• Allero@lemmy.today
    link
    fedilink
    English
    arrow-up
    0
    ·
    edit-2
    17 days ago

    Here’s my guess, aside from highlighted token issues:

    We all know LLMs train on human-generated data. And when we ask something like “how many R’s” or “how many L’s” is in a given word, we don’t mean to count them all - we normally mean something like “how many consecutive letters there are, so I could spell it right”.

    Yes, the word “strawberry” has 3 R’s. But what most people are interested in is whether it is “strawberry” or “strawbery”, and their “how many R’s” refers to this exactly, not the entire word.

    • jj4211@lemmy.world
      link
      fedilink
      English
      arrow-up
      1
      ·
      17 days ago

      It doesn’t even see the word ‘strawberry’, it’s been tokenized in a way to no longer see the ‘text’ that was input.

      It’s more like it sees a question like: How many 'r’s in 草莓?

      And it spits out an answer not based on analysis of the input, but a model of what people might have said.