Screenshot of this question was making the rounds last week. But this article covers testing against all the well-known models out there.

Also includes outtakes on the ‘reasoning’ models.

  • SuspciousCarrot78@lemmy.world
    link
    fedilink
    English
    arrow-up
    3
    ·
    edit-2
    3 hours ago

    Qwen3-4B HIVEMIND (abliterated) got it in 2, though it scores a lot higher on PIQA, HellaSwag and Winogrande benchmarks than normal Qwen3-30B. I think the new abliteration methods actually strengthen real world understanding.

    https://imgur.com/a/7YZme4i

    https://imgur.com/a/25ApzDN

    I wonder if an abliterated VL model could do even better? They tend to have the best real world model benchmarks. Perhaps a Qwen3-VL-30B ablit (if such a thing exists) could one shot this.

    I’d like to think a lot of these gotcha prompts rely on verbal misunderstanding, rather than failure in world models, but I can’t say that for certain.

    PS: Saw a pearler of a response to this: Chatgpt recommend “yeah, lift the car and carry it on your back. Make sure to bend your knees” (though I’m guessing someone edited that for the lulz)