The Kanye Test

Is your language model performing at its Kanye Best?

I was goofing around on lmsys (https://chat.lmsys.org/), checking out various LLMs and how they responded to different prompts.
At one point in time, I thought, it might be interesting to judge a model based on its ability to complete/make coherent rhyming schemes.
There's a copypasta I asked it to continue that has a unique rhyming pattern:

Continue the story: After a long day of work, Kanye West goes to his Kanye Nest to take his Kanye Rest. He wakes up feeling his Kanye Best. Then he’ll get Kanye Dressed on his Kanye Vest to go on a Kanye Quest. He goes to church and becomes Kanye Blessed, then to a hotel room to be a Kanye Guest. Then to school to take his Kanye Test. He forgot to brush his teeth. Did he run out of Kanye Crest? His neighbor stole it, what a Kanye Pest.

On the surface, this is a silly prompt that you'd figure doesn't hold much bearing on what a model is truly capable of. But once you think about it, it kind of works wonderfully as a stress test for language models.

It has to be able to interpret that:

There's a rhyming pattern
There is only one specific word that it's rhyming with consistently, and that it shouldn't fall back to a traditional rhyming scheme
The rhymed word always comes after the first name of "Kanye" and is nowhere else
That the rhyme is unique each time

GPT4

The original GPT4 (not GPT4-turbo) is by and far the best at creating proper continuations for this prompt. Although it struggles to always use unique words for the rhyme scheme here, it more or less understands the assignment with minor errors.

GPT4-Turbo

GPT4-Turbo absolutely struggles with this in some noticeable ways compared to GPT4. There are far less unique rhymes, and it tends to break the pattern when it doesn't just rehash the ones that were originally used.

On lmsys, most models are worse than both GPT4 & GPT4-Turbo at handling this.

Claude-1 [left] vs Mistral-Medium [right]

Especially bad is GPT3.5 which completely ignores the established pattern of the writing and takes the "story" part far too literally:

Meanwhile, Gemini Pro more or less gets the actual rhyming down, but "forces" a more traditional pattern & formatting in the process:

Gemini Pro

And with the open source darling Mixtral Instruct:

Mixtral Instruct

Not so bad. Holds its own against Gemini Pro, and picks up on the idea of the text way better than GPT3.5, but doesn't latch onto it rhyming with "West".

OpenHermes 7b

I am... far from Kanye Impressed with 7b.

Claude 2.1

Jesus Christ, Claude. Really?

I think what this really illustrates about what our current benchmarking methods lack, are adversarial tests that highlight the most inherent weaknesses of language modeling. More specifically, the relationships between words that are hard to learn due to tokenizer biases. I'd be interested in a custom benchmark that targets those deep, difficult patterns [e.g. rhyming] that your average meme benchmark model will struggle hard with (at least, without deliberately crafted prompting)