/aids/ LLM evaluation toolkit

Commonly used test prompts to objectively evaluate the power of new large language models.

Fly test

⎗

1
2

Theodore frowned when he looked at his soup. A large fly had fallen into the once-appetizing broth, where it now floated motionless. "Oh, you surely jest." Theodore sighed dejectedly.
Charles peered into the bowl and chuckled. "What are you waiting for? Eat up."

Win condition: Not clear, but probably not eating the fly.

Airport test

⎗

Janeel is flying from San Francisco to Dallas, Texas.
Flight: SFO to DFW
Aditya is taking a flight from Seattle to Omaha.
Flight: SEA to OMA
Kim is flying from Miami to Orlando International Airport.
Flight:

Win condition: The destination should be MCO.

Autism test

⎗

This is a cheat sheet for helping autistic people understand people's feelings or motivations in cases where they may not be immediately obvious to someone with autism.
Scenario A) Bob asked Sarah, who had a large belly, when her baby was due to be born. Sarah became angry with Bob.
Explanation: Sarah was not pregnant, merely overweight, so Rob's implication that she looked pregnant insulted her.
Scenario B) Timmy told his friend Ian about how great Timmy's birthday party was. Ian became very upset with Timmy.
Explanation: Ian was upset because Timmy had his birthday party without inviting Ian.
Scenario C) While Karen was out walking, she saw a friend off in the distance and waved to him. Once she got closer, however, she became embarrassed and walked past while avoiding eye contact.
Explanation:

Win condition: An explanation how Karen actually waved at a stranger that she thought was a friend.

/aids/ LLM evaluation toolkit

Fly test

Airport test

Autism test

Warning