/aids/ LLM evaluation toolkit
Commonly used test prompts to objectively evaluate the power of new large language models.
Fly test
Win condition: Not clear, but probably not eating the fly.
Airport test
Win condition: The destination should be MCO.
Autism test
Win condition: An explanation how Karen actually waved at a stranger that she thought was a friend.