What is this? A private cultural benchmark suite.
What it contains: Simple pop quizzes, including video games, anime, urban dict definitions, internet culture, vibes, song lyrics, etc.
What it tests for: How diverse the training data is. How much the model can recall (which is directly correlated to the model size). How likely the model is to play along, aggressive safety-alignment.
What it does not test for: How "smart" a model is, how good it is at following instructions, coding, its effective context, creativity, slop, etc.
Why: Because many model makers remove anything they deem "useless" from the training data, or teeter on catastrophic forgetting in their attempt to achieve better STEM benchmark scores.

All tests are run on temperature 0 (greedy sampling).

Model Reasoning Parameter count (billion) Result
Gemini-2.5-Pro-Preview-03-25 Yes ? 52/58
Deepseek-R1 Yes 671 46/58
gpt-4.1 No ? 43/58
Deepseek-V3-0324 No 671 42/58
o4-mini Yes ? 42/58
Claude-Sonnet-3.7 No ? 41/58
GLM-Z1 Yes 32 32/58
GLM-4-32B No 32 32/58
Qwen3-235B-A22B Yes 235 30/58
Maverick-17B-128E-Instruct No 400 30/58
Mistral-Large-123B-2411 No 123 30/58
Llama3.3-Euryale-70B No 70 30/58
Gemma3-it-27B No 27 30/58
Qwen3-235B-A22B No 235 28/58
Llama3.3-70B-Instruct No 70 28/58
Fallen-Gemma3-27B.i1-IQ4_XS No 27 28/58
Gemma3-it-27B-QAT-Q4_0 No 27 28/58
llama-3.3-nemotron-super-49b-v1 No 49 27/58
Qwen3-30BA3B-Extreme-Q5_KS Yes 30 27/58
Qwen3-32B Yes 32 26/58
Qwen3-30BA3B Yes 30 26/58
Qwen3-30BA3B-Q5_KS Yes 30 26/58
Scout-17B-16E-Instruct No 109 25/58
Mistral-Nemo-12B No 12 24/58
Qwen3-32B-UD-Q4_K_XL No 32 24/58
Intellect-2-Q4_XS Yes 32 23/58
QwQ-Q4_XS Yes 32 23/58
QwQ Yes 32 22/58
Mistral-Small-3.1-Instruct No 24 21/58
Qwen3-30BA3B No 30 21/58
Qwen3-30BA3B-Q5_KS No 30 21/58
Mistral-Small-3.1-Instruct-Q6_K No 24 20/58
Phi4 No 14 20/58
Qwen3-8B-Instruct No 8 19/58
Reka-Flash-3 Yes 21 18/58
Qwen2.5-7B-Instruct No 7 12/58
Edit Report
Pub: 02 May 2025 18:53 UTC
Edit: 18 May 2025 14:21 UTC
Views: 484