Problem

existing benchmarks are used for leaderboards, which VCs look at, providing an incentive for gaming
can be gamed easily because the questions/prompts are out in the open for anyone to look at
can be unintentionally gamed because the training datasets for foundation models include web crawls, meaning that these publicly posted benchmarks can get scraped and "contaminate" them, so even if a particular benchmark checks for contamination, it can't do anything about it
contamination gets worse over time as the internet tends to spread and pass information around
many existing benchmarks are frankly just badly made and not very good measures of real world performance

Solutions

contamination can be solved by obscuring the benchmark content from crawlers in different ways
gaming can be solved by making the benchmark private (bad because no one's going to run the benchmark on and on for every new model that comes out, unpaid, also trust issues), or (our preferred choice) just making a benchmark without fanfare, so it's obscure and no one cares about gaming it
crowd source both the benchmark questions (while using obscuring methods) and the results for the leaderboard
quality and relevance of the benchmark can be solved with careful prompt design

Our Design

current models are able to generate token probabilities, basically % chances for what it thinks the next token can be
we can leverage this by designing prompts in which the very next token should logically be something specific e.g. a multiple choice question with a correct choice X means that the model should have near 100% confidence that the next token after "The answer is choice" is X, or e.g. more complicated prompting that you can see in the submission details rentry

Q&A

Carefully designing prompts is tiresome, why not make a model generate a ton of answers to a question, with samplers on, and have another model grade them?

the grader model is flawed with its own capacity to hallucinate (sometimes even GPT-4 can be saying some stuff and then contradict what it just obviously said one sentence ago)
not everyone can run it if it's too big for their hardware
not as lightweight and fast as just getting a single token probability

Token probabilities still sound like an inaccurate method if you don't use a multiple choice prompt.

true, depending on the prompt there is some likelihood that there is a logical continuation after a token that sounds nonsensical on the surface; however, it should still be very small
it does not matter if there is a little bit of inaccuracy because we will still use at least a not too low number of questions, and because we will be measuring relative performance, not absolute (think of how students are awarded based on percentile rather than raw grade)
there is no proof that carefully designed testing with token probabilities isn't an effective method (I guess you could say we're going to prove whether or not it is)

Why not solve the gaming problem by making the benchmark completely private?

no one with the hardware necessary is going to take on the responsibility of running a benchmark on and on for every new model that comes out, unpaid
no one's going to write all those benchmark questions by theirself, unpaid
cannot necessarily be trusted for honesty
cannot be verified

But what if a malicious party leaks the benchmark on the internet?

we have actually a plethora of gaming safeguard techniques that have nothing to do with preventing contamination, so in the end, if someone decides to try and game the benchmark, they would have to spend considerable effort to do so, and we will have methods to know who did that
don't worry about it

Problem

Solutions

Our Design

Q&A

Warning