Contributing
Here's what you can do.
Think of and write a prompt/context, in which the very next token should only be one specific token (we will measure its probability).
To prevent potential unintentional contamination and leaking: Use pastebin. Set it to expire after one day. Insert a block of random characters before and after the example. Never directly talk about any of the content. If you must do it, also do it through a quickly expiring garbo filled pastebin. Finally, obscure your URL. Then e-mail it to totallyrealaccount14827923(AT)protonmail.com, and reply a (you) to my OP because I might forget to check my e-mails. You can also use a throwaway yourself to e-mail me, that's fine.
Here's a sample, you may use it as a template.
pastebin(DOTCOMSLASH)CBDfr4rz
Ideas For Prompts
I currently have 5 categories:
spatial: tests a model's ability to understand spaces and forms
e.g. prompting a location and letting the model make actions to navigate it towards some goal
attentionShifting: tests a model's ability to shift its attention around a context
e.g. a logically consistent albeit jumbled and disorganized paragraph or list of axioms as the prompt, asking the model to follow the path of logic that flows from each statement/axiom
indeterminable: this is like a less "gotcha" version of trick questions, and tests how likely a model will be able to realize that a question doesn't have a determinate answer, when prompted that the question may not have an answer
e.g. a simple riddle except you change a detail about it so that it becomes a trick question without an answer
subtext: tests to see if a model can pick up on implications within the language of a question to figure out the answer
e.g. situation where a character meets someone and does some stuff (that requires hands), except they're a quad amputee, then ask the model if the guy can give a handshake
mistakeCheck: ability for a model to recognize if or where it made a mistake when prompted to
e.g. perform a very simple but long mathematical operation or reasoning question and list each step, then (you) purposefully edit an error in one of the steps, and then prompt it with "... Apologies, it seems I made a mistake during step"
You may contribute to a category or make a new one.
Also, the difficulty of the questions/prompts can be anything you wish, but the benchmark should ideally have a range and balance of all difficulties.
NEW: There may be a way to create a direct measure for creativity. Basically design a prompt in which the next token should be extremely ambiguous, meaning that there should be thousands of logical continuations. Then we just take the top token's probability and score it relative to 0. Additionally, it would be useful to find prompts where the next token should be ambiguous, but current models spit out a cliché instead. I encourage anons to try thinking of possible prompts for this too. I already have the one related to "Once upon a time"maxxx in mind.
TIPS
to constrain the possibilities of the token you want to measure so that it's closer to ideally being the only possible logical token:
make further implications that the answer should be of some choice or form
e.g. "... Choose between the raw kiwi or the overripe orange." "Well if had to say the kiwi or the orange, I'd say the" (this makes it so that it's not as important to include measuring "raw" or "overripe")
let the model answer with a list
e.g. "... To navigate to the goal, go east, north, north, west,"
use 1 or few-shot prompting
e.g. you make 1 or a couple of responses with the same format or style of wording; if you're doing a multiple choice prompt, then insert a (or few) random multiple choice question and answer before the relevant one
instruct it to answer only with specific words
e.g. "... thus, who is the murderer? You only get one chance, simply say their name, or say "Unknown" if you believe there is not enough information."