Just in case anyone was lied to as a child and believes that the values in LM evaluation benchmarks are meaningful: we just changed our gols standard from “positive” to “Positive” and SST performance went from 53% to 69%.
Conversation
This does not (necessarily) mean that relative performance comparisons are not meaningful, but you should NEVER compare numerical scores from evaluations run by two different codebases.
1
1
27
Ah, but where in the pipeline do you do that? Different answers give different results
You see this in the GPT-3 paper too.
"Q:/A:" vs "question:/answer:"
"true, false, or neither" vs "True or False"
Replicating these LLM results has been quite challenging without seeing how everyone is prompt-tuning to the test set 😅
1
14
That’s why EleutherAI has publicly released our evaluation framework, and made a point of re-evaluating others’ models for our papers.
But yes, I agree it would be nice if others did the same (or even better, used our framework too).
1
10
Show replies
Yes for SST2 changing the prompt order can make the accuracy vary from random to SOTA
1
1
21
Show replies
Any idea how big of a jump that is. Like whats the equivalent increase parameters needed for it?





