Conversation

This does not (necessarily) mean that relative performance comparisons are not meaningful, but you should NEVER compare numerical scores from evaluations run by two different codebases.
1
27
You see this in the GPT-3 paper too. "Q:/A:" vs "question:/answer:" "true, false, or neither" vs "True or False" Replicating these LLM results has been quite challenging without seeing how everyone is prompt-tuning to the test set 😅
Image
1
14
That’s why EleutherAI has publicly released our evaluation framework, and made a point of re-evaluating others’ models for our papers. But yes, I agree it would be nice if others did the same (or even better, used our framework too).
1
10
Show replies
Show replies