We’re excited to have ARC Evals test our models to assess their readiness for deployment. We look forward to sharing more about our approach to evaluating our systems in the coming months.
Conversation
We strongly agree there’s much more work to be done on alignment, security, and measurement. You can read about ARC’s specific approach to evaluation here: evals.alignment.org/blog/2023-03-1
1
5
47
Working with ARC is part of our overall vision for AI safety and evaluation, as we work to build more steerable, predictable, and interpretable systems. You can read more about our approach to safety and the societal impacts of AI systems here:
2
6
54
Can you also assess unreadiness for deployment? Personally, I think that should cover most or all of your models.
Why are all your products in the hands of third parties? why can't we test Claude on his website?
1
Show more replies
Discover more
Sourced from across Twitter
The scale of modern GPU computation is so incomprehensible that I regularly find that even experts underestimate it. A 4090 can do ~150 THOUSAND fp32 ops per pixel per frame at 4k 60 Hz, and can load kilobytes for every single pixel from VRAM (and more from on-die SRAM).
16
47
549
Show this thread
The most surprising part of having a child has been how it has broadened my understanding and feeling of love
9
1
137
Show this thread
Quote Tweet
This was an impressive paper with some big implications
GPT-4 was given 4,550 novel questions representing the entire “MIT Mathematics and EECS undergraduate curriculum, including problem sets, midterms, and final exams”
With good prompts, it scored 100% arxiv.org/abs/2306.08997
Show this thread
19
35
281
watching a new ML grad student say that their research direction is using neuroscience as inspiration to make new architectures (can't interfere, it's a canon event)
8
14
317







