Conversation

Thread for examples of alignment research MIRI has said relatively positive stuff about: ("Relatively" because our overall view of the field is that not much progress has been made, and it's not clear how we can change that going forward. But there's still better vs. worse.)
1
135
Nate singles out John Wentworth's Natural Abstractions as a line of future research that "could maybe help address the core difficulty if it succeeds wildly more than I currently expect it to succeed". Ditto "sufficiently ambitious interpretability work".
2
25
Quote Tweet
Here is an important example of a rare, nonfake prosaic/short-term alignment research project. How can I tell it's nonfake? Because I can't instantly predict what results they'll get by reading the project description; which is rare. alignmentforum.org/posts/k7oxdbNa
Show this thread
1
15
Stiennon, Ouyang, Wu, Ziegler, Lowe, Voss, Radford, Amodei, and Christiano at OpenAI:
Quote Tweet
A very rare bit of research that is directly, straight-up relevant to real alignment problems! They trained a reward function on human preferences AND THEN measured how hard you could optimize against the trained function before the results got actually worse. twitter.com/OpenAI/status/…
Show this thread
1
17
Askell et al. at Anthropic:
Quote Tweet
I'm slightly cheerful about this paper; it caused me to mildly positively update on @AnthropicAI. Directly challenges alignment; doesn't significantly burn the capability commons that I noticed; doesn't overstate the accomplishment and claim that alignment has now been solved. twitter.com/AnthropicAI/st…
Show this thread
1
11
Bowman et al. at Anthropic:
Quote Tweet
Replying to @ESYudkowsky and @AnthropicAI
I should also say out loud that I like the attempt to manifest problems early and via a paradigm that thinks about relative competence between humans and AIs. (Though there's also a threshold in here about whether the AI is *any* good at deceiving humans.)
1
15
The post also highlights Stuart Armstrong's work "formalizing the shutdown problem, an example case in point of why corrigibility is hard, which so far as I know is still resisting all attempts at solution". Early corrigibility work is collected here:
1
13
Rumbelow and Watkins at SERI-MATS:
Image
Quote Tweet
Image
This is one of the more hopeful processes happening on Earth right now - because it may give rise to a culture of people with something like security mindset, who try to break things, instead of imagining how wonderfully they'll work. lesswrong.com/posts/aPeJE8bS
3
18
Bills*, Cammarata*, Mossing*, Tillman*, Gao*, Goh, Sutskever, Leike, Wu*, and Saunders* at OpenAI:
Quote Tweet
Commentary at greater length: - I'm encouraged that somebody ran right out and tried this. - It's not clear (to me, yet) that it worked all that well, or better than expected; I have not yet signficantly updated my model of how technically hard interpretability is. - It is… twitter.com/blader/status/… Show more
4