Thread for examples of alignment research MIRI has said relatively positive stuff about:
("Relatively" because our overall view of the field is that not much progress has been made, and it's not clear how we can change that going forward. But there's still better vs. worse.)
Conversation
Nate singles out John Wentworth's Natural Abstractions as a line of future research that "could maybe help address the core difficulty if it succeeds wildly more than I currently expect it to succeed". Ditto "sufficiently ambitious interpretability work".
2
1
25
(Though Nate worries that current interpretability research isn't being ambitious enough to help with what Nate sees as the central alignment challenge, lesswrong.com/s/v55BhXbpJuaE.)
2
1
16
Eliezer endorsing Redwood Research's first big project: twitter.com/ESYudkowsky/st
+ alignmentforum.org/posts/k7oxdbNa.
Quote Tweet
Here is an important example of a rare, nonfake prosaic/short-term alignment research project. How can I tell it's nonfake? Because I can't instantly predict what results they'll get by reading the project description; which is rare. alignmentforum.org/posts/k7oxdbNa
Show this thread
1
15
Burns, Ye, Klein, and Steinhardt at UC Berkeley:
Quote Tweet
Very dignified work! Signal boosting. twitter.com/CollinBurns4/sβ¦
1
21
Stiennon, Ouyang, Wu, Ziegler, Lowe, Voss, Radford, Amodei, and Christiano at OpenAI:
Quote Tweet
A very rare bit of research that is directly, straight-up relevant to real alignment problems! They trained a reward function on human preferences AND THEN measured how hard you could optimize against the trained function before the results got actually worse. twitter.com/OpenAI/status/β¦
Show this thread
1
17
Askell et al. at Anthropic:
Quote Tweet
I'm slightly cheerful about this paper; it caused me to mildly positively update on @AnthropicAI. Directly challenges alignment; doesn't significantly burn the capability commons that I noticed; doesn't overstate the accomplishment and claim that alignment has now been solved. twitter.com/AnthropicAI/stβ¦
Show this thread
1
11
Bowman et al. at Anthropic:
Quote Tweet
Replying to @ESYudkowsky and @AnthropicAI
I should also say out loud that I like the attempt to manifest problems early and via a paradigm that thinks about relative competence between humans and AIs. (Though there's also a threshold in here about whether the AI is *any* good at deceiving humans.)
1
15
MIRI also obviously thinks that the alignment researchers we directly fund have done relatively cool stuff. Some who post on LW these days:
Garrabrant - lesswrong.com/users/scott-ga
Demski - lesswrong.com/users/abramdem
Kosoy - lesswrong.com/users/vanessa-
Hubinger -
1
13
The Visible Thoughts Project at MIRI is attempting to build a dataset we think may be useful for alignment work: lesswrong.com/posts/zRn6cLtx
(Announcement post is old now, update coming soon.)
1
13
The post also notes of Circuits, and similar interpretability work by Chris Olah's team: "we have said for years that Circuits-style research deserves all possible dollars that can be productively spent on it".
distill.pub/2020/circuits/
1
BookmarkLikewise re Paul Christiano: "I continue to recommend throwing as much money at Paul as he says he can use, and I wish he said he knew how to use larger amounts of money." lesswrong.com/posts/S7csET9C
1
17
lesswrong.com/posts/CpvyhFy9 again singles out Olah as "exceptional in the field" and Christiano as "one of the few people trying to have foundational ideas at all" (while expressing pessimism that their research directions will succeed, and broader pessimism about the field).
1
14
The post also highlights Stuart Armstrong's work "formalizing the shutdown problem, an example case in point of why corrigibility is hard, which so far as I know is still resisting all attempts at solution".
Early corrigibility work is collected here:
1
13
+ "Various people who work or worked for MIRI came up with some actually-useful notions here and there, like Jessica Taylor's expected utility quantilization".
Taylor's paper: intelligence.org/files/Quantili.
1
13
Rumbelow and Watkins at SERI-MATS:
Quote Tweet
This is one of the more hopeful processes happening on Earth right now - because it may give rise to a culture of people with something like security mindset, who try to break things, instead of imagining how wonderfully they'll work. lesswrong.com/posts/aPeJE8bS
3
4
18
Bills*, Cammarata*, Mossing*, Tillman*, Gao*, Goh, Sutskever, Leike, Wu*, and Saunders* at OpenAI:
Quote Tweet
Commentary at greater length:
- I'm encouraged that somebody ran right out and tried this.
- It's not clear (to me, yet) that it worked all that well, or better than expected; I have not yet signficantly updated my model of how technically hard interpretability is.
- It isβ¦ twitter.com/blader/status/β¦Β Show more
4


