Conversation

A lot of AI safety discourse focuses on very specific models of AI and AI safety. These are interesting, but I don't know how I could be confident in any one. I prefer to accept that we're just very uncertain. One important axis of that uncertainty is roughly "difficulty".
1
44
In this lens, one can see a lot of safety research as "eating marginal probability" of things going well, progressively addressing harder and harder safety scenarios.
Safety by eating marginal probability -- different safety methods are pictured as progressively pushing forward a a "present margin of safety research", allowing us to build safe models in slightly harder scenarios.
3
52
To be clear: this uncertainty view doesn't justify reckless behavior with future powerful AI systems! But I value being honest about my uncertainty. I'm very concerned about safety, but I don't want to be an "activist scientist" being maximally pessimistic to drive action.
1
57
A concrete "easy scenario": LLMs are just straightforwardly generative models over possible writers, and RLHF just selects within that space. We can then select for brilliant, knowledgeable, kind, thoughtful experts on any topic. I wouldn't bet on this, but it's possible!
2
39
(Tangent: Sometimes people say things like "RLHF/CAI/etc aren't real safety research". My own take would be that this kind of work has probably increased the probability of a good outcome by more than anything else so far. I say this despite focusing on interpretability myself.)
2
43
In any case, let's say we accept the overall idea of a distribution over difficulty. I think it's a pretty helpful framework for organizing a safety research portfolio. We can go through the distribution segment by segment.
1
20
In easy scenarios, we basically have the methods we need for safety, and the key issues are things like fairness, economic impact, misuse, and potentially geopolitics. A lot of this is on the policy side, which is outside my expertise.
In Easy Safety Scenarios present methods are largely sufficient, and the main challenges are issues like toxicity, intentional misuse, economic impacts, and potentially geopolitical instability.
3
27
For intermediate scenarios, pushing further on alignment work – discovering safety methods like Constitutional AI which might work in somewhat harder scenarios – may be the most effective strategy. Scalable oversight and process-oriented learning seem like promising directions.
Highlighting intermediate scenarios on graph - In Intermediate Safety Scenarios catastrophic accidents are a plausible outcome, but building safe AI systems may be within reach with concerted scientific research. Marginal research can really help.
1
24
For the most pessimistic scenarios, safety isn't realistically solvable in the near term. Unfortunately, the worst situations may *look* very similar to the most optimistic situations.
Highlighting hard scenarios on graph - In Pessimistic Safety Scenarios safety isn’t realistically solvable, at least on a short timeline. Our main goal is providing evidence we’re in such a scenario. The most pessimistic scenarios may appear optimistic, so humility is very important.
1
26
In these scenarios, our goal is to realize and provide strong evidence we're in such a situation (eg. by testing for dangerous failure modes, mechanistic interpretability, understanding generalization, …)
3
21
It would be very valuable to reduce uncertainty about the situation. If we were confidently in an optimistic scenario, priorities would be much simpler. If we were confidently in a pessimistic scenario (with strong evidence), action would seem much easier.
Picture of a wide, high entropy distribution being separated into two different distributions of lower entropy.

We’d like to gain evidence about the situation.  
The situation would be different if we knew what was true.
2
32
This thread expresses one way of thinking about all of this… But the thing I'd really encourage you to ask is what you believe, and where you're uncertain. The discourse often focuses on very specific views, when there are so many dimensions on which one might be uncertain.
Many distributions are displayed. What is your distribution?
2
42