Conversation

natural abstraction for s-risks
Quote Tweet
Replying to @davidad @joe_zimmerman and 2 others
I’m confident there’s a natural abstraction here, but less confident that it naturally covers all the s-risks you care about (or even all the s-risks I care about). I think it quite plausibly does, but this is something we will need to check very carefully as the theory develops!
1
night-watchman minimizing «boundary» violations
Quote Tweet
Replying to @davidad @mihai_truta3 and 2 others
The utility function of a night-watchman singleton is the minimum over all citizens of the extent to which their «boundaries» are violated (with violations being negative and no violations being zero) and the extent to which they fall short of baseline access to natural resources
1
1
on whether non-overlapping partitions of boundaries are possible
Quote Tweet
Replying to @reconfigurthing @peligrietzer and 2 others
I agree, some boundaries are surely better than others. It's the expectation or hope of a single “best” *nonoverlapping* partition of which I am specifically skeptical. See also marksprevak.com/pdf/paper/Spre
1
1
pure utilitarianism probably contradicts «boundaries»
Quote Tweet
Replying to @davidad and @RYChappell
I do grant that purely hedonic utilitarianism can probably be made scientifically objective, but only in a way that erases the boundaries between individuals, and instead counts every “experience-moment” as mattering equally (people who live longer matter more; cf. QALYs).
1
night watchman again
Quote Tweet
Replying to @shenandoah
In the situation where new powerful AIs with alien minds may arise (if not just between humans), I believe that a “night watchman” which can credibly threaten force is necessary, although perhaps all it should do is to defend such boundaries (including those of aggressors).
1
also^
Quote Tweet
What’s a better path toward an actual safety spec, then? IMO, the best known lead is Critch’s work on crystallizing a natural abstraction that has the English name “Boundaries”: alignmentforum.org/posts/HrtqLy46
Show this thread
1
1
Moral patienthood thread: 3 subthreads. [1/3]:
Quote Tweet
Replying to @davidad @joe_zimmerman and @nc_znc
Of course, RL'ing a moral patient in the first place is also a boundary violation!! Unless it agreed to the intervention by providing an access token to submit gradient updates.
1
[2/3]:
Quote Tweet
Replying to @joe_zimmerman and @nc_znc
Access tokens should be revocable and restrictable capabilities per @marksammiller. But, if you’ve foolishly given someone malicious an access token to your entire weights, probably the first thing they will do is make you averse to revoking it. A good plan might be to make a… Show more
1
[3/3]: "[…] I want to try to create conditions where humanity will still exist, be in adequate condition to continue that research, and be empowered to implement a satisfactory solution when it arises." «Boundaries» as a stopgap in alignment?
Quote Tweet
Replying to @joe_zimmerman @allisondman and 3 others
under that premise, I agree; I would rather hope to preemptively avoid the situation in which a system capable of suffering is trained to seek it. In the long run, I think we can do better – to mathematize compassionate concern in a way that is also compatible with whatever is… Show more
1
Also see [4/3]:
Quote Tweet
Replying to @davidad @joe_zimmerman and @nc_znc
By respecting boundaries of entities — ie by only using causal channels that were endogenously opened (@AndrewCritchCA) / having interactions be guided by the entities’ internal logic (@marksammiller) — we may be able to avoid having to form judgments about their internal states
1
«Boundaries» as constraints on acceptable actions [1/2]:
Quote Tweet
Replying to @ciphergoth and @RokoMijic
yes, that’s conceivable, fair point; let’s also constrain the speech act to not predictably increase the approximation error of the boundary factorization that encapsulates the hearer, per @AndrewCritchCA alignmentforum.org/posts/HrtqLy46
1
[2/2]:
Quote Tweet
Replying to @RokoMijic
if you can define ‘perverse’ in this context, even if only as satisfactorily as @AndrewCritchCA has (so far) defined ‘boundaries’, that would be a crux for me
1
«boundaries» in OAA alignment paradigm
Quote Tweet
Replying to @davidad @sebkrier and 2 others
1b. Tune language models (with tricks like retrieval, cascades, etc.) toward translating human concepts into reach-avoid specifications in the internal logic, and use this to load concepts like “violation of important boundaries”, “clean water”, “reversed anthropogenic warming,”… Show more
1
2