Last time, I outlined why aligning AI is so important!
To recap:
Transformative AI could arrive soon.
TAI would be a very powerful technology
TAI is unlikely to be aligned with human values
This is a very wild set of claims to make
, but of all the possible days to outline the key considerations for these claims, today seems like as good a day as any! By writing these posts, I aim to clarify my thinking and resolve the uncertainty I have right now. Here’s a (very much non-exhaustive) list of decision-relevant questions I'll try and touch on in this series!Timelines and takeoff
How likely is it possible to get to TAI by scaling up existing ML architecture with more computational power?
Conditional on this scaling hypothesis being true, what is the median timeline for the arrival of TAI based on extrapolating from current trends?
What are the dynamics of this trajectory i.e. is it sub-linear, linear, exponential, super-exponential or something else?
Before getting to TAI, will advanced AI systems have better-but-not-TAI capabilities across the board, or will they instead be TAI-level but in narrow domains?
These questions matter because when and how advanced AI systems arrive will make a big difference to what the best way for deploying them is. For example, one hypothetical set of answers is: “we can get TAI within the next 40 years by scaling the amount of computational resources we throw at ML models at a steady linear pace”.
If true, this would imply that we should be devoting substantial resources to this problem, and that incrementally improving our understanding of models as they become more powerful is an acceptable strategy.
On the other hand, if advanced AI systems increase the pace at which AI systems get better, we may get more explosive dynamics in the run up to (and maybe even after) TAI, and capabilities will rapidly outpace our understanding, with relatively little warning ahead of time.
Technical challenges
If we get TAI, what sort of ML architecture is it likely built on?
To what extent is it possible to specify human values which we’d want advanced AI systems to follow when deployed in the real world?
Are advanced AI systems likely to mis-generalise their goals when faced with out-of-distribution situations?
Do modern machine learning methods incentivise deception from advanced AI systems in a way that makes misalignment less detectable before deployment?
Can we predict the behaviour of advanced AI systems (and TAI) based on extrapolating from the behaviour of smaller models?
How much more costly is it to train aligned models versus unaligned models?
Are many prosaic methods of alignment dual-use, contributing to faster capabilities development? If so, does that mean advances in alignment will generalise and scale less well than advances in capabilities?
These questions revolve around identifying what the technical alignment problem actually is and why it is difficult. For example, one hypothetical view could be something like: “prosaic alignment techniques which make models more useful are cheaper and easier to find than theoretical discoveries that ensure models are robustly aligned with human values, and they do not scale to more powerful models which display surprising emergent capabilities like deception”.
In that case, we might lean against an incremental approach of doing just enough alignment for models to look acceptable for product release. This is because it is likely to advance model capabilities, while failing at alignment when it matters most i.e. when large and powerful models, which can hide misalignment during training, arrive.
Deployment challenges
Is the market structure of large AI labs going to be oligopolistic, monopolistically competitive or something else?
How far ahead are the top AI labs right now and how long would it take a newcomer to catch up?
Are governments more likely to accelerate or decelerate AI development?
Given these firm entry dynamics, what is the probability of large AI labs getting into a capabilities arms race, versus agreeing to norms and regulations which involve moving more carefully?
Considering everything above, what is the probability that we will get a TAI which is misaligned with human values?
What sorts of behaviour do we expect such a misaligned TAI to actually exhibit?
How likely is it the case that the behaviour conducted by a misaligned TAI leads to existential catastrophe?
These questions gesture at one key feature about the AI alignment problem: even if the technical challenge of alignment is no harder than most difficult STEM problems, it is unlike many of those problems because we don’t have an unlimited amount of time and tries to solve it.
Plausible solutions
This last section has only one question, and in many ways, it is the question which matters. Everything above builds towards this:
What is the most promising portfolio of strategies for reducing the probability of an existential catastrophe from misaligned TAI?
This list of questions is lot to unpack, and I’ve by no means covered every relevant angle. However, it gives you a sense for what I think is worth spending time to figure out! Thus I expect many of my upcoming blog posts to be oriented around trying to clarify uncertainties I have regarding these questions, so I’d love to hear if there’s something big I’m missing!
Hence the name of this series! Indeed, the subtitle for this post is an allusion to something I said to a friend a few months ago, after realising just how weird it was to have spent the past hour discussing AI timelines:
“Either TAI is going change everything or we're in a cult on the 6th floor of a WeWork!”
All 3 leading AI labs had major product releases today: OpenAI dropped GPT-4, Anthropic rolled out Claude and Google released PaLM’s API!