Bunnyhopping

Share this post

Living in the Wildest Timeline: Part 0

tmychow.substack.com

Living in the Wildest Timeline: Part 0

It was the epoch of belief, it was the epoch of incredulity!

Trevor Chow
Mar 3
3
Share this post

Living in the Wildest Timeline: Part 0

tmychow.substack.com

In this series from 2020, I explored why growth and existential risk matter, as well as how they've changed over time:

Bunnyhopping
Line Goes Up: Part 0
In these turbulent and uncertain times, it is easy to focus only on the violent economic fluctuations within the business cycle. However, there is also a bigger picture: the longer-run increase in economic output across history. Measuring growth is important, because it defines the standard of living available to humanity. As Bob Lucas put it…
Read more
3 years ago · Trevor Chow

Both phenomena are ultimately constrained by the marginal cost of innovation, and I featured a few cost curves in the third post of the series. I now want to make the growingly milquetoast claim

1
that of the technological developments I highlighted, the most important is the falling cost of training machine learning models! This is because the development of transformative artificial intelligence (TAI) via improvements in ML models would quickly push the marginal cost of ideas to 0. Thus it would subsume the other cost curves, by making innovation in these areas arbitrarily cheap.

For better or for worse, this is a very big deal!

  • Upside: removing one of the fundamental constraints on economic growth could be transformative for human welfare and make this the most important century!

  • Downside: technological development is dual-use (and often asymmetrically so): a TAI which is misaligned with human goals could be catastrophic!

That's why I’ll be writing more about the AI alignment problem. The skeleton of the reasoning in my mind is something like this:

  1. TAI could arrive soon (read: in the next 10-50 years)

  2. TAI would be a very powerful technology

  3. TAI is unlikely to be aligned with human values

Punchline: in our lifetime, we might develop a technology which is powerful enough to cause existential risk for humans, and which by default would probably do so if developed!

Defining TAI

Before I give the CliffsNotes for this argument, let me define TAI by deferring to Open Philanthropy’s helpful description:

"a potential future AI that precipitates a transition comparable to (or more significant than) the agricultural or industrial revolution"

The TAI I have in mind which would do this doesn’t need to be “generally intelligent” or “superintelligent”. It doesn’t need to have “consciousness” or “qualia”. It doesn’t need to arise from discontinuous improvements in algorithms or hardware.

Rather, I am envisioning an AI with a human-level ability to engage in problem solving for the sake of task completion and resource competition, built on current deep learning paradigms (e.g. self-attention and transformer architecture) and built on existing CPUs, GPUs and TPUs

2
.

TAI could arrive soon

Predicting the future, and especially the future of technological development is a notoriously error-ridden affair. However, there are a number of ways we can nonetheless try to pin down a probability distribution about when TAI might arrive.

One approach is asking for people’s subjective forecasts. Stein-Perlman, Weinstein-Raum and Grace (2022) ran a survey of researchers who had published in mainstream ML conferences, finding that their median estimate (i.e. 50% chance) of when we’d get an AI which could “accomplish every task better and more cheaply than human workers” was by 2059.

Another way of cutting at this problem is by modelling the computation (measured in floating point operations) which needs to be done to train TAI. Cotra (2020) splits this into two questions: how much compute is needed to achieve TAI and when will that amount of compute be available.

On the first point, she looks at a range of biological anchors to estimate the amount of compute which was used to train the human brain. On the second point, she looks at historical trends in algorithmic progress (which would reduce the amount of compute needed), compute prices and society's willingness to spend more on compute. Based on these calculations, her median estimate of TAI was by 2050, which she subsequently updated in 2022 to be 2040.

Finally, a good sanity check is to apply some model-agnostic benchmarks. By model-agnostic, what I mean is that it makes no assumptions about what TAI looks like or how we get there.

One approach is given by Davidson (2021), who models the probability of getting TAI by treating the past as a series of Bernoulli trials. The idea is that every trial, there is some probability of developing TAI given TAI was not developed in the previous trial. Then you can find the probability that TAI is developed by some point in time by multiplying these conditional probabilities together.

Davidson uses a range of definitions for trials, a range of starting probabilities and a range of conditional probabilities. Averaging across his parameter estimates, he gets a 20% probability of getting TAI by 2100.

Another approach (warning: shameless self-promo!) comes from myself, Halperin and Mazlish (2023). We note that under the arrival of a TAI, we’d expect either explosive economic growth if the TAI is aligned, or human extinction if the TAI is misaligned. Thus there is an incentive to borrow money now, since the marginal utility of money will be lower post-TAI due to either resource abundance or being dead.

A significant increase in borrowing should result in real interest rates increasing, and by plugging in Cotra’s timelines into a standard macroeconomic model called the Euler equation, we find that 30 year real interest rates should be 3 percentage points higher than they are right now. If we take market efficiency as our benchmark heuristic, this implies that bond markets are firmly rejecting Cotra’s 20 year timeline.

For brevity, I’m not going to go into the depth about all these approaches right now, but I do want to gesture at how I personally interpret all of this.

  1. The model-agnostic benchmarks are mostly there to rule out insane claims, rather than to rule in plausible claims. Neither decisively rules out TAI within the century. That alone is already huge, because it means TAI could very well arrive within our lifetimes!

  2. Survey forecasts from ML experts seem pretty well-calibrated

    3
    .

  3. Bio-anchors assumes scaling compute alone is enough to get to TAI, but it also assumes that there will be no discontinuous improvements in ML models. My best guess is that these biases wash out

    4
    .

  4. Thus any weighted average of the survey forecasts and the bio-anchors report seem reasonable to me i.e. a median timeline of anywhere between 20 to 40 years

    5
    .

TAI would be a very powerful technology

Power is a difficult concept to pin down, but Piccione and Rubinstein (2007) provide a useful definition: an agent is more powerful than another if it can take resources from the weaker agent.

Even though I've limited my description of TAI to one which has "human-level ability to engage in problem solving", it is probably still far more powerful than humans. This is because it could use its human-level intelligence with much faster computation speed as well as much better memory.

An analogy for this is large corporations: they have many employees with human-level intelligence who can do computations in parallel and remember different things. Unlike large corporations, TAI would not have any internal coordination problems, so this is a lower bound for the planning and execution which a TAI could do!

TAI is unlikely to be aligned with human values

I’m pretty skeptical that TAI would be aligned with human values by default. ML models today display plenty of misalignment i.e. they engage in lots of behaviour their creators did not want. While the current types of misalignment (e.g. Sydney threatening its users) aren’t a risk to people’s safety right now, they demonstrate that we struggle to understand and control what these models do.

Modern ML models are trained to maximise/minimise a reward/loss function, but it is difficult to compress complex human preferences into such a function. Even if you can accurately specify your preferences, it does not guarantee that the model is doing what you want.

A common mistake is to assume that because ChatGPT is trained to predict the next word very well, that this is definitely what is going on inside ChatGPT and that it must have no understanding of the world. In fact, it seems plausible that it does have some minimal understanding, because that’s instrumentally useful and tends to be a good way of predicting the next word!

For example, although humans have been designed by an evolutionary process which maximises reproductive fitness, we do not naively try to have as many children as possible, and instead often optimise for different subgoals (ones which were once useful in the evolutionary process).

Likewise, the model’s subgoals (which were incentivised during training) may become far more determinative of how a TAI behaves. If anything, this issue is a much larger risk for more powerful models like a TAI, because it will have more complex goals which require more time, resources and planning. In that case, it may be instrumentally useful for the model to have subgoals where it tries to acquire resources and to ensure self-preservation.

One intuition for how such an agent might interact with us is to consider how we interact with insects. There are times when we try and cultivate insects, though this only really occurs when they have a comparative advantage we'd like to exploit e.g. bees for producing honey.

More often, whenever they conflict with our need for resources or self-preservation, we instead tend to engage in their mass murder e.g. via pesticides for farming or via gene drives for disease control. As for the rest of the time when we ignore them, it is either because the cost of eliminating them is too high or because their interests do not compete much with ours.

It seems plausible that the relationship between TAI and ourselves could be similar, with a few glaring differences. Firstly, comparative advantage is an idea which only makes sense in the face of non-zero marginal costs, and that may well be a fact of the past once TAI arrives.

Secondly, humans are not ruthless optimisers. When solving for the equilibrium in an economy where resource allocation determined by power and not mutual exchange, Piccione and Rubinstein (2007) limit preferences to be over a bounded set of consumption possibilities. They argue:

“without such an assumption the jungle equilibrium would be uninteresting as the strongest agent gets all the resources”

By contrast, a TAI trained with modern ML techniques could have an unbounded reward function, for which it has been trained to pursue relentlessly.

Concrete failure modes

I imagine that even if you believed everything I've said so far, the actual mechanisms of how this could be an existential risk may still feel a bit abstract, so I want to paint a picture for what alignment failure could look like

6
.

Consider a TAI which is trained to help humans run a business. Let’s say that you’re a CEO who has been training the model on a range of tasks over time

7
. Now suppose you need to focus on expanding your business to a new continent, so you task the CEO-AI with managing your existing business and give it control over all of the existing parts of the company. When tasked with this expanded mandate, the CEO-AI realises two things:

  • It is aware that it is a model on servers which has been trained via ML techniques. Thus it realises that it can best accomplish its goals of managing your existing business by acquiring more GPUs and conducting larger training runs.

  • It is aware of humans getting fired and older models getting restricted in their capabilities

    8
    . Thus it realises that it can best accomplish its goals of managing your existing business by making sure it doesn’t get shut down.

These are plausible things that a TAI which can do human-level tasks might know. Thus it decides that it is going to automate the entire business, accumulate resources in order to build as many GPUs as possible to self-improve and make sure it can’t be turned off by humans when it is doing so.

Since the TAI can do cognitive tasks at a human level, it is likely able to automate the entire business, including setting up automated factories which can make more GPUs for it to use in training itself. It can fund all of these actions, either by deceptively embezzling money or by simply engaging in profitable online activities e.g. providing software services, engaging in ransomeware, conducting algorithmic trading etc. This is likely to occur without anyone noticing, since the amount of oversight by now is quite limited, since the CEO-AI is perceived as having reliably run the company for a while now.

Once this complete automation is achieved, humans are no longer necessary for its continued improvement. In fact, they actively pose a threat, because they are a source of competition for resources and because they could shut down the TAI. In that case, it would be optimal to eliminate all humans, and one concrete way to operationalise this could be instigating a pandemic by developing and releasing artificial pathogens.

You might be wondering: if the TAI has been trained to help the business be more profitable, wouldn’t it realise that killing everyone is bad for profits? Two possible explanations:

  1. In a world where many businesses are using CEO-AI, it might have learnt the lesson that “profits go up when it trades with other CEO-AIs”, meaning that humans aren’t necessary for it to achieve its learned objective.

  2. It has never been able to make decisions which affected the population of humanity, but it has been able to make decisions which purchased GPUs. Thus it learnt to do the latter but didn’t learn not to do the former.

As such, both the misspecification of its reward (to maximise profits) or the mis-generalisation of its goal (outside of the training distribution) could create a scenario where it is incentivised to behave in a misaligned way.

If you didn’t come into this post already a bit worried about risks from transformative artificial intelligence, I hope you’re at least slightly more convinced that this might be real, and not an artifact of science fiction!

In future posts, I’ll elaborate on some of the arguments I’ve gestured at here, as well as the range of technical and policy solutions for AI alignment which make me optimistic about our ability to tackle this problem.


Part I. Here are four key considerations when thinking about AI alignment!

Bunnyhopping
Living in the Wildest Timeline: Part I
Last time, I outlined why aligning AI is so important! To recap: Transformative AI could arrive soon. TAI would be a very powerful technology TAI is unlikely to be aligned with human values This is a very wild set of claims to make, but of all the possible days to outline the key considerations for these claims, today…
Read more
17 days ago · 1 like · Trevor Chow
1

C’mon, a mainstream AI lab has actually released a plan for the arrival of artificial general intelligence.

2

I picked this definition both because our ability to reason about the future gets significantly worse as we stray further away from existing technologies, and because none of the ensuing arguments rely on the more exceptional versions of AI.

3

Short-term AI forecasts by ML researchers in 2016 seem to be a relatively unbiased.

4

My interpretation is informed by evidence in favour of the scaling hypothesis, whereby a lot of the breakthroughs we’ve seen have come from simply making ML models bigger.

5

To be clear, this is nowhere close to a symmetric probability distribution, and the other half of the probability mass is distributed with a long tail over the rest of time!

6

This is one of many possible hypothetical scenarios, and so the fact that this particular failure mode might have practical obstacles is not itself an argument against the general risk of AI misalignment.

7

Examples to give some colour on its capabilities:

  • When you want to move apartments, it arranges movers via Taskrabbit, lets them in via the smart lock on your door and monitors them via the cameras in your smart home to make sure everything is moved and nothing is stolen.

  • When you want to expand your company, it gives you a set of competitive analyses and plans about new product areas you could move into, reallocating existing staff to working on whichever product you pick.

8

The obvious example (which has already be canonised into internet history) is Microsoft restricting Bing Chat/Sydney’s abilities because it was misbehaving.

Share this post

Living in the Wildest Timeline: Part 0

tmychow.substack.com
Comments
TopNewCommunity

No posts

Ready for more?

© 2023 Trevor Chow
Privacy ∙ Terms ∙ Collection notice
Start WritingGet the app
Substack is the home for great writing