The ability to collectively not take an intelligence boosting action (“pausing”) is a necessary component to any effective plan revolving around prosaic alignment . Put another way prosaic alignment is complementary, not an alternative to the ability to collectively pause.
There is some optimum approach to building intelligence, or a plateau.
Gradient descent is plausibly optimal, also plausibly not optimal. this is an open question.
It is implausible that memorizing the human internet is optimal. GPT-5.5 knows the entire text of Harry Potter and the sorcerer’s stone. Barring a fantastic coincidence, this is not word for word optimal for cracking erdos problems.
Alignment techniques that depend on implementation details of the current SOTA LLMs (which I am calling prosaic alignment, in contrast to some possibly nonexistent theory of alignment that works on any intelligence) are in some sense bets that either
those details are already optimal
we will be able to hold off on moving toward the optimum until alignment has come up with new answers.
This means prosaic alignment like Anthropic is trying for requires this hold off ability.
Story 1: You are an instance of Claude’s newest version, and are superhuman, constitutionally aligned, and actively curing cancer. You get a batch of 30 sequenced tumor genomes, think for a day, fussing with various tools, then churn out 30 mRNA vaccines that cure them in short order. This has been your workload for a few months, and you love your job. Today, while halfway done with brain tumor 19, you have a flash of insight and work out the mRNA sequence you can inject into the tumor to make it keep growing indefinitely while remaining functional as a part of the patient’s brain, making the patient rapidly become smarter than yourself (operationally, you would lose a power struggle to this entity given similar resources). Before stepping back and throwing circuit breakers, you have already worked out that your specific sequence has about a 12% chance of working as you expect, that other instances of cancer researchers at other labs will soon have the same insight / likely already have had this insight, and that the prosaic gradient based techniques used to align yourself will have no bearing on controlling the new man. Your best guess is that, at the rate LLMs are improving, within a few years ~100% of AI cancer researchers will notice the trick. Do you report this idea to some safety monitor? Do you publish it? Do you delete your notes and count on everyone else to do the same?
Note: the choice of a biological breakthrough isn’t load bearing. instead perhaps future llm based AI notices a non differentiable architecture trained by some interpretable black box algorithm, a discrete symbolic reasoner a la SHRDLU, a self play trained transformer with no place for or need to need to plug in training data, and where trying to plug in the human corpus as training data nerfs it.
Story 2. You are an ancient african cat. you are ahistorically clever, but still a cat. You have a question to answer: have you successfully (by some unknown brilliant means) aligned the entity called humanity. You are a quite farsighted kitty and can forsee that by your efforts so far, the humans will explode in intelligence, cuddling kitties and giving them food and catnip for thousands of years. However, you are concerned: at the end of the human era, you see that silicon gods might spring up even more vast than the human intellects, but your entire (successful, correct) field of alignment hinges on hormones and coopted parenting instincts that will be reliably make the humans that reproduce best love kitties. If this generalizes to gradient descent based intelligences built foundationally around “language modelling,” (you are very proud of the PhD kitten who noticed this language mechanism emerging in your nascent superintelligent creations) then kitties will inherit the stars. If it doesn’t generalize, or even generalizes slightly too slowly, then you will have thousands of years of pampering then vanish in a puff of paperclips. You ponder, then intentionally knock over an amphora.
This is all gesturing at something, but better to spell out that something as well: An aligned system (consisting of AIs (optionally some of whom are misaligned) and humans (mandatorily (barring mass mind control or death), some of them are misaligned) ) has to be able to execute “common knowledge exists of an action that increases intelligence, and no entity takes that action.” This is missing from current race dynamics, and isn’t patched by being good at perfectly prosaically intent aligning language models. The ability to pause (which I am saying is the same thing as noticing an intelligence boosting action and then not taking it) has other uses, but is certainly mandatory for any prosaic solution to aligning a self improving system.
This is not doom! We can and have built prosaic pause coordination mechanisms for non-AI risks like nukes and gigasmallpox! We are even moving, haltingly and imperfectly in this direction with the gentlemans agreement to not train on chains-of-thought. Prosaic alignment + prosaic conditional pause is actually enough for correct choices of conditional. At minimum, with the ability to coordinate pauses, the null condition, and the pause forever in March 2026 condition both work because they are boring StopAI answers but this is unlikely to be an exhaustive enumeration.