Going Down the AI Alignment Rabbit Hole

From Paperclips to Ethics

Feb 08, 2025

Alignment is now defined so broadly that all of AI, all of ML, and the entire history of technology is—and always has been—"alignment research". - Lipton

Say that we want to “align AIs”, or even “align AGIs”, but what do we even mean by aligning something? Is it single-handedly matching the model’s optimization goals to our human ones? Can the model really learn the sum of all human ethics and values, not to mention that there isn’t a complete agreed-upon list of them? How can we interpret the model’s internal goals and validate their truthfulness?

When you decide to go down the AI alignment rabbit hole and learn about the underlying factors shaping it, conflicting views will likely appear. Similar to our messy real-world ethics, alignment was further divided into detailed aspects like scope, specification, and framing. While knowing these arguments will help you navigate the field better, it is often easy to end up in a more chaotic perspective than simply knowing the surface definition.

Given all the emerging ideas and research regarding AI alignment, this post is my attempt to organize the statements that I find helpful in a way that will hopefully provide a clearer understanding of what aligning AI really means. Throughout the process of dissecting this term, there will be widely agreed definitions, in which I will largely restate the sources. I will also connect different ideas and provide my assumptions, which I will explicitly state.

In general, I will attempt to answer these three questions, mirroring the opening paragraph:

What does aligning a model’s optimization goals to human ones mean?
What are human goals or values to be aligned with?
What are some ways to understand a model’s learned goals?

Everything I’m saying could be wrong. To set expectations, this article is not going to help you find solutions to solving AI alignment. I assume that anyone reading this simply wants to understand what AI alignment means, and that’s the intent of this writing. We will largely focus on the non-technical side of this term, but it is equally exciting to dive into any technical approach to AI alignment. I do believe that both the technical and non-technical sides are highly linked, and reading about one will inspire the other.

Types of AI Alignment

AI alignment, by itself, aims to ensure that systems will pursue objectives that reflect human intent. This is sometimes referred to as intent alignment. A common way to further frame alignment tasks is to divide them into “inner” or “outer” alignment. We will focus on understanding outer alignment in this article, while inner alignment will be provided as reference materials.

Outer Alignment

Outer alignment asks the question “Did we give the AI the right to-do list?” Formally, it wants to know whether the model’s training objective (e.g., reward and loss function) is aligned with the real-world goals we care about.

When we design a model to achieve a task in the real world, it is often hard to directly transcribe that goal into the model. A potential strategy for teaching the model your goal is to reframe the goal into something simpler. For example, imagine a social media company that wants to maximize advertising profits. Instead of directly conveying this complex goal involving many aspects of AI, the company gave it the objective of maximizing user engagement. Here, the specified objective is maximizing user engagement while the intended goal is to maximize advertising profits. Outer alignment compares these two goals.

One possible outer misalignment, in this case, could result when the AI model starts promoting clickbait, misinformation, or extremist content. These actions will likely attract user engagement - the AI is optimizing toward its goal! But the company will not gain advertising profits - the AI is not following its intended goal.

Specification Gaming

In the outer misalignment scenario, we could say that the AI exploited a gap between the specified objective and the intended goal. This act of directly "satisfying the literal specification of an objective without achieving the intended outcome" is known as specification gaming: The AI found a loophole in the system.

Specification gaming can be both beneficial and catastrophic. The example above leads to a catastrophic result for the company. Another catastrophic case of bad specification gaming from outer misalignment could be a paperclip-maximising AI. In a human-aligned vision, it should find less-costly methods of producing paperclips. But here the AI found a loophole in the system. It began to drain resources to produce paperclips, ignoring implicit constraints like environmental sustainability.

The specification can also be right and beneficial, meaning that the AI produces a novel solution satisfying the intended goal. One famous example is AlphaGo's Move 37 - a creative, non-intuitive play - in 2016 that defied human strategies but secured victory. Here, gaming the system aligned with the true goal.

Loopholes are always common in the game, and unaligned AI will follow undesirable gaming behaviors. In the catastrophic paperclip scenario, the AI drained the resources because we did not explicitly forbid that action in the optimization goal. But what if we do? Will the AI become outer aligned? It will only find another unforbidden loophole because human intent is inherently underspecified. We cannot articulate every edge case. The root issue here is not the AI’s internal behavior but the incompleteness of the objective. It will therefore be important that we can correctly specify tasks and keep up with the ability of AI to find novel solutions.

We don’t need to create a specification that covers every potential loophole, however. There are alternatives left to be innovated. One such method is to dynamically learn the reward function through human feedback. Rather than explicitly listing out the potential loopholes, it is easier to evaluate the outcome on whether it followed the intent value. When a loophole is detected, humans refine the reward function, and the AI adapts. Systems can iteratively update objectives using human feedback.

We still have to keep in mind that human feedback does not solve everything, because dynamic modeling builds upon the assumption that humans can recognize and correct flaws. What happens then when ethics are ambiguous, and even humans disagree on the "right" fix? This uncertainty reflects how current frameworks such as outer alignment focus too much on technical precision. They overlooked a critical variable: humans themselves.

What are human values?

This brings us back to another opening question: Can the model really learn the sum of all human ethics and values, not to mention that there isn’t a complete agreed-upon list of them?

I am not going to propose a list of values here, nor am I going to set up frameworks for doing so. It is useful to instead look at some properties of human values that many, including me, may subconsciously overlook.

The most fundamental aspect of human values is that they change all the time. They change in two dimensions: macro and micro. On a macro scale, human values change either more rapidly or slower based on the time they are born. I like to frame the rate of change as being directly proportional to the exposure of external sources, which means that Gen Z’s values are more susceptible than the previous generations due to the internet era. On a micro scale, each individual experiences changes in their values as they grow up, which is intuitively easy to understand. The hard problem of defining values then lies in the micro scale realm, because our values are spontaneous - we’re able to value things that are not passed through natural selections (e.g., aesthetics) and hard-coded into our genetics. Macro scale value shifts are shared among groups of individuals, making them easier to define and widely agreed upon.

Given this basic assumption setup, it is then relatively more rewarding to figure out the values of an individual than a whole group. Once we can define the values of one single human, generalization will not be an important question.

One concept that human values remind me of is reinforcement learning. My interpretation took root from a rather psychological concept, which is about how reinforcements on current actions will encourage future ones. If I see a smile on someone’s face after I taught them an interesting topic, I’m then positively reinforced. There are more complex ways of explaining human values through reinforcement learning such as Shard Theory or value reinforcement learning.

Shard Theory

Here I will try to reason human values through shard theory, which views human values as clusters of reinforced behaviors shaped by experience. Imagine it as a mosaic: each tile represents a "shard," a fragment of preference of decision-making logic, and they come together by repeated rewards or penalties (reinforcement events).

The process starts simply. Take a baby learning to seek juice. When the baby drinks juice, a hardcoded reward (sugar) activates. This reward doesn’t just make the baby “happy”—it reinforces the exact chain of actions that led to drinking the juice. Early on, the baby operates on reflexes: see juice → grab juice. But as the baby develops a crude understanding of its environment (a proto-world model), it learns to plan. If juice exists behind the baby, turning around becomes part of the reward sequence. Each successful step—turning, grabbing, drinking—strengthens neural pathways, creating “subshards” that guide future behavior in similar contexts.

*Generated by DALL-E 2. Source: LessWrong*

These shards are not about the reward itself, but the pathway to it. The baby’s brain isn’t wired to chase abstract pleasure; it learns to value actions that align with its model of reality. For instance, the juice-seeking shard isn’t just “get sugar”—it’s “find the pouch, turn toward it, drink it,” all filtered through the baby’s growing understanding of how objects exist in space. This distinction matters. If the baby could electrically stimulate its reward circuitry (“wireheading”), the juice shard would oppose it. Why? Because the shard is tied to the real-world outcome (juice consumption), not the raw reward signal. This also makes sense evolutionarily, as wireheading doesn’t fill your stomach.

As more shards develop - social approval, curiosity, avoiding pain - they compete. The shard theory describes the brain as integrating them through a “bidding” process. When faced with a choice, shards activate based on context, and each advocate for plans that historically maximized their rewards. Turning toward juice might bid against turning toward a caregiver’s smile. The winning bid here is not random but shaped by the shards that were reinforced most strongly in similar past scenarios. Over time, this forms a value system of an individual, matching our micro scale description.

We have to keep in mind, though, that shards are not static. They can evolve with new experiences and even refined world models. If the previous child learned that stealing juice causes disapproval, she might develop a "fairness" shard that overrides immediate gratification. Similar to our emotions, human values are also context-dependent heuristics.

Now how will the shard theory help us understand human values in the case of AI alignment? It suggests that, instead of preparing a list of fixed values, we might design systems that learn shard-like preferences through interaction. This could be an exciting framework to explore, although the "bidding" process is challenging. Another theory called bidirectional AI alignment could be extended upon this. Since our shards are context-dependent, interacting with AI can also influence their formation, thus not only are we aligning AI to humans, but also aligning ourselves to AI. This forms a loop worth further exploring. We have values, or dynamic, reinforced patterns, that deeply tie to how we model our world. And that's precisely why they're so hard to replicate and explain.

*The Overview of the Bidirectional Human-AI Alignment Framework*

Understanding AI models

Our final question: “What are some ways to understand a model’s learned goals?”

This last part will be more opinionated than the previous two, and I will not be proposing any strategies to understand what a model really thinks. When I was writing this article, I also considered myself to be a reader who wanted to learn more about AI safety. I originally planned only to cover the previous two parts on outer alignment and human values, because I don’t have the expertise to explain further (probably neither for the first two). However, I believe it would be beneficial to put some potential directions here so that curious readers can go down the rabbit hole.

In general, one field I would be excited to learn more about is AI interpretability, which is to dissect a trained neural network and understand its internal reasoning processes. This will provide insight into how and why they produce the outputs that they do and also help understand the inner alignment problem. AI researchers currently have very little understanding of what is happening inside state-of-the-art models.

That said, I’ll end this journey with some resources I found useful while learning about AI interpretability:

Introduction to Mechanistic Interpretability by Bluedot
Transformer Circuit Thread with a list of great papers on AI interpretability
Neel Nanda's Mech Interp Guide
How might LLMs store facts by 3Blue1Brown with a nice visualization of superposition

I believe that putting a few high-quality resources is enough - you will encounter many more when reading through these. I am also currently reading about this field and will write future articles on more advanced concepts once I get there. I do think that the process of writing things down strengthened my understanding a lot. More importantly, I learned about many new things in the process. I will highly recommend anyone who wants to write about AI safety-related topics to participate in Bluedot’s Writing Intensive Fast Track. This article is my personal project for this five-day course.

If only we could align AI as easily as we align text in an article.

Third Arm

Discussion about this post