Why is AI getting worse and worse? The secret of the mechanism of self-attention
Have you ever encountered this situation: asking AI to help you write a piece of code, and having a good chat in front of it, but after more than ten rounds of chatting, it suddenly forgets your initial request and even starts to contradict itself?
It is not that AI has become "stupid", but that its internal self-attention mechanism has been "diluted."
What is self-attention?
Self-attention is the core mechanism of all today's big language models, and GPT, Claude, and Wenxin all use it.
How it works can be understood using a metaphor:
Imagine you are in a conference room listening to a discussion with 100 people. When everyone speaks, you need to decide "who to listen to." Under normal circumstances, you would focus on the people who were most relevant to the speech and ignore other people's small talk.
Self-attention does almost the same thing-when the model processes a word, it calculates the "relevance" of that word to all other words in the context, assigning a weight to each word. Words with high relevance gain more attention, and words with low relevance are ignored.
This is where the name "attention" comes from.
How is attention distracted?
The key problem is that the total amount of self-attention is limited.
When there are only 10 people in the conference room, you can easily focus on the most important speakers. But if there are 1000 people in the conference room and everyone is talking, your attention will have to be distracted to more people, and your attention to everyone will naturally decrease.
The same goes for models. When there are only a few rounds of conversations in the context window, self-attention can be accurately focused on key information. But as the conversation gets longer and longer, chat logs, previous code, debugging logs, and various intermediate results are all crowded into the context, and self-attention is "diluted."
In 2023, a research team from Stanford University and New York University published a paper specifically studying this phenomenon, titled "Lost in the Middle." They found a rule: When key information appears in the middle of the context, self-attention naturally declines, and the performance is far worse than when the information appears at the beginning or end.
Why does the model "forget" what it has done before?
By understanding self-distraction, you can understand a common phenomenon: the model has clearly obtained API data and cracked the mapping relationship before, and repeated it again after a few rounds.
It is not that the model deliberately repeats its work, but that key information is submerged in the ocean of context. When self-attention calculations are made, previously acquired data and new conversation content are put together, the weights are dispersed, and the model cannot "see" the importance of that information.
It's like opening 50 browser tabs on your phone and looking for a previously saved URL-it's there, but your attention is distracted by the other 49 pages and you can't find it for a while.
I myself encountered this problem when using Claude Code: I asked it to crawl the API data of a website. In the past few rounds, I had successfully obtained the correct interface address and cookies. After more than ten rounds of chatting, it suddenly started using another. A completely different interface address, completely forgetting the path that had been verified before. At that time, I couldn't understand it, but later I realized that my self-attention had been distracted.
The "signal-to-noise ratio" problem of attention
The deeper reason is the signal-to-noise ratio.
Information in the context falls into two categories: one that is useful for the current task (signals) and one that is irrelevant historical conversations (noise). As the conversation grows longer, there becomes more and more noise, and a higher proportion of the signal is submerged.
Although the self-attention mechanism will try to give higher weight to the signal, when the amount of noise is much larger than the signal, no matter how clever attention allocation is, interference cannot be completely avoided.
This is why you find that the longer you chat, the easier it is for AI to stray, forget previous key information, and repeat actions that have already been done.
The mathematics behind attention distribution
To understand this more precisely, it helps to know how self-attention actually computes. At its core, the attention mechanism calculates three things for each word: a Query (what this word is looking for), a Key (what information this word offers), and a Value (the actual content of this word). The attention weight between two words is computed by taking the dot product of their Query and Key vectors, then applying a softmax function to normalize these weights across all words in the context.
The critical insight is that softmax forces all attention weights to sum to 1. When you have 100 tokens in the context, each token gets on average 1% of the total attention. But when you have 2,000 tokens, each gets only 0.05%. This means the model's ability to strongly "focus" on any single piece of information naturally decreases as context length increases. It's not a bug or a limitation of training — it's baked into the mathematical architecture itself.
Researchers have found that attention weights tend to follow a U-shaped distribution: words at the very beginning and very end of the context receive disproportionately more attention, while everything in the middle gets squeezed. This is why placing your most important instruction at the start or end of a prompt often works better than burying it in the middle. It's also why "system prompts" in many APIs are placed at the very beginning — they get the most attention weight.
Practical implications for developers
Understanding attention dilution has real consequences for how you use AI tools. When you're debugging a complex issue with AI assistance, the first 3 rounds and the last 3 rounds of your conversation are where the model is performing best. The middle of a long conversation is where errors creep in. This is why breaking a large task into multiple focused conversations often yields better results than one marathon session.
For those building RAG (Retrieval-Augmented Generation) systems, attention dilution explains why simply dumping more context into the prompt isn't always helpful. Beyond a certain threshold, additional context dilutes attention on the most relevant passages, actually degrading answer quality. The common recommendation of "retrieve more chunks than you think you need" has a ceiling, beyond which more context becomes actively harmful.
Another practical consideration: when designing prompts for long conversations, periodically restating key constraints and decisions in your messages acts as a "refresh" for the model's attention. By reintroducing critical information near the end of the context, you ensure it receives higher attention weight than it would if it were buried in the middle.
Now that you know the principle, what can you do?
By understanding the principle of self-distraction, you have a key.
The first practical strategy: Speak directly about the needs and express them clearly. Reducing the decision points of the model reduces the opportunities for self-attention to be distracted.
Second practical strategy: Opening a new session can reset attention. Clean context equals 100% self-focus.
Third practical strategy: Summarize and restate key decisions midway through long conversations. This "attentional refresh" technique keeps critical information in the high-attention zone at the end of the context.
In the next article, we will talk about a more intuitive question: The solution written by AI is beautiful, but is it really being implemented?
Remember one principle: When you find that the AI starts to do the same thing over and over again and forgets key information from the past, the problem is not the way you ask questions, but its self-attention has been exhausted.
The practical difference between understanding attention dilution and ignoring it shows up most clearly in real-world workflows. Developers who structure their AI interactions as a series of short, focused conversations — each with clear context and a specific goal — consistently get better results than those who try to accomplish everything in one long session. The mental model I find most useful is thinking of context as a budget: every token you spend on irrelevant chatter is a token you cannot spend on substantive content. Being deliberate about context management, including starting fresh sessions at natural transition points and periodically summarizing progress to keep the signal strong, is the single highest-impact habit I have developed for working effectively with AI coding assistants. Another technique I have found invaluable is writing important constraints and decisions to a project memory file (like a README or CLAUDE.md) at the start of each session, essentially externalizing the most critical context so that it never gets lost in the middle of a long conversation. This combination of structured conversations and persistent external context has transformed my experience with AI from frustrating forgetfulness to reliable collaboration.