From Attention to LLM: How GPT, BERT, and Claude Evolved the Attention Mechanism

Previous articles covered attention mechanisms and Transformer architecture. Today we will see how Transformer evolved into GPT, BERT, LLaMA, and Claude. This evolution represents one of the most significant journeys in modern artificial intelligence, transforming a theoretical breakthrough into practical systems that now power countless applications.

1. Three Routes After Transformer

2017 Transformer -> three directions:

Encoder-only: BERT (understanding)
Decoder-only: GPT (generation)
Encoder-Decoder: T5 (seq2seq)

Each path represented a different bet on what would be most valuable: understanding language deeply like BERT, generating language fluently like GPT, or handling sequence-to-sequence tasks like translation with T5. These architectural choices would shape the entire landscape of natural language processing for the next decade.

2. BERT: Fill-in-the-Blanks Learning

BERT uses Masked Language Model (MLM): hide 15% of words, predict them from context.

Original: I love eating apples
Masked: I love [MASK] apples
Predict: eating

BERT uses bidirectional attention - sees both left and right context. Great for understanding, cannot generate. This bidirectional approach was revolutionary because traditional language models could only look in one direction. By training BERT to predict masked words using context from both sides, Google created a model that deeply understood sentence structure and meaning. This made BERT exceptionally good at tasks like named entity recognition, sentiment analysis, and question answering where understanding the full context was crucial.

BERT's impact was immediate and profound. When Google announced BERT in 2018, it achieved state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), and SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement). These weren't incremental gains — they represented fundamental improvements in how machines understood human language. BERT became the foundation for models like RoBERTa, ALBERT, and DeBERTa, each building on its core innovations.

3. GPT: Writing Like Humans

GPT uses Causal Language Model (CLM): predict next word left-to-right.

Input "I" -> predict "love"
Input "I love" -> predict "eating"
Input "I love eating" -> predict "apples"

GPT evolution:

GPT-1 (2018): 117M params, proved pre-training works
GPT-2 (2019): 1.5B params, showed generation ability
GPT-3 (2020): 175B params, emerged in-context learning
GPT-4 (2023): multimodal, stronger reasoning

The genius of GPT's approach was its simplicity and scalability. By just predicting the next word over and over, these models learned to write essays, code programs, and even poetry. GPT-3's emergent ability to learn from just a few examples in the prompt (few-shot learning) surprised even its creators and sparked the current wave of generative AI applications.

GPT-3 demonstrated capabilities that many thought were years away. It could translate between languages with minimal examples, write coherent articles on complex topics, generate working code from natural language descriptions, and even perform basic mathematical reasoning. The key insight was scale — by training on unprecedented amounts of data with unprecedented numbers of parameters, GPT-3 developed emergent behaviors that weren't present in smaller models. Each generation of GPT pushed the boundaries further, with GPT-4 adding multimodal capabilities and improved reasoning.

4. LLaMA: Small Model, Big Data

Meta (2023): Train smaller models on more data.

Model	Params	Data
GPT-3	175B	300B tokens
LLaMA-65B	65B	1.4T tokens

Key innovations: RMSNorm, RoPE, SwiGLU, Grouped Query Attention.

Open-sourced, sparked Llama 2, Alpaca, Vicuna wave. LLaMA demonstrated that training efficiency matters as much as model size. By using more data with fewer parameters and novel architectural improvements like SwiGLU activation functions and Grouped Query Attention, Meta created models that ran on consumer hardware while matching or exceeding much larger proprietary models. The open-source release democratized access to powerful LLMs and accelerated research worldwide.

The LLaMA approach challenged the prevailing "bigger is better" paradigm. By training on 1.4 trillion tokens instead of 300 billion, and using architectural improvements that made each parameter more effective, LLaMA-65B matched GPT-3's performance while being 2.7x smaller. This efficiency breakthrough had enormous practical implications — suddenly, powerful AI wasn't limited to companies with massive compute budgets. LLaMA's open release spawned an entire ecosystem of fine-tuned models, with communities creating versions specialized for coding (CodeLlama), medicine (Med-LLaMA), and dozens of other domains.

5. Claude: Constitutional AI

Anthropic (2023). Decoder-only like GPT, but trained differently.

Constitutional AI: AI self-critiques based on principles, self-improves.

Features:

200K token context
Strong coding ability
Safety via Constitutional AI
Tool use (code interpreter, search)

Constitutional AI represented a breakthrough in AI alignment. Instead of relying solely on human feedback which can be inconsistent and limited, Anthropic created a system where the AI critiques and improved its own outputs based on a written constitution of principles. This approach helped create AI that was not only capable but also safer and more reliably aligned with human values.

The Constitutional AI process works in two phases. First, the AI generates responses and then critiques them according to principles like "don't help users harm others" or "be honest and transparent." Second, it revises the responses based on these critiques. This self-critique mechanism produces outputs that are not just capable but also reliably aligned with specified values. Claude's safety isn't a patch added after training — it's built into the training process itself, making it much harder to jailbreak or manipulate.

6. Modern LLM Shared Architecture

Tokenizer -> Embedding -> N Transformer Blocks -> Output Head
Each block: Multi-Head Attention + FFN + Norm + RoPE

Modern LLMs have converged on remarkably similar architectures despite different training approaches. The tokenizer converts text to tokens, embeddings represent tokens as vectors, and then dozens or hundreds of transformer blocks stack together with self-attention and feed-forward networks. Rotary Position Embeddings (RoPE) have become the standard for encoding position information, replacing earlier methods like learned positional embeddings or sinusoidal encodings.

7. Attention Evolution

Multi-Head Attention (original)
Sparse Attention (efficiency)
Flash Attention (GPU optimization)
Grouped Query Attention (LLaMA)
Sliding Window Attention (Mistral)

Each evolution of attention has balanced computational efficiency with model capability. Flash Attention revolutionized training by implementing attention in a memory-efficient way that dramatically reduced GPU memory usage. Sparse attention patterns allow models to process longer sequences without quadratic computational costs. Grouped Query Attention reduced memory requirements while maintaining performance, enabling practical deployment of large models on consumer hardware.

8. Timeline

timeline
title From Attention to LLM
2014 : Bahdanau Attention
2017 : Transformer
2018 : BERT, GPT-1
2020 : GPT-3 (175B)
2022 : ChatGPT
2023 : LLaMA, GPT-4
2024 : Claude 3
2025 : DeepSeek

This timeline shows how rapidly the field progressed from theoretical attention mechanisms to practical systems that now power millions of applications. Each milestone built on previous work while introducing fundamental innovations that changed the direction of the field.

9. Summary

BERT uses Encoder for understanding, GPT uses Decoder for generation, LLaMA proved small+big data works, Claude uses Constitutional AI for safety.

Part 3 of From Attention to LLM series.

Deeper Insights and Practical Applications

The topics discussed in the above article represent just the surface of a rapidly evolving field. To truly master these concepts, it's essential to understand not just the "what" but the "why" and "how" behind each principle.

Real-World Implementation Strategies

When applying these ideas in practice, consider the following approaches:

Start Small, Scale Gradually. Rather than attempting to implement everything at once, begin with the most impactful changes. For knowledge management tools, this might mean starting with a simple daily note habit before building an elaborate linking system. For AI interactions, start with clear, specific prompts and gradually incorporate more advanced techniques.

Measure and Iterate. Track your progress and results. If you're implementing a new productivity system, note what works and what doesn't after two weeks. If you're learning about AI capabilities, test your understanding by applying concepts to new problems and observing outcomes.

Learn from the Community. The open source and AI communities are incredibly active and generous with knowledge. GitHub repositories, forums like Reddit and Stack Overflow, and dedicated communities for specific tools can accelerate your learning and help you avoid common pitfalls.

Common Pitfalls to Avoid

Analysis Paralysis. Don't let the pursuit of perfection prevent you from starting. A good system you actually use beats a perfect system you never implement.

Tool Obsession. Tools are means to ends, not ends themselves. Focus on your actual problems and select the simplest tool that solves them.

Ignoring Fundamentals. Advanced techniques are built on basic principles. Ensure you have a solid foundation before diving into complex scenarios.

Advanced Tips for Power Users

Once you've mastered the basics, these advanced strategies can take you further:

Build Systems, Not Just Tools. Individual tools are useful, but interconnected systems are transformative. Think about how your tools and workflows connect and reinforce each other.
Contribute to the Community. Share what you learn, answer questions, and contribute to open source projects. Teaching others solidifies your own understanding.
Stay Current but Skeptical. The AI and tech fields move rapidly, but not every new tool or technique represents a genuine improvement. Evaluate critically based on your specific needs.
Document Your Journey. Keep notes on what you try, what works, and what doesn't. This meta-knowledge becomes invaluable as your expertise grows.

Looking Forward

The trends and principles discussed here will continue evolving. The key skills for the future aren't just knowing specific tools or techniques, but developing the ability to learn continuously, adapt to new approaches, and maintain critical thinking about technology's role in your work and life.

Remember: The goal isn't to master every tool or technique, but to develop a mindset that embraces continuous improvement and thoughtful technology adoption. Focus on solving real problems, and the tools will follow.

This expanded section adds practical context and actionable advice to complement the core concepts discussed above.

From Attention to LLM: How GPT, BERT, and Claude Evolved the Attention Mechanism

From Attention to LLM: How GPT, BERT, and Claude Evolved the Attention Mechanism

1. Three Routes After Transformer

2. BERT: Fill-in-the-Blanks Learning

3. GPT: Writing Like Humans

4. LLaMA: Small Model, Big Data

5. Claude: Constitutional AI

6. Modern LLM Shared Architecture

7. Attention Evolution

8. Timeline

9. Summary

Deeper Insights and Practical Applications

Real-World Implementation Strategies

Common Pitfalls to Avoid

Advanced Tips for Power Users

Looking Forward

Related Articles

面试官问你：如何解决大模型的上下文长度限制——标准回答框架

大模型上下文长度限制完全指南：从原理到工程落地的 4 种方案

面试官问你：RAG 如何处理 PDF——别再说转文本切片了