Token Getting Started Guide: A Complete Manual of Basic Units of Text in the AI Era
You may have encountered this situation: you only wrote a few hundred words of prompt words, but the AI did not reply to you; or when you saw the API bill at the end of the month, you were confused-why did you use so many words? Behind these questions, they all point to a concept that you must understand: Token.
Token is the basic unit for the AI model to understand, process, and generate text. By understanding it, you can accurately estimate API costs, write more efficient prompts, and avoid inexplicable truncation. To put it bluntly, if you don't understand Token, you are groping in a muddle when using AI.
I will take you from scratch and explain all the ins and outs of Token, calculation methods, price differences between different models, and money-saving techniques in practical applications. After you finish school, you can not only calculate how much Tokens you spend on each call, but you can also use skills to cut the cost in half.
1. What will you learn
After completing this tutorial, you can complete the following tasks independently:
- Accurately calculate the number of Tokens in any text: Whether it is a Chinese or an English paper, you can calculate how many Tokens are consumed
- Compare the Token prices of different AI models: Know the difference between GPT-4, Claude, and domestic models, and why the price of the same content is several times different
- Understand the logic behind API bills: no longer be confused by inexplicable billing, you can calculate the cost of each call yourself
- Write a prompt to save more Tokens: Use techniques to compress the same effect into fewer Tokens, saving real money
- Handle the boundary situation of long text: Know what the context window is, what to do when it is full, and how to process long documents in blocks
2. Pre-knowledge
Before reading this article, you need to know these:
- Basic programming concepts: Know what variables, functions, and API calls are. As long as you don't feel scared when you see the code, you don't need to know how to write it.
- Have used ChatGPT or similar AI conversation tools: Know what AI can do and have a basic impression of the word "big model."
- Junior high school English level: Token calculation involves some English terms, but I can explain the core concepts thoroughly in Chinese.
If you are a pure novice, it is recommended to try ChatGPT or Claude first before reading this article. You will feel more.
3. Environmental preparation
Before starting to calculate Tokens, you need to have calculation tools ready. I recommend using the 'tiktoken' library officially provided by OpenAI, which is the most accurate calculation tool currently available.
Software to be installed:
- Python 3.8+: Programming environment, must be installed. If you are not sure about the version, open the terminal and type
python3 --versionto view it. - pip: Python package manager, usually installed with Python. Verification command:
pip --version
Installation steps:
-
Open the terminal (Mac is the Terminal application, Windows is the PowerShell or Command Prompt)
-
Create a folder specifically for practice:
mkdir token-practice && cd token-practice
- Install Python virtual environment tools (optional but recommended):
python3 -m venv venv
- Activate the virtual environment:
# Mac/Linux
source venv/bin/activate
# Windows
venv\Scripts\activate
After successful activation, the (venv) flag will appear in front of the terminal prompt.
- Install the tiktoken library:
pip install tiktoken
Expected output:
Collecting tiktoken
Downloading tiktoken-0.7.0-cp311-cp311-macosx_10_9_x86_64.macwheel11.0.dist.whl (xxx kB)
Installing collected packages: tiktoken
Successfully installed tiktoken-0.7.0
- Install an additional auxiliary tool 'langchain' to facilitate subsequent demonstrations:
pip install langchain
After the installation is complete, enter python -c "import tiktoken; print ('tiktoken OK')" to verify that the installation was successful. If you see 'tiktoken OK', it means there is no problem.
4. What exactly is Token
To understand Token, you must first understand how the AI model handles text.
AI models do not directly "read" text like the human brain does. It first breaks the text into small pieces, processes it and then puts it back together. These small pieces are Tokens.
A simple analogy: When you read an article, your eyes are actually scanning groups of words quickly, rather than reading each word at a time. The AI model processes text in a similar way, except that its "grouping rules" follow linguistic statistical rules.
English Token Rules:
The Token division in English is relatively simple and crude. The usual rules are:
- 1 Token ≈ 4 characters
- 1 English word ≈ 1-2 Tokens
For example,'hello' is 1 Token,'Artificial' is 2 Tokens (because it is relatively long), and 'a' is 1 Token.
Chinese Token rule:
The situation in Chinese is a little more special. Chinese does not have natural spaces separated like English, so it is handled differently:
- 1 Chinese character ≈ 1-2 Tokens
- Punctuation marks also count as tokens
- 1 Chinese word ≈ 1-3 Tokens
More accurately, the number of Tokens in Chinese text depends on the specific word segmentation and model used. OpenAI's GPT series uses the BPE (Byte Pair Encoding) algorithm, and Chinese calculates 1-2 Tokens per character on average.
Why is Token division so complicated?
Because the AI model does not understand language in terms of "words" or "words", it understands language in terms of "subwords." The word Transformer 'may be broken into two Tokens' Trans '+' former 'because these two parts often appear in other words. This splitting allows the model to better handle new words that have not been seen before.
Remember this core point: Token is not the number of words, but the minimum number of units processed by the model. In the same paragraph, Chinese usually consumes less Tokens than English (because of the high information density of Chinese words), but this is not absolute and depends on the specific content.
5. How is the Token number calculated
Now that you know what Token is, you need to figure out how to calculate it.
Official calculation rules (take OpenAI as an example):
OpenAI uses a word segmentation algorithm called BPE (Byte Pair Encoding). The simple understanding is: count the most frequently occurring character combinations in the corpus and merge them into Tokens. The more mergers, the larger the Token formed.
The 'cl100k_base' encoder used by GPT-4 can recognize about 100,000 different Tokens.
Practical calculation method:
Use the 'tiktoken' library to calculate the most accurately:
import tiktoken
# Select encoder, use cl100k_base for GPT-4
encoding = tiktoken.get_encoding("cl100k_base")
text = "Hello, world! Hello world! "
# The encode method returns a list of Token IDs
token_ids = encoding.encode(text)
print(f"text length: {len(text)} characters")
print(f"Token quantity: {len(token_ids)}")
print(f"Token IDs:{token_ids}")
Expected output:
Text length: 28 characters
Number of Tokens: 25
Token IDs:[40, 1917, 1917, 443, 1917, 357, 248, 1917, 220, 1917, 357, 1917, 1917, 1917, 1917, 1917, 1917, 1917, 1917, 1917, 1917, 1917, 1917, 1917, 1917]
See? 28 characters correspond to 25 Tokens. The ratio is about 1:1, but punctuation and special characters will account for more.
Token consumption estimates for different scenarios:
| text type | estimation mode | example |
|---|---|---|
| Short English sentences | 1 Token for every 4 characters | "Hello world" → 2 Token |
| Long English paragraph | 1 Token per 4 characters +0.75 Token per word | 100 words article → approximately 80-100 Tokens |
| Short Chinese sentences | 1-1.5 Token per character | "Hello World" → 4-5 Token |
| Chinese long paragraph | 1-1.5 Token per character | 100 words in Chinese → approximately 120-150 Tokens |
| code | There is a big difference. Simple code is close to English, and complex code costs more Token. | 1 line of code → 5-20 Token |
Recommended practical calculation tools:
In addition to tiktoken, there are also some online tools that can be used directly:
- OpenAI Tokenizer (platform.openai.com/tokenizer): The official tool, the most accurate, but only supports GPT series encoding
- Anthropic Token Counter (anthropic.com): Claude's dedicated calculator
- tiktokenizer (tiktokenizer.vercel.app): Online version of tiktokenizer, with good visualization
It is recommended to collect OpenAI Tokenizer and throw it in to calculate when you are uncertain.
6. Comparison of Token prices for mainstream AI models
After you know how to calculate Token, you must care about the price. The difference between different models is not even a bit.
Comparison of Token prices for mainstream models (data as of 2024, prices are estimates, and actual prices are based on the latest official pricing):
| model | Enter the price ($/1M Token) | Export price ($/1M Token) | contextual Windows | remarks |
|---|---|---|---|---|
| GPT-4o | $5.00 | $15.00 | 128K | The most cost-effective multimodal model |
| GPT-4 Turbo | $10.00 | $30.00 | 128K | A quick version of GPT-4 |
| GPT-4(8K) | $30.00 | $60.00 | 8K | Old version, more expensive |
| GPT-3.5 Turbo | $0.50 | $1.50 | 16K | Cheapest GPT model |
| Claude 3.5 Sonnet | $3.00 | $15.00 | 200K | Strongest processing of long text |
| Claude 3 Opus | $15.00 | $75.00 | 200K | Flagship model, more expensive |
| Claude 3 Haiku | $0.25 | $1.25 | 200K | The cheapest Claude |
| Tongyi Thousand Questions Qwen-Max | ¥0.04 | ¥0.12 | 8K | Domestic low-cost options |
| Wenxin Yiyan 4.0 | ¥0.12 | ¥0.12 | 4K | Baidu system has strict current restrictions |
| Kimi(Moonshot) | ¥0.012 | ¥0.012 | 128K | King of long-context cost performance |
Reasons for price differences:
You may ask, why is Claude 3.5 Sonnet cheaper than GPT-4o, but the context window is so much larger? There are several reasons behind this:
- Different model capabilities: GPT-4o is stronger in reasoning, multi-round dialogue, and multimodal understanding.
- Differences in training costs: The larger the model and the more training data, the higher the cost
- Market positioning: Domestic models usually adopt low-price strategies to seize the market
- Contextual windows: Being able to process longer text means more video memory and computing resources are needed
Example of actual cost calculation:
Suppose you want to process a 100,000-word Chinese document (about 50,000 Tokens), using the cost of different models:
| model | input costs | Output cost (assuming 20,000 Tokens) | total cost |
|---|---|---|---|
| GPT-4o | $0.025 | $0.30 | $0.325 |
| GPT-3.5 Turbo | $0.025 | $0.03 | $0.055 |
| Claude 3.5 Sonnet | $0.015 | $0.30 | $0.315 |
| Kimi | ¥0.06 | ¥0.24 | ¥0.30 |
If you process 10 of these documents a day, it costs about $97 per month with GPT-4o and $16 per month with GPT-3.5 Turbo. The gap is still obvious.
7. Practical practice 1: Calculate the number of Tokens in any text
Finally, it's the hands-on part. In this section, I will take you to write a complete Token calculation script.
Task goal: Write a Python script that can calculate the number of Tokens for any text and give a rough cost estimate.
Full code:
# !/ usr/bin/env python3
"""
Token calculator
Function: Calculate the number of text tokens and estimate the cost of API calls
"""
import tiktoken
import sys
# Encoder names for different models
ENCODERS = {
"gpt-4": "cl100k_base",
"gpt-3.5": "cl100k_base",
"claude": "cl200k_base", # Claude dedicated
}
# Price list ($/1M Token), data source: official pricing of each manufacturer (estimate)
PRICES = {
"gpt-4o": {"input": 5.00, "output": 15.00},
"gpt-4-turbo": {"input": 10.00, "output": 30.00},
"gpt-3.5-turbo": {"input": 0.50, "output": 1.50},
"claude-3.5-sonnet": {"input": 3.00, "output": 15.00},
"claude-3-haiku": {"input": 0.25, "output": 1.25},
}
def count_tokens(text, model="gpt-4"):
"""Calculate the number of tokens in the text"""
#Both GPT-4 and GPT-3.5 use cl100k_base
encoding = tiktoken.get_encoding("cl100k_base")
tokens = encoding.encode(text)
return len(tokens)
def estimate_cost(token_count, model="gpt-4o", is_output=False):
"""Estimate API call costs"""
if model not in PRICES:
return None
price = PRICES[model]["output"] if is_output else PRICES[model]["input"]
cost = (token_count / 1_000_000) * price
return cost
def main():
print("=" * 50)
print("Token Calculator v1.0")
print("=" * 50)
#Read input text
if len(sys.argv) > 1:
#Read from command line parameters
text = " ".join(sys.argv[1:])
else:
#Interactive Input
print("\nPlease enter the text to be calculated (press Ctrl+D to finish after entering):")
text = sys.stdin.read()
if not text.strip():
print("Error: No text entered")
sys.exit(1)
#Calculate the number of tokens
token_count = count_tokens(text)
char_count = len(text)
print(f"\n [Calculation Result]")
print(f"Number of characters: {char_count}")
print(f"Token quantity: {token_count}")
print(f"Token/character ratio: {token_count/char_count:.2f}")
#Estimate costs for each model
print(f"\n [Cost estimation (assuming 1000 Tokens output)]")
for model, prices in PRICES.items():
input_cost = estimate_cost(token_count, model, is_output=False)
output_cost = estimate_cost(1000, model, is_output=True)
total = input_cost + output_cost
print(f"{model:25s} |Input: ${input_cost:.6f}| Output: ${output_cost:.6f}| Total: ${total:.6f}")
if __name__ == "__main__":
main()
Operation mode:
-
Save the above code as
token_counter.py -
Method 1: Calculate a single line of text
python token_counter.py "Hello, how are you today? "
Expected output:
==================================================
Token Calculator v1.0
==================================================
[Calculation Results]
Number of characters: 26
Number of Tokens: 8
Token/character ratio: 0.31
[Cost estimation (assuming output of 1000 Tokens)]
gpt-4o | Enter: $0.000040| Output: $0.015000| Total: $0.015040
gpt-4-turbo | Enter: $0.000080| Output: $0.030000| Total: $0.030080
gpt-3.5-turbo | Enter: $0.00004| Output: $0.001500| Total: $0.001504
claude-3.5-sonnet | Enter: $0.000024| Output: $0.015000| Total: $0.015024
claude-3-haiku | Input: $0.00002| Output: $0.001250| Total: $0.001252
- Method 2: Calculate multiple lines of text (interactive mode)
python token_counter.py
Then enter multiple lines of text and press Ctrl+D (Mac) or Ctrl+Z (Windows) to end the entry.
If an error is reported:
ModuleNotFoundError: No module named 'tiktoken': Runpip install tiktokento reinstallUnicodeEncodeError: It may be a terminal encoding problem. Try settingexport PYTHONIOENCODING=utf-8- Enter Chinese and display garbled code: Make sure the terminal code is UTF-8. Windows users may need 'chcp65001'
8. Practical combat 2: Optimize prompt words to reduce Token consumption
After calculating the number of Tokens, the next step is how to save money. Prompt optimization is the most cost-effective way-the same effect costs half the money.
Scenario example: You need to write a prompt and ask AI to help you review the code.
Before optimization (lengthy version):
Please help me review the code below. I will give you a piece of Python code that you need to review from the following aspects:
1. Correctness of the code: Check whether the code works properly and whether there are any syntax errors or logic errors
2. Code security: Check for security vulnerabilities, such as SQL injection, XSS, etc.
3. Code performance: Check for obvious performance issues, such as unnecessary loops, inefficient algorithms, etc.
4. Readability of the code: Check whether the code is easy to understand, whether the variable names are clear, and whether the comments are sufficient
5. Normalness of the code: Check whether it conforms to the PEP8 specification and whether the naming specification is consistent
Please provide a detailed analysis of each aspect. If a problem is found, please point out the specific line number and problem description.
Finally, please give an overall evaluation, on a scale of 1-10, and explain the reasons.
Here is the code:
def calculate_sum(numbers):
result = 0
for i in range(len(numbers)):
result = result + numbers[i]
return result
Please review it carefully.
After optimization (simplified version):
Review this Python code: check for correctness, security, performance, readability, and standardization. Point out specific problems and score an overall evaluation of 1-10.
Code:
def calculate_sum(numbers):
result = 0
for i in range(len(numbers)):
result = result + numbers[i]
return result
Comparison results:
| version | number of characters | Token number | Savings ratio |
|---|---|---|---|
| before optimization | 482 | 142 | - |
| optimized | 147 | 52 | 63% |
Summary of optimization skills:
- Delete the nonsense: AI can understand concise instructions and does not need the polite words of "please" and "please be careful"
- Consolidation and duplication requirements: originally, five aspects were mentioned, but in fact, the word "review" can be summarized
- Remove the example format: If the AI output format requirements are not strict, don't write too much detail
- Get straight to the point: "Review the code" is much simpler than "Help me review the code from the following aspects"
- Use abbreviations and code names: for example,"1-10" is cheaper than "scale from one to ten"
Another practical case: Optimization of structured output
Suppose you want AI to return data in JSON format:
Before optimization:
Please help me generate a user information, including user name, email address, age, mobile phone number, and the format is JSON. Please make sure that the email format is correct, that the age is a number, and that the mobile phone number is 11 digits. Please let me know if any of the fields do not meet the requirements.
After optimization:
Generate user JSON: {"name":"","email":"","age":0,"phone":""}. The mailbox must conform to the format, the age is numeric, and the phone is 11 digits.
Token was reduced from 68 to 31, saving 54%, with almost the same effect.
9. Key configuration: context window and Token restrictions
In addition to the Token consumption of a single call, there is another important concept: the Context Window.
What is a context window?
The context window refers to the upper limit of the total number of Tokens that AI can process at one time. Including the prompt words you entered + all previous AI replies + current replies.
For example: The context window of GPT-4o is 128K Tokens, which means that in a single conversation, the total of all historical content cannot exceed 128,000 Tokens.
Why is this restriction important?
Because AI does not have infinite memories. When your conversation is too long and exceeds the context window:
- Old content will be "squeezed out"(discarding the earliest news)
- Or directly report an error and prevent you from continuing to enter
Comparison of context windows for different models:
| model | contextual Windows | Approximately equal to how many words |
|---|---|---|
| GPT-4o | 128K | About 100,000 Chinese characters |
| GPT-3.5 Turbo | 16K | About 12,000 Chinese characters |
| Claude 3.5 Sonnet | 200K | About 150,000 Chinese characters |
| Claude 3 Opus | 200K | About 150,000 Chinese characters |
| Kimi | 128K | About 100,000 Chinese characters |
What should I do if I exceed the limit?
Scenario: You have a 200,000-word article for AI summary, but the model context window only has 128,000 Tokens.
Solution: Block processing
def chunk_text(text, max_tokens=3000, overlap=100):
"""
Divide long text into blocks, each block does not exceed max_tokens
overlap is the number of overlapping Tokens between blocks used to maintain context continuity
"""
encoding = tiktoken.get_encoding("cl100k_base")
tokens = encoding.encode(text)
chunks = []
start = 0
while start < len(tokens):
end = start + max_tokens
chunk_tokens = tokens[start:end]
chunk_text = encoding.decode(chunk_tokens)
chunks.append(chunk_text)
#Move the window and add overlap
start = end - overlap
return chunks
# uses examples
long_text = """Here is the content of a very long document...
(Omitted 200,000 words)"""
chunks = chunk_text(long_text, max_tokens=3000, overlap=200)
print(f"document is divided into {len(chunks)} blocks")
# block-wise processing
for i, chunk in enumerate(chunks):
print(f"\nProcess {i+1} block ({len(encoding.encode(chunk))} Token)... ")
#Call the AI API here to process each piece
Practical skills:
- When processing long documents, it is recommended to leave a 20% margin for the number of Tokens per block (for example, the limit is 3000, but the actual use is 2400)
- Use overlap to ensure that content between blocks does not break apart
- You can add a sentence of "previous summary" at the beginning of each piece to let the AI know the context
10. Common problems and troubleshooting
This section lists several pits that are easy to step on in practical use.
Problem 1: The calculation result of 'tiktoken' is different from that returned by the API
- Reason: API billing will also include the message format (role, content and other fields), not just the text content itself
- Solution: If you want to accurately match API billing, use 'len(response.usage)' to obtain the actual number of Tokens returned by the API instead of calculating it yourself
- Code example:
import openai
response = openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[{"role": "user", "content": "your problem"}]
)
# Get the exact number of tokens from the API response
print(f"Enter Token: {response.usage.prompt_tokens}")
print(f"Output Token: {response.usage.completion_tokens}")
print(f"Total Token: {response.usage.total_tokens}")
Question 2: Claude's Token calculation is different from GPT
- Reason: Claude uses a different word segmenter (tiktoken does not support Claude's encoding)
- Solution: Use Anthropic official tools or estimates (Claude's encoding is about 3-4 Tokens per character, which is more compact than GPT's BPE)
- Estimation formula: Claude Token number ≈ character number × 0.75 (Chinese) or character number/ 4 (English)
Question 3: The prompt word is too long and is truncated
- Error message:
This model's maximum context window is 128000 tokens - Reason: Prompt words + historical dialogue + expected output exceeds the context limit of the model
- Solution:
- Shorten the prompt words (use the optimization techniques in Section 8 of this article)
- Clean up the conversation history and delete unimportant news
- If the limit exceeds the limit in a single call, use blocking processing (see Section 9)
Question 4: The API returns a '400 Bad Request' or '413' error
- Reason: The request body exceeds the maximum number of Tokens allowed by the model
- Solution: Check the length of the prompt word to ensure that the total number of Tokens is <the context window limit × 0.9 (reserve 10% for the output)
- Debugging code:
def check_limit(prompt, model_max_tokens, output_estimate=500):
"""Check to see if the limit is exceeded"""
prompt_tokens = count_tokens(prompt)
total_needed = prompt_tokens + output_estimate
if total_needed > model_max_tokens * 0.9:
print(f"Warning: {total_needed} Token is required, limit may be exceeded")
print(f"Suggestion: Reduce approximately {total_needed - int(model_max_tokens * 0.9)} Token")
else:
print(f"Security: Use {total_needed} Token, remaining {int(model_max_tokens * 0.9 - total_needed)} Token")
Question 5: The output was accidentally truncated
- Reason: The output is too long, triggering the output limit of the model (not the context limit)
- Solution: Clearly require "concise" or segmented output in the prompt, or use the
max_tokensparameter to limit the output length
11. Advanced direction
After completing the basics, you can continue to explore in these directions:
Direction 1: Build an automated pipeline optimized for Tokens
Integrate Token calculation into the prompt word development process, and automatically detect Token consumption and cost every time a prompt word is modified. For example, write a script to automatically alarm when the number of Tokens exceeds a threshold.
Direction 2: Learn the principles of word segmentation of different models
If you are interested in technical details, you can study the principles of word segmentation algorithms such as BPE, WordPiece, and SentencePiece. Understanding the underlying principles can help you optimize prompts more accurately.
Direction 3: Explore the latest technologies for long-context optimization
RAG (Retrieval Enhanced Generation), MapReduce mode, reflection mode, etc. are all advanced skills for processing long documents. These methods allow you to process extra-long text at a lower cost.
Direction 4: Cost monitoring and budget management
In the production environment, it is recommended to build a monitoring kanban for Token consumption. Record the cost of each API call and set budget alerts to avoid exploding bills at the end of the month.
Token may seem like a small concept, but it is the underlying logic used by AI. Understand it, your efficiency in using AI can be greatly improved and your cost can be reduced a lot. I hope this tutorial is helpful to you.
If there is anything that you haven't explained clearly, or if you encounter other problems in actual operation, you are welcome to chat in the comment area.