How to use AI Engineering From Scratch to improve development efficiency: A guide from getting started to mastering
Do you have this feeling? After reading countless AI tutorials, I was still confused when I actually started working on the project. How to use LangChain, how to build RAG, and how to deploy models, every link is stuck. It's not easy to get through a demo, but I have to start over again in a different scene.
This is not your only problem. The AI field is changing too fast, official document updates cannot keep up with the emergence of new technologies, tutorials are either too shallow or too scattered, and lack systematic practical guidance. The rohitg00/ai-engineering-from-scratch project was born to solve this pain point-it is not as simple as teaching you to tune APIs, but takes you to build complete AI engineering capabilities from scratch.
In the past six months, I have used the project's methodology to reconstruct the team's knowledge base retrieval system, and the response time has dropped from an average of 3 seconds to 800 milliseconds. Below, I will share with you the pits you have stepped on and the experience you have summarized.
1. Why don't you need another tutorial?
Most AI tutorials on the market go to two extremes: either they are nanny-level introductory guides that teach you how to use ChatGPT to write copywriting; or they are piled up academic papers that explain the 47 variants of RAG well, and you still don't know how to choose them when you really run.
The core positioning of this project is different. It assumes that you already have basic programming capabilities and wants to solve the problem of "how to engineer AI capabilities." Specifically, it covers these key aspects:
- Prompt Engineering's systematic approach: not fragmented techniques, but a complete workflow from evaluation to iteration
- Practical guide to RAG architecture: How to choose Embedding model, what to use for vectorized databases, and how to determine the size of Chunk
- LLM application development paradigm: From single round dialogue to multiple rounds of interactions, from simple calls to complex agents
- Production environment deployment points: API design, caching strategy, cost control
The project adopts a "learning-by-doing" model, and each knowledge point is accompanied by a runnable code example. You don't need to install a bunch of dependencies locally-it recommends using AI programming tools like Cursor or Windsurf directly to change code while looking at examples, which is much more efficient than traditional learning methods.
2. Start quickly: Run through the first AI application in 30 minutes
Whether you want to quickly verify ideas or learn systematically, the first step is to get the project running. Here is the simplest path for friends who want to get started as soon as possible.
2.1 Environmental preparation
Build the basic environment first. It is recommended to use Python version 3.10 or above and use conda or venv to create a separate environment to avoid dependency conflicts.
# Create and activate a virtual environment
conda create -n ai-engineering python=3.11
conda activate ai-engineering
# Install core dependencies
pip install openai anthropic langchain langchain-community tiktoken
If your network environment is unstable in accessing overseas APIs, it is recommended to configure an agent or use domestic model services. The project's sample code encapsulates the API calls. Changing the base model only requires changing the configuration without using the business logic.
2.2 Clone the project and run the example
git clone https://github.com/rohitg00/ai-engineering-from-scratch.git
cd ai-engineering-from-scratch
The project structure is organized by modules, and it is recommended to start with 01-prompt-engineering and move forward in turn. Each module has README.md description background and pre-knowledge, followed by the runnable code in the examples/ directory.
Take the first example:
# examples/basic_completion.py
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "You are a technical documentation assistant, explaining complex concepts in concise language"},
{"role": "user", "content": "explain what RAG is"}
]
)
print(response.choices[0].message.content)
Before running this script, make sure you have API keys in your environment variables:
export OPENAI_API_KEY="sk-xxxxx"
python examples/basic_completion.py
Seeing the output results means that there is no problem with the basic environment. The first milestone was achieved.
2.3 Accelerate learning with AI programming tools
The project documentation specifically recommends AI programming tools such as Cursor and Windsurf. When I actually use Cursor, the advantage of Cursor is that its Composer mode can write code while looking at the document. When encountering functions that I don't understand, I can directly ask AI, which is much faster than checking the document.
The specific operation is: use Cursor to open the project directory, open Composer, open README.md in the left panel, and write code on the right. When encountering an error, you can directly select the code and ask "Why does this report an error and how to change it?" The advice given by AI is usually more accurate than Stack Overflow.
The advantages of this combination are that the project provides a structured knowledge body and code examples of real-life scenarios, and AI tools help you quickly understand and modify. The combination of the two is 2-3 times more efficient than reading documents alone.
3. In-depth understanding: core components and selection of RAG system
Quick running through examples is just the beginning. What can really improve development efficiency is mastering the construction method of the RAG (Retrieval Enhanced Generation) system. This is the core scenario of current AI application development and the most detailed part of this project.
3.1 Three-level understanding of RAG architecture
Many people think that RAG means cutting everything a document, storing it in a vector library, and retrieving it and spelling it to LLM. This is only one-third correct.
The first layer is the data processing layer. How to cut a document, how big it is, how to clean it, and whether to identify the document structure directly affect the quality of retrieval. Common Chunk strategies are by paragraph, by sentence, and by fixed length, each with its own advantages and disadvantages. The rule of thumb is: If your document has a clear chapter structure, cut by chapter works best; if it is a loose Q & A set, cut by sentences or small paragraphs is more appropriate.
The second layer is the search layer. Which Embedding model to choose, how many vectorization dimensions to set, what to use for similarity measurement, and whether Hybrid Search (keyword + vector mixture) should be added determine whether relevant content can be accurately found. The project compared the actual effects of mainstream solutions such as OpenAI Embeddings, Cohere, and BGE, and measured that BGE is very cost-effective in Chinese scenarios.
The third layer is the generative layer. How to organize the retrieved content, how to build Prompt, whether the context window is enough, and whether secondary search is needed determine the quality of the final answer. In particular, it is important to note that the more search results the better. Too much related content will dilute key information.
3.2 Empirical values for key configuration items
Based on project documentation and actual testing, the following configurations are validated and effective combinations:
| scene | Chunk Size | Embedding model | number of searches | generative model |
|---|---|---|---|---|
| Short Document Questions and Answers | 500-800 token | text-embedding-3-small | 5-8 strip | gpt-4o-mini |
| Long Document Summary | 1000-1500 token | BGE-large-zh | 3-5 strip | gpt-4o |
| Cross-retrieval of multiple documents | 300-500 token | text-embedding-3-large | Articles 10-15 | gpt-4o |
These are not absolute values and need to be adjusted based on actual data. The project's suggestion is to run according to this benchmark configuration first, and then use eval scripts to evaluate the effect and iteratively optimize it.
3.3 Search quality evaluation method
The project provides a simple RAG evaluation framework, and the core indicators are:
- Context Precision: How much of the content retrieved is truly relevant
- Answer Faithfulness: Is the generated answer nonsense or fabricated out of thin air?
- Answer Relevance: Are the answers and questions highly relevant?
# examples/rag_evaluation.py
from ragas import evaluate
from ragas.metrics import (
faithfulness,
answer_relevancy,
context_precision,
)
# Prepare test set
test_set = [
{
"question": "What is the license for the project? ",
"answer": "MIT License",
"contexts": ["This project is licensed under the MIT License. "],
"ground_truth": "MIT License"
}
]
# run-in assessment
result = evaluate(test_set, metrics=[faithfulness, answer_relevancy, context_precision])
print(result)
First use small samples to run through the process, establish evaluation benchmarks, and then gradually expand the test set. Once this habit is developed, subsequent optimizations will have anchors and parameters will not be adjusted based on your feelings.
4. Scenario solutions: best practices for different needs
For the same technical solution, the optimal implementation in different scenarios may be completely different. The following are classified into categories according to common scenarios.
4.1 Knowledge base question and answer system
This is the most typical application scenario of RAG. When users ask questions, the system retrieves relevant content from the document library and generates answers.
The core challenge is how to deal with complex issues that cross documents. For example, the user asked,"Compare the technical architecture differences between Scheme A and Scheme B." This requires extracting information from multiple documents and integrating them.
The solution is Multi-hop Retrieval. The first search finds relevant documents, analyzes key entities, uses these entities to do a second round of search, and finally combines all the results to answer.
# examples/multi_hop_rag.py
def multi_hop_query(query: str, vector_store, llm):
#First jump: Get initial related documents
initial_docs = vector_store.similarity_search(query, k=5)
#Analyze key entities
entity_prompt = f"Extract key entities that require further query from the following questions: {query}"
entities = llm.invoke(entity_prompt)
#Second jump: Extended search based on entities
expanded_docs = []
for entity in entities:
docs = vector_store.similarity_search(entity, k=3)
expanded_docs.extend(docs)
#Merge and remove duplication to generate answers
combined_context = "\n".join(set([doc.page_content for doc in expanded_docs]))
return llm.invoke(f"Answer the question based on the following content: {combined_context}\n\nQuestion: {query}")
4.2 customer service robot
The characteristics of customer service scenarios are: relatively fixed problem types, high response delay requirements, and multiple rounds of conversations need to be supported.
The core optimization points of this scenario are caching and intention recognition. First, use a lightweight model to classify intentions and route problems to different processing processes; high-frequency problems are directly cached without adjusting LLM; only the cache misses are followed by the RAG process.
# examples/customer_service_router.py
from functools import lru_cache
@lru_cache(maxsize=1000)
def cached_response(question_hash, question):
"""Cache answers to frequently asked questions"""
return generate_rag_response(question)
def handle_customer_query(question: str):
intent = classify_intent(question) #Use lightweight models to identify intent
if intent == "faq":
return cached_response(hash(question), question)
elif intent == "order_inquiry":
return handle_order(question)
elif intent == "technical_support":
return generate_rag_response(question)
else:
return escalate_to_human(question)
Measured that this architecture can reduce P99 latency from 2 seconds to 300 milliseconds, while reducing API call costs by more than 60%.
4.3 Code Assistant
The particularity of code assistants lies in that they need to understand the code structure, handle long contexts, and have precise output formats.
This scenario recommends using models that support long contexts (such as Claude 3.5 Sonnet), coupled with code-specific Embedding models (such as StarCoder). During retrieval, AST (Abstract Grammar Tree) can be used for structured retrieval, not just text similarity.
There is an example in the project that demonstrates how to use Tree-sitter to parse a code base, build indexes, and search based on code structure. This method is much more accurate than plain text retrieval and is especially suitable for questions such as "where is this function called?"
5. Practical cases: complete realization of two typical scenarios
Theory alone is not enough. Here are two cases that I actually solved using this project methodology.
5.1 Case 1: Rebuilding the internal knowledge base retrieval system
Background: The company's technical documents are scattered in Confluence, Notion and local Markdown warehouses, and employees often have to go through several places to find information. The original plan was to directly throw all documents to GPT-4 for Q & A, which was costly and the quality of the answers was unstable.
Implementation steps:
-
Data integration: Use the document processing module in the project to uniformly convert documents from different sources into standard formats. Confluence exports HTML, Notion exports Markdown, directly reads local files, and then uses the 'unstructured' library for cleaning and blocking.
-
Vectorized storage: Qdrant was selected as the vector database and deployed on the local server. Embedding uses BGE-large-zh-v1.5, sets the Chunk size to 800 tokens, and overlaps 100 tokens to maintain context coherence.
-
Evaluation iteration: Prepare 50 frequently asked questions as a test set and run RAGAS evaluation. The initial score was Context Precision 0.72 and Faithfulness 0.68. After adjusting the size of the Chunk and the number of searches, it was optimized to 0.89 and 0.85.
-
On-line optimization: A caching layer has been added to cache high-frequency issues; a Re-ranking module has been added to refine the search results again using BAAI/bge-reranker-large.
Final results: Answer accuracy increased from 65% to 89%, P99 delay reduced from 4 seconds to 1.2 seconds, and API costs reduced by 70%. Employee feedback that "I can finally quickly find the technical documents I want."
5.2 Case 2: E-commerce intelligent customer service robot
Background: Customer service on e-commerce platforms handles a large number of duplicate issues every day. Refund progress inquiries, order status confirmation, and size recommendations account for 70% of the inquiries. Manual processing costs are high, responses are slow, and user experience is poor.
Implementation steps:
-
Problem classification: First perform cluster analysis on historical work orders to identify high-frequency problem types. Using text classification examples of projects, a lightweight intention classification model was trained with an accuracy rate of 94%.
-
FAQ knowledge base: Organize high-frequency questions with standard answers in historical work orders into a FAQ library, and build indexes with the project's vectorization scheme. The FAQ matching rate is about 60%, and these are directly cached without adjusting LLM.
-
RAG for complex problems: For questions that miss the FAQ, go through the RAG process. Search order system documents, customer service manuals and other information to generate answers. For scenarios that need to query real-time data (such as refund progress), call the backend API to obtain it before generating it.
-
Manual takeover mechanism: Problems with confidence below the threshold will be automatically transferred to manual. The system will record the reason for switching to labor for continuous optimization.
Final effect: The automatic reply rate has increased from 30% to 75%, the average response time has dropped from 2 minutes to 8 seconds, and user satisfaction has increased from 3.2 stars to 4.1 stars. The customer service team has reduced the number of work orders processed from 500 to 120 per day, allowing it to do more valuable work.
6. Effect data: true comparison before and after optimization
It is more convincing to speak with data. The following are the results I measured in different scenarios:
| indicators | before optimization | optimized | increase rate |
|---|---|---|---|
| RAG answer accuracy | 65% | 89% | +37% |
| P99 response delay | 4s | 1.2s | -70% |
| API call cost/month | $1200 | $360 | -70% |
| Intent recognition accuracy | 78% | 94% | +21% |
| cache hit rate | 0% | 62% | - |
These data come from the actual production environment and will vary in different business scenarios. But the overall trend is certain: a systematic RAG solution + reasonable caching strategy + continuous evaluation iteration can bring significant comprehensive improvements.
Data from other developers is also shared in the project documentation. A team used the same method to reduce the illusion rate of code review robots from 15% to below 3%. The key is to add a Fact Verification link to the generation layer.
7. Guide to avoiding pits: The five pits that are easiest for novices to step on
Based on project experience and actual experience in walking pits, summarize these areas that are prone to problems.
7.1 Pit 1: Embedding model and generation model do not match
Many people want to save trouble. Embedding uses OpenAI's text-embedding-3-small and Claude is used for generation. In theory, there is no problem, but the actual search effect will be compromised.
The reason is simple: different models have different training data and different understandings of "similarity". The two paragraphs Embedding finds similar, but the generative model may find completely incompatible.
Method to avoid pits: Try to use the same model for Embedding and generative models. For example, both use OpenAI, or both use Anthropic, or use local open source models (such as Nomic + Llama). If you must mix it, do at least one round of tuning and use actual data to test the matching effect.
7.2 Pit 2: Chunk's size is determined by patting his head
How big is the appropriate size for Chunk? Many people randomly set up 500 tokens, but the effect is almost ignored when running. In fact, the size of the Chunk has a great impact on the quality of retrieval, and needs to be adjusted based on the document structure and problem type.
Method to avoid pits: First use the eval script in the project to run a batch of tests and record Precision and Recall indicators of different Chunk sizes. Documents with clear structures are cut according to structure, and documents with loose structures are cut according to fixed lengths to find the optimal value.
7.3 Pit 3: The more search results, the better
I always feel that more content insurance will be recalled, and the number of searches will increase from 5 to 20. As a result, the context window was filled with irrelevant content, and the generated answers were even worse.
Method to avoid traps: The more search results the better. 5-8 pieces of experience are enough, too much noise is too loud. Use the Re-ranking module to do fine layout and put the truly relevant ones first. The quantity can be small but the quality should be high.
7.4 Pit 4: Go online without evaluation
Running through the demo feels that the effect is good, so go online directly. As a result, when the user asked specific questions, the answers were full of loopholes, but it was too late to discover them.
Method of avoiding pits: An evaluation mechanism must be established. Prepare at least 20-50 test questions with standard answers, run them for evaluation before going online, and conduct random inspections every week after going online. Assessments are not just one-time, they must be done continuously.
7.5 Pit 5: Neglecting cost control
As the number of API calls in the production environment increases, costs increase rapidly. I waited until I saw the bill at the end of the month before regretting that I didn't cache or limit current.
Avoidance methods: Design cost control mechanisms from the beginning. Cache layers must be added, high-frequency requests are current limited, and model selection is allocated according to scenarios (small models are used for simple problems and large models are used for complex problems). It is recommended to set monitoring alarms for API calls and automatically alarm if the threshold is exceeded.
8. Advanced optimization skills: from usability to usability
After the basic plan is completed, there are also these optimization methods that can make the system qualitative change.
8.1 Query rewriting and extension
Users have various questions expressed, and direct search may not match the expressions in the document. Query rewriting is to first transform user questions into a form more suitable for retrieval.
# examples/query_rewriting.py
def rewrite_query(query: str, llm) -> str:
"""Rewrite user questions into a form more suitable for retrieval"""
prompt = f"""Reforms the following user question into a more precise search query.
Requirements:
1. Retain core information and remove colloquial expressions
2. You can add synonyms or related concepts
3. The output contains only rewritten queries and does not interpret them
User question: {query}
After rewriting: """
rewritten = llm.invoke(prompt)
return rewritten.strip()
This technique is particularly effective for long-tail problems and can cover situations that would otherwise not be matched.
8.2 Hybrid search strategy
Pure vector retrieval sometimes misses questions with clear keywords. Hybrid search is to use keyword search and vector search at the same time, and then merge the results.
# examples/hybrid_search.py
def hybrid_search(query: str, vector_store, bm25_index, top_k=5):
#Vector Search
vector_results = vector_store.similarity_search(query, k=top_k*2)
#BM25 keyword search
bm25_results = bm25_index.search(query, k=top_k*2)
#Merge results (RRF algorithm)
combined = {}
for rank, doc in enumerate(vector_results):
combined[doc.id] = combined.get(doc.id, 0) + 1 / (60 + rank)
for rank, doc in enumerate(bm25_results):
combined[doc.id] = combined.get(doc.id, 0) + 1 / (60 + rank)
#Returns the merged top_k result
sorted_ids = sorted(combined.keys(), key=lambda x: combined[x], reverse=True)
return [get_doc_by_id(id) for id in sorted_ids[:top_k]]
Measured hybrid retrieval can improve Recall by 10-15%, especially for scenarios where there are a large number of technical terms in the technical documentation.
8.3 Streaming output optimization experience
Non-streaming output LLM responses will not be returned until the model generates a complete answer, resulting in a poor user waiting experience. After changing to streaming output, the first token can start to return, and the latency perceived by users can be reduced by 60%.
# examples/streaming_response.py
def stream_response(query: str, vector_store, llm):
docs = vector_store.similarity_search(query, k=5)
context = "\n".join([doc.page_content for doc in docs])
prompt = f"Answer based on the following content: \n{context}\n\nQuestion: {query}"
#Streaming output
stream = llm.stream(prompt)
for chunk in stream:
yield chunk.content
The front-end receives SSE (Server-Sent Events) to achieve the typewriter effect.
8.4 automatic model selection
Different problems have different complexities, and using GPT-4 all costs and delays are high. Automatic model selection is to dynamically select models based on problem complexity.
Use small models (GPT-4o-mini) for simple problems, medium models (GPT-4o) for medium complexity, and large models (Claude 3.5 Opus) for complex problems. Train a classifier to judge problem complexity, or use the number of tokens as a proxy indicator.
# examples/auto_model_selection.py
def select_model(query: str) -> str:
token_count = count_tokens(query)
if token_count < 50 and is_simple_structure(query):
return "gpt-4o-mini" #Simple Question
elif token_count < 200:
return "gpt-4o" #Medium complexity
else:
return "claude-3-5-sonnet-20240620" #Complex problem
This strategy saves 40-60% on API costs while maintaining overall response quality.
8.5 Continuous learning and feedback closed loop
After the system is launched, it is not the end point, but the starting point. A feedback mechanism needs to be established and continuous optimization.
Collect user feedback (likes/clicks, questions, reports), regularly use feedback data to re-evaluate, identify system shortcomings, update the knowledge base or adjust strategies. This is a continuous iterative process.
9. Daily maintenance recommendations: Keep the system healthy for a long time
The launch of the system is just the beginning, and daily maintenance is equally important.
9.1 Regularly update the knowledge base
The effectiveness of AI applications depends largely on the quality of the knowledge base. When documents are updated, they must synchronize them in time, delete outdated content, and add new content. It is recommended to set up version management of the knowledge base and record a change log for each update.
9.2 Monitoring core indicators
Indicators that must be monitored include: answer accuracy, response delay, API call volume, cache hit rate, and user satisfaction. Set threshold alarms to detect abnormalities as soon as possible.
9.3 Periodic evaluation iteration
It is recommended to collect a batch of users 'real problems every week for evaluation, and conduct a comprehensive evaluation every quarter. Adjust parameters such as Chunk size, retrieval number, and model selection based on the evaluation results.
9.4 Cost optimization check
Check the API call bill once a month and analyze the call distribution. If a certain type of problem has a high proportion but a high cost, priority should be given to optimizing this part.
9.5 Focus on model updates
Large model manufacturers continue to update model versions, and new versions often have better results or lower prices. Pay attention to the update log, test new versions in time, and switch in time.
X. Summary
AI Engineering From Scratch This project provides a systematic learning path from theory to practice. Its core value is not a specific code, but to help you build an AI engineering thinking way-from evaluation-driven development, to incremental optimization, to production environment considerations.
Keep these key points in mind:
- Run through first and then optimize: Don't pursue perfection at the beginning, use basic solutions to verify feasibility first
- Evaluation-driven iteration: Optimization without evaluation is blind, establish benchmarks and continue to measure
- Cache priority: Those that can be cached will not adjust LLM, and those that can use small models will not use large models
- Monitoring alarms: Going online is just the beginning, and problems can only be discovered by continuous monitoring
AI application development is not a one-time thing, but a process of continuous optimization. By using the methodology of this project well, you have made fewer detours than most people.
