Creating LLM Agents for RAG: a detailed guide to architecture, development, and Application

Content

  • What is RAG and why has it become a key element of modern AI systems?

  • Understanding LLM Agents

  • Agent components: brain, memory, planning

  • Tools for developing LLM and RAG agents

  • Configuring the environment for creating a RAG agent

  • Preparation of documents

  • RAG architecture: a scheme for practical implementation

  • Building a search quality control chain

  • Querying the system and working with the context

  • Creating an LLM Agent

  • Agent architecture: logic, interaction, tools

  • Improving the RAG agent using advanced techniques

  • Multi-agent systems: relevance and quality assessment

  • Future directions and key challenges

  • Technical application: an example of the implementation of a RAG agent in Python

What is RAG and why has it become a key element of modern AI systems?

Retrieval-Augmented Generation (RAG) allows the language model to access external data and generate responses not only based on its own parameters, but also based on documents, internal knowledge bases and regulations. This approach reduces the likelihood of errors and ensures reliable conclusions, especially in a corporate environment where accuracy and control requirements are high.

RAG has become the main tool for companies working with large amounts of documents: it speeds up information retrieval, automates analytical tasks, and improves the quality of customer support. In a number of practical implementations, RAG has reduced the processing time of questions by tens of percent, as well as stabilized the quality of responses.

Understanding LLM Agents

LLM agents extend the capabilities of regular RAG. If RAG answers the question, then the agent is able to plan the work, perform several steps, use tools, check the quality of results and adjust the strategy.

In legal and technical expertise, such systems allow you to automate the analysis of documents, identify inconsistencies and make recommendations. This reduces the burden on employees and increases the accuracy of information processing.

Agent components: brain, memory, planning

Agent (brain)

Manages the process of solving the problem: interprets the request, selects tools, forms a strategy and evaluates the result.

Memory

The memory system includes short-term context, working memory, and long-term vector storage. In the RAG architecture, vector memory provides access to corporate documents and knowledge.

Planning

The scheduler determines the sequence of actions, monitors the execution of steps, and analyzes the completion of the task. This ensures the stability and predictability of the agent's behavior.

Tools for developing LLM and RAG agents

The following categories of tools are used to build RAG agents:









They provide development flexibility, allow you to add tools, form chains of reasoning, and implement complex search scenarios.

Configuring the environment for creating a RAG agent

Usually Python, LangChain or LlamaIndex and a vector database server are used. Industrial implementation takes into account security requirements, logging, and the possibility of horizontal scaling.

Preparation of documents

Document preparation includes file parsing, cleaning, fragmentation, and vectorization. The fragment size in the range of 500-1500 tokens allows you to keep the meaning and improve the accuracy of the search.

RAG architecture: a scheme for practical implementation

To visualize the RAG architecture, we recommend a diagram that includes:











This diagram demonstrates the complete data path from the document to the finished response.

Building a search quality control chain

To improve the search quality, a combination is used:





This approach significantly improves accuracy and reduces the chance of getting irrelevant fragments.

Querying the system and working with the context

The request template combines the instruction, the response format, and the context transmitted from the search. This allows the model to work steadily and eliminates false conclusions with a lack of information.

Creating an LLM Agent

The agent's work includes several stages, which are presented in a list:

  • Analyzing the query, choosing a strategy, and determining the need for a search.

  • Obtaining documents from vector memory and evaluating their sufficiency.

  • Formation of a draft response and initial verification of correctness.

  • If necessary, refine the query, re-search, and improve the response.

A couple of techniques

  1. Automatic request extension

  2. Task decomposition

Agent architecture: logic, interaction, tools

It is recommended to add a diagram for the article, illustrating:











The diagram helps the reader see how search, reasoning, and final response generation are related.

Improving the RAG agent using advanced techniques

Semantic search for DPR

Using separate encoders for queries and documents improves the quality of matching.

Request Extension

Allows the agent to independently refine the wording and increase the completeness of the search.

Iterative refinement

Combines a draft response, quality control, and adjustments to achieve a more accurate result.

Multi-agent systems: relevance and quality assessment

In complex scenarios, several agents are used: one evaluates the relevance of documents, the other checks the logic of the response, and the third verifies facts with sources.

The diagram of such a solution includes an Orchestrator (coordinator) and three modules: Search Agent, Answer Agent, Review Agent.

Future directions and key challenges

The development of RAG agents is moving in three directions:

  • increasing the depth of search and hybrid matching methods;

  • development of self-learning and autonomous memory updating;

  • integration of models capable of complex multicore reasoning.

The main challenge remains the same: ensuring the quality, explainability, and controllability of AI systems in an environment with high reliability requirements.

Technical application: an example of the implementation of a RAG agent in Python

This example demonstrates the basic framework of a RAG agent, which includes:









It can serve as a starting point for building a complete system.

import os
from typing import List, Tuple

from dotenv import load_dotenv

from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.schema import Document

load_dotenv()

llm = ChatOpenAI(
    model="gpt-4.1-mini",
    temperature=0.1,
)

embeddings = OpenAIEmbeddings(
    model="text-embedding-3-small"
)

def build_vector_store(texts: List[str], persist_dir: str = "./rag_index") -> Chroma:
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,
        chunk_overlap=200
    )

    docs: List[Document] = []
    for i, t in enumerate(texts):
        for chunk in splitter.split_text(t):
            docs.append(Document(page_content=chunk, metadata={"source_id": i}))

    vector_store = Chroma.from_documents(
        documents=docs,
        embedding=embeddings,
        persist_directory=persist_dir,
    )
    vector_store.persist()
    return vector_store

def retrieve_relevant_docs(vector_store: Chroma, query: str, k: int = 4) -> List[Document]:
    return vector_store.similarity_search(query, k=k)

SYSTEM_PROMPT = """
Ты — экспертный ассистент, который отвечает строго на основе предоставленных документов.
Если информации не хватает, сообщи об этом честно.
"""

def format_context(docs: List[Document]) -> str:
    parts = []
    for idx, d in enumerate(docs, start=1):
        parts.append(f"[Фрагмент {idx}]\n{d.page_content}\n")
    return "\n".join(parts)

def agent_answer(vector_store: Chroma, user_query: str) -> Tuple[str, float]:
    docs = retrieve_relevant_docs(vector_store, user_query, k=4)
    context_text = format_context(docs)

    prompt = f"""
{SYSTEM_PROMPT}

Вопрос пользователя: {user_query}

Контекст:
{context_text}

Сначала дай ответ, затем оцени уверенность (0–1).

Формат:
Ответ: <текст>
Уверенность: <число>
"""

    resp = llm.invoke(prompt)
    text = resp.content

    confidence = 0.5
    for line in text.splitlines():
        if "Уверенность:" in line:
            try:
                confidence = float(line.split("Уверенность:")[1].strip())
            except:
                pass

    return text, confidence

if __name__ == "__main__":
    corporate_docs = [
        "Регламент оценки рисков...",
        "Порядок инспекции оборудования...",
        "Инструкция регистрации инцидентов..."
    ]

    vs = build_vector_store(corporate_docs)

    query = "Как проводится ежегодная техническая инспекция?"
    answer, conf = agent_answer(vs, query)

    print(answer)
    print("Уверенность:", conf)