Skip to main content

ML 101 - Data Grounding and RAG

· 5 min read

Data Grounding

Prompt: "What is the battery capacity of the latest Tesla Model 3?"

  • Without Data Grounding: A model trained only up to 2023 might say:
    "The Tesla Model 3 has a battery capacity of up to 75 kWh."
    This could be outdated if the model has changed.
  • With Data Grounding Using RAG: The model checks Tesla's latest data:
    "The 2025 Tesla Model 3 has a battery capacity of up to 82 kWh."

Imagine an elderly librarian (our LLM) who is incredibly knowledgeable but hasn't left the library in years. While they know an enormous amount, their knowledge is frozen in time. Data grounding is like giving him a smartphone with internet access.
Now, instead of relying solely on their memory, they can provide up-to-date information by checking reliable external sources.

Ways to ground a model:

  • RAG: Combining retrieval with generation for real-time accuracy.
  • Prompt Engineering: Crafting prompts that embed necessary context or examples.
  • Fine-Tuning: Updating the model's weights with domain-specific data.

Retrieval-Augmented Generation (RAG)

Imagine you're searching for a specific book in a huge library. You know the library has all sorts of information you need, but flipping through every single page or shelf would be time-consuming. Instead, you use catalogs and indexes to find the relevant books quickly. Once you have those books in hand, you summarize or combine their content to answer your question.

That, in a nutshell, is RAG:

  1. Retrieve the relevant data (like pulling the right books off the shelf).
  2. Generate an answer (like creating a summary from those books).

Why RAG?

LLMs are powerful, but they come with major constraints:

  1. Limited Knowledge Window till their last training data.
  2. Bias from Training Data
  3. Context Hallucinations as they generate text based on probabilities of what word should come next, they sometimes hallucinate.

RAG addresses these limitations by letting your LLM "look up" new or private data (like going to that library) without retraining the entire model from scratch. This is especially important for enterprises that have a lot of proprietary information.

RAG Architecture

RAG typically involves two main steps:

  1. Retrieval: This is where you "search the library." A retrieval component (often a semantic search engine) scans the external data sources to find relevant chunks of text.
  2. Generation: After the relevant texts are retrieved, the LLM "reads" them and generates the final answer, ideally grounded in those retrieved facts.

RAG Example

  1. User Query: "What are the key market trends for wearable devices in 2025?"
  2. Index Search: The system checks an index (like a library catalog) built from corporate research documents on wearable devices.
  3. Retrieve Documents: Relevant research papers, whitepapers, or internal reports are found. Let's say 3–5 documents are deemed most relevant.
  4. LLM Generation: The LLM then processes these 3–5 documents and crafts a synthesized response. Instead of relying on its "old" training data, it now grounds its answer in these up-to-date, private documents.

Precision and Recall

LLMs are verbose by default — design your system to maximize recall upstream, and enforce precision downstream through reranking, validation, or filters.

StageFocusExplanation
🔍 RetrieverHigh Recall“Find anything that might help.”
🧠 LLM + RerankerHigh Precision“Now pick only the most relevant and trustworthy.”

Imagine you're a detective investigating a crime. You have a list of suspects and you need to find the one who committed the crime.

MetricWhat it MeansProsCons
High PrecisionMost predicted positives are correctAccurate predictions, low false alarmsMight miss actual positives (low recall)
Low PrecisionMany predicted positives are wrongMight catch all actual positives (if recall is high)Many false alarms
High RecallMost actual positives are foundRarely misses important positivesMight include irrelevant or wrong positives (low precision)
Low RecallMany actual positives are missedVery selective predictionsMisses important positive cases

Common Use CaseOptimize ForWhy?
Intent Detection (chatbots)High Precision,
Low Recall
Wrong intents lead to confusing replies. Better to admit uncertainty than be wrong.
Information Retrieval (RAG)Moderate Precision,
High Recall
Gather all potentially relevant context. LLM can handle noise via reranking.
Question AnsweringHigh Precision,
Recall Ignore
Trust relies on correctness. Better to say “I don’t know” than hallucinate.
SummarizationHigh Precision,
Recall Ignore
Summary must reflect source truthfully. Don’t invent info.
Code GenerationHigh Precision,
Recall Ignore
Invalid code breaks functionality. Avoid hallucinated or broken output.

Assessing RAG

To assess model performance, we use the following metrics (rated between 0 and 1, higher is better):

  • Faithfulness (Generation)
    • Does the answer stick to what the documents say, or is it making things up?
  • Context Recall (Retrieval)
    • Did we grab all the relevant info, or did we miss something useful?
  • Context Precision (Retrieval)
    • Measures how much of what we retrieved is actually relevant.
  • Factual Correctness (Generation)
    • Are the facts in the answer actually true based on the retrieved content?
  • Answer Semantic Similarity (Generation)
    • Even if phrased differently, does the answer mean the same as a good one?