KS
Killer-Skills

rag — how to use RAG how to use RAG, RAG setup guide, RAG install, what is RAG, RAG alternative, RAG vs LLM, Retrieval-Augmented Generation, RAG for AI agents, RAG technical documentation

v0.5.0
GitHub

About this Skill

Perfect for AI Agents needing accurate and grounded responses through document retrieval and LLM generation. RAG is a Retrieval-Augmented Generation system that integrates document retrieval with large language model generation for grounded responses.

Features

Supports installation via uv add llmemory command
Enables local cross-encoder models for reranking
Configures OpenAI reranking for improved relevance
Ingests documents into llmemory for retrieval
Retrieves relevant chunks from ingested documents
Reranks retrieved chunks for improved accuracy

# Core Topics

juanre juanre
[0]
[0]
Updated: 3/6/2026

Quality Score

Top 5%
59
Excellent
Based on code quality & docs
Installation
SYS Universal Install (Auto-Detect)
Cursor IDE Windsurf IDE VS Code IDE
> npx killer-skills add juanre/llmemory/rag

Agent Capability Analysis

The rag MCP Server by juanre is an open-source Categories.community integration for Claude and other AI agents, enabling seamless task automation and capability expansion. Optimized for how to use RAG, RAG setup guide, RAG install.

Ideal Agent Persona

Perfect for AI Agents needing accurate and grounded responses through document retrieval and LLM generation.

Core Value

Empowers agents to generate accurate responses by combining document retrieval with LLM generation, utilizing llmemory and supporting reranking through local cross-encoder models or OpenAI reranking.

Capabilities Granted for rag MCP Server

Ingesting documents for knowledge retrieval
Reranking search results for improved relevance
Generating grounded responses using Retrieval-Augmented Generation

! Prerequisites & Limits

  • Requires llmemory installation
  • Optional: OpenAI API Key for reranking support
Project
SKILL.md
32.8 KB
.cursorrules
1.2 KB
package.json
240 B
Ready
UTF-8

# Tags

[No tags]
SKILL.md
Readonly

LLMemory RAG Systems

Installation

bash
1uv add llmemory 2# For reranking support 3uv add "llmemory[reranker-local]" # Local cross-encoder models 4# or configure OpenAI reranking (no extra install needed)

Overview

Retrieval-Augmented Generation (RAG) combines llmemory's document retrieval with LLM generation for accurate, grounded responses.

RAG Pipeline:

  1. Ingest: Add documents to llmemory
  2. Retrieve: Search for relevant chunks
  3. Rerank: Improve relevance ordering (optional but recommended)
  4. Augment: Build prompt with retrieved context
  5. Generate: Get LLM response

When to use RAG:

  • Question answering over your documents
  • Customer support with knowledge base
  • Research assistance
  • Code documentation search
  • Any application needing accurate, source-backed answers

Quick Start

python
1from llmemory import LLMemory, SearchType, DocumentType 2from openai import AsyncOpenAI 3 4async def rag_system(): 5 # Initialize 6 memory = LLMemory( 7 connection_string="postgresql://localhost/mydb", 8 openai_api_key="sk-..." 9 ) 10 await memory.initialize() 11 12 # 1. Ingest documents 13 await memory.add_document( 14 owner_id="workspace-1", 15 id_at_origin="kb", 16 document_name="product_guide.md", 17 document_type=DocumentType.MARKDOWN, 18 content="Your product documentation..." 19 ) 20 21 # 2. Retrieve with reranking 22 results = await memory.search( 23 owner_id="workspace-1", 24 query_text="how to reset password", 25 search_type=SearchType.HYBRID, 26 query_expansion=True, # Better retrieval 27 rerank=True, # Better ranking 28 rerank_top_k=50, # Rerank top 50 candidates 29 rerank_return_k=10, # Prefer 10 best after reranking 30 limit=5 # Final result count (max of limit and rerank_return_k) 31 ) 32 33 # 3. Build prompt with context 34 context = "\n\n".join([ 35 f"Source: {r.metadata.get('source', 'unknown')}\n{r.content}" 36 for r in results 37 ]) 38 39 prompt = f"""Answer the question using only the provided context. 40 41Context: 42{context} 43 44Question: how to reset password 45 46Answer:""" 47 48 # 4. Generate response 49 client = AsyncOpenAI() 50 response = await client.chat.completions.create( 51 model="gpt-4", 52 messages=[{"role": "user", "content": prompt}] 53 ) 54 55 print(response.choices[0].message.content) 56 await memory.close() 57 58import asyncio 59asyncio.run(rag_system())

Query Routing for Production RAG

Production RAG systems should detect when queries cannot be answered from available documents.

When to use query routing:

  • User queries may be unanswerable from your knowledge base
  • Need to route to web search or external APIs
  • Want to avoid hallucinated answers
  • Building conversational assistants

Example:

python
1from llmemory import LLMemory 2 3async with LLMemory(connection_string="...") as memory: 4 # Search with automatic routing 5 result = await memory.search_with_routing( 6 owner_id="workspace-1", 7 query_text="What's the current weather in Paris?", 8 enable_routing=True, 9 limit=5 10 ) 11 12 if result["route"] == "retrieval": 13 # Answer from documents 14 return generate_answer(result["results"]) 15 elif result["route"] == "web_search": 16 # Route to web search 17 return fetch_from_web(query) 18 elif result["route"] == "unanswerable": 19 # Honest response 20 return "I don't have information to answer that question." 21 else: # clarification 22 return "Could you please provide more details?"

API Reference:

search_with_routing()

Route queries intelligently before searching.

Signature:

python
1async def search_with_routing( 2 owner_id: str, 3 query_text: str, 4 enable_routing: bool = True, 5 routing_threshold: float = 0.7, 6 **search_kwargs 7) -> Dict[str, Any]

Parameters:

  • owner_id (str): Owner identifier
  • query_text (str): Search query
  • enable_routing (bool, default: True): Enable automatic routing
  • routing_threshold (float, default: 0.7): Confidence threshold
  • **search_kwargs: Additional arguments passed to search()

Returns: Dict with:

  • route (str): "retrieval", "web_search", "unanswerable", or "clarification"
  • confidence (float): 0-1 confidence in routing decision
  • results (List[SearchResult]): If route="retrieval"
  • message (str): If route != "retrieval"
  • reason (str): Explanation of routing decision

Example:

python
1result = await memory.search_with_routing( 2 owner_id="support", 3 query_text="How do I reset my password?", 4 routing_threshold=0.8 5) 6 7if result["route"] == "retrieval": 8 answer = generate_rag_response(result["results"]) 9else: 10 answer = result["message"] # Pre-formatted response

Complete RAG Pipeline

Step 1: Document Ingestion

python
1from llmemory import LLMemory, DocumentType, ChunkingConfig, LLMemoryConfig 2 3async def ingest_knowledge_base(owner_id: str): 4 """Ingest documents into RAG system.""" 5 6 # Configure chunking for RAG (smaller chunks for precise retrieval) 7 chunking_config = ChunkingConfig( 8 chunk_size=300, # Tokens per chunk (smaller for RAG) 9 chunk_overlap=50, # Overlap for context preservation 10 strategy="hierarchical", # Chunking strategy 11 min_chunk_size=100, # Minimum chunk size 12 max_chunk_size=500 # Maximum chunk size 13 ) 14 15 # Enable chunk summaries via LLMemoryConfig 16 config = LLMemoryConfig() 17 config.chunking.enable_chunk_summaries = True 18 config.chunking.summary_max_tokens = 80 19 20 memory = LLMemory( 21 connection_string="postgresql://localhost/mydb", 22 config=config 23 ) 24 await memory.initialize() 25 26 documents = [ 27 { 28 "name": "product_guide.md", 29 "type": DocumentType.MARKDOWN, 30 "content": "...", 31 "metadata": {"category": "guide", "version": "2.0"} 32 }, 33 { 34 "name": "faq.md", 35 "type": DocumentType.MARKDOWN, 36 "content": "...", 37 "metadata": {"category": "faq"} 38 }, 39 { 40 "name": "api_docs.md", 41 "type": DocumentType.TECHNICAL_DOC, 42 "content": "...", 43 "metadata": {"category": "api", "language": "python"} 44 } 45 ] 46 47 for doc in documents: 48 result = await memory.add_document( 49 owner_id=owner_id, 50 id_at_origin="knowledge_base", 51 document_name=doc["name"], 52 document_type=doc["type"], 53 content=doc["content"], 54 metadata=doc["metadata"], 55 chunking_config=chunking_config, 56 generate_embeddings=True 57 ) 58 print(f"Ingested {doc['name']}: {result.chunks_created} chunks")

Step 2: Retrieval Configuration

python
1async def retrieve_for_rag( 2 memory: LLMemory, 3 owner_id: str, 4 query: str, 5 top_k: int = 5 6) -> List[SearchResult]: 7 """Retrieve relevant chunks for RAG.""" 8 9 results = await memory.search( 10 owner_id=owner_id, 11 query_text=query, 12 13 # Hybrid search for best quality 14 search_type=SearchType.HYBRID, 15 alpha=0.6, # Slight favor to semantic search 16 17 # Query expansion for better recall 18 query_expansion=True, 19 max_query_variants=3, 20 21 # Reranking for precision 22 rerank=True, 23 rerank_top_k=20, # Consider top 20 candidates 24 rerank_return_k=top_k, # Prefer top_k after reranking 25 26 # Final limit (actual count = max(limit, rerank_return_k)) 27 limit=top_k 28 ) 29 30 return results

Step 3: Reranking Configuration

llmemory supports multiple reranking methods:

OpenAI Reranking (Recommended for Quality)

python
1# Configure via environment 2LLMEMORY_RERANK_PROVIDER=openai 3LLMEMORY_RERANK_MODEL=gpt-4.1-mini 4LLMEMORY_RERANK_TOP_K=30 5LLMEMORY_RERANK_RETURN_K=10
python
1# Or programmatically 2from llmemory import LLMemoryConfig 3 4config = LLMemoryConfig() 5config.search.enable_rerank = True 6config.search.rerank_provider = "openai" 7config.search.default_rerank_model = "gpt-4.1-mini" 8config.search.rerank_top_k = 30 9config.search.rerank_return_k = 10 10 11memory = LLMemory( 12 connection_string="postgresql://localhost/mydb", 13 config=config 14)

Local Cross-Encoder Reranking (Faster, No API Calls)

bash
1# Install local reranker dependencies 2uv add "llmemory[reranker-local]"
python
1# Configure 2config = LLMemoryConfig() 3config.search.enable_rerank = True 4config.search.default_rerank_model = "cross-encoder/ms-marco-MiniLM-L6-v2" 5config.search.rerank_device = "cpu" # or "cuda" 6config.search.rerank_batch_size = 16

Lexical Reranking (Fallback, No Dependencies)

python
1# Automatic fallback when no reranker configured 2# Uses token overlap scoring 3results = await memory.search( 4 owner_id="workspace-1", 5 query_text="query", 6 rerank=True # Uses lexical reranking 7)

Reranker API Reference

CrossEncoderReranker

Local cross-encoder model for reranking search results without API calls.

Constructor:

python
1CrossEncoderReranker( 2 model_name: str = "cross-encoder/ms-marco-MiniLM-L6-v2", 3 device: Optional[str] = None, 4 batch_size: int = 16 5)

Parameters:

  • model_name (str, default: "cross-encoder/ms-marco-MiniLM-L6-v2"): Hugging Face cross-encoder model name
    • Available models: "cross-encoder/ms-marco-MiniLM-L6-v2", "cross-encoder/ms-marco-TinyBERT-L2-v2"
  • device (Optional[str]): Device to run on: "cpu", "cuda", or None (auto-detect)
  • batch_size (int, default: 16): Batch size for inference

Methods:

score()

Score query-document pairs for relevance.

Signature:

python
1async def score( 2 query_text: str, 3 results: Sequence[SearchResult] 4) -> Sequence[float]

Parameters:

  • query_text (str): Search query
  • results (Sequence[SearchResult]): Search results to score

Returns:

  • Sequence[float]: Relevance scores (same length as results)

Example:

python
1from llmemory import CrossEncoderReranker 2 3# Initialize reranker 4reranker = CrossEncoderReranker( 5 model_name="cross-encoder/ms-marco-MiniLM-L6-v2", 6 device="cpu", 7 batch_size=32 8) 9 10# Get initial search results 11results = await memory.search( 12 owner_id="workspace-1", 13 query_text="machine learning", 14 limit=50, 15 rerank=False # Get unranked results 16) 17 18# Rerank with cross-encoder 19scores = await reranker.score("machine learning", results) 20 21# Sort by new scores 22scored_results = list(zip(scores, results)) 23scored_results.sort(key=lambda x: x[0], reverse=True) 24top_results = [r for _, r in scored_results[:10]]

Installation:

bash
1# Requires sentence-transformers 2uv add "llmemory[reranker-local]"

OpenAIResponsesReranker

Use OpenAI GPT models for intelligent reranking with natural language understanding.

Constructor:

python
1OpenAIResponsesReranker( 2 model: str = "gpt-4.1-mini", 3 max_candidates: int = 30, 4 temperature: float = 0.0 5)

Parameters:

  • model (str, default: "gpt-4.1-mini"): OpenAI model name
    • Recommended: "gpt-4.1-mini" (fast, cost-effective), "gpt-4" (higher quality)
  • max_candidates (int, default: 30): Maximum candidates to send to API
  • temperature (float, default: 0.0): Model temperature (0 = deterministic)

Methods:

score()

Score query-document pairs using OpenAI API.

Signature:

python
1async def score( 2 query_text: str, 3 results: Sequence[SearchResult] 4) -> Sequence[float]

Parameters:

  • query_text (str): Search query
  • results (Sequence[SearchResult]): Search results to score

Returns:

  • Sequence[float]: Relevance scores between 0 and 1

Example:

python
1from llmemory import OpenAIResponsesReranker 2import os 3 4# Initialize reranker (uses OPENAI_API_KEY from env) 5reranker = OpenAIResponsesReranker( 6 model="gpt-4.1-mini", 7 max_candidates=20, 8 temperature=0.0 9) 10 11# Get initial search results 12results = await memory.search( 13 owner_id="workspace-1", 14 query_text="customer retention strategies", 15 limit=50, 16 rerank=False 17) 18 19# Rerank with OpenAI 20scores = await reranker.score("customer retention strategies", results) 21 22# Sort by scores 23scored_results = list(zip(scores, results)) 24scored_results.sort(key=lambda x: x[0], reverse=True) 25top_results = [r for _, r in scored_results[:10]] 26 27print(f"Top result score: {scores[0]:.3f}")

Cost Considerations:

  • Each rerank call makes one API request
  • Costs depend on model and number of candidates
  • Consider caching reranked results for repeated queries

When to use:

  • Need highest quality reranking
  • Willing to pay API costs
  • Latency tolerance (100-300ms overhead)

RerankerService

Internal service that wraps reranker implementations (rarely used directly).

Usage: Automatically created by LLMemory when reranking is enabled via configuration. Generally not instantiated directly by users.

SearchResult Fields Reference

Search results contain multiple score fields depending on the search configuration. Understanding these fields helps optimize RAG retrieval quality.

Core Fields

chunk_id (UUID)

  • Unique identifier for the chunk

document_id (UUID)

  • Parent document identifier

content (str)

  • Full chunk text content

metadata (Dict[str, Any])

  • Chunk metadata (may include title, section, page number, etc.)

score (float)

  • Primary relevance score
  • For hybrid search: combined score from vector and text search
  • For vector search: same as similarity
  • For text search: same as text_rank
  • After reranking: same as rerank_score

Optional Score Fields

similarity (Optional[float])

  • Vector similarity score (cosine distance)
  • Range: 0.0 to 1.0 (higher = more similar)
  • Populated when search_type is VECTOR or HYBRID
  • Example: 0.87 indicates 87% semantic similarity

text_rank (Optional[float])

  • BM25 full-text search rank
  • Higher values indicate better keyword matches
  • Populated when search_type is TEXT or HYBRID
  • Not normalized to [0,1] range

rrf_score (Optional[float])

  • Reciprocal Rank Fusion score
  • Populated when query_expansion=True (multi-query search)
  • Combines rankings from multiple query variants
  • Higher values indicate consistent ranking across variants

rerank_score (Optional[float])

  • Reranker relevance score
  • Populated when rerank=True
  • Range and interpretation depends on reranker:
    • OpenAI reranker: 0.0 to 1.0 (normalized probability)
    • Cross-encoder: typically -10 to +10 (raw logit score)
    • Lexical reranker: 0.0 to 1.0 (token overlap ratio)
  • Higher values indicate higher relevance according to reranker

summary (Optional[str])

  • Concise chunk summary (30-50% of original length)
  • Populated when ChunkingConfig.enable_chunk_summaries=True
  • Generated during document ingestion
  • Use for prompts to reduce token usage: text = result.summary or result.content
  • See "Enable and Use Chunk Summaries" section below

Using Score Fields

python
1# Example: Analyzing search result scores 2results = await memory.search( 3 owner_id="workspace-1", 4 query_text="machine learning algorithms", 5 search_type=SearchType.HYBRID, 6 query_expansion=True, 7 rerank=True, 8 limit=5 9) 10 11for result in results: 12 print(f"Chunk ID: {result.chunk_id}") 13 print(f" Final score: {result.score:.3f}") 14 15 # Vector component (if hybrid/vector search) 16 if result.similarity is not None: 17 print(f" Vector similarity: {result.similarity:.3f}") 18 19 # Text component (if hybrid/text search) 20 if result.text_rank is not None: 21 print(f" BM25 rank: {result.text_rank:.3f}") 22 23 # Multi-query fusion (if query_expansion=True) 24 if result.rrf_score is not None: 25 print(f" RRF score: {result.rrf_score:.3f}") 26 27 # Reranking (if rerank=True) 28 if result.rerank_score is not None: 29 print(f" Rerank score: {result.rerank_score:.3f}") 30 31 # Summary (if enabled during ingestion) 32 if result.summary: 33 print(f" Summary: {result.summary[:100]}...")

Reranking Parameters

rerank_top_k (int, default: 50)

  • Number of initial candidates to send to reranker
  • Retrieve this many results from base search before reranking
  • Larger values: better quality but slower and more expensive
  • Recommended range: 20-100

rerank_return_k (int, default: 15)

  • Preferred number of results after reranking
  • Results are prioritized by rerank score
  • Actual result count: max(limit, rerank_return_k)
  • Set higher than limit to ensure best reranked results

limit (int, default: 10)

  • Final result count returned to user
  • Works with rerank_return_k: final_count = max(limit, rerank_return_k)
  • Example: limit=5, rerank_return_k=10 → returns 10 results
  • Example: limit=20, rerank_return_k=10 → returns 20 results
python
1# Example: Reranking parameter interactions 2results = await memory.search( 3 owner_id="workspace-1", 4 query_text="database optimization", 5 search_type=SearchType.HYBRID, 6 rerank=True, 7 rerank_top_k=50, # Consider top 50 from base search 8 rerank_return_k=10, # Prefer 10 best after reranking 9 limit=5 # But return max(5, 10) = 10 results 10) 11# Returns 10 results (max of limit and rerank_return_k) 12assert len(results) == 10 13 14results = await memory.search( 15 owner_id="workspace-1", 16 query_text="database optimization", 17 search_type=SearchType.HYBRID, 18 rerank=True, 19 rerank_top_k=50, # Consider top 50 from base search 20 rerank_return_k=5, # Prefer 5 best after reranking 21 limit=20 # But return max(20, 5) = 20 results 22) 23# Returns 20 results (max of limit and rerank_return_k) 24assert len(results) == 20

Step 4: Prompt Augmentation

python
1def build_rag_prompt( 2 query: str, 3 results: List[SearchResult], 4 system_instructions: str = "Answer based only on the provided context." 5) -> str: 6 """Build RAG prompt with retrieved context.""" 7 8 # Format context from search results 9 context_parts = [] 10 for i, result in enumerate(results, 1): 11 # Include source information 12 source = result.metadata.get("source", "Unknown") 13 doc_name = result.metadata.get("document_name", "") 14 15 # Use summary if available (more concise for prompts) 16 # Summary is populated when ChunkingConfig.enable_chunk_summaries=True 17 text = result.summary or result.content 18 19 context_parts.append( 20 f"[Source {i}: {doc_name or source}]\n{text}" 21 ) 22 23 context = "\n\n".join(context_parts) 24 25 # Build final prompt 26 prompt = f"""{system_instructions} 27 28Context: 29{context} 30 31Question: {query} 32 33Answer:""" 34 35 return prompt

Advanced Prompt Patterns

With Citation Requirements:

python
1def build_prompt_with_citations(query: str, results: List[SearchResult]) -> str: 2 context_parts = [] 3 for i, result in enumerate(results, 1): 4 source = result.metadata.get("document_name", f"Source {i}") 5 # Use summary if enabled (see enabling summaries section below) 6 text = result.summary or result.content 7 context_parts.append(f"[{i}] {source}: {text}") 8 9 context = "\n\n".join(context_parts) 10 11 prompt = f"""Answer the question using the provided context. Cite sources using [number] format. 12 13Context: 14{context} 15 16Question: {query} 17 18Answer (with citations):""" 19 20 return prompt

With Metadata Filtering:

python
1async def rag_with_filters( 2 memory: LLMemory, 3 owner_id: str, 4 query: str, 5 category: str 6): 7 """RAG with metadata filtering.""" 8 results = await memory.search( 9 owner_id=owner_id, 10 query_text=query, 11 search_type=SearchType.HYBRID, 12 metadata_filter={"category": category}, # Filter by category 13 rerank=True, 14 limit=5 15 ) 16 17 return build_rag_prompt(query, results)

Step 5: LLM Generation

python
1from openai import AsyncOpenAI 2 3async def generate_rag_response( 4 query: str, 5 results: List[SearchResult], 6 model: str = "gpt-4" 7) -> dict: 8 """Generate LLM response with RAG context.""" 9 10 # Build prompt 11 prompt = build_rag_prompt(query, results) 12 13 # Generate with OpenAI 14 client = AsyncOpenAI() 15 response = await client.chat.completions.create( 16 model=model, 17 messages=[ 18 { 19 "role": "system", 20 "content": "You are a helpful assistant that answers questions based on provided context." 21 }, 22 { 23 "role": "user", 24 "content": prompt 25 } 26 ], 27 temperature=0.3, # Lower temperature for factual answers 28 max_tokens=500 29 ) 30 31 # Extract response 32 answer = response.choices[0].message.content 33 34 return { 35 "answer": answer, 36 "sources": [ 37 { 38 "content": r.content[:200] + "...", 39 "score": r.score, 40 "metadata": r.metadata 41 } 42 for r in results 43 ], 44 "model": model 45 }

Complete RAG System Example

python
1from llmemory import LLMemory, SearchType, DocumentType 2from openai import AsyncOpenAI 3from typing import List, Dict, Any 4 5class RAGSystem: 6 """Complete RAG system with llmemory.""" 7 8 def __init__(self, connection_string: str, openai_api_key: str): 9 self.memory = LLMemory( 10 connection_string=connection_string, 11 openai_api_key=openai_api_key 12 ) 13 self.client = AsyncOpenAI(api_key=openai_api_key) 14 self.initialized = False 15 16 async def initialize(self): 17 """Initialize the RAG system.""" 18 await self.memory.initialize() 19 self.initialized = True 20 21 async def ingest_document( 22 self, 23 owner_id: str, 24 document_name: str, 25 content: str, 26 document_type: DocumentType = DocumentType.TEXT, 27 metadata: Dict[str, Any] = None 28 ): 29 """Add a document to the knowledge base.""" 30 result = await self.memory.add_document( 31 owner_id=owner_id, 32 id_at_origin="rag_kb", 33 document_name=document_name, 34 document_type=document_type, 35 content=content, 36 metadata=metadata or {}, 37 generate_embeddings=True 38 ) 39 return { 40 "document_id": str(result.document.document_id), 41 "chunks_created": result.chunks_created 42 } 43 44 async def answer_question( 45 self, 46 owner_id: str, 47 question: str, 48 top_k: int = 5, 49 model: str = "gpt-4" 50 ) -> Dict[str, Any]: 51 """Answer a question using RAG.""" 52 53 # Retrieve relevant chunks 54 results = await self.memory.search( 55 owner_id=owner_id, 56 query_text=question, 57 search_type=SearchType.HYBRID, 58 query_expansion=True, 59 max_query_variants=3, 60 rerank=True, 61 rerank_top_k=20, 62 rerank_return_k=top_k, 63 limit=top_k 64 ) 65 66 if not results: 67 return { 68 "answer": "I don't have enough information to answer this question.", 69 "sources": [], 70 "confidence": "low" 71 } 72 73 # Build prompt 74 context = "\n\n".join([ 75 f"[Source: {r.metadata.get('document_name', 'Unknown')}]\n{r.summary or r.content}" 76 for r in results 77 ]) 78 79 prompt = f"""Answer the question using only the provided context. If the answer cannot be found in the context, say so. 80 81Context: 82{context} 83 84Question: {question} 85 86Answer:""" 87 88 # Generate response 89 response = await self.client.chat.completions.create( 90 model=model, 91 messages=[ 92 {"role": "system", "content": "You are a helpful assistant."}, 93 {"role": "user", "content": prompt} 94 ], 95 temperature=0.3 96 ) 97 98 answer = response.choices[0].message.content 99 100 # Determine confidence based on scores 101 avg_score = sum(r.score for r in results) / len(results) 102 confidence = "high" if avg_score > 0.5 else "medium" if avg_score > 0.3 else "low" 103 104 return { 105 "answer": answer, 106 "sources": [ 107 { 108 "document_name": r.metadata.get("document_name"), 109 "content_preview": r.content[:150] + "...", 110 "score": r.score, 111 "similarity": r.similarity, 112 "rerank_score": r.rerank_score # Populated when rerank=True 113 } 114 for r in results 115 ], 116 "confidence": confidence, 117 "model": model 118 } 119 120 async def close(self): 121 """Clean up resources.""" 122 await self.memory.close() 123 124# Usage 125async def main(): 126 rag = RAGSystem( 127 connection_string="postgresql://localhost/mydb", 128 openai_api_key="sk-..." 129 ) 130 await rag.initialize() 131 132 # Ingest documents 133 await rag.ingest_document( 134 owner_id="user-123", 135 document_name="product_guide.md", 136 content="...", 137 document_type=DocumentType.MARKDOWN, 138 metadata={"category": "guide"} 139 ) 140 141 # Answer questions 142 result = await rag.answer_question( 143 owner_id="user-123", 144 question="How do I reset my password?", 145 top_k=5 146 ) 147 148 print(f"Answer: {result['answer']}") 149 print(f"Confidence: {result['confidence']}") 150 print(f"Sources: {len(result['sources'])}") 151 152 await rag.close()

RAG Best Practices

1. Chunk Size Optimization

python
1from llmemory import ChunkingConfig, LLMemoryConfig 2 3# For RAG, use smaller chunks (better precision) 4chunking_config = ChunkingConfig( 5 chunk_size=300, # 300 tokens (vs 1000 default) 6 chunk_overlap=50, # 50 tokens overlap 7 strategy="hierarchical", # Create parent/child chunks 8 min_chunk_size=100, # Minimum chunk size 9 max_chunk_size=500 # Maximum chunk size 10) 11 12# Enable summaries via LLMemoryConfig (set when creating LLMemory) 13# config = LLMemoryConfig() 14# config.chunking.enable_chunk_summaries = True 15# config.chunking.summary_max_tokens = 80 16 17await memory.add_document( 18 owner_id="workspace-1", 19 id_at_origin="kb", 20 document_name="doc.md", 21 document_type=DocumentType.MARKDOWN, 22 content="...", 23 chunking_config=chunking_config 24) 25 26# Smaller chunks: 27# - More precise retrieval 28# - Better for prompts (fit more sources) 29# - Less noise in context 30# 31# Larger chunks: 32# - More context per chunk 33# - Better for broad questions 34# - Fewer chunks needed

2. Use Parent Context for Broader Context

python
1# Retrieve with parent context 2results = await memory.search( 3 owner_id="workspace-1", 4 query_text="API authentication", 5 search_type=SearchType.HYBRID, 6 include_parent_context=True, # Include surrounding chunks 7 context_window=2, # ±2 chunks 8 limit=5 9) 10 11# Build prompt with parent context 12for result in results: 13 print(f"Main chunk: {result.content}") 14 if result.parent_chunks: 15 print(f"Context from {len(result.parent_chunks)} parent chunks") 16 for parent in result.parent_chunks: 17 print(f" - {parent.content[:100]}...")

3. Reranking for Quality

Always use reranking in RAG for better relevance:

python
1# Without reranking (lower quality) 2results = await memory.search( 3 owner_id="workspace-1", 4 query_text="query", 5 rerank=False, 6 limit=5 7) 8 9# With reranking (higher quality) 10results = await memory.search( 11 owner_id="workspace-1", 12 query_text="query", 13 rerank=True, 14 rerank_top_k=20, # Consider top 20 candidates 15 rerank_return_k=10, # Prefer 10 best after reranking 16 limit=5 # Final count = max(5, 10) = 10 results 17) 18 19# Reranking improves: 20# - Relevance of top results 21# - Precision for RAG prompts 22# - Reduces hallucination (better context)

4. Query Expansion for Recall

python
1# Use multi-query for better recall 2results = await memory.search( 3 owner_id="workspace-1", 4 query_text="reduce latency", 5 query_expansion=True, # Generates variants like "improve response time" 6 max_query_variants=3, 7 rerank=True, # Rerank after fusion 8 limit=5 9) 10 11# Good for: 12# - Vague queries 13# - Different terminology in docs 14# - Comprehensive answers

5. Metadata for Filtering

python
1# Add rich metadata during ingestion 2await memory.add_document( 3 owner_id="workspace-1", 4 id_at_origin="kb", 5 document_name="api_v2_docs.md", 6 document_type=DocumentType.TECHNICAL_DOC, 7 content="...", 8 metadata={ 9 "category": "api", 10 "version": "2.0", 11 "language": "python", 12 "last_updated": "2024-10-01" 13 } 14) 15 16# Filter during retrieval 17results = await memory.search( 18 owner_id="workspace-1", 19 query_text="authentication", 20 metadata_filter={ 21 "category": "api", 22 "version": "2.0" 23 }, 24 limit=5 25)

6. Enable and Use Chunk Summaries

Chunk summaries provide concise representations of chunks, making prompts more efficient by reducing token usage while preserving key information.

Enabling Summaries:

python
1from llmemory import LLMemory, ChunkingConfig, LLMemoryConfig, DocumentType 2 3# Enable summaries via LLMemoryConfig 4config = LLMemoryConfig() 5config.chunking.enable_chunk_summaries = True 6config.chunking.summary_max_tokens = 80 # Control summary length 7 8memory = LLMemory( 9 connection_string="postgresql://localhost/mydb", 10 config=config 11) 12await memory.initialize() 13 14# Use custom chunking config for chunk size settings 15chunking_config = ChunkingConfig( 16 chunk_size=300, 17 chunk_overlap=50, 18 strategy="hierarchical" 19) 20 21await memory.add_document( 22 owner_id="workspace-1", 23 id_at_origin="kb", 24 document_name="doc.md", 25 document_type=DocumentType.MARKDOWN, 26 content="...", 27 chunking_config=chunking_config 28)

Using Summaries in Prompts:

python
1def build_prompt_with_summaries(query: str, results: List[SearchResult]): 2 """Build prompt using chunk summaries when available.""" 3 context_parts = [] 4 for result in results: 5 # SearchResult.summary is populated when enable_chunk_summaries=True 6 # Falls back to full content if summaries weren't generated 7 text = result.summary or result.content 8 context_parts.append(text) 9 10 context = "\n".join(context_parts) 11 return f"Context:\n{context}\n\nQuestion: {query}\n\nAnswer:"

Benefits:

  • Reduced prompt token usage (summaries are ~30-50% of original size)
  • More chunks fit in context window
  • Faster LLM processing
  • Preserved key information for accurate answers

RAG Evaluation

Measuring Retrieval Quality

python
1async def evaluate_retrieval( 2 memory: LLMemory, 3 owner_id: str, 4 test_queries: List[Dict[str, Any]] 5): 6 """Evaluate retrieval quality.""" 7 8 metrics = { 9 "precision_at_5": [], 10 "recall": [], 11 "mrr": [] # Mean Reciprocal Rank 12 } 13 14 for test in test_queries: 15 query = test["query"] 16 relevant_doc_ids = set(test["relevant_docs"]) 17 18 # Retrieve 19 results = await memory.search( 20 owner_id=owner_id, 21 query_text=query, 22 rerank=True, 23 limit=10 24 ) 25 26 # Calculate precision@5 27 top_5_docs = {str(r.document_id) for r in results[:5]} 28 precision = len(top_5_docs & relevant_doc_ids) / 5 29 metrics["precision_at_5"].append(precision) 30 31 # Calculate recall 32 retrieved_docs = {str(r.document_id) for r in results} 33 recall = len(retrieved_docs & relevant_doc_ids) / len(relevant_doc_ids) 34 metrics["recall"].append(recall) 35 36 # Calculate MRR 37 for rank, result in enumerate(results, 1): 38 if str(result.document_id) in relevant_doc_ids: 39 metrics["mrr"].append(1.0 / rank) 40 break 41 else: 42 metrics["mrr"].append(0.0) 43 44 return { 45 "avg_precision_at_5": sum(metrics["precision_at_5"]) / len(test_queries), 46 "avg_recall": sum(metrics["recall"]) / len(test_queries), 47 "mean_reciprocal_rank": sum(metrics["mrr"]) / len(test_queries) 48 }

Related Skills

  • basic-usage - Core document and search operations
  • hybrid-search - Vector + BM25 hybrid search fundamentals
  • multi-query - Query expansion for improved retrieval
  • multi-tenant - Multi-tenant isolation patterns for SaaS

Important Notes

RAG Pipeline Optimization: The complete RAG pipeline (retrieve → rerank → generate) typically takes 200-500ms:

  • Retrieval: 50-150ms
  • Reranking: 50-200ms (depending on provider)
  • LLM generation: 500-2000ms

Chunk Size for RAG: Smaller chunks (200-400 tokens) work better for RAG than larger chunks:

  • More precise retrieval
  • Less noise in context
  • More chunks fit in prompt
  • Better for specific questions

Multi-Tenant RAG: Always use owner_id for data isolation in multi-tenant RAG systems. Never expose one tenant's documents to another.

Reranking ROI: Reranking adds 50-200ms but significantly improves answer quality by ensuring the most relevant chunks appear first in the prompt, reducing hallucination and improving accuracy.

Related Skills

Looking for an alternative to rag or building a Categories.community AI Agent? Explore these related open-source MCP Servers.

View All

widget-generator

Logo of f
f

widget-generator is an open-source AI agent skill for creating widget plugins that are injected into prompt feeds on prompts.chat. It supports two rendering modes: standard prompt widgets using default PromptCard styling and custom render widgets built as full React components.

149.6k
0
Design

chat-sdk

Logo of lobehub
lobehub

chat-sdk is a unified TypeScript SDK for building chat bots across multiple platforms, providing a single interface for deploying bot logic.

73.0k
0
Communication

zustand

Logo of lobehub
lobehub

The ultimate space for work and life — to find, build, and collaborate with agent teammates that grow with you. We are taking agent harness to the next level — enabling multi-agent collaboration, effortless agent team design, and introducing agents as the unit of work interaction.

72.8k
0
Communication

data-fetching

Logo of lobehub
lobehub

The ultimate space for work and life — to find, build, and collaborate with agent teammates that grow with you. We are taking agent harness to the next level — enabling multi-agent collaboration, effortless agent team design, and introducing agents as the unit of work interaction.

72.8k
0
Communication