Perfect for AI Agents needing advanced Retrieval-Augmented Generation capabilities for accurate and grounded responses. Use when building Retrieval-Augmented Generation systems - covers document ingestion, hybrid search retrieval, reranking results, and prompt augmentation for accurate LLM responses grounded in your kn

How do I install rag?

Run the command: npx killer-skills add juanre/llmemory/rag. It works with Cursor, Windsurf, VS Code, Claude Code, and 19+ other IDEs.

What are the use cases for rag?

Key use cases include: Retrieving relevant documents for informed responses, Reranking search results for improved relevance, Generating grounded responses using Retrieval-Augmented Generation.

Which IDEs are compatible with rag?

This skill is compatible with Cursor, Windsurf, VS Code, Trae, Claude Code, OpenClaw, Aider, Codex, OpenCode, Goose, Cline, Roo Code, Kiro, Augment Code, Continue, GitHub Copilot, Sourcegraph Cody, and Amazon Q Developer. Use the Killer-Skills CLI for universal one-command installation.

Are there any limitations for rag?

Requires llmemory installation. Needs configuration for OpenAI reranking or local cross-encoder models.

LLMemory RAG Systems

Name: rag
Availability: InStock
Author: juanre

Installation

bash
1uv add llmemory
2# For reranking support
3uv add "llmemory[reranker-local]"  # Local cross-encoder models
4# or configure OpenAI reranking (no extra install needed)

Overview

Retrieval-Augmented Generation (RAG) combines llmemory's document retrieval with LLM generation for accurate, grounded responses.

RAG Pipeline:

Ingest: Add documents to llmemory
Retrieve: Search for relevant chunks
Rerank: Improve relevance ordering (optional but recommended)
Augment: Build prompt with retrieved context
Generate: Get LLM response

When to use RAG:

Question answering over your documents
Customer support with knowledge base
Research assistance
Code documentation search
Any application needing accurate, source-backed answers

Quick Start

python
1from llmemory import LLMemory, SearchType, DocumentType
2from openai import AsyncOpenAI
3
4async def rag_system():
5    # Initialize
6    memory = LLMemory(
7        connection_string="postgresql://localhost/mydb",
8        openai_api_key="sk-..."
9    )
10    await memory.initialize()
11
12    # 1. Ingest documents
13    await memory.add_document(
14        owner_id="workspace-1",
15        id_at_origin="kb",
16        document_name="product_guide.md",
17        document_type=DocumentType.MARKDOWN,
18        content="Your product documentation..."
19    )
20
21    # 2. Retrieve with reranking
22    results = await memory.search(
23        owner_id="workspace-1",
24        query_text="how to reset password",
25        search_type=SearchType.HYBRID,
26        query_expansion=True,  # Better retrieval
27        rerank=True,           # Better ranking
28        rerank_top_k=50,       # Rerank top 50 candidates
29        rerank_return_k=10,    # Prefer 10 best after reranking
30        limit=5                # Final result count (max of limit and rerank_return_k)
31    )
32
33    # 3. Build prompt with context
34    context = "\n\n".join([
35        f"Source: {r.metadata.get('source', 'unknown')}\n{r.content}"
36        for r in results
37    ])
38
39    prompt = f"""Answer the question using only the provided context.
40
41Context:
42{context}
43
44Question: how to reset password
45
46Answer:"""
47
48    # 4. Generate response
49    client = AsyncOpenAI()
50    response = await client.chat.completions.create(
51        model="gpt-4",
52        messages=[{"role": "user", "content": prompt}]
53    )
54
55    print(response.choices[0].message.content)
56    await memory.close()
57
58import asyncio
59asyncio.run(rag_system())

Query Routing for Production RAG

Production RAG systems should detect when queries cannot be answered from available documents.

When to use query routing:

User queries may be unanswerable from your knowledge base
Need to route to web search or external APIs
Want to avoid hallucinated answers
Building conversational assistants

Example:

python
1from llmemory import LLMemory
2
3async with LLMemory(connection_string="...") as memory:
4    # Search with automatic routing
5    result = await memory.search_with_routing(
6        owner_id="workspace-1",
7        query_text="What's the current weather in Paris?",
8        enable_routing=True,
9        limit=5
10    )
11
12    if result["route"] == "retrieval":
13        # Answer from documents
14        return generate_answer(result["results"])
15    elif result["route"] == "web_search":
16        # Route to web search
17        return fetch_from_web(query)
18    elif result["route"] == "unanswerable":
19        # Honest response
20        return "I don't have information to answer that question."
21    else:  # clarification
22        return "Could you please provide more details?"

API Reference:

search_with_routing()

Route queries intelligently before searching.

Signature:

python
1async def search_with_routing(
2    owner_id: str,
3    query_text: str,
4    enable_routing: bool = True,
5    routing_threshold: float = 0.7,
6    **search_kwargs
7) -> Dict[str, Any]

Parameters:

owner_id (str): Owner identifier
query_text (str): Search query
enable_routing (bool, default: True): Enable automatic routing
routing_threshold (float, default: 0.7): Confidence threshold
**search_kwargs: Additional arguments passed to search()

Returns: Dict with:

route (str): "retrieval", "web_search", "unanswerable", or "clarification"
confidence (float): 0-1 confidence in routing decision
results (List[SearchResult]): If route="retrieval"
message (str): If route != "retrieval"
reason (str): Explanation of routing decision

Example:

python
1result = await memory.search_with_routing(
2    owner_id="support",
3    query_text="How do I reset my password?",
4    routing_threshold=0.8
5)
6
7if result["route"] == "retrieval":
8    answer = generate_rag_response(result["results"])
9else:
10    answer = result["message"]  # Pre-formatted response

Complete RAG Pipeline

Step 1: Document Ingestion

python
1from llmemory import LLMemory, DocumentType, ChunkingConfig, LLMemoryConfig
2
3async def ingest_knowledge_base(owner_id: str):
4    """Ingest documents into RAG system."""
5
6    # Configure chunking for RAG (smaller chunks for precise retrieval)
7    chunking_config = ChunkingConfig(
8        chunk_size=300,              # Tokens per chunk (smaller for RAG)
9        chunk_overlap=50,            # Overlap for context preservation
10        strategy="hierarchical",     # Chunking strategy
11        min_chunk_size=100,          # Minimum chunk size
12        max_chunk_size=500           # Maximum chunk size
13    )
14
15    # Enable chunk summaries via LLMemoryConfig
16    config = LLMemoryConfig()
17    config.chunking.enable_chunk_summaries = True
18    config.chunking.summary_max_tokens = 80
19
20    memory = LLMemory(
21        connection_string="postgresql://localhost/mydb",
22        config=config
23    )
24    await memory.initialize()
25
26    documents = [
27        {
28            "name": "product_guide.md",
29            "type": DocumentType.MARKDOWN,
30            "content": "...",
31            "metadata": {"category": "guide", "version": "2.0"}
32        },
33        {
34            "name": "faq.md",
35            "type": DocumentType.MARKDOWN,
36            "content": "...",
37            "metadata": {"category": "faq"}
38        },
39        {
40            "name": "api_docs.md",
41            "type": DocumentType.TECHNICAL_DOC,
42            "content": "...",
43            "metadata": {"category": "api", "language": "python"}
44        }
45    ]
46
47    for doc in documents:
48        result = await memory.add_document(
49            owner_id=owner_id,
50            id_at_origin="knowledge_base",
51            document_name=doc["name"],
52            document_type=doc["type"],
53            content=doc["content"],
54            metadata=doc["metadata"],
55            chunking_config=chunking_config,
56            generate_embeddings=True
57        )
58        print(f"Ingested {doc['name']}: {result.chunks_created} chunks")

Step 2: Retrieval Configuration

python
1async def retrieve_for_rag(
2    memory: LLMemory,
3    owner_id: str,
4    query: str,
5    top_k: int = 5
6) -> List[SearchResult]:
7    """Retrieve relevant chunks for RAG."""
8
9    results = await memory.search(
10        owner_id=owner_id,
11        query_text=query,
12
13        # Hybrid search for best quality
14        search_type=SearchType.HYBRID,
15        alpha=0.6,  # Slight favor to semantic search
16
17        # Query expansion for better recall
18        query_expansion=True,
19        max_query_variants=3,
20
21        # Reranking for precision
22        rerank=True,
23        rerank_top_k=20,        # Consider top 20 candidates
24        rerank_return_k=top_k,  # Prefer top_k after reranking
25
26        # Final limit (actual count = max(limit, rerank_return_k))
27        limit=top_k
28    )
29
30    return results

Step 3: Reranking Configuration

llmemory supports multiple reranking methods:

OpenAI Reranking (Recommended for Quality)

python
1# Configure via environment
2LLMEMORY_RERANK_PROVIDER=openai
3LLMEMORY_RERANK_MODEL=gpt-4.1-mini
4LLMEMORY_RERANK_TOP_K=30
5LLMEMORY_RERANK_RETURN_K=10

python
1# Or programmatically
2from llmemory import LLMemoryConfig
3
4config = LLMemoryConfig()
5config.search.enable_rerank = True
6config.search.rerank_provider = "openai"
7config.search.default_rerank_model = "gpt-4.1-mini"
8config.search.rerank_top_k = 30
9config.search.rerank_return_k = 10
10
11memory = LLMemory(
12    connection_string="postgresql://localhost/mydb",
13    config=config
14)

Local Cross-Encoder Reranking (Faster, No API Calls)

bash
1# Install local reranker dependencies
2uv add "llmemory[reranker-local]"

python
1# Configure
2config = LLMemoryConfig()
3config.search.enable_rerank = True
4config.search.default_rerank_model = "cross-encoder/ms-marco-MiniLM-L6-v2"
5config.search.rerank_device = "cpu"  # or "cuda"
6config.search.rerank_batch_size = 16

Lexical Reranking (Fallback, No Dependencies)

python
1# Automatic fallback when no reranker configured
2# Uses token overlap scoring
3results = await memory.search(
4    owner_id="workspace-1",
5    query_text="query",
6    rerank=True  # Uses lexical reranking
7)

Reranker API Reference

CrossEncoderReranker

Local cross-encoder model for reranking search results without API calls.

Constructor:

python
1CrossEncoderReranker(
2    model_name: str = "cross-encoder/ms-marco-MiniLM-L6-v2",
3    device: Optional[str] = None,
4    batch_size: int = 16
5)

Parameters:

model_name (str, default: "cross-encoder/ms-marco-MiniLM-L6-v2"): Hugging Face cross-encoder model name
- Available models: "cross-encoder/ms-marco-MiniLM-L6-v2", "cross-encoder/ms-marco-TinyBERT-L2-v2"
device (Optional[str]): Device to run on: "cpu", "cuda", or None (auto-detect)
batch_size (int, default: 16): Batch size for inference

Methods:

score()

Score query-document pairs for relevance.

Signature:

python
1async def score(
2    query_text: str,
3    results: Sequence[SearchResult]
4) -> Sequence[float]

Parameters:

query_text (str): Search query
results (Sequence[SearchResult]): Search results to score

Returns:

Sequence[float]: Relevance scores (same length as results)

Example:

python
1from llmemory import CrossEncoderReranker
2
3# Initialize reranker
4reranker = CrossEncoderReranker(
5    model_name="cross-encoder/ms-marco-MiniLM-L6-v2",
6    device="cpu",
7    batch_size=32
8)
9
10# Get initial search results
11results = await memory.search(
12    owner_id="workspace-1",
13    query_text="machine learning",
14    limit=50,
15    rerank=False  # Get unranked results
16)
17
18# Rerank with cross-encoder
19scores = await reranker.score("machine learning", results)
20
21# Sort by new scores
22scored_results = list(zip(scores, results))
23scored_results.sort(key=lambda x: x[0], reverse=True)
24top_results = [r for _, r in scored_results[:10]]

Installation:

bash
1# Requires sentence-transformers
2uv add "llmemory[reranker-local]"

OpenAIResponsesReranker

Use OpenAI GPT models for intelligent reranking with natural language understanding.

Constructor:

python
1OpenAIResponsesReranker(
2    model: str = "gpt-4.1-mini",
3    max_candidates: int = 30,
4    temperature: float = 0.0
5)

Parameters:

model (str, default: "gpt-4.1-mini"): OpenAI model name
- Recommended: "gpt-4.1-mini" (fast, cost-effective), "gpt-4" (higher quality)
max_candidates (int, default: 30): Maximum candidates to send to API
temperature (float, default: 0.0): Model temperature (0 = deterministic)

Methods:

score()

Score query-document pairs using OpenAI API.

Signature:

python
1async def score(
2    query_text: str,
3    results: Sequence[SearchResult]
4) -> Sequence[float]

Parameters:

query_text (str): Search query
results (Sequence[SearchResult]): Search results to score

Returns:

Sequence[float]: Relevance scores between 0 and 1

Example:

python
1from llmemory import OpenAIResponsesReranker
2import os
3
4# Initialize reranker (uses OPENAI_API_KEY from env)
5reranker = OpenAIResponsesReranker(
6    model="gpt-4.1-mini",
7    max_candidates=20,
8    temperature=0.0
9)
10
11# Get initial search results
12results = await memory.search(
13    owner_id="workspace-1",
14    query_text="customer retention strategies",
15    limit=50,
16    rerank=False
17)
18
19# Rerank with OpenAI
20scores = await reranker.score("customer retention strategies", results)
21
22# Sort by scores
23scored_results = list(zip(scores, results))
24scored_results.sort(key=lambda x: x[0], reverse=True)
25top_results = [r for _, r in scored_results[:10]]
26
27print(f"Top result score: {scores[0]:.3f}")

Cost Considerations:

Each rerank call makes one API request
Costs depend on model and number of candidates
Consider caching reranked results for repeated queries

When to use:

Need highest quality reranking
Willing to pay API costs
Latency tolerance (100-300ms overhead)

RerankerService

Internal service that wraps reranker implementations (rarely used directly).

Usage: Automatically created by LLMemory when reranking is enabled via configuration. Generally not instantiated directly by users.

SearchResult Fields Reference

Search results contain multiple score fields depending on the search configuration. Understanding these fields helps optimize RAG retrieval quality.

Core Fields

chunk_id (UUID)

Unique identifier for the chunk

document_id (UUID)

Parent document identifier

content (str)

Full chunk text content

metadata (Dict[str, Any])

Chunk metadata (may include title, section, page number, etc.)

score (float)

Primary relevance score
For hybrid search: combined score from vector and text search
For vector search: same as similarity
For text search: same as text_rank
After reranking: same as rerank_score

Optional Score Fields

similarity (Optional[float])

Vector similarity score (cosine distance)
Range: 0.0 to 1.0 (higher = more similar)
Populated when search_type is VECTOR or HYBRID
Example: 0.87 indicates 87% semantic similarity

text_rank (Optional[float])

BM25 full-text search rank
Higher values indicate better keyword matches
Populated when search_type is TEXT or HYBRID
Not normalized to [0,1] range

rrf_score (Optional[float])

Reciprocal Rank Fusion score
Populated when query_expansion=True (multi-query search)
Combines rankings from multiple query variants
Higher values indicate consistent ranking across variants

rerank_score (Optional[float])

Reranker relevance score
Populated when rerank=True
Range and interpretation depends on reranker:
- OpenAI reranker: 0.0 to 1.0 (normalized probability)
- Cross-encoder: typically -10 to +10 (raw logit score)
- Lexical reranker: 0.0 to 1.0 (token overlap ratio)
Higher values indicate higher relevance according to reranker

summary (Optional[str])

Concise chunk summary (30-50% of original length)
Populated when ChunkingConfig.enable_chunk_summaries=True
Generated during document ingestion
Use for prompts to reduce token usage: text = result.summary or result.content
See "Enable and Use Chunk Summaries" section below

Using Score Fields

python
1# Example: Analyzing search result scores
2results = await memory.search(
3    owner_id="workspace-1",
4    query_text="machine learning algorithms",
5    search_type=SearchType.HYBRID,
6    query_expansion=True,
7    rerank=True,
8    limit=5
9)
10
11for result in results:
12    print(f"Chunk ID: {result.chunk_id}")
13    print(f"  Final score: {result.score:.3f}")
14
15    # Vector component (if hybrid/vector search)
16    if result.similarity is not None:
17        print(f"  Vector similarity: {result.similarity:.3f}")
18
19    # Text component (if hybrid/text search)
20    if result.text_rank is not None:
21        print(f"  BM25 rank: {result.text_rank:.3f}")
22
23    # Multi-query fusion (if query_expansion=True)
24    if result.rrf_score is not None:
25        print(f"  RRF score: {result.rrf_score:.3f}")
26
27    # Reranking (if rerank=True)
28    if result.rerank_score is not None:
29        print(f"  Rerank score: {result.rerank_score:.3f}")
30
31    # Summary (if enabled during ingestion)
32    if result.summary:
33        print(f"  Summary: {result.summary[:100]}...")

Reranking Parameters

rerank_top_k (int, default: 50)

Number of initial candidates to send to reranker
Retrieve this many results from base search before reranking
Larger values: better quality but slower and more expensive
Recommended range: 20-100

rerank_return_k (int, default: 15)

Preferred number of results after reranking
Results are prioritized by rerank score
Actual result count: max(limit, rerank_return_k)
Set higher than limit to ensure best reranked results

limit (int, default: 10)

Final result count returned to user
Works with rerank_return_k: final_count = max(limit, rerank_return_k)
Example: limit=5, rerank_return_k=10 → returns 10 results
Example: limit=20, rerank_return_k=10 → returns 20 results

python
1# Example: Reranking parameter interactions
2results = await memory.search(
3    owner_id="workspace-1",
4    query_text="database optimization",
5    search_type=SearchType.HYBRID,
6    rerank=True,
7    rerank_top_k=50,      # Consider top 50 from base search
8    rerank_return_k=10,   # Prefer 10 best after reranking
9    limit=5               # But return max(5, 10) = 10 results
10)
11# Returns 10 results (max of limit and rerank_return_k)
12assert len(results) == 10
13
14results = await memory.search(
15    owner_id="workspace-1",
16    query_text="database optimization",
17    search_type=SearchType.HYBRID,
18    rerank=True,
19    rerank_top_k=50,      # Consider top 50 from base search
20    rerank_return_k=5,    # Prefer 5 best after reranking
21    limit=20              # But return max(20, 5) = 20 results
22)
23# Returns 20 results (max of limit and rerank_return_k)
24assert len(results) == 20

Step 4: Prompt Augmentation

python
1def build_rag_prompt(
2    query: str,
3    results: List[SearchResult],
4    system_instructions: str = "Answer based only on the provided context."
5) -> str:
6    """Build RAG prompt with retrieved context."""
7
8    # Format context from search results
9    context_parts = []
10    for i, result in enumerate(results, 1):
11        # Include source information
12        source = result.metadata.get("source", "Unknown")
13        doc_name = result.metadata.get("document_name", "")
14
15        # Use summary if available (more concise for prompts)
16        # Summary is populated when ChunkingConfig.enable_chunk_summaries=True
17        text = result.summary or result.content
18
19        context_parts.append(
20            f"[Source {i}: {doc_name or source}]\n{text}"
21        )
22
23    context = "\n\n".join(context_parts)
24
25    # Build final prompt
26    prompt = f"""{system_instructions}
27
28Context:
29{context}
30
31Question: {query}
32
33Answer:"""
34
35    return prompt

Advanced Prompt Patterns

With Citation Requirements:

python
1def build_prompt_with_citations(query: str, results: List[SearchResult]) -> str:
2    context_parts = []
3    for i, result in enumerate(results, 1):
4        source = result.metadata.get("document_name", f"Source {i}")
5        # Use summary if enabled (see enabling summaries section below)
6        text = result.summary or result.content
7        context_parts.append(f"[{i}] {source}: {text}")
8
9    context = "\n\n".join(context_parts)
10
11    prompt = f"""Answer the question using the provided context. Cite sources using [number] format.
12
13Context:
14{context}
15
16Question: {query}
17
18Answer (with citations):"""
19
20    return prompt

With Metadata Filtering:

python
1async def rag_with_filters(
2    memory: LLMemory,
3    owner_id: str,
4    query: str,
5    category: str
6):
7    """RAG with metadata filtering."""
8    results = await memory.search(
9        owner_id=owner_id,
10        query_text=query,
11        search_type=SearchType.HYBRID,
12        metadata_filter={"category": category},  # Filter by category
13        rerank=True,
14        limit=5
15    )
16
17    return build_rag_prompt(query, results)

Step 5: LLM Generation

python
1from openai import AsyncOpenAI
2
3async def generate_rag_response(
4    query: str,
5    results: List[SearchResult],
6    model: str = "gpt-4"
7) -> dict:
8    """Generate LLM response with RAG context."""
9
10    # Build prompt
11    prompt = build_rag_prompt(query, results)
12
13    # Generate with OpenAI
14    client = AsyncOpenAI()
15    response = await client.chat.completions.create(
16        model=model,
17        messages=[
18            {
19                "role": "system",
20                "content": "You are a helpful assistant that answers questions based on provided context."
21            },
22            {
23                "role": "user",
24                "content": prompt
25            }
26        ],
27        temperature=0.3,  # Lower temperature for factual answers
28        max_tokens=500
29    )
30
31    # Extract response
32    answer = response.choices[0].message.content
33
34    return {
35        "answer": answer,
36        "sources": [
37            {
38                "content": r.content[:200] + "...",
39                "score": r.score,
40                "metadata": r.metadata
41            }
42            for r in results
43        ],
44        "model": model
45    }

Complete RAG System Example

python
1from llmemory import LLMemory, SearchType, DocumentType
2from openai import AsyncOpenAI
3from typing import List, Dict, Any
4
5class RAGSystem:
6    """Complete RAG system with llmemory."""
7
8    def __init__(self, connection_string: str, openai_api_key: str):
9        self.memory = LLMemory(
10            connection_string=connection_string,
11            openai_api_key=openai_api_key
12        )
13        self.client = AsyncOpenAI(api_key=openai_api_key)
14        self.initialized = False
15
16    async def initialize(self):
17        """Initialize the RAG system."""
18        await self.memory.initialize()
19        self.initialized = True
20
21    async def ingest_document(
22        self,
23        owner_id: str,
24        document_name: str,
25        content: str,
26        document_type: DocumentType = DocumentType.TEXT,
27        metadata: Dict[str, Any] = None
28    ):
29        """Add a document to the knowledge base."""
30        result = await self.memory.add_document(
31            owner_id=owner_id,
32            id_at_origin="rag_kb",
33            document_name=document_name,
34            document_type=document_type,
35            content=content,
36            metadata=metadata or {},
37            generate_embeddings=True
38        )
39        return {
40            "document_id": str(result.document.document_id),
41            "chunks_created": result.chunks_created
42        }
43
44    async def answer_question(
45        self,
46        owner_id: str,
47        question: str,
48        top_k: int = 5,
49        model: str = "gpt-4"
50    ) -> Dict[str, Any]:
51        """Answer a question using RAG."""
52
53        # Retrieve relevant chunks
54        results = await self.memory.search(
55            owner_id=owner_id,
56            query_text=question,
57            search_type=SearchType.HYBRID,
58            query_expansion=True,
59            max_query_variants=3,
60            rerank=True,
61            rerank_top_k=20,
62            rerank_return_k=top_k,
63            limit=top_k
64        )
65
66        if not results:
67            return {
68                "answer": "I don't have enough information to answer this question.",
69                "sources": [],
70                "confidence": "low"
71            }
72
73        # Build prompt
74        context = "\n\n".join([
75            f"[Source: {r.metadata.get('document_name', 'Unknown')}]\n{r.summary or r.content}"
76            for r in results
77        ])
78
79        prompt = f"""Answer the question using only the provided context. If the answer cannot be found in the context, say so.
80
81Context:
82{context}
83
84Question: {question}
85
86Answer:"""
87
88        # Generate response
89        response = await self.client.chat.completions.create(
90            model=model,
91            messages=[
92                {"role": "system", "content": "You are a helpful assistant."},
93                {"role": "user", "content": prompt}
94            ],
95            temperature=0.3
96        )
97
98        answer = response.choices[0].message.content
99
100        # Determine confidence based on scores
101        avg_score = sum(r.score for r in results) / len(results)
102        confidence = "high" if avg_score > 0.5 else "medium" if avg_score > 0.3 else "low"
103
104        return {
105            "answer": answer,
106            "sources": [
107                {
108                    "document_name": r.metadata.get("document_name"),
109                    "content_preview": r.content[:150] + "...",
110                    "score": r.score,
111                    "similarity": r.similarity,
112                    "rerank_score": r.rerank_score  # Populated when rerank=True
113                }
114                for r in results
115            ],
116            "confidence": confidence,
117            "model": model
118        }
119
120    async def close(self):
121        """Clean up resources."""
122        await self.memory.close()
123
124# Usage
125async def main():
126    rag = RAGSystem(
127        connection_string="postgresql://localhost/mydb",
128        openai_api_key="sk-..."
129    )
130    await rag.initialize()
131
132    # Ingest documents
133    await rag.ingest_document(
134        owner_id="user-123",
135        document_name="product_guide.md",
136        content="...",
137        document_type=DocumentType.MARKDOWN,
138        metadata={"category": "guide"}
139    )
140
141    # Answer questions
142    result = await rag.answer_question(
143        owner_id="user-123",
144        question="How do I reset my password?",
145        top_k=5
146    )
147
148    print(f"Answer: {result['answer']}")
149    print(f"Confidence: {result['confidence']}")
150    print(f"Sources: {len(result['sources'])}")
151
152    await rag.close()

RAG Best Practices

1. Chunk Size Optimization

python
1from llmemory import ChunkingConfig, LLMemoryConfig
2
3# For RAG, use smaller chunks (better precision)
4chunking_config = ChunkingConfig(
5    chunk_size=300,              # 300 tokens (vs 1000 default)
6    chunk_overlap=50,            # 50 tokens overlap
7    strategy="hierarchical",     # Create parent/child chunks
8    min_chunk_size=100,          # Minimum chunk size
9    max_chunk_size=500           # Maximum chunk size
10)
11
12# Enable summaries via LLMemoryConfig (set when creating LLMemory)
13# config = LLMemoryConfig()
14# config.chunking.enable_chunk_summaries = True
15# config.chunking.summary_max_tokens = 80
16
17await memory.add_document(
18    owner_id="workspace-1",
19    id_at_origin="kb",
20    document_name="doc.md",
21    document_type=DocumentType.MARKDOWN,
22    content="...",
23    chunking_config=chunking_config
24)
25
26# Smaller chunks:
27# - More precise retrieval
28# - Better for prompts (fit more sources)
29# - Less noise in context
30#
31# Larger chunks:
32# - More context per chunk
33# - Better for broad questions
34# - Fewer chunks needed

2. Use Parent Context for Broader Context

python
1# Retrieve with parent context
2results = await memory.search(
3    owner_id="workspace-1",
4    query_text="API authentication",
5    search_type=SearchType.HYBRID,
6    include_parent_context=True,  # Include surrounding chunks
7    context_window=2,              # ±2 chunks
8    limit=5
9)
10
11# Build prompt with parent context
12for result in results:
13    print(f"Main chunk: {result.content}")
14    if result.parent_chunks:
15        print(f"Context from {len(result.parent_chunks)} parent chunks")
16        for parent in result.parent_chunks:
17            print(f"  - {parent.content[:100]}...")

3. Reranking for Quality

Always use reranking in RAG for better relevance:

python
1# Without reranking (lower quality)
2results = await memory.search(
3    owner_id="workspace-1",
4    query_text="query",
5    rerank=False,
6    limit=5
7)
8
9# With reranking (higher quality)
10results = await memory.search(
11    owner_id="workspace-1",
12    query_text="query",
13    rerank=True,
14    rerank_top_k=20,    # Consider top 20 candidates
15    rerank_return_k=10,  # Prefer 10 best after reranking
16    limit=5             # Final count = max(5, 10) = 10 results
17)
18
19# Reranking improves:
20# - Relevance of top results
21# - Precision for RAG prompts
22# - Reduces hallucination (better context)

4. Query Expansion for Recall

python
1# Use multi-query for better recall
2results = await memory.search(
3    owner_id="workspace-1",
4    query_text="reduce latency",
5    query_expansion=True,  # Generates variants like "improve response time"
6    max_query_variants=3,
7    rerank=True,           # Rerank after fusion
8    limit=5
9)
10
11# Good for:
12# - Vague queries
13# - Different terminology in docs
14# - Comprehensive answers

5. Metadata for Filtering

python
1# Add rich metadata during ingestion
2await memory.add_document(
3    owner_id="workspace-1",
4    id_at_origin="kb",
5    document_name="api_v2_docs.md",
6    document_type=DocumentType.TECHNICAL_DOC,
7    content="...",
8    metadata={
9        "category": "api",
10        "version": "2.0",
11        "language": "python",
12        "last_updated": "2024-10-01"
13    }
14)
15
16# Filter during retrieval
17results = await memory.search(
18    owner_id="workspace-1",
19    query_text="authentication",
20    metadata_filter={
21        "category": "api",
22        "version": "2.0"
23    },
24    limit=5
25)

6. Enable and Use Chunk Summaries

Chunk summaries provide concise representations of chunks, making prompts more efficient by reducing token usage while preserving key information.

Enabling Summaries:

python
1from llmemory import LLMemory, ChunkingConfig, LLMemoryConfig, DocumentType
2
3# Enable summaries via LLMemoryConfig
4config = LLMemoryConfig()
5config.chunking.enable_chunk_summaries = True
6config.chunking.summary_max_tokens = 80  # Control summary length
7
8memory = LLMemory(
9    connection_string="postgresql://localhost/mydb",
10    config=config
11)
12await memory.initialize()
13
14# Use custom chunking config for chunk size settings
15chunking_config = ChunkingConfig(
16    chunk_size=300,
17    chunk_overlap=50,
18    strategy="hierarchical"
19)
20
21await memory.add_document(
22    owner_id="workspace-1",
23    id_at_origin="kb",
24    document_name="doc.md",
25    document_type=DocumentType.MARKDOWN,
26    content="...",
27    chunking_config=chunking_config
28)

Using Summaries in Prompts:

python
1def build_prompt_with_summaries(query: str, results: List[SearchResult]):
2    """Build prompt using chunk summaries when available."""
3    context_parts = []
4    for result in results:
5        # SearchResult.summary is populated when enable_chunk_summaries=True
6        # Falls back to full content if summaries weren't generated
7        text = result.summary or result.content
8        context_parts.append(text)
9
10    context = "\n".join(context_parts)
11    return f"Context:\n{context}\n\nQuestion: {query}\n\nAnswer:"

Benefits:

Reduced prompt token usage (summaries are ~30-50% of original size)
More chunks fit in context window
Faster LLM processing
Preserved key information for accurate answers

RAG Evaluation

Measuring Retrieval Quality

python
1async def evaluate_retrieval(
2    memory: LLMemory,
3    owner_id: str,
4    test_queries: List[Dict[str, Any]]
5):
6    """Evaluate retrieval quality."""
7
8    metrics = {
9        "precision_at_5": [],
10        "recall": [],
11        "mrr": []  # Mean Reciprocal Rank
12    }
13
14    for test in test_queries:
15        query = test["query"]
16        relevant_doc_ids = set(test["relevant_docs"])
17
18        # Retrieve
19        results = await memory.search(
20            owner_id=owner_id,
21            query_text=query,
22            rerank=True,
23            limit=10
24        )
25
26        # Calculate precision@5
27        top_5_docs = {str(r.document_id) for r in results[:5]}
28        precision = len(top_5_docs & relevant_doc_ids) / 5
29        metrics["precision_at_5"].append(precision)
30
31        # Calculate recall
32        retrieved_docs = {str(r.document_id) for r in results}
33        recall = len(retrieved_docs & relevant_doc_ids) / len(relevant_doc_ids)
34        metrics["recall"].append(recall)
35
36        # Calculate MRR
37        for rank, result in enumerate(results, 1):
38            if str(result.document_id) in relevant_doc_ids:
39                metrics["mrr"].append(1.0 / rank)
40                break
41        else:
42            metrics["mrr"].append(0.0)
43
44    return {
45        "avg_precision_at_5": sum(metrics["precision_at_5"]) / len(test_queries),
46        "avg_recall": sum(metrics["recall"]) / len(test_queries),
47        "mean_reciprocal_rank": sum(metrics["mrr"]) / len(test_queries)
48    }

basic-usage - Core document and search operations
hybrid-search - Vector + BM25 hybrid search fundamentals
multi-query - Query expansion for improved retrieval
multi-tenant - Multi-tenant isolation patterns for SaaS

Important Notes

RAG Pipeline Optimization: The complete RAG pipeline (retrieve → rerank → generate) typically takes 200-500ms:

Retrieval: 50-150ms
Reranking: 50-200ms (depending on provider)
LLM generation: 500-2000ms

Chunk Size for RAG: Smaller chunks (200-400 tokens) work better for RAG than larger chunks:

More precise retrieval
Less noise in context
More chunks fit in prompt
Better for specific questions

Multi-Tenant RAG: Always use owner_id for data isolation in multi-tenant RAG systems. Never expose one tenant's documents to another.

Reranking ROI: Reranking adds 50-200ms but significantly improves answer quality by ensuring the most relevant chunks appear first in the prompt, reducing hallucination and improving accuracy.

rag — community llmemory, community, ide skills

About this Skill

Killer-Skills Review

Core Value

Ideal Agent Persona

↓ Capabilities Granted for rag

! Prerequisites & Limits

Source Boundary

Decide The Next Action Before You Keep Reading Repository Material

Start With Installation And Validation

Cross-Check Against Trusted Picks

Move To Workflow Collections For Team Rollout

Browser Sandbox Environment

⚡️ Ready to unleash?

FAQ & Installation Steps

? Frequently Asked Questions

What is rag?

How do I install rag?

What are the use cases for rag?

Which IDEs are compatible with rag?

Are there any limitations for rag?

↓ How To Install

Upstream Repository Material

rag

LLMemory RAG Systems

Installation

Overview

Quick Start

Query Routing for Production RAG

search_with_routing()

Complete RAG Pipeline

Step 1: Document Ingestion

Step 2: Retrieval Configuration

Step 3: Reranking Configuration

OpenAI Reranking (Recommended for Quality)

Local Cross-Encoder Reranking (Faster, No API Calls)

Lexical Reranking (Fallback, No Dependencies)

Reranker API Reference

CrossEncoderReranker

score()

OpenAIResponsesReranker

score()

RerankerService

SearchResult Fields Reference

Core Fields

Optional Score Fields

Using Score Fields

Reranking Parameters

Step 4: Prompt Augmentation

Advanced Prompt Patterns

Step 5: LLM Generation

Complete RAG System Example

RAG Best Practices

1. Chunk Size Optimization

2. Use Parent Context for Broader Context

3. Reranking for Quality

4. Query Expansion for Recall

5. Metadata for Filtering

6. Enable and Use Chunk Summaries

RAG Evaluation

Measuring Retrieval Quality

Related Skills

Important Notes

Related Skills

Looking for an alternative to rag or another community skill for your workflow? Explore these related open-source skills.

openclaw-release-maintainer

widget-generator

flags

pr-review