RAGAS (Retrieval Augmented Generation Assessment) is a framework for reference-free evaluation of RAG systems using LLMs. RAGAS uses state-of-the-art evaluation metrics:
| Metric | What It Measures | Good Score |
|---|---|---|
| Faithfulness | Is the answer factually accurate based on retrieved context? | > 0.80 |
| Answer Relevance | Is the answer relevant to the user's question? | > 0.80 |
| Context Recall | Was all relevant information retrieved from documents? | > 0.80 |
| Context Precision | Is retrieved context clean without irrelevant noise? | > 0.80 |
| RAGAS Score | Overall quality metric (average of above) | > 0.80 |
lightrag/evaluation/ ├── eval_rag_quality.py # Main evaluation script ├── sample_dataset.json # 3 test questions about LightRAG ├── sample_documents/ # Matching markdown files for testing │ ├── 01_lightrag_overview.md │ ├── 02_rag_architecture.md │ ├── 03_lightrag_improvements.md │ ├── 04_supported_databases.md │ ├── 05_evaluation_and_deployment.md │ └── README.md ├── __init__.py # Package init ├── results/ # Output directory │ ├── results_YYYYMMDD_HHMMSS.json # Raw metrics in JSON │ └── results_YYYYMMDD_HHMMSS.csv # Metrics in CSV format └── README.md # This file
Quick Test: Index files from sample_documents/ into LightRAG, then run the evaluator to reproduce results (~89-100% RAGAS score per question).
pip install ragas datasets langfuse
Or use your project dependencies (already included in pyproject.toml):
pip install -e ".[evaluation]"
Basic usage (uses defaults):
cd /path/to/LightRAG
python lightrag/evaluation/eval_rag_quality.py
Specify custom dataset:
python lightrag/evaluation/eval_rag_quality.py --dataset my_test.json
Specify custom RAG endpoint:
python lightrag/evaluation/eval_rag_quality.py --ragendpoint http://my-server.com:9621
Specify both (short form):
python lightrag/evaluation/eval_rag_quality.py -d my_test.json -r http://localhost:9621
Get help:
python lightrag/evaluation/eval_rag_quality.py --help
Results are saved automatically in lightrag/evaluation/results/:
results/ ├── results_20241023_143022.json ← Raw metrics in JSON format └── results_20241023_143022.csv ← Metrics in CSV format (for spreadsheets)
Results include:
The evaluation script supports command-line arguments for easy configuration:
| Argument | Short | Default | Description |
|---|---|---|---|
--dataset | -d | sample_dataset.json | Path to test dataset JSON file |
--ragendpoint | -r | http://localhost:9621 or $LIGHTRAG_API_URL | LightRAG API endpoint URL |
Use default dataset and endpoint:
python lightrag/evaluation/eval_rag_quality.py
Custom dataset with default endpoint:
python lightrag/evaluation/eval_rag_quality.py --dataset path/to/my_dataset.json
Default dataset with custom endpoint:
python lightrag/evaluation/eval_rag_quality.py --ragendpoint http://my-server.com:9621
Custom dataset and endpoint:
python lightrag/evaluation/eval_rag_quality.py -d my_dataset.json -r http://localhost:9621
Absolute path to dataset:
python lightrag/evaluation/eval_rag_quality.py -d /path/to/custom_dataset.json
Show help message:
python lightrag/evaluation/eval_rag_quality.py --help
The evaluation framework supports customization through environment variables:
⚠️ IMPORTANT: Both LLM and Embedding endpoints MUST be OpenAI-compatible
| Variable | Default | Description |
|---|---|---|
| LLM Configuration | ||
EVAL_LLM_MODEL | gpt-4o-mini | LLM model used for RAGAS evaluation |
EVAL_LLM_BINDING_API_KEY | falls back to OPENAI_API_KEY | API key for LLM evaluation |
EVAL_LLM_BINDING_HOST | (optional) | Custom OpenAI-compatible endpoint URL for LLM |
| Embedding Configuration | ||
EVAL_EMBEDDING_MODEL | text-embedding-3-large | Embedding model for evaluation |
EVAL_EMBEDDING_BINDING_API_KEY | falls back to EVAL_LLM_BINDING_API_KEY → OPENAI_API_KEY | API key for embeddings |
EVAL_EMBEDDING_BINDING_HOST | falls back to EVAL_LLM_BINDING_HOST | Custom OpenAI-compatible endpoint URL for embeddings |
| Performance Tuning | ||
EVAL_MAX_CONCURRENT | 2 | Number of concurrent test case evaluations (1=serial) |
EVAL_QUERY_TOP_K | 10 | Number of documents to retrieve per query |
EVAL_LLM_MAX_RETRIES | 5 | Maximum LLM request retries |
EVAL_LLM_TIMEOUT | 180 | LLM request timeout in seconds |
Example 1: Default Configuration (OpenAI Official API)
export OPENAI_API_KEY=sk-xxx
python lightrag/evaluation/eval_rag_quality.py
Both LLM and embeddings use OpenAI's official API with default models.
Example 2: Custom Models on OpenAI
export OPENAI_API_KEY=sk-xxx
export EVAL_LLM_MODEL=gpt-4o-mini
export EVAL_EMBEDDING_MODEL=text-embedding-3-large
python lightrag/evaluation/eval_rag_quality.py
Example 3: Same Custom OpenAI-Compatible Endpoint for Both
# Both LLM and embeddings use the same custom endpoint
export EVAL_LLM_BINDING_API_KEY=your-custom-key
export EVAL_LLM_BINDING_HOST=http://localhost:8000/v1
export EVAL_LLM_MODEL=qwen-plus
export EVAL_EMBEDDING_MODEL=BAAI/bge-m3
python lightrag/evaluation/eval_rag_quality.py
Embeddings automatically inherit LLM endpoint configuration.
Example 4: Separate Endpoints (Cost Optimization)
# Use OpenAI for LLM (high quality)
export EVAL_LLM_BINDING_API_KEY=sk-openai-key
export EVAL_LLM_MODEL=gpt-4o-mini
# No EVAL_LLM_BINDING_HOST means use OpenAI official API
# Use local vLLM for embeddings (cost-effective)
export EVAL_EMBEDDING_BINDING_API_KEY=local-key
export EVAL_EMBEDDING_BINDING_HOST=http://localhost:8001/v1
export EVAL_EMBEDDING_MODEL=BAAI/bge-m3
python lightrag/evaluation/eval_rag_quality.py
LLM uses OpenAI official API, embeddings use local custom endpoint.
Example 5: Different Custom Endpoints for LLM and Embeddings
# LLM on one OpenAI-compatible server
export EVAL_LLM_BINDING_API_KEY=key1
export EVAL_LLM_BINDING_HOST=http://llm-server:8000/v1
export EVAL_LLM_MODEL=custom-llm
# Embeddings on another OpenAI-compatible server
export EVAL_EMBEDDING_BINDING_API_KEY=key2
export EVAL_EMBEDDING_BINDING_HOST=http://embedding-server:8001/v1
export EVAL_EMBEDDING_MODEL=custom-embedding
python lightrag/evaluation/eval_rag_quality.py
Both use different custom OpenAI-compatible endpoints.
Example 6: Using Environment Variables from .env File
# Create .env file in project root
cat > .env << EOF
EVAL_LLM_BINDING_API_KEY=your-key
EVAL_LLM_BINDING_HOST=http://localhost:8000/v1
EVAL_LLM_MODEL=qwen-plus
EVAL_EMBEDDING_MODEL=BAAI/bge-m3
EOF
# Run evaluation (automatically loads .env)
python lightrag/evaluation/eval_rag_quality.py
The evaluation framework includes built-in concurrency control to prevent API rate limiting issues:
Why Concurrency Control Matters:
Default Configuration (Conservative):
EVAL_MAX_CONCURRENT=2 # Serial evaluation (one test at a time)
EVAL_QUERY_TOP_K=10 # OP_K query parameter of LightRAG
EVAL_LLM_MAX_RETRIES=5 # Retry failed requests 5 times
EVAL_LLM_TIMEOUT=180 # 3-minute timeout per request
Common Issues and Solutions:
| Issue | Solution |
|---|---|
| Warning: "LM returned 1 generations instead of 3" | Reduce EVAL_MAX_CONCURRENT to 1 or decrease EVAL_QUERY_TOP_K |
| Context Precision returns NaN | Lower EVAL_QUERY_TOP_K to reduce LLM calls per test case |
| Rate limit errors (429) | Increase EVAL_LLM_MAX_RETRIES and decrease EVAL_MAX_CONCURRENT |
| Request timeouts | Increase EVAL_LLM_TIMEOUT to 180 or higher |
sample_dataset.json contains 3 generic questions about LightRAG. Replace with questions matching YOUR indexed documents.
Custom Test Cases:
{
"test_cases": [
{
"question": "Your question here",
"ground_truth": "Expected answer from your data",
"project": "evaluation_project_name"
}
]
}
| Metric | Low Score Indicates |
|---|---|
| Faithfulness | Responses contain hallucinations or incorrect information |
| Answer Relevance | Answers don't match what users asked |
| Context Recall | Missing important information in retrieval |
| Context Precision | Retrieved documents contain irrelevant noise |
Low Faithfulness:
Low Answer Relevance:
Low Context Recall:
top_k resultsLow Context Precision:
pip install ragas datasets
Cause: This warning indicates API rate limiting or concurrent request overload:
EVAL_QUERY_TOP_K=10, that's 10 calls)EVAL_MAX_CONCURRENT × LLM calls per testSolutions (in order of effectiveness):
Serial Evaluation (Default):
export EVAL_MAX_CONCURRENT=1
python lightrag/evaluation/eval_rag_quality.py
Reduce Retrieved Documents:
export EVAL_QUERY_TOP_K=5 # Halves Context Precision LLM calls
python lightrag/evaluation/eval_rag_quality.py
Increase Retry & Timeout:
export EVAL_LLM_MAX_RETRIES=10
export EVAL_LLM_TIMEOUT=180
python lightrag/evaluation/eval_rag_quality.py
Use Higher Quota API (if available):
This error occurs with RAGAS 0.3.x when LLM and Embeddings are not explicitly configured. The evaluation framework now handles this automatically by:
Solution: Ensure you have set one of the following:
OPENAI_API_KEY environment variable (default)EVAL_LLM_BINDING_API_KEY for custom API keyThe framework will automatically configure the evaluation models.
Make sure you're running from the project root:
cd /path/to/LightRAG
python lightrag/evaluation/eval_rag_quality.py
The evaluation uses your configured LLM (OpenAI by default). Ensure:
.envThe evaluator queries a running LightRAG API server at http://localhost:9621. Make sure:
python lightrag/api/lightrag_server.py)python lightrag/evaluation/eval_rag_quality.pyresults/ folderEvaluation Result Sample:
INFO: ====================================================================== INFO: 🔍 RAGAS Evaluation - Using Real LightRAG API INFO: ====================================================================== INFO: Evaluation Models: INFO: • LLM Model: gpt-4.1 INFO: • Embedding Model: text-embedding-3-large INFO: • Endpoint: OpenAI Official API INFO: Concurrency & Rate Limiting: INFO: • Query Top-K: 10 Entities/Relations INFO: • LLM Max Retries: 5 INFO: • LLM Timeout: 180 seconds INFO: Test Configuration: INFO: • Total Test Cases: 6 INFO: • Test Dataset: sample_dataset.json INFO: • LightRAG API: http://localhost:9621 INFO: • Results Directory: results INFO: ====================================================================== INFO: 🚀 Starting RAGAS Evaluation of LightRAG System INFO: 🔧 RAGAS Evaluation (Stage 2): 2 concurrent INFO: ====================================================================== INFO: INFO: =================================================================================================================== INFO: 📊 EVALUATION RESULTS SUMMARY INFO: =================================================================================================================== INFO: # | Question | Faith | AnswRel | CtxRec | CtxPrec | RAGAS | Status INFO: ------------------------------------------------------------------------------------------------------------------- INFO: 1 | How does LightRAG solve the hallucination probl... | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | ✓ INFO: 2 | What are the three main components required in ... | 0.8500 | 0.5790 | 1.0000 | 1.0000 | 0.8573 | ✓ INFO: 3 | How does LightRAG's retrieval performance compa... | 0.8056 | 1.0000 | 1.0000 | 1.0000 | 0.9514 | ✓ INFO: 4 | What vector databases does LightRAG support and... | 0.8182 | 0.9807 | 1.0000 | 1.0000 | 0.9497 | ✓ INFO: 5 | What are the four key metrics for evaluating RA... | 1.0000 | 0.7452 | 1.0000 | 1.0000 | 0.9363 | ✓ INFO: 6 | What are the core benefits of LightRAG and how ... | 0.9583 | 0.8829 | 1.0000 | 1.0000 | 0.9603 | ✓ INFO: =================================================================================================================== INFO: INFO: ====================================================================== INFO: 📊 EVALUATION COMPLETE INFO: ====================================================================== INFO: Total Tests: 6 INFO: Successful: 6 INFO: Failed: 0 INFO: Success Rate: 100.00% INFO: Elapsed Time: 161.10 seconds INFO: Avg Time/Test: 26.85 seconds INFO: INFO: ====================================================================== INFO: 📈 BENCHMARK RESULTS (Average) INFO: ====================================================================== INFO: Average Faithfulness: 0.9053 INFO: Average Answer Relevance: 0.8646 INFO: Average Context Recall: 1.0000 INFO: Average Context Precision: 1.0000 INFO: Average RAGAS Score: 0.9425 INFO: ---------------------------------------------------------------------- INFO: Min RAGAS Score: 0.8573 INFO: Max RAGAS Score: 1.0000
Happy Evaluating! 🚀