Retrieval Pipeline
How the RAG system finds relevant content for chat queries.
Pipeline Overview
User Query → Query Processing → Hybrid Search → Ranking → Context → Response
↓ ↓
Query Expansion Reciprocal Rank Fusion
↓ ↓
Embed Generation Cross-Encoder Rerank
1. Query Processing
- Clean and normalize input
- Extract key terms for keyword search
- Generate query variations for better recall
2. Embedding Generation
- Model: Azure OpenAI
text-embedding-3-small - Dimensions: 1536
Caching Strategy
Redis (hot) → PostgreSQL (cold) → API
1 hour 30 days
3. Hybrid Search
Vector Search (pgvector)
Semantic similarity using HNSW index.
Keyword Search (Postgres TSVECTOR)
Full-text search for exact matches.
Why Hybrid?
| Search Type | Strength | Weakness |
|---|---|---|
| Vector | Semantic similarity | May miss exact terms |
| Keyword | Exact matches | No semantic understanding |
4. Score Fusion
Reciprocal Rank Fusion (RRF) combines rankings:
score = 1 / (k + rank)
5. Reranking
Cross-encoder reranking for better relevance:
- Model:
BAAI/bge-reranker-base
6. Context Assembly
Token budget allocation:
| Component | Budget |
|---|---|
| System prompt | 1000 |
| History | 8000 |
| Context | ~110000 |
| Response | 4000 |
Performance Targets
| Metric | Target | Alert |
|---|---|---|
| Total Latency | < 500ms | > 1000ms |
| Vector Search | < 100ms | > 200ms |
| Cache Hit Rate | > 70% | < 50% |