Table of Contents
Quick Answer
RAG lets LLMs answer questions using your documents. Embed chunks, store in pgvector or Qdrant, retrieve top-k with reranking, then pass to the LLM as context. Always cite sources in the response.
- Chunk size of 500-1000 tokens works for most cases
- Reranking (Cohere, BGE) improves quality by 20-40%
- Always display citations — hallucinations kill trust
What You'll Need
- Document corpus (PDFs, markdown, web pages)
- Embedding model (text-embedding-3-small, bge-m3, or assisters-embed)
- Vector DB: pgvector, Qdrant, Weaviate, or Chroma
- LLM via OpenAI-compatible API
Steps
- Ingest and chunk. Use
unstructuredorlangchainfor PDFs. Chunk at 800 tokens with 100 overlap. - Embed. Batch embed chunks:
const { data } = await ai.embeddings.create({
model: 'assisters-embed-v1',
input: chunks,
});
- Store in pgvector.
INSERT INTO documents (content, embedding) VALUES (...) - Create index.
CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops); - Query pipeline. Embed user question, vector search top 20, rerank to top 5.
- Rerank. Use Cohere Rerank or BGE reranker:
const { results } = await ai.rerank.create({
query,
documents: candidates,
top_n: 5,
});
- Prompt the LLM. System:
Answer using only the provided context. Cite sources with [n]. - Return with citations. Link back to original documents.
Common Mistakes
- Bad chunking. Splitting mid-sentence destroys meaning. Use semantic chunking.
- No reranking. First-pass vector search is noisy.
- Losing metadata. Always keep doc_id, title, url.
- Ignoring recency. Add time decay for news/social corpora.
Top Tools
| Tool | Purpose |
|---|---|
| pgvector | SQL + vectors in one DB |
| Qdrant | Dedicated vector DB |
| LangChain / LlamaIndex | Orchestration |
| Cohere Rerank | Reranking API |
| Unstructured | Document parsing |
Conclusion
RAG is the dominant pattern for domain-specific AI in 2026. Start with pgvector + Assisters, add reranking, always cite. Misar Dev builds full RAG stacks in minutes.