What actually breaks when you run a RAG pipeline in production

// Building an on-premise RAG system for HIPAA-sensitive documents at UHG — the failure modes nobody writes about until they ship.

The part of RAG that gets discussed most — model choice, chunk size, similarity threshold — is not the part that breaks in production. What breaks is everything around it.

At UnitedHealth Group, the constraint was strict: no external API calls, all processing on-premise, zero data exposure. That ruled out hosted embeddings and most of the tooling the community assumes. We ran quantized Llama 3 and Mistral via FastAPI on internal hardware, with pgvector and Chroma as the retrieval layer.

The first failure mode was chunking. Healthcare documents do not follow clean paragraph boundaries — they have dense clinical shorthand, structured tables, and cross-references that destroy naive text splitters. A chunk that looks coherent in isolation often loses meaning without the surrounding section header.

The second failure mode was retrieval confidence. Vector similarity scores are not calibrated probabilities. A score of 0.82 might mean a strong match or a reasonable-looking hallucination. We added a post-retrieval reranking step and confidence thresholds before any answer was surfaced to a downstream agent.

The lesson: RAG is a data pipeline with an LLM in the middle, not an LLM with a database attached. The engineering discipline is in the pipeline, not the model.

$ cd ..