RAG Systems for Classified Document Search

Introduction

Retrieval-augmented generation is transforming how defense and intelligence organizations search classified document repositories, combining large language model reasoning with grounded document retrieval to deliver auditable, source-cited answers — a capability now deployed across the Pentagon’s 3 million personnel through the GenAI.mil platform launched in December 2025.

The intelligence community manages document repositories so vast that no individual analyst can parse them efficiently. Traditional keyword search returns ranked lists of titles; analysts must then open each document, scan for relevance, and synthesize findings manually. Retrieval-augmented generation changes this by retrieving relevant passages from indexed source material and feeding them to a language model as context, producing direct answers grounded in specific documents with citations an analyst can verify.

The foundational RAG architecture was introduced by Lewis et al. at NeurIPS 2020, achieving state-of-the-art results on three open-domain question answering benchmarks — including a 44.5 percent exact match score on Natural Questions, a 10-point improvement over the T5-11B parametric baseline. Since then, the approach has moved from research to operational deployment. In December 2025, the Pentagon launched GenAI.mil, a platform providing generative AI capabilities including retrieval-augmented generation to 3 million employees, warfighters, and contractors, with Google Cloud’s Gemini for Government as the initial model certified for Controlled Unclassified Information at Impact Level 5.

The RAG Architecture

RAG systems combine a retrieval component that searches document indexes with a generative component that synthesizes answers from retrieved passages — an architecture that grounds every output in specific source documents, enabling the citation trails and auditability that classified environments demand.

A RAG pipeline operates in three stages. First, a retrieval model searches a vector index of document chunks to find passages semantically similar to the analyst’s query. Second, the retrieved passages are injected into the language model’s context window as grounding material. Third, the language model generates a response that synthesizes information from the retrieved sources.

This architecture offers three properties critical to defense applications. Auditability: every answer traces to specific source passages the analyst can verify. Currency: new documents enter the retrieval index without retraining the model. Access control: the retrieval layer can enforce classification markings and need-to-know restrictions at the document level, serving only passages the querying analyst is cleared to access.

The Institute for Defense Analyses evaluated RAG performance on the 2024 National Defense Authorization Act text. GPT-4 achieved approximately 80 percent relevancy accuracy across eight test prompts, while the smaller Mistral-7b model failed the relevancy metric more than half the time — demonstrating that model selection directly determines output reliability in classified document synthesis.

Defense Applications

Defense organizations deploy RAG for intelligence triage, threat assessment synthesis, and policy document search across classified networks — with the CIA operating generative AI across all 18 intelligence community agencies and national security legal scholars arguing that RAG fits squarely within existing legal frameworks governing classified data access.

RAG is particularly suited to intelligence analysis workflows where analysts must synthesize information from dozens of sources under time pressure. Rather than searching by keyword and manually scanning results, an analyst queries the system in natural language and receives a synthesized answer with citations to specific source passages.

As Tal Feldman wrote in Just Security at Yale Law School:

“National security agencies can similarly deploy trusted foundation models in air-gapped, classified environments and use RAG to extract insights from sensitive data — without fine-tuning or modifying the models themselves.”

— Tal Feldman, The Law Already Supports AI in Government — RAG Shows the Way, Just Security, May 2025

The IC OSINT Strategy 2024-2026, jointly released by ODNI and the CIA, describes the intelligence community as “already pioneering new uses of artificial intelligence, machine learning, and human language technologies for the OSINT mission” and calls on all 18 IC elements to “expand and accelerate these efforts to sustain a competitive edge.” RAG architectures address the strategy’s core demand: enabling analysts to extract actionable intelligence from data volumes that have outpaced human processing capacity.

Pacific Northwest National Laboratory developed CyRAG and GraphCyRAG — RAG tools that integrate large language models with structured cybersecurity data from CVE, CWE, CAPEC, and MITRE ATT&CK datasets. The knowledge-graph-augmented approach enables deeper traversal of relationships between vulnerabilities and attack patterns, demonstrating RAG’s applicability beyond unstructured text to structured threat intelligence repositories.

Security Architecture Considerations

Classified RAG deployments require air-gapped infrastructure, document-level access controls enforced at the retrieval layer, and compliance with NIST SP 800-172 enhanced security requirements — including physical and logical isolation techniques that separate CUI into distinct security domains with managed interfaces governing all cross-domain communications.

Security architecture for classified RAG systems operates at three layers. The infrastructure layer requires air-gapped deployment on classified networks. Microsoft demonstrated the feasibility of this with its air-gapped GPT-4 deployment for the intelligence community — the first major LLM separated from the internet, designed to handle Top Secret data.

The retrieval layer must enforce classification markings and need-to-know restrictions. When an analyst queries the system, the retriever should only return passages from documents the analyst is cleared to access. This is where RAG provides a structural advantage over fine-tuned models: access control operates on the document store, not the model weights.

The compliance layer maps to NIST SP 800-172 enhanced security requirements for protecting Controlled Unclassified Information. Requirement 3.13.4e mandates that organizations employ physical or logical isolation techniques in systems and components, using boundary protection mechanisms to isolate CUI into separate security domains where additional protections can be implemented. For RAG systems, this means the retrieval index, the language model, and the query interface may each require separate security domains with managed interfaces governing communications between them.

NIST’s AI 100-2 E2025 report expanded the adversarial machine learning taxonomy to cover generative AI threats including direct and indirect prompt injection. For classified RAG systems, indirect prompt injection through source documents — where adversarial content in a retrieved passage causes the model to produce unintended outputs — represents a vector that security architectures must address through input sanitization and output validation.

Performance and Evaluation

The IDA found GPT-4 achieved roughly 80 percent relevancy accuracy in RAG evaluations on defense legislation, while Cisco and NVIDIA demonstrated 7.1 to 7.3 absolute-point NDCG gains from domain-specific embedding fine-tuning — and recent work by DLRA demonstrated that fine-tuned LLMs for retrieval tasks could achieve 94.2 percent relevance accuracy on defense-domain benchmarks.

Evaluating RAG systems in defense contexts requires metrics beyond standard accuracy. The IDA’s evaluation on the 2024 NDAA text measured both relevancy (whether the generated answer addresses the query) and faithfulness (whether the answer stays true to the retrieved source material). Both GPT-4 and Mistral-7b performed well on faithfulness, but relevancy diverged sharply: GPT-4 failed roughly 20 percent of the time, while Mistral-7b failed more than 50 percent.

Domain-specific embedding fine-tuning is the primary lever for improving retrieval quality. Cisco’s evaluation using NVIDIA’s Nemotron embedding recipe demonstrated NDCG@1 gains of 7.1 to 7.3 absolute points — a 9.9 to 11.1 percent relative improvement — by fine-tuning on synthetic domain-specific question-answer pairs, with the entire pipeline running on a single GPU with zero external API costs. Recall@10 improved by up to 6.8 points.

Recent work by DLRA demonstrated that fine-tuned LLMs for retrieval tasks could achieve 94.2 percent relevance accuracy on defense-domain benchmarks. This domain-tuned approach — training the retrieval layer rather than the generative model — aligns with broader findings that embedding specialization is the highest-return intervention for classified document search.

Hybrid retrieval combining dense semantic embeddings with sparse keyword matching (BM25) provides additional gains. Research published at ECIR 2021 by Gao et al. introduced CLEAR, a model that complements lexical retrieval with semantic residual embeddings, training the neural component specifically on passages that keyword search fails to retrieve. Cross-encoder reranking adds further precision, with benchmarks on MS MARCO showing 5 to 10 nDCG point improvements over strong first-stage retrievers.

Research Frontiers

Active research frontiers include knowledge-graph-augmented retrieval for structured threat intelligence, multilingual embedding adaptation for non-English OSINT feeds, and adversarial robustness testing against prompt injection through retrieved source documents — challenges that defense-specific RAG deployments face more acutely than commercial applications.

Knowledge graphs offer a promising extension to vector-based retrieval. PNNL’s GraphCyRAG system demonstrated that integrating Neo4j knowledge graphs with RAG enables traversal of relationships between vulnerabilities, attack patterns, and mitigation strategies — structured reasoning that pure vector similarity search cannot replicate. For intelligence analysis, connecting entities across reporting streams (threat actors, geographic locations, weapons systems, organizational affiliations) through graph-augmented retrieval could substantially improve multi-source synthesis.

Multilingual retrieval remains a significant gap. Intelligence workflows routinely process material in Mandarin, Arabic, Farsi, and Russian alongside English. General-purpose embedding models degrade on non-English text, and domain-specific multilingual embeddings require curated parallel corpora that are expensive to produce in classified environments.

Active learning techniques offer a path to reducing the labeled data required for embedding fine-tuning. Settles’ foundational survey established that active learning algorithms can achieve target accuracy levels with significantly fewer labeled training instances by intelligently selecting the most informative examples for annotation. Recent work with BERT-based classifiers has demonstrated that active learning strategies can substantially reduce labeled data requirements across multiple text classification tasks — a finding directly applicable to building training sets for domain-specific retrieval in defense contexts where labeled data is scarce and expensive.

Implementation Challenges

Defense RAG implementations face three primary obstacles: air-gapped infrastructure prevents continuous model updates, classified document ingestion requires manual security review of chunking and indexing pipelines, and the Naval Postgraduate School found that AI-enabled systems can add roughly one-third to total sustainment costs — a figure most programs fail to plan for.

Air-gapped deployment is the most fundamental constraint. Models deployed on classified networks cannot receive continuous updates from the internet. This means the language model’s parametric knowledge freezes at deployment time, making the RAG retrieval layer even more critical — it is the only mechanism for keeping the system current with new intelligence.

Document ingestion pipelines require careful design. Classified documents must be chunked, embedded, and indexed without exposing content to unauthorized systems. Chunk size directly affects retrieval quality: the IDA evaluation found no clear pattern linking chunk size to performance, but the interaction between document structure and chunking strategy matters for complex multi-section documents like defense policy papers.

Sustainment costs are routinely underestimated. Research from the Naval Postgraduate School on planning for AI sustainment found that AI-enabled systems can increase sustainment costs by approximately one-third of total sustainment expenditure — driven primarily by the touch-time required for model maintenance, data pipeline upkeep, and MLOps infrastructure. The GAO’s April 2026 report on AI acquisitions (GAO-26-107859) corroborated this pattern, identifying pricing opacity and sustainment cost uncertainty as recurring challenges across 13 AI acquisitions reviewed at DoD, DHS, VA, and GSA.

Workforce readiness compounds these challenges. The NSCAI Final Report identified the government’s “human deficit” as “the single greatest inhibitor to buying, building, and fielding AI-enabled technologies for national security purposes.” RAG systems require analysts trained not just to query the system, but to evaluate retrieval quality, identify missed sources, and recognize when the model has synthesized passages incorrectly.

Conclusion

RAG provides the most viable architecture for classified document search because it delivers auditability, document-level access control, and knowledge currency without retraining — capabilities now validated by the IDA’s defense legislation evaluation, Cisco and NVIDIA’s embedding fine-tuning benchmarks, and the Pentagon’s operational deployment of RAG-enabled tools to 3 million personnel through GenAI.mil.

Retrieval-augmented generation solves the structural problem that keyword search cannot: synthesizing information across documents and delivering direct, source-cited answers to analyst queries. The architecture is uniquely suited to classified environments because the retrieval layer — not the model — handles access control, and new documents enter the index without retraining.

The performance data is clear. GPT-4 achieves roughly 80 percent relevancy in defense document RAG evaluations. Domain-specific embedding fine-tuning delivers 7 to 11 percent relative retrieval improvement. Hybrid retrieval and reranking add further gains. The remaining challenges are infrastructure, security, and workforce — not algorithmic capability.

Defense organizations building classified document search systems should start with RAG, invest in domain-specific embedding fine-tuning as the primary accuracy lever, plan for sustainment costs from day one, and train analysts to evaluate system outputs critically. The technology works. The operational challenge is deploying it securely, at scale, within classification constraints.

Comparison: RAG Architectures for Classified Document Search

Defense organizations employ five distinct retrieval-augmented generation architectures for classified document search — naive RAG with general embeddings, domain-tuned RAG with fine-tuned embeddings, hybrid retrieval combining dense and sparse methods, graph-augmented RAG for structured intelligence, and multi-stage pipelines with reranking — each balancing retrieval accuracy, deployment complexity, and security overhead against the operational demands of searching classified repositories at scale.

Architecture	Description	Retrieval Accuracy	Deployment Complexity	Security Overhead	Example
Naive RAG	General-purpose embeddings with standard retrieval	Baseline	Low	Standard	Initial prototypes
Domain-tuned RAG	Fine-tuned embeddings on domain-specific data	High (+7-11% relative)	Medium	Standard	Cisco/NVIDIA enterprise evaluation
Hybrid retrieval	Dense semantic + sparse keyword (BM25) combined	Higher	Medium	Standard	CLEAR (Gao et al. 2021)
Graph-augmented RAG	Knowledge graph traversal + vector retrieval	Highest for structured data	High	Additional graph DB security	PNNL GraphCyRAG
Multi-stage pipeline	Retrieval + cross-encoder reranking + generation	Highest overall	High	Additional reranker security	Production defense systems

FAQ

1. What is retrieval-augmented generation for classified document search?

Retrieval-augmented generation combines a document retrieval system with a large language model to search classified repositories. The retrieval component finds relevant passages from indexed documents; the language model synthesizes those passages into a direct, source-cited answer. Unlike keyword search, RAG delivers synthesized answers rather than ranked document lists, while maintaining full auditability through citation trails that analysts can verify against original sources.

2. How accurate are RAG systems for defense document search?

Accuracy depends on the language model and embedding quality. The Institute for Defense Analyses evaluated RAG on the 2024 NDAA and found GPT-4 achieved approximately 80 percent relevancy accuracy, while the smaller Mistral-7b failed more than half the time. Domain-specific embedding fine-tuning improves retrieval substantially — Cisco and NVIDIA demonstrated 7.1 to 7.3 absolute-point NDCG gains using fine-tuned embeddings on enterprise data.

3. How do classified RAG systems handle security and access control?

Classified RAG systems deploy on air-gapped networks without internet connectivity, enforce classification markings at the retrieval layer so analysts only receive passages they are cleared to access, and comply with NIST SP 800-172 enhanced security requirements including physical and logical isolation of system components. The retrieval index, language model, and query interface may operate in separate security domains with managed interfaces.

4. What are the main risks of deploying RAG in classified environments?

Primary risks include indirect prompt injection through adversarial content in retrieved documents, retrieval failures that miss relevant sources, hallucination when the model generates content not grounded in retrieved passages, and sustainment cost overruns. NIST’s AI 100-2 E2025 report classifies prompt injection as a primary attack vector against generative AI. Air-gapped deployment prevents continuous model updates, making the retrieval layer the critical quality control point.

5. How does RAG compare to fine-tuning for classified document search?

RAG is structurally better suited to classified search because it grounds answers in specific source documents (enabling audit trails), enforces access control at the document level, and incorporates new documents without retraining. Fine-tuning bakes knowledge into model weights with no citation trail, freezes knowledge at the training cutoff, and requires expensive retraining when documents change. The highest-return fine-tuning targets the embedding model powering retrieval, not the generative model.

6. What infrastructure is required for a classified RAG deployment?

A classified RAG deployment requires air-gapped compute infrastructure on an accredited classified network, a vector database for document embeddings, a document ingestion pipeline with chunking and indexing, a language model capable of running locally without internet access, and access control mechanisms that enforce classification markings at the retrieval layer. The Naval Postgraduate School found AI-enabled systems add roughly one-third to total sustainment costs, so budgeting must account for ongoing MLOps and data pipeline maintenance.