Threat Report Generation with AI

Introduction

Artificial intelligence is reshaping how defense and intelligence organizations produce threat reports, compressing workflows that once took analysts hours into minutes-long automated pipelines — while introducing new risks around hallucination, automation bias, and workforce displacement that demand rigorous human oversight frameworks.

The intelligence community confronts an information environment that has outpaced human analytical capacity. The IC Data Strategy 2023-2025 published by the Office of the Director of National Intelligence acknowledged that data volume has become an increasing challenge since its previous 2017 strategy. The strategy declared: “It is no longer just about the volume of data, it is about who can collect, access, exploit and gain actionable insight the fastest, as they will have the decision and intelligence advantage.” Across all 18 IC elements, the pressure to produce faster, more comprehensive threat assessments has driven adoption of AI-powered generation tools from experimental prototypes to operational systems.

The Threat Report Production Challenge

Traditional threat report production requires analysts to manually synthesize information from dozens of classified and open sources — a labor-intensive process where the Army’s 18th Airborne Corps once needed 2,000 personnel for time-critical targeting that AI-assisted teams of 20 now accomplish, reflecting a systemic capacity gap that AI generation tools aim to close.

Intelligence analysts have long faced a fundamental bottleneck: the time between collecting raw data and delivering a finished, actionable assessment. The Army’s experience with the Maven Smart System illustrates the scale of improvement possible. According to Brigadier General John Cogbill, initial target-pass timelines in late 2020 ran to 743 minutes — more than 12 hours. Current capability has compressed that to under one minute. The Center for Security and Emerging Technology at Georgetown University documented that the 18th Airborne Corps matched the targeting output of 2,000 Operation Iraqi Freedom staffers using just 20 soldiers equipped with Maven.

The IC OSINT Strategy 2024-2026, jointly released by ODNI and the CIA, calls for AI and machine learning to “expand and accelerate” open-source intelligence processing. The strategy describes a rapidly evolving digital environment that demands analysts with data acumen and technical aptitude trained through “clear training pathways with skill expectations from the foundational to the expert level.” The volume problem is not theoretical. It defines daily operations across every intelligence discipline.

AI Generation Approaches

Defense AI threat report generation relies on three primary architectures — retrieval-augmented generation that grounds outputs in source documents, fine-tuned domain models trained on intelligence corpora, and hybrid human-AI pipelines where models draft and analysts verify — each offering distinct tradeoffs between speed, accuracy, and classification handling.

Retrieval-augmented generation has emerged as the dominant architecture for classified document synthesis. RAG systems retrieve relevant passages from indexed source material and feed them to a language model as context, reducing hallucination by grounding outputs in actual documents. The Institute for Defense Analyses evaluated RAG performance on the 2024 National Defense Authorization Act text. GPT-4 failed the relevancy metric roughly 20 percent of the time, while the smaller Mistral-7b model failed over half the time — demonstrating that model selection directly determines output reliability in document synthesis tasks.

Singapore-based DLRA has developed SynthBrief, a platform that generates structured intelligence briefs from 50 or more source documents in under 3 minutes — a fraction of the 4-6 hour industry baseline for manual multi-source products.

Fine-tuned domain models represent a second approach, where base language models are retrained on intelligence-specific datasets to produce outputs conforming to analytic tradecraft standards. The CIA’s OSIRIS platform uses generative AI to classify, triage, and synthesize open-source intelligence, with CIA Director of AI Lakshmi Raman describing how agents use the technology for “search and discover and do levels of natural language querying.”

Hybrid pipelines combine automated drafting with human analyst review. The model generates an initial structured assessment; the analyst verifies claims against source material, adjusts confidence levels, and adds contextual judgment that machines cannot replicate. This approach preserves speed gains while maintaining the analytic rigor required for intelligence products.

Current Deployment State

Federal generative AI use cases surged ninefold from 32 to 282 between 2023 and 2024 across agencies tracked by the GAO, while the DoD’s January 2026 AI Acceleration Strategy directs every military department and combatant command to identify AI pace-setting projects within 30 days — signaling that threat report automation has moved from pilot programs to institutional priority.

The Government Accountability Office’s July 2025 report on generative AI at federal agencies (GAO-25-107653) found that total AI use cases across 11 selected agencies nearly doubled from 571 to 1,110 between 2023 and 2024. Generative AI use cases specifically increased approximately ninefold, from 32 to 282. Mission-enabling functions accounted for 61 percent of all generative AI implementations.

The 2026 DoD AI Acceleration Strategy establishes an “AI-first” posture across the department. Each military department, combatant command, and defense agency must identify at least three projects for prioritization as pace-setting projects. The strategy also requires military components to deliver federated data catalogues to the Chief Digital and Artificial Intelligence Office within 30 days.

NGA’s Maven system is already operational in U.S. Central Command, with a planned Pacific debut during the Yama Sakura exercise. The system’s ultimate goal, documented by CSET at Georgetown, is for soldiers and the system to help a commander process 1,000 tactical decisions per hour.

The Human Review Imperative

Georgetown’s Center for Security and Emerging Technology warns that AI systems “confidently present incorrect information” and fabricate supporting justifications, while users risk automation bias through over-reliance on AI recommendations — making structured human review not optional but essential for any intelligence product where errors could trigger miscalculation or escalation.

CSET’s April 2025 brief AI for Military Decision-Making identifies automation bias as a primary risk: users over-rely on AI recommendations without adequate critical review. The brief emphasizes that large language models can “confidently present incorrect information” while appearing authoritative — a particularly dangerous characteristic in intelligence assessments where confidence levels carry operational weight.

“Ultimately, DOD cannot maintain its competitive advantage without transforming itself into an AI-ready and data-centric organization, with RAI [Responsible AI] as a prominent feature.”

— Deputy Secretary of Defense Kathleen Hicks, DoD Responsible AI Strategy and Implementation Pathway, June 2022

CSET’s earlier December 2023 report, Reducing the Risks of Artificial Intelligence for Military Decision Advantage, frames the central dilemma. Decision makers want AI to reduce uncertainty about battlefield awareness and adversary intentions, but by relying on AI they “introduce a new source of uncertainty in the likelihood of technical failures.” The report concludes that responsible deployment requires “reducing the likelihood, and containing the consequences of, AI failures” through mission-specific properties, standards, and requirements.

The UK’s Centre for Emerging Technology and Security at the Alan Turing Institute conducted research commissioned by the Joint Intelligence Organisation and GCHQ. The study found that while AI can make “transformational improvements in intelligence analysis by supporting analysts to process data more quickly and accurately,” it also has the potential to exacerbate dimensions of uncertainty inherent in intelligence assessment.

Quality Assurance Frameworks

Effective QA for AI-generated threat reports requires layered validation — source verification against retrieved documents, cross-referencing with parallel intelligence streams, confidence calibration aligned with analytic tradecraft standards, and adversarial red-teaming — because the IDA found that even top-tier models fail document relevancy checks roughly one in five attempts.

The IDA’s RAG evaluation on NDAA text established empirical baselines for generation quality. GPT-4 achieved approximately 80 percent relevancy accuracy, while both models performed well on faithfulness — whether generated text stayed true to the retrieved source material. The study found no clear pattern linking chunk size or document retrieval count to performance, but higher pass rates for questions with answers directly stated in the source text.

NIST’s AI 100-2 E2025 report, published March 2025, expanded its adversarial machine learning taxonomy to cover generative AI threats including direct prompt injection, indirect prompt injection, and AI agent vulnerabilities. For threat report generation systems, these attack vectors represent risks to both the generation pipeline and the document retrieval layer that feeds it.

Quality assurance in defense contexts extends beyond factual accuracy to classification handling. Generated reports must respect source classification markings, properly attribute information to originating agencies, and apply need-to-know restrictions — tasks that require human review workflows integrated at every stage of the generation pipeline.

Implications for the Workforce

The National Security Commission on Artificial Intelligence identified DoD’s “AI talent deficit” as the greatest impediment to becoming AI-ready, while GAO found the Pentagon cannot fully identify who belongs to its AI workforce or which positions require AI skills — a gap that threatens to stall the transition from manual to AI-assisted threat report production.

The NSCAI Final Report (March 2021) called for an “AI-Ready Intelligence Community by 2025.” The vision required “intelligence professionals enabled with baseline digital literacy and access to the digital infrastructure and software required for ubiquitous AI integration in each stage of the intelligence cycle.” The commission described the government’s “human deficit” as “the government’s most conspicuous AI deficit and the single greatest inhibitor to buying, building, and fielding AI-enabled technologies for national security purposes.”

GAO’s December 2023 report Artificial Intelligence: Actions Needed to Improve DOD’s Workforce Management (GAO-24-105645) found that DOD developed AI work roles but has not assigned responsibility to organizations needed to complete identification steps. Those steps include coding work roles in workforce data systems, developing qualification programs, and updating guidance. As a result, DOD “can’t fully identify who is part of its AI workforce or which positions require personnel with AI skills” and cannot effectively forecast future AI workforce needs.

The shift from manual to AI-assisted report production changes what analysts do, not whether analysts are needed. Analysts who once spent the bulk of their time on drafting now focus on source validation, confidence calibration, and contextual judgment — tasks that require deeper expertise, not less.

Implementation Challenges

RAND Corporation research estimates that more than 80 percent of AI projects fail — twice the failure rate for non-AI IT projects. GAO found federal agencies more than doubled AI acquisitions between 2023 and 2024 without systematically collecting lessons learned, and Army officials reported proposed AI licensing fees of approximately $300,000 per vehicle per year for XM-30.

The RAND report on AI project failure (RRA2680-1, August 2024) drew on interviews with 65 data scientists and engineers. It identified five root causes: misunderstanding the problem to be solved, insufficient data infrastructure, prioritizing technology over real problems, inadequate deployment infrastructure, and applying AI to problems too difficult for current capabilities.

GAO’s April 2026 report Artificial Intelligence Acquisitions (GAO-26-107859) reviewed 13 AI acquisitions across DoD, DHS, VA, and GSA. The report found six recurring challenge areas: access to subject matter experts, data and intellectual property protections, acquisition timelines misaligned with AI development cycles, requirements definition, testing and continuous evaluation, and pricing opacity. Army officials reported proposed licensing fees for the XM-30 AI solution of approximately $300,000 per vehicle per year. The VA retired its SoKAT suicide prevention AI solution in January 2023 without documenting lessons learned.

Classification requirements compound every challenge. Air-gapped environments prevent continuous model updates. Microsoft’s air-gapped GPT-4 deployment for the intelligence community — the first major LLM separated from the internet — demonstrated the feasibility of classified operation but at the cost of disconnected models that sacrifice continuous improvement.

Future Directions

The DoD’s 2026 AI-first strategy, combined with the Office of Management and Budget’s 2025 inventory documenting 3,611 federal AI use cases across 56 agencies — a 105 percent increase from 2024 — indicates that AI-assisted threat report generation will become standard practice, with the decisive question being whether quality assurance and human oversight frameworks can scale alongside adoption.

The 2025 Federal AI Use Case Inventory documented 3,611 individual use cases across 56 submitting agencies, a 105 percent increase from the 1,757 reported in 2024. The Department of Health and Human Services reported the highest count, followed by NASA, the VA, the Department of Energy, and the Department of Justice.

NIST’s expanded adversarial ML taxonomy covering generative AI attack vectors will shape how defense organizations test and validate threat report generation systems. Supply chain attacks on retrieval corpora, prompt injection through source documents, and adversarial manipulation of training data all represent vectors that quality assurance frameworks must address.

The ODNI’s IC OSINT Strategy describes the community as “already pioneering new uses of artificial intelligence, machine learning, and human language technologies for the OSINT mission.” It calls on the IC to “expand and accelerate these efforts to sustain a competitive edge.” Combined with the DoD’s mandate for pace-setting AI projects across every component, the organizational momentum toward automated intelligence production is now institutional rather than experimental.

Conclusion

AI threat report generation has crossed from experimental to operational across the U.S. defense enterprise, with Maven compressing 12-hour targeting workflows to under a minute and federal generative AI use cases growing ninefold in a single year. The 80 percent AI project failure rate and persistent workforce identification gaps mean the hardest work lies in building sustainable, trustworthy production systems.

The trajectory is clear: intelligence organizations are automating threat report production because the data volume demands it and the technology now supports it. Maven’s results — from 743 minutes to under one minute, from 2,000 analysts to 20 — demonstrate the ceiling of what AI-assisted intelligence production can achieve.

The constraints are equally clear. CSET warns that AI systems confidently fabricate information. RAND documents an 80 percent project failure rate. GAO finds agencies acquiring AI without collecting lessons learned. The NSCAI identified the talent deficit as the single greatest inhibitor four years before the DoD acknowledged it still cannot identify its own AI workforce.

Threat report generation with AI works. Making it work reliably, at scale, within classification constraints, with adequate human oversight, and sustained by a workforce trained to operate human-AI teams — that remains the operational challenge for the years ahead.

Comparison: AI Threat Report Generation Approaches

Defense organizations employ five distinct AI architectures for threat report generation — retrieval-augmented generation, fine-tuned domain models, hybrid human-AI pipelines, multi-agent orchestration, and template-driven extraction — each balancing source fidelity, generation speed, classification compliance, and human oversight requirements against the operational demands of producing timely, accurate intelligence assessments.

Approach	Description	Source Fidelity	Speed	Human Oversight	Example
Retrieval-augmented generation	LLM grounded by retrieved source passages	High	Fast	Verification review	IDA RAG evaluation on NDAA
Fine-tuned domain model	Base model retrained on intelligence corpora	Medium	Fast	Calibration review	CIA OSIRIS platform
Hybrid human-AI pipeline	Model drafts, analyst verifies and edits	High	Medium	Integrated at every stage	DLRA SynthBrief
Multi-agent orchestration	Multiple AI agents handle subtasks in parallel	Variable	Fast	Supervisory review	Maven Smart System targeting
Template-driven extraction	AI extracts data into structured report templates	High	Very fast	Spot-check review	NGA geospatial products

FAQ

1. What is AI threat report generation in a defense context?

AI threat report generation uses large language models and retrieval-augmented generation to automate the drafting of intelligence assessments from classified and open-source material. The Army’s Maven Smart System compressed targeting workflows from over 12 hours to under one minute. These systems draft structured assessments that human analysts then verify against source material, calibrate confidence levels, and enrich with contextual judgment before dissemination.

2. How accurate are AI-generated threat reports compared to human-written ones?

Accuracy varies by architecture and model. The Institute for Defense Analyses found that GPT-4 achieved approximately 80 percent relevancy accuracy when generating answers from retrieved defense documents, while smaller models failed more than half the time. CSET at Georgetown warns that large language models can “confidently present incorrect information” — making human review mandatory rather than optional for any intelligence product.

3. What are the main risks of using AI for threat report production?

Primary risks include hallucination (fabricated content presented as fact), automation bias (analysts over-relying on AI outputs), adversarial manipulation of source documents or retrieval pipelines, and classification handling errors. NIST’s AI 100-2 E2025 report classifies prompt injection — both direct and indirect — as a primary attack vector against generative AI systems, which directly threatens the integrity of document retrieval layers.

4. Which U.S. defense agencies currently use AI for intelligence report generation?

The CIA operates OSIRIS for open-source intelligence triage and synthesis across all 18 IC agencies. NGA deploys the Maven Smart System for targeting intelligence in U.S. Central Command. Microsoft built an air-gapped GPT-4 system for the intelligence community handling Top Secret data. The DoD’s 2026 AI Acceleration Strategy directs every military department and combatant command to identify AI pace-setting projects.

5. Why is human review still required for AI-generated threat reports?

AI systems introduce what CSET calls “a new source of uncertainty in the likelihood of technical failures.” Intelligence assessments carry operational consequences — misattributed sources, incorrect confidence levels, or fabricated details could lead to misallocated resources or missed threats. The UK’s CETaS research found AI can exacerbate uncertainty in intelligence analysis. Human analysts provide source verification, confidence calibration, and contextual judgment that current AI cannot replicate.

6. How do defense organizations handle classified data in AI threat report systems?

Classified AI systems operate on air-gapped networks without internet connectivity. Microsoft’s GPT-4 deployment for the intelligence community was the first major LLM separated from the internet, designed to handle Top Secret data. Air-gapped operation prevents continuous model updates, creating a tradeoff between classification security and model improvement. RAG architectures address this by retrieving from local classified document stores rather than relying on the model’s training data.