Retrieval-Augmented Generation (RAG) systems enhance large language models (LLMs) with external knowledge but are known to be vulnerable to optimization-based data poisoning, where adversaries inject adversarially augmented passages into the knowledge base.
Such attacks are designed to bias retrieval and generation, leading the system to produce misleading or adversarially steered responses. Existing defenses rely on query-specific passage filtering during inference at either retrieval or generation, inherently incurring latency and redundantly evaluating passages across queries. This results in substantial runtime overhead, which further increases when additional passages are retrieved as substitutions.
We present JUDO, a query-agnostic indexing-time defense that detects and filters poisoned passages at ingestion by measuring the semantic instability that arises when adversarial trigger tokens are removed. We demonstrate that clean passages remain stable under such perturbations, whereas optimized poisoned passages exhibit erratic embedding shifts. Applied once per passage, JUDO filters malicious content without altering the inference pipeline or incurring runtime cost.
We compare our approach to previous defenses against known poisoning attacks and across multiple retrieval and QA benchmarks, achieving state-of-the-art results of 0.8–0.9 F1 and a 70% reduction in attack success rate, thereby providing a practical, zero-overhead defense for securing RAG systems against optimization-based poisoning.