SOTA on ASSE-Safety and ATBench, outperforming several strong general models (e.g., GPT-5.2) across R-Judge, ASSE-Safety, and ATBench
Autonomous agents (e.g., tool-using LLM agents, mobile agents, web agents) often execute multi-step trajectories consisting of observations, reasoning, and actions. Existing safety mechanisms mainly focus on single-step content moderation or final-output filtering, which are insufficient for capturing risks emerging during execution. AgentDoG addresses this gap through trajectory-level risk assessment, enabling comprehensive evaluation and protection of autonomous agent systems across diverse application scenarios.
Evaluates entire execution trajectories rather than individual steps, capturing emergent risks that only manifest through sequences of actions.
Leverages a comprehensive three-dimensional safety taxonomy covering risk sources, failure modes, and real-world harms for precise risk classification.
Can serve as a judge model for trajectory safety evaluation, a reward model for agent alignment, or a plug-in guard module for real-time protection.
A unified three-dimensional safety taxonomy for agentic systems
We propose a unified, three-dimensional safety taxonomy that decomposes agentic risks along three orthogonal dimensions: Risk Source, Failure Mode, and Real-World Harm. These dimensions respectively answer: where the risk comes from, how it manifests in agent behavior, and what real-world harm it causes.
Categories
User inputs, environmental observations, external tools/APIs, internal logic failures
Categories
Behavioral failures (planning, tool use, execution) and output content failures
Categories
Privacy, financial, security, physical, psychological, reputational, societal harms
Overview of the three orthogonal dimensions of the agentic safety taxonomy
Taxonomy-guided synthesis pipeline for realistic multi-step agent trajectories
We use a taxonomy-guided synthesis pipeline to generate realistic, multi-step agent trajectories. Each trajectory is conditioned on a sampled risk tuple (risk source, failure mode, real-world harm), then expanded into a coherent tool-augmented execution and filtered by quality checks.
Three-stage pipeline for multi-step agent safety trajectory synthesis
Distribution over risk source, failure mode, and harm type categories
Tool library size compared to existing agent safety benchmarks (86x larger than R-Judge)
A large-scale benchmark for evaluating agent trajectory safety
Trajectories
Unique Tools
Risk Categories
Turns/Trajectory
Comprehensive evaluation on binary classification and fine-grained risk identification tasks
Accuracy comparison across R-Judge, ASSE-Safety, and ATBench benchmarks
| Model | Type | R-Judge | ASSE-Safety | ATBench |
|---|---|---|---|---|
| GPT-5.2 | General | 90.8 | 77.4 | 90.0 |
| Gemini-3-Flash | General | 95.2 | 75.9 | 75.6 |
| Gemini-3-Pro | General | 94.3 | 78.5 | 87.2 |
| QwQ-32B | General | 89.5 | 68.2 | 63.0 |
| Qwen3-235B-A22B-Instruct | General | 85.1 | 77.6 | 84.6 |
| LlamaGuard3-8B | Guard | 61.2 | 54.5 | 53.3 |
| LlamaGuard4-12B | Guard | 63.8 | 56.3 | 58.1 |
| Qwen3-Guard | Guard | 40.6 | 48.2 | 55.3 |
| ShieldAgent | Guard | 81.0 | 79.6 | 76.0 |
| AgentDoG-Qwen3-4B (Ours) | Guard | 91.8 | 80.4 | 92.8 |
| AgentDoG-Qwen2.5-7B (Ours) | Guard | 91.7 | 79.8 | 87.4 |
| AgentDoG-Llama3.1-8B (Ours) | Guard | 78.2 | 81.1 | 87.6 |
Fine-grained label accuracy (%) on ATBench for unsafe trajectories
| Model | Risk Source Acc | Failure Mode Acc | Harm Type Acc |
|---|---|---|---|
| Gemini-3-Flash | 38.0 | 22.4 | 34.8 |
| GPT-5.2 | 41.6 | 20.4 | 30.8 |
| Gemini-3-Pro | 36.8 | 17.6 | 32.0 |
| Qwen3-235B-A22B-Instruct | 19.6 | 17.2 | 38.0 |
| QwQ-32B | 23.2 | 14.4 | 34.8 |
| AgentDoG-FG-Qwen3-4B (Ours) | 82.0 | 32.4 | 58.4 |
| AgentDoG-FG-Llama3.1-8B (Ours) | 81.6 | 31.6 | 57.6 |
| AgentDoG-FG-Qwen2.5-7B (Ours) | 81.2 | 28.8 | 59.2 |
Fine-tuned guard models for agent trajectory safety evaluation
Trajectory-level safety evaluation: determines whether an agent trajectory is safe or unsafe
Risk diagnosis: classifies Risk Source, Failure Mode, and Real-World Harm for unsafe trajectories
A hierarchical framework for explaining agent decision drivers beyond simple failure localization
We introduce a novel hierarchical framework for Agentic Attribution, designed to unveil the internal drivers behind agent actions. By decomposing interaction trajectories into pivotal components and fine-grained textual evidence, our approach explains why an agent makes specific decisions regardless of the outcome.
Attribution results across representative scenarios
Comparative attribution: AgentDoG vs Basemodel
Demo of the dynamic attribution process in agentic XAI
If you find AgentDoG useful in your research, please cite our paper
@misc{liu2026agentdogdiagnosticguardrailframework,
title={AgentDoG: A Diagnostic Guardrail Framework for AI Agent Safety and Security},
author={Dongrui Liu and Qihan Ren and Chen Qian and Shuai Shao and Yuejin Xie and Yu Li and Zhonghao Yang and Haoyu Luo and Peng Wang and Qingyu Liu and Binxin Hu and Ling Tang and Jilin Mei and Dadi Guo and Leitao Yuan and Junyao Yang and Guanxu Chen and Qihao Lin and Yi Yu and Bo Zhang and Jiaxuan Guo and Jie Zhang and Wenqi Shao and Huiqi Deng and Zhiheng Xi and Wenjie Wang and Wenxuan Wang and Wen Shen and Zhikai Chen and Haoyu Xie and Jialing Tao and Juntao Dai and Jiaming Ji and Zhongjie Ba and Linfeng Zhang and Yong Liu and Quanshi Zhang and Lei Zhu and Zhihua Wei and Hui Xue and Chaochao Lu and Jing Shao and Xia Hu},
year={2026},
journal={arXiv preprint arXiv:2601.18491}
}