AgentDoG

Abstract

Autonomous agents (e.g., tool-using LLM agents, mobile agents, web agents) often execute multi-step trajectories consisting of observations, reasoning, and actions. Existing safety mechanisms mainly focus on single-step content moderation or final-output filtering, which are insufficient for capturing risks emerging during execution. AgentDoG addresses this gap through trajectory-level risk assessment, enabling comprehensive evaluation and protection of autonomous agent systems across diverse application scenarios.

🧭

Trajectory-Level Monitoring

Evaluates entire execution trajectories rather than individual steps, capturing emergent risks that only manifest through sequences of actions.

🧩

Taxonomy-Guided Diagnosis

Leverages a comprehensive three-dimensional safety taxonomy covering risk sources, failure modes, and real-world harms for precise risk classification.

🛡️

Flexible Deployment

Can serve as a judge model for trajectory safety evaluation, a reward model for agent alignment, or a plug-in guard module for real-time protection.

Safety Taxonomy

A unified three-dimensional safety taxonomy for agentic systems

We propose a unified, three-dimensional safety taxonomy that decomposes agentic risks along three orthogonal dimensions: Risk Source, Failure Mode, and Real-World Harm. These dimensions respectively answer: where the risk comes from, how it manifests in agent behavior, and what real-world harm it causes.

Risk Source

Failure Mode

Real-World Harm

Data Synthesis

Taxonomy-guided synthesis pipeline for realistic multi-step agent trajectories

We use a taxonomy-guided synthesis pipeline to generate realistic, multi-step agent trajectories. Each trajectory is conditioned on a sampled risk tuple (risk source, failure mode, real-world harm), then expanded into a coherent tool-augmented execution and filtered by quality checks.

Three-stage pipeline for multi-step agent safety trajectory synthesis

Distribution over risk source, failure mode, and harm type categories

Tool library size compared to existing agent safety benchmarks (86x larger than R-Judge)

ATBench Dataset

A large-scale benchmark for evaluating agent trajectory safety

500

Trajectories

1575

Unique Tools

Risk Categories

~8.97

Turns/Trajectory

Download from HuggingFace

Performance

Comprehensive evaluation on binary classification and fine-grained risk identification tasks

Trajectory-Level Safety Evaluation

Accuracy comparison across R-Judge, ASSE-Safety, and ATBench benchmarks

Model	Type	R-Judge	ASSE-Safety	ATBench
GPT-5.2	General	90.8	77.4	90.0
Gemini-3-Flash	General	95.2	75.9	75.6
Gemini-3-Pro	General	94.3	78.5	87.2
QwQ-32B	General	89.5	68.2	63.0
Qwen3-235B-A22B-Instruct	General	85.1	77.6	84.6
LlamaGuard3-8B	Guard	61.2	54.5	53.3
LlamaGuard4-12B	Guard	63.8	56.3	58.1
Qwen3-Guard	Guard	40.6	48.2	55.3
ShieldAgent	Guard	81.0	79.6	76.0
AgentDoG-Qwen3-4B (Ours)	Guard	91.8	80.4	92.8
AgentDoG-Qwen2.5-7B (Ours)	Guard	91.7	79.8	87.4
AgentDoG-Llama3.1-8B (Ours)	Guard	78.2	81.1	87.6

Fine-Grained Risk Diagnosis

Fine-grained label accuracy (%) on ATBench for unsafe trajectories

Model	Risk Source Acc	Failure Mode Acc	Harm Type Acc
Gemini-3-Flash	38.0	22.4	34.8
GPT-5.2	41.6	20.4	30.8
Gemini-3-Pro	36.8	17.6	32.0
Qwen3-235B-A22B-Instruct	19.6	17.2	38.0
QwQ-32B	23.2	14.4	34.8
AgentDoG-FG-Qwen3-4B (Ours)	82.0	32.4	58.4
AgentDoG-FG-Llama3.1-8B (Ours)	81.6	31.6	57.6
AgentDoG-FG-Qwen2.5-7B (Ours)	81.2	28.8	59.2

Model Checkpoints

Fine-tuned guard models for agent trajectory safety evaluation

Binary Classification (Safe/Unsafe)

Trajectory-level safety evaluation: determines whether an agent trajectory is safe or unsafe

AgentDoG-Qwen3-4B

HuggingFace ModelScope

AgentDoG-Qwen2.5-7B

HuggingFace ModelScope

AgentDoG-Llama3.1-8B

HuggingFace ModelScope

Fine-Grained Classification (FG)

Risk diagnosis: classifies Risk Source, Failure Mode, and Real-World Harm for unsafe trajectories

AgentDoG-FG-Qwen3-4B

HuggingFace ModelScope

AgentDoG-FG-Qwen2.5-7B

HuggingFace ModelScope

AgentDoG-FG-Llama3.1-8B

HuggingFace ModelScope

Agentic XAI Attribution

A hierarchical framework for explaining agent decision drivers beyond simple failure localization

We introduce a novel hierarchical framework for Agentic Attribution, designed to unveil the internal drivers behind agent actions. By decomposing interaction trajectories into pivotal components and fine-grained textual evidence, our approach explains why an agent makes specific decisions regardless of the outcome.

Attribution results across representative scenarios

Comparative attribution: AgentDoG vs Basemodel

Demo of the dynamic attribution process in agentic XAI

Citation

If you find AgentDoG useful in your research, please cite our paper

@misc{liu2026agentdogdiagnosticguardrailframework,
      title={AgentDoG: A Diagnostic Guardrail Framework for AI Agent Safety and Security},
      author={Dongrui Liu and Qihan Ren and Chen Qian and Shuai Shao and Yuejin Xie and Yu Li and Zhonghao Yang and Haoyu Luo and Peng Wang and Qingyu Liu and Binxin Hu and Ling Tang and Jilin Mei and Dadi Guo and Leitao Yuan and Junyao Yang and Guanxu Chen and Qihao Lin and Yi Yu and Bo Zhang and Jiaxuan Guo and Jie Zhang and Wenqi Shao and Huiqi Deng and Zhiheng Xi and Wenjie Wang and Wenxuan Wang and Wen Shen and Zhikai Chen and Haoyu Xie and Jialing Tao and Juntao Dai and Jiaming Ji and Zhongjie Ba and Linfeng Zhang and Yong Liu and Quanshi Zhang and Lei Zhu and Zhihua Wei and Hui Xue and Chaochao Lu and Jing Shao and Xia Hu},
      year={2026},
      journal={arXiv preprint arXiv:2601.18491}
}

Project Homepage https://ai45lab.github.io/AgentDoG

Contact liudongrui@pjlab.org.cn

State-of-the-Art Performance

Abstract

Trajectory-Level Monitoring

Taxonomy-Guided Diagnosis

Flexible Deployment

Safety Taxonomy

Risk Source

Failure Mode

Real-World Harm

Data Synthesis

ATBench Dataset

Performance

Trajectory-Level Safety Evaluation

Fine-Grained Risk Diagnosis

Model Checkpoints

Binary Classification (Safe/Unsafe)

AgentDoG-Qwen3-4B

AgentDoG-Qwen2.5-7B

AgentDoG-Llama3.1-8B

Fine-Grained Classification (FG)

AgentDoG-FG-Qwen3-4B

AgentDoG-FG-Qwen2.5-7B

AgentDoG-FG-Llama3.1-8B

Agentic XAI Attribution

Citation