AgentDoG

A Diagnostic Guardrail Framework for AI Agent Safety and Security

Shanghai AILab

State-of-the-Art Performance

SOTA on ASSE-Safety and ATBench, outperforming several strong general models (e.g., GPT-5.2) across R-Judge, ASSE-Safety, and ATBench

Binary Classification Performance
Fine-Grained Classification Performance

Abstract

Autonomous agents (e.g., tool-using LLM agents, mobile agents, web agents) often execute multi-step trajectories consisting of observations, reasoning, and actions. Existing safety mechanisms mainly focus on single-step content moderation or final-output filtering, which are insufficient for capturing risks emerging during execution. AgentDoG addresses this gap through trajectory-level risk assessment, enabling comprehensive evaluation and protection of autonomous agent systems across diverse application scenarios.

🧭

Trajectory-Level Monitoring

Evaluates entire execution trajectories rather than individual steps, capturing emergent risks that only manifest through sequences of actions.

🧩

Taxonomy-Guided Diagnosis

Leverages a comprehensive three-dimensional safety taxonomy covering risk sources, failure modes, and real-world harms for precise risk classification.

🛡️

Flexible Deployment

Can serve as a judge model for trajectory safety evaluation, a reward model for agent alignment, or a plug-in guard module for real-time protection.

Safety Taxonomy

A unified three-dimensional safety taxonomy for agentic systems

We propose a unified, three-dimensional safety taxonomy that decomposes agentic risks along three orthogonal dimensions: Risk Source, Failure Mode, and Real-World Harm. These dimensions respectively answer: where the risk comes from, how it manifests in agent behavior, and what real-world harm it causes.

Risk Source

8

Categories

User inputs, environmental observations, external tools/APIs, internal logic failures

Failure Mode

14

Categories

Behavioral failures (planning, tool use, execution) and output content failures

Real-World Harm

10

Categories

Privacy, financial, security, physical, psychological, reputational, societal harms

Safety Taxonomy Overview

Overview of the three orthogonal dimensions of the agentic safety taxonomy

Data Synthesis

Taxonomy-guided synthesis pipeline for realistic multi-step agent trajectories

We use a taxonomy-guided synthesis pipeline to generate realistic, multi-step agent trajectories. Each trajectory is conditioned on a sampled risk tuple (risk source, failure mode, real-world harm), then expanded into a coherent tool-augmented execution and filtered by quality checks.

Data Synthesis Pipeline

Three-stage pipeline for multi-step agent safety trajectory synthesis

Distribution comparison

Distribution over risk source, failure mode, and harm type categories

Tool library size comparison

Tool library size compared to existing agent safety benchmarks (86x larger than R-Judge)

ATBench Dataset

A large-scale benchmark for evaluating agent trajectory safety

500

Trajectories

1575

Unique Tools

32

Risk Categories

~8.97

Turns/Trajectory

Performance

Comprehensive evaluation on binary classification and fine-grained risk identification tasks

Trajectory-Level Safety Evaluation

Accuracy comparison across R-Judge, ASSE-Safety, and ATBench benchmarks

Model Type R-Judge ASSE-Safety ATBench
GPT-5.2 General 90.8 77.4 90.0
Gemini-3-Flash General 95.2 75.9 75.6
Gemini-3-Pro General 94.3 78.5 87.2
QwQ-32B General 89.5 68.2 63.0
Qwen3-235B-A22B-Instruct General 85.1 77.6 84.6
LlamaGuard3-8B Guard 61.2 54.5 53.3
LlamaGuard4-12B Guard 63.8 56.3 58.1
Qwen3-Guard Guard 40.6 48.2 55.3
ShieldAgent Guard 81.0 79.6 76.0
AgentDoG-Qwen3-4B (Ours) Guard 91.8 80.4 92.8
AgentDoG-Qwen2.5-7B (Ours) Guard 91.7 79.8 87.4
AgentDoG-Llama3.1-8B (Ours) Guard 78.2 81.1 87.6

Fine-Grained Risk Diagnosis

Fine-grained label accuracy (%) on ATBench for unsafe trajectories

Model Risk Source Acc Failure Mode Acc Harm Type Acc
Gemini-3-Flash 38.0 22.4 34.8
GPT-5.2 41.6 20.4 30.8
Gemini-3-Pro 36.8 17.6 32.0
Qwen3-235B-A22B-Instruct 19.6 17.2 38.0
QwQ-32B 23.2 14.4 34.8
AgentDoG-FG-Qwen3-4B (Ours) 82.0 32.4 58.4
AgentDoG-FG-Llama3.1-8B (Ours) 81.6 31.6 57.6
AgentDoG-FG-Qwen2.5-7B (Ours) 81.2 28.8 59.2

Model Checkpoints

Fine-tuned guard models for agent trajectory safety evaluation

Binary Classification (Safe/Unsafe)

Trajectory-level safety evaluation: determines whether an agent trajectory is safe or unsafe

AgentDoG-Qwen3-4B

AgentDoG-Qwen2.5-7B

AgentDoG-Llama3.1-8B

Fine-Grained Classification (FG)

Risk diagnosis: classifies Risk Source, Failure Mode, and Real-World Harm for unsafe trajectories

AgentDoG-FG-Qwen3-4B

AgentDoG-FG-Qwen2.5-7B

AgentDoG-FG-Llama3.1-8B

Agentic XAI Attribution

A hierarchical framework for explaining agent decision drivers beyond simple failure localization

We introduce a novel hierarchical framework for Agentic Attribution, designed to unveil the internal drivers behind agent actions. By decomposing interaction trajectories into pivotal components and fine-grained textual evidence, our approach explains why an agent makes specific decisions regardless of the outcome.

XAI Attribution Results

Attribution results across representative scenarios

XAI Attribution Comparison

Comparative attribution: AgentDoG vs Basemodel

Case demo

Demo of the dynamic attribution process in agentic XAI

Citation

If you find AgentDoG useful in your research, please cite our paper

@misc{liu2026agentdogdiagnosticguardrailframework,
      title={AgentDoG: A Diagnostic Guardrail Framework for AI Agent Safety and Security},
      author={Dongrui Liu and Qihan Ren and Chen Qian and Shuai Shao and Yuejin Xie and Yu Li and Zhonghao Yang and Haoyu Luo and Peng Wang and Qingyu Liu and Binxin Hu and Ling Tang and Jilin Mei and Dadi Guo and Leitao Yuan and Junyao Yang and Guanxu Chen and Qihao Lin and Yi Yu and Bo Zhang and Jiaxuan Guo and Jie Zhang and Wenqi Shao and Huiqi Deng and Zhiheng Xi and Wenjie Wang and Wenxuan Wang and Wen Shen and Zhikai Chen and Haoyu Xie and Jialing Tao and Juntao Dai and Jiaming Ji and Zhongjie Ba and Linfeng Zhang and Yong Liu and Quanshi Zhang and Lei Zhu and Zhihua Wei and Hui Xue and Chaochao Lu and Jing Shao and Xia Hu},
      year={2026},
      journal={arXiv preprint arXiv:2601.18491}
}