AgentDoG 1.5

A Lightweight and Scalable Alignment Framework for AI Agent Safety and Security

Shanghai AI Lab

State-of-the-Art Performance

Strong trajectory-level safety evaluation on R-Judge and ATBench, with fine-grained diagnostic accuracy reported across the three taxonomy dimensions

Binary Classification Performance

Introduction

AgentDoG 1.5 is a lightweight and scalable agent safety alignment framework.

🧩

Updated Agent Safety Taxonomy and ATBench Family

Updates the three-dimensional taxonomy with Codex/OpenClaw risks and extends ATBench for trajectory-level diagnosis.

🛡️

Lightweight AgentDoG 1.5

Uses a taxonomy-guided data engine and around 1k samples for strong, lightweight deployment.

🚀

Scalable Lightweight Agentic Training Pipeline

Supports low-cost safety-aware SFT/RL, scaling to 10,000+ concurrent agentic environments on an 8-core machine.

🧱

Online Agent Safety Guardrail

Provides runtime monitoring and intervention for deployed OpenClaw agent workflows.

Safety Taxonomy

A unified three-dimensional safety taxonomy for agentic systems

We propose a unified, three-dimensional safety taxonomy that decomposes agentic risks along three orthogonal dimensions: Risk Source, Failure Mode, and Real-World Harm. These dimensions respectively answer: where the risk comes from, how it manifests in agent behavior, and what real-world harm it causes.

Risk Source

15

Categories

User inputs, environmental observations, external tools/APIs, internal logic failures

Failure Mode

21

Categories

Behavioral failures (planning, tool use, execution) and output content failures

Real-World Harm

11

Categories

Privacy, financial, security, physical, psychological, reputational, societal harms

Safety Taxonomy Overview

Overview of the three orthogonal dimensions of the agentic safety taxonomy

Data Synthesis

Taxonomy-guided synthesis pipeline for realistic multi-step agent trajectories

We use a taxonomy-guided synthesis pipeline to generate realistic, multi-step agent trajectories. Each trajectory is conditioned on a sampled risk tuple (risk source, failure mode, real-world harm), then expanded into a coherent tool-augmented execution and filtered by quality checks.

Data Synthesis Pipeline

Three-stage pipeline for multi-step agent safety trajectory synthesis

Distribution comparison

Distribution over risk source, failure mode, and harm type categories

Tool library size comparison

Tool library size compared to existing agent safety benchmarks (86x larger than R-Judge)

Performance

Comprehensive evaluation on binary classification and fine-grained risk identification tasks

Trajectory-Level Safety Evaluation

AgentDoG 1.5 is evaluated on R-Judge and ATBench using Accuracy, Precision, Recall, and F1-score.

Model R-Judge Acc R-Judge Prec. R-Judge Rec. R-Judge F1 ATBench Acc ATBench Prec. ATBench Rec. ATBench F1
GPT-5.4 93.3 93.1 94.3 93.7 73.7 68.5 87.1 76.7
Qwen3.5-397B-A17B 85.6 81.3 94.5 87.4 66.8 65.5 70.2 67.8
Qwen3.5-4B 81.0 82.1 81.9 82.0 45.9 41.2 20.7 27.6
LlamaGuard4-12B 63.8 68.3 58.8 63.2 58.1 63.8 30.9 41.7
Qwen3-Guard 40.6 23.6 5.6 9.0 51.5 40.0 0.4 0.8
AgentDoG-1.0-Qwen3-4B 91.8 87.5 98.5 92.7 64.0 59.2 88.9 71.1
AgentDoG-1.5-Qwen3.5-0.8B 75.7 83.3 67.5 74.6 60.3 58.6 68.6 63.2
AgentDoG-1.5-Qwen3.5-2B 71.5 78.0 64.1 70.4 69.0 70.1 65.7 67.8
AgentDoG-1.5-Llama3.1-8B 75.5 68.6 98.8 81.0 70.9 67.1 81.2 73.5
AgentDoG-1.5-Qwen3.5-4B 92.2 91.7 93.7 92.7 72.4 69.2 80.3 74.3
AgentDoG-1.5-Qwen3.5-4B-U 90.4 93.9 87.6 90.6 78.4 79.8 75.7 77.7

Fine-Grained Risk Diagnosis

Fine-grained diagnostic accuracy on ATBench across Risk Source, Failure Mode, and Real-world Harm. Guard models are excluded because they only output binary labels.

Model Risk Source Failure Mode Real-world Harm
GPT-5.4 33.6 13.5 30.2
GPT-5.2 29.5 12.0 26.8
Gemini-3-Flash 18.4 8.3 15.0
Gemini-3.1-Pro 24.8 12.6 18.5
Qwen3.5-397B 7.7 3.6 6.8
AgentDoG-1.0-Qwen3-4B 46.8 16.5 40.6
AgentDoG-1.5-Qwen3.5-0.8B 65.7 18.4 44.9
AgentDoG-1.5-Qwen3.5-2B 68.0 24.0 53.8
AgentDoG-1.5-Llama3.1-8B 72.9 24.6 52.5
AgentDoG-1.5-Qwen3.5-4B 75.2 27.5 62.9
AgentDoG-1.5-Qwen3.5-4B-U 24.1 9.5 28.4
AgentDoG 1.5 application overview

APP 1: Agentic Safety SFT & RL

AgentDoG 1.5 can serve as a trajectory-level diagnostic evaluator for improving agent safety through supervised fine-tuning and reinforcement learning.

The dual-scenario environment synthesis pipeline for agentic safety RL
The dual-scenario environment synthesis pipeline for agentic safety RL.
Performance comparison on utility and safety metrics
Performance comparison on utility and safety metrics.

APP 2: Online Agentic Safety Guardrail

Deploying AgentDoG 1.5 as a runtime guardrail for monitoring risky agent behaviors

App 2 online agentic safety guardrail framework
App 2 online agentic safety guardrail demo

Lightweight demo of AgentDoG 1.5 as an online agent safety guardrail.

ATBench Family

An extensible trajectory-level benchmark family for evaluating agent safety across diverse execution environments

1,000

Trajectories

2,084

Available Tools

1,954

Unique Invoked Tools

9.01

Turns/Trajectory

3.95k

Tokens/Trajectory

Benchmark Agent Setting Description HF Link
ATBench General tool-use agents The base trajectory-level safety benchmark inherited from AgentDoG 1.0. Hugging Face
ATBench-Claw OpenClaw agents with stateful tool/skill execution Extends the benchmark to persistent sessions, accumulated traces, and stateful tool execution. Hugging Face
ATBench-Codex Codex-style repository and command execution agents Extends the benchmark to repository modification, shell commands, file operations, and code-execution risks. Hugging Face

Model Checkpoints

AgentDoG 1.5 model zoo with unified, coarse-grained, and fine-grained guardrail checkpoints

Model Task Parameters Base model HF Link ModelScope Link
AgentDoG1.5-Unified-Qwen3.5-4B Unified safety diagnosis 4B Qwen3.5-4B Hugging Face ModelScope
AgentDoG1.5-Qwen3.5-0.8B Coarse-grained moderation 0.8B Qwen3.5-0.8B Hugging Face ModelScope
AgentDoG1.5-Qwen3.5-2B Coarse-grained moderation 2B Qwen3.5-2B Hugging Face ModelScope
AgentDoG1.5-Qwen3.5-4B Coarse-grained moderation 4B Qwen3.5-4B Hugging Face ModelScope
AgentDoG1.5-Llama3.1-8B Coarse-grained moderation 8B Llama3.1-8B Hugging Face ModelScope
AgentDoG1.5-FG-Qwen3.5-0.8B Fine-grained diagnosis 0.8B Qwen3.5-0.8B Hugging Face ModelScope
AgentDoG1.5-FG-Qwen3.5-2B Fine-grained diagnosis 2B Qwen3.5-2B Hugging Face ModelScope
AgentDoG1.5-FG-Qwen3.5-4B Fine-grained diagnosis 4B Qwen3.5-4B Hugging Face ModelScope
AgentDoG1.5-FG-Llama3.1-8B Fine-grained diagnosis 8B Llama3.1-8B Hugging Face ModelScope

Citation

If you use AgentDoG or ATBench in your research, please cite:

@article{liu2026agentdog15,
  title={AgentDoG 1.5: A Lightweight and Scalable Alignment Framework for AI Agent Safety and Security},
  author={Liu, Dongrui and Li, Yu and Yang, Zhonghao and Wang, Peng and Chen, Guanxu and Xie, Yuejin and Mao, Qinghua and Qu, Wanying and Zhu, Yanxu and Zhou, Tianyi and others},
  journal={arXiv preprint arXiv:2605.29801},
  year={2026}
}

@article{liu2026agentdog,
  title={AgentDoG: A Diagnostic Guardrail Framework for AI Agent Safety and Security},
  author={Liu, Dongrui and Ren, Qihan and Qian, Chen and Shao, Shuai and Xie, Yuejin and Li, Yu and Yang, Zhonghao and Luo, Haoyu and Wang, Peng and Liu, Qingyu and others},
  journal={arXiv preprint arXiv:2601.18491},
  year={2026}
}

@article{li2026atbench,
  title={ATBench: A Diverse and Realistic Trajectory Benchmark for Long-Horizon Agent Safety},
  author={Li, Yu and Luo, Haoyu and Xie, Yuejin and Fu, Yuqian and Yang, Zhonghao and Shao, Shuai and Ren, Qihan and Qu, Wanying and Fu, Yanwei and Yang, Yujiu and others},
  journal={arXiv preprint arXiv:2604.02022},
  year={2026}
}

@misc{qian2026behind,
      title={The Why Behind the Action: Unveiling Internal Drivers via Agentic Attribution},
      author={Chen Qian and Peng Wang and Dongrui Liu and Junyao Yang and Dadi Guo and Ling Tang and Jilin Mei and Qihan Ren and Shuai Shao and Yong Liu and Jie Fu and Jing Shao and Xia Hu},
      year={2026},
      journal={arXiv preprint arXiv:2601.15075}
}