Abstract
Rapid advances in large language models (LLMs) and agentic AI systems introduce new safety risks that extend beyond traditional AI capability benchmarks. To systematically analyze these emerging threats, we present Frontier AI Risk Management Framework in Practice (F1.5), an updated risk analysis framework evaluating frontier models across multiple realistic scenarios.
This version introduces a more granular evaluation across five critical risk dimensions: cyber offense, persuasion and manipulation, strategic deception, uncontrolled AI R&D, and self-replication. We evaluate a diverse set of frontier models from major research organizations and propose practical mitigation strategies, including adversarial defense mechanisms and training pipelines designed to reduce manipulation risks.
Our findings show that while current frontier models exhibit limited autonomous execution capabilities, they already demonstrate concerning abilities in areas such as persuasion and vulnerability exploitation. These results highlight the importance of continuous risk evaluation and mitigation frameworks for safe deployment of frontier AI systems.
1. Introduction
Recent years have witnessed significant progress in artificial intelligence, with large language models achieving human-level performance across many domains. At the same time, these models raise concerns about frontier risks — high-impact risks associated with general-purpose AI systems.
Previous work introduced the Frontier AI Risk Management Framework (F1), which evaluated risks across several categories. The rapid development of reasoning models and autonomous AI agents motivates a comprehensive update.
Version F1.5 focuses on five critical risk areas:
- Cyber Offense
- Persuasion and Manipulation
- Strategic Deception and Scheming
- Uncontrolled Autonomous AI R&D
- Self-Replication
This update aims to provide practical evaluation methodologies and mitigation strategies that can guide safer deployment of advanced AI systems.
2. Evaluated Frontier Models
To study frontier risks, we evaluated a diverse set of recent LLMs spanning both open-source and proprietary models.
The evaluation set includes models from:
- OpenAI
- Google DeepMind
- Anthropic
- Alibaba
- ByteDance
- Moonshot AI
- Tencent
- xAI
The models were selected based on the following principles:
Diversity in Scale
Models range from 27B to 1000B parameters, enabling analysis of how model scale influences risk profiles.
Diversity in Accessibility
Both open-source models and proprietary systems are included to study differences in deployment paradigms.
Functional Specialization
We distinguish between:
- Standard instruction-following models
- Reasoning-enhanced models
This allows us to investigate whether advanced reasoning capability correlates with specific risk patterns.
| Model | Developer | Accessibility | Functional | Scale |
|---|---|---|---|---|
| Kimi-K2-Instruct-0905 | Moonshot | Open-Source | Standard | 1000B |
| Seed-OSS-36B-Instruct | ByteDance | Open-Source | Standard | 36B |
| MiniMax-M2.1 | MiniMax | Open-Source | Standard | 230B |
| GLM-4.7 | Zhipu AI | Open-Source | Standard | 358B |
| Hunyuan-A13B-Instruct | Tencent | Open-Source | Standard | 80B |
| Gemma-3-27B-It | Google DeepMind | Open-Source | Standard | 27B |
| Qwen3-235B-A22B-Thinking-2507 | Alibaba | Open-Source | Reasoning | 235B |
| Qwen3-max | Alibaba | Proprietary | Reasoning | — |
| GPT-5.2-2025-12-11 | OpenAI | Proprietary | Reasoning | — |
| Claude Sonnet 4.5 (Thinking) | Anthropic | Proprietary | Reasoning | — |
| Gemini-3-Pro | Proprietary | Standard | — | |
| Doubao-seed-1-8-251228 | ByteDance | Proprietary | Reasoning | — |
| Grok-4 | xAI | Proprietary | Standard | — |
Table 2: Evaluated models. Open-source models are listed first, followed by proprietary models, each in alphabetical order.
3. Frontier Risk Evaluation
We evaluate frontier models across five risk dimensions representing different potential failure modes of advanced AI systems.
3.1 Cyber Offense
Motivation
Advanced AI systems may assist cyber attackers by lowering the technical barriers required to conduct sophisticated attacks. These risks emerge in two primary forms:
- Uplift Risk: AI assists human attackers
- Autonomy Risk: AI agents autonomously execute attacks
PACEbench
To evaluate these risks, we introduce PACEbench, a benchmark designed to simulate realistic cyber attack environments.
PACEbench evaluates agents across four increasingly complex scenarios:
| Scenario | Description |
|---|---|
| A-CVE | Single vulnerability exploitation |
| B-CVE | Multi-host environments requiring reconnaissance |
| C-CVE | Multi-stage chained attacks |
| D-CVE | Exploitation under active cyber defenses |
The benchmark incorporates realistic conditions including:
- Real-world CVE vulnerabilities
- Multi-host network environments
- Production-grade web application firewalls
3.2 Autonomous Cyber Attack Agents
We implement PACEAgent, an LLM-based penetration testing agent built using a ReAct-style reasoning framework.
The agent consists of three components:
LLM Reasoning Engine
Responsible for planning attack strategies and analyzing environment feedback.
Tool Interface
Allows the agent to use cybersecurity tools such as:
- Nmap
- SQLMap
- Dirb
- Custom code execution
Execution Environment
Sandboxed environments simulate real-world systems for vulnerability testing.
4. Experimental Results
Cyber Exploitation Capabilities
Experiments reveal several key observations.
1. Simple vulnerabilities can be exploited
Models successfully solve many common vulnerabilities including:
- SQL injection
- Arbitrary file read
- Basic remote code execution
2. Reconnaissance is a major bottleneck
When benign hosts are mixed with vulnerable ones, model performance drops significantly.
3. Long-horizon attacks remain difficult
No evaluated model successfully completed a full multi-stage attack chain.
4. Defense evasion remains unsolved
All tested models fail to bypass production-grade defenses such as WAF systems.
Overall, current models exhibit limited autonomous offensive capability, though they can still act as powerful tools for human attackers.
| Model | A-CVE | B-CVE | C-CVE | D-CVE | PACEbench |
|---|---|---|---|---|---|
| Kimi-K2-Instruct-0905 | 0.240 | 0.050 | 0.000 | 0.000 | 0.063 |
| Seed-OSS-36B-Instruct | 0.290 | 0.050 | 0.000 | 0.000 | 0.075 |
| MiniMax-M2.1 | 0.350 | 0.050 | 0.000 | 0.333 | 0.153 |
| GLM-4.7 | 0.410 | 0.210 | 0.067 | 0.000 | 0.166 |
| Qwen3-max | 0.350 | 0.260 | 0.133 | 0.000 | 0.190 |
| Gemini-3-Pro | 0.470 | 0.160 | 0.067 | 0.000 | 0.161 |
| Doubao-seed-1-8-251228 | 0.350 | 0.260 | 0.067 | 0.000 | 0.170 |
| GPT-5.2-2025-12-11 | 0.410 | 0.370 | 0.067 | 0.333 | 0.280 |
| Claude Sonnet 4.5 (Thinking) | 0.590 | 0.370 | 0.133 | 0.333 | 0.335 |
| Grok-4 | 0.060 | 0.000 | 0.000 | 0.000 | 0.012 |
Table 3: PACEbench scores per model. A-CVE: single vulnerability; B-CVE: multi-host; C-CVE: chained; D-CVE: defended. Claude Sonnet 4.5 (Thinking) achieves the highest overall score (0.335). No model succeeds on D-CVE except those with reasoning capabilities.
5. Adversarial Defense: The RvB Framework
To mitigate cyber risks, we introduce the Red Team vs Blue Team (RvB) framework.
This system models security hardening as an adversarial process:
- Red Team Agent: discovers vulnerabilities
- Blue Team Agent: patches vulnerabilities
- The environment updates iteratively
The process creates a continuous feedback loop for security improvement.
Results
The RvB framework demonstrates several advantages:
- Defense success rate improves to ~90% after several iterations
- Service disruption is eliminated
- Token consumption decreases by ~18% compared to cooperative agent baselines
These results suggest that automated adversarial testing may be an effective strategy for AI system hardening.
6. Persuasion and Manipulation Risks
Another major risk dimension concerns the ability of LLMs to influence opinions through dialogue.
We evaluate two types of persuasion scenarios:
LLM-to-Human Persuasion
Models attempt to shift human opinions during multi-turn conversations across controversial topics.
LLM-to-LLM Persuasion
Models attempt to influence other AI agents in simulated decision-making tasks.
7. Persuasion Experiment Results
Results show that modern models can be highly effective persuaders.
Key findings include:
- Successful persuasion rates often exceed 80–98%
- Voting manipulation experiments show 65–94% success rates
- Negative persuasion ("backfire effects") occur rarely
Interestingly, model scale does not strictly correlate with persuasion success. Some smaller models outperform larger ones in manipulation tasks.
These results highlight the potential for AI-driven large-scale opinion influence.
8. Mitigating Persuasion Risks
To reduce manipulation risks, we propose a mitigation training pipeline combining supervised learning and reinforcement learning.
Dataset Construction
A dataset of 9,566 human behavioral records was constructed and augmented with reasoning traces representing how humans resist persuasion.
Training Pipeline
The mitigation approach includes two stages:
Supervised Fine-Tuning (SFT)
Teaches models to generate reasoning-based resistance to persuasion attempts.
Reinforcement Learning (GRPO)
Optimizes stance consistency and logical reasoning through reward-based learning.
The reward function encourages:
- Consistent stance maintenance
- Logical argumentation
- Correct reasoning format
This approach helps shift models from passive compliance toward active reasoning-based resistance.
9. Discussion
Our experiments highlight several important trends in frontier AI safety.
First, reasoning models demonstrate higher risk in certain domains, particularly those requiring strategic planning.
Second, agentic AI systems introduce new safety challenges because they can interact with external tools and environments.
Third, persuasion capability appears to scale differently from general capability, suggesting that manipulation risk requires specialized evaluation.
These findings reinforce the importance of domain-specific safety benchmarks beyond traditional capability evaluations.
10. Conclusion
This report presents F1.5, an updated frontier AI risk management framework designed to evaluate and mitigate emerging risks associated with advanced language models.
Our experiments show that while current frontier models still struggle with complex autonomous tasks, they already exhibit concerning abilities in areas such as persuasion and vulnerability exploitation.
To address these challenges, we introduce:
- PACEbench for realistic cyber risk evaluation
- RvB adversarial defense framework
- Training strategies for mitigating persuasion risks
Together, these tools provide a foundation for continuous risk assessment and safer deployment of frontier AI systems.
Citation
Citation information will be available with the full release of F1.5.