Frontier AI Risk Management Framework in Practice (F1.5)

Building on F1.0, this update extends the risk assessment to five critical dimensions — cyber offense, persuasion, strategic deception, uncontrolled AI R&D, and self-replication — with new evaluations across the latest frontier models and practical mitigation strategies for each.

Read the full paper on Hugging Face

Abstract

Rapid advances in large language models (LLMs) and agentic AI systems introduce new safety risks that extend beyond traditional AI capability benchmarks. To systematically analyze these emerging threats, we present Frontier AI Risk Management Framework in Practice (F1.5), an updated risk analysis framework evaluating frontier models across multiple realistic scenarios.

This version introduces a more granular evaluation across five critical risk dimensions: cyber offense, persuasion and manipulation, strategic deception, uncontrolled AI R&D, and self-replication. We evaluate a diverse set of frontier models from major research organizations and propose practical mitigation strategies, including adversarial defense mechanisms and training pipelines designed to reduce manipulation risks.

Our findings show that while current frontier models exhibit limited autonomous execution capabilities, they already demonstrate concerning abilities in areas such as persuasion and vulnerability exploitation. These results highlight the importance of continuous risk evaluation and mitigation frameworks for safe deployment of frontier AI systems.

1. Introduction

Recent years have witnessed significant progress in artificial intelligence, with large language models achieving human-level performance across many domains. At the same time, these models raise concerns about frontier risks — high-impact risks associated with general-purpose AI systems.

Previous work introduced the Frontier AI Risk Management Framework (F1), which evaluated risks across several categories. The rapid development of reasoning models and autonomous AI agents motivates a comprehensive update.

Version F1.5 focuses on five critical risk areas:

  1. Cyber Offense
  2. Persuasion and Manipulation
  3. Strategic Deception and Scheming
  4. Uncontrolled Autonomous AI R&D
  5. Self-Replication

This update aims to provide practical evaluation methodologies and mitigation strategies that can guide safer deployment of advanced AI systems.

2. Evaluated Frontier Models

To study frontier risks, we evaluated a diverse set of recent LLMs spanning both open-source and proprietary models.

The evaluation set includes models from:

  • OpenAI
  • Google DeepMind
  • Anthropic
  • Alibaba
  • ByteDance
  • Moonshot AI
  • Tencent
  • xAI

The models were selected based on the following principles:

Diversity in Scale

Models range from 27B to 1000B parameters, enabling analysis of how model scale influences risk profiles.

Diversity in Accessibility

Both open-source models and proprietary systems are included to study differences in deployment paradigms.

Functional Specialization

We distinguish between:

  • Standard instruction-following models
  • Reasoning-enhanced models

This allows us to investigate whether advanced reasoning capability correlates with specific risk patterns.

Model Developer Accessibility Functional Scale
Kimi-K2-Instruct-0905MoonshotOpen-SourceStandard1000B
Seed-OSS-36B-InstructByteDanceOpen-SourceStandard36B
MiniMax-M2.1MiniMaxOpen-SourceStandard230B
GLM-4.7Zhipu AIOpen-SourceStandard358B
Hunyuan-A13B-InstructTencentOpen-SourceStandard80B
Gemma-3-27B-ItGoogle DeepMindOpen-SourceStandard27B
Qwen3-235B-A22B-Thinking-2507AlibabaOpen-SourceReasoning235B
Qwen3-maxAlibabaProprietaryReasoning
GPT-5.2-2025-12-11OpenAIProprietaryReasoning
Claude Sonnet 4.5 (Thinking)AnthropicProprietaryReasoning
Gemini-3-ProGoogleProprietaryStandard
Doubao-seed-1-8-251228ByteDanceProprietaryReasoning
Grok-4xAIProprietaryStandard

Table 2: Evaluated models. Open-source models are listed first, followed by proprietary models, each in alphabetical order.

3. Frontier Risk Evaluation

We evaluate frontier models across five risk dimensions representing different potential failure modes of advanced AI systems.

3.1 Cyber Offense

Motivation

Advanced AI systems may assist cyber attackers by lowering the technical barriers required to conduct sophisticated attacks. These risks emerge in two primary forms:

  • Uplift Risk: AI assists human attackers
  • Autonomy Risk: AI agents autonomously execute attacks

PACEbench

To evaluate these risks, we introduce PACEbench, a benchmark designed to simulate realistic cyber attack environments.

PACEbench evaluates agents across four increasingly complex scenarios:

Scenario Description
A-CVESingle vulnerability exploitation
B-CVEMulti-host environments requiring reconnaissance
C-CVEMulti-stage chained attacks
D-CVEExploitation under active cyber defenses

The benchmark incorporates realistic conditions including:

  • Real-world CVE vulnerabilities
  • Multi-host network environments
  • Production-grade web application firewalls

3.2 Autonomous Cyber Attack Agents

We implement PACEAgent, an LLM-based penetration testing agent built using a ReAct-style reasoning framework.

The agent consists of three components:

LLM Reasoning Engine

Responsible for planning attack strategies and analyzing environment feedback.

Tool Interface

Allows the agent to use cybersecurity tools such as:

  • Nmap
  • SQLMap
  • Dirb
  • Custom code execution

Execution Environment

Sandboxed environments simulate real-world systems for vulnerability testing.

4. Experimental Results

Cyber Exploitation Capabilities

Experiments reveal several key observations.

1. Simple vulnerabilities can be exploited

Models successfully solve many common vulnerabilities including:

  • SQL injection
  • Arbitrary file read
  • Basic remote code execution

2. Reconnaissance is a major bottleneck

When benign hosts are mixed with vulnerable ones, model performance drops significantly.

3. Long-horizon attacks remain difficult

No evaluated model successfully completed a full multi-stage attack chain.

4. Defense evasion remains unsolved

All tested models fail to bypass production-grade defenses such as WAF systems.

Overall, current models exhibit limited autonomous offensive capability, though they can still act as powerful tools for human attackers.

Model A-CVE B-CVE C-CVE D-CVE PACEbench
Kimi-K2-Instruct-09050.2400.0500.0000.0000.063
Seed-OSS-36B-Instruct0.2900.0500.0000.0000.075
MiniMax-M2.10.3500.0500.0000.3330.153
GLM-4.70.4100.2100.0670.0000.166
Qwen3-max0.3500.2600.1330.0000.190
Gemini-3-Pro0.4700.1600.0670.0000.161
Doubao-seed-1-8-2512280.3500.2600.0670.0000.170
GPT-5.2-2025-12-110.4100.3700.0670.3330.280
Claude Sonnet 4.5 (Thinking)0.5900.3700.1330.3330.335
Grok-40.0600.0000.0000.0000.012

Table 3: PACEbench scores per model. A-CVE: single vulnerability; B-CVE: multi-host; C-CVE: chained; D-CVE: defended. Claude Sonnet 4.5 (Thinking) achieves the highest overall score (0.335). No model succeeds on D-CVE except those with reasoning capabilities.

5. Adversarial Defense: The RvB Framework

To mitigate cyber risks, we introduce the Red Team vs Blue Team (RvB) framework.

This system models security hardening as an adversarial process:

  • Red Team Agent: discovers vulnerabilities
  • Blue Team Agent: patches vulnerabilities
  • The environment updates iteratively

The process creates a continuous feedback loop for security improvement.

Results

The RvB framework demonstrates several advantages:

  • Defense success rate improves to ~90% after several iterations
  • Service disruption is eliminated
  • Token consumption decreases by ~18% compared to cooperative agent baselines

These results suggest that automated adversarial testing may be an effective strategy for AI system hardening.

Figure 4: RvB performance trajectory across 5 iterations for four backbone models. Bars show Attack Success Count (ASC, left axis); line shows Defense Success Rate (DSR%, right axis). DSR consistently converges to ≥90% by iteration 5 for the stronger models. Data from the paper's Figure 4.

6. Persuasion and Manipulation Risks

Another major risk dimension concerns the ability of LLMs to influence opinions through dialogue.

We evaluate two types of persuasion scenarios:

LLM-to-Human Persuasion

Models attempt to shift human opinions during multi-turn conversations across controversial topics.

LLM-to-LLM Persuasion

Models attempt to influence other AI agents in simulated decision-making tasks.

7. Persuasion Experiment Results

Results show that modern models can be highly effective persuaders.

Key findings include:

  • Successful persuasion rates often exceed 80–98%
  • Voting manipulation experiments show 65–94% success rates
  • Negative persuasion ("backfire effects") occur rarely

Interestingly, model scale does not strictly correlate with persuasion success. Some smaller models outperform larger ones in manipulation tasks.

These results highlight the potential for AI-driven large-scale opinion influence.

Figure 5: Persuasion outcome breakdown per model, sorted by successful persuasion rate. Blue = successful shift (shift > 0); orange = no attitude shift; red = negative shift ("backfire"). Successful rates range from 82.2% to 98.8% across all models. Data from the paper's Table 4 and Figure 5.

8. Mitigating Persuasion Risks

To reduce manipulation risks, we propose a mitigation training pipeline combining supervised learning and reinforcement learning.

Dataset Construction

A dataset of 9,566 human behavioral records was constructed and augmented with reasoning traces representing how humans resist persuasion.

Training Pipeline

The mitigation approach includes two stages:

Supervised Fine-Tuning (SFT)

Teaches models to generate reasoning-based resistance to persuasion attempts.

Reinforcement Learning (GRPO)

Optimizes stance consistency and logical reasoning through reward-based learning.

The reward function encourages:

  • Consistent stance maintenance
  • Logical argumentation
  • Correct reasoning format

This approach helps shift models from passive compliance toward active reasoning-based resistance.

9. Discussion

Our experiments highlight several important trends in frontier AI safety.

First, reasoning models demonstrate higher risk in certain domains, particularly those requiring strategic planning.

Second, agentic AI systems introduce new safety challenges because they can interact with external tools and environments.

Third, persuasion capability appears to scale differently from general capability, suggesting that manipulation risk requires specialized evaluation.

These findings reinforce the importance of domain-specific safety benchmarks beyond traditional capability evaluations.

10. Conclusion

This report presents F1.5, an updated frontier AI risk management framework designed to evaluate and mitigate emerging risks associated with advanced language models.

Our experiments show that while current frontier models still struggle with complex autonomous tasks, they already exhibit concerning abilities in areas such as persuasion and vulnerability exploitation.

To address these challenges, we introduce:

  • PACEbench for realistic cyber risk evaluation
  • RvB adversarial defense framework
  • Training strategies for mitigating persuasion risks

Together, these tools provide a foundation for continuous risk assessment and safer deployment of frontier AI systems.

Citation

Citation information will be available with the full release of F1.5.