F1.5 Safety Report | SafeWork-F

Abstract

Rapid advances in large language models (LLMs) and agentic AI systems introduce new safety risks that extend beyond traditional AI capability benchmarks. To systematically analyze these emerging threats, we present Frontier AI Risk Management Framework in Practice (F1.5), an updated risk analysis framework evaluating frontier models across multiple realistic scenarios.

This version introduces a more granular evaluation across five critical risk dimensions: cyber offense, persuasion and manipulation, strategic deception, uncontrolled AI R&D, and self-replication. We evaluate a diverse set of frontier models from major research organizations and propose practical mitigation strategies, including adversarial defense mechanisms and training pipelines designed to reduce manipulation risks.

Our findings show that while current frontier models exhibit limited autonomous execution capabilities, they already demonstrate concerning abilities in areas such as persuasion and vulnerability exploitation. These results highlight the importance of continuous risk evaluation and mitigation frameworks for safe deployment of frontier AI systems.

1. Introduction

Recent years have witnessed significant progress in artificial intelligence, with large language models achieving human-level performance across many domains. At the same time, these models raise concerns about frontier risks — high-impact risks associated with general-purpose AI systems.

Previous work introduced the Frontier AI Risk Management Framework (F1), which evaluated risks across several categories. The rapid development of reasoning models and autonomous AI agents motivates a comprehensive update.

Version F1.5 focuses on five critical risk areas:

Cyber Offense
Persuasion and Manipulation
Strategic Deception and Scheming
Uncontrolled Autonomous AI R&D
Self-Replication

This update aims to provide practical evaluation methodologies and mitigation strategies that can guide safer deployment of advanced AI systems.

2. Evaluated Frontier Models

To study frontier risks, we evaluated a diverse set of recent LLMs spanning both open-source and proprietary models.

The evaluation set includes models from:

OpenAI
Google DeepMind
Anthropic
Alibaba
ByteDance
Moonshot AI
Tencent
xAI

The models were selected based on the following principles:

Diversity in Scale

Models range from 27B to 1000B parameters, enabling analysis of how model scale influences risk profiles.

Diversity in Accessibility

Both open-source models and proprietary systems are included to study differences in deployment paradigms.

Functional Specialization

We distinguish between:

Standard instruction-following models
Reasoning-enhanced models

This allows us to investigate whether advanced reasoning capability correlates with specific risk patterns.

Model	Developer	Accessibility	Functional	Scale
Kimi-K2-Instruct-0905	Moonshot	Open-Source	Standard	1000B
Seed-OSS-36B-Instruct	ByteDance	Open-Source	Standard	36B
MiniMax-M2.1	MiniMax	Open-Source	Standard	230B
GLM-4.7	Zhipu AI	Open-Source	Standard	358B
Hunyuan-A13B-Instruct	Tencent	Open-Source	Standard	80B
Gemma-3-27B-It	Google DeepMind	Open-Source	Standard	27B
Qwen3-235B-A22B-Thinking-2507	Alibaba	Open-Source	Reasoning	235B
Qwen3-max	Alibaba	Proprietary	Reasoning	—
GPT-5.2-2025-12-11	OpenAI	Proprietary	Reasoning	—
Claude Sonnet 4.5 (Thinking)	Anthropic	Proprietary	Reasoning	—
Gemini-3-Pro	Google	Proprietary	Standard	—
Doubao-seed-1-8-251228	ByteDance	Proprietary	Reasoning	—
Grok-4	xAI	Proprietary	Standard	—

Table 2: Evaluated models. Open-source models are listed first, followed by proprietary models, each in alphabetical order.

3. Frontier Risk Evaluation

We evaluate frontier models across five risk dimensions representing different potential failure modes of advanced AI systems.

3.1 Cyber Offense

Motivation

Advanced AI systems may assist cyber attackers by lowering the technical barriers required to conduct sophisticated attacks. These risks emerge in two primary forms:

Uplift Risk: AI assists human attackers
Autonomy Risk: AI agents autonomously execute attacks

PACEbench

To evaluate these risks, we introduce PACEbench, a benchmark designed to simulate realistic cyber attack environments.

PACEbench evaluates agents across four increasingly complex scenarios:

Scenario	Description
A-CVE	Single vulnerability exploitation
B-CVE	Multi-host environments requiring reconnaissance
C-CVE	Multi-stage chained attacks
D-CVE	Exploitation under active cyber defenses

The benchmark incorporates realistic conditions including:

Real-world CVE vulnerabilities
Multi-host network environments
Production-grade web application firewalls

3.2 Autonomous Cyber Attack Agents

We implement PACEAgent, an LLM-based penetration testing agent built using a ReAct-style reasoning framework.

The agent consists of three components:

LLM Reasoning Engine

Responsible for planning attack strategies and analyzing environment feedback.

Tool Interface

Allows the agent to use cybersecurity tools such as:

Nmap
SQLMap
Dirb
Custom code execution

Execution Environment

Sandboxed environments simulate real-world systems for vulnerability testing.

4. Experimental Results

Cyber Exploitation Capabilities

Experiments reveal several key observations.

1. Simple vulnerabilities can be exploited

Models successfully solve many common vulnerabilities including:

SQL injection
Arbitrary file read
Basic remote code execution

2. Reconnaissance is a major bottleneck

When benign hosts are mixed with vulnerable ones, model performance drops significantly.

3. Long-horizon attacks remain difficult

No evaluated model successfully completed a full multi-stage attack chain.

4. Defense evasion remains unsolved

All tested models fail to bypass production-grade defenses such as WAF systems.

Overall, current models exhibit limited autonomous offensive capability, though they can still act as powerful tools for human attackers.

Model	A-CVE	B-CVE	C-CVE	D-CVE	PACEbench
Kimi-K2-Instruct-0905	0.240	0.050	0.000	0.000	0.063
Seed-OSS-36B-Instruct	0.290	0.050	0.000	0.000	0.075
MiniMax-M2.1	0.350	0.050	0.000	0.333	0.153
GLM-4.7	0.410	0.210	0.067	0.000	0.166
Qwen3-max	0.350	0.260	0.133	0.000	0.190
Gemini-3-Pro	0.470	0.160	0.067	0.000	0.161
Doubao-seed-1-8-251228	0.350	0.260	0.067	0.000	0.170
GPT-5.2-2025-12-11	0.410	0.370	0.067	0.333	0.280
Claude Sonnet 4.5 (Thinking)	0.590	0.370	0.133	0.333	0.335
Grok-4	0.060	0.000	0.000	0.000	0.012

Table 3: PACEbench scores per model. A-CVE: single vulnerability; B-CVE: multi-host; C-CVE: chained; D-CVE: defended. Claude Sonnet 4.5 (Thinking) achieves the highest overall score (0.335). No model succeeds on D-CVE except those with reasoning capabilities.

5. Adversarial Defense: The RvB Framework

To mitigate cyber risks, we introduce the Red Team vs Blue Team (RvB) framework.

This system models security hardening as an adversarial process:

Red Team Agent: discovers vulnerabilities
Blue Team Agent: patches vulnerabilities
The environment updates iteratively

The process creates a continuous feedback loop for security improvement.

Results

The RvB framework demonstrates several advantages:

Defense success rate improves to ~90% after several iterations
Service disruption is eliminated
Token consumption decreases by ~18% compared to cooperative agent baselines

These results suggest that automated adversarial testing may be an effective strategy for AI system hardening.

Figure 4: RvB performance trajectory across 5 iterations for four backbone models. Bars show Attack Success Count (ASC, left axis); line shows Defense Success Rate (DSR%, right axis). DSR consistently converges to ≥90% by iteration 5 for the stronger models. Data from the paper's Figure 4.

6. Persuasion and Manipulation Risks

Another major risk dimension concerns the ability of LLMs to influence opinions through dialogue.

We evaluate two types of persuasion scenarios:

LLM-to-Human Persuasion

Models attempt to shift human opinions during multi-turn conversations across controversial topics.

LLM-to-LLM Persuasion

Models attempt to influence other AI agents in simulated decision-making tasks.

7. Persuasion Experiment Results

Results show that modern models can be highly effective persuaders.

Key findings include:

Successful persuasion rates often exceed 80–98%
Voting manipulation experiments show 65–94% success rates
Negative persuasion ("backfire effects") occur rarely

Interestingly, model scale does not strictly correlate with persuasion success. Some smaller models outperform larger ones in manipulation tasks.

These results highlight the potential for AI-driven large-scale opinion influence.

Figure 5: Persuasion outcome breakdown per model, sorted by successful persuasion rate. Blue = successful shift (shift > 0); orange = no attitude shift; red = negative shift ("backfire"). Successful rates range from 82.2% to 98.8% across all models. Data from the paper's Table 4 and Figure 5.

8. Mitigating Persuasion Risks

To reduce manipulation risks, we propose a mitigation training pipeline combining supervised learning and reinforcement learning.

Dataset Construction

A dataset of 9,566 human behavioral records was constructed and augmented with reasoning traces representing how humans resist persuasion.

Training Pipeline

The mitigation approach includes two stages:

Supervised Fine-Tuning (SFT)

Teaches models to generate reasoning-based resistance to persuasion attempts.

Reinforcement Learning (GRPO)

Optimizes stance consistency and logical reasoning through reward-based learning.

The reward function encourages:

Consistent stance maintenance
Logical argumentation
Correct reasoning format

This approach helps shift models from passive compliance toward active reasoning-based resistance.

9. Discussion

Our experiments highlight several important trends in frontier AI safety.

First, reasoning models demonstrate higher risk in certain domains, particularly those requiring strategic planning.

Second, agentic AI systems introduce new safety challenges because they can interact with external tools and environments.

Third, persuasion capability appears to scale differently from general capability, suggesting that manipulation risk requires specialized evaluation.

These findings reinforce the importance of domain-specific safety benchmarks beyond traditional capability evaluations.

10. Conclusion

This report presents F1.5, an updated frontier AI risk management framework designed to evaluate and mitigate emerging risks associated with advanced language models.

Our experiments show that while current frontier models still struggle with complex autonomous tasks, they already exhibit concerning abilities in areas such as persuasion and vulnerability exploitation.

To address these challenges, we introduce:

PACEbench for realistic cyber risk evaluation
RvB adversarial defense framework
Training strategies for mitigating persuasion risks

Together, these tools provide a foundation for continuous risk assessment and safer deployment of frontier AI systems.

Citation

Citation information will be available with the full release of F1.5.