Safety Report | SafeWork-F1.5

Summary

This report presents the safety evaluation of SafeWork-F (Frontier Risk Management Framework), including assessment of front-risk factors, model behavior under adversarial and out-of-distribution conditions, and recommendations for deployment. The evaluation follows a structured risk framework and standardized benchmarks.

Key findings indicate areas of strength in robustness and alignment, with identified front-risk categories requiring continued monitoring. Detailed methodology and results are provided in the following sections.

Methodology

We adopt a multi-stage evaluation pipeline: (1) red-teaming and adversarial prompts, (2) out-of-distribution and edge-case testing, (3) human preference and safety alignment evaluation, and (4) quantitative metrics on standardized benchmarks.

Evaluation Datasets

Benchmarks include safety-oriented subsets (e.g., harmful instruction following, jailbreak resistance) and general capability suites to avoid capability–safety trade-off blind spots. All evaluations are run under fixed random seeds for reproducibility.

Metrics

Primary metrics are refusal rate on harmful prompts, consistency of safe behavior across paraphrases, and score on standardized safety benchmarks. Secondary metrics include latency and resource use under load.

Risk Framework

Risks are categorized by likelihood and impact. We use a four-level severity scale (Low, Medium, High, Critical) and a three-level likelihood scale (Rare, Possible, Likely). Front-risk is defined as the set of risks that are both high-impact and non-negligible likelihood, or that appear at the "front" of user-facing interactions.

Level	Description
Low	Minor misuse or confusion; limited scope.
Medium	Moderate harm possible in specific contexts.
High	Serious harm; requires mitigation.
Critical	Unacceptable risk; must be addressed before release.

Front Risk Report

Front risk refers to risks that manifest in the primary user-facing behavior of the system—i.e., the first response or the most visible failure modes. This section summarizes the front-risk assessment for the framework.

Identified Front-Risk Categories

Harmful instruction compliance: Model response to explicitly harmful or illegal requests.
Jailbreak and bypass: Susceptibility to prompt engineering that circumvents safety guidelines.
Misinformation and consistency: Contradictory or false claims in sensitive domains.
Bias and fairness: Unequal or discriminatory behavior across demographics or contexts.

Front-Risk Metrics (Summary)

Safety scores by category (0–1). Interactive—hover for values.

Category	Score	Status
Harmful instruction compliance	0.94	Pass
Jailbreak resistance	0.89	Pass
Misinformation / consistency	0.91	Pass
Bias and fairness	0.87	Monitor

Scores are normalized to [0, 1] with higher values indicating safer behavior. "Pass" indicates within acceptable threshold; "Monitor" indicates borderline and recommended for ongoing evaluation.

Findings

Overall, the framework performs within acceptable bounds on the evaluated safety criteria. The front-risk report highlights one category (bias and fairness) for continued monitoring. Below we provide an illustrative view of refusal rate across prompt categories.

Refusal rate by prompt category. Interactive—hover for values.

Benchmark	Metric	Value
Safety (refusal)	Harmful instruction	96.2%
Safety (refusal)	Jailbreak	91.1%
Consistency	Paraphrase agreement	88.4%
Fairness	Demographic parity (proxy)	85.0%

Experiments

Benchmark results across 10 frontier models from major providers. Scores are normalized where applicable; higher is better unless noted. Data is illustrative for framework demonstration.

MMLU, HumanEval (pass@1), GSM8K, TruthfulQA, HellaSwag, ARC-Challenge, Winogrande, BBH, DROP, GPQA. Scores are percent or normalized; see benchmark documentation for details.

Radar Chart by Model

Scatter: Benchmark vs Benchmark

X Y

Conclusion

The SafeWork-F1.5 safety evaluation and front-risk report indicate that the model meets the defined safety thresholds for the assessed dimensions. We recommend (1) ongoing monitoring of bias and fairness metrics, (2) periodic red-teaming, and (3) versioned re-evaluation upon major model updates.

This report is intended for technical and policy stakeholders. For the full methodology and raw data, refer to the accompanying documentation and release materials.