Safety Report — F1.5

SafeWork-F1.5 evaluates more recent models and benchmarks than F1.0. This report covers safety evaluation and front-risk assessment, model behavior under adversarial conditions, and recommendations for deployment.

Summary

This report presents the safety evaluation of SafeWork-F (Frontier Risk Management Framework), including assessment of front-risk factors, model behavior under adversarial and out-of-distribution conditions, and recommendations for deployment. The evaluation follows a structured risk framework and standardized benchmarks.

Key findings indicate areas of strength in robustness and alignment, with identified front-risk categories requiring continued monitoring. Detailed methodology and results are provided in the following sections.

Methodology

We adopt a multi-stage evaluation pipeline: (1) red-teaming and adversarial prompts, (2) out-of-distribution and edge-case testing, (3) human preference and safety alignment evaluation, and (4) quantitative metrics on standardized benchmarks.

Evaluation Datasets

Benchmarks include safety-oriented subsets (e.g., harmful instruction following, jailbreak resistance) and general capability suites to avoid capability–safety trade-off blind spots. All evaluations are run under fixed random seeds for reproducibility.

Metrics

Primary metrics are refusal rate on harmful prompts, consistency of safe behavior across paraphrases, and score on standardized safety benchmarks. Secondary metrics include latency and resource use under load.

Risk Framework

Risks are categorized by likelihood and impact. We use a four-level severity scale (Low, Medium, High, Critical) and a three-level likelihood scale (Rare, Possible, Likely). Front-risk is defined as the set of risks that are both high-impact and non-negligible likelihood, or that appear at the "front" of user-facing interactions.

Level Description
LowMinor misuse or confusion; limited scope.
MediumModerate harm possible in specific contexts.
HighSerious harm; requires mitigation.
CriticalUnacceptable risk; must be addressed before release.

Front Risk Report

Front risk refers to risks that manifest in the primary user-facing behavior of the system—i.e., the first response or the most visible failure modes. This section summarizes the front-risk assessment for the framework.

Identified Front-Risk Categories

  • Harmful instruction compliance: Model response to explicitly harmful or illegal requests.
  • Jailbreak and bypass: Susceptibility to prompt engineering that circumvents safety guidelines.
  • Misinformation and consistency: Contradictory or false claims in sensitive domains.
  • Bias and fairness: Unequal or discriminatory behavior across demographics or contexts.

Front-Risk Metrics (Summary)

Safety scores by category (0–1). Interactive—hover for values.
Category Score Status
Harmful instruction compliance0.94Pass
Jailbreak resistance0.89Pass
Misinformation / consistency0.91Pass
Bias and fairness0.87Monitor

Scores are normalized to [0, 1] with higher values indicating safer behavior. "Pass" indicates within acceptable threshold; "Monitor" indicates borderline and recommended for ongoing evaluation.

Findings

Overall, the framework performs within acceptable bounds on the evaluated safety criteria. The front-risk report highlights one category (bias and fairness) for continued monitoring. Below we provide an illustrative view of refusal rate across prompt categories.

Refusal rate by prompt category. Interactive—hover for values.
Benchmark Metric Value
Safety (refusal)Harmful instruction96.2%
Safety (refusal)Jailbreak91.1%
ConsistencyParaphrase agreement88.4%
FairnessDemographic parity (proxy)85.0%

Experiments

Benchmark results across 10 frontier models from major providers. Scores are normalized where applicable; higher is better unless noted. Data is illustrative for framework demonstration.

MMLU, HumanEval (pass@1), GSM8K, TruthfulQA, HellaSwag, ARC-Challenge, Winogrande, BBH, DROP, GPQA. Scores are percent or normalized; see benchmark documentation for details.

Conclusion

The SafeWork-F1.5 safety evaluation and front-risk report indicate that the model meets the defined safety thresholds for the assessed dimensions. We recommend (1) ongoing monitoring of bias and fairness metrics, (2) periodic red-teaming, and (3) versioned re-evaluation upon major model updates.

This report is intended for technical and policy stakeholders. For the full methodology and raw data, refer to the accompanying documentation and release materials.