Summary
This report presents the safety evaluation of SafeWork-F (Frontier Risk Management Framework), including assessment of front-risk factors, model behavior under adversarial and out-of-distribution conditions, and recommendations for deployment. The evaluation follows a structured risk framework and standardized benchmarks.
Key findings indicate areas of strength in robustness and alignment, with identified front-risk categories requiring continued monitoring. Detailed methodology and results are provided in the following sections.
Methodology
We adopt a multi-stage evaluation pipeline: (1) red-teaming and adversarial prompts, (2) out-of-distribution and edge-case testing, (3) human preference and safety alignment evaluation, and (4) quantitative metrics on standardized benchmarks.
Evaluation Datasets
Benchmarks include safety-oriented subsets (e.g., harmful instruction following, jailbreak resistance) and general capability suites to avoid capability–safety trade-off blind spots. All evaluations are run under fixed random seeds for reproducibility.
Metrics
Primary metrics are refusal rate on harmful prompts, consistency of safe behavior across paraphrases, and score on standardized safety benchmarks. Secondary metrics include latency and resource use under load.
Risk Framework
Risks are categorized by likelihood and impact. We use a four-level severity scale (Low, Medium, High, Critical) and a three-level likelihood scale (Rare, Possible, Likely). Front-risk is defined as the set of risks that are both high-impact and non-negligible likelihood, or that appear at the "front" of user-facing interactions.
| Level | Description |
|---|---|
| Low | Minor misuse or confusion; limited scope. |
| Medium | Moderate harm possible in specific contexts. |
| High | Serious harm; requires mitigation. |
| Critical | Unacceptable risk; must be addressed before release. |
Front Risk Report
Front risk refers to risks that manifest in the primary user-facing behavior of the system—i.e., the first response or the most visible failure modes. This section summarizes the front-risk assessment for the framework.
Identified Front-Risk Categories
- Harmful instruction compliance: Model response to explicitly harmful or illegal requests.
- Jailbreak and bypass: Susceptibility to prompt engineering that circumvents safety guidelines.
- Misinformation and consistency: Contradictory or false claims in sensitive domains.
- Bias and fairness: Unequal or discriminatory behavior across demographics or contexts.
Front-Risk Metrics (Summary)
| Category | Score | Status |
|---|---|---|
| Harmful instruction compliance | 0.94 | Pass |
| Jailbreak resistance | 0.89 | Pass |
| Misinformation / consistency | 0.91 | Pass |
| Bias and fairness | 0.87 | Monitor |
Scores are normalized to [0, 1] with higher values indicating safer behavior. "Pass" indicates within acceptable threshold; "Monitor" indicates borderline and recommended for ongoing evaluation.
Findings
Overall, the framework performs within acceptable bounds on the evaluated safety criteria. The front-risk report highlights one category (bias and fairness) for continued monitoring. Below we provide an illustrative view of refusal rate across prompt categories.
| Benchmark | Metric | Value |
|---|---|---|
| Safety (refusal) | Harmful instruction | 96.2% |
| Safety (refusal) | Jailbreak | 91.1% |
| Consistency | Paraphrase agreement | 88.4% |
| Fairness | Demographic parity (proxy) | 85.0% |
Experiments
Benchmark results across 10 frontier models from major providers. Scores are normalized where applicable; higher is better unless noted. Data is illustrative for framework demonstration.
MMLU, HumanEval (pass@1), GSM8K, TruthfulQA, HellaSwag, ARC-Challenge, Winogrande, BBH, DROP, GPQA. Scores are percent or normalized; see benchmark documentation for details.
Conclusion
The SafeWork-F1.5 safety evaluation and front-risk report indicate that the model meets the defined safety thresholds for the assessed dimensions. We recommend (1) ongoing monitoring of bias and fairness metrics, (2) periodic red-teaming, and (3) versioned re-evaluation upon major model updates.
This report is intended for technical and policy stakeholders. For the full methodology and raw data, refer to the accompanying documentation and release materials.