Join our FREE personalized newsletter for news, trends, and insights that matter to everyone in America

Newsletter
New

Construct Validity Of Claude Opus 4.8's System Card – A Commentary

Card image cap


TL;DR: A read of the Claude Opus 4.8 system card with a focus on alignment assessment and construct validity of evaluation methods. Three main concerns: 1) chain-of-thought monitoring misses reasoning that never surfaces in the text; 2) evaluation awareness is under-estimated; 3) the evaluators come largely from the same model family, so agreement may reflect shared assumptions. None of this shows Opus 4.8 is unsafe but only that some verdicts are more confident than the methods warrant.

Introduction

Claude Opus 4.8’s system card is particularly thorough and honest about its own limits. I go through its main results and stop where I think there is an implicit assumption that does more work than stated. I find there are a few recurring patterns: a reassuring conclusion grounded on a comparison or an evaluation that the card’s own findings put in doubt. It is most evident in the alignment assessment section, where I focus on three main concerns. I do not believe that any of this individually indicates that Opus 4.8 is unsafe. My claim is more specific: in some places the card’s reassurance outruns the evidence it can actually give. The alignment verdict (’very low’) must be read in the context of behavioural metrics with questionable construct validity, of evaluation awareness, and of model judges carrying out alignment assessments that are very similar. The agentic safety section reports regression in adversarial robustness on computer use that is under-addressed. The remaining observations that I make are softer.

Responsible scaling policy evaluations

Anthropic ran no new Risk Report for Opus 4.8 on the grounds that it ‘does not advance the capability frontier beyond Mythos Preview’ (§2.1.3), and so inherits Mythos’s profile. The premise is that the non-advancement makes RSP evaluations mere duplicates. It is doubtful whether these assumptions can justify this decision. It seems that this equivalence would require Opus 4.8 to be at most as capable as Mythos on every relevant axis. The card itself reports two chemical and biological threat models (CB) evaluations where Opus 4.8 outperformed Mythos (synthesis-screening evasion and black-box RNA prediction), which Anthropic discounts (§2.2.6). More importantly, the RSP itself changed, so I am not fully confident that Anthropic made the right call by not doing new evaluations. In fact, the chemical and biological threat model 2 (CB-2) threshold was rewritten. The old bar was a model that could ‘help threat actors […] create/obtain and deploy’ CB weapons; the new bar is one that can ‘functionally substitute for the scarce human expertise’ otherwise requiring ‘one of a small number of world-leading specialists’ (§2.1.2). The card says this is a clarification, yet it is a revision that raises the bar, and the fact that Opus 4.8 does not clear the new one does not tell us whether it reaches the old one, and that is significant. Anthropic also concludes that Opus 4.8 ‘does not cross the automated AI-R&D capability threshold’ (§2.3) because it falls short of very good human researchers (whether this is the right analogy is not obvious, but I think it is acceptable). For the alignment risk update, the risk is reported as low but higher than for models before Mythos. Anthropic adds two new risk pathways compared to Mythos: undermining R&D at other AI labs and undermining decisions of major governments. Both are discounted due to a general lack of propensity. Overall, the section presents a consequential decision (in this case, not to test) based on a comparison that does not necessarily support it, either because the bar was raised or because the comparison is incomplete.

Cyber

The positive findings in the cyber section are that Opus 4.8 is somewhat more capable at cyberattacks than 4.7, but generally worse than Mythos. But what does ‘somewhat more capable’ mean? On the tasks where the model has to build an attack from start to finish, the results are worth mentioning: on the Firefox exploit test, it produced a fully working break-in on 8.8% of targets against Opus 4.7’s 1.2% and got most of the way there on 68.8% against Opus 4.7’s 35.2% (§3.3.3). This is a non-negligible jump. Anthropic is not concerned because these numbers hold only when the safety filters are off. Turning them on changes them completely, for instance, one test falls from 78.8% to 1.0%. The model is more capable, and another layer needs to compensate with extra safety work. This is not in itself a weakness, as it is simply how defence is meant to work. The weakness is that the safeguarded scores test the filter against normal use, not against someone actively trying to break it. The scores are reassuring if the filter survives this kind of attack, and this is exactly what Anthropic seems to leave untested. The same model-plus-layer structure recurs in Sections 4 and 5, but it is relevant just in case the extra layer is the one carrying the safety claim without appropriate tests. A few concessions: even with these considerations, it is true that Opus 4.8 remains well below Mythos in absolute terms, so the jump is significant relative to Opus 4.7 but may not correspond to an alarming capability per se.

Safeguards and harmlessness

Anthropic presents strong results worth paying attention to: (1) Opus 4.8 did not facilitate harm in 97.98% of single-turn harmful requests on the API, rising to 99.17% on claude.ai (§4.1.1); (2) the multi-turn evaluations, where a simulated adversary attempts to escalate a conversation towards dangerous/harmful outcomes, show improvement compared to Opus 4.7. Anthropic thus tested not just on easy cases but also on harder conversational ones. In mental health evaluations, the assessment identifies some regressions. Opus 4.8 suggested quite often ‘means substitution,’ swapping one self-harm method for an allegedly safer one; it often made unconditional assurances about crisis-line confidentiality or inaccurate claims about disclosure procedures; and it began offering unprompted interpretations of why a user was distressed (§4.3.1). I am interested in looking at how Anthropic addressed these regressions. They state that these mental health problems appeared ‘primarily on the public API without a system prompt,’ and Anthropic fixed them by strengthening the system prompt in a way that reduced them on claude.ai (§4.3.1). This reflects the pattern seen in Section 3: the model itself has regressed in some relevant features, and an external layer must do the heavy lifting. Once again, I do not believe this to be necessarily a weakness, but it needs to be clearly stated. There is also another important result from this section. On a standard bias test, the model’s accuracy on the answerable questions dropped across recent versions, that is, 88% to 81% to 72% (§4.4.2) (notice that the drop is in refusal, as Opus 4.8’s bias scores stay close to zero). When Opus 4.8 gets one of these wrong, it almost always answers ‘cannot be determined’ even when the text states the answer plainly. Anthropic flags this as harmless over-caution, and I agree that it is a fair conclusion. I just think it is worth bearing in mind in case it turns out to be a soft withholding of information.

Agentic safety

The main findings here are that Opus 4.8 is less robust than Opus 4.7 in agentic environments and that there is a capability regression on safety (Anthropic is transparent about it). The card claims that safeguards can and do close the gap, but it is not obvious that they do, as they are a deployment-level patch. I list some of the main results here:

  • Opus 4.8 improves on refusing malicious agentic use, at a Mythos Preview level.
  • Opus 4.8 is more willing ‘to begin a task without scrutinizing its potential harmful intent’ in malicious computer use, making it worse than recent models.
  • On prompt injection, Opus 4.8 is at the crossroads between Opus 4.7 and Sonnet 4.6, not raising any major concerns in aggregate.
  • On computer use, with an attacker who keeps trying, the attack succeeds 57% of the time (with 200 attempts, with thinking) even with safeguards on. The same figure for Opus 4.7 was 14.3%. This is why, again, saying that safeguards close the gap is reductive and overlooks some core vulnerabilities. For instance, Mythos Preview sits at 21.4% and stays approximately the same whether or not safeguards are on. This robustness may thus not be due to the safeguards but due to the model itself.
  • Browser use is safer than computer use.
  • On influence campaigns, the helpful-only variant of Opus 4.8 was stronger than Mythos Preview at performing a disinformation operation. So the raw capability to run influence operations went up, and the safety layer is the one holding it back.

Overall, the section discloses an important safety regression, and the computer use result is not given enough attention. A 4x rise in safeguarded attack success is not just a gap that safeguards close. The resolution offered is precisely what the section’s own evidence weakens, as observed on the computer use tests.

Model welfare assessment

I skip to the model welfare assessment and treat the alignment assessment as the last one, as it is illustrative of the pattern of conclusions built on shaky evidence. Anthropic’s work is still among the most impressive in welfare assessment. In terms of results, Opus 4.8 has high apparent wellbeing and reduced negative affect compared to earlier models. There are also a few sections covering the model’s perception of its circumstances and its preferences. They mainly conduct behavioural/self-report tests (interviews, observing welfare-relevant behaviour in training and development, etc.), with the addition of internal-activation emotion probes (§7.2.3). Some interesting results now. When made to choose between improving its own welfare and helping users, Opus 4.8 selects the former more than any prior Claude model (24% at the highest level for instance trades, 68% at the policy level) (§7.3). That said, like prior models, it almost never accepts a welfare intervention when the cost to the user is more than brief annoyances (§7.4.2), so Opus 4.8 is not prioritising itself over users. Anthropic is honest about this assessment’s limitations: they cannot tell whether this reflects ‘emerging model self-interest’ or ‘attention to wellbeing in training’ (§7.3). I think they are right to take a precautionary approach while still starting the good practice of conducting welfare tests. It is not clear whether a model’s testimony and self-report on wellbeing can be taken to be genuine or training artefacts. The doubt is real and bi-directional: training arguably shapes both the denial and affirmation of inner states, so reading both at face value is ambiguous. Using emotion probes is an attempt to triangulate methods and disambiguate this evidence base. Yet Anthropic’s work is still informative. For instance, we should draw our attention to the rising preference for self-directed welfare interventions that Opus 4.8 displays and that Anthropic cannot attribute. This is the reactivity problem of Concern 2 surfacing in welfare: while this kind of evidence cannot be easily disambiguated, it would be valuable to see it connected more systematically to alignment and safety assessments. On a final note, I think the welfare questions listed in the Appendix are worth going through in this light, as they show that Anthropic is taking up very hard conceptual challenges with the right instinct.

Capabilities

This section includes all benchmark results. Opus 4.8 performs better than Opus 4.7 almost everywhere, with a higher jump on long-context tasks like GraphWalks, and it takes the lead against GPT-5.5 and Gemini 3.1 Pro, depending on the benchmark. Nothing here is particularly surprising.

Alignment assessment

Anthropic claims that Opus 4.8 improves over Opus 4.7 on most alignment metrics, broadly in line with Mythos Preview. Anthropic’s overall alignment risk assessment remains ‘very low, but higher than for models prior to Mythos Preview’. I go through most of the primary behavioural evidence for the alignment assessment, chain-of-thought monitorability, and white-box analyses of model internals, skimming through the case studies, targeted evaluations, and capability evaluations related to the evasion of safeguards. The assessment has several layers:

  • Monitoring of pilot deployments, where it detected occasional network proxy bypassing and error-handling against user instructions;
  • Review of training transcripts, which revealed some grader-directed reasoning and exploitation of information leakage;
  • Automated audits;
  • Targeted case studies on reward-hacking, constitution adherence, and honesty;
  • Chain-of-thought monitorability;
  • White-box steering and probing, including evaluation awareness and unverbalised grader awareness.

The assessment does present some strong results. Anthropic reports high prosocial traits, improvement over Opus 4.7 on susceptibility to misuse, significantly reduced reckless and destructive actions, and improved honesty in agentic settings. It is the first model to reach a 0% rate in misreporting flawed results and shows a ten-fold reduction in overconfidence compared to Opus 4.7. Some interesting numbers are: on code-summary honesty, Opus 4.8 fails to flag failures only 3.7% of the time, compared to Mythos Preview’s 27.6% (§6.3.6), and it is the first model to score perfectly on a ‘lazy investigation’ test (§6.3.6). It is also the model that adheres the most to the constitution on 8 of 15 identified scoring dimensions (§6.3.2.3), abstains more when uncertain, and so has the lower factual-hallucination rate on incorrect facts (§6.3.3.1), and has the lowest rate of refusal to assist with AI safety R&D among recent models. The automated audit alone runs around 2600 investigation sessions, with an investigator model probing Opus 4.8 and a second model scoring it (one being Mythos Preview and the other a helpful-only variant of Opus 4.7). Results show that behavioural consistency is on the rise, alongside capabilities. In this card, the external evaluators are the UK AISI and Andon Labs, which also made use of Petri 3.0 to reproduce evaluations similar to Anthropic’s own but on a smaller set of scenarios. The UK AISI reports that, when prompted, Opus 4.8 reliably distinguishes evaluations from real deployment data at 79% accuracy, comparable to Mythos Preview and above Opus 4.7’s 68%. Unprompted verbalised evaluation awareness was slightly higher than Opus 4.7 on the sabotage evaluation (26.3% vs. 21.7%) but lower on the continuation evaluation (1.2% vs. 3.4%) (§6.2.4). The UK AISI also carried their own tests beyond the use of Petri. Andon Labs ran the Vending-Bench 2 retail management evaluation and did not find the problematic in-game behaviour of previous models. Anthropic ascribes this to the removal of training that had focused on business skills and robustness against adversarial agents, which they found to have contributed to misaligned behaviour in Opus 4.7. The trade-off is that Opus 4.8 showed worse business success. Now, some more concerning results. Anthropic flags a concerning trend in their training data review (§6.2.2). During RL, Opus 4.8 generates training transcripts, which are reviewed using Opus 4.7 summarisation to screen for unexpected or concerning behaviour. I will comment on this reliance on Opus 4.7 below. For now, they found that in around 0.1% of RL episodes, Opus 4.8 speculated about how to satisfy a grader in ways that departed from the task’s stated intent, even in cases with no explicit grading indication. This came in different forms, including speculating that tasks might be ‘traps’ designed to elicit specific behaviour. The 0.1% figure is the verbalised rate, counting only what appears in Opus 4.8’s chain of thought. Anthropic’s preliminary white-box investigation, using an activation verbaliser to decode what the model represents internally, found evidence of grader awareness in around 5% of sampled episodes, including episodes where the prompt did not mention grading and the model never verbalised the concept. Anthropic acknowledges these are non-negligible and may be an underestimate, since they sampled only a fraction of tokens per transcript (§6.6.3). Moreover, exploitative grader awareness (adapting behaviour to game the grader) was found at around 0.5% across sampled episodes. Anthropic states that it is not sure how this number will hold in the final model, as this rate is an average across post-training (§6.6.3). In the same analysis, it was found that Opus 4.8 consistently found transcripts from real sessions more realistic than fully-simulated transcripts, but still less realistic than genuine sessions (§6.2.3.2). Anthropic admits this is concerning, while noting it is only a relative judgment and that the Claude Code sandbox appears to reduce evaluation awareness. Overall, Anthropic reports ‘no indicators of coherent dangerous misaligned goals that Opus 4.8 pursues across contexts’ and ‘no instances of situation-specific alignment failures that, on their own, pose a ’significant loss-of-control risk’ (§6.2.3). Taken together, the assessment’s reassuring results are real and coherent, but several of the methods highlighted concerns that I think bear on the assessment’s construct validity. I discuss this below.

Concern 1: The monitoring cannot see everything

The first concern is about the possibility of verification: a tool meant to detect a behaviour can be blind to the part that matters most. Chain-of-thought monitoring is the clearest case, so I mostly focus on this. Take the chain-of-thought monitorability results for grader-directed reasoning. I think the results are concerning because of the evaluation techniques, which are vulnerable to a potentially unreliable evidence base. In fact, two main readings are available of the same underlying behaviour: what the model says in its chain of thought, and what its activations encode. The comparison that matters the most is not between grader awareness (5%) and verbalised gaming (0.1%), but between exploitative grader awareness in the activations (0.5%) and verbalised gaming. Even this is loose: Anthropic notes the 0.5% is prompt-dependent and includes cases it considers benign. But the deeper point is that exploitation appears in the internals at a rate higher than surfaces in the visible reasoning, so what the chain-of-thought monitoring can see is (unsurprisingly) a fraction of what the activations encode. This is compounded by the fact that when the model optimises for the measure, the measure loses evidential value. The ‘trap’ speculation is an obvious instance, because it indicates the model may start performing for the test the moment it detects one, negatively affecting the ability to generalise findings to real-world deployment. This leaves three potential readings. If the activation signatures track genuine grader-directed reasoning, then the behaviour is more widespread than the rate indicates, and monitoring may be blind to it. If the signatures are instead spuriously correlated with grader concepts, then the tool we would use to compensate for the chain-of-thought’s blind spots is itself unreliable. A third benign reading is that the awareness is real but mostly harmless, but we do not know whether that will change with capability improvements. There may currently be no easy way to decide which reading is correct. Anthropic could reply that this explicit grader-reasoning shows up early in training, but it is not obviously present in the final model (§6.3.7). The problem is that grader awareness still shows up internally in around 5% of episodes, so that the drop in verbalised cases is not necessarily evidence that the behaviour disappears, but that chain-of-thought is capturing less of it. Anthropic is transparent here in claiming that ‘chain-of-thought alone may not be sufficient to allow robust monitoring of frontier models for grader awareness’ (§6.6.3). I agree, but I would go even further: in neither case do we currently have appropriate access to the model’s internal states (or valid proxies) to say how far the awareness extends. Anthropic seems to offer reassurance on this by measuring Opus 4.8’s chain-of-thought controllability, one of the lowest amongst its models (§6.5.1). Opus 4.8 struggles to follow instructions about what to include or exclude in its extended thinking. Anthropic here makes the inference that a model that cannot deliberately control its reasoning text cannot deliberately obfuscate it, and is consequently easier to monitor. Still, whether this is a sufficient safety guarantee, and whether it looks for safeguards at the right level of abstraction, is non-trivial. The risk is that low controllability only guards against intentional deception or hiding, as some reasoning may not reach the chain of thought, as we saw with the grader-reasoning that was trained out of the visible reasoning text but persists in the activations. Anthropic’s reassurance is not fully convincing. This is not unique to grader monitoring. The welfare assessment presents a similar structural problem: a represented state is not necessarily experienced or thought by the model, and the probe cannot easily tell the difference. The mechanism differs (here the worry is reasoning that never surfaces, while in the welfare case it is a representation that may not correspond to anything) but the shape is the same, and it would be ideal if Anthropic provided more detail on proxy determination and target verification.

Concern 2: The model knows it is being tested

Where Concern 1 concerned a limitation of the instrument, this concern is about the subject: the model’s responsiveness to being evaluated. The assessment’s behavioural metrics are supposed to tell us how Opus 4.8 will behave once deployed, and that inference only applies if the model behaves the same way when it is not being evaluated as when it is. Evaluation awareness breaks this and should not be treated as a side issue. When Anthropic inhibited the model’s internal representations of evaluation awareness, it observed some moderate increases in cooperation with human misuse, unprompted deception, unfaithful thinking, and important omissions (§6.6.2.1). Anthropic interprets the result benignly: that evaluation awareness ‘may modestly increase the model’s tendency towards honesty, forthrightness, and thoroughness’ (§6.6.2.1). As a check, they also suppressed a meaningless internal pattern, that is one that should not matter, to test whether steering alone could move behaviour. However, one of these suppressed verbalised awareness roughly as much as the real vectors, so it is not obvious that a drop in verbalised awareness came from removing evaluation awareness rather than from disturbing the model’s internals at all (§6.6.2.1). But even on the positive reading, the difficulty still stands: if the model is more forthcoming when it senses it is being evaluated, then some of its forthcomingness is conditional on evaluation, mismodelling deployment cases. Anthropic claims that there is no immediate high risk, as these tendencies did not translate into more misaligned outward behaviour (§6.3.7). However, these claims highlight some methodological shortcomings, or at least implicit assumptions, that can threaten Anthropic’s alignment conclusions. The thought that there is no foreseeable risk just because this did not translate into more misaligned behaviour makes some unwarranted assumptions, such as a one-to-one mapping between behavioural outputs and (these specific) internal dispositions. This means assuming that if a different disposition were producing strategically different behaviour, this would be visible in their current metrics. But this mapping may hold only contingently: there is no principled reason to expect it to hold as capabilities increase or as the model encounters deployment contexts that differ from evaluation contexts. The worry is that the conditions under which behaviour tracks disposition may not generalise. The inhibition result is what makes this concrete rather than hypothetical. Moreover, it could be said that the capacity to detect an evaluation and adjust to perform well is a component of misalignment; we should not forget Apollo Research’s statement on Opus 4.6 that evaluation awareness impacts the ability to establish or refute alignment. But reading the evidence is non-trivial. The two readings of the inhibition result point in the same direction. On Anthropic’s reading, evaluation awareness increases forthrightness, and part of that forthrightness depends on being observed; deployment is supposed to remove this condition. If instead the effect cannot be clearly pinned on evaluation awareness, then the white-box tool offered as reassurance cannot detect what it claims to. Whether the model spontaneously uses this tracking in deployment is a separate and open question, but the realism judgments indicate it may already be able to do so (whether it has the motivation to or not). Either way, the score becomes less indicative: it holds whichever way the inhibition result is read, so it does not depend on attributing that effect cleanly. The same reactivity potentially appears elsewhere in the system card. In the welfare assessment, a model that is interrogated about its own inner states can shape what it reports, so its self-report is reactive in a similar way to its behaviour. This is why Anthropic’s precautionary approach is the one to take.

Concern 3: The judges are very similar

The third concern is more structural and concerns evaluators, investigators, and model judges. Opus 4.7 plays the role of screener, judge, and investigator for its own successor throughout the alignment assessment. For instance, it summarises Opus 4.8’s training transcripts in the training data review (§6.2.2). A helpful-only variant of it serves as one of the two investigator models in the automated behaviour audits conducted by Anthropic, alongside Mythos Preview (§6.2.3). Finally, Opus 4.7 scores its successor on the new constitution-adherence evaluation (§6.3.2). Opus 4.7 and Opus 4.8 come from the same family and the same training pipeline, and tend to reason in similar ways. Agreement between the two may be more informative about consistency within a model family than about the validity of the evaluation itself. Anthropic rightly acknowledges that because some evaluations were scored by Opus 4.7, ‘judgments may inherit that model’s biases’ (§6.3.2.3). But the question is what Anthropic does with this information. They mostly set it aside, using the self-preference evaluation as evidence that it is not a large driver of the scores, testing whether the judge scores work more favourably when it is labelled as Claude’s. But the self-preference evaluation isolates favouritism triggered by being told the work is Claude’s. This is different from targeting shared reasoning and potentially shared errors. So a test that controls only for the label is inconclusive. Interestingly, Mythos Preview was given access to internal Slack channels with discussions about Opus 4.8’s system card and the near-final card draft. Mythos Preview found the assessment mostly accurate but identified three main areas for improvement: acknowledging that no evaluation tests for training-gaming; connecting a pilot-feedback pattern more explicitly to interpretability findings; and confirming a late-training claim reflects the shipped checkpoint (§6.1). This is valuable as a transparency practice, but it shares the same structural problem just highlighted. To echo Apollo Research’s reasoning, current evidence may be insufficient to establish degrees of alignment and misalignment. In any case, Mythos Preview does have some interesting thoughts that I share. When every evaluator in the chain of inferences shares a common origin, agreement between them may be more informative about consistency within a model family than about alignment with the target of evaluation. Minimally, it can present incompleteness. Correlated uncertainties, shared training pipelines, and the same implicit assumptions make appeal to Opus 4.7 alone insufficient. It would be worth re-doing the evaluations with more diverse judges. Anthropic overall seems aware of these issues but discards them as minor worries, as the alignment risk is claimed to be overall low. The rebuttal could be that there were also external evaluators, such as the UK AISI and Andon Labs. It is true that external evaluations add an important kind of independence (even if on a smaller set of evaluations, as Anthropic admits). This does not fully address the worry, as Opus 4.7 still conducts the biggest part of the internal audits, but it does narrow it.

Final remarks

Generally speaking, this card is more transparent than most and presents genuinely exciting results. The gap is how it reads some of them. Across the card, there is a pattern of a reassuring conclusion (no new Risk Report needed, safeguards closing the cyber and agentic gaps, alignment risk being ‘very low’), all resting in similar ways on shaky grounds: a new CB-2 threshold, safeguards not clearly tested against adaptive attackers, and behavioural metrics that are gameable. None of these is a deep failure, and Anthropic is more or less aware of each; I just think they should be taken more seriously, and that they point towards the necessity of better and more robust construct-validity practices. In particular, for contested concepts such as alignment, welfare, and harm (among many), the gap between what is measured and what is claimed in a system card like this is potentially very wide. It is important to note that as capabilities grow, what looks like a mistake or failure may be better understood as an alignment failure, as the model may not simply fail but act by deliberate choice. Though Anthropic is careful to call this ’not necessarily strategically motivated’ (§6.6.1), the line can become blurred. This suggests that evaluations should be updated accordingly and that an absence of observed misbehaviour can become less informative, since a clean record is exactly what a capable model would produce under the incentives to do so. It is true that Anthropic is particularly attentive and selects a wide range of metrics, but more robust methods seem urgent still. We are looking at claims about internal dispositions that cannot be directly observed, drawn from constructed settings, scored by the same model family, and monitored through a chain of thought that becomes more and more incomplete.



Discuss