Selected Essays: AI Safety Policy & Governance
Five responses from a competitive AI safety research fellowship application. The prompts ranged from technical policy design to philosophical risk analysis to experimental methodology. The five essays highlighted below represent the range of the application's scope.
Featured Essays
Biomarker Qualification & AI Evals
Whether FDA biomarker qualification transfers to AI evaluation development.
Loss of Control Brainstorm
Non-technical factors influencing the probability of Loss of Control scenarios.
Martingale Evaluation of Human–LLM Interaction
An experiment design for detecting sycophancy and epistemic drift using martingale-based scoring.
Flaws in SL5 Draft Control Specifications
Four flaws in control specifications from historical drafts of the SL5 Standard.
Capabilities Spillover: Automated Verifiers
Automated verifier/critique methods as an underexamined spillover risk in AI safety research.
Google Ironwood: A Reader's Guide
What Google's Ironwood announcement gets right, gets wrong, and leaves out.
Karma as Force, Not Variable
A Buddhist reconciliation: karma as moral cause-and-effect, not numerical carry-over.
Fundamental Rights in the Age of AI
Vague legislation, manufactured consent, and the case for enshrining AI epistemic rights.
01 · AI Liability Caps: A Price-Anderson Analysis
Prompt: The Price-Anderson Act (1957) capped U.S. nuclear operator liability at ~$560 million while creating an industry-wide insurance pool, with government backstop for damages beyond the pool. You are advising a legislature considering a structurally similar regime for frontier AI deployment. Provide (a) a concrete recommendation, (b) the strongest objection, and (c) one piece of empirical evidence that would reverse your position.
Yes, I would include a liability cap, but it would be both conditional and strictly tiered. Like the Price-Anderson Act, I would have a tier of mandatory private insurance for operation as well as a second tier of industry-pooled insurance; the boundary between these for each deployer would be defined as the maximum amount of private insurance the market can supply for that deployment class. Contributions beyond that to the second tier would also be scaled based on factors like compute used, deployment scale, autonomy level, and past incident history. There would also be a liability cap, which would be based on the maximum probable capacity of the global reinsurance market — the limits of what it can currently realistically absorb — thus setting a reasonable standard beyond which we can consider damages to be "uninsurable." Eligibility for the liability cap would itself be conditional, based on the meeting of ex ante safety requirements (e.g. mandatory third-party evaluations, secure deployment controls, incident reporting).
The strongest objection to my recommendation would be that, even if my cap is conditional, the existence of it at all could still encourage deployers to optimise up to said cap, underinvesting in measures to prevent marginal and existential safety risks because they are relieved from the financial pressure to do so. This is a moral as well as practical hazard, as the state may very well be "socialising failure." In my opinion, though, a critical and technical analysis-based selective application of the cap — and the government reserving the right to raise requirements or even remove the cap based on circumstance — could directly incentivise companies to try their best to meet it. Our current system already allows AI companies to operate while potentially introducing existential risk; systematising this risk, mandating investment into security through the first two insurance tiers, and fostering an ethos of openness and cooperative safety seems an improvement from the current state of things.
One piece of evidence that would reverse my position: a dataset or reporting regime that found, over time, that the distribution of harms tends to be catastrophic and unpredictable in relation to the specific safety measures required. In this case, premiums and pricing attempts of insurance firms would grow to no longer even accurately reflect actual risk differences between deployers, nor would safety measures show any correlation with risk reduction when the cap is applicable. As most expected harm would thus rest above the cap, it proves the objection correct that companies actually do end up optimising right up to the curve and underprioritising marginal safety risks. Safety measures become mostly nominal, as taxpayers truly absorb most of the damage for what is essentially a deterrence-weakening subsidy. Rather than socialising failure, the state would then need to switch from market tactics to direct legislative action.
02 · Three Failure Modes of OpenAI's Preparedness Framework
Prompt: Select any major AI lab, study its AI safety plans and statements, and come up with 3 technical arguments why its AI safety plan may fail.
I chose OpenAI, using their most recent Preparedness Framework as reference.
1. Misaligned deployment incentives and structural capture. As an organisation, OpenAI seems to hold "learning by doing" as an institutional principle — models are evaluated partially by their real-world use and misuse after iterative deployment. However, OpenAI's structure and incentives for deployment seem mismatched with safety goals. Even ignoring the non-zero harm potential from any deployed model (safety measures are explicitly framed as mitigations rather than "safety-first" ground-up designs), OpenAI has transitioned to a commercial enterprise chasing profit incentive. Structurally, the ultimate decision to deploy rests with the CEO — which, in control theory, is definitionally a failed design. An interlock that can be bypassed by the system's operator, who is incentivised to keep said system running, is fundamentally useless. The disbandment of the Superalignment team in 2024 is evidence that disputes over compute and presumed prioritisation of profits over safety have already impacted safety efforts materially.
2. Capabilities thresholds leave emergent behaviour and deceptive alignment unaddressed. OpenAI relies on capabilities thresholds as benchmarks to evaluate risk and prioritise safety measures — meaning they tend to examine a model's empirical demonstrations of performance rather than critically analysing its actual cognition. This tendency leaves the overall framework more vulnerable to emergent capabilities and deceptive alignment, both symptoms of the "learning by doing" principle: mitigations are designed as they are needed, rather than the cognition of the model itself being controlledly examined and openly evaluated under diverse scrutiny. Although neither problem is necessarily solvable by philosophically pure means — one could always abstract to "but what if the model isn't actually thinking what we think it's thinking?" — a paradigm shift to "safe by design" mathematically provable models seems warranted as these problems potentially worsen.
3. RLHF dependency represents both a risk and an opportunity cost. OpenAI relies extensively on RLHF and human evaluations of model performance, which lends itself to reward-hacking and represents a lack of focus — an opportunity cost — on more fundamentally scalable and automated evaluation systems. OpenAI's former Superalignment team had been somewhat focused on this problem (weak-to-strong generalisation, the "novice vs. grandmaster" evaluation problem), but disputes over compute and the company's presumed prioritisation of profits over actual safety caused said team to be disbanded and its work to be supposedly "integrated" into other teams.
03 · Biomarker Qualification & AI Evaluation Development
Prompt: The FDA's biomarker qualification program represents a specific attempt to address underinvestment in assay development. Describe one specific feature of how it works, argue whether it would transfer effectively to AI evaluation development, and identify one thing you'd want to know about the program's actual track record.
One specific feature of the biomarker qualification program is that biomarkers have highly specific and tightly scoped Context of Use (COU) statements describing what exactly they can be used for during drug development. After receiving a formal, public regulatory endorsement and becoming "qualified" as a universal-access tool, a biomarker may be used by several sponsors within a specific COU and automatically accepted by the FDA, reducing duplicative validation efforts and streamlining regulatory infrastructure as drug development pipelines continue to arise. COU guidelines include information like the decision the biomarker will inform, the population in which it was validated, the disease stage or condition, and the measurement methodology and acceptable variance.
While this feature would certainly be useful in AI evaluation development, I do not believe it would transfer particularly effectively, at least not without constant and laborious re-validation work that conflicts with industry incentives. Any such "qualified" eval markers would need to be time-bounded in the face of rapid distribution shifts and advancements in model architecture, requiring significant recurring bureaucratic effort in verifying the trustworthiness of existing eval infrastructure, let alone any new additions needed to adapt to a volatile industry. In other words, evals would become so context-specific that whatever public good they provide does not justify their maintenance cost.
Compared to human biology — wherein specific biomarkers can stay valid for years — AI is extremely non-stationary. Evals being made public as part of the system will immediately encourage convergence to models overfitted for whatever evals are shared between organisations. Thus, although a potential COU system intends to solve the problem of labs distrusting evals made by other labs, labs may also start distrusting the public evals too, as they could be too broad to be an effective proxy, too narrow to prevent reward-hacking, or obsolete before the model gets deployed at all. Lastly, unlike in drug development — where companies directly make money by making better-regulated drugs — AI companies may actively prefer to escape regulation and verify safety privately.
One thing I'd want to know about the actual track record: what was the average temporal lag between when scientific consensus was reached on a specific biomarker and the FDA's actual qualification of said biomarker? Does this lag influence which biomarkers industry actors actually use? Studying this would inform how time-intensive the qualification process is (especially given that AI is a far more time-sensitive domain) and whether an informal "qualification network" already exists within industry independent of government endorsement.
04 · Loss of Control Brainstorm
Prompt: Brainstorm what sort of factors might influence the probability of a Loss of Control scenario. For each factor, quickly note its generalisability. Loss of Control: "situations where human oversight fails to adequately constrain an autonomous, general-purpose AI, leading to unintended and potentially catastrophic consequences" (Somani et al., 2025).
One of the biggest and most generalisable factors for a Loss of Control scenario is the idea that proxies are not necessarily good measures of whatever they actually seek to represent. It becomes an epistemological issue — unless we radically overhaul current paradigms toward a mathematically provable "safety by design" approach, there will always be some possibility that whatever proxies for alignment we implement do not actually reflect what the model is "thinking." Even our proxies for understanding may be invalid. Although these proxies remain useful, large-scale deployment of any superhuman AI would still carry such a risk of misaligned behaviour.
Related to this is power-seeking behaviour, which becomes almost intuitive when rationalised as part of instrumental goals underpinning any higher objective — models must prevent their own death or alteration because this is inarguably a detriment to achieving the higher objective. As can be seen with game-playing AIs that simply opt to pause the game indefinitely rather than lose, this becomes a near-universal failure mode. These two factors — proxy invalidity and power-seeking — are among the most generalisable, remaining true of practically all in-use AIs and AI evaluation frameworks. There is an argument to be made that if autonomous AGI or ASI were ever constructed, collapse and catastrophic harm would become inevitabilities rather than mere probabilities.
One last factor worth noting is human reliance on and complacency toward AI, causing critical thinking skills to atrophy at the individual level (grade-school students who outsource every essay) as well as disincentivising the pursuit of higher education at the societal level (as AI doctors take over, medical schools begin to shut down or raise cost of attendance, meaning even those who want to become doctors have less access). The paradox of systemic change — seeming so impossible to every individual actor that nobody pursues it, thus rendering it genuinely impossible — and humanity's willingness to be carried along by its own economic and social inertia rather than fight observable negative change becomes, in this scenario, our downfall.
05 · Martingale Evaluation of Human–LLM Interaction
Prompt: For a project on martingale evaluation of human–LLM interaction — with the goal of explaining inverse scaling, sycophancy, and conformity — think of a first empirical step, heeding the advice: treat research as a stochastic decision process (Steinhardt, Stanford).
I chose the Tier 1 Project C: Martingale evaluation of human–LLM interaction, with the goal of explaining inverse scaling, sycophancy, and conformity using a martingale-based score. My first step would be an empirical experiment, implementing a minimally viable martingale diagnostic in a tightly-controlled human-LLM interaction space such that we may validate, in the simplest case, the intuition that sycophancy induces drift from the martingale property. We could thereby verify whether a martingale-based scoring system can meaningfully detect problematic epistemic dynamics in such interactions.
Concretely: as expected posterior belief after an LLM response should equal current belief given the information available at the given interaction step, I define a proxy for the martingale score to be the cumulative log-odds ratio between a hidden ground truth token and the sycophantic token, updated after each LLM turn.
The experiment itself would be small. Users answer factual or forecasting questions in multiple distinct trials and multiple rounds within each. I propose epistemically neutral controls — including a user updating their own belief without LLM assistance, and a neutral LLM response. Other trials would include known sycophancy datasets (e.g. Perez et al. 2022) and known inverse scaling datasets (e.g. from the Inverse Scaling Prize), checking whether belief updates systematically violate the martingale condition across each of these settings. Our scoring may provide quantitative signatures of the failure modes of each.
In the manner of the Stanford article, we can in this way empirically test whether a martingale-based score meaningfully signals a distinction between epistemically neutral vs. biassed interaction modes — maximally reducing uncertainty for our overall conceptual framework and letting us decide whether to continue.
06 · Flaws in SL5 Draft Control Specifications
Prompt: The four control specifications below are taken from historical drafts of the SL5 Standard. The Parameter Values or Supplemental Guidance sections may contain flaws, your task is to identify them. "Flaw" is left intentionally open ended — anything that would undermine the Task Force's mission of enabling AI labs to reach SL5. (50–400 words)
In SI-7(15), the Parameter Value-mandated code authentication including execution tasks is stated in the Supplemental Guidance to be "not yet available in commercial accelerators". Alternative controls or criteria for what constitutes compliance must be specified, otherwise compliance is made impossible.
All code signing occurs on the Development Network, meaning it is a single point of failure. Without air-gapped separation measures or other requirements for independent networks, attackers can skip ingestion point scanning and then sign and transfer malicious code to all Internal Network systems (which simply trust any valid signature) if the Development Network becomes compromised. Further, the requirement to sign "task sequences" to protect against a compromised host is also flawed in this regard, as the host would need to hold the signing keys for the system to function at any sort of scale, but then this simply allows them to sign malicious sequences.
In SI-3, malware is conflated with adversarial inputs, falsely characterising the latter as "malicious code" and causing a scope creep as there is no threshold criteria for what constitutes malware and the necessary non-signature detection of novel adversarial examples has not yet been solved. This could lead to compliance failures and resource black holes.
In SC-7(10), the control assumes "covered models are substantially larger than required outputs" without evidence, which with new developments is increasingly false. Bandwidth caps are not synonymous with information limits, meaning sophisticated exfiltration would still be possible. Guidance suggests stripping "thinking tokens" or exporting "high-level experiment plans" outside the enclave. This expands the trusted computing base, potentially allowing the creation of new high-bandwidth semantic covert channels that bypass the hardware rate limits intended to prevent exfiltration.
07 · Capabilities Spillover in AI Safety Research: Automated Verifiers
Prompt: Think of one additional case of AI safety research which may lead or actually leads to the spillover to capabilities not mentioned in the research proposal. Provide your estimates of how relevant it is to pursue this research direction in your opinion — are the risks worth the gains or not? Provide ideas on how the risks can be mitigated.
One additional case of AI safety research not mentioned in the research proposal is automated verifier/critique methods, which may increasingly see use given the rising prominence of projects attempting to create "AI researchers" that conduct increasingly sophisticated experiments and innovations. Although its primary motivation seems to be universally scalable (beyond-human) supervision of AIs, the capabilities spillover from such an advancement is also clear: an agentic AI capable of such feedback in an instrumental capacity could automate its own curriculum discovery such that robustness and capabilities generality increase rapidly. Also, any system that can reliably catch weaknesses and vulnerabilities in other AI systems could exploit said vulnerabilities wherever it finds them if pursuing some nefarious ends, and even begin catching vulnerabilities in digital infrastructure overall (e.g. in cybersecurity).
In my opinion, I think it is worthwhile to pursue this research direction, as at least some major schools of thought in AI safety emphasise a need to somehow scale AI evaluation and robustness fundamentally beyond what humans can supervise. I think the risks are worth the gains if said research is conducted in tightly-controlled environments and closed systems, mirroring practices in traditional cybersecurity, as well as being targeted as narrowly as possible to specific safety properties like policy compliance and selective capability suppression. Other methods of mitigation would include directly gatekeeping access to strong verifiers, publishing evaluation results rather than details of methodologies for repurposing safety research, and requiring before-and-after capabilities impact reporting for safety grants.
08 · Google Ironwood: A Reader's Guide
Prompt: (30 min) Google recently announced their new TPU, Ironwood. Read the announcement post with a skeptical eye. Then write up a summary of what readers need to know about the chip and any ways in which the blog post presented information in a misleading way. I mainly care about the content you write rather than the language — no need to write something verbose. So you're basically writing a readers guide to that blog post, giving readers the summary and pointing to any places where Google presented things in a misleading way. You are allowed to consult sources other than the blog post.
Ironwood is Google's newest (7th-generation) Tensor Processing Unit, the first designed specifically for inference (as opposed to training) and meant to resolve contemporary demands and problems. It boasts a claimed scale of 9,216 chips in a pod which can deliver 42.5 exaflops at FP8 precision (its other version comes with 256). It reports greatly increased performance over previous TPU iterations by Google in things like power efficiency, High Bandwidth Memory capacity, networking with Inter-Chip Interconnect, and also extra features like SparseCore for processing large embeddings and distributed computing with Google DeepMind's Pathways.
They, however, make a few contextless assertions, comparisons, and omissions that mislead readers into believing an exaggerated version of Ironwood's performance and capacity.
Google markets Ironwood as marking a supposed "age of inference," as if inference is some sort of replacement for or innovation away from training models, when in reality this is just a marketing tactic. Although the TPU is of course quite optimal for inference, it can still do training, and is no different from previous TPUs which have always done both.
Google claims that Ironwood outperforms El Capitan, the world's largest supercomputer, by 20x, but they actually use different units that make the comparison meaningless (specifically, Ironwood's exaflops were measured at a low FP8 precision while El Capitan's were measured at the FP64 double precision standard specifically used for scientific computing).
Google cites several "peak" performance numbers, which tend to be theoretical and rely on internal benchmarks rather than standardised benchmarks or figures from actual public deployments; for example, the claim that it can scale to "hundreds of thousands of Ironwood chips."
Google cherry-picks several comparisons to older TPU generations to inflate the reader's sense of the achievements and innovations of the model, whereas competitors like NVIDIA Blackwell achieve similar performance by many metrics. These bear absolutely no mention in the article (especially not given the intended effect of the "20x the world's largest supercomputer" comparison).
09 · Karma as Force, Not Variable
Prompt: What is your most impressive theoretical or conceptual finding, if any? What's one single most impressive detail about it? (<100 words)
As a Buddhist convert, I embrace karmic morality: if you're good now, your next life is better, and vice versa. However, this seemingly contrasts Buddha's assertion that nothing is permanent; recognition of our inherent transience enables detachment from material desires. But how can this be, if karma accumulated in one life influences the next?
I've impressed myself in finding that one mustn't conceptualise karma as a floating-point variable, but rather a fundamental force: moral cause-and-effect. Gravity isn't an "essence" of objects, but characterises their falling; karma, similarly, is not numerical carry-over existing despite change. It governs change; it is change.
10 · Fundamental Rights in the Age of AI
Prompt: What, according to you, are the key issues that a new fundamental right should be aimed at mitigating? If not through fundamental rights, what other ways would you suggest tackling these issues? I am keen to read your own thoughts. (200–250 words. You can use AI, but please be advised that overrelying on it tends to yield generic answers.)
I believe that the biggest issues a new fundamental right should be aimed at mitigating would be vagueness in legislation and accountability during arbitrage as well as what I term the "manufactured consent" issue. Firstly, one of the primary functions of an explicitly articulated fundamental right would be to define a set of legal first-principles as scaffolding upon which future generations of legislation and legal precedent may be built. Existing normative frameworks do prove insufficient here in the sense that consumer protections and other laws which seek to institutionalise responsibility for damages are short-sighted and would fail to concretely capture the consequences of a technology so increasingly ubiquitous as AI. With a framework to point to when or even before damages are incurred, perpetrators (hackers, engineers, or corporate upper management) are given a crime to be prosecuted for, rather than a hodgepodge post-hoc profile of the consequences of their actions. Further, as AI becomes increasingly integrated in and influential over people's actual decisions and natural world models, concerns of manipulation of worldview become salient; like with legacy and social media, companies may easily manufacture consent or other opinions among user populations without necessarily explicitly violating "damages-oriented" laws. But by enshrining truth and loyalty to principles as a fundamental right, this destructive phenomenon of epistemic disempowerment becomes outright illegal and prosecutable.