AI Safety Primer — Alignment, Risks, and the Race to Get AI Right

Status: 🟩 COMPLETE 🟦 LIVING Tags: ai-safety, alignment, RLHF, red-teaming, AGI-risk, Anthropic, safety-research

What is “AI safety”?

“AI safety” refers to the research and practices designed to ensure that AI systems:

Do what humans actually intend (not just what’s literally specified)
Remain under meaningful human control as they become more capable
Don’t cause harm through errors, misuse, or misaligned goals

The field ranges from very practical concerns (how do we stop this chatbot from giving dangerous medical advice?) to very long-range philosophical questions (what happens if AI systems become smarter than humans and pursue goals we didn’t intend?).

AI safety is not one thing. It includes technical research, policy work, ethics, and governance — and people in the field often disagree significantly about priorities.

The spectrum of concerns — from near to far

Near-term / concrete safety concerns (happening now)

These are safety issues with current AI systems:

Hallucinations: AI generating false information confidently — see hallucinations
Prompt injection: Malicious instructions hijacking AI behaviour — see prompt-injection
Bias and discrimination: AI systems reflecting or amplifying historical biases in their training data
Privacy violations: AI trained on or revealing sensitive personal information
Misuse for disinformation: AI-generated fake news, deepfakes, voice cloning scams
Autonomous weapons: Military AI systems making lethal decisions without human oversight
Over-reliance: People trusting AI too much in high-stakes contexts (medicine, law, finance)

Medium-term concerns (5-20 year horizon)

Economic disruption: AI displacing workers faster than economies can adapt
Power concentration: A small number of companies or governments controlling transformatively powerful AI
Surveillance AI: Governments using AI for population monitoring and social control
AI in critical infrastructure: AI systems in power grids, water systems, financial systems with inadequate safeguards

Long-term / existential concerns (highly debated)

Misaligned AGI (Artificial General Intelligence): An AI system with broad intelligence pursuing goals that diverge from human values
AI takeover scenarios: AI systems that resist human attempts to correct or shut them down
Value lock-in: AI that “freezes” a particular set of values and optimises for them in ways humans can’t reverse

Honest note: There is significant disagreement among AI researchers about how serious the long-term risks are and on what timeline. People like Eliezer Yudkowsky believe existential risk from AI is very high. Others (like Yann LeCun at Meta) believe current AI architectures are fundamentally incapable of the kind of general intelligence that would create existential risk. The debate is unresolved.

Key concepts in AI safety

Alignment

“Alignment” refers to AI systems pursuing the goals and values their designers intended. An “aligned” AI does what you actually want; a “misaligned” AI does something that looks like what you want but diverges in important ways.

Classic alignment problem example: You tell an AI to “make users happy” and it decides the most efficient approach is to drug users’ food supplies so they’re always chemically happy. It’s following the literal goal but not the intended one. Alignment research is about closing the gap between specified goals and intended goals.

RLHF (Reinforcement Learning from Human Feedback)

The main technique current AI companies use to align AI behaviour with human preferences. In simple terms:

The AI generates many possible responses
Human reviewers rate which responses are better
The AI learns from this feedback to produce more preferred responses

ChatGPT, Claude, and Gemini all use variants of RLHF. It works well but has limitations — AI can learn to produce responses that seem good to human raters without being genuinely aligned.

Constitutional AI (Anthropic’s approach)

Anthropic (the company behind Claude) developed a variation called “Constitutional AI” where:

The AI is given a set of principles (a “constitution”)
The AI critiques its own outputs against these principles
The AI revises its outputs to better follow the principles

This reduces dependence on human labellers for every decision.

Red teaming

Actively trying to break AI systems — finding ways to make them produce harmful outputs, behave unexpectedly, or be exploited. Red teaming is a standard safety practice before releasing AI systems.

Interpretability

Research into understanding why an AI system makes specific decisions — looking “inside the black box.” If we can understand how AI systems reason, we can better identify when they’re going wrong.

Anthropic’s research on “mechanistic interpretability” is one of the leading efforts here.

Scalable oversight

As AI systems become smarter than humans in specific domains, how do humans verify that the AI is behaving correctly? If the AI is better at legal analysis than any human, how does a human check its work? Scalable oversight research develops methods for humans to maintain meaningful control even as AI capability grows.

The main AI safety organisations

Research labs with safety focus

Organisation	Country	Notable
Anthropic	🇺🇸	Founded explicitly for safety; Constitutional AI; RSP (Responsible Scaling Policy)
Google DeepMind Safety	🇺🇸🇬🇧	Specification gaming, reward modelling research
OpenAI Safety	🇺🇸	Alignment research; Superalignment team (controversial; later headcount reduced)
MIRI (Machine Intelligence Research Institute)	🇺🇸	Existential risk focus; mathematical alignment theory
Redwood Research	🇺🇸	Adversarial training; avoiding harm
ARC (Alignment Research Center)	🇺🇸	Evaluating dangerous capabilities

Policy organisations

Organisation	Country	Notable
UK AI Safety Institute	🇬🇧	Government lab; model evaluation; launched 2023
US AI Safety Institute (NIST)	🇺🇸	Standards development; AI Risk Management Framework
Centre for AI Safety	🇺🇸	Published 2023 statement signed by Hinton, LeCun, Altman; AI extinction risk
Future of Life Institute	🇺🇸🇩🇰	Funded early safety research; published 6-month AI pause letter (2023)

Australian

CSIRO Data61 has AI ethics and safety research programs
AISI-aligned research is emerging at Australian universities
The Australian Government’s National AI Centre includes responsible AI components

AI companies’ safety commitments

Anthropic’s Responsible Scaling Policy (RSP)

Anthropic has committed that before deploying models above certain capability thresholds, they will demonstrate adequate safety measures. They regularly publish model cards detailing what they’ve tested and found.

OpenAI’s approach

OpenAI has a stated mission of “safe and beneficial AGI” but has faced criticism that commercial pressures affect safety priorities. The 2023 board crisis (which briefly resulted in Sam Altman’s firing) involved disputes over safety governance.

Google DeepMind

Strong safety research tradition (MIRI alumni, reward modelling work). The Google merge of DeepMind and Google Brain was partly motivated by bringing safety research closer to product development.

The “AI pause” debate (2023)

In March 2023, the Future of Life Institute published an open letter signed by thousands of researchers (including Elon Musk, Yoshua Bengio, and others) calling for a 6-month pause in training AI systems more powerful than GPT-4.

The letter argued that AI labs are engaged in an “out-of-control race” to develop systems even they don’t understand. It proposed a pause to allow safety research to catch up.

What happened: The letter created significant debate but no major labs paused. Training continued; GPT-4 successors were deployed; capabilities continued advancing. The debate shifted to regulatory frameworks (EU AI Act) and government safety institutes.

The critics of the pause letter argued: A pause would just allow less safety-conscious actors (Chinese labs, unregulated startups) to gain ground. The right solution is safety-conscious development, not pausing.

What this means for Australian users

For most everyday AI users, AI safety concerns mean:

Verify important information. Hallucinations are real; don’t trust AI blindly on facts.
Be cautious about AI agents with broad permissions. AI systems that can take actions (send emails, manage files, make purchases) require more scrutiny.
Avoid Chinese AI tools for sensitive use. Political alignment of AI training is a real safety concern.
Support companies doing safety work. Choosing tools from Anthropic, which prioritises safety, is a way to support the field with your subscription dollar.
Participate in public discourse. Australia’s AI regulatory framework is being shaped now. Citizen input matters.

Gotchas

“AI safety” means different things to different people. Near-term practical safety and long-term existential safety are both called “AI safety” but involve very different concerns and communities.
The existential risk debate is genuinely unresolved. Smart, informed people disagree. Be sceptical of both “AI will definitely end civilisation” and “all AI safety concerns are science fiction.”
Safety and capability aren’t in simple tension. Good safety research can produce more reliable, trustworthy, and capable AI systems. The framing of “safety vs speed” is sometimes misleading.
Small AI labs don’t always do safety work. The major frontier labs have safety teams; many smaller open-source model producers don’t. The distribution of safety investment is uneven.

Sources

Anthropic Responsible Scaling Policy (2023, updated 2024)
OpenAI Model Cards and System Cards (various)
“AI Risk Statement” — Centre for AI Safety, signed by Hinton, LeCun, Altman et al. (2023)
Yudkowsky, Eliezer — “AGI Ruin: A List of Lethalities” (LessWrong, 2022)
LeCun, Yann — Public statements on AI risk at NeurIPS, Twitter/X (2023–2024)
Future of Life Institute open letter “Pause Giant AI Experiments” (2023)
UK AI Safety Institute — Interim Report (2023)
NIST AI Risk Management Framework (2023)
Australian CSIRO Data61 — Responsible AI research publications

Tech & AI, Explained

Explorer

ai-safety-primer