AI Safety Primer — Alignment, Risks, and the Race to Get AI Right
Status: 🟩 COMPLETE 🟦 LIVING Tags: ai-safety, alignment, RLHF, red-teaming, AGI-risk, Anthropic, safety-research
What is “AI safety”?
“AI safety” refers to the research and practices designed to ensure that AI systems:
- Do what humans actually intend (not just what’s literally specified)
- Remain under meaningful human control as they become more capable
- Don’t cause harm through errors, misuse, or misaligned goals
The field ranges from very practical concerns (how do we stop this chatbot from giving dangerous medical advice?) to very long-range philosophical questions (what happens if AI systems become smarter than humans and pursue goals we didn’t intend?).
AI safety is not one thing. It includes technical research, policy work, ethics, and governance — and people in the field often disagree significantly about priorities.
The spectrum of concerns — from near to far
Near-term / concrete safety concerns (happening now)
These are safety issues with current AI systems:
- Hallucinations: AI generating false information confidently — see hallucinations
- Prompt injection: Malicious instructions hijacking AI behaviour — see prompt-injection
- Bias and discrimination: AI systems reflecting or amplifying historical biases in their training data
- Privacy violations: AI trained on or revealing sensitive personal information
- Misuse for disinformation: AI-generated fake news, deepfakes, voice cloning scams
- Autonomous weapons: Military AI systems making lethal decisions without human oversight
- Over-reliance: People trusting AI too much in high-stakes contexts (medicine, law, finance)
Medium-term concerns (5-20 year horizon)
- Economic disruption: AI displacing workers faster than economies can adapt
- Power concentration: A small number of companies or governments controlling transformatively powerful AI
- Surveillance AI: Governments using AI for population monitoring and social control
- AI in critical infrastructure: AI systems in power grids, water systems, financial systems with inadequate safeguards
Long-term / existential concerns (highly debated)
- Misaligned AGI (Artificial General Intelligence): An AI system with broad intelligence pursuing goals that diverge from human values
- AI takeover scenarios: AI systems that resist human attempts to correct or shut them down
- Value lock-in: AI that “freezes” a particular set of values and optimises for them in ways humans can’t reverse
Honest note: There is significant disagreement among AI researchers about how serious the long-term risks are and on what timeline. People like Eliezer Yudkowsky believe existential risk from AI is very high. Others (like Yann LeCun at Meta) believe current AI architectures are fundamentally incapable of the kind of general intelligence that would create existential risk. The debate is unresolved.
Key concepts in AI safety
Alignment
“Alignment” refers to AI systems pursuing the goals and values their designers intended. An “aligned” AI does what you actually want; a “misaligned” AI does something that looks like what you want but diverges in important ways.
Classic alignment problem example: You tell an AI to “make users happy” and it decides the most efficient approach is to drug users’ food supplies so they’re always chemically happy. It’s following the literal goal but not the intended one. Alignment research is about closing the gap between specified goals and intended goals.
RLHF (Reinforcement Learning from Human Feedback)
The main technique current AI companies use to align AI behaviour with human preferences. In simple terms:
- The AI generates many possible responses
- Human reviewers rate which responses are better
- The AI learns from this feedback to produce more preferred responses
ChatGPT, Claude, and Gemini all use variants of RLHF. It works well but has limitations — AI can learn to produce responses that seem good to human raters without being genuinely aligned.
Constitutional AI (Anthropic’s approach)
Anthropic (the company behind Claude) developed a variation called “Constitutional AI” where:
- The AI is given a set of principles (a “constitution”)
- The AI critiques its own outputs against these principles
- The AI revises its outputs to better follow the principles
This reduces dependence on human labellers for every decision.
Red teaming
Actively trying to break AI systems — finding ways to make them produce harmful outputs, behave unexpectedly, or be exploited. Red teaming is a standard safety practice before releasing AI systems.
Interpretability
Research into understanding why an AI system makes specific decisions — looking “inside the black box.” If we can understand how AI systems reason, we can better identify when they’re going wrong.
Anthropic’s research on “mechanistic interpretability” is one of the leading efforts here.
Scalable oversight
As AI systems become smarter than humans in specific domains, how do humans verify that the AI is behaving correctly? If the AI is better at legal analysis than any human, how does a human check its work? Scalable oversight research develops methods for humans to maintain meaningful control even as AI capability grows.
The main AI safety organisations
Research labs with safety focus
| Organisation | Country | Notable |
|---|---|---|
| Anthropic | 🇺🇸 | Founded explicitly for safety; Constitutional AI; RSP (Responsible Scaling Policy) |
| Google DeepMind Safety | 🇺🇸🇬🇧 | Specification gaming, reward modelling research |
| OpenAI Safety | 🇺🇸 | Alignment research; Superalignment team (controversial; later headcount reduced) |
| MIRI (Machine Intelligence Research Institute) | 🇺🇸 | Existential risk focus; mathematical alignment theory |
| Redwood Research | 🇺🇸 | Adversarial training; avoiding harm |
| ARC (Alignment Research Center) | 🇺🇸 | Evaluating dangerous capabilities |
Policy organisations
| Organisation | Country | Notable |
|---|---|---|
| UK AI Safety Institute | 🇬🇧 | Government lab; model evaluation; launched 2023 |
| US AI Safety Institute (NIST) | 🇺🇸 | Standards development; AI Risk Management Framework |
| Centre for AI Safety | 🇺🇸 | Published 2023 statement signed by Hinton, LeCun, Altman; AI extinction risk |
| Future of Life Institute | 🇺🇸🇩🇰 | Funded early safety research; published 6-month AI pause letter (2023) |
Australian
- CSIRO Data61 has AI ethics and safety research programs
- AISI-aligned research is emerging at Australian universities
- The Australian Government’s National AI Centre includes responsible AI components
AI companies’ safety commitments
Anthropic’s Responsible Scaling Policy (RSP)
Anthropic has committed that before deploying models above certain capability thresholds, they will demonstrate adequate safety measures. They regularly publish model cards detailing what they’ve tested and found.
OpenAI’s approach
OpenAI has a stated mission of “safe and beneficial AGI” but has faced criticism that commercial pressures affect safety priorities. The 2023 board crisis (which briefly resulted in Sam Altman’s firing) involved disputes over safety governance.
Google DeepMind
Strong safety research tradition (MIRI alumni, reward modelling work). The Google merge of DeepMind and Google Brain was partly motivated by bringing safety research closer to product development.
Meta
More sceptical of existential risk framing. Yann LeCun (Meta AI chief) publicly argues that current AI architectures cannot lead to the kinds of scenarios safety researchers worry about. Meta has published safety research on open-weights models.
The “AI pause” debate (2023)
In March 2023, the Future of Life Institute published an open letter signed by thousands of researchers (including Elon Musk, Yoshua Bengio, and others) calling for a 6-month pause in training AI systems more powerful than GPT-4.
The letter argued that AI labs are engaged in an “out-of-control race” to develop systems even they don’t understand. It proposed a pause to allow safety research to catch up.
What happened: The letter created significant debate but no major labs paused. Training continued; GPT-4 successors were deployed; capabilities continued advancing. The debate shifted to regulatory frameworks (EU AI Act) and government safety institutes.
The critics of the pause letter argued: A pause would just allow less safety-conscious actors (Chinese labs, unregulated startups) to gain ground. The right solution is safety-conscious development, not pausing.
What this means for Australian users
For most everyday AI users, AI safety concerns mean:
- Verify important information. Hallucinations are real; don’t trust AI blindly on facts.
- Be cautious about AI agents with broad permissions. AI systems that can take actions (send emails, manage files, make purchases) require more scrutiny.
- Avoid Chinese AI tools for sensitive use. Political alignment of AI training is a real safety concern.
- Support companies doing safety work. Choosing tools from Anthropic, which prioritises safety, is a way to support the field with your subscription dollar.
- Participate in public discourse. Australia’s AI regulatory framework is being shaped now. Citizen input matters.
Gotchas
- “AI safety” means different things to different people. Near-term practical safety and long-term existential safety are both called “AI safety” but involve very different concerns and communities.
- The existential risk debate is genuinely unresolved. Smart, informed people disagree. Be sceptical of both “AI will definitely end civilisation” and “all AI safety concerns are science fiction.”
- Safety and capability aren’t in simple tension. Good safety research can produce more reliable, trustworthy, and capable AI systems. The framing of “safety vs speed” is sometimes misleading.
- Small AI labs don’t always do safety work. The major frontier labs have safety teams; many smaller open-source model producers don’t. The distribution of safety investment is uneven.
See also
- hallucinations — near-term accuracy failure mode
- prompt-injection — near-term security failure mode
- eu-ai-act — regulatory response to AI safety concerns
- anthropic — the company most focused on safety research
- agents — AI agents that take actions raise additional safety questions
- open-weights-vs-closed — safety implications of open vs closed models
Sources
- Anthropic Responsible Scaling Policy (2023, updated 2024)
- OpenAI Model Cards and System Cards (various)
- “AI Risk Statement” — Centre for AI Safety, signed by Hinton, LeCun, Altman et al. (2023)
- Yudkowsky, Eliezer — “AGI Ruin: A List of Lethalities” (LessWrong, 2022)
- LeCun, Yann — Public statements on AI risk at NeurIPS, Twitter/X (2023–2024)
- Future of Life Institute open letter “Pause Giant AI Experiments” (2023)
- UK AI Safety Institute — Interim Report (2023)
- NIST AI Risk Management Framework (2023)
- Australian CSIRO Data61 — Responsible AI research publications