AI Image Generation — How Machines Make Pictures from Words

Status: 🟩 COMPLETE 🟦 LIVING Section: 10 — AI and LLMs Tags: image-generation, text-to-image, diffusion, generative-ai, midjourney, stable-diffusion, dall-e, flux

What it is

AI image generation is the ability to describe a picture in plain English — or any language — and have a computer create that picture from scratch in seconds. You type something like “a golden retriever wearing a graduation cap, sitting in a library, oil painting style” and the AI produces a photorealistic or artistic image matching your description.

This is one of the most visible and widely-used AI capabilities today. It powers everything from Canva’s “magic generate” button to professional concept-art workflows to the images you see on social media.

A plain-English explanation of how it works

The technology behind most AI image generators is called diffusion — think of it like this:

Training phase (done once by the company): The AI is shown hundreds of millions of images from the internet, each with a text description attached. It learns which visual patterns correspond to which words. “Sunset” → warm oranges and pinks on a horizon. “Corgi” → specific body shape, fur texture, ear shape.
The “noise-to-image” trick: When you type a prompt, the AI starts with a completely random blur of pixels (like TV static) and then gradually “cleans” that noise, step by step, guided by your description — until a coherent image emerges. Each step nudges the picture toward matching your words.
The guidance signal: A separate component called a text encoder (similar to the one inside a language model) turns your prompt into a mathematical signal. The diffusion process uses this signal as its compass while cleaning up the noise.

This whole process happens in seconds on modern hardware. The result is an image that never existed before — synthesised from everything the model learned during training.

The main quality dimensions

When evaluating AI-generated images, people look at:

Dimension	What it means
Prompt adherence	Did it actually draw what you asked for?
Photorealism	How convincing does it look as a real photo?
Artistic range	Can it do many styles — oil painting, anime, watercolour, sketch?
Coherence	Are faces correct? Hands right? Text readable?
Resolution	High enough for print? Or blurry when zoomed?
Speed	How many seconds per image?
Editing / iteration	Can you tweak specific parts without redoing everything?

The major image generators (mid-2026)

Consumer / easy-to-use

Tool	Country	Best for	Free tier?
Midjourney	🇺🇸	Stunning artistic quality; the “prestige” option	Very limited (via Discord)
DALL·E 3 (inside ChatGPT)	🇺🇸	Easy to use; follows complex prompts well	Yes (limited)
Adobe Firefly	🇺🇸	Commercially safe; trained on licensed images	Yes (inside Creative Cloud)
Canva AI (powered by Flux/Stable Diffusion)	🇦🇺🇺🇸	Non-designers; drag-and-drop integration	Yes
Google Imagen 3	🇺🇸	Photorealism; inside Gemini and Workspace	Yes (via Gemini)
Microsoft Designer / Copilot	🇺🇸	Windows users; quick social-media images	Yes
Ideogram	🇨🇦	Text inside images (logos, posters — historically hard for AI)	Yes

Power / professional

Tool	Country	Best for
Flux (Black Forest Labs)	🇩🇪	Open-weights; highest quality for technical users
Stable Diffusion (Stability AI)	🇬🇧	Run locally; infinite customisation via community models
Recraft V3	🇺🇸 🇬🇧	Vector-style outputs; brand and logo work
Leonardo.ai	🇦🇺	Game assets; fine-tuning for consistent characters

Chinese (⛔ avoid for personal or business use)

Tongyi Wanxiang (Alibaba), Jimeng (ByteDance), Kolors (Kuaishou) — see vendors-chinese-avoid

Key concepts you’ll encounter

Text-to-image: The most common mode — you write a prompt and get an image.

Image-to-image (img2img): You provide a starting image and a prompt; the AI modifies the image toward your description. Useful for style transfer (“make this photo look like a painting”) or editing.

Inpainting: You mask (cover) a specific area of an image and ask the AI to regenerate just that part. Used to fix hands, remove objects, change backgrounds.

Outpainting: You extend the edges of an image beyond its original borders. Great for “expanding” a narrow crop into a wider scene.

ControlNet: An add-on (mainly for Stable Diffusion) that lets you give the AI a structural guide — a pose skeleton, edge map, or depth map — so the generated image follows a specific composition even when you change the style.

LoRA (Low-Rank Adaptation): A small add-on model you can attach to a base model to specialise its output — e.g., always draw in a particular artist’s style or always generate a specific character’s face consistently. Very popular in Stable Diffusion communities.

CFG scale / guidance scale: A dial that controls how strictly the AI follows your prompt. Too high → stiff, over-saturated. Too low → dreamy but ignores your words.

Steps: How many denoising steps the AI takes. More steps → higher quality but slower. Usually 20–50 is sufficient.

Negative prompt: Words you add to tell the AI what NOT to include. “blurry, ugly, extra fingers, watermark” are common negatives.

Seed: A number that initialises the random process. The same seed + same prompt = the same image every time. Useful for iterating small changes.

How to write a good image prompt

A well-written prompt typically includes:

Subject — what’s in the image (“a woman reading a book”)
Setting — where/when (“in a sunlit café, 1920s Paris”)
Style — art form or aesthetic (“Art Nouveau illustration, soft pastel colours”)
Quality boosters — for photorealistic: “DSLR, 85mm lens, shallow depth of field, golden hour lighting”
Negative prompt (where supported) — “blurry, watermark, text, extra limbs”

Different generators respond differently to prompts. Midjourney is more aesthetic and loose; DALL·E 3 is more literal and precise; Stable Diffusion rewards technical knowledge of keywords.

Commercial and copyright considerations

This is genuinely unsettled territory (mid-2026):

Who owns AI-generated images? In most countries including Australia, an image entirely generated by AI with no human creative input is not automatically copyrightable. The legal picture is evolving.
Training data disputes: Several major image generators were sued by artists whose work was used in training without permission. Some, like Adobe Firefly, specifically trained on licensed content to avoid this.
Commercial use rights: Check each platform’s terms. Midjourney’s paid tiers grant commercial rights; free tiers may not. Stable Diffusion (open-weights) allows broad commercial use depending on the model version.
AI labels: Australia’s voluntary guidance (and growing international regulation) encourages labelling AI-generated images, especially in advertising and news.

What AI image generation still can’t do well (mid-2026)

Hands and fingers: A persistent weakness — extra fingers, fused hands, wrong counts. Improving but still unreliable.
Text inside images: Readable text in generated images is difficult. Ideogram specialises in this and does it better than most.
Consistent characters: Getting the same person/character to look identical across multiple images requires workarounds (seeds, LoRAs, face-lock features).
Complex spatial reasoning: “Put the red ball to the left of the blue cube, on top of the wooden table” — AI often muddles complex positional logic.
Long-form coherence: A single image is fine; a 10-panel comic strip with consistent characters and settings is much harder.
Truly novel inventions: AI remixes what it’s seen. Genuinely new concepts that don’t resemble anything in its training data are harder to realise.

Gotchas

“Free” often means watermarked or low-resolution. Check the resolution and watermark policy before using in any project.
Prompt sensitivity: Small wording changes can dramatically alter results. “A man” vs “a person” vs “an adult” may give different outputs.
Style drift: If you generate 10 variations, they’ll all look slightly different even with the same prompt. For brand consistency you need a system (LoRA, reference images, face lock).
NSFW filters: Most consumer tools have aggressive content filters. Legitimate prompts (nude statues, medical illustrations, violence in historical context) are often blocked.
Model versions matter: Midjourney v6 looks very different from v5. Always check what version you’re using; older versions give lower quality.
Generation ≠ editing: Most tools are great at creating from scratch but clunky at surgical edits. Adobe Firefly and Photoshop’s generative fill are better for editing existing images.
Australian privacy note: If you’re generating images of real people (politicians, celebrities), check Australian defamation and image-rights law before publishing.

How image generation fits into a workflow

Concept art / mood boards: Fast and cheap way to explore visual directions before hiring a designer.
Social media content: Canva + AI = endless variations without stock photo subscriptions.
Product mockups: Show a product in different scenes or on different people without a photoshoot.
Marketing materials: Faster iteration; human designer refines the winner.
Game and film pre-production: Storyboards, character concepts, environment sketches at speed.
Book covers / album art: Independent creators now produce professional-looking artwork affordably.

It does NOT replace: final commercial photography, established illustrators whose style has meaning, illustration requiring legal-safe custom art.

Sources

Stability AI documentation and model releases (2022–2026)
Black Forest Labs Flux announcements (2024–2026)
Midjourney version changelogs
Adobe Firefly commercial use terms
DALL·E 3 system card (OpenAI, 2023)
Australian Attorney-General’s Department — AI and copyright consultation (2023–2024)

Tech & AI, Explained

Explorer

image-generation