Which AI Model Writes Better Villains: A Genre-by-Genre Test

Murdok Published May 17, 2026 Updated May 17, 2026 9 min read

Villains break AI models in ways that heroes never do. Ask any model to write a brave protagonist facing impossible odds, and you'll get something serviceable — competent, even moving. Ask that same model to write a villain who's genuinely unsettling, who holds contradictory beliefs with complete sincerity, who makes the reader lean forward instead of pulling back — and suddenly you can see exactly where the seams are.

I've spent months running villain prompts across Claude, GPT-4, and Gemini for different projects, and the differences are stark enough to matter to your actual writing process. This isn't a theoretical comparison. It's the stuff I wish someone had told me before I wasted three hours trying to get one model to do something it fundamentally isn't built for.

Why Villain Writing Is the Ideal AI Model Stress Test

Heroes are easy to simulate. They want clear things for understandable reasons. They grow. They overcome. Every model has absorbed enough of this narrative structure to produce a passable version on autopilot.

Villains are harder because the best ones resist the gravitational pull of the narrative. They don't think they're the villain. They have a worldview that coheres internally even if it's monstrous externally. They have voice consistency under pressure — meaning their speech patterns, their specific obsessions, their tells don't evaporate the moment the plot needs them to be scary.

There's also the menace problem. Real menace in fiction isn't about describing someone as dangerous. It's about the reader sensing it through accumulation — a particular phrasing, a misplaced calm, a moment where the villain's logic is almost convincing. Most AI models default to telling you a villain is threatening rather than making you feel it. The models that can thread the needle between those two things are the ones worth understanding.

The best villain writing is a study in controlled dissonance: the character's self-perception and their actual behavior should never quite align, and the reader should feel that gap like a splinter.

Testing villain writing across genre also matters because genre sets the rules the villain gets to break. A thriller villain operating through plausible institutional power hits differently than a dark lord with cosmic ambitions. A literary fiction antagonist might not even identify as an antagonist. Testing models across these contexts reveals which ones understand genre as a set of reader expectations — not just aesthetic trappings.

The Three Axes of a Compelling Villain

Before the comparison makes sense, you need a framework for evaluating what you're reading. I judge villain output on three axes.

Voice is the most immediately detectable. Does this character sound like a specific person, or do they sound like "generic menacing figure"? A well-voiced villain has verbal tics, a particular relationship with formality, specific things they reach for when they're making a point. The corrupt senator who keeps using sports metaphors. The cult leader who speaks entirely in questions. Voice is what makes a villain legible and individual.

Motivation is where models most often fail quietly. They'll give you a motivation that sounds right on paper — childhood trauma, ideological extremism, wounded pride — but it won't drive the character's behavior in a consistent, specific way. Motivation shouldn't be backstory. It should be the engine running underneath every scene the character appears in.

Menace is the hardest to prompt for and the easiest to destroy by over-explaining. It lives in implication, in what the villain doesn't say, in the moment of stillness before something terrible. Models that generate menace through accumulation rather than announcement are rare, and they're worth knowing about.


Head-to-Head: How Each Model Handles the Thriller Antagonist

The thriller villain lives in our world. They have budgets, calendars, and plausible deniability. Their power comes from institutions, not magic systems. And their menace has to feel real in a way fantasy villains don't — because the reader's lizard brain is constantly checking whether this could actually happen.

I ran all three models on the same prompt: a senior intelligence officer who has authorized a civilian's death and is now in a routine meeting, where he has to perform normalcy. Here's a version of that prompt you can adapt:

Write a scene (400 words) in close third person from the POV of Richard Voss, a 58-year-old deputy director of a domestic intelligence agency. He's just authorized the death of a whistleblower and is now sitting in a budget meeting. He's not anxious — he's done this before. His internal voice should be precise, slightly contemptuous of the people around him, and occasionally nostalgic. He should never think about what he did in moral terms. Show his menace through what he notices and ignores, not through description of him as dangerous.

Claude handled this best, by a clear margin. What it produced had real interiority — Voss noticed the wrong things (a junior analyst's cheap watch, the specific font on a slide deck) and his contempt was specific rather than generic. Crucially, Claude honored the instruction about moral framing. Voss didn't agonize or suppress guilt. He simply didn't have the category. That's hard to do. It requires the model to sustain a coherent but alien psychology rather than projecting a familiar one.

GPT-4 produced technically competent prose with one significant problem: it couldn't resist signaling. Small intrusions of authorial judgment kept appearing — a pause that lingered "just a moment too long," a slight tightening around the eyes. These tells are the model reassuring you that it knows Voss is bad. They undermine the effect immediately. That said, GPT-4's Voss had sharper, more clipped dialogue than Claude's, and his voice in direct speech was excellent.

Gemini struggled most with the psychological consistency requirement. The Voss it generated drifted — sometimes cold and institutional, sometimes briefly self-aware in ways that felt more like a thriller protagonist's corrupted mentor than an actual architect of harm. The prose was clean but the character kept breaking frame.

For thriller antagonists: Claude for interiority and sustained psychology, GPT-4 for sharp dialogue you can drop into existing scenes.

Fantasy vs. Literary Fiction: Where Model Strengths Diverge

Fantasy and literary fiction pull villain writing in completely opposite directions, and the model rankings shift accordingly.

The Fantasy Villain

Fantasy villains carry extra freight. They have to be large enough to justify an entire narrative, often cosmological in their ambitions, but they also need specific texture or they become cardboard. The failure mode here is the eloquent nihilist — beautifully spoken, philosophically consistent, but somehow never surprising.

GPT-4 is unexpectedly strong with fantasy antagonists. It has clearly absorbed a huge amount of epic fantasy and can write villain monologues that feel mythologically weighted without tipping into parody. More usefully, it handles the villain who believes in something — not just destruction for its own sake, but a genuine (if warped) vision of order or beauty or sacrifice.

Write a monologue (300 words) for Serevane, an ancient elven archivist who has spent 800 years systematically destroying knowledge she deems "corrupting." She doesn't see herself as a destroyer — she sees herself as a curator of what's worth preserving. She's speaking to a young scholar she's about to imprison. Her tone should be warm, almost maternal, and entirely sincere. She should reference specific texts she's destroyed as if recalling beloved objects she's had to let go. No self-doubt. No irony. She means every word.

GPT-4 produced a Serevane who was genuinely chilling because she was genuinely kind. The warmth didn't feel performative — it felt like the model understood that her kindness and her horror were the same thing. Claude's version was good but slightly more self-conscious; the maternal warmth felt slightly performed rather than inhabited.

Gemini, interestingly, excelled at one narrow but valuable thing in fantasy: physical and sensory presence. If you need your fantasy villain to feel embodied — their movements, their specific gestures, the way a room changes when they enter it — Gemini's output often has a cinematic quality the other models lack. It's less useful for interiority, but for the externally observed villain scene, it's worth testing.

The Literary Fiction Antagonist

Here's where things get genuinely interesting. Literary fiction often doesn't have a villain in any clean sense. It has someone whose presence damages the people around them — through smallness, through fear, through withholding love. The antagonist might be a mother. A boss. A friend who stays too long.

This is Claude's territory. No contest.

Literary antagonists require holding moral ambiguity without resolving it, which means the model can't lean on genre signals to tell it when someone is "supposed" to seem threatening. It has to build that entirely from behavior, from the small recognizable textures of how people make each other suffer.

Write a dinner table scene (500 words) in close third person from the POV of an adult daughter visiting her mother for the holidays. The mother, Diane, is not abusive in any dramatic way — she's subtly corrosive. She asks questions that aren't really questions. She mentions the daughter's weight once, very briefly, then moves on. She talks about a neighbor's success in a way that isn't quite a comparison. She loves her daughter. That's the problem. Write Diane's dialogue so that every line is technically fine and collectively devastating. The daughter should not name what's happening. She should just feel it.

Claude's output for this kind of prompt is genuinely difficult to distinguish from competent human literary fiction. It understands that Diane's damage comes from precision and deniability. GPT-4 made Diane too obviously pointed — the lines were sharper but less true, more like satire than observation. Gemini's version had Diane comment on the food twice, which is accurate but too on-the-nose; it lacks the restraint that makes the character devastating.


How to Choose and Combine Models for Your Villain's Specific Weakness

The practical truth is that you don't have to pick one model and commit. The smarter approach is to diagnose what your villain is currently missing and route that specific problem to the right tool.

If your villain's motivation feels thin or decorative — present in the backstory but not driving the scenes — use Claude to excavate it. Ask it to write three scenes where the motivation expresses itself without the villain ever naming it. Don't ask for motivation as explanation. Ask for motivation as behavior.

If your villain's dialogue is flat or interchangeable with your other characters — if you could swap their lines with the hero's without breaking anything — take the existing dialogue to GPT-4 and ask it to rewrite specifically for voice compression. GPT-4 is good at giving characters verbal signatures: the thing they return to, the rhythm that's theirs alone.

If your villain needs physical presence and staging — if they feel like a floating head delivering speeches rather than a body in a room — use Gemini for the blocking. Ask it to rewrite the scene focusing only on what the villain does with their hands, their attention, their body. Then integrate the physicality into whatever draft you have.

Here is a scene with my villain, Marcus. His motivation is that he experienced profound public humiliation at 22 and has organized his entire life around ensuring it can never happen again — but this motivation isn't showing up in the scene. Rewrite the scene (keep the plot beats identical) so that Marcus's behavior in every exchange is slightly shaped by threat-detection: he's always reading the room for status, always noting who has power over him, always making small adjustments. He should never acknowledge this consciously. [PASTE SCENE]

This kind of targeted, diagnostic prompting works because you're not asking the model to invent a villain from scratch — you're asking it to solve a specific craft problem you've already identified. Models perform dramatically better when the task is narrow and the standard of success is clear.

One last thing worth knowing: model outputs are most useful as first drafts of voice, not finished character work. The goal isn't to have the AI write your villain. It's to find the villain's frequency — the specific timbre of how they speak and think — and then write them yourself from inside that frequency. Run a few hundred words of output, read it aloud, feel for the places where it snaps into focus. That's your signal. That's what you take back to your own draft.

The specific thing to try this week: take a scene where your villain already appears and run it through all three models with the same revision instruction. Compare what each model reaches for. The differences will tell you more about your villain's current weaknesses than any amount of abstract planning.

Try it yourself

Write your own book with AI — free, no credit card required.

Start Writing Free