When better prompts aren't the answer: the naturalist who couldn't stop saying "upon"

By Magnus Hultberg • 25 May 2026

I built a small thing called QRious Specimens. You scan a QR code, any QR code, and it gives you back a unique creature, illustrated in the style of a Victorian naturalist's plate, with field notes to match. The whole project is a small love letter to Mary Anning, and the About page tells that part of the story. This post is about what I learned making it.

Here's what happens when you scan. The encoded string runs through a hash chain to produce a deterministic seed, which I treat as the creature's DNA. That seed draws an instant Victorian SVG sketch from pure mathematics (to not have to stare at a blank screen while the real image is generated), then prompts Gemini for a watercolour-and-ink illustration. The illustration is passed on to Claude alongside the DNA, and Claude writes two paragraphs of period-appropriate field notes in response. Everything accumulates in the Gazette, a feed of every species discovered. Same QR, same creature, same notes, anywhere.

Five lessons learned:

Pin the seed. The same QR yields the same creature anywhere, every time. That turns a gimmicky novelty into a shared world: you discover species, you don't conjure them out of randomness, and "first to find one" is a real claim.
Cache the species, not the scan. Because the creature is deterministic, you can cache by its hash rather than by user or request. Thousands of people might scan the same poster; the expensive AI work happens once.
Always show something. The Victorian SVG sketch is drawn from pure mathematics, instantly, from the same seed. If Gemini and Claude take 10+ seconds to deliver their work or fail outright, the user never sees a blank screen.
Multimodal grounding writes better prose. Sending Claude the generated illustration alongside the DNA grounds the field notes in what was actually drawn, not in abstract parameters. Doing first the image, then the field notes in sequence takes a bit longer, but the connection between the image and the text gets so much better.
Measure before you fix. As the corpus grew, the entries started sounding "same-y". The instinct was to rewrite the prompt. That instinct was wrong. And I'll outline a better way below.

ai-uncovering-qrious-specimens

The realisation

It crept up on me. I was passing Claude each creature's DNA and asking for two paragraphs of period-appropriate field notes. The illustrations were varied: jellyfish shapes, spider shapes, things with five eyes, things with no eyes. The prose, on the other hand, started feeling like one naturalist with a writing tic.

Two-thirds of the entries opened with the same shape. "Upon dredging the sediments...", "Upon first observing...", "Whilst examining the specimen...". A temporal-prepositional clause, a slight Victorian throat-clear, and only then the creature. Once I saw it, I couldn't unsee it. Open the Gazette, scroll the first five entries: Upon, Upon, Whilst, Upon, Upon.

The illusion of a living journal with multiple contributors collapsed immediately.

Why "vary your openers" doesn't work

The obvious move is to add a line to the prompt. "Vary your sentence openers. Avoid starting with 'Upon'." Briefly satisfying. The corpus opens up to what looks like greater variation for a few entries. Then the model picks a new favourite and parks there instead: "The creature presents..." dominated for a while, then "Three eyes regarded me..." took over.

Same disease, different host.

The reason is structural. A single-shot prompt has no memory of what was generated for the previous thousand QR codes. It can influence the entry being produced from the inputs right now, but it cannot enforce distribution across a corpus. You can keep iterating prompts forever and the LLM output will keep collapsing toward whatever shape feels most natural to that prompt.

This is something that is not immediately obvious, I think: noticing that the prompt has a natural limitation. There are problems prompts can solve, and there are problems prompts cannot solve no matter how cleverly you rewrite them. Figuring out which is which can make or break your project.

The fix

In conversation with Claude the answer arrived as a different shape of question. Instead of asking the prompt to "vary", we wrote six specific opener directives, distinct rhetorical shapes rather than vague guidance:

Anatomical detail: lead with a feature, a count, an arrangement.
Setting: place, weather, surrounding matter, before the creature enters the sentence.
Sensory clue: a glimmer, a sound, a movement preceding the sighting.
Anomaly or mistaken identity: something taken for one thing before revealing itself as another.
Discovery act: what you were doing when something caught your eye.
Question or contemplation: wonder at the form before describing it.

The directive sent to Claude for any given creature is selected by dna.seed % 6. The same integer seed that determines genus, body plan, eye count, every anatomical trait, now also picks the opener shape. Same QR, same directive, every time.

The shape of the opener text then stops being a stylistic decision Claude makes per-entry and becomes another deterministic projection of the DNA. Claude still writes the actual prose. The directive only controls the angle of entry.

Measuring it

The harder bit is knowing it worked, and continuing to know. So Claude came up with a test for it.

The repo now contains a small heuristic classifier (openerShape.ts) that reads the first sentence of a field note and buckets it into one of eight categories: temporal-prepositional, generic-introduction, body-part-first, numeric-feature, setting-first, mistaken-identity, rhetorical-question, other. Pure keyword matching on the first one to three tokens. Deliberately coarse. The job is to catch gross uniformity, not subtle style drift.

A separate script generates a batch of real field notes by calling Claude with real DNA seeds, and commits the first sentence of each entry to a JSON file in the repo. The test suite reads that committed corpus and runs three assertions: no single opener shape exceeds 35% of entries, at least four distinct shapes appear, no single first word exceeds 35%. The 35% threshold is roughly twice the expected 17% per shape under even distribution; loose enough for sampling noise, tight enough to catch the original disease (which had been running at 67%).

If anyone, me or Claude or some future me with grand ideas, drifts the prompt back toward uniformity, the test fails. The committed corpus file is, in effect, permanent empirical evidence that the prompt is behaving.

The same disease, found again

Once you've identified this pattern, you can recognise it everywhere.

Adding pull-quotes to the Gazette (single evocative lines extracted from each entry for the community feed) ran straight into the same problem. Without intervention, between 38% and 63% of pull-quotes led with a numeric feature: "Three eyes regarded me from the gloom...", "Four appendages folded with...", "Seven tentacles described...". Counts are salient, so the model led with them. New surface, same disease.

The medicine was identical: six lead-directives (colour and texture, motion, setting, mistaken identity, aphorism, sensory clue) rotated by the same dna.seed. As a belt-and-braces measure I also added an explicit negative constraint to the prompt: "Do not begin with a bare number-word." Structural fix, as well as a spot-fix on top.

What I think this means

The instinct, when an LLM's output feels off, is to rewrite the prompt. Smarter prompt, better behaviour. Sometimes that's right.

But the prompt is a lever, and like any lever it can only move what it's attached to. If the problem lives at the level of a single response (tone, accuracy, structure) the prompt is the right tool. If the problem lives at the level of a corpus (distribution, variety, the felt sense of "these all sound the same") the prompt cannot reach it. Not without help.

The help here was three things, in this order. First, noticing the difference between a single-response problem and a corpus problem. Second, building a way to measure the corpus, even a coarse one, so the difference between "feels better" and "actually better" stopped being vibes. Third, accepting that the structural fix had to come from outside the prompt: a deterministic seed, six shaped slots, the variety baked into the DNA rather than hoped for in the inference.

If you're building anything where an LLM produces many small artefacts that sit next to each other (names, descriptions, summaries, captions, recipes, anything that accumulates), it's worth asking whether your problem is a single-response problem or a corpus problem.

The two may seem identical at first glance. But in practice they have entirely different fixes.