When you upload a 200-page PDF and ask MoreExams to generate 50 practice questions, a large language model has to read your entire document. That's a lot of tokens. Multiply that by hundreds of users doing the same thing, and the API bill adds up fast.

On March 17, our daily AI spend hit a point we weren't happy with. Not catastrophic, but clearly unsustainable at the growth trajectory we were on. Over the following week, we made two focused changes — prompt caching and more targeted prompts — and by March 23 our daily cost had dropped to roughly one-seventh of what it was. This is that story.

Where the Money Was Going

Every generation request in MoreExams has two expensive parts: the input tokens (your document, plus the system prompt that tells the AI what to do) and the output tokens (the actual questions, flashcards, or cheat sheet). Input tokens are cheaper per unit, but they dominate our volume because we send the full document context on every request.

The system prompt alone — the detailed instructions that tell the model how to format questions, calibrate difficulty, avoid hallucination, handle different question types — was several thousand tokens. That prompt was being sent fresh with every single request, even when the same user was generating multiple batches from the same document. We were paying to re-read the same instructions over and over.

Fix One: Prompt Caching

Anthropic's API supports prompt caching: if you mark a portion of your input as cacheable, the model provider stores a processed version of that segment. Subsequent requests that share the same cached prefix are significantly cheaper — you only pay a small fraction of the original cost to reuse it.

We restructured our prompts so that the static parts — the detailed system instructions, the output schema definitions, the grounding rules — sit at the top and are marked for caching. The dynamic part (your document content and the specific generation request) comes after. The result: on the second and subsequent requests within a session, we're paying cache read prices on thousands of tokens instead of full input prices.

For a typical user who uploads a document and generates two or three batches of questions, this change alone cut input token costs by around 60%. For power users generating across multiple question types from the same course, the savings were even larger.

Fix Two: Smarter, Tighter Prompts

Caching helped the input side. On the output side, we had a different problem: our prompts were asking the model to produce more than we actually needed.

Early versions of our prompts included instructions like "provide a detailed explanation for why each wrong answer is incorrect." Good for learning, but we were generating this for every question in every batch — including cases where the user was just doing a quick warm-up quiz and didn't need verbose answer explanations. We were paying for output that wasn't being surfaced in the UI.

We audited every generation prompt against what we actually displayed to users. Anywhere we were requesting content that wasn't shown, we removed it. Where we needed explanations, we made them request-time parameters rather than always-on. This reduced average output token count per request by about 35%.

We also found that some of our prompts were over-specified in ways that caused the model to hedge and repeat itself. A prompt that says "make sure the questions are not too easy and not too hard" generates more verbose self-correction in the output than a prompt that says "difficulty: medium" with a precise definition. Tighter constraints, shorter output.

The Combined Effect

Neither change alone would have hit 7x. Caching brought input costs down substantially. Tighter prompts cut output tokens. Together, they compound: lower input cost per request plus fewer output tokens per request equals a dramatically lower cost per generation.

The chart above shows the daily spend across March. The high bars on March 17–19 are the pre-optimization baseline. The spike on March 22 was a traffic anomaly — an unusually high volume day that we hit right in the middle of the rollout. From March 23 onward, the new baseline holds flat at roughly one-seventh of where we started.

Importantly, we ran the new prompts against a set of quality benchmarks before shipping — sample documents, expected outputs, human review of edge cases. The question quality is unchanged. The cheat sheet coverage is unchanged. What changed is how efficiently we get there.

What This Means for Users

Lower infrastructure costs let us keep MoreExams affordable as we grow. They also give us headroom to run more ambitious AI features — longer document support, multi-document synthesis, more sophisticated question types — without those features becoming economically unviable.

We're continuing to profile every generation path. There are still gains to find on specific question types and on the document pre-processing pipeline. The March numbers were a good start. We're not done. (For more on how we improved question quality alongside these cost reductions, see how we rebuilt our AI processing engine.)

If you're building AI-powered products and haven't audited your prompts recently, this is the nudge: check what you're actually displaying to users versus what you're asking the model to produce. The gap is usually larger than you think.

Where the Money Was Going

Fix One: Prompt Caching

Fix Two: Smarter, Tighter Prompts

Caching helped the input side. On the output side, we had a different problem: our prompts were asking the model to produce more than we actually needed.

How We Reduced Our AI Costs by 7x in a Week

Where the Money Was Going

Fix One: Prompt Caching

Fix Two: Smarter, Tighter Prompts

The Combined Effect

What This Means for Users

How We Reduced Our AI Costs by 7x in a Week

Where the Money Was Going

Fix One: Prompt Caching

Fix Two: Smarter, Tighter Prompts

The Combined Effect

What This Means for Users