Methodology

Why QuizPilot is not an AI wrapper

Turning a messy exam document into a correct quiz is mostly not a job for a language model. Here is the system that does it, and why about one in three documents never calls one.

The Pilot AI teamJune 20268 min read

Key takeaways

QuizPilot is an extraction system, not a thin layer over a language model. Deterministic code handles detection, parsing, segmentation, and validation. The model is asked for one narrow judgment, or it is not called at all.
There is no fast-but-sloppy shortcut. Dropping a whole document into a model is slow, expensive, and silently loses most of the questions, because the model can only output so much in one call and truncates a large bank to a fraction. Splitting the text at question boundaries and extracting small batches in parallel is faster, cheaper, and keeps every question, so a 500-question document yields close to 500.
About one in three real uploads becomes a quiz with no extraction-model calls. Those documents are parsed deterministically, with the correct answer traced to a marker in the source rather than guessed.
The rule behind every design choice is simple: a wrong answer is worse than no answer. The pipeline would rather drop a doubtful question than ship a confident mistake.
Coverage compounds. A repeatable loop turns each new document format real users bring into a parser, so the share handled without AI grows over time instead of standing still.

It is tempting to think there is a fast, rough way to build a quiz generator and a slow, careful way. There is not. The obvious approach, dropping the whole document into a language model and asking for questions, is slow, expensive, and quietly throws most of the questions away. Doing it properly turns out to be the faster and cheaper way too.

Start with the throwing-away part, because it is the one people miss. A model call has a hard ceiling on how much it can produce in one go. Ask it to pull every question out of a 500-question bank in a single shot and it does not refuse, it simply runs out of room and stops, handing back a fraction of them, sometimes only a few dozen, and silently dropping the rest. That one giant generation is also a single long call, and you pay to push the entire document through the model every time you run it. Slow, expensive, and lossy, all at once.

Then there is the mess. A scanned exam bank has questions glued together with no spacing. A spreadsheet has its answer key in a column that quietly shifts halfway down. The same symbol means three different things on three different lines. Hand all of that to a model in one pass and it will give you back plausible questions with the wrong answers marked correct. For most products that is a cosmetic bug. For a learning tool it teaches a student something false.

So QuizPilot treats extraction as an engineering problem first and a model problem last. Deterministic code reads the text, splits it at question boundaries so no question is ever cut in half, and feeds the model small batches in parallel instead of one giant document. The model is brought in for the one thing it is genuinely better at than a parser, on a small piece at a time, and nothing more.

What happens to a document

Every document takes the shortest path that produces a correct result. It is read according to its file type, checked for whether it already contains questions, and classified by content type, because a physics sheet needs different formatting rules than a history quiz.

Then the deterministic parser cascade runs. Eleven format parsers try in order, and the first one that matches cleanly wins, with no model involved. If none fit, the document falls through to progressively more capable and more expensive stages: a structure parse that asks the model only to pick the correct option, then full extraction over chunks split at question boundaries, and finally a vision model for pages that are scanned or garbled. Most documents never reach that far.

1
Read the file
Separate extractors for PDF, images, DOCX, XLSX, and PPTX.
2
Check for questions
Decide whether the document already contains questions, and estimate how many.
3
Detect the content type
Math and physics format differently from standard text, so the rules adapt.
4
Deterministic parser cascade
no AI
Eleven format parsers try in order. The first clean match wins, with no model call.
5
Structure parse, single judgment
AI, bounded
Code extracts the questions and options. The model only picks which option is correct.
6
Batched extraction
AI, bounded
Only when no parser fits. The document is split at question boundaries and read in parallel.
7
Vision fallback
AI, bounded
Scanned or garbled pages are rendered and read by a vision model.
8
Validate, dedupe, shuffle, save
Every question is checked, near-duplicates are removed, and answer positions are randomized.

A document takes the shortest path to a correct result. The deterministic cascade resolves most of them with no model call. The stages beneath it are fallbacks, each one narrowing what the model is allowed to decide.

One bounded job for the model

This is the part the word wrapper gets wrong. The model in QuizPilot has a small, bounded job. It reads the options out of a chunk of text, or in the structure-parse stage it does only one thing: pick which option is correct. It does not decide where one question ends and the next begins, which answers are duplicates, or whether the result is good enough to ship. Deterministic code owns all of that, along with file extraction, detection, segmentation, validation, de-duplication, answer shuffling, caching, and metrics.

The result is measurable. About one in three real uploads is turned into a quiz with no extraction-model call at all, parsed end to end by the deterministic cascade. For the rest, the model is one bounded step inside a pipeline that checks its work, not the pipeline itself.

When the model is needed, the primary extractor is Gemini 2.5 Flash-Lite, with automatic failover to a second provider if it is rate-limited or refuses, and a separate vision model for scanned pages. The interesting engineering is not which model we call. It is how little we ask it to do. A smaller job is cheaper, faster, and has less room to go wrong.

~1 in 3real uploads become a quiz with zero extraction-model calls

Why this is genuinely hard

A few real examples make the point. Inside a single document, question numbering is rarely consistent. One question reads 1. with a space after the dot, the next reads 2.Question glued straight to its text. A naive parser silently folds the glued one into the question above it, and you lose a question without ever knowing it was there.

Symbols are overloaded. A hash can be a question number, a correct-answer marker, or a delimiter, depending on context, and the marker for a correct answer changes by region and format. It might be a star, a plus, an equals sign, a hash, a check mark, or the words for correct answer spelled out. The same character has to be read three different ways on three different lines, which a single fixed rule cannot do and a careful cascade can.

Spreadsheets drift. An exam bank fills the subject column on the first row of a block and leaves it blank below, so a later row lines up against the wrong column and a correct answer gets marked wrong. That is invisible to a model reading cell by cell, and caught by code that knows the shape of the sheet. Each of these was a real bug, found in a real document, and each is now a permanent regression test so it cannot come back.

A wrong answer is worse than no answer

That single rule explains the entire verification layer. Before a quiz ships, every question is checked. It must have real options, not placeholders or empties. The options must be unique. The correct index must point to an option that exists, with an automatic fix for the common case of a model counting from one instead of zero. Questions with too few options, or a structure parse the model is unsure about, are dropped rather than guessed.

The assembled quiz is de-duplicated by a normalized signature, answer positions are shuffled so the correct choice is not always in the same place, and a failed or partial extraction is never cached, so a bad run cannot poison later ones. A coverage gate stops a handful of stray markers in a long document from hijacking the whole parse. And if a paying user ever loses more than a small fraction of their expected questions, an alert is raised automatically, so we hear about a bad document before they have to tell us.

Getting there took the kind of work that never shows up in a demo. A boundary-first splitter that respects where questions actually start cut mid-question breaks by 18 percent and reduced model calls by 13 percent. Teaching it to handle glued and hash-numbered questions cut those breaks by a further 33 percent.

+335documents+5.4 points of coverage

+141documents+1.7 points of coverage

Each pass of the corpus loop moves more document formats onto the deterministic, zero-AI path. Two recent ships added 335 and 141 documents of coverage.

Coverage that compounds

The deterministic share is not frozen. It grows through a repeatable loop: capture the new document formats real users upload, find the ones the parsers miss, build and adversarially test a new parser against real files, and ship it. Every pass widens the set of documents handled with no AI at all. A wrapper is only ever as good as its model is this week. This gets better on its own schedule.

It is also built for a specific reality. Central Asian exam banks have their own conventions: tickets, mixed Cyrillic, Latin, and Turkic scripts, region-specific answer markers. A generic document-to-quiz tool faceplants on them. QuizPilot was shaped against thousands of these real documents, which is a moat that has little to do with the model and everything to do with the years of formats it has already learned to read.

The moat is the system

None of this means QuizPilot avoids AI. It uses it, deliberately and narrowly, for the judgments a model is good at, with deterministic code doing everything around it and checking the result. The honest claim is not that the output is perfect. It is that answers on the deterministic path are traced to a marker in the source rather than inferred, and the model path is bounded and validated rather than trusted on faith.

Put together, the surgical use of AI, the regional formats, and a process that compounds are far harder to copy than any single model call. That is the difference between a wrapper and a system. One ships what the model says. The other is engineered to be right.

Notes

The one-in-three figure is traffic-weighted and measured from production extraction metrics over a recent window. It counts real uploads parsed with no extraction-model call, and is not a lab benchmark.
Primary extraction runs on Gemini 2.5 Flash-Lite, with automatic failover to a second provider, and a separate vision model for scanned pages. Specific parser rules, thresholds, and model prompts are intentionally left out of this piece.
Coverage figures count document formats moved onto the deterministic path, recorded at the time each change shipped.

Keep reading

QuizPilotThe study app this piece is about. Quizzes, flashcards, and slides from any document or topic.

Back to research

Why QuizPilot is not an AI wrapper

What happens to a document

Read the file

Check for questions

Detect the content type

Deterministic parser cascade

Structure parse, single judgment

Batched extraction

Vision fallback

Validate, dedupe, shuffle, save