Efficacy

Why AvtoPilot's accuracy falls as learners get better

For the first ten active days on AvtoPilot, raw accuracy, first-exposure accuracy, and almost every in-app number drift downward. That is not learners getting worse. It is the system resurfacing what they are about to forget, and the one yardstick it cannot bend climbs.

The Pilot AI teamJune 202611 min read

Key takeaways

Over the first ten active days, raw practice accuracy falls from 84 percent to 78 percent, and first-exposure accuracy, measured only on never-before-seen questions, falls from 81 percent to 73 percent. The material a committed learner faces is getting harder, not their understanding getting worse.
Down is the method working. FSRS spaced repetition deliberately resurfaces the items a learner is about to forget, and the easy, common questions get exhausted early, so what survives is harder by construction. You cannot measure learning by the difficulty of what you are surviving.
The clean yardstick is the fixed mock exam: twenty questions drawn at random from the whole 1,220-question bank, scored the same way every time and never adaptively hardened. There, average score rises from 14.0 to 15.7 of 20 and the pass rate from 20 percent to 31 percent across ten attempts.
The paired result holds under the strictest cut. Among the 1,416 learners who took at least five mocks, average score went from 14.01 to 14.72 of 20 and the share scoring eighteen or better from 19.8 percent to 26.1 percent, a gain of 6.3 points, with no survivorship in that comparison.
What this is not: observational data with no control group, on our own low-stakes practice mocks rather than the government exam, over about five months of a young product. We can say learners who practiced improved on our measures. We do not claim AvtoPilot caused it.

It is tempting to think a learning app should make its numbers go up. Open AvtoPilot, practice for ten days, and the most obvious one, the share of questions you get right, goes down instead. On the first active day the average committed learner answers 84 percent correctly. By the tenth, 78 percent. First-exposure accuracy, measured only on questions a learner has never seen before, falls further, from 81 percent to 73. Almost everything in the app that looks like progress drifts the wrong way.

AvtoPilot prepares people for the Uzbek government driving-theory exam: twenty questions, two mistakes allowed, so eighteen correct is a pass. Over about five months, from early February to late June 2026, more than 63,000 learners signed up, and the ones who practiced answered more than 3.3 million questions, sat more than 33,000 full mock exams, and put in more than 16,000 hours of study. About nineteen in twenty study in Uzbek. That is a large enough record to see something real, and the first thing it shows is a decline.

The decline is not a bug, and it is not learners getting worse. It is the single most important thing to understand about reading these numbers. The questions a committed learner faces on day ten are harder than the ones they faced on day one, because the system makes them harder on purpose. You cannot measure learning by the difficulty of what you are surviving.

This piece is about that paradox, and about the one measurement we trust to cut through it: a fixed mock exam that never hardens. We will walk through what goes down, why down is the system working, what the clean yardstick shows instead, and, just as carefully, what these numbers do not prove.

The number that should go up goes down

Start with raw accuracy, the plain share of practice questions a learner answers correctly. Take the people who stayed, the 420 learners who practiced on at least ten separate days, and line up their accuracy by active day. It reads 84, 85, 84, 81, 82, 81, 80, 79, 79, 78. A small bump on day two, then a steady slide of six points.

Now strip out memory. Look only at questions a learner is seeing for the very first time, where no answer can be a remembered one. Among the 309 learners in that group with enough first-exposure data, the same shape appears, lower and steeper: 81 on day one, then 76, 76, 73, 74, 73, 71, 72, 71, and 73 on day ten. By either measure, the longer a committed learner studies, the lower the percentage on the screen.

We could leave this chart out. Most products would. We are starting with it, because the temptation in our position is to reach for a flattering story, and the honest version is the more useful one. On the metric a learner sees most, the most dedicated cohort gets worse for ten straight days. Then we looked at what the system was actually feeding them, and the worry inverted.

Accuracy on all questionsAccuracy on first-seen questions

Percent of answers correct

Raw practice accuracy over the first ten active days falls from 84 percent to 78 percent, and first-exposure accuracy from 81 percent to 73 percent. The finding is that in-app accuracy declines as a committed learner keeps going, because the questions are getting harder, not the learner worse.

Why down is the system working

AvtoPilot schedules practice with FSRS, a modern spaced-repetition algorithm. Its whole job is to estimate, for every question, the moment you are about to forget the answer, and to put that item back in front of you just before you do. A question you got right a few days ago and have not seen since is exactly what the scheduler serves you today, at the point it judges your memory has decayed to the edge. The items you have comfortably mastered drop out of rotation. The ones on the verge of slipping away come back.

A second force pushes the same way. The bank holds 1,220 questions, and they are not equally hard. The easy, common ones get answered, learned, and retired early. What remains for a learner ten days in is the long tail: the rare signs, the edge-case right-of-way rules, the near-identical pairs that trip everyone. The pool is not standing still while the learner improves against it. The pool itself is getting harder.

Put those together and the falling line means close to the opposite of what it looks like. A learner holding at 78 percent on day ten is doing it against deliberately harder material than the learner at 84 percent on day one. This is what learning researchers call desirable difficulty: the productive struggle of retrieving something you almost forgot is the very thing that moves it into long-term memory. A queue that kept your accuracy pinned near perfect would be a queue that had stopped teaching you anything.

That is the trap with accuracy as a yardstick. It moves for two reasons that point in opposite directions, how much you have learned and how hard the current material is, and under spaced repetition the second effect swamps the first. You cannot measure learning by the difficulty of what you are surviving. Which is why, to measure learning at all, we stop looking at accuracy and look somewhere else.

The one yardstick that cannot harden

If practice difficulty drifts, you need a measurement that does not. AvtoPilot has one built in. The mock exam is the same shape as the real thing every time: twenty questions, a fresh randomized draw from the whole 1,220-question bank, two mistakes allowed, scored out of twenty. It is not adaptive. It does not resurface your weak items or retire your strong ones. Each mock is an unbiased sample of the entire body of material, which makes a learner's first attempt directly comparable to their tenth. When the difficulty is held flat, any movement in the score is movement in the learner, not in the test.

Take the 1,416 learners who completed at least five mocks and average their score by attempt number. It rises: 14.0, 14.3, 14.5, 14.6, 14.7, 14.9, 15.2, 15.3, 15.4, 15.7 out of 20. The pass rate, the share scoring eighteen or better, climbs with it: 20, 22, 21, 25, 26, 27, 29, 31, 32, 31 percent. The same people posting a lower percentage in adaptive practice are posting a higher score on the fixed test.

This is the same population, the same app, the same five months. The only thing that changed is the ruler. Stop measuring people by the difficulty of what they are surviving, measure them against a fixed, representative test, and the decline turns into a climb.

Average mock exam scorePass mark, 18 of 20

Average score out of 20

Average mock-exam score rises from 14.0 to 15.7 of 20 and the pass rate from 20 percent to 31 percent across ten attempts, among the 1,416 learners with at least five mocks. On a fixed test that never hardens, the same people whose practice accuracy fell improve steadily.

The result that survives the hardest cut

The ten-attempt curve has a catch we will not bury. Not everyone takes ten mocks. The first attempt is averaged over all 1,416 learners, but by the tenth only 546 remain, and the people who take ten mocks are not a random sample of the people who take one. Some of that rising line is the most committed learners selecting themselves into the later attempts.

So we ran it the strict way. Every one of the 1,416 learners completed at least five mocks, so we can set each person's first mock beside their fifth with no one dropping out of the comparison. On that paired, same-people basis, average score rises from 14.01 to 14.72 out of 20, a gain of 0.71. The pass rate rises from 19.8 percent to 26.1 percent, a gain of 6.3 points.

That is the most conservative cut we can make, the same learners measured from their first fixed test to their fifth, and the result holds. The full ten-attempt curve points further in the same direction, to 15.7 of 20 and a 31 percent pass rate, and we report it as directional precisely because its later points lean on that shrinking, self-selected group.

+6.3 ptsRise in mock-exam pass rate from each learner's first exam to their fifth, same 1,416 learners, no survivorship

Speed and memory point the same way

Two more measurements line up behind the mock-exam climb, and neither is the accuracy percentage. The first is speed. Across the ten-day cohort, time per question falls from about 19.3 seconds on day one to about 15.8 on day ten, roughly 18 percent faster. That happens while the questions are getting harder, not easier, which rules out the easy explanation that people are just racing through simpler items. Faster on harder material is recognition becoming automatic.

The second is memory, measured on its own terms. FSRS does not only schedule reviews, it predicts the probability a learner will recall each item when it comes due. For the 1,187 learners with at least 50 reviews and parameters fitted to their own history, the model predicted 88.6 percent retention and observed 87.7 percent. The forecast lands within about one point of reality, which tells us two things. The scheduling is well calibrated, and learners are actually holding on to roughly 88 percent of the material that falls due, rather than cramming it and dumping it.

Neither of these is the headline. They are corroboration. A learner getting faster on harder questions while retaining nearly nine in ten due items is consistent with the mock scores going up, and inconsistent with the falling practice accuracy meaning what it appears to mean.

What these numbers do not prove

We are going to be precise about the limits, because the honest version of this story is more useful than the flattering one.

First, this is observational, not an experiment. We have no control group, no learners randomly assigned to study some other way. So we can say that learners who practiced on AvtoPilot improved on our measures. We cannot say AvtoPilot caused the improvement rather than the motivation that brought those learners back, or the simple effect of seeing more questions over time. Causation is not something this data can carry.

Second, survivorship runs through all of it. Of roughly 34,000 people who practiced at all, 672 reached five active days, 420 reached ten, and 281 reached fourteen. The ten-day cohort is a committed minority, about 420 out of 34,000, and committed people tend to be the ones improving. The paired first-to-fifth comparison is built to remove the worst of this, but it cannot change the fact that we are mostly describing people who chose to keep going.

Third, and most important, the pass rate here is not the government pass rate. It is the share of our own low-stakes practice mocks scored eighteen or better, a threshold we set to mirror the real one. We make no claim about how these learners do at the testing center. We do not measure that, and we will not imply a number we do not have.

Fourth, the mocks draw from the same bank learners practice on, so part of any gain is plain familiarity with the questions rather than deeper understanding of the rules. We randomize the draw and shuffle the answer order to blunt this, but we cannot claim to have removed it entirely.

Fifth, this is about five months of a young product. Five months is enough to see a pattern. It is not enough to call it permanent.

What we believe the data shows

Strip it down to what we are willing to stand behind. In AvtoPilot's adaptive practice, accuracy falls over the first ten active days because the system is deliberately serving harder material, and that decline is the method working as intended, not learners getting worse. On a fixed mock exam that is not hardened this way, the same learners improve: 14.0 to 15.7 of 20 across ten attempts, and a 6.3 point rise in pass rate from each learner's first mock to their fifth, with no survivorship in that comparison. Speed and calibrated retention move the same direction.

What we believe, stated carefully, is this. Learners who keep practicing on AvtoPilot get measurably better at our fixed, exam-format mock test, the one ruler we did not let the system bend, and two independent measures agree with it. We are not claiming we caused it. We are not claiming anything about the testing center. We are not promising any learner a result.

The general lesson is the one we opened with. When a learning app's numbers go down, the first question is not whether the learners are failing. It is whether the app made the questions harder. Here it did, on purpose, and the people who stayed got better anyway. Build the measurement that cannot flatter you, hold it still, and watch that one. When we did, the number went up.

Notes

All figures cover the window from 6 February to 29 June 2026, about five months, and are measured from production usage of AvtoPilot rather than a lab benchmark. Over that window: more than 63,000 registered learners, more than 3.3 million practice answers, more than 33,000 completed twenty-question mock exams, and more than 16,000 study hours. The busiest single day, 9 April 2026, saw 174,789 practice answers, with usage concentrated in the 8 to 10pm Tashkent evening.
The day-by-day accuracy figures are averaged over the 420 learners who practiced on at least ten distinct active days. First-exposure accuracy counts only questions a learner had never seen before, over the 309 learners in that cohort with enough first-exposure data. The retention funnel: 34,250 learners practiced on at least one day, 672 reached five active days, 420 reached ten, and 281 reached fourteen.
The per-attempt mock averages cover the 1,416 learners who completed at least five mock exams. The sample is complete through the fifth attempt and shrinks thereafter, from 1,416 learners at the first attempt to 546 at the tenth, as fewer learners take that many exams. We therefore treat the paired first-to-fifth comparison as the headline and the full ten-attempt curve as directional. Each mock is a fresh randomized draw of twenty questions from the full 1,220-question bank, scored against the exam's threshold of eighteen of twenty.
The paired result compares each of the 1,416 learners' first mock to their fifth, with no within-window dropout: average score 14.01 to 14.72 of 20, pass rate 19.8 to 26.1 percent. Pass means scoring eighteen or more of twenty on our practice mock. It is not, and should not be read as, a pass rate on the government driving-theory exam, for which we hold no outcome data.
Calibrated retention is reported for the 1,187 learners with at least 50 reviews and FSRS parameters optimized to their own history: a predicted retention of 88.6 percent against an observed 87.7 percent, an error within about one point on material that came due. Speed is mean time per question for the ten-day cohort, about 19.3 seconds on day one and 15.8 on day ten.
The bank holds 1,220 questions, each available in Uzbek Latin, Uzbek Cyrillic, and Russian. Of the 59,807 learners who set a language preference, 93.8 percent chose Uzbek Latin, 5.4 percent Russian, and 0.8 percent Uzbek Cyrillic, so about nineteen in twenty study in Uzbek. None of these figures come from a controlled trial. They describe learners who chose to use AvtoPilot, and cannot separate the app's effect from the effect of studying at all.

Keep reading

AvtoPilotAvtoPilot is a trilingual study app that prepares learners in Uzbekistan for the government driving-theory exam with adaptive, spaced-repetition practice and full mock exams in Uzbek and Russian.

Back to research