Designing a Metacognitive Learning Engine

Confidence calibration, spaced repetition, and misconception detection — a deterministic learning engine powering confidence-calibrated exam prep.

The Question

The evidence base for how humans acquire and retain knowledge is remarkably well-established. Ebbinghaus established the forgetting curve in 1885. Bjork’s work on desirable difficulties has been replicated for decades. Koriat’s research on metacognitive monitoring (the accuracy of people’s confidence judgments about their own knowledge) has produced a substantial literature. Yet the study tools most people use ignore nearly all of it.

The typical exam-prep app is a flashcard deck with a progress bar. It treats all correct answers as equal, all wrong answers as equal, and confidence as irrelevant. It cannot distinguish between a learner who guessed correctly and one who knew the material. It cannot detect that a student is systematically overconfident in one domain and underconfident in another. It has no model of the learner at all.

The question that launched Meridian Labs was deceptively simple: what would a study system look like if it actually used what cognitive science knows about learning?

The Approach

The first design decision was the most consequential: deterministic heuristics over machine learning. Three considerations motivated this choice.

First, interpretability. When a learner asks “why is the system showing me this question?” the answer should be traceable. A spaced repetition interval driven by a transparent formula can be explained; a neural network’s recommendation cannot, at least not to the person using it. For a system designed to build metacognitive awareness, opacity defeats the purpose.

Second, cold-start viability. ML-based adaptive systems need substantial interaction data before they can personalize effectively. A learner preparing for an exam in two weeks cannot wait for the model to converge. The engine needed to make intelligent decisions from the first question.

Third, the research literature already provides the parameters. Butterfield and Metcalfe (2001) demonstrated that high-confidence errors produce stronger hypercorrection effects—when you’re certain you’re right and discover you’re wrong, the correction is more durable than for low-confidence errors. This finding maps directly to a scoring function. A confident-wrong response receives a −5 penalty; a tentative-wrong response receives −2. The asymmetry is not a design whim—it is a translation of empirical findings into arithmetic.

The system penalizes confident errors 2.5× more harshly than tentative errors in its spaced repetition scheduling. This is not a tuning parameter. It reflects the finding that confident errors are the most dangerous form of ignorance, precisely because the learner has no reason to revisit them.

The confidence matrix produces four outcomes per question: confident-correct (+3), confident-wrong (−5), tentative-correct (+1), and tentative-wrong (−2). This four-outcome model is the atomic unit of the entire engine. Every downstream system (mastery levels, spaced repetition intervals, exam readiness scores, misconception detection) builds on it.

The Architecture

The engine has four interlocking systems, each grounded in a different strand of the cognitive science literature.

1. Six-Level Mastery System. Rather than a binary known/unknown state, the engine tracks mastery across six levels: Novice, Beginner, Intermediate, Proficient, Advanced, and Expert. Transitions require sustained performance, not single correct answers. Demotion is asymmetric: dropping a level is easier than gaining one, reflecting the established finding that skill regression is faster than skill acquisition (Bjork & Bjork, 1992). Each level maps to a spaced repetition interval: Novice items resurface within hours, Expert items after weeks.

2. Four-Factor Exam Readiness. A single “percent correct” metric is nearly useless for predicting exam performance. The engine computes readiness from four weighted factors: mastery distribution (35%), recent accuracy trend (30%), coverage breadth (15%), and confidence calibration accuracy (20%). The weights are non-linear; high mastery with poor calibration still produces a moderate readiness score, because the learner doesn’t yet know what they don’t know.

3. Misconception Detection. When a learner answers the same question wrong twice with high confidence, the engine flags a misconception, a stable, incorrect mental model that passive review will not correct. Misconception items receive 4× scheduling priority, forcing active confrontation. Resolution requires three consecutive correct answers, not just one. This is grounded in the conceptual change literature: misconceptions are resistant to instruction because they are internally coherent (Vosniadou, 1994). A single correction rarely dislodges them.

4. Bloom’s Taxonomy Practice Tests. The engine generates practice exams at four cognitive levels, progressively increasing from recall to application to analysis to evaluation. This is not a difficulty slider; it is a structured progression through Bloom’s taxonomy that mirrors how expert knowledge is actually organized. A learner who can recall a fact but cannot apply it in a novel scenario has not mastered it in any meaningful sense.

These systems interact. A misconception detected during practice testing adjusts both mastery level and exam readiness. A strong confidence calibration score can partially offset lower raw accuracy in the readiness calculation, because a well-calibrated learner knows where to focus. The engine embeds a theory of learning, not just a set of features.

What I Learned

Building twenty-six production apps using frontier AI tools across the full development lifecycle (content generation, code architecture, testing, deployment, iteration) produced an unintentional research artifact. The process itself became a case study in human-AI collaboration at a scale that is only recently feasible.

The cognitive science grounding is real, but the validation is incomplete. The algorithms are derived from peer-reviewed findings, the architecture makes principled decisions, and the system demonstrably behaves differently from a naive flashcard app. But the controlled experiments have not been run. There is no A/B test comparing this engine against a baseline with matched content and different scheduling algorithms. There is no longitudinal study tracking exam outcomes.

The platform exists. The instrumentation is built. The question is no longer “can cognitive science principles be translated into software?” but “do they produce measurably better learning outcomes when they are?” That experiment remains to be run.

The most valuable output may not be the engine itself but the experimental platform it created. Twenty-six apps across four certification domains, with 55,000+ questions instrumented for confidence tracking, mastery progression, and misconception detection. A dataset waiting for a hypothesis.

Orchestrating large language models across research, content generation, architecture, and testing required a kind of metacognitive discipline, constantly evaluating what the AI could handle reliably and what required human judgment. Building a system designed to develop metacognitive skills in learners demanded developing those same skills in the builder.

Paper

Perry, A. C. (2026). Confidence-Calibrated Adaptive Learning: An Integrated Adaptive Engine for Professional Exam Preparation. Submitted to AIED 2026 Late Breaking Results (Seoul, South Korea).

Preprint

References

Ebbinghaus, H. (1885). Über das Gedächtnis: Untersuchungen zur experimentellen Psychologie. Leipzig: Duncker & Humblot.
Bjork, R. A., & Bjork, E. L. (1992). A new theory of disuse and an old theory of stimulus fluctuation. In A. Healy et al. (Eds.), From learning processes to cognitive processes: Essays in honor of William K. Estes (Vol. 2, pp. 35–67). Erlbaum.
Koriat, A. (1997). Monitoring one’s own knowledge during study: A cue-utilization approach to judgments of learning. Journal of Experimental Psychology: General, 126(4), 349–370.
Butterfield, B., & Metcalfe, J. (2001). Errors committed with high confidence are hypercorrected. Journal of Experimental Psychology: Learning, Memory, and Cognition, 27(6), 1491–1494.
Vosniadou, S. (1994). Capturing and modeling the process of conceptual change. Learning and Instruction, 4(1), 45–69.
Cepeda, N. J., Pashler, H., Vul, E., Wixted, J. T., & Rohrer, D. (2006). Distributed practice in verbal recall tasks: A review and quantitative synthesis. Psychological Bulletin, 132(3), 354–380.
Meichenbaum, D. (1985). Stress Inoculation Training. Pergamon Press.
Bloom, B. S. (Ed.). (1956). Taxonomy of Educational Objectives: The Classification of Educational Goals. Handbook I: Cognitive Domain. David McKay.