Report: The cognitive & pedagogical implications of generative AI in education

Haga clic aquí para la traducción al español (Google Translate)

Introduction

This report synthesises findings from recent research on the application & impact of Large Language Models (LLMs) in educational settings. The evidence indicates that while LLMs present opportunities for efficiency, their integration poses significant cognitive hazards for learners & complex pedagogical challenges for educators. Neurocognitive studies reveal that using LLMs for tasks like essay writing measurably reduces brain connectivity, impairs memory recall, & diminishes a student’s sense of ownership over their work, a phenomenon termed “cognitive debt.”

Fundamentally, LLMs are statistical text predictors, not reasoning engines. This core design leads to critical limitations: they are prone to “hallucinating” plausible but false information, propagating biases from their training data, & failing simple logic, spatial, & common-sense problems that humans find trivial. A benchmark of popular models like GPT-4 Turbo & Claude 3 Opus showed average scores below 40% on such tasks.

In practice, these limitations manifest as educational hazards. LLMs can generate instructional materials containing scientific misconceptions, provide flawed feedback on student work, & mislead learners with incorrect tutoring. Efforts to build sophisticated tools like nationwide AI tutors, such as Estonia’s “AI Leap” initiative, highlight a profound gap between high-level learning science & the dialogue-level, turn-by-turn guidance required for effective AI behaviour. Furthermore, LLMs cannot replace even minimally competent human teaching assistants as they lack the essential human capacities for perception, professional judgment, relationship-building, cannot learn from error or experience, & cannot be held accountable. Effective use of LLMs requires significant pre-existing knowledge from the user, underscoring that AI cannot be a substitute for foundational learning. The consensus across the source papers & report is a call for deliberate, sceptical integration of AI, emphasising human-in-the-loop oversight & a continued focus on carefully & purposefully building coherent, cohesive student knowledge.

The cognitive impact of LLM use on learners

Research demonstrates a measurable impact on learners’ cognitive processes & outcomes when using LLMs. Studies employing electroencephalography (EEG) & behavioural analysis reveal a pattern of reduced cognitive engagement & impaired skill retention.

Evidence from neurocognitive studies

A multi-session study Kosmyna et al. (2025) on essay writing, the infamous “cognitive debt” pre-print paper that has attracted much media attention, provides strong evidence of the neurological effects of LLM assistance. Participants were divided into three groups: LLM-assisted, Search Engine-assisted, & Brain-only.

Reduced neural connectivity: EEG analysis showed that brain connectivity systematically scaled down with the amount of external support. The Brain-only group exhibited the strongest & most widespread neural networks, the Search Engine group showed intermediate engagement, & the LLM-assisted group elicited the weakest overall coupling.
Cognitive offloading: When participants who had previously used an LLM were asked to write without it (the “LLM-to-Brain” group), they showed weaker neural connectivity & an under-engagement of alpha & beta networks compared to those who had practised without tools. This suggests that prior LLM use may diminish the brain’s recruitment of networks for content planning & generation.
Increased load on integration: Conversely, when participants who had previously worked without tools were given an LLM (“Brain-to-LLM” group), they demonstrated a network-wide spike in connectivity across all frequency bands. This indicates that integrating AI output into an existing, self-generated workflow imposes a high cognitive load.

Observed behavioural effects & “cognitive debt”

The neural patterns correlate with significant behavioural differences, pointing to what Kosmyna et al. term an “accumulation of cognitive debt,” where short-term convenience leads to long-term deficits in cognitive skill.

Impaired memory & recall: The LLM group demonstrated a severely diminished ability to recall & quote from essays they had just written. In the first session, 83.3% of the LLM group could not provide a correct quotation, a failure rate that persisted across sessions. In contrast, both the Search Engine & Brain-only groups achieved near-perfect quoting ability by the second session.
Diminished sense of ownership: The LLM group reported a fragmented & conflicted sense of authorship. While the Brain-only group claimed nearly unanimous full ownership, LLM users often reported partial ownership (e.g. “50/50”) or no ownership at all.
Metacognitive laziness: Aru & Laak (2025) note that standard LLM tools can “offload students’ thinking, reduce mental effort, encourage metacognitive laziness, & create an illusion of learning.” This is echoed by Riley & Bruno (2024) who state that when students rely on chatbots, they “can miss opportunities to learn how to think critically, to organise ideas, & to consider alternate viewpoints.”

Fundamental limitations of large language models

The educational hazards of LLMs stem directly from their core design. They are not repositories of knowledge or reasoning machines but sophisticated pattern-matching systems trained to predict the next most probable morpheme (a fragment of a word, which is the smallest unit of meaning) in a sequence.

Core functionality: Statistical text prediction

As explained by Riley & Bruno (2024), LLMs are statistical models that answer the question: “According to your model of the statistics of human language, what morphemes are likely to come next?”

Not search engines: LLMs do not store or search their training data. They store statistical relationships between text tokens, which are units of their vocabulary. This means they cannot verify information or provide sources in the way a search engine does.
Imitation of intelligence: While interactions feel conversational, LLMs do not “know” or “understand.” It is more accurate to think of them as “role-playing entities that imitate intelligence” (much like what Alan Turing envisaged for his “Turing Test”) Their goal is to provide plausible, helpful-sounding responses, not necessarily true, actually helpful, or accurate ones.
Hallucinations: LLMs are known to “hallucinate,” generating text that sounds plausible but is factually incorrect. This is a direct result of their predictive function; if a false statement is statistically likely, the model can generate it with confidence.

Benchmark performance on simple tasks

A study by Williams & Huckle (2024) created a “Linguistic Benchmark” of 30 questions designed to be easy for adult humans but challenging for LLMs, testing domains like logic, spatial reasoning, & common sense. The results revealed significant performance gaps.

Model	Average Score	95% Confidence Interval
GPT-4 Turbo	38%	[23% , 55%]
Claude 3 Opus	35%	[21% , 52%]
Gemini 1.5 Pro	30%	[15% , 45%]
Mistral Large	28%	[15% , 42%]
Llama 3 70B	27%	[14% , 43%]
Mistral 8x22B	20%	[7% , 34%]
Gemini 1.0 Pro	16%	[6% , 29%]

Source: Williams & Huckle (2024)

The study identified common failure modes, including:

Overfitting: Models defaulted to online versions of logic puzzles rather than solving the modified, simpler versions presented.
Lack of spatial intelligence: LLMs failed simple navigation questions (e.g. “If you’re in London facing west, Edinburgh would be to your left…”).
Incorrect mathematical reasoning: Models made basic counting errors (e.g. finding 5 ‘L’s in ‘LOLLAPALOOZA’).

Biased & static training

The knowledge base of an LLM is limited & shaped by its training data, which has several critical flaws.

Biased data sample: The data is predominantly in English & reflects a USA-centric perspective, meaning LLMs are exposed to a biased sample of cultural practices & values, including strong historical biases related to gender, race, religion, ethnicity, & attitudes towards political & economic systems. This can make them less relevant or even inaccurate for culturally & linguistically diverse students.
Prevalence of misconceptions: LLMs may reproduce common misconceptions if they are prevalent online. This includes pedagogical myths like Learning Styles or scientific inaccuracies, pseudoscience, & conspiracy theories.
Static knowledge: LLMs are not continuously learning from user interactions. Their capabilities are almost entirely derived from their initial training, meaning they do not adapt to the specific needs of an individual student or cohort of students over time.

Educational hazards & application challenges

The cognitive & technical limitations of LLMs translate into significant practical challenges & risks when applied in educational contexts, from tutoring systems to classroom instruction.

The AI tutor dilemma: the Estonian “AI leap” project

The “AI Leap” initiative in Estonia, aiming to provide a bespoke AI tutor to every high school student, serves as a case study for the profound challenges in this domain (Aru & Laak, 2025).

The translation gap: A central challenge is the “translation gap” between the rich body of theory in learning sciences & the practical, dialogue-level guidance needed for an LLM. Academic research focuses on high-level constructs, whereas LLMs require instructions for “turn-by-turn, dialogue-level decisions.”
Complexity of student needs: Effective tutoring requires modelling a complex set of student needs, including motivational, emotional, & cognitive states, prior knowledge, & executive functioning. Current student models often oversimplify these needs to easily measurable metrics like test scores.
Redefining success: The AI Leap team argues that success should not be measured by better test scores but by a sustained positive effect on learning-related beliefs, attitudes, skills, & key educational outcomes. Achieving this requires a level of pedagogical sophistication that current LLMs do not possess by default.

Instructional & Assessment Risks

Riley & Bruno (2024) outline several key hazards for educators using LLMs for instruction:

Domain	Education Hazard
Lesson planning	May generate ineffective lesson sequences or plans based on low-quality online materials.
Tutoring	Can provide factually incorrect answers or make computational errors, confusing students.
Grading/feedback	May focus on superficial aspects like grammar over conceptual understanding & can miscalculate scores.
Instructional materials	Can create content based on debunked theories (e.g. Learning Styles) or scientific misinformation.
Writing instruction	Using LLMs for writing can prevent students from engaging in the effortful thinking that the writing process is designed to cultivate.
Administration	May generate content (e.g. job descriptions, observation notes) that does not align with strategic goals or legal requirements.

The irreplaceability of human educators

It is also widely & justifiably argued that LLMs are fundamentally incapable of performing the core functions of a human teacher, even a minimally competent one. Education is described as “relational, embodied, contextual, & stubbornly human.”

Perception: Human teachers & teaching assistants can notice a student’s confusion through body language, tone, or silence. An LLM has no comparable perceptual access to the classroom.
Judgement: Human teachers & teaching assistants exercise professional judgment shaped by experience, ethics, & context. An LLM offers probability-weighted text without understanding the difference between a struggling & a disengaged student.
Responsibility: Human teachers & teaching assistants are accountable for mistakes & can correct them. An LLM is “structurally irresponsible” & cannot be held accountable for its hallucinations or biases.
Classroom management: Human teachers & teaching assistants are a physical presence who can redirect behaviour, support groups, & ensure safety. An LLM can write a behaviour plan but cannot implement one.

Strategic considerations & recommendations

The expression “technological solutionism” refers to the belief that technology, such as AI, provides comprehensive & automatic solutions for complex educational challenges, often driven by techno-optimism & hype rather than focusing on deliberate decisions grounded in learning & pedagogy. The following are strategic principles for navigating the integration of AI in education, emphasising human cognition & deliberate implementation over technological solutionism, avoiding the tendency toward oversimplifications that imply AI supplants the need for educators to teach foundational knowledge.

The primacy of human knowledge

A central theme across the source papers & report is that knowledge cannot be outsourced to AI. Cognitive science shows that humans need to build a broad base of knowledge to learn new ideas & think critically.

Knowledge as a prerequisite: As Riley & Bruno (2024) state, “effective use of LLMs requires the user to possess existing background knowledge & expertise.” Students who lack this knowledge will have their ability to use the technology “severely limited.”
Avoiding skill obsolescence fallacies: Education leaders are warned against oversimplifications like, “If AI can do this, we don’t need to teach it any more.” The purpose of many assignments, such as writing, is the cognitive process itself, not the final product.

The necessity of human oversight & scepticism

Given the inherent unreliability of LLMs, human oversight is non-negotiable.

Human-in-the-loop: Williams & Huckle (2024) stress the need for a “human-in-the-loop for enterprise applications,” a principle that applies directly to education. Educators must fact-check any AI-generated materials & monitor student interactions.
Educator responsibility: Administrators should emphasise that “educators are responsible for the validity & usefulness of the materials they choose to use,” & administrators who mandate AI tools should be held similarly responsible (Riley & Bruno, 2024).
Scepticism of future claims: Educators should be sceptical of speculative claims about AI’s future capabilities, such as the achievement of “artificial general intelligence.” Decisions should be based on how the technology currently functions, not on predictions of what it might become.

A call for deliberate integration

Finally, education systems should not be swayed by techno-optimism.

Start with educational goals: Aru & Laak (2025) argue that the goal is not to “simply bring AI tutors into the classrooms but to first agree on what the learning process is &, based on that knowledge, how to best enhance students’ well-being & learning.”
Collaborate or develop: To ensure tools meet pedagogical needs, education systems should “either collaborate with technology companies… or invest in developing their own solutions.” Vanilla, off-the-shelf LLMs are not optimised for learning.

References

Aru, J., & Laak, K.-J. (2025). Developing an AI-based General Personal Tutor for education. Trends in Cognitive Sciences. https://doi.org/10.1016/j.tics.2025.09.010
Kosmyna, N., Hauptmann, E., Yuan, Y. T., Situ, J., Liao, X.-H., Beresnitzky, A. V., Braunstein, I., & Maes, P. (2025). Your Brain on ChatGPT: Accumulation of Cognitive Debt when Using an AI Assistant for Essay Writing Task (No. arXiv:2506.08872). arXiv. https://doi.org/10.48550/arXiv.2506.08872
Riley, B., & Bruno, P. (2024). The Educational Hazards of Generative AI. Cognitive Resonance. https://www.cognitiveresonance.net/resources
Williams, S., & Huckle, J. (2024). Easy Problems That LLMs Get Wrong (No. arXiv:2405.19616). arXiv. https://doi.org/10.48550/arXiv.2405.19616