Position paper: Issues & principles for AI in learning research

Haga clic aquí para la traducción al español (Google Translate)

Introduction

This article is a summary of a scientific position paper (Bauer et al., 2025) that highlights weaknesses in recent studies & outlines principles for conducting research that focuses on real learning outcomes. While enthusiasm around these technologies is understandable, much of the current discourse is shaped by hype rather than science. It aims to help readers critically navigate the rapidly growing body of research on AI in education, particularly following the rise of generative AI tools. By doing so, it seeks to move the conversation away from surface-level excitement & toward a more thoughtful, rigorous understanding of how AI can genuinely support education.

The authors highlight key issues including outdated algorithm evaluations, reliance on subjective or performance-based measures rather than actual learning outcomes, & flawed study designs that limit interpretability. In response, they advocate for a return to rigorous, evidence-based research grounded in SoL, emphasising the need for robust methodologies, appropriate comparisons, & a focus on instructional effectiveness rather than technological novelty. The article outlines principles for research that can meaningfully assess AI’s educational potential while avoiding misleading generalisations & over-hype.

NotebookLM audio summary

If you’d like to hear Google NotebookLM’s AI generated audio summary of the position paper (Bauer et al., 2025) as a deep dive conversation, they’re an acquired taste but here you are. (Running time: 0:21:06)

Issues with research to date & how it can be misleading

1. Over-hype & neglect of prior research

Current publication trends often reflect a tendency to over-hype recent AI developments. This is sometimes driven by a neglect of previous theoretical & empirical insights from research on AI-enhanced education & digital learning technologies. While discussion papers exploring opportunities & challenges are valuable starting points, research needs to move towards evidence-based studies building on this foundation.

The publication boom following ChatGPT’s launch includes numerous discussion papers, but a critical view is necessary, drawing on previous literature. SoL research, in contrast, focused on systematically establishing insights through rigorous methods, which is often overlooked in the current trends.

2. Outdated algorithm performance studies

Some initial studies focus on the performance of new AI algorithms themselves. However, due to the rapid pace of AI development, findings from these studies can quickly become outdated, potentially before they are even published through typical peer-review processes.

Research evaluating the performance of the “new generation of algorithms” is subject to quickly becoming obsolete.

3. Methodological weaknesses compromising validity

Many recent primary studies & research syntheses have been criticised for methodological issues that compromise the interpretability & validity of their findings. Unlike SoL research, which used rigorous methodologies to establish instructional effectiveness, more recent trends sometimes prioritise rapid exploration over methodological robustness.

Studies often fail to adequately acknowledge their limitations, contributing to the hype. Critical characteristics include the measurement approaches, study designs used, & inclusion criteria in syntheses.

4. Focus on subjective or performance-based measures over actual learning

Studies often focus on subjective variables like satisfaction or self-assessments to make claims about AI’s usefulness, neglecting actual learning processes & outcomes. Furthermore, some studies use performance improvements during an AI-supported task as an indicator of learning effects. However, performance during support is not sufficient evidence for actual learning, which requires demonstrating performance or knowledge improvements that persist without AI support.

Studies claiming learning benefits based solely on improvements during a writing or programming task while using AI support are misleading, as this does not prove learning has occurred that transfers or persists when the support is removed. Research syntheses must critically assess measurements to avoid confusing learning with performance enhancements.

5. Problematic study designs & comparisons

The designs of studies significantly influence what conclusions can be drawn. Pre-post designs without comparison conditions cannot attribute changes solely to the AI intervention. “No-intervention” control groups (comparing AI to doing nothing) show only that AI is better than no support, offering limited insight into its instructional quality compared to alternative methods. Designs comparing different interventions can lead to misinterpretations if the purpose (technology comparison vs. instructional method comparison) is unclear. Meta-analyses risk an “apples & oranges” issue by mixing studies with varying interventions & comparison conditions, potentially blurring distinctions & resulting in misleading generalisations about AI effectiveness.

Comparing an AI feedback system to a control group receiving no instruction or support shows only that the AI is better than nothing, not how it compares to human feedback or non-AI computer feedback. Mixing findings from studies comparing AI to no instruction with those comparing AI to teacher-led instruction in a single meta-analysis can lead to broad, misleading claims about AI’s overall effectiveness.

Principles of good quality research

Based on the critique of current practices & the emphasis on systematic, evidence-based approaches rooted in SoL research, the following principles for good quality research into AI in education are put forward by the :

1. Prioritise cognitive learning processes & outcomes

Good research should focus on understanding how AI impacts students’ cognitive processes & actual learning outcomes, such as the acquisition of knowledge & development of skills.

Effective learning is driven by optimising cognitive outcomes. While motivation & emotions are linked to cognitive learning, the primary focus for assessing AI’s educational impact should be on deep learning processes like synthesising, evaluating, & integrating knowledge, & developing skills; what the researchers identify as “transversal skills.” Focusing on subjective measures like satisfaction or mere performance during a supported task is insufficient.

Research should measure improvements in domain-specific knowledge or transversal skills through post-tests or tasks performed without AI support, rather than relying on student self-reports of learning or performance improvements observed only while the AI tool is being used.

2. Build systematically on existing theoretical & empirical insights

Research should integrate theoretical, empirical, & methodological insights from prior research on digital technologies & SoL. It should avoid neglecting established knowledge & instead build upon it through systematic investigations.

The current “publication boom” sometimes neglects previous theoretical & empirical insights. A systematic, evidence-based approach that learns from past research on instructional effectiveness is crucial for effective AI integration & avoiding potential risks.

Designing AI-enhanced learning opportunities should be informed by learning theories, & the AI-driven cognitive support must align with established principles & undergo rigorous validation. Research designs should reflect lessons learned from previous technology comparison studies.

3. Employ rigorous methodologies & acknowledge limitations

Good research uses robust methodologies that allow for clear interpretation & valid conclusions, & is explicit about the limitations arising from specific study characteristics.

Methodological issues compromise the interpretability & validity of findings. It is crucial to understand what conclusions can & cannot be drawn from a study based on its design & measurements to prevent over-hyping AI capabilities.

Using experimental designs with appropriate comparison groups (e.g. comparing AI to non-AI instruction or different types of AI support), rather than just pre-post designs without controls or comparisons to a “no-instruction” group, is essential for determining the AI’s unique effect. Critically assessing the quality of measurements in primary studies included in meta-analyses is also vital.

4. Focus on instructional effectiveness, not just technology presence

The focus should be on how AI enables or delivers effective instruction & supports learning activities, rather than assuming the technology itself inherently leads to learning benefits.

Drawing on earlier debates in educational research, the authors highlight that learning outcomes are influenced by the instruction delivered, not simply the medium (or technology) used. AI serves as a tool that can substitute for existing instruction, augment it with additional support, or redefine tasks to enable new forms of learning. Its effectiveness depends on the quality of its instructional implementation.

Research should investigate how AI provides cognitive support (e.g. through specific types of feedback or scaffolding) or how it transforms learning tasks (e.g. by enabling specific constructive or interactive activities), & evaluate the effectiveness of these instructional approaches mediated by AI, rather than simply comparing AI-present vs. AI-absent conditions without specifying the instructional difference.

5. Use appropriate comparison conditions to assess AI effects

Research should carefully select comparison conditions to determine the specific type of effect AI has on learning (inversion, substitution, augmentation, or redefinition). The ISAR model distinguishes effects based on comparison to conditions without AI.

Inversion means reduced learning compared to a non-AI condition.
Substitution implies equivalence to non-AI alternatives;
Augmentation implies additional support compared to non-AI or lower-quality support;
Redefinition implies fostering deeper learning processes not supported by the non-AI condition;

Clear comparisons are necessary to identify when & where AI provides instructional benefits. For example, to demonstrate an augmentation effect, AI-enhanced instruction providing adaptive feedback might be compared to instruction with non-adaptive feedback or no feedback, or in order to show a redefinition effect, an AI-supported learning-by-design task fostering constructive learning would be compared to a non-AI task that only supports passive or active learning. Meta-analyses should differentiate studies based on the nature of the comparison to avoid the “apples & oranges” problem.

By adhering to these principles, research can move “beyond the hype” & systematically investigate how AI can truly enhance cognitive learning, while also identifying & mitigating potential risks like inversion effects. Successful integration also depends on contextual factors like AI literacy, infrastructure, access, & regulations.

Reference

Bauer, E., Greiff, S., Graesser, A. C., Scheiter, K., & Sailer, M. (2025). Looking Beyond the Hype: Understanding the Effects of AI on Learning. Educational Psychology Review, 37(2), 45. https://doi.org/10.1007/s10648-025-10020-8