This blog was written by Ketan, PhD candidate, University of Massachusetts Amherst; Melissa Goodnight, Assistant Professor, University of Illinois at Urbana-Champaign; and Stephen Sireci, Distinguished University Professor, University of Massachusetts Amherst.
There is a broad consensus in global education that we are facing a learning crisis. Yet in 2024, the UNESCO Institute for Statistics (UIS) reported that nearly half of all countries still do not systematically assess learning outcomes, leaving the progress of roughly 680 million children unmeasured. The outlook is even bleaker for Sustainable Development Goal (SDG) 4.1.1a, which focuses on foundational literacy and numeracy in the early grades. Between 2018 and 2022, only 34 countries reported data at the Grade 2/3 level, compared with 98 at the end of primary school.
Assessing younger learners poses unique technical challenges, particularly in reading across diverse languages. At the same time, large-scale assessments, whether international or national, vary widely in standards, definitions and procedures, reflecting different priorities and limiting the global comparability of their data.
Beyond technical criteria in assessment design
In response to this problem, UIS and its partners developed criteria to help countries produce data aligned with SDG 4.1.1. The seven technical criteria cover key aspects of assessment design and implementation: content alignment, quality of test items, sampling, standardised administration, reliability, linking procedures and maintaining standards over time. These are supported by practical guidance and documentation from the Assessment for Minimum Proficiency Level-a (AMPL-a) programme.
Alongside these technical criteria, UIS also offers general considerations to guide assessment planning and processes. Among them, one of the most important is “utility to the country,” which states that assessments should add value beyond global reporting by supporting policy dialogue, informing programme design, and strengthening national capacity. However, UIS does not require documentation of these general considerations for an assessment to count toward SDG reporting.
Especially now, with tighter development budgets, large-scale assessments must offer more than comparable statistics. Their real utility lies in pairing reliable global reporting with concrete support for national policies and programmes that improve learning.
Validity extends from measurement to meaningful consequences
Prioritising utility in assessment design is integral to the concept of validity. Samuel Messick defined validity as an integrated judgment about how well evidence and theory support the inferences and actions drawn from test scores. In other words, validity is not only about the technical quality of a test, but also about whether the interpretations and actions based on its results are appropriate and lead to meaningful outcomes.
The Standards for Educational and Psychological Testing (2024) outline five sources of validity evidence:
- Content: how well the assessment reflects the skills or knowledge it is meant to measure
- Response processes: whether test-takers think about and engage with items as intended
- Internal structure: how related items work together to reflect what is being measured
- Relations to other variables: whether scores connect with other measures in expected ways
- Consequences of testing: whether using the scores leads to intended benefits without causing harm
In practice, the first four sources tend to receive more emphasis, while consequences are often overlooked. This imbalance is reflected in the SDG assessment criteria, where consequences, including utility to the country, are acknowledged but not required.
A comprehensive view of assessment validity extends beyond the test instrument itself to consider the full sequence from measurement to impact. A test measures a construct, produces scores, and supports inferences about learners. These inferences guide interpretations, which inform decisions and actions, leading to both intended and unintended consequences. Within this framework, the following broad stages of validation emerge:
- Evidence that the test measures the intended construct accurately and reliably
- Verification that interpretations and resulting decisions are appropriate and supported
- Utility of action, or evidence that assessment-driven actions lead to positive outcomes that outweigh negative effects
Validity evidence is therefore needed across all these stages to ensure assessments not only function technically but also inform sound decisions with meaningful consequences. When assessments are led primarily by international agencies and experts without sufficient care for local culture, capacity and context, the result can be unintended negative effects such as stakeholder disengagement and limited uptake of results.
Designing for consequences through a theory of action
As the 2030 SDG deadline approaches, assessments must be judged not only by their technical quality but also by the consequences they are designed to produce. Large-scale assessments are difficult to modify once launched, as maintaining comparability across cycles limits the scope for change. While not all consequences can be anticipated at the outset, considering potential benefits and risks during design creates space for them to be addressed; otherwise, consequences unfold by default rather than intention.
One practical way to integrate consequences into assessment design is through a theory of action (ToA). A ToA makes the reasoning from test to consequence explicit by mapping how components such as items, administration, scoring, reporting and resources are expected to connect to decisions and, ultimately, to outcomes.
For SDG 4.1.1a, this means clarifying from the outset how foundational learning data will support policy dialogue, inform system reforms and shape instructional practice in the early grades, while also anticipating risks such as inadequate contextualisation across languages, limited sensitivity to diverse learning environments, and inequitable uptake of results in settings where stakeholder engagement and technical capacity are constrained. By requiring assessment experts, practitioners and policymakers to jointly define intended uses and mechanisms for impact, a ToA embeds consequences into design choices ranging from the constructs and sampling strategies selected to who collects the data, when and where it is gathered, in what format, and how results are reported and used.
In this way, assessments become instruments for system improvement rather than static measurement exercises, tying validity not only to technical soundness but also to the actions and changes they are intended to inform. Whether through international initiatives or national assessments, countries must ensure that theories of action are tailored to their context, aligning outcomes and pathways with local needs, capacities, and priorities.
UKFIET 2025 session on foundational learning assessments
The upcoming UKFIET 2025 Conference session on “Mobilizing partnerships for capacity development in foundational learning assessments” offers a timely space to advance this conversation. The task is not only to meet global SDG 4.1.1a reporting demands but to do so in ways that strengthen national systems, build sustainable capacity, and engage diverse stakeholders and communities. Across large-scale assessment initiatives (AMPL, UNICEF’s Foundational Learning Module, and PAL Network’s citizen-led ICAN-ICAR assessments), the real charge is to balance comparability with utility, foster trust, inform decisions and ensure assessments become catalysts for lasting change.