ZIB Education

On the Use of Natural Language Processing and Machine Learning in Educational Assessment

Dr. Fabian Zehner➚, ZIB-associated researcher DIPF

Assessments, such as tests, can produce complex data. For example, open-ended (text) responses need to be evaluated, or extensive log data capturing test takers’ interactions with computer-based test environments have to be processed. In these tasks, the computer can assist humans and thus create numerous valuable use cases. For example, machine learning and, in the case of linguistic data, natural language processing can be used. The habilitation project applies this methodology in the context of educational assessment, from which especially large-scale assessments such as PISA can benefit.

Assessments in educational research and educational practice often include natural language. Tests and surveys, for example, contain linguistic stimuli such as instructions or questions to which test taker and respondents shall respond. On the other hand, these observed responses may also contain natural language themselves, for example in the form of short text responses or essays. Such observations entail the complication that their evaluation is complex. A similar complexity results from so-called log data, i.e., when a computer-based assessment captures behavioral data (such as clicks on objects, keystrokes, scrolling, etc.). This way, more complex assessment situations can be created (e.g., for so-called embedded or stealth assessment), from which the measurement of so-called 21st century skills can benefit in particular (e.g., collaborative problem solving).

In both cases of complex data setups, machine-learning methods enable the recognition of patterns that help to deal with different research questions. In the case that text responses are to be processed, techniques of natural language processing are used in addition. For example, these enable the computer to automatically evaluate whether text responses are correct or incorrect. Or whether the response is aimed more at one line of argumentation or another. In the case of log data, machine learning can, among others, be used to improve the estimation of the latent feature of interest (such as ability) or to identify relevant features other than the construct to be measured (such as whether a test taker is engaged).

In the habilitation project, both natural language processing and machine learning are used to answer research-relevant questions or for enabling practical applications. One focus is on the automatic evaluation of text responses by a software developed and programmed in the same project context: ReCo.

The project …

  1. is pursuing, for example, the development of a graphical user interface so that the automatic evaluation of textual data is easily available to researchers (Zehner & Andersen, 2020),
  2. shows how PISA's administration mode (paper- vs. computer-based assessment) has affected students' responses (Zehner, Goldhammer, Lubaway, & Sälzer, 2019; Zehner, Kroehne, Hahnel, & Goldhammer, 2020),
  3. develops a theoretical framework to identify causal conditions in text responses that make responses, for example, a correct answer,
  4. aims at developing methodology to directly compare responses from different languages and
  5. for using open-ended text responses in computer-adaptive testing, and
  6. demonstrates how uncommitted test behavior can be predicted (moderately) by means of machine learning and theory-driven feature extraction from log data (Zehner, Harrison et al., 2020).

These examples are only sample excerpts from various research lines of the project.

Zehner, F. & Andersen, N. (2020). ReCo: Textantworten automatisch auswerten (Methodenworkshop). Zeitschrift für Soziologie der Erziehung und Sozialisation40(3), 334–340.

Zehner, F., Goldhammer, F., Lubaway, E. & Sälzer, C. (2019). Unattended consequences: How text responses alter alongside PISA's mode change from 2012 to 2015. Education Inquiry10(1), 34–55. doi: 10.1080/20004508.2018.1518080➚

Zehner, F., Harrison, S., Eichmann, B., Deribo, T., Bengs, D., Andersen, N. & Hahnel, C. (2020). The NAEP Data Mining Competition: On the value of theory-driven psychometrics and machine learning for predictions based on log data. In A. N. Rafferty, J. Whitehill, C. Romero, & V. Cavalli-Sforza (Hrsg.), Proceedings of the Thirteenth International Conference on Educational Data Mining (EDM 2020)S. 302–312, Morocco. [online available➚]

Zehner, F., Kroehne, U., Hahnel, C. & Goldhammer, F. (2020). PISA Reading: Mode Effects Unveiled in Text Responses. Psychological Test and Assessment Modeling62, 55–75. [online available➚]


Mentor: Prof. Dr. Frank Goldhammer➚


089 289 28274 zib.edu@sot.tum.de