Extracting data from historical texts using Large Language Models

Large Language Models (LLMs) hold significant promise for research-oriented information extraction from texts. If used with caution and human cross-validation, they can speed up the research process and help researchers fulfill ambitious objectives in quantitative cultural history that would be impossible to achieve through manual data acquisition.

3 Oct 2025

Image generated by AI image generator DALL-E 3 on the prompt "Ai explaining inquisitorial records to historians".

Large Language Models (LLMs) have invaded our lives, first by helping us write e-mails, abstracts and program code, then by summarizing texts into neat abstracts or attractive audios, and finally ending up in more questionable uses some, of which verge upon divination techniques (e.g., “which party should I vote given that I …?”). Following up on our 2022 blog post, “Can artificial intelligence read medieval inquisition records?”, it is high time for historians to consider more broadly what applications LLMs may have in their research.

Leaving aside some widely discussed questionable uses of LLMs, which sometimes trouble our teaching and evaluation of academic articles, we want to focus here on the exceptional promise LLMs hold for speeding up data acquisition. Recently, Andres Karjus discussed various ways in which LLMs can assist the work of historians and social scientists in the article “Machine-assisted quantitizing designs: augmenting humanities and social sciences with artificial intelligence”, published in Humanities and Social Science Communications. To take just one benefit of LLMs, their exceptional capacity to understand texts in various languages make them excellent annotators. This can solve a lot of our problems stemming from the lack of manpower in the notoriously underfinanced humanities and the – certainly deplorable – decline in foreign (and source) language competencies of our students (or perhaps the changing focus of our curricula has already led to the scarcity of these skillsets?).

Of course, there are good and bad uses of LLMs for information extraction. If we only ask for data from historical sources to be summarized and unquestioningly rely on the LLM’s results, we have missed a crucial step in any data extraction process: cross-validation. And here, the most reliable means still is, and is likely to remain, cross-evaluation by human specialists. We have started using LLMs for information extraction extensively in our DISSINET work, but we always include the crucial step of manually annotating a sample of the data for the same variables, and usually on the basis of the same instruction that we gave the LLM, in order to be able to formally express the accuracy of annotation. Thus, we are able to ascertain how much we can trust the LLM outputs.

If used properly, the capacity of LLMs to capture fine-grained semantic information bypasses many limits of traditional keyword-based searches. We have experimented with various tasks in information extraction from historical documents, usually in view of particular studies, such as:

recognition of male/female gender (from names, pronouns etc.);
classification of document genre;
presence or absence of specific topics or their variants in individual notarial documents, such as religious belief or the presence of contestation of religious authorities in inquisition records;
and even more specialized and interpretatively reasonably complex tasks crucial for extracting data from trial records, such as recognition whether some statement is portrayed as real vs. hypothetical or otherwise under question (i.e., modality), as well as whether it refers to trial time (e.g., conversation between the inquisitor and the deponent) or constitutes testimony (e.g., a past conversation between the deponent and another suspect).

Apart from this, in a work led by our own Tomáš Hampejs, we have even started to extract structures comparable to our very complex CASTEMO data model with considerable success. While there are still years ahead of us, we foresee an “AISTEMO” future: an “Artificial Intelligence Semantic Text Modelling” rendering full-texts into historically informed, complex, properly structured data models. This, however, is not likely to replace manual approaches; rather, we can now ask more ambitious questions that we always wanted to ask and inquire into them based on a much larger corpus. However, human cross-validation will be key to the assessment of the usability of LLM data, and thus the validity of the results achieved with such data. The historian’s and social scientists’ work will change, often shifting partially from sequential reading to annotating a random sample of clauses, but will remain a crucial part of the research process.

All articles

Extracting data from historical texts using Large Language Models

More articles

DISSINET publishes a transcription of the inquisition register of John Galand and William of Saint-Seine, 1283–1302

DISSINET during summer 2025: conferences, workshops, and success at a competition on detecting hate speech

Medieval anti-heretical hate speech: Funded Ph.D. opportunity, deadline 31 July 2025

DISSINET publishes most comprehensive statement of its research agenda