Towards a corpus of medieval inquisition records: First numbers

Building upon the invaluable work of modern editors, the DISSINET project is building a textual corpus of medieval inquisitorial material, which now contains 15 registers, totals over 1.6 million tokens, and ranges from North-Central Italy through Languedoc to England, and from the 1230s until the 1520s.

29 Mar 2024 David Zbíral Gideon Jozua Kotzé Robert Laurence John Shaw

What are inquisition records?

Medieval inquisition records are notarial documents of different types, including depositions, records of sentencing, and legal consultations. Some registers are predominantly composed of documents of the same type, for instance deposition records, while others contain documents of multiple types. There are, beyond this, significant differences in content and style between them that go beyond the mix of genres they present. Taken as a whole, they present significant opportunities for computational text analysis with a comparative focus. For this reason, we have started compiling a sizable digital corpus of medieval inquisition records.

Spatiotemporal coverage of the current corpus

Due to the haphazard preservation of medieval inquisition material, as well as the political and social conditions of different areas of medieval Europe, extant records do not directly reflect the actual extent of Christian dissidence in medieval Europe. Furthermore, since we use published editions as the basis of our corpus, its coverage is dependent upon previous editorial work: some extensive registers still remain unpublished. In addition, our work on the digitisation and cleaning of available editions is progressing gradually. Nevertheless, the corpus we compiled already gives a fairly representative image of extant medieval inquisition material. It included registers produced in Languedoc, North-Central Italy and England across three centuries: they date from the 1230s up to the 1520s. The spatiotemporal coverage is shown in Fig. 1.

Fig. 1. Spatiotemporal coverage of the corpus. The x-axis shows time, colour shows the region.

Fig. 1. Spatiotemporal coverage of the corpus. The x-axis shows time, the y-axis the short name of the register, and the colour shows the geographical area.

Text preprocessing

As mentioned, our corpus of medieval inquisition records is founded on modern scholarly editions of the original texts. These editions were scanned on a professional robotic scanner (Qidenus) and further processed and optically recognized with the ABBYY FineReader software at the Centre for Information Technologies of Masaryk University’s Faculty of Arts. Subsequently, a team of historians with strong Latin skills cleaned the scanned texts. This process entailed manually removing editorial segments such as footnotes, which should not be part of the main text using the Microsoft Word software, and correcting the major OCR errors by text replacement and manual edits. From Microsoft Word DOCX files, we produced plain text files in the UTF-8 encoding and stored them in a private GitHub repository. In one case, that of the Register of Jacques Fournier, we adopted a different approach: we ordered a commercial manual transcription (Word Pro; lead: Yann Pitchal), which, apart from providing a high-quality transcription of the edition, also included in the main text the corrections that the editor presented (1) in the critical apparatus instead of the main text, and (2) in the subsequently published extensive erratum to the original edition.

Composition of the corpus

The volume of most registers has not been precisely measured so far, and to the best of our knowledge, computational text analysis techniques have not yet been used at all to analyse medieval inquisition material. However, these techniques – which we are using in DISSINET alongside manual approaches to data acquisition – have much to offer, and we hope to deploy their potential in upcoming studies. In this blog post, we share some first numbers that historians of medieval inquisition and dissidence might find of interest.

As it now stands, our corpus is composed of fifteen cleaned digital texts (Tab. 1). It contains just over 1.6 million word tokens.

Name of the register	Short name	Area	Date	Token count	Token count in Latin docs	Type count in Latin docs	Number of Latin docs
Proceedings against Bernard of Niort and his family	Niort	France	1234/1235	5981	5981	1258	117
Register HHH of the Carcassonne inquisition	Carcassonne	France	1246-1247	9437	9437	1370	22
Book of sentences from Orvieto	Orvieto	Italy	1268	38183	38183	2817	69
Register of Pons of Parnac, Ranulph of Plaissac and other inquisitors in Toulouse	Toulouse	France	1273-1282	102360	102360	7824	199
Register of the inquisition of Bologna	Bologna	Italy	1291-1310	208256	208256	12221	922
Proceedings against the Guglielmites in Milano	Guglielmites	Italy	1300-1302	41367	41367	3543	114
Register of Geoffroy of Ablis	Ablis	France	1308-1309	69815	69815	5150	46
Book of sentences of Bernard Gui	Gui	France	1308-1323	278055	278055	13444	711
Register of Jacques Fournier	Fournier	France	1318-1325	654591	654591	24595	hundreds
Proceedings against heretics in Giaveno by Alberto de Castellario	Castellario	Italy	1335	35072	35072	2444	248
Proceedings against heretics in Piedmont by Tomasso of Casasco	Casasco	Italy	1373-1388	10215	10215	2481	61
Proceedings against heretics in Piedmont by Antonio of Settimo	Settimo	Italy	1387-1388	31563	31563	5089	25
Proceedings against heretics in Norwich	Norwich	England	1428-1431	67314	41892	4332	124
Proceedings against heretics in Coventry	Coventry	England	1486-1522	33493	19695	3153	68
Proceedings against heretics in Kent	Kent	England	1511-1512	48842	17631	2085	98

Total				1634544	1564113	53773	2824+

Tab. 1. Composition and basic descriptive measures of the corpus. Documents mean, here, lowest-level notarial documents, that is texts which form a unit in an inquisition process (for instance, a deposition on one individual on one day, or a sentence of one individual). Tokens roughly mean words. Types mean unique word forms. For instance, the string “word words word” has 3 tokens, but only 2 types (“word”, “words”), and only 1 lemma or basic word form (“word”).

We are now looking forward to using this corpus to study crucial aspects of medieval inquisition records, inquisitorial discourse, and dissidence itself, continuing to develop DISSINET’s comprehensive data-oriented computational approach to these intriguing historical trial documents.