Towards a corpus of medieval inquisition records: First numbers
Building upon the invaluable work of modern editors, the DISSINET project is building a textual corpus of medieval inquisitorial material, which now contains 15 registers, totals over 1.6 million tokens, and ranges from North-Central Italy through Languedoc to England, and from the 1230s until the 1520s.
29 Mar 2024
David Zbíral
Gideon Kotzé
Robert Laurence John Shaw
What are inquisition records?
Medieval inquisition records are notarial documents of different types, including depositions, records of sentencing, and legal consultations. Some registers are predominantly composed of documents of the same type, for instance deposition records, while others contain documents of multiple types. There are, beyond this, significant differences in content and style between them that go beyond the mix of genres they present. Taken as a whole, they present significant opportunities for computational text analysis with a comparative focus. For this reason, we have started compiling a sizable digital corpus of medieval inquisition records.
Spatiotemporal coverage of the current corpus
Due to the haphazard preservation of medieval inquisition material, as well as the political and social conditions of different areas of medieval Europe, extant records do not directly reflect the actual extent of Christian dissidence in medieval Europe. Furthermore, since we use published editions as the basis of our corpus, its coverage is dependent upon previous editorial work: some extensive registers still remain unpublished. In addition, our work on the digitisation and cleaning of available editions is progressing gradually. Nevertheless, the corpus we compiled already gives a fairly representative image of extant medieval inquisition material. It included registers produced in Languedoc, North-Central Italy and England across three centuries: they date from the 1230s up to the 1520s. The spatiotemporal coverage is shown in Fig. 1.
Fig. 1. Spatiotemporal coverage of the corpus. The x-axis shows time, the y-axis the short name of the register, and the colour shows the geographical area.
Text preprocessing
As mentioned, our corpus of medieval inquisition records is founded on modern scholarly editions of the original texts. These editions were scanned on a professional robotic scanner (Qidenus) and further processed and optically recognized with the ABBYY FineReader software at the Centre for Information Technologies of Masaryk University’s Faculty of Arts. Subsequently, a team of historians with strong Latin skills cleaned the scanned texts. This process entailed manually removing editorial segments such as footnotes, which should not be part of the main text using the Microsoft Word software, and correcting the major OCR errors by text replacement and manual edits. From Microsoft Word DOCX files, we produced plain text files in the UTF-8 encoding and stored them in a private GitHub repository. In one case, that of the Register of Jacques Fournier, we adopted a different approach: we ordered a commercial manual transcription (Word Pro; lead: Yann Pitchal), which, apart from providing a high-quality transcription of the edition, also included in the main text the corrections that the editor presented (1) in the critical apparatus instead of the main text, and (2) in the subsequently published extensive erratum to the original edition.
Composition of the corpus
The volume of most registers has not been precisely measured so far, and to the best of our knowledge, computational text analysis techniques have not yet been used at all to analyse medieval inquisition material. However, these techniques – which we are using in DISSINET alongside manual approaches to data acquisition – have much to offer, and we hope to deploy their potential in upcoming studies. In this blog post, we share some first numbers that historians of medieval inquisition and dissidence might find of interest.
As it now stands, our corpus is composed of fifteen cleaned digital texts (Tab. 1). It contains just over 1.6 million word tokens.
Name of the register
|
Short name
|
Area
|
Date
|
Token count
|
Token count in Latin docs
|
Type count in Latin docs
|
Number of Latin docs
|
Proceedings against Bernard of Niort and his family
|
Niort
|
France
|
1234/1235
|
5981
|
5981
|
1258
|
117
|
Register HHH of the Carcassonne inquisition
|
Carcassonne
|
France
|
1246-1247
|
9437
|
9437
|
1370
|
22
|
Book of sentences from Orvieto
|
Orvieto
|
Italy
|
1268
|
38183
|
38183
|
2817
|
69
|
Register of Pons of Parnac, Ranulph of Plaissac and other inquisitors in Toulouse
|
Toulouse
|
France
|
1273-1282
|
102360
|
102360
|
7824
|
199
|
Register of the inquisition of Bologna
|
Bologna
|
Italy
|
1291-1310
|
208256
|
208256
|
12221
|
922
|
Proceedings against the Guglielmites in Milano
|
Guglielmites
|
Italy
|
1300-1302
|
41367
|
41367
|
3543
|
114
|
Register of Geoffroy of Ablis
|
Ablis
|
France
|
1308-1309
|
69815
|
69815
|
5150
|
46
|
Book of sentences of Bernard Gui
|
Gui
|
France
|
1308-1323
|
278055
|
278055
|
13444
|
711
|
Register of Jacques Fournier
|
Fournier
|
France
|
1318-1325
|
654591
|
654591
|
24595
|
hundreds
|
Proceedings against heretics in Giaveno by Alberto de Castellario
|
Castellario
|
Italy
|
1335
|
35072
|
35072
|
2444
|
248
|
Proceedings against heretics in Piedmont by Tomasso of Casasco
|
Casasco
|
Italy
|
1373-1388
|
10215
|
10215
|
2481
|
61
|
Proceedings against heretics in Piedmont by Antonio of Settimo
|
Settimo
|
Italy
|
1387-1388
|
31563
|
31563
|
5089
|
25
|
Proceedings against heretics in Norwich
|
Norwich
|
England
|
1428-1431
|
67314
|
41892
|
4332
|
124
|
Proceedings against heretics in Coventry
|
Coventry
|
England
|
1486-1522
|
33493
|
19695
|
3153
|
68
|
Proceedings against heretics in Kent
|
Kent
|
England
|
1511-1512
|
48842
|
17631
|
2085
|
98
|
|
|
|
|
|
|
|
|
Total
|
|
|
|
1634544
|
1564113
|
53773
|
2824+
|
Tab. 1. Composition and basic descriptive measures of the corpus. Documents mean, here, lowest-level notarial documents, that is texts which form a unit in an inquisition process (for instance, a deposition on one individual on one day, or a sentence of one individual). Tokens roughly mean words. Types mean unique word forms. For instance, the string “word words word” has 3 tokens, but only 2 types (“word”, “words”), and only 1 lemma or basic word form (“word”).
We are now looking forward to using this corpus to study crucial aspects of medieval inquisition records, inquisitorial discourse, and dissidence itself, continuing to develop DISSINET’s comprehensive data-oriented computational approach to these intriguing historical trial documents.