Mapping the Latin of medieval inquisition records: DISSINET’s lexico-semantic network as a language resource

DISSINET is a research project focusing on knowledge discovery from historical texts – Latin inquisition records – through the synergy between the manual collection of structured data and natural language processing (NLP) approaches. As an important product of our analytical work, we are at the initial stages of compiling a specialised language resource for Latin with the potential to enrich the description of the Latin language and improve the quality of computational analysis performed, including NLP, on Latin sources.

21 Apr 2023

David Zbíral | Gideon Kotzé | Robert L. J. Shaw

In the Dissident Networks Project (DISSINET), we are developing a comprehensive text annotation approach that we call Computer-Assisted Semantic Text Modelling (CASTEMO). CASTEMO is a procedure of knowledge-graph building, where textual data are captured at a high level of syntactic-semantic accuracy. It is capable of retaining narrators’ perspectives, original expressions in the original languages, and the hierarchy of texts and their different parts. These details can be connected to project-specific research-oriented ontologies, allowing for efficient querying and analysis of the data.

The semantic core of CASTEMO multi-language knowledge graphs is a lexico-semantic network, i.e. a network of lemmatized concepts with defined meanings (thus semantically disambiguated). Unlike entities within the knowledge graph representing specific instances (individuals) – such as persons, groups, locations, animals, texts etc. – these concepts represent types, i.e. they are generic entities. In CASTEMO, we divide these into Actions, which represent verbs (both single-word and multi-word, i.e. composite), and Concepts, which encompass other parts of speech (such as single-word as well as multi-word nouns, adjectives, pronouns, adverbs and adverbial expressions, etc.).

In effect, our collection and analysis of data from medieval Latin sources has led us to develop a lexico-semantic network that is not dissimilar in structure to resources such as WordNet and Latin WordNet. Of course, DISSINET is not an infrastructural or linguistic project, and its lexico-semantic network is not based on a representative predefined glossary. Rather, it grows as our data collection progresses. Nevertheless, available resources proved to have various limitations for our purposes. For instance, while studying Latin WordNet, a state-of-the-art lexico-semantic network for Latin, we found that it often lacked the lemmas that we needed, especially multi-word lexical units. We have thus embraced the opportunity, through our annotation of Latin sources, to build our own structured and machine-operable description of the Latin language.

Let’s have a look at some numbers. As of 18 April 2023, our database contains 565 Actions, of which 519 are Latin, and 4,033 Concepts, of which 1,946 are Latin. A tiny network perhaps, but it is rich in connections. Among the 15,116 ontological and semantic relations of different kinds within our overall DISSINET knowledge graph, 6,503 occur within our Concepts and Actions network.

The DISSINET lexico-semantic network can serve as an additional language resource for Latin in the following ways:

For Latin Concepts, it provides an English-language taxonomy, allowing us to group Latin concepts (both verbs and other parts of speech) in broader categories for purposes of querying and analysis. For example, the superclass taxonomy allows us to introduce "is a" relations, such as the fact that a “baptism” is a type of “sacrament”.
For Latin Actions, it provides their nominal counterpart Concept (that we call action/event equivalent) in either English or Latin. This ensures the interchangeability between verbal and nominal expressions. For example, “accepit pacem” (“he received [the kiss of] peace”) has a nominal counterpart relation to “kiss of peace”, which has the superclass “greeting”.
It is rich in mapping multi-word lexical units in Latin (multi-word expressions and collocations): as many as 815 Latin-language Concepts (and 65 Latin Actions) are composed of more than one word.
In ways similar to FrameNet, it provides any Action (i.e. verbal lemma with unambiguous meaning) with its valency frame, i.e. a structure which relates any actant role (subject, object 1, object 2) to its morphosyntactic valency (case, preposition), allowed entity types following our ontology (Person, Group, Concept… or Empty, if that actant slot must not be occupied), and meaning (defined through a relation to a Concept). For instance, it defines that the subject of “video” (alqm/alqd), that is “to see” (sb/st), will be in the nominative without a preposition, that it must be a Person or a Group, and that the actant occupying the subject slot is a “perceiver”; that the primary object will either be in the accusative without preposition, accusative with infinitive, or a clausal complement introduced by a relative pronoun, that it must be occupied by a Person, Group, Object, Event or Statement, and that the actant will be “content of perception”; and finally, that in this meaning of “video”, no entity is allowed in the second object role. So far, we have constructed such valency frames for 519 Latin Actions.

Fig. 1. Screenshot of the valency section of an Action from InkVisitor, a data collection interface for building CASTEMO knowledge graphs.

Fig. 1. Screenshot of the valency section of an Action from InkVisitor, a data collection interface for building CASTEMO knowledge graphs.

DISSINET’s goal is not to match, let alone replace, any existing language resources such as Latin WordNet or Latin Valency Lexicon (VALLEX). These are – unlike DISSINET’s lexico-semantic network – based on a balanced glossary and on a broader domain than what the needs of our project represents, where we extend the network as our data collection progresses. Nevertheless, we believe that what is something of a by-product of our research (from one point of view) can already be of interest to linguists and NLP specialists in several ways:

It contains some lemmas not covered by Latin WordNet and connects them to a broader lexico-semantic network.
It provides superclasses for multi-word lexical units.
It provides some valency frames not described by the Latin Valency Lexicon.
It pairs up morphosyntactic valency with semantics (e.g., someone who sees something is a “perceiver”) and expected entity types (e.g., only a Person or a Group can see somebody or something).

We are interested in your feedback, as we are looking to align our growing network with existing resources and standards for the mutual benefit of communities in the digital humanities and natural language processing of ancient languages. Achieving this, we hope to be able to release a public version in the near future.

Tab. 1. Database statistics as of 18 April 2023

Knowledge graph component	Count
Entities	31,726
Actions	565
English Actions	39
Latin Actions	519
Latin Actions with valency frames	502
Latin Actions with semantics of actant positions	483
Concepts	4,033
Concepts linked to English WordNet 3.1 meaning identifiers	1,100
English Concepts	2,039
Latin Concepts	1,946
Single-word Latin Concepts	1,129
Single-word Latin Concepts matching a Latin WordNet lemma	939
Multi-word Latin Concepts	815
Multi-word Latin Actions	65
Relations	15,116
Relations within Concepts and Actions	6,503
Superclass Relation	3,425
Synonym Relation	394
Action/Event Equivalent Relation (relating Actions with their nominal counterparts)	410
Actant Semantics Relations (subject, object 1 and object 2)	1,410

Fig. 2. DISSINET database’s taxonomy, generated on 19 April 2023. This is a force directed plot of the current graph, using the Kamada-Kawai layout. Green edges indicate superclass relations between Concepts, while blue edges indicate relations linking Actions to their nominal counterparts in Concepts.

Fig. 2. DISSINET database’s taxonomy, generated on 19 April 2023. This is a force directed plot of the current graph, using the Kamada-Kawai layout. Green edges indicate superclass relations between Concepts, while blue edges indicate relations linking Actions to their nominal counterparts in Concepts.

Fig. 3. Distribution of taxonomy lengths for Concepts linked through superclass relations. This shows that a relatively high number of Concepts are still unlinked, but also that most of those that are linked form taxonomies of a reasonable length.

Fig. 3. Distribution of taxonomy lengths for Concepts linked through superclass relations. This shows that a relatively high number of Concepts are still unlinked, but also that most of those that are linked form taxonomies of a reasonable length.

All articles

Mapping the Latin of medieval inquisition records: DISSINET’s lexico-semantic network as a language resource

More articles

DISSINET at IMC Leeds and DH2026 in South Korea

Conference presentations showcase DISSINET applications of LLMs, SNA, and CASTEMO

OP JAK CoRe welcomes visiting researchers for 3-8 month stays

Coding a heresy trial, one clause at a time