David Zbíral | Gideon Kotzé | Robert L. J. Shaw
In the Dissident Networks Project (DISSINET), we are developing a comprehensive text annotation approach that we call Computer-Assisted Semantic Text Modelling (CASTEMO). CASTEMO is a procedure of knowledge-graph building, where textual data are captured at a high level of syntactic-semantic accuracy. It is capable of retaining narrators’ perspectives, original expressions in the original languages, and the hierarchy of texts and their different parts. These details can be connected to project-specific research-oriented ontologies, allowing for efficient querying and analysis of the data.
The semantic core of CASTEMO multi-language knowledge graphs is a lexico-semantic network, i.e. a network of lemmatized concepts with defined meanings (thus semantically disambiguated). Unlike entities within the knowledge graph representing specific instances (individuals) – such as persons, groups, locations, animals, texts etc. – these concepts represent types, i.e. they are generic entities. In CASTEMO, we divide these into Actions, which represent verbs (both single-word and multi-word, i.e. composite), and Concepts, which encompass other parts of speech (such as single-word as well as multi-word nouns, adjectives, pronouns, adverbs and adverbial expressions, etc.).
In effect, our collection and analysis of data from medieval Latin sources has led us to develop a lexico-semantic network that is not dissimilar in structure to resources such as WordNet and Latin WordNet. Of course, DISSINET is not an infrastructural or linguistic project, and its lexico-semantic network is not based on a representative predefined glossary. Rather, it grows as our data collection progresses. Nevertheless, available resources proved to have various limitations for our purposes. For instance, while studying Latin WordNet, a state-of-the-art lexico-semantic network for Latin, we found that it often lacked the lemmas that we needed, especially multi-word lexical units. We have thus embraced the opportunity, through our annotation of Latin sources, to build our own structured and machine-operable description of the Latin language.
Let’s have a look at some numbers. As of 18 April 2023, our database contains 565 Actions, of which 519 are Latin, and 4,033 Concepts, of which 1,946 are Latin. A tiny network perhaps, but it is rich in connections. Among the 15,116 ontological and semantic relations of different kinds within our overall DISSINET knowledge graph, 6,503 occur within our Concepts and Actions network.
The DISSINET lexico-semantic network can serve as an additional language resource for Latin in the following ways:
- For Latin Concepts, it provides an English-language taxonomy, allowing us to group Latin concepts (both verbs and other parts of speech) in broader categories for purposes of querying and analysis. For example, the superclass taxonomy allows us to introduce "is a" relations, such as the fact that a “baptism” is a type of “sacrament”.
- For Latin Actions, it provides their nominal counterpart Concept (that we call action/event equivalent) in either English or Latin. This ensures the interchangeability between verbal and nominal expressions. For example, “accepit pacem” (“he received [the kiss of] peace”) has a nominal counterpart relation to “kiss of peace”, which has the superclass “greeting”.
- It is rich in mapping multi-word lexical units in Latin (multi-word expressions and collocations): as many as 815 Latin-language Concepts (and 65 Latin Actions) are composed of more than one word.
- In ways similar to FrameNet, it provides any Action (i.e. verbal lemma with unambiguous meaning) with its valency frame, i.e. a structure which relates any actant role (subject, object 1, object 2) to its morphosyntactic valency (case, preposition), allowed entity types following our ontology (Person, Group, Concept… or Empty, if that actant slot must not be occupied), and meaning (defined through a relation to a Concept). For instance, it defines that the subject of “video” (alqm/alqd), that is “to see” (sb/st), will be in the nominative without a preposition, that it must be a Person or a Group, and that the actant occupying the subject slot is a “perceiver”; that the primary object will either be in the accusative without preposition, accusative with infinitive, or a clausal complement introduced by a relative pronoun, that it must be occupied by a Person, Group, Object, Event or Statement, and that the actant will be “content of perception”; and finally, that in this meaning of “video”, no entity is allowed in the second object role. So far, we have constructed such valency frames for 519 Latin Actions.
Fig. 1. Screenshot of the valency section of an Action from InkVisitor, a data collection interface for building CASTEMO knowledge graphs.
DISSINET’s goal is not to match, let alone replace, any existing language resources such as Latin WordNet or Latin Valency Lexicon (VALLEX). These are – unlike DISSINET’s lexico-semantic network – based on a balanced glossary and on a broader domain than what the needs of our project represents, where we extend the network as our data collection progresses. Nevertheless, we believe that what is something of a by-product of our research (from one point of view) can already be of interest to linguists and NLP specialists in several ways:
- It contains some lemmas not covered by Latin WordNet and connects them to a broader lexico-semantic network.
- It provides superclasses for multi-word lexical units.
- It provides some valency frames not described by the Latin Valency Lexicon.
- It pairs up morphosyntactic valency with semantics (e.g., someone who sees something is a “perceiver”) and expected entity types (e.g., only a Person or a Group can see somebody or something).
We are interested in your feedback, as we are looking to align our growing network with existing resources and standards for the mutual benefit of communities in the digital humanities and natural language processing of ancient languages. Achieving this, we hope to be able to release a public version in the near future.
Tab. 1. Database statistics as of 18 April 2023
Knowledge graph component
|
Count
|
Entities
|
31,726
|
Actions
|
565
|
English Actions
|
39
|
Latin Actions
|
519
|
Latin Actions with valency frames
|
502
|
Latin Actions with semantics of actant positions
|
483
|
Concepts
|
4,033
|
Concepts linked to English WordNet 3.1 meaning identifiers
|
1,100
|
English Concepts
|
2,039
|
Latin Concepts
|
1,946
|
Single-word Latin Concepts
|
1,129
|
Single-word Latin Concepts matching a Latin WordNet lemma
|
939
|
Multi-word Latin Concepts
|
815
|
Multi-word Latin Actions
|
65
|
Relations
|
15,116
|
Relations within Concepts and Actions
|
6,503
|
Superclass Relation
|
3,425
|
Synonym Relation
|
394
|
Action/Event Equivalent Relation (relating Actions with their nominal counterparts)
|
410
|
Actant Semantics Relations (subject, object 1 and object 2)
|
1,410
|
Fig. 2. DISSINET database’s taxonomy, generated on 19 April 2023. This is a force directed plot of the current graph, using the Kamada-Kawai layout. Green edges indicate superclass relations between Concepts, while blue edges indicate relations linking Actions to their nominal counterparts in Concepts.
Fig. 3. Distribution of taxonomy lengths for Concepts linked through superclass relations. This shows that a relatively high number of Concepts are still unlinked, but also that most of those that are linked form taxonomies of a reasonable length.