Mapping Waldensian “synagogues”
DISSINET has published a map of Waldensian gathering places in Piedmont.
DISSINET is a research project focusing on knowledge discovery from historical texts – Latin inquisition records – through the synergy between the manual collection of structured data and natural language processing (NLP) approaches. As an important product of our analytical work, we are at the initial stages of compiling a specialised language resource for Latin with the potential to enrich the description of the Latin language and improve the quality of computational analysis performed, including NLP, on Latin sources.
David Zbíral | Gideon Kotzé | Robert L. J. Shaw
In the Dissident Networks Project (DISSINET), we are developing a comprehensive text annotation approach that we call Computer-Assisted Semantic Text Modelling (CASTEMO). CASTEMO is a procedure of knowledge-graph building, where textual data are captured at a high level of syntactic-semantic accuracy. It is capable of retaining narrators’ perspectives, original expressions in the original languages, and the hierarchy of texts and their different parts. These details can be connected to project-specific research-oriented ontologies, allowing for efficient querying and analysis of the data.
The semantic core of CASTEMO multi-language knowledge graphs is a lexico-semantic network, i.e. a network of lemmatized concepts with defined meanings (thus semantically disambiguated). Unlike entities within the knowledge graph representing specific instances (individuals) – such as persons, groups, locations, animals, texts etc. – these concepts represent types, i.e. they are generic entities. In CASTEMO, we divide these into Actions, which represent verbs (both single-word and multi-word, i.e. composite), and Concepts, which encompass other parts of speech (such as single-word as well as multi-word nouns, adjectives, pronouns, adverbs and adverbial expressions, etc.).
In effect, our collection and analysis of data from medieval Latin sources has led us to develop a lexico-semantic network that is not dissimilar in structure to resources such as WordNet and Latin WordNet. Of course, DISSINET is not an infrastructural or linguistic project, and its lexico-semantic network is not based on a representative predefined glossary. Rather, it grows as our data collection progresses. Nevertheless, available resources proved to have various limitations for our purposes. For instance, while studying Latin WordNet, a state-of-the-art lexico-semantic network for Latin, we found that it often lacked the lemmas that we needed, especially multi-word lexical units. We have thus embraced the opportunity, through our annotation of Latin sources, to build our own structured and machine-operable description of the Latin language.
Let’s have a look at some numbers. As of 18 April 2023, our database contains 565 Actions, of which 519 are Latin, and 4,033 Concepts, of which 1,946 are Latin. A tiny network perhaps, but it is rich in connections. Among the 15,116 ontological and semantic relations of different kinds within our overall DISSINET knowledge graph, 6,503 occur within our Concepts and Actions network.
The DISSINET lexico-semantic network can serve as an additional language resource for Latin in the following ways:
Fig. 1. Screenshot of the valency section of an Action from InkVisitor, a data collection interface for building CASTEMO knowledge graphs.
DISSINET’s goal is not to match, let alone replace, any existing language resources such as Latin WordNet or Latin Valency Lexicon (VALLEX). These are – unlike DISSINET’s lexico-semantic network – based on a balanced glossary and on a broader domain than what the needs of our project represents, where we extend the network as our data collection progresses. Nevertheless, we believe that what is something of a by-product of our research (from one point of view) can already be of interest to linguists and NLP specialists in several ways:
We are interested in your feedback, as we are looking to align our growing network with existing resources and standards for the mutual benefit of communities in the digital humanities and natural language processing of ancient languages. Achieving this, we hope to be able to release a public version in the near future.
Tab. 1. Database statistics as of 18 April 2023
Knowledge graph component |
Count |
Entities |
31,726 |
Actions |
565 |
English Actions |
39 |
Latin Actions |
519 |
Latin Actions with valency frames |
502 |
Latin Actions with semantics of actant positions |
483 |
Concepts |
4,033 |
Concepts linked to English WordNet 3.1 meaning identifiers |
1,100 |
English Concepts |
2,039 |
Latin Concepts |
1,946 |
Single-word Latin Concepts |
1,129 |
Single-word Latin Concepts matching a Latin WordNet lemma |
939 |
Multi-word Latin Concepts |
815 |
Multi-word Latin Actions |
65 |
Relations |
15,116 |
Relations within Concepts and Actions |
6,503 |
Superclass Relation |
3,425 |
Synonym Relation |
394 |
Action/Event Equivalent Relation (relating Actions with their nominal counterparts) |
410 |
Actant Semantics Relations (subject, object 1 and object 2) |
1,410 |
Fig. 2. DISSINET database’s taxonomy, generated on 19 April 2023. This is a force directed plot of the current graph, using the Kamada-Kawai layout. Green edges indicate superclass relations between Concepts, while blue edges indicate relations linking Actions to their nominal counterparts in Concepts.
Fig. 3. Distribution of taxonomy lengths for Concepts linked through superclass relations. This shows that a relatively high number of Concepts are still unlinked, but also that most of those that are linked form taxonomies of a reasonable length.
DISSINET has published a map of Waldensian gathering places in Piedmont.
The DISSINET has welcomed two new team members over recent months following the departure of a historian and a sociologist in autumn 2022.