Digitising our medieval heritage requires better search tools
OPINION: The Middle Ages continue to captivate our imagination through their portrayal in films, TV series and games. The term medieval evokes vivid associations in our minds. But where does our knowledge of the Middle Ages truly originate?
When we read a book on medieval history, what are its sources?
The answer lies in art and archeology, but – most importantly – we learn about the distant past from manuscripts, the written sources that have come down to us through centuries.
Thanks to the Internet, medieval manuscripts have never been more accessible. An increasing number of manuscripts are being digitised.
At the moment, numerous libraries, museums and archives are digitising their collections, but since the number of surviving texts is in the tens of thousands, efficient tools are needed for finding all copies of a particular work.
Language modelling-based tools can
assist us in discovering works across a wide range of written sources that are
being made available digitally.
The wide range of manuscript sources
Medieval manuscripts are windows into the past, preserving knowledge, stories and beliefs from centuries ago. The range of written sources that have come down to us includes a huge number of texts of all sorts: historical chronicles, chivalric romances, mystery plays, travelogues, medical texts like remedies for toothaches, cures for baldness and treatments for cataracts.
Culinary recipes describe dishes for grand banquets such as sewing live birds into the belly of a roasted pig and providing instructions for the silver vessels they should be served in. Medieval manuscripts may contain lists of apothecaries’ weights, astrological calculations, alchemical texts detailing the preparation of the philosopher’s stone or the mysterious fifth element.
We also have calendars, personal letters, court records, and lists of important people such as those who contributed to a specific monastery or noble families that arrived in Britain in 1066 with William the Conqueror.
Charters recording land ownership, all kinds of religious texts from sermons to confessions, religious lyrics and instructions to solitary nuns or parish priests are also among the surviving texts found in manuscripts.
Manuscripts in England and beyond
The number of extant texts increases towards the late Middle Ages. While we have only a few hundred texts written in English before the year 1000, tens of thousands survive from 1200 to 1500 in England alone. These texts are found in manuscripts in England, Scotland, Wales and Ireland. They are housed in university libraries like those in Cambridge and Oxford, local archives, cathedral libraries such as those in Lincoln or Canterbury and private collections.
However, texts related to medieval England do not only survive from the British Isles. English manuscripts are also found in Scandinavian libraries. The Index of Middle English Prose (IMEP) lists dozens of texts copied in English which are now housed in libraries across Nordic countries, ranging from the Royal Libraries in Copenhagen and Stockholm to the University Library in Uppsala. In Norway, late medieval English prose texts can be found in the collection assembled by Martin Schøyen.
Thanks to the Internet, medieval manuscripts have never been more accessible. An increasing number of manuscripts are being digitised. Still, it is important to acknowledge that far from everything has been digitised yet. Also, cataloguing practices can vary across different institutions and countries.
When digitising manuscripts, libraries often also digitise their own catalogues as a reference aid, but many of these may be decades old. Moreover, librarians may not be as well-versed in unfamiliar languages as in those used in their own countries. For instance, librarians in Scandinavia are more likely to be familiar with Old Norse than Middle English due to their local interests and regional focus.
Interestingly, copies of texts written in Old or Middle English are occasionally discovered in Italian libraries, which were located along pilgrim routes to Rome. Consequently, there is a growing need for tools and resources to locate specific texts within digitised manuscripts and across libraries.
The Index of Middle English Prose (IMEP)
To address this need, reference works such as The Index of Middle English Prose (IMEP) play a crucial role. The IMEP aims to identify all surviving copies of texts written in English and in prose between 1200 and 1500. The foundations for this project were laid over 40 years ago in 1978 at the Problems in Middle English Prose conference held in Cambridge.
The series is published by Boydell & Brewer and currently totals 24 printed volumes, which describe texts collection by collection, library by library. However, it is now searchable through a website developed by Cambridge University Library.
Creating a resource like the IMEP relies on various elements seamlessly coming together. This includes bibliographical work conducted by highly trained specialists in compiling the volumes.
It will also be possible to include digital infrastructure that allows for linking texts indexed in the IMEP database to manuscript images and descriptions on library websites.
One major challenge faced by search engines is the considerable linguistic variation in medieval English texts. For instance, the word ‘through’ has been documented with over 500 different spellings in Middle English (thorow, thurh, dorw, dorwgh, thoro, thoroghe, thoroo, thorowght, thorv, thro, throue, thwrw, twrw, etc. etc.).
The editorial policy of IMEP is to record linguistic variants as they appear in the texts, rather than normalise them to a particular spelling. It is common for scribes to have made changes to the spelling, wording, and word order when they copied texts with the result that no two copies are identical. Consequently, search tools that attempt to match an exact spelling will fail to find all existing copies of a text.
A language modelling based search tool developed in Oslo
Fortunately, in Oslo, we have developed a search engine that can retrieve texts despite all the variation. The tool is based on Language Modelling, which is closely related to what is today often referred to as Artificial Intelligence.
The development of the search tool is a collaborative effort between Cambridge University Library and the University of Oslo. Funding for the project has been provided by an EU Marie Skłodowska Curie fellowship from 2021 to 2023.
The team behind the tool consists of Dr. Jacob Thaisen at the Department of Literature, Area Studies and European Lanuaguages, Dr. Anders Nøklestad at the Text Laboratory of the University of Oslo and Dr Alpo Honkapohja, the visiting Marie Curie fellow. The tool will be integrated into the IMEP website at Cambridge and made freely available Open Access.
(This project has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie grant agreement No 101025997).
The ScienceNorway Researchers' zone consists of opinions, blogs and popular science pieces written by researchers and scientists from or based in Norway. Want to contribute? Send us an email!