Crossing the Boundary of Time: Fine-Tuning Modern NLP Models for Specialized Historical Corpus Data

ICAME42 - TU Dortmund University, Germany 2021

Beatrix Busse, Ingo Kleiber, Sophie Du Bois, Nina Dumrukcic

Abstract

The application of deep learning (DL) within NLP has yielded promising results for a variety of tasks, and the field has seen a ‘neural turn’. While DL approaches have become the standard for contemporary English, historical data has not received the same amount of attention since state-of-the-art models are almost exclusively trained on contemporary language data.

In the spirit of this conference’s theme, “crossing boundaries,” this paper serves as a case study in how adapting current DL language models to the (historical) corpus domain can improve next-word prediction and additional downstream tasks for working with historical data.

Therefore, the baseline performance of state-of-the-art language models, e.g., BERT (see Devlin et al. 2018) and GPT-2 (see Radford et al. 2019), are compared to models fine-tuned on both our own corpus of 16th-century English grammars as well as external historical data like EEBO TCP (Early English Books Online (EEBO) TCP).

The corpus introduced and utilized in this paper, which is part of the larger HeidelGram project (see e.g. Busse et al. 2020), represents what we label to be British grammars of English from the 16th century. For instance, William Bullokar's Brief Grammar for English, defining amongst other things parts of speech and grammatical cases, constitutes such a grammar.

By applying fine-tuned language models, we are approaching multiple problems identified within the HeidelGram research project, such as mitigating bad scans and OCR as well as diverse classification tasks (e.g., categorizing reference types), from a computational perspective.

In addition, we will propose and demonstrate a relational database approach powering both our corpora as well as our processing and analysis pipelines. While relying on text files and XML annotations (e.g., TEI-based) has been the standard in corpus construction, approaches using relational databases are gaining more attention (e.g., Davies 2005) due to their wide range of advantages (ibid.).

While the focus will be on our specialized grammar corpus, we believe that insights from our experiments, both in terms of fine-tuning existing models as well as using relational databases, will be of interest to everyone working on historical corpora wanting to practically apply current methods from NLP.

References

Bullokar, William. 1586. Brief Grammar for English. London: Edmund Bollifant.

Busse, Beatrix, Kirsten Gather und Ingo Kleiber. 2020. „A Corpus-Based Analysis of Grammarians’ References in 19th-Century British Grammars.“ In Variation in Time and Space: Observing the World Through Corpora, hrsg. von Anna Cermakova und Markéta Malá. Diskursmuster - Discourse Patterns 20. Berlin: De Gruyter.

Davies, Mark. 2005. “The Advantage of Using Relational Databases for Large Corpora: Speed, Advanced Queries, and Unlimited Annotation.” International Journal of Corpus Linguistics 10 (3): 307–34.

Devlin, Jacob, Ming-Wei Chang, Kenton Lee und Kristina Toutanova. 2018. „BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.“ Unveröffentlichtes Manuskript. https://arxiv.org/pdf/1810.04805. Early English Books Online (EEBO) TCP. (n.d.): Retrieved November 24, 2020, from https://textcreationpartnership.org/tcp-texts/eebo-tcp-early-english-books-online/

Radford, A., Jeffrey Wu, R. Child, David Luan, Dario Amodei and Ilya Sutskever. 2019. “Language Models are Unsupervised Multitask Learners.”

Presentation Material

Slides (.pdf)