The Case for Custom Software Development: HGSimpleCorpusNetwork – A Network Analysis Toolbox for (Historical) Corpora

The discipline of corpus linguistics has always been closely linked to technology, and some even claim that “[i]t was not the linguistic climate, but the technological one that stimulated the development of corpora” (Tognini-Bonelli 2010: 15).

Whenever we are conducting corpus linguistic research, we are heavily relying on our data (the corpora), but also on the software and tools that we are using. Therefore, “[t]he functionality offered by software tools largely dictates what corpus linguistics research methods are available to a researcher” (Anthony 2013: 141). Furthermore, it has to be recognized that the choice of tools almost always has some impact on the results of the analysis due to the decisions made by the developers.

This has led to the development of hundreds of applications, toolboxes, and frameworks, which arguably cover most use cases imaginable. Nevertheless, new tools are being developed despite the availability of de facto standard tools, which are widely used, well documented, and well tested.

The primary reason for this seems to be the need to process and analyze ever more complex and specialized data based on a rapidly growing set of methodologies. In addition, custom software has the advantage of allowing researchers to “tailor the output of the analysis to fit [their] research needs” (Biber et al. 2006: 255) and to stay “in the driver's seat” (Gries 2009: 1236).

We will be carefully making a case for custom software development within the field of corpus linguistics. In order to do this, we will be presenting HGSimpleCorpusNetwork, a custom toolbox for the analysis of historical, diachronic corpora using network analytical approaches. This toolbox, currently under development within the HeidelGram project (investigating discourses of English grammar writing between 1550 and 1900), is tailored specifically towards the data, methodology, and research questions at hand. For example, the tool is especially useful when analyzing (potentially unreliable) data generated via unsupervised OCR. Following an agile approach to development, the software is developed alongside the research project and continuously adapted and improved based on the current research aims.

Looking back at some older case studies using the toolbox, advantages, issues, and lessons learned regarding the development of custom, project-specific, corpus analysis software will be presented. Based on these insights, we are also going to provide some generalized guidelines and best practices for those who need to decide between using existing tools and developing project-specific software.

References

Anthony, L. 2013. A critical look at software tools in corpus linguistics. Linguistic Research 30 (2), 141–61.

Biber, D., Conrad, S., & Reppen R. (1998) 2006. Corpus linguistics: Investigating language structure and use: Cambridge University Press.

Gries, S. T. 2009. What is corpus linguistics? Language and Linguistics Compass 3 (5), 1225–41.

Tognini-Bonelli, E. 2010. Theoretical overview of the evolution of corpus linguistics. In A. O’Keeffe & M. McCarthy (Eds.), The Routledge handbook of corpus linguistics, 14-27. London: Routledge.