HGSimpleCorpusNetwork

HGSimpleCorpusNetwork is a tool written in Python which can be used to do batch frequency analyses on corrupted corpus data. Given a set of search terms and a set of text files, the script will generate an adjacency matrix, a gexf file, and a graphml file linking the search terms to the texts.

In order to account for corrupted data (i.e. OCR-corrupted data), the search algorithm supports levenshtein distances and gestalt pattern matching in order to also recognise similar (i.e. distorted) tokens.

HGSimpleCorpusNetwork is available freely under an MIT-License on GitHub.