Digital tools for textual analysis

Revision as of 13:20, 18 September 2014 by MeaghanBrown (talk | contribs)

This is a list of digital humanities tools and readings dealing with textual analysis, most of which were initially compiled by Brett Hirsch, Heather Froehlich, and other participants of the Folger Institute's Early Modern Digital Agendas (2013) institute for advanced topics in digital humanities. For more resources, please see the Glossary of digital humanities terms. Additional links and resources are welcome.

Tools

For additional text analysis tools, see the Bamboo DiRT list.

AntConc (Desktop, cross-platform: Mac/Win/Unix)

  • AntConc is a concordance software for digital text analysis with built-in statistical analysis metrics. It helps identify keywords, collates, and clusters, with lots of resources available on the software page.

DocuScope (Desktop; Java, cross-platform)

  • DocuScope is a text analysis environment with a suite of interactive visualization tools for corpus-based rhetorical analysis.

Gephi (Desktop, cross-platform)

  • Gephi is an open-source network analysis software for data visualization & manipulation.

Intelligent Archive (Desktop; Java, cross-platform)

  • Intelligent Archive is an interface to an archive of texts, and incorporates a range of counting functionalities to support statistical analysis and computational stylistics.

Juxta (Desktop; Mac/Win/Unix)

  • Juxta is an open-source cross-platform tool for comparing and collating multiple witnesses to a single textual work. The software allows users to set any of the witnesses as the base text, to add or remove witness texts, to switch the base text at will, and to annotate Juxta-revealed comparisons and save the results.

MALLET (Desktop; Java, cross-platform)

  • MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.
  • Graphical User Interface version of MALLET (Desktop; Java, cross-platform
    • This is a graphical user interface (GUI) for MALLET's Latent Dirichlet Allocation implementation.

MorphAdorner (Desktop; Java, cross-platform)

  • MorphAdorner is an XML lemmatizer, text segmenter and natural language processing parser for Early Modern text (especially EEBO-TCP texts).

Python (Desktop; Mac/Win/Unix)

  • Python is a free programming language that uses a clean, flexible, and legible syntax. Python packages like the Natural Language Toolkit, BeautifulSoup, and Whoosh are particularly great resources for digital text analysis.

R (Desktop; Mac/Win/Unix)

  • R is a free software environment for statistical computing and graphics.

WordHoard (Desktop; Java, cross-platform)

  • WordHoard is an application for the close reading and scholarly analysis of deeply tagged texts.

Wordsmith (Desktop, Mac/Win/Unix)

  • Wordsmith is a concordance software for digital text analysis with built-in statistical analysis metrics including keywords and collocation, with lots of resources available on the software page.

Versioning Machine (Web)

  • Versioning Machine is a framework and an interface for displaying multiple versions of text encoded according to the Text Encoding Initiative (TEI) Guidelines.

Voyant Tools (Web)

  • Voyant Tools is a web-based reading and analysis environment for digital texts.

Readings in textual analysis

General Resources: A bibliography

Atwell, Eric. “Corpora-List Question: Citing Linguistic Corpora.” [Corpora-List]. 7 March 2013. Web. 8 August 2013 and threaded response from Angela Chambers

Bieber, Douglas. “Representativeness in Corpus Design.” (PDF) Literary and Linguistic Computing 8, no. 4 (1993): 243-57. Web. 22 August 2013.

Burton, Matt. “The Joy of Topic Modeling.” Mcburton.net. 21 May 2013. Web. 9 August 2013.

Davies, Mark. “A corpus-based study of lexical developments in Early and Late Modern English.” In Handbook of English Historical Linguistics, edited by Merja Kytö and Päivi Pahta. Forthcoming from Cambridge University Press.

—. “Expanding Horizons in Historical Linguistics with the 400 million word Corpus of Historical American English.” (PDF) Corpora 7, no. 2 (2012): 121-57. Accessed July 10, 2013.

Duhaime, Douglas. [ http://douglasduhaime.com/blog/co-citation-networks-in-the-eebo-tcp-corpus "Co-citation Networks in the EEBO-TCP Corpus."] July 26, 2014.

—. "Identifying Poetry In Unstructured Corpora." May 5, 2014.

—. "NGram Frequencies and 18th Commonplaces." March 13, 2014. [1]

Froehlich, Heather. "An introductory bibliography to corpus linguistics". Web. 11 May 2014.

Graham, Shawn, Ian Milligan and Scott Weingart. The Historian's Macroscope

Jockers, Matthew. Macroanalysis: Digital Methods and Literary History. Urbana-Champaign: University of Illinois Press, 2013

Meeks, Elijah and Scott Weingart. "The Digital Humanities Contribution to To Topic Modeling." Journal of Digital Humanities 2, no. 1 (Winter 2012)..

The Programming Historian is a suite of resources and tutorials for doing digitally-inflected work. Peer-reviewed and open-source.