EMDA2015 Curriculum

The following outlines the curriculum for the "Early Modern Digital Agendas: Advanced Topics" institute convened from mid-June through 1 July 2015. The application deadline was 2 March 2015. Please contact institute@folger.edu with any questions.

General Overview

The institute will convene in the Folger Board Room, which has been recently upgraded in terms of its presentation technology and wireless access. All participants will be required to attend all sessions. The institute will meet from 9:30 a.m. to 4:30 p.m., Mondays through Fridays. This schedule provides for a two-hour morning session and a three-hour afternoon session, with a generous break for lunch. The afternoon session also has a built-in break; participants and visiting faculty will join Folger staff and readers at the Library’s daily tea from 3:00 to 3:30 on weekdays. Weekly evening social events will allow for conversations to continue and community to build outside the sessions.

The schedule will allow for consultation among participants, faculty, and Folger staff. Discussions of assigned and pre-circulated readings will be led by the director and the visiting faculty, and links to digital exemplars and tools will be made available before presentations.

Week 1: 15-19 June 2015

Data: Creation, Management, and Curation

After orientations and community-formation steps, this week considers issues relating to the creation, management, and curation of data in early modern DH. That work begins by recognizing that most early modern digital projects have been—and will continue to be—built upon the corpus provided by the Text Creation Partnership. That project has transcribed the digital facsimiles found in Early English Books Online, which are themselves converted from mid-twentieth-century microfilms. The digital resources available to early modern scholars are more extensive than in many fields. But they are a product of the history of their creation, and participants will also investigate options beyond EEBO-TCP. Regardless of how text is produced, it must be managed and curated, and participants will discuss best practices in the field of data curation.

Monday morning will begin with an orientation necessary for work in a restricted-access, non-circulating, rare book library: reader registration will be followed by an introduction to the rules and regulations of the Reading Room in the course of a tour of the Library. Owen Williams will organize introductions to the Folger’s online catalogue, Hamnet, and growing digital repositories like LUNA. Underway since the mid-1990s, LUNA provides high-resolution imaging of collection material, made freely available online, with source master digital images and associated metadata. Participants will confer with the institute’s two Technical Assistants to configure wireless protocols and the like. Following these orientations, Professor Jonathan Hope and the participants will convene for a welcome lunch.

The first afternoon session will be crucial for community-building and accomplishing the agenda of the rest of the institute. Priorities include: (1) establishing a level of critical discussion which theorizes and contextualizes DH within the broad field of the humanities; and (2) establishing sub-groups within the institute which allow for the development of good inter-personal relations, the sharing of knowledge, and the creation of a supportive context in which participants’ research plans can be refined. The fifteen participants will meet in three sub-groups of five people each. In each sub-group, participants will introduce themselves and describe their work, research interests, and experience in early modern studies and DH. The institute will then reconvene as a whole, and each person will introduce another member of their sub-group. The aim of these introductions is to establish a research problem for each participant that relates to DH and for which the participant will develop a solution, a visualization, a guided approach, or a list of resources over the coming weeks. Professor Hope and Dr. Williams will also outline plans for the institute’s digital presence. They will point out EMDA2013’s success with live tweeting of presentations and discussions (with over 3,400 tweets); private wiki-sites for each sub-group to record ongoing work and allow sharing between participants; and a public blog to present the participants’ work and interim discoveries.

Professor Hope will draw upon the participants collective introductions to scope out the group’s sense of current issues—both theoretical and practical—that are of current concern in early modern digital humanities. This discussion will provide an overview of the meta-critical questions with which the institute is interested. Professor Hope will lead discussion of the first set of assigned texts that are concerned with not only the advanced analysis of data, but also with the particulars of how its creation affects the product being analyzed and the producer.

In a scholarly discussion on Tuesday morning called “Historicizing Data,” Professor Lisa Gitelman (Professor of English and of Media, Culture, and Communication, New York University) will address the issue of what data is, how it overlaps with and differs from information, and discuss the importance of creating, historicizing, and curating it for scholarly applications and analysis. She will draw participants’ attention to points in history (including early modernity) when the explosion of data and its technological manipulation framed new kinds of inquiry.

On Tuesday afternoon, Dr. Paul Schaffner (Head of Electronic Text Production at the University of Michigan’s Digital Production Library Service and the TCP Production Manager) and Rebecca Welzenbach (TCP Project Outreach Librarian) will discuss the Text Creation Partnership (TCP) component of EEBO, by which a growing proportion of EEBO’s books are available in full-text form. For these advanced TCP users, they will provide an insiders’ look, describing not just how many books and bytes EEBO-TCP provides, but where the inconsistencies lurk, where the data may exhibit sufficient bias to affect analysis, what kinds of variation are present, and how significant they are. They will outline the creation process briefly before focusing on the effects of process and process-related constraints and what kinds of uses those effects would facilitate or impede. Dr. Schaffner and Ms. Welzenbach will also sketch out the likely future of EEBO-TCP beyond the eventual Phase II release, especially concerning ways the project plans to maintain and preserve the data’s reliability and permanence while anticipating multiple (and dynamic) uses.

While EEBO-TCP is the primary source for most textually based early modern DH work, Optical Character Recognition (OCR) is potentially another way to create texts on a large scale from EEBO. Professor Laura Mandell (Director of the Initiative for Digital Humanities, Media and Culture (IDHMC), Texas A&M University) and Matthew Christy (Lead Programmer at IDHMC, Texas A&M University) will introduce the OCR process of data creation over two days. OCR software mechanically types texts from page images: large collections of data that are only page images can thus be typed and then effectively searched by word, mined, visualized, etc. The Mellon-funded Early Modern OCR Project (eMOP) at Texas A&M, for instance, has mechanically typed the 45 million page images comprising Early English Books Online (EEBO) and Eighteenth Century Collections Online (ECCO).

On Wednesday morning, EMDA participants will work with pre-processing for OCR their page images of early modern texts, acquired on their own or via the LUNA digital repository at the Folger. We will use ImageMagick and Fred’s Scripts in order to binarize, deskew, and denoise these images.

On Wednesday afternoon, participants will be shown how to use Tesseract, an open-access OCR engine, with training libraries that have been made by Christy to run Tesseract on early modern documents, and then will have the opportunity to OCR their own page images. Mandell will first give participants a behind-the-scenes look at OCR in ECCO and JSTOR, discussing how the differences between them are related to early modern printing practices and technological developments. She will then demonstrate how to get multiple sets of font training installed for use by Tesseract: we will try running the participants’ page images through Tesseract using several different kinds of fonts.

On Thursday morning, Mandell will demonstrate how to train Tesseract by using Aletheia and an IDHMC tool called Franken+, which generates an ideal early modern document in a specific font. Afterwards, in a hands-on session, participants will be introduced to two tools for correcting the OCR in EEBO and ECCO documents: Cobre and TypeWright. On Thursday afternoon, Mandell will demonstrate how to create a digital edition and/or a digital corpus by using TypeWright and with uncorrected OCR outputs. At the end of the two days of presentation and hands-on work, the goal is to give participants some idea of the range of possibilities available through OCR and related tools.

The first week concludes with discussion focusing on of the management and curation of data, whether it is produced through TCP, OCR, or transcription and TEI-encoding. On Thursday afternoon, Dr. Erika Farr (Head of Digital Archives, Emory University) and Trevor Muñoz (Associate Director, Maryland Institute for Technology in the Humanities (MITH)) discuss the need for good file management practices including data back-up, consistent file naming, and adherence to standards (e.g. metadata and file formats). The importance of maintaining preservation-quality storage either locally or through cloud-based services will be addressed. Farr and Muñoz will help participants synthesize the many elements of preservation and curation into practical and actionable data management plans for their own projects as well as to develop strategies for sharing data that are relevant and valuable to their communities of practice.

The Institute wraps up the week with a turn to Folger projects and the role of a library in shaping DH. One of the main limitations of EEBO-TCP and eMOP’s OCR processing is that they are restricted to print sources. On Friday morning, Dr. Heather Wolfe (Folger Curator of Manuscripts and Project Director of Early Modern Manuscripts Online (EMMO) project), Mike Poston (Folger Data Architect and principal developer of the EMMO database), and Dr. Paul Dingman (EMMO Project Manager) will introduce the Folger’s own ambitious project to encode a non-print corpus. They will demonstrate current developments in online manuscript transcription and tagging, discuss the challenges of building a manuscript corpus, and encourage the participants to think about the pros and cons of various output models.

In the afternoon, Eric Johnson (the Folger Shakespeare Library’s first Director of Digital Access), Dr. Erin Blake (Head of Collection Information Services), and Dr. Meg Brown (Folger/CLIR Fellow for Data Curation) will offer a roundtable presentation on how the Folger curates data and knowledge generated by its staff and others. They will discuss Folger’s open-access wikis, bibliographic data in MARC format (including the upcoming new Hamnet interface), curated digital editions, and other initiatives. Dr. Williams will chair the session.

With reference to their own projects discussed at the beginning of the week, participants will share their experiences using EEBO-TCP as a research and teaching tool, OCR as an alternative for early modern text creation, and the TEI-encoding projects they might want to undertake. Readings for the second week will be distributed, assignments set, and the Technical Assistants will support the installation of requisite software as needed.

Week 2: 22-26 June 2015

Data Analysis: Statistical, Linguistic, Visual, Network

For many scholars of early modern English, DH is equated with the analysis of data. Following from the hands-on demonstrations of advanced data creation, management, and curation in week one, the second week will feature a series of experts who will discuss the principles of analysis before shifting to advanced techniques in the most challenging areas of DH. Object lessons will be taken from major projects currently underway that expand the set of data available to scholars and the tools through which they are created and accessed. The participants will consider the theories and applications of statistical analysis that underlie so much of the analytical work done in the field before turning to advanced corpus analysis. Presentations on visualization and its design will likewise return participants to first principles. How does visualization impede or advance access to patterns now intelligible with DH techniques? How can visualizations support what John Tukey called “exploratory analysis” of data rather than statistical description? Quantitative network analysis (QNA) is a burgeoning field in early modern studies that will be presented by two of its leading practitioners. Week 2 will be rounded out by a case study that brings together many of the topics addressed during the week, including Principal Components Analysis (PCA), topic modeling, and visualization.

Professor Hope will begin with discussion on the nature of the transformations scholars perform on texts when subjecting them to statistical analysis. To what extent is this analogous to “traditional” literary criticism, in as much as it involves comparison and assessment of similarity and difference? To what extent does it depart from the “traditional,” changing the object of study, and the mode of argument? Building on the “Historicizing Data” roundtable in Week 1, can scholars better understand the fundamentals of statistical analysis by thinking about the historical development of libraries and their catalogues? Libraries organized by subject “project” their books into three-dimensional space, so that books with similar content are found next to each other. Many statistical procedures function similarly, projecting books into hyper-dimensional spaces, and then using distance metrics to identify similarity and difference within the complex mathematical spaces the analysis creates. Once DH scholars understand the geometry of statistical comparison, they can grasp the potential literary significance of the associations identified by counting—and can begin to understand the difference between statistical significance and literary significance, and realize that it is the job of the literary scholar, not the statistician, to decide on the latter.

In the afternoon, Professor Alan B. Farmer (Associate Professor of English, The Ohio State University) and Dr. Goran Proot (Conservateur, Bibliothèque Mazarine) will present work derived from physical examinations of early modern books and from curated metadata of what are called “short title catalogues.” Professor Farmer will discuss using the print Short-Title Catalogue of English titles and the online English Short Title Catalogue to examine the ephemerality of different kinds of publications in the book trade of early modern England. He will consider the relative impact of format, leaf counts, edition-sheets, genre, and binding on the likelihood of entire editions becoming lost, as well as the topic of how lost editions might change our sense of the larger English book trade. He will also address certain methodological issues involved in using both online catalogues and printed reference works in order to conduct this kind of research. Dr. Proot will work with the Short Title Catalogue Flanders (STCV) and the Universal Short Title Catalogue to elucidate and uncover the data that can be recovered about the material object through statistical analysis of format, typography, and title-page layout. In comparing this data with the book production of other regions on the European Continent, Dr. Proot will also raise some quantitative questions concerning how representative the existing corpus of early English titles is and what types of books were most likely to be lost. Dr. Proot will discuss the importance of Sammelbände (volumes consisting of more than one edition or title) for survival rates, the importance of leaf counts, and the impact of the English (bibliophile) book trade from the late eighteenth-century to the present. Both Professor Farmer and Dr. Proot examine printing cycles and trends, the economies of the early modern book trade, and the statistical analysis of material objects through physical analysis and metadata. Together, their presentations will illuminate the vital frontiers between DH and the book history field.

On Tuesday, Professors Tony McEnery (Professor of Linguistics and English Language and Faculty Dean, Lancaster University), David Hoover (Professor of English, New York University), and Jan Rybicki (Assistant Professor at the Institute of English Studies, Jagiellonian University, Kraków, Poland) will introduce recent advances in corpus linguistics and stylometrics and their applications for literary analysis. Professor McEnery will survey the history of corpus linguistics before reviewing several areas in detail. He will guide discussion on the key techniques used in corpus linguistics, principally collocation and keyword analysis. He will look at the use of corpora to explore language change through time, a key topic for the study of the literature in the early-modern period. He will explore a recent development in corpus linguistics that is of interest to scholars in the study of literature in particular, namely the use of GIS techniques to visually comprehend literature and related materials, e.g. author’s letters. Professors Hoover and Rybicki will then join the discussion. Professor Hoover will provide an overview of stylometric research as it applies to English literature by drawing on his deep experience in the field. Following this, the session will shift to a hands-on approach to literary analysis. Professor Rybicki will provide the participants with suitable electronic literary texts (if they do not have their own), start them on a stylometric analysis of the texts using the stylo R package (or its spring 2015 descendant), and further process the results with network analysis software (e.g., Gephi). The goal is for each participant to have a network graph to share by the end of the afternoon. Time permitting, participants will discuss their own experiences with corpus linguistics analysis and the tools they have employed. They will also propose the best visualization work they have seen and discuss affordances and shortcomings in preparation for upcoming presentations.

The institute moves on to the fundamental principles of data visualization and its related processes from a practical design perspective. Visiting faculty will not only provide the experience of critiquing existing visualizations and tools, but will also model design challenge exercises to help participants practice applying the concepts. Professor Mike Gleicher (Professor of Computer Sciences, University of Wisconsin) and Stephan Thiel (Studio NAND, Berlin) will lead a day of sessions on visualization techniques, tools, and workflows. Professor Gleicher will explore the foundations of data visualization: how we turn data into pictures to help in understanding or communicating it. He will review principles of human perception, statistics, and design to develop a basis for “Data Science” before discussing the unique challenges of applying these tools to humanities scholarship. Participants will develop their skills at analyzing visualizations, allowing them to practice critiquing visualizations to understand how specific designs can help with data interpretation and communication. Professor Gleicher will not advocate any particular visualization tools or approaches, but rather provide participants with a foundation in visualization and analysis that can help them understand the potential for visualization in their work, assess tools and techniques, create and adapt visual designs to fit their needs, and better communicate with visualization developers and designers.

Mr. Thiel, a visualization designer, will describe the processes related to visualization and analysis from a practical design perspective. He will explain the high-level design process behind emoto, the award-winning data visualization artwork for the 2012 London Olympic Games, to reflect on the project’s design decisions. He will guide participants through the entire process of visualization design. Participants will get the chance to explore the hands-on process of visualization using a mix of existing (e.g., Tableau or Lyra) and custom software tools that do not require programming knowledge. Participants will be invited to use either their own data or a prepared set of thirty-seven German translations of Act 1, Scene 3 from Shakespeare’s Othello from the TransVis project on which Mr. Thiel is collaborating.

Although much theoretical work has been done elucidating networks, from Jean François Lyotard’s evocative description of the postmodern self as a “nodal point” to Tiziana Terranova’s analysis of global network culture in “Free Labor,” surprisingly little work has addressed the question: why networks? What is the conceptual power of networks? Dr. Ruth Ahnert (Lecturer in Renaissance Studies, Queen Mary University of London) and Dr. Sebastian Ahnert (Royal Society University Research Fellow, Cambridge) bring expertise on early modern literature and network science to the study of large early modern letter collections. Network analysis is a highly interdisciplinary field that has grown rapidly over the past fifteen years as a result of the ubiquity of network data in everyday life. Drs. Ahnert will introduce participants to the basic ways in which network connectivity can be quantified. Their recent publication on Protestant letter networks (English Literary History 82.1) has shown how various network measurements can highlight the different roles that individuals play in a correspondence network, including those who inhabit crucial infrastructural roles without necessarily writing many letters. This application will serve as an example of the kinds of historical and literary questions that network analysis can help us to answer. Drs. Ahnert will then provide a practical, step-by-step guide to turning historical records into data suitable for computational analysis. They will use their current work, funded by the Arts and Humanities Research Council (UK), on the massive archive of Tudor State Papers Online as an example of the steps involved, including the disambiguation of individuals and places, the classification of relationships between people, and decisions on how to include the temporal dimension of correspondence networks. Leading from this, Drs. Ahnert will introduce participants to a variety of network visualization and analysis tools with different technical skill requirements, such as Gephi and the Python NetworkX library. As part of a hands-on tutorial session, participants will have the opportunity either to explore a sample dataset or learn how to convert their own data into a network dataset for analysis.

One of the most striking methodological issues facing researchers is the vast quantity of data that is becoming available, as corpora shift from 40 texts to 400, and then to 400,000. If scholars are focused on a history of words, then such data sets are an advantage. But when scholarship seeks to move beyond words to study the development of genres, for example, then the quantities of data pose significant challenges for the researcher. On Friday afternoon, Professor Hope, Professor Gleicher, and Dr. Mike Witmore (Director, Folger Shakespeare Library) will demonstrate the tools being developed as part of the “Visualizing English Print” (VEP) project, a major Mellon-funded initiative. Its team seeks to develop tools and protocols that enable researchers to analyze and visualize the data being made available through EEBO-TCP and other archives. Dr. Witmore will explore with the group a number of approaches that have grown up alongside the tools being developed within VEP, providing several case studies that have developed out of this group’s work with EEBO-TCP texts drawn from the years 1530-1799. Emphasizing the need for corpus-wide findings to engage existing and emerging questions in literary studies, Dr. Witmore will focus on three areas where those findings seem relevant: the apparent distinctiveness of Shakespearean drama as compared with early modern drama more generally, the relationship between fictional prose (the novel) and works of “moral philosophy,” and the relationship of literary genre to authorship.

Week 3: 29 June-1 July 2015

The Implications of Digital/At-Scale Research for the Field of Literary Studies

In the third week, Professor Hope will redirect participants’ attention to the challenges digital tools and methods pose to literary studies and scholars. He will also broaden the scope of the institute’s agenda to the larger (period) ecosystems of DH. Challenges range from the practical ones of how scholars collaboratively conceive a digital project and organize its workflow, interoperability and sustainability, to fundamental questions about the basis, aims, and procedures of literary studies. To facilitate discussion, participants will be joined by two of the most trenchant practitioners and theorists of DH: Professors Andrew Prescott (Professor of Digital Humanities, University of Glasgow, and Digital Fellow with the AHRC, in which role he leads and advises on almost all UK DH funding) and Ted Underwood (Associate Professor of English, University of Illinois at Urbana-Champaign). Prescott is a medievalist, with expertise in imaging, and a keen sense of the history of DH and computational approaches to the humanities generally; Underwood works on mainly nineteenth-century materials, but his publications have consistently raised the issue of what literary scholars must take responsibility for if they are to use digital methods critically and effectively. Underwood has argued that DH poses a challenge for literary scholars who are used to basing their arguments on “turning points” and exceptions, whereas digital evidence, usually collected at scale, typically tells stories about continuity and gradual change. The discussions will focus on this, using the projects of participants as examples: do digital corpora and tools place DH practitioners at the dawn of a new world, or are they just in for more (or a lot more) of the same?

To provide the institute’s coda, participants will prepare and then deliver individual presentations, in which they will be charged to respond to the institute’s themes and lay out plans and issues for their future research. They will discuss what they have learned, speculate on what needs to be done or made available to researchers in the field, and describe what they have been inspired to investigate. They will also indicate what their continuing contribution to the Institute’s digital presence will be. In EMDA2013, these sessions were extremely successful, even celebratory, as they generated offers of support and collaboration. Again in EMDA2015, these sessions also mark the beginning of the lasting digital presence that the participants will create.