Case Study: Interpretation in the Creation of Corpora, Catalogs, and Databases
As we have seen in this class the digitization and coding of corpora and the creation of databases creates a host of problems for scholars. What metadata is necessary or useful? What is the best way to code the data? What is the accuracy of the text? How is the text going to be used? Will the corpus be open access? How will access and database upkeep be funded over the long term? Addressing these problems is not always simple, and most often the solutions are subjective and creative choices based on theoretical constructs, scholarly research, and financial efficacy. In other words, the digital tools that you use for your research — in this case, databases — are neither objective nor passive objects. In fact, they are the product of someone else’s biases, research, opinions, and limitations. If we do not pay attention to how the tools were designed, we may find ourselves unintentionally influenced by the assumptions of their creators and the intellectual frameworks and boundaries that they have imposed. This exercise asks you to think long and hard about how our tools are created and to what extent they might be directing our methodologies and research agendas.
To get a sense of the problems outlined above, this case study focuses on several digital tools created to study early modern publications in England between 1450 and 1800. There are numerous projects that individual scholars, non-profits, and for-profit institutions have begun in the last 30 years which provide digital access to various early modern corpora. I am listing only a few with which I want you to familiarize yourself. Note that unlike the others, the English Short Title Catalogue does not provide direct access to texts. Rather, it is a catalogue with data that is relevant to the other projects.
- English Short Title Catalogue (ESTC)
- Early English Books Online (EEBO)
- Eighteenth Century Collections Online (ECCO)
- Text Creation Partnership (TCP)
- Google Books
While you are familiarizing yourself with these databases, ask yourself the following questions.
- What is the purpose of the database? What problem(s) is it trying to solve?
- When was the project established?
- Who were its creators?
- What methods did the creators use to collect and create data — both the document texts and the metadata?
- Is it open access or behind a pay wall? Why?
- Do you have access the metadata, and what does that metadata tell you?
- How was the data input into the database? Is there consistency in the metadata?
- Was the text scanned using OCR? Was the OCR text checked by a human? How (random sampling, reCAPTCHA, double keyed)?
- What are the search capabilities of the database, and what does that tell you about how its authors wanted you to use it?
- What are the limits of the database due to design, search capabilities, data input, etc.?
- In what ways could the database limit, control, or define your research agenda if used uncritically?
After you are done answering the questions above, please read the following text to see if you have come to a similar conclusion? On what points do you agree or disagree with Gadd’s analysis?
- Ian Gadd, “The Use and Misuse of Early English Books Online,” Literature Compass 6 (2009): 680–692.
Finally, I would like you to take a look at Connected Histories. Connected Histories is slightly different than the others that you have looked at so far. It is a database of databases — a single interface that allows you to search across multiple platforms, including several of those that you have looked at already. The authors explain how they for this here. Ask yourself the same questions about Connected Histories that you did about the other databases.
When we get together in class, we will discuss your findings. Be sure that you have taken good notes so that you will be prepared to compare and contrast the various features of the tools.