Case Study: Interpretation in the Creation of Corpora, Catalogs, and Databases

13 Jan

Pollard and Redgrave SampleI know that most of you are busy reading for this week’s discussion, but for those of you who are a little bit ahead, I wanted to share with you one of your reading assignments for week 12.  It’s a case study in methodology, which I think is directly relevant to your other courses and is important for those of you who are drafting your thesis prospectuses.  So, if you have some time and want to get a head start, here it is.

As we have seen in this class, the digitization and coding of corpora and the creation of databases creates a host of problems for scholars.  What metadata is necessary or useful?  What is the best way to code the data?  What is the accuracy of the text?  How is the text going to be used?  Will the corpus be open access?  How will access and database upkeep be funded over the long term?  Addressing these problems is not always simple, and most often the solutions are subjective and creative choices based on theoretical constructs, scholarly research, and financial efficacy.  In other words, the digital tools that you use for your research — in this case, databases — are neither objective nor passive objects.  In fact, they are the product of someone else’s biases, research, opinions, and limitations.  If we do not pay attention to how the tools were designed, we may find ourselves unintentionally influenced by the assumptions of their creators and the intellectual frameworks and boundaries that they have imposed.  This exercise asks you to think long and hard about how our tools are created and to what extent they might be directing our methodologies and research agendas.

To get a sense of the problems outlined above, this case study focuses on several digital tools created to study early modern publications in England between 1450 and 1800.  There are numerous projects that individual scholars, non-profits, and for-profit institutions have begun in the last 30 years which provide digital access to various early modern corpora.  I am listing only a few with which I want you to familiarize yourself.  Note that unlike the others, the English Short Title Catalogue does not provide direct access to texts.  Rather, it is a catalogue with data that is relevant to the other projects.

While you are familiarizing yourself with these databases, ask yourself the following questions.

  • What is the purpose of the database?  What problem(s) is it trying to solve?
  • When was the project established?
  • Who were its creators?
  • What methods did the creators use to collect and create data — both the document texts and the metadata?
  • Is it open access or behind a pay wall?  Why?
  • Do you have access the metadata, and what does that metadata tell you?
  • How was the data input into the database?  Is there consistency in the metadata?
  • Was the text scanned using OCR?  Was the OCR text checked by a human?  How (random sampling, reCAPTCHA, double keyed)?
  • What are the search capabilities of the database, and what does that tell you about how its authors wanted you to use it?
  • Do you have access to the coding of documents (e.g. XML)?  How does the coding impose boundaries for text analysis?
  • What are the limits of the database due to design, search capabilities, data input, etc.?
  • In what ways could the database limit, control, or define your research agenda if used uncritically?
  • What are the strengths of the database?  How do they compare to the other databases?

Now that you have had a close look at the structure of the databases, let’s see how they compare to each other from a scholarly usability standpoint.   Let’s choose a prominent eighteenth-century author, Joseph Addison, and look at his essay, A discourse on antient and modern learning, in the following databases: ECCO, TCP, and Google Books.  Let’s assume that we know nothing about the essay, except the name of its author and its title.  Let’s ask these questions:

  • How easy was it to find and navigate the text?
  • How many editions of the essay did you find?
  • In what ways was the metadata different, and how does that affect both your ability to search and analyze the text?
  • How reliable is the transcribed text?
  • Search for an obscure word or two in the original text — “Achitophel” for example — and do a search for them in each of the different databases’ versions of the discourse.  Is the search finding the word?  If not, what would this mean if you were researching and relying on the search functions alone?
  • Pretend that you are doing research on the sculpture, Laocoon, which is in the Belvedere courtyard at the Vatican.  What does the text say about it?  What did you need to do in order to find a reference to the sculpture?

After you are done answering the questions above, please read the following text to see if you have come to a similar conclusion?  On what points do you agree or disagree with Gadd’s analysis?

Finally, I would like you to take a look at Connected Histories.  Connected Histories is slightly different than the others that you have looked at so far.  It is a database of databases — a single interface that allows you to search across multiple platforms, including several of those that you have looked at already.  The authors explain how they for this here.  Ask yourself the same questions about Connected Histories that you did about the other databases.

When we get together in class, we will discuss your findings.  Be sure that you have taken good notes so that you will be prepared to compare and contrast the various features of the tools.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: