Are We Entering The Age of Big Data?

2 Mar

big-dataThe preponderance of online digital content, resources, and technology has signaled the beginning of many new and exciting projects within the field of history. James Grossman, Executive Director of the American Historical Association, recently shared some personal musings on whether we have entered “The Age of Big Data” in historical research, and there is reason to believe that we have in fact entered a new age of change. So much digital content now exists that historians must being learning how to incorporate research tools that help analyze data sets too large for one person to study. We are now working in a world of overabundance, where thousands of letters, land sales, wills, and maps are now readily available at our fingertips.[1] According to Tim Hitchcock, this means we are entering “a different and generically distinct form of historical analysis [that] fundamentally changes the character of the historical agenda that is otherwise in place.”[2] If this is true, what digital tools can we use to better understand these character changes?

Throughout the semester, our class has been exploring the validity of utilizing “graphs, maps, and trees” in historical research. We have also contemplated the possibility that these visualizations may supplant the “close reading” techniques we have learned throughout our entire educational careers. I believe such visualizations are complimentary to “close reading” and enhance our studies in several reasons. For one, I wholeheartedly agree with those who have rightly pointed out that data mining and visualization have helped us ask new questions, detect unique patters, and provide new answers about the past.[3] However, what is more important, in my opinion, is that these analytical tools help us make sense of big datasets and enable us to be more efficient in our reading of large amounts of content. We cannot ask questions about big data until we put that data into a form that is readily understandable for both creator and researcher. John Theibault brought this to my attention with his paper “Visualizations and Historical Arguments,” which analyzes what visualization methods best “deploy the visual capabilities of the computer to show what we wish to communicate.”[4] Clearly defined and readable digital content is just as important as a readable book, clearly.

I explored several different websites to see data mining and visualization projects in action. My research interests are primarily focused on 19th Century American history, so I found myself browsing sites that had digital projects focusing on this period. The first site I visited was Robert K. Nelson’s “Mining the Dispatch,” which–following Theibault’s description of visualization–was strikingly beautiful in its density (amount of information) and transparency (ability of information to be easily understood by an audience).[5] Nelson utilized the MALLET topic modeling program to extract specific topics from a large collection of newspaper articles from the Richmond Daily Dispatch, which he then broke up into smaller categories and subcategories that were turned into visual graphs.[6] These graphs demonstrate interesting information about the role of slavery, nationalism, economics, politics, and many other topics on the people who subscribed to the Richmond Daily Dispatch during the Civil War. I personally find it interesting that the number of “for hire and wanted ads” looking for African American labor always spiked around the month of January. Why was this?

I also enjoyed going through the University of Nebraska-Lincoln’s website “Railroads and the Making of Modern America,” which has a wide range of digital tools for data mining and visualization.[7] To wit:

1. TokenX: Described as a” text visualization, analysis, and play tool,” this program, in conjunction with the RMMA project, allows users to conduct text mining on a collection of speeches William Jennings Bryan made on four railroad trips during his unsuccessful bid for the 1896 U.S. Presidency. Users can pick out specific speeches and then organize these speeches into a word cloud, like I did with this August 8, 1896 speech.

2. The Aurora Project: This visualization project outlines the relationship between slavery and the growth of railroads in American from 1840-1870. I liked being able to study the interactive pie charts and bar graphs that helped me visualize the growth of both (until the Civil War for slavery) on a year-to-year basis, although I was unable to get the interactive map at the top of the page to work on my internet browser. I also appreciated the “Scholarly Interpretation” page, which helped provide a context for the visual data and reinforced the fact that many slaves did much more than farm labor at plantations.

3. Land Sales in Nebraska: This simple, animated cinematic map achieves several important tasks. Although I’d like a little more functionality and interactivity with it, the map gives us a partial glimpse into the relationship between people in Nebraska and their geography from 1865-1886 and, perhaps more importantly, shows us how this relationship changed over time. We can also detect interesting patterns with this animation. For example, one will notice that from roughly 1873-1876, railroad construction slowed dramatically, a result of the Panic of 1873 that crippled the entire U.S. economy.

It is important for us to once again remember that digital content requires us to proceed with caution when analyzing big data sets. When we create visualizations that document change over time, we need to display “rhetorical honesty” and document our entire research process–not just the end results–so that the audience can see where our numbers are coming from, much like an economist or scientist’s report that utilizes quantitative data.[8] When we use text mining, we still need to provide a responsible interpretation that puts those words into a proper historical context. Most importantly, collaboration is key in helping us to make sense of big data. To this end, I agree with Michael Simeone, Jennifer Guiliano, Rob Kooper, and Peter Bajcsy in that we should look for ways to build “cloud computing” repositories for collecting resources and sharing digital artifacts amongst scholars in an interdisciplinary setting. Through these systems, we may begin to “think about joining the historian’s analytical and narrative skills to the statistician’s methods of organization and analysis[,] or the historian’s facility with sifting and contextualizing information to the computer scientist’s (or marketing professional’s) ability to generate and process data.”[9] Indeed, we can use digital resources to find new opportunities to make the entire field of humanities a more important part of our lives.

[1] James Grossman, “‘Big Data:’ An Opportunity for Historians?” accessed February 26, 2013.; Michael Simeone, Jennifer Guiliano, Rob Kooper, Peter Bajcsy, “Digging into data using new collaborative infrastructures supporting humanities-based computer science research,” First Monday, 16, no. 5 (2 May 2011). Accessed February 26, 2013. It should also be pointed out that most archival material is not digitized. A recent report by the Smithsonian states that the institution is hoping to digitize 14 million objects in its collection, which amounts to roughly 10 percent of its total collection. See “Digitization of Smithsonian Collections.” Accessed February 21, 2013.

[2] Tim Hitchcock, “Academic History Writing and the Headache of Big Data.” Accessed March 1, 2013.

[3] Grossman, “‘Big Data’: An Opportunity for Historians?”; Council on Library and Information Resources, “Using Zotero and TAPoR on the Old Bailey Proceedings: Data Mining with Criminal Intent.” Accessed February 27, 2013.

[4] John Theibault, “Visualizations and Historical Arguments” (Spring 2012 Version). Accessed February 27, 2013.

[5] Ibid. Robert K. Nelson, “Mining the Dispatch.” Accessed February 27, 2013.

[6] For more info on topic modeling, see Scott B. Weingart, “Topic Modeling for Humanists: A Guided Tour,” accessed February 28, 2013.

[7] William G. Thomas et al., “Railroads and the Making of Modern America.” Accessed February 27,2013.

[8] Theibault, “Visulaizations and Historical Arguments.”

[9] Simone, Guiliano, Kooper, Bajcsy, “Digging into Data”; quote from Grossman, “‘Big Data’: An Opportunity for Historians?”


3 Responses to “Are We Entering The Age of Big Data?”

  1. Nick Sacco March 2, 2013 at 3:07 am #

    Reblogged this on Exploring the Past and commented:

    Check out the full post at the IUPUI Digital History Blog!

  2. xtinexby March 2, 2013 at 3:24 pm #


    I appreciate your focus in your blog post on using visualizations for big data and remembering to be transparent not only with the images but how they were formed.
    Do you feel like now that we have all of the tools and possibilities for visualization, that it is dangerous and easy to produce graphs, charts, and maps that do not have any value? Or do you think that most of these tools and projects are making a valuable contribution to historical knowledge?

    • Nick Sacco March 3, 2013 at 3:59 pm #

      Christine (or Xtine!),

      Thanks for the comment and question. I think the question of value is something that the creator(s) of graphs, charts, maps, etc. has to determine ahead of time. Like the writing of a book, the ways one can frame an interpretive visualization are endless, so picking one way that is “most valuable” is tough. As a fairly inexperienced reader of visualizations, I think I have often put too much value into graphs, charts, and maps in books without seriously analyzing the interpretation behind it. Just flip on to the next page, right?!

      John Theibault’s essay showed me that the creators of visualizations need to avoid the pitfalls of misleading and manipulative interpretations and datasets that give readers the false impression that such visualizations HAVE value. “Chartjunk” does exist, in sum. Theibault explained these problems as such:

      “…sometimes, visualizations can be shaped to seem more conclusive than the underlying data actually warrant. For example, if a line graph showing differences ranging from 65 percent to 85 percent has its baseline set at 50 percent rather than zero, it leads people to see the differences in values as starker than they actually are. These issues of what one might call rhetorical honesty in the formulation of visualizations were compounded for historians and other humanists by the hard choices that were required to generate the data to be processed in the first place.”

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: