5B. Data Mining and Text Analysis

(JDrucker 9/2013)

The term data mining refers to any process of analysis performed on a dataset to extract information from it. That definition is so general that it could mean something as simple as doing a string search (typing into a search box) in a library catalogue or in Google. Mining quantitative data or statistical information is standard practice in the social sciences where software packages for doing this work have a long history and vary in sophistication and complexity. For a good succinct introduction to SPSS, one of the standard applications for statistical analysis, read this “dummies” guide.

But data mining in the digital humanities usually involves performing some kind of extraction of information from a body of texts and/or their metadata in order to ask research questions that may or may not be quantitative. Supposing you want to compare the frequency of the word “she” and “he” in newspaper accounts of political speeches in the early 20th century before and after the 19th Amendment guaranteed women the right to vote in August 1920. Suppose you wanted to collocate these words with the phrases in which they were written and sort the results based on various factors—frequency, affective value, attribution and so on. This kind of text analysis is a subset of data mining. Quite a few tools have been developed to do analyses of unstructured texts, that is, texts in conventional formats. Text analysis programs use word counts, keyword density, frequency, and other methods to extract meaningful information. The question of what constitutes meaningful information is always up for discussion, and completely silly or meaningless results can be generated as readily from text analysis tools as they can from any other.

Exercise: Even a very simple tool, like Textanalyser (http://textalyser.net/), can generate results that are useful—but for what? Make use of the tool and then define a context or problem for which it would be useful. Think about the various categories of analysis. What are stop words? What are other features can you control and how do they affect the results? Now look at a more complicated tool and compare the language that describes its features with that of Textanalyser. What is a “conceptual grammar” for instance, and what are the applications (e.g. Visual Text) that the developers describe in their promotional materials?

While text analysis is considered qualitative research, the algorithms that are run by the tools are using quantitative methods as well as search/match procedures to identify the elements and features in any text. Is the apparent paradox between quantitative and qualitative approaches in text analysis real?

In 2009 (also in 2011 and 2013), the National Endowment for the Humanities ran a “digging into data challenge” as part of its funding of digital scholarship. The goal was to take digital projects with large data sets and create useful ways to engage with them. Take a look at the project and look at the kinds of proposals that were funded in 2009. One of these used two tools, Zotero(developed at George Mason in the Center for History and New Media) and TAPoR (an earlier version of what is now Voyeur, developed by a group of Canadian researchers) to create a new front end for a project, the transcripts of trials at the Old Bailey in London. The Old Bailey records provide one of the single longest continuous chronological account of trials and criminal proceedings in existence, and are therefore a fascinating document of changes in attitudes, values, punishments, and the social history of crime.

Exercises for critical analysis and discussion:

Case 1: Old Bailey Online

  • API (application programming interface) for Old Bailey for search/query
  • Zotero – manage, save records, integrate
  • Voyant/Voyeur — visualization

First look at the site: http://www.oldbaileyonline.org

Then look at the CLIR report on the project. (http://www.clir.org/pubs/reports/pub151/case-studies/dmci) or the final research summary (http://criminalintent.org/wp-content/uploads/2011/09/Data-Mining-with-Criminal-Intent-Final1.pdf)

5bfig1

 Figure 1 : How is the API structured and what does it enable? Compare with the original Old Bailey Online search. If the Old Bailey becomes “a collection of texts”

5bfig2

Figure 2: Zotero: saves search results, not just points within corpus

5bfig3

Figure 3, export of results.

5bfig5

Figure 5: Voyeur – correlate information in this image.

5bfig6

Figure 6: Compare with Figure 5.

Other features: TF / IDF = Term Frequency, Inverse Document Frequency

Case 2: Erotics in Emily Dickinson (http://hcil2.cs.umd.edu/trs/2006-01/2006-01.pdf)

Exercise: look at Page 4 and Page 6 – analyze the visualization

  • How does this kind of “data” analysis differ from that of the Old Bailey project?
  • What are the means by which the visualizations were produced?

Case 3: Compus – Letters from 1531-1532, 100 letters, transcribed (clemency)

Exercise, Look at Figure 1, then p. 2, examine the encoding/tagging.

How is the process of generating the visualization different from in Old Bailey or the Emily Dickinson project?

Exercise: Doing text analysis with Voyeur/Voyant and ManyEyes.

One of the challenges with any kind of data mining is to translate the results into a legible format. Information visualization, as we know, compresses large amounts of information into an image that shows patterns across a range of variables. Using visualization tools for “reading” and analyzing the results of text mining has various advantages and liabilities. In this set of exercises, try to identify the ways the rhetorical force of the tools works within the results.

TBD: Exercises

Summary: Methods of doing text analysis are a subset of data mining. They depend upon statistical analysis and algorithms that can actually “understand” (that is, process in a meaningful way) features of natural language. Visualization tools are used to display many of the results of text analysis and introduce their own arguments in the process. While this lesson has focused on “unstructured” texts, the next will look at the basic principles of “structured” texts that make use of mark-up to introduce a layer of interpretation or analysis into the process.

Takeaway

Text analysis is a way to perform data mining on digitally encoded text files. One of the earliest forms of humanities computing, at its simplest it is a combination string search, match, count, and sort functions that show word frequency, context, and lexical preferences. It can be performed on unstructured data. Topic modelling is an advanced form of text analysis that analyzes relations (such as proximity) among textual elements as well as their frequency.

Required readings for 6A:

Study questions for 6A:

  1. How does text encoding work?
  2. Talk about XML in terms of structured data?

Copyright © 2014 - All Rights Reserved