6B. Distant Reading and Cultural Analytics

(JDrucker 9/2013)

Many concepts and terms in digital humanities have come into being through a community of users—such as mark-up, data mining, and so on. But in the case of distant reading and cultural analytics, the terms are associated with individual authors, Franco Moretti and Lev Manovich, each of whom has been involved in their use and the application of their principles to research projects.

Distant reading is the idea of processing content in (subjects, themes, persons, places etc.) or information about (publication date, place, author, title) a large number of textual items without engaging in the reading of the actual text. The “reading” is a form of data mining that allows information in the text or about the text to be processed and analyzed. Debates about distant reading range from the suggestion that it is a misnomer to call it reading, since it is really statistical processing and/or data mining, to arguments that the reading of the corpus of literary or historical (or other) works has a role to play in the humanities. Proponents of the method argue for the ability of text processing to expose aspects of texts at a scale that is not possible for human readers and which provide new points of departure for research. Patterns in changes in vocabulary, nomenclature, terminology, moods, themes, and a nearly inexhaustible number of other topics can be detected using distant reading techniques, and larger social and cultural questions can be asked about what has been included in and left out of traditional studies of literary and historical materials.

Cultural analytics is a phrase coined by Lev Manovich to describe the work he has embarked on that uses large screen displays and digital capacities to analyze, organize, sort, and computationally process large numbers of images. Images have different properties in digital form than texts, and the act of remediating an image into a digital file is more radical than the act of typing or transcribing a text into an alphanumeric stream (we could quibble over this, but essentially, text is produced in alphanumeric code, but no equivalent or analogous code exists for images). Finding ways to process the remediated digital files based on values, color, degrees of difference from a median or norm, and so on, has constituted one of the core research areas of cultural analytics.

In distant reading and cultural analytics the fundamental issues of digital humanities are present: the basic decisions about what can be measured (parameterized), counted, sorted, and displayed are interpretative acts that shape the outcomes of the research projects. The research results should be read in relation to those decisions, not as statements of self-evident fact about the corpus under investigation. (For example, if the publication date of books is used as an element of the data being processed, then are all of these the date of first publication, of subsequent publications, of editions that have been modified or changed, and how do publication dates and composition dates match. War and Peace is still in print, but how should we assess the publication date of such a work?

Case studies:

On Distant Reading

A) Franco Moretti, Stanford Literary Lab (http://litlab.stanford.edu/?page_id=13)

Exercise: What kinds of patterns are being analyzed (geography, networks, stylistics) and how are parameters set?

What is Distant Reading? (New York Times, Sunday Book Review, 2011)

Pamphlet on “quantitative formalism” (http://litlab.stanford.edu/LiteraryLabPamphlet1.pdf)

Exercise: Why is this a misleading graph? http://www.rogerwhitson.net/britnovel2012/wp-content/uploads/2012/10/graph-11.png

B) Matt Jockers (worked extensively with Moretti to design the software/algorithms used in distant reading)

Read reviews of his book and summarize the issues, compare them with the responses to Moretti’s work.

C) “Conjecture-based” analysis: Patrick Juola’s “Conjecturator” (https://twitter.com/conjecturator)

D) Dan Cohen and Fred Gibbs, 1,681,161 titles in Victorian literature

Exercise: Analyze the graphic and compare with the network diagram of Hamlet

Cultural Analytics

Lev Manovich, http://lab.softwarestudies.com/2008/09/cultural-analytics.html

Read: How to Compare One Million Images (http://softwarestudies.com/cultural_analytics/2011.How_To_Compare_One_Million_Images.pdf)

Discuss: Some details of the project:

  • 1,074,790 manga pages
  • supercomputers
  • visual features
  • feature = numerical value of an image property

Exercise: Analyze the analysis (p.5)

  • argument: tiny sample method vs. large cultural data sets
  • claims: “full spectrum of graphical possibilities” revealed
  • benefits/disadvantages
  • controlled vocabulary / crowd sourcing
  • digital image processing / image plots

Exercise: Google “cultural analytics,” look at image result and analyze

Exercise: Design a project for which cultural analytics would be useful. Think in terms of the large volume of visual information which can be processed. In what circumstances might this be of value?

Exercise: What are the differences and similarities between distant reading and cultural analytics?


Cultural analytics is a phrase used to describe the analysis of very large data sets. Computational tools to analyze big data have to balance the production of patterns, summaries at a large scale, with the capacity to drill down into the data at a small scale. A number of “digging into data” projects have made large repositories of cultural materials more useful through faceted search and customizable browsing interface.

Distant reading is a combination of text analysis and other data mining performed on metadata or other available information. Natural language processing applications can summarize the contents of a large corpus of texts. Data mining techniques can show other patterns at a scale that is beyond the capacity of human processing (e.g. How many times does the word “prejudice” appear in 200,000 hours of newscasts?). The term distant reading is created in opposition to the notion of “close reading” that is at the heart of humanistic interpretation through careful attention to the composition and meaning of texts (or images or musical works).

  1. What are the basic components of a network? How are they defined? How do they translate into a data structure?
  2. What is meant by ‘connectivity’ and what are the limits of the ways network definitions represent actual situations?

