6A. Text Encoding: Mark-up and TEI

(JDrucker 9/2013)

Mark-up languages are among the common forms of structured data. The term “mark-up” refers to the use of tags that bracket words or phrases in a document. They are always applied within a hierarchical structure and always embedded within the text stream itself. Experimental approaches to address some of the conceptual and logistical problems that arise from the hierarchical structure of mark-up have not succeeded in making an effective alternative. Mark-up remains a standard practice in editing, processing, and publishing texts in electronic forms. The use of HTML tags, introduced in an earlier section, is a very basic form of mark-up. But where HTML is used to create instructions for browsers to display texts (specifying format, font, size etc.), mark-up languages are designed to call attention to the content of texts. This can involve anything from noting the distinctions among parts of a text such as title, author, stanza, or interpreting mood, atmosphere, place, or any other element of a text. As discussed in lesson 2A, every act of introducing mark-up into a text is an act of interpretation. Mark-up is a way of making explicit intervention in a text so that it can be analyzed, searched, and put into relation with other texts in a repository or corpus. Mark-up is an essential element of digital humanities work since it is the primary way of structuring texts as they are transcribed, digitized, or born digital.

Mark-up is slow, demanding work, but it is also intellectually engaging. Mark-up languages can be selected from among the many domain specific standards (again, see Lesson 2A), or custom built for a specific project or task. These two approaches can also be combined, but then the task of processing the marked-up text will have to be custom built as well, which means that the transformations, selections, and display instructions will need to be written in XSL and XSLT in a way that matches the mark-up.

TEI, the Text Encoding Initiative, is the prevailing standard mark-up scheme for text and should be used if you are working with literary texts. The scheme includes basic bibliographical tags (publication information, edition information and so on), tags for the basic structure of a work (chapters, titles, subtitles, etc.) and tags for basic elements of literary content. The TEI is a complex scheme, and the documentation on it is excellent. In addition, the most commonly used editor, Oxygen, contains the TEI built into its system. See here for information on TEI from the community that builds and maintains it.

For customized mark-up, the first phase of working with mark-up is to decide on a scheme or content model for the texts. The content model is not inherent in the text, but instead embodies the intellectual tasks to which the work is being put. Is a novel being analyzed for its gender politics? Its ecological themes? Its depictions of place? All of these? The tag set that is devised for analysis should fit the theme and/or content of the text but also of the work that you want to do with it. Creating a “content model” for a project is an intellectual exercise as critical as creating a classification scheme. It shapes the interpretative framework within which the work will proceed.

Because XML is always hierarchical in structure, one of the challenges in making a content model is to make decisions about the “parent-child” structures this involves. The fundamental conflict that became clear in early in discussions of XML and TEI was that of overlapping hierarchies. One such conflict exists in the decision to mark up a physical object or its contents, because it is virtually impossible to do both. A poem may straddle two pages, and XML does not have a way to accommodate the mark up of both the physical autonomy of each page and the unity of the poem at the same time. In general, TEI concentrates on the intellectual content of a work, not the physical features of its original instantiation.

Exercise: The classic exercise is to take a recipe and try to determine what the tag set should be for its elements and how they should be introduced into the text. In this exercise, contrast the “semantic” elements of a recipe, a poem, and an advertisement.

  • Isolate the different content types in each instance simply by bracketing them.
  • Come up with a set of descriptive tags for the recipe
  • Look at TEI and locate the appropriate tags for the poem
  • Now try to create a tag set for the advertisement
  • Look at the three different tag sets independent of the content to which they are going to be applied. What do the tag sets tell you?
  • Try applying the tag sets to the content of each of the textual objects. What differences do you find in the process? What does this tell you about tagging?
  • Compare your tag sets with those of your neighbor. Are they the same?

The documentation of the creation of a tag set for a project is very important. Creating clear definitions of what tags describe and how they are to be used is essential if you are making your own XML custom scheme. If you are using TEI, be sure to follow the tag descriptions accurately. This is particularly important if the texts you are marking up are to be incorporated into a larger project (like an online encyclopedia, repository, collection, etc.) where they have to match the format of other files. Even the same individual working on different days can use tags differently. The range of interpretation is difficult to restrict, and individual acts of tagging are rarely consistent.

To get a good idea of a custom-built tag set for a project, go to http://www.artistsbooksonline.org and look at the DTD and tag definitions. What do you think the tag set for the Old Bailey project was? How do tags and search processes relate to each other? Data mining? What is the fundamental difference between marked-up text and non-marked-up text and when is it useful to go to the work of marking up a file?

Takeaway:

Mark-up schemes are integral to digital humanities projects and allow large collections of digital files to be searched and analyzed in a coherent and coordinated way. But mark-up schemes are formalized expressions of interpretation, they are models of content, and they are limited by the hierarchical structure required by the technical constraints of the system. Almost all digital scholarship and publication requires mark-up and familiarity with its operations and effects is a crucial part of doing digital humanities work.

Required readings for 6B:

Franco Moretti, “Conjectures on World Literature,” New Left Review 1, January/ February 2000, (http://newleftreview.org/A2094)

Lev Manovich, Douglass, et al., “How to Compare One Million Images” (http://softwarestudies.com/cultural_analytics/2011.How_To_Compare_One_Million_Images.pdf)

Study questions for 6B:

  1. What is the concept of “distant reading” and how does it relate to and differ from other forms of data mining we have looked at to date?
  2. What are the challenges faced in trying to analyze large numbers of images by contrast to those we?

Copyright © 2014 - All Rights Reserved