2A. HTML and Structured Data

(JDrucker 9/2013)

Structured data, content modelling, interpretation, and display

All content in digital formats can be characterized as structured or unstructured data. In actuality, all data is structured—even typing on a keyboard “structures” a text as an alphabetic file and links it to an ASCII keyboard and strokes. The distinction of one letter from another or from a number structures the data at the primary level. But the concept of “structured data” is used to refer to another, second, level of organization that allows data to be managed or manipulated through that extra structure. Common ways to structure data are to introduce mark-up using tags, to use comma separated values, or other data structures.

The distinction between structured/unstructured data has ramifications for the ways information can be used, analyzed, and displayed. Structured data is given explicit formal properties by means of the secondary levels of organization, or encoding, referred to above. These use extra elements (such as tags, to be discussed below), data structures (tables, spread sheets, data bases), or other means to add an extra level of interpretation or value to the data. The term unstructured data is generally used to refer to texts, images, sound files, or other digitally encoded information that has not had a secondary structure imposed upon it.

Sidebar: Think about the text of Romeo and Juliet. Every line in the play is structured by virtue of being alphabetic. But the text is also divided into lines spoken by characters, stage directions, and information about the act, scene and so on. If we want to find any instance of “Juliet” a simple string search will locate the name. That is a search operation on unstructured data. But if we want to be able to pull all of the lines by Juliet, we would have to introduce a tag, such as <proper_name> into the text. The degree of granularity introduced by the structure will determine how much control we have over the manipulation and/or analysis. Every line could be marked for attributes such as class, race, gender, but if we then wanted to sort analyze all of the lines with obscene language, this set of tags, or structures, would be of no use. Every act of structuring introduces another level of interpretation, and is itself an act of interpretation, with powerful implications.

The most ubiquitous and familiar form of mark-up is HTML (hypertext markup language), which was created to standardize display of files carried over the internet, read by browsers, and displayed on screens. Many scholarly projects make use of other forms of markup language, and the principles that are fundamental to HTML transfer to their use, even if each markup language is different. The original mark-up language, SGML (standardized general markup language) was the first standard designed for the Web, and, technically, should be considered a metalanguage—a language used to describe other languages. Mark-up languages were designed to standardize communication on the Web, and, in essence, to make files display in the same way across different browsers and platforms. Good resources for understanding mark-up can be found at http://www.w3.org/MarkUp/SGML/ and http://www-sul.stanford.edu/tools/tutorials/html2.0/gentle.html

 Sidebar: Markup languages come in many flavors. Geospatial information uses KML, many text-based projects use a standard called TEI, Text Encoding Initiative, and so on. The use of these standards helps projects communicate with each other and share data. A good exercise is to study a tag set for a domain in your area of interest or expertise and/or make one of your own. For instance, the creation of a specialized tag set allows people working in a shared knowledge domain to create consistency across collections of documents created by different users (e.g. Golf Markup Language, Music Markup Language, Chemical Markup Language etc.). But a mark-up language is also a naming system, a way to formalize the elements of a domain of knowledge or expressions (e.g. texts, scores, performances, documents). In spite of the growing power of natural language processing (referred to as NLP), structured data remains the most common way of creating standards, formal systems, and data analysis. Structured data is particularly crucial as collections of documents grow in scale, complexity, or are integrated from a variety of users or repositories. Standards in data formats make it possible for data in files to be searched and analysed consistently. (If one day you mark up Romeo and Juliet using the <girl> and <boy> tags and the next day someone else uses <man> and <woman> for the same characters, that creates inconsistency. In reality, the implementation of standards is difficult, inconsistency is a fact of life, and data crosswalks (matching values in one set of terms with those in another) only go partway towards fixing this problem. Nonetheless, structuring data is a crucial aspect of Digital Humanities work.

The standards for tags in markup languages, and their definition, rules for use, and other guidelines are maintained by the W3C (World Wide Web Consortium). The page also contains a list of existing markup languages, which are fascinating to read.

See: http://www.w3.org/MarkUp/SGML/

HTML

If you understand the basic principles of any markup language, you will be able to extend this knowledge to any other. Because HTML is so common, it is a good starting place. Simply stated, all files displayed on the Web use HTML in order to be read by a browser. Other file formats (jpg, mp3, png, etc.) may be embedded in HTML frameworks (as a picture, television, speaker, or aquarium might be held in a physical frame), but HTML is the basic language of the web. Again, it is called a “mark-up” language because it uses tags to instruct a browser on how to display information in a file. HTML can be considered crude and reductive, and when it was first created, it angered graphic designers because it used a very simple set of instructions to render text simply in order of size and importance (boldness). Early HTML made no allowance for the use of specific typefaces, for instance.

HTML elements name the elements of a file (e.g. header, paragraph, linebreak) for the purposes of standardizing the display. Essentially, it serves as encoded instructions fo the browser. All markup languages and structured data are subject to the rules of well-formed-ness. This means the files must be made so that they conform to the rules of markup to display properly, or “parse” in the browser. A file that does not parse is like a play made in a sport to which it does not belong (a home run does not “parse” in football) or a structure that is not correct (a circle that does not close) because it does not conform to the rules. HTML is a metalanguage governed by its own rules and those of all markup languages.

Because mark up languages structure data, they can be used for analysis. HMTL tags mark up physical features of documents, they do not analyze content. HTML does not have tags for <proper_name_female_girl> for instance. But in a textual markup system a more elaborate means of structuring allows attributes to modify terms and tags to produce a very high degree of analysis of semantic (meaning) value in a text. When markup languages are interpretative and analytic, they are able to be processed before the information in them in displayed (e.g. give me all the instances of a male speaker using obscene language). The processes of data selection, transformation, and display are each governed by instructions. Display can be managed by style sheets so that global instructions can be given to entire sets of documents, rather than having each document styled independently. (e.g. All chapter titles will be blue, 24 point Garamond, with three lines of space following, indented 3 picas.) Style sheets can be maintained independently, and documents “reference” them, or call on them for instructions. A single style sheet can be used for an infinite number of web pages. Suppose you decide to change all of your chapter titles from bold to italic—do you want to change the <b> tag surrounding each chapter title to <i>? Or do you want to change a style sheet that instructs all text marked <chapter title> to be displayed differently? More powerful style sheets, called Cascading Style Sheets (CSS), are the common way to control display to a very fine degree of design specification.

Exercise

Style a page, then create a style sheet to govern all style features globally across a collection of pages.

Exercise

What does HTML identify? Describe the formal / format elements of documents.

What doesn’t it do? What would be necessary to model content? How is TEI different from HTML?

Look at Whitman http://www.whitmanarchive.org/)

            Rosetti: http://www.rossettiarchive.org/index.html

            Find poems, translators, authors, prose, commentary, footnotes etc.

                        Can you extract, search, analyze, find, style?

Structured data is crucial for scholarly interpretation. In answering the question, “How is digital humanities different from web development?” we immediately recognize the difference between display of content and interpretative analysis of content in a project as an integral relation between structure and argument.

Exercise

Take John Unsworth’s seven scholarly primitives (discovering, annotating, comparing, referring, sampling, illustration, representing) and see how they are embodied in a digital humanities site vs. a commercial site (Amazon). To what extent are social media sites engaged in digital humanities activities?

 Sites:  

Blake: http://www.blakearchive.org/blake/

Spatial history project: Republic of Letters

http://republicofletters.stanford.edu/case-study/voltaire-and-the-enlightenment/

VCDH: Valley of the Shadow http://valley.lib.virginia.edu/VoS/choosepart.html

Salem Witch Trial Project: http://etext.virginia.edu/salem/witchcraft/

Exercise

Discuss the ways in which Will Thomas’s discussion of the shit from quantitative methods to digital humanities questions is present in any of these sites. What is meant by the term cliometrics? How does it relate to traditional and digital humanities?

Exercise

Tools for Annotation:

DiRT: https://digitalresearchtools.pbworks.com/w/page/17801672/FrontPage

Exercise

Take time to look at the ways in which structure is present in every aspect of a digital humanities project site, from display to repository, to ways of organizing information, navigation, and use. Take apart and analyze: Perseus Digital Library http://www.perseus.tufts.edu/hopper/

            What are the elements of the site?

            How do they embody and support functionality?

            What does the term content model mean theoretically and practically?

Takeaways:

Structured data has a second level of organization.

Markup languages are a common means of structuring data.

Markup languages are metalanguages, languages that describe language.

Structured data expresses a model of content and interpretation. Structuring data allows analysis, repurposing, and manipulation of data/texts/files in systematic ways. It also disambiguates (between say, the place name “Washington” and the personal name).

Consistency is crucial in any structured data set.

Structured data is interpreted, and can be used for analysis and manipulation in ways that unstructured data cannot.

Recap:

–       Model of DH projects repository/metadata/dbase/service/display

–       Mark-up languages as a way to make structured data.

Readings for 2B:

C2DH: Chapter. 14, Sperberg-McQueen, Classification and its Structures

Michel Foucault, “Introduction,” The Order of Things, citing Borges

serendip.brynmawr.edu/sci_cult/evolit/…/prefaceOrderFoucault.pdf

Musical instrument classification,

http://en.wikipedia.org/wiki/Musical_instrument_classification

Study questions for 2B:

  1. What are the ways you can get at the worldview embodied in a classification system?

 

 

Copyright © 2014 - All Rights Reserved