3B. Data and Databases

Critical and Practical Issues

(JDrucker 9/2013)

Basics

What is data? We take the term for granted because it is so ubiquitous. The phrase “big data” is bandied about constantly, and it conjures images of nearly infinite amounts of information codified in discrete units that make it available for analysis and research in realms of spying, commerce, medicine, population research, epidemiology, and political opinion, to name just a few. But all data starts with decisions about how it is made. Data does not exist in the world. It is not a form of atomistic information waiting to be counted and sorted like cells in a swab or cars on a highway. Instead, data is made by defining parameters for its creation. So before we begin to deal with databases, and the ways their structure supports various kinds of activity, we have to address the fundamental theoretical and practical issues involved in the concept and production of data.

For instance, if we look around the room where we are and decide what to measure, what can be quantified? Temperature and physical qualities of the room, demographic statistics on the persons present, features of the university and so on. Basically, anything to which you can give a metric can be transformed into “data” by observation and measure. Data is anything you can paramaterize. But what is the scale that we use to capture this information about phenomena? Do we use a temperature gauge that would work on the surface of the sun to tell the difference between one person’s body temperature and another’s? Between the heat at the edge of the room by the window and the temperature by the door? What scale registers significant differences? The creation of significant description from raw phenomena is the task of data creation—which is why the term “capta” makes more sense. Data derives from the greek word datum, which means given. Capta suggests active “capture” and creation or construction. Because all parameterized information depends on the point of view from which it was created, capta explains the process of creating quantitative information which acknowledging the “madeness” of the information.

Exercise: Data analysis in the present situation. If your only tool is a hammer, you see only nails. If your only approach to phenomena is to transform them into things that are quantified, you see everything as a measuring device. But what scale or unit or system of measure is being used. The answers connect us back to questions of value across and within cultures. “A day’s walk” or a “woman’s work” have no absolute value and no transcendent parameters.

(Example: Imagine an alien anthropologist from a nocturnal culture capturing ‘data’ about classroom use at UCLA finds most of the spaces under-utilized. The information visualization made to show the occupation of the university suggests it can accommodate many more students because of the “data” collected at one time of day instead of another. In this example, the simple problem of when a data set is collected will restructure the results.)

Metric standards have their own strange histories. We know that inches and centimeters are human-created standards for measure of space and dimension. But a year has a relation to a natural cycle of motion around the sun, as the day is determined by the turning earth. But what is the means by which a “minute” is determined or a day broken into hours? Are all hours the same? Medieval monks had a system for dividing the day into twelve hours of daylight and twelve of darkness throughout the year. In summer the daylight “hours” were longer than in winter, and vice versa, but the division of units served their purposes. If we are transcribing the record of activities from a monastery in this period, how do we reconcile these differences with the standard measures of time we are accustomed to using?

Temperature data seems to be empirically derived, based on the thermal condition of phenomena under investigation. But the Centigrade and Celsuis scales have very different units. The Farhenheit scale is an idiosyncratic scale, rooted in the experience of the man who designed it. He defined the low end as the coldest temperature taken in the town where he lived and the midpoint as body temperature and the high point as that at which water boils. This was later refined and made in a more precise system, but that a standard metric was created with a human reference point—he had a slight fever when defining the precise body temperature—is remarkable. In an important sense, all metrics share this characteristic—they are created in reference to human experience—but they function as if they are value-neutral and universal.

Exercise: Create a value scale that is relevant to your experience and to a domain of knowledge that you can use to “measure” the differences among phenomena in that domain.

In the day to day creation of data sets and databases, these more theoretical questions are not asked, and instead, we get on with the business of using standard metrics, categories, classification systems, and spreadsheets to make databases. Databases come in many forms, flat, relational, object-oriented, and so on. Databases can be described by their contents, their function, their structure, or other characteristics. For our purposes, we will begin with a very simple flat database that can be created in a spread sheet. Then we’ll see its limitations, and create a relational database. Our case study involves the fictional Pet Talent Agency, Star Paws.

Creating a data model is the first step of database construction. What are the kinds of information that need to be stored and how will they be identified and used? How often will they change? How do the components relate to or affect each other? Answers to these questions are not really answered in the abstract, but in doing, making, defining the content types and make a model of their relations. This can be done on paper, by hand, and/or using a database design tool, but the technological elements are dependent on the conceptual ones. A database is only as good as its content model.

The term “content type” refers to a type of content you want to distinguish, such as a name, address, age in a personnel record, or, in the case of books or music, title, author, publisher etc. What are the content types for materials in your domain? Data content types are actual information. A spreadsheet is a simple way to make a data set. It is also powerful because data from a spreadsheet can be exported for other purposes, manipulated in the spreadsheet, and related to other data elements in more complex databases. The graphic format is simply rows and columns.

Exercise: StarPaws Pet Talent Agency Imagine that your rich, eccentric Hollywood uncle has left you an inheritance in the form of a pet talent agency. He was very old school, and kept his client and talent lists on 3 x 5 cards in long boxes. These have elaborate records on them of the animals, owners, talents, kennels, addresses etc. and also cards for the clients. If you simply type the information into a text document, you cannot sort it by categories, but would have to read through all the entries to find information. The value of a spreadsheet is that you can organize any of the information in any column or row by various methods (alphabetical order, numerical order, date, size etc.).

First, imagine the cards, create the information for ten of them. Be sure to include the owners, pet names, roles played, talents, descriptions of pets, and other relevant information.

Then, figure out what the content types are and create a spreadsheet. What if three people are all transferring information from the cards? Do they all enter the information in the same format (e.g. names as last name, first name or not? Date of birth as dd/mm/yy or mm/dd/yyy). What are the implications of such decisions? Are all the cards standardized? Do some have information fields not in other cards? Will you organize the project by owner names or pet names? Or by talent/skills?

Now create a scenario in which the information changes – a pet’s owner changes, a new pet with the same talent but a different name joins a kennel, a pet with the same name and different skills, etc.

What about the roles played by various different animals? Can you link the talent to the roles? What if you are looking for a certain color dog with the ability to dance on hind legs while juggling who is located in Marina del Rey and available for work next week? You begin to see the difficulties and advantages of organizing information in a structured way. Humanities domains bring their own challenges to the design of the conceptual model of data.

A fairly simple form of data structure is a spreadsheet, but it is also a powerful instrument for analysis, modeling, and work of various kinds. Spreadsheets were created in analogue environments for the management of information, as well as for the presentation and analysis of data. If you want to look at a budget, a spread sheet is a good way to do it, for instance, and if you want to project forward what the changes in, for instance, a pay rate or an interest rate will do to costs, it is exceptionally/ useful to be able to automate this process. This is what made the automated spread sheet, VISICALC, created in the late 1970s, into what was known as the first “killer-app”. The digital spread sheet is considered the application that made computing an integral part of business life.

Some milestones in the history of database development include the following:

  • 1969-70 LANPAR “automatic natural order recalculation algorithm”
  • Rene Pardo and Remy Landau
  • 1970-72 Edgar Clogg, database concepts
  • 1974-76 IBM – QUEL (Query Langauge)
  • 1970s RDBMS
  • 1979-ish VISICALC Dan Bricklin and Bob Frankston = “killer app” Apple, IBM
  • 1980s Lotus 1-2-3
  • 1980s SQL (Sequel)

Exercise StarPaws Continued: A spreadsheet provides many advantages over a card catalogue or rolodex, and it is considered a “flat” database. All of the information is stored in one table. A relational database breaks information into multiple tables linked by keys. These permit data to be grouped by relations. One crucial feature of relational databases is that they allow data to vary dependently (when a dog’s owner changes, so does the telephone number for locating it) or independently (when several dogs play the same role in a film, the role stays stable but the relation to the pets varies). If you take the information in your cards and/or spread sheets and organize a set of tables, which pieces of information belong together and which will be separate? Why? You can draw this on paper.

Whether you build a database in a software program like Access, Filemaker, MySQL, or any other, the principles are essentially the same for all relational databases. However, other forms of database structures exist that do not depends only an entity-relationship model, but also on other principles. Look at object-oriented databases, and RDF formats, and linked open data (LOD). If you build a database, you design the content model, create fields for data entry, and design the relationships. Then you build a form-based entry for putting data into the database. This might be organized very differently from the database in order to make it more useful or coherent. Learning to manipulate the data through searches/queries, reports, and other methods will show you the value of a database for the management of information as well as metadata.

The basic principles of database management and design are modularity, content type definition or data modelling, and relations, and then the combinatoric use of data through selection and display. Since all data is capta, that is, a construct made through interpretation, databases are powerful rhetorical instruments that often pass themselves of as value-neutral observations or records of events, information, or things in the world.

Exercise: Think about census data and categories that have been taken as “givens” or as “natural” in some cultures and times in history that might now be questioned or challenged. If medical data and census data are linked, can you see problems in the ways these worldviews might differ?

Data structures, like classification systems, organize and express values. Michael Christie’s article pays attention to the ways database structures limit what can be said and/or done with cultural materials. Why does he argue for narrative and the need for multi-dimensional, non-linear, forms? How are his issues related to the Wallack and Srinivasan essay read earlier?

Exercise: Discuss and paraphrase the following points from Christie:

  • Digital songlines – relation to space/place
  • Kinship, language, humor relation to environment, embedded
  • Cartesian systems – rational, object and representation distinct
  • Storyworld not storyline
  • Collaboration with a sentient landscape/multi-layered

Some links:

Takeaway

Flat databases create a structure in which content can be stored by type. Relational databases allow information to be controlled and varied according to whether it is in a dependent or independent relation. Databases allow for authority control, consistency, and standardization across large bodies of information.

Required Readings for 4A

  • * Manovich, “Database as Symbolic Form”
  • * Ed Folsom, “Database as genre,” PMLA
  • * Responses to Folsom, PMLA

Study Questions for 4A:

  1. What does Lev Manovich mean by “database logic” and do his distinctions between narrative (sequential, linear, causal trajectory) and database (unordered and unstructured) match your experience of using ORBIS, the Chicago Encyclopedia, or the Whitman Archive (pick one).
  2. What ways do Ed Folsom and Jerome McGann’s descriptions of what constitutes a database match or differ from Manovich’s (and each other’s)? You may try to include some discussion of whether their comments share an attitude about the “liberatory” subtext of Manovich’s approach, but this is not necessary.

Copyright © 2014 - All Rights Reserved