Getting Started with DDI-Codebook
ASK AN EXPERT (LINK)
This document provides a basic introduction on how to make use of the DDI-Codebook specification. Though the explanations are primarily tailored to users who deal with large numbers of files, the information should be useful for all DDI users. It is assumed here that you've had some experience with XML; if not, consult our FAQ.
The markup task may appear daunting at first, so we have broken the process down into discrete steps. If you have questions or would like some advice on your specific situation, please don't hesitate to get in touch with us. You may also want to join the DDI list-serv.
If you're curious to see how other organizations make use of DDI XML, see DDI Implementations.
- Step One: Review the Tags
- Step Two: Mark Up and Validate Documents
- Step Three: Select and Implement Display Software
- Step Four: Test Marked Up Documents
Step One: Review the Tags
Step One is all about organizing information, and for this kind of task it may be useful to consult a librarian or an archivist who is familiar with metadata.
You may also find it helpful to review some sample codebooks marked up by ICPSR (note that these codebooks are marked up according to Version 1.0 of the specification).
-
First, you need to determine what information will be recorded in your DDI instances. The DDI features over 300 different tags (most of them optional), but it's unlikely that anyone would make use of all of them.
The DDI has five main sections:
- Document Description -- This is essentially "header" or citation information about the marked up DDI instance itself. You may decide to use only a few of the elements in this section.
- Study Description -- This section describes the study at a broad level and includes information on geographic and temporal scope as well as methodological information.
- Files Description -- This section is a description of the physical data file(s) in terms of record and variable counts, logical record length, etc.
- Data (Variables) Description -- This section presents detailed information on each data item, including question text, variable label, category labels and values, etc.
- Other Related Materials -- Other documents or files related to the study.
If you have existing metadata records, the process of defining the fields you want to use is somewhat easier, as you can just look at what was recorded on the old records. At this stage, you should generate a simple list of metadata fields you used, with a brief explanation of each, like this:
TITLE - the title of the survey
PI- the principal investigator
UNIVERSE- the universe
GEOGRAPHY - the countries covered by the survey
RESPONSE - the response rateAgain, you're just creating a template that includes the kinds of information you wish to record about the dataset. If you want this information to display in your DDI instance or codebook, it needs to be included on this list.
-
Next you need to find out what DDI tags those fields map to. This is best done by reading over the DTD or Schema itself, or the Tag Library. Essentially, you're attempting to build a mapping document that will lay out the structure and content of your XML. Pay careful attention to whether or not a field is optional or repeatable, as this will affect your XML.
-
Check for any fields that can be entered automatically, i.e., any fields that will be the same for all of your XML documents. For example, in <docDscr>, the <prodStmt> section covers basic information about who produced the XML document, using what software, as part of which grant, etc. In all likelihood, this information will be the same for all the XML you're producing. The same is true of <distStmt>, and you'll want to record information on when an XML document was last updated in <verStmt>. In the <dataAccs> section, you can also record information on where the data can be accessed, a good location for recording your institution's name and Web site address.
Note that if you plan to prepare a whole series of DDI codebooks, you may want to create a local supplement to the tag library, listing locally-required elements and controlled vocabulary and canned language for specific fields. Staff may prefer to have this type of shorter "this is what we do here" document for ready reference. For example, the tag library supplement might refer the person doing the markup to an in-house production database for certain pieces of information, or it might assign responsibility for some part of the markup to a specific person.
-
Now that you've gathered a list of the fields, mapped them to DDI elements, and decided on which DDI elements should be displayed for all documents, you just need to build your template, which should take the form of an XML document.
-
After you've got the template made, you need to validate it. This point needs to be emphasized:
Always validate your XML.
Depending upon what display software you choose, invalid XML may either not display at all, or display incorrectly. Validating your mapping document will ensure that you don't waste a lot of time and energy on the next step.
Validating your XML should be relatively easy. Conventional XML editors include validation utilities, and if your server is set up to handle XML, you probably have server-side validation utilities installed already. ICPSR uses either XMetal or Oxygen to validate single XML documents, and Paxer (on the server) to validate large collections of XML documents. Another tool for validating the XML is NetBeans, an open source Java development tool that is free.
Step Two: Mark Up Documents
Step Two is more of a technical process. At this stage, you may want to consult a programmer, or someone who's proficient in XML, because this stage is all about coding (tagging). There may be ways to automate your markup, depending on your source materials. This step really depends upon the materials you are working with and what your ultimate goal for markup is.
The first thing you need to do is determine what software you're going to use to generate XML. If you're creating new documentation or if you're working from an unformatted text/Word document that's not suitable for text processing, then you may want to purchase an XML editor and begin the process of tagging individual documents, using your template. If your source document has a regular format, you may use text processing to insert DDI tags around the appropriate content. If your metadata is in a database or some sort of delimited format, then you'll most likely want to have a programmer build a script that generates XML from your original metadata files.
It is sometimes the case that there is more than one way to get to an end result in terms of the mark-up you need. Below we describe the different scenarios that we are familiar with, but these may not reflect the full range of options. If you have a situation or a mark-up goal not represented here and would like some advice, please get in touch with us at ddi@icpsr.umich.edu.
Scenario A
Source documents: MARC or Dublin Core catalog records (study descriptions)
Goal: To format catalog records in DDI for use with XML-based search engines and for interoperability with other DDI-based catalogs.
You may want to explore ICPSR's DDI XML metadata records.
Scenario B
Source documents: SPSS portable or system file for microdata and PDF questionnaire; or SPSS data definition statements, ASCII data file, and PDF questionnaire
Goal: To create a DDI-compliant file with variable-level information -- including question text -- for use with an online analysis package.
Possible strategies:
-
Using Nesstar Publisher (http://www.nesstar.com) and PDF converter:
-
Convert PDF file to ASCII text using the utility found at: http://www.foolabs.com/xpdf/download.html.
-
Use Nesstar Publisher to convert the portable or system file to DDI formatted variable-level information with frequencies. Or use the Publisher with the data definition statements and the raw ASCII data to achieve the same result.
-
Manually insert question text into the Nesstar-generated document.
OR
-
Prepare a DDI file with question text that is derived from the ASCII questionnaire file. Variable names must match the variable names in the Nesstar-produced file.
Merge the questions file with the variables file in Nesstar Publisher. You will now have a file to import into the Nesstar system for analysis.
-
-
Using Survey Document and Analysis (SDA) (sda.berkeley.edu):
-
Convert PDF file to ASCII text using the utility found at: http://www.foolabs.com/xpdf/download.html.
-
Use SDA tools to prepare a system file for SDA. The system file can be used in SDA for analysis, but you can also then convert it to DDI format for other purposes using the tool: SDA to XML (http://sda.berkeley.edu/ddi/tools/). This application converts SDA system files into DDI variable specifications, including frequencies and statistics. Question text is integrated as part of the preparation of the SDA system file.
-
As an alternative, use the SDA XCONVERT program (http://sda.berkeley.edu/ddi/tools/) to convert SAS/SPSS/Stata data definition statements to DDI format. Note that question text and frequencies will not be included in the product of this conversion.
-
Scenario C:
Source documents: Word-processed/ASCII/PDF legacy codebooks for microdata
Goal: To display DDI codebooks on the Web in a navigable way, separate from data.
Use an XML authoring tool such as XMetaL, oXygen, or XMLSpy for markup. First, you will need to import or reference the DDI DTD, which enables the markup package to present the necessary tags. Proceed according to package instructions, cutting and pasting from the source document into the new markup file.
Alternative: Use a Unix editor like emacs or a text processing tool like TextPad to add markup to the file. This can be accomplished using macros that work against the formatting of the file. You can also use regular expressions, a pattern matching language, to discern more complex patterns in the file and then search and replace or perform other text processing operations using the regular expressions.
Apply an XSLT stylesheet to the markup (see Step 3). It's important to note that XSLT isn't a built-in capability of all Web browsers. Unless your computer's operating system is quite up-to-date, it's unlikely that you'll be able to view XML files, except as raw XML (i.e., you won't get the attractive display specified by the XSLT document). Using a server-side operation based on Cocoon is one option for transforming XML.
Scenario D
Source documents: Codebooks for aggregate data like the U.S. Census
Goal: To convert metadata describing aggregate data to DDI form in order to import into an analysis package or a data extractor/finder application.
Using an XML editor, follow the instructions in the following document to mark up aggregate/tabular data:
Block, William C. "How to Markup Aggregate Data in the DDI." (PPT 45K, PDF 124K)
Scenario E
Source documents: Output from Blaise Computer Assisted Interview software
Goal: To convert Blaise output to DDI-compliant XML form.
Use the program written by Janez Stebe of the Slovenian Data Archive, and subsequently modified by I-Lin Kuo: DDI_DTD.cif.zip (ZIP 4K)
To run this program, open Blaise and select Tools >> Cameleon.
The result of this script is the portion of the DDI describing the variables -- you will have to produce the other sections (Document, Study, and File Descriptions ) by other means. The result of the script is not yet well-formed XML due to the existence of Blaise formatting characters within the output, but it's pretty close. After you run the script, open up the result with a text editor (DO NOT use an XML editor) and you'll see the instructions to clean the output to make it well-formed XML.
Validate your final document to ensure that everything is correct.
Step Three: Select and Implement Display Software
Step Three may involve both a systems/server administrator and an XML person.
Unless you're on an up-to-date Windows machine with the latest version MSXML, your Web browser probably can't display XML on its own, save as raw code. In order to render XML into an attractive, understandable document, you'll need a stylesheet to dictate display, and you will need software to read the stylesheet (XSLT document) and render the XML accordingly.
For consistent display, you'll want to take advantage of a server-side solution, which should be installed on your Web server by the individual responsible for such things. Be aware that this kind of install is seldom a simple out-of-the-box thing. As mentioned Cocoon freeware by Apache may be used for this.
Once you've got the necessary software installed, you'll need to build an XML stylesheet (an XSLT document) or use ICPSR's generic stylesheet or one of the others provided on the site. The following site contains stylesheets created at the University of Guelph Data Resource Centre for use with DDI files: http://tdr.uoguelph.ca/DATA/WKSHPS/IASSIST2002/.
The XSLT specifies precisely how the XML should display. Basically, your XML display software will read the XML, then look to the XSLT for instructions, and generate a new file that it sends to the Web browser. XSLT is written in XML; it's not difficult to learn, but it can be difficult if you're not experienced with HTML and XML.
Step Four: Testing
At this point, you have the XML files validated and sitting in the appropriate folder on your server, as well as your XSLT file(s). Your sysadmin has installed the necessary software, and you're ready to begin testing. With any luck, your files will display and the only changes you'll need to make are visual changes dictated by the XSLT.