This document provides a basic introduction on how to make use of the DDI specification using the latest version 3.1; you can also consult our guide to Getting Started with DDI: Version 2. Though the explanations are primarily tailored to users who deal with large numbers of files, the information should be useful for all DDI users. It is assumed here that you've had some experience with XML; if not, consult our FAQ.
The markup task may appear daunting at first, so we have broken the process down into discrete steps. If you have questions or would like some advice on your specific situation, please don't hesitate to get in touch with us. You may also want to join the DDI list-serv.
If you're curious to see how other organizations make use of DDI XML, see DDI Implementations.
The current version of the DDI specification, version 3.1, is designed to support information entry at basically any point in the data life cycle. A few points to remember prior to organizing your materials for entry:
XML instances of individual major modules can be created independent of other portions of the 3.1 structure, although all major modules make use of elements in the Reusable schema and generally have links to one or more other modules.
It is advisable to create all elements with their namespace prefixes (l:Variable as opposed to Variable in the logical product). This will allow you to integrate everything into a single XML instance in the future with greater ease. A list of suggested namespace prefixes can be found in the Documentation, Part 1.
Some modules require links to elements in other modules so that there is an implied order of creation. These include:
Physical Data Product requires a link to a Logical Product
Physical Instance requires a link to a Physical Data Product
Modules may make references to other modules, such as the Variable in Logical Product referring to a Question in Data Collection or a Concept in Study Unit. These types of links do not require a certain order of creation, but may result in going back and adding links in existing modules to newly created modules.
Note that Variable requires a link to a Concept. If these have not been developed for legacy files you can create a few broad concepts in order to create the link. This is not ideal but may be a practical approach in translating large numbers of legacy documents without concept information.
You need to version "versionable" elements and modules at the point of publication (see Documentation, Part 1). Maintaining version control prior to that depends on the needs of your system. Individuals may version internal working drafts to track changes. Group editing may require more extensive version control. Publication in this case refers to the point at which you make the instance public. External users need to be able to reference a specific version or to late- or early-bind their references with confidence.
New users of DDI 3 may want to familiarize themselves with the DDI Best Practices.
Note that these documents are the first attempt to define procedures for using DDI 3 optimally and as such are not considered "final." The best practice authors and editors welcome your feedback on the usability of the documents and priorities for new best practices. Please send your comments and suggestions to firstname.lastname@example.org.
Start with the Study Unit.
The only required elements are:
- Identification of the XML instance you are creating using the Study Unit scheme
- A citation
However, at minimum, you should record the purpose of the study. This is also a good time to start defining the concepts being used in the study. These can be revised later, but they can be referred to by questions and variables and it's a good idea to get these defined as early as possible. If all the information you have is the required citation and identification, you may want to wait to work on this section.
Note that concepts are maintainable. You should consider whether you will make your concept structure a maintained object that can be reused within your organization or by others.
Move on to Data Collection.
This may be as simple as a reference to an external questionnaire and textual information on the methodology, collection event, and data processing. The only required item is the identification of the XML instance. It is strongly recommended that question information be encoded so that it can be referenced by variables and used in a description of the instrument.
Questions are encoded in a Question Scheme and include references to Response Domains that detail the response categories, ranges, text, or other type of valid responses to a question. The Instrument is needed to indicate order, skip patterns, pre- or post-text, and to attach a response unit, analysis unit, or universe. Remember that Question Schemes are maintainable and so information specific to the use of a question in an instrument must be contained in the instrument rather than the Question Scheme.
With a new project, it is assumed you are encoding information as it is created. Therefore, if your process determines the purpose of the study, then determines data sources and methodology for collecting data, identification of concepts, development of questions, the data collection event and data processing, then this is the order in which the information is captured. It can be done with separate XML instances for the Study Unit and Data Collection modules or as a single integrated instance (Data Collection is an element in the Study Unit). The flexibility allows you to customize order of entry to match your internal processes.
Design the Logical Product (variables section).
A single data collection may result in a number of logical data products, but in general the first is the codification of the raw data captured in the collection process. These may reflect the exact entry of the respondent, a recoded representation of the response, the result of a calculation, or contain situational information (geography, conditions, etc.). Regardless, the structure of the Logical Data Product will require the construction of Variables to be used as is or assembled into NCubes.
Creating variables in DDI 3.0 requires thinking about them is a slightly different way. To create a variable, you must in this order (1) first define the categories, (2) assemble them into category coding structures, and (3) use these structures to create variables. Note that the category definitions and category coding structures may be created once, given IDs, and reused as needed through referencing.
One approach is to first go through the response domains of the questions, identify all the unique categories used to represent them in the variables, encode the information regarding them (label and definition), and then assemble into the required coding structures. Remember that multiple variables can use a single coding structure in its entirety or specified segments (levels, most discrete categories, a range, or selected categories). The coding structure is created once as a complete structure. Once all the category coding structures are created, variables are assembled using these coding structures or defining valid response ranges.
An alternative approach is to assemble the categories and category coding structures as needed for each variable. This requires keeping three sections open simultaneously, adding to the category listing and category coding structure list as new items are needed by a variable. It is a personal preference, but the main point is that regardless of the approach preferred, a variable using categories requires the pre-existence of category coding structures. Category coding structures require the pre-existence of categories. However, both the category listing and the coding structures can be modified if needed to include missing items.
When creating Variable information, include reference to the Question and/or Concept if that section has been created. Note that concepts, questions, categories, category coding structures, variables, and NCubes are all maintainable objects and may be maintained as separate objects and referred to or be reused by many Study Units. Individuals or organizations can create and maintain "banks" of these objects in order to provide consistency in usage within and between studies.
NCubes are always created after Variables and are assembled using one or more Variables as dimensions.
Create a Physical Data Product.
The Physical Data Product describes the physical structure of the data store. Use the structure that matches your needs (fixed format, delimited, tabular, relational, etc.). The sole purpose of the Physical Data Product is to link the physical location of a data item (variable or cell in an NCube) with the description of what it contains. This means it links a physical address (record type and location in the record) with the variable or cell coordinate address in an NCube.
Create the Physical Instance.
The Physical Instance description bears a one-to-one relationship of a physical instance to a data file. I can take a single file of survey results for a five-country study (one physical instance, one Physical Data Product) and subset it into one file per country (five Physical Instances, one Physical Data Product). Each physical instance can define the subset of records it represents (in this case individual countries) and record the summary statistics unique to that file.
Pull it all together.
You now have the base modules for a single Study Unit. These can stay "free-floating" as individual instances or be assembled into a DDI Instance through the creation of the Instance Module. What this does is pull together the sections (through reference or assembly into a single XML instance), and add a full citation (its bibliographic information), archive and holding information, and overall coverage information (topic, temporal, geographic).
You may be able to use the citation of the Study Unit as a base for the instance citation, but remember that the citation here is primarily the "Dublin Core" record of the assembled published instance. If you have created any coverage information, it can be used as a base for the instance level coverage information. Note that the coverage information for the instance must be inclusive of all of the lower level descriptions.
Make a group.
Groups are made of existing Study Units and/or groups. Common information is moved to the group level and comparison information is provided. Use the grid decision trees to complete the group attributes. Studies that are comparable-by-design can use the grouping mechanism to indicate the common information that is inherited by the individual study units. Use the grouping decision trees to fill in the required group attributes. These provide machine actionable direction to programmers needing to process the documents. Groups created for after-the-fact comparison need to use the Comparative module.
Legacy codebooks can be created in the same manner as above or by starting at any of the following points:
- Study Unit
- Data Collection
- Logical Data Product
Start with setup files.
If using a SPSS/SASS/STATA setup file and translation software, you will end up with a very raw Logical Data Product, a Physical Data Product, and a basic Physical Instance module. It will require considerable cleaning to alleviate repetition of categories and category coding structures along with adjusting the references to these within the variable descriptions.
It will probably be much more practical to go from DDI to an SPSS/SASS/Stata setup file than the other way around. This is due primarily to the limited amount of information currently found in these setup file structures. This is not a new problem and if one reviews the majority of SPSS/SAS setup files currently in social science data archives, most carry little or no content information on variables, but rather a var1, var2, var3 approach only.
However, as a means of capturing the Physical Data Product and Physical Instance information including accurate links to variable IDs, these files are still very useful. It will depend on the translation tools created for this process and how well they capture and provide sorting and editing capabilities for the Logical Data Product. Since SPSS/SAS/Stata currently do not support the relational structure of NCubes, translation tools will not be very useful with describing aggregate data in this format.
Software to expedite this approach is in development and new features will continue to be added. While this will result in a bare bones DDI instance, additional information such as an expansion of the Concept list and basic citation and Study Unit information should be added. Note that the same document can be used for multiple record subsets of the same logical and physical structure, by simply creating an additional physical instance for each physical data file.
Use existing structured electronic files.
There are few options for standard translation tools to deal with other types of electronic files like codebooks and data dictionaries, given the variety of structures represented. Projects dealing with a number of files of similar format will find it useful to parse structured information into a rough DDI structure and then clean up the resulting transformation. This can be done through programming, macros, and regular expressions. In these cases, it is important to program in the creation of unique IDs and links as well as provide the appropriate required elements for each output instance so that it can be cleaned using an XML parser that validates both structure and content. Working with small segments, cleaning them individually, and then assembling the parts allows easier identification of errors within the sections. A tool set that assists in creating accurate links between sections would be a valuable asset.
Convert existing instance to DDI 3.1.
If you are converting an earlier version of DDI to a 3.0 instance, a XML browser tool allows you to view your old document in a 3.0 tag set. This tool enables the converted document to be saved to a clipboard as a 3.0 xml document. Currently this tool handles only microdata structures and commonly used sections of the earlier DDI specification. However, by handling the restructuring of variables and ensuring valid internal references, it provides major assistance in 2.x to 3.0 translations. Note that you will then have to upgrade to 3.1 to use the most current version of the standard.
Previewing your DDI instances for migration will save you time and effort in the long run. These are points to remember when reviewing:
- Not all sections of DDI were used by everyone so you need to know what problem areas you can happily avoid
- Identify any sections that may need pre-editing or special attention
- Identify any elements or attributes that were "created" locally to relay specific information
Particular areas require some additional attention due to variations in the earlier versions of DDI or because of major structural changes in 3.0. Special attention should be paid to:
- Future home
- Notes and Text
- Special use
- Large text sections
- Should these be pre-parsed
- Missing information
- Absent content
- Repetitive variable content
- Repeated use of category set
- Use of complex category groups
Remember that all of these will need to be handled at some point in the translation process. By making as much of the process as "automatic" as possible, you will save time and reduce errors. Some options for handling migration are noted below:
- Create a pre-edited intermediary document in the "old" format
- Use software that allows for in-process approval and editing
- Determine "default" categories and determine post-transfer editing process
- Identify sections that will be transferred by hand
Finally, be sure to validate your document using an XML editor.