Developing Products of the DDI Alliance

DDI Alliance, 2020-04-12

I.    Introduction
II.   Disco - DDI-RDF Discovery Vocabulary
III.  DDI-CDI - Cross Domain Integration
IV.   SDTL - Structured Data Transformation Language 

 

Introduction

The DDI Alliance continues to develop new products to meet the needs of current and potential user group. These are proposed by members and supported by contributed work and funding from the Alliance as agreed to by the Executive Board. A major development effort, entitled the "Moving Forward Project," resulted in new content and production approaches in existing products as well as new product development. The products below are currently under development within the DDI Alliance.

Development of a number of RDF Vocabularies including XKOS, DISCO and PHDD began in a workshop on "Semantic Statistics for Social, Behavioural, and Economic Sciences: Leveraging the DDI Model for the Linked Data Web" at Schloss Dagstuhl - Leibniz Center for Informatics, Germany, in September 2011. This work has been continued at these three meetings: follow-up working meeting (Discovery vocabulary) at the 3rd Annual European DDI Users Group Meeting (EDDI11) in Gothenburg, Sweden, in December 2011; a second workshop on "Semantic Statistics for Social, Behavioural, and Economic Sciences: Leveraging the DDI Model for the Linked Data Web" at Schloss Dagstuhl in October 2012; and a follow-up meeting (Discovery only) at GESIS-Leibniz Institute for the Social Sciences in Mannheim, Germany, in February 2013.

 

Disco - DDI-RDF Discovery Vocabulary

This specification is designed to support the discovery of microdata sets and related metadata using RDF technologies in the Web of Linked Data. The vocabulary leverages the DDI specification to create a simplified version of this model for the discovery of data files. It is based on a subset of the DDI XML formats of DDI Codebook and DDI Lifecycle. It supports identifying programmatically the relevant datasets for a specific research purpose. Existing DDI XML instances can be transformed into this RDF format and therefore exposed in the Web of Linked Data. The reverse process is not intended, as the developers of the RDF discovery vocabulary have defined DDI-RDF components and reused components of other RDF vocabularies which make sense only in the Linked Data field.

 

DDI-CDI - Cross Domain Integration

In today's world, the types and formats of data researchers deal with are proliferating. Data Science has come into its own: technology tools for dealing with "big data" and the application of machine learning and related technologies have enabled new possibilities in how research is conducted. The demand for data sources, both traditional and new, has grown as a result.

The generation and availability of more and different data places an emphasis on understanding the data, both from the perspective of human-readable documentation, and increasingly in machine-actionable ways. An emphasis is increasingly placed on data provenance: where did the data come from, and how was it processed so that it could be used?

For the DDI suite of metadata standards, this provides a new and expanded focus - the specifications must be able to describe new forms of data, to be implemented in a wider range of technologies. Ultimately, the diverse types of data must be seen as an integrated whole, complete with a description of the structure, meaning, and provenance of each part.  DDI-CDI is intended to meet this need.

Existing DDI specifications - DDI Codebook and DDI Lifecycle - provide tested models for the documentation of much of what is needed. Applying this rich metadata set to the new forms of data, however, was not something they were designed to do.

The new DDI - Cross Domain Integration (DDI - CDI) is an application of the model which emerged from many years of work on a "next generation" DDI specification, known as "DDI 4". It is designed to be a model which can be used to connect disparate forms of data with each other, whether they are described in DDI Codebook/Lifecycle or in any other fashion. As such, it can be used as a way of integrating these new forms of data with more traditional, existing data, or with each other.

Although built on the basis of the DDI 4 work, it is a new kind of specification, aimed at supplementing the existing metadata models, but also being useful in its own right. It emphasizes the model, which is aligned with many other relevant non-DDI standards used to describe data from a variety of sources. While an XML syntax representation is provided, it is also available in a Unified Modeling Language (UML) "Canonical XMI" form for easy use in other technology platforms.

In the current release, DDI - CDI has two main parts: a section for describing data in rectangular (wide), long (event), multi-dimensional (cube), and no-SQL (big data) data formats; and a section for describing process as a primary aspect of data provenance. To support these parts, a rich model for describing variables, classifications, and other concepts and representations is also part of the model.

DDI - CDI is intended to fill the emerging need for integration of disparate data described above. The existing standards under DDI, Codebook and Lifecycle, were built primarily for describing social science data. DDI – CDI was designed to be more generic, with the intention of developing a DDI standard which provides detailed descriptions of describe data coming from multiple sources. Because interesting and relevant data can come from almost any source and on any topic, DDI – CDI is designed to describe the variety of ways a data set might be organized and the provenance of that data, independent of the subject field or the technology platform used.

The DDI Alliance is collaborating with CODATA on a series of webinars to support the public review of DDI-CDI.

 

SDTL - Structured Data Transformation Language

Structured Data Transformation Language (SDTL) is an independent intermediate language for representing data transformation commands. Statistical analysis packages (e.g., SPSS, Stata, SAS, and R) provide similar functionality, but each one has its own proprietary language. SDTL consists of JSON schemas for common operations, such as RECODE, MERGE FILES, and VARIABLE LABELS. SDTL provides machine-actionable descriptions of variable-level data transformation histories derived from any data transformation language. Provenance metadata represented in SDTL can be added to documentation in DDI and other metadata standards.

SDTL greatly enhances the value of DDI, because it is a key component of an automated metadata production process. Currently, DDI metadata is almost always created by data repositories not by data producers. Even when data are born digital, data producers discard provenance information that could be transported into DDI, because they do data management and variable transformations in statistical packages with minimal metadata capabilities. SDTL and the tools created by the C2Metadata Project are designed to create a metadata life cycle that parallels the data life cycle. The same scripts that are used to transform and manage data files can be used to update metadata files. As a result, data producers can create more accurate and complete DDI metadata with less time and effort for them and for data repositories.