Under review: 2020-04-16
In today's world, the types and formats of data researchers deal with are proliferating. Data Science has come into its own: technology tools for dealing with "big data" and the application of machine learning and related technologies have enabled new possibilities in how research is conducted. The demand for data sources, both traditional and new, has grown as a result.
The generation and availability of more and different data places an emphasis on understanding the data, both from the perspective of human-readable documentation, and increasingly in machine-actionable ways. An emphasis is increasingly placed on data provenance: where did the data come from, and how was it processed so that it could be used?
For the DDI suite of metadata standards, this provides a new and expanded focus - the specifications must be able to describe new forms of data, to be implemented in a wider range of technologies. Ultimately, the diverse types of data must be seen as an integrated whole, complete with a description of the structure, meaning, and provenance of each part. DDI-CDI is intended to meet this need.
Existing DDI specifications - DDI Codebook and DDI Lifecycle - provide tested models for the documentation of much of what is needed. Applying this rich metadata set to the new forms of data, however, was not something they were designed to do.
Although built on the basis of the DDI 4 work, it is a new kind of specification, aimed at supplementing the existing metadata models, but also being useful in its own right. It emphasizes the model, which is aligned with many other relevant non-DDI standards used to describe data from a variety of sources. While an XML syntax representation is provided, it is also available in a Unified Modeling Language (UML) "Canonical XMI" form for easy use in other technology platforms.
In the current release, DDI - CDI has two main parts: a section for describing data in rectangular (wide), long (event), multi-dimensional (cube), and no-SQL (big data) data formats; and a section for describing process as a primary aspect of data provenance. To support these parts, a rich model for describing variables, classifications, and other concepts and representations is also part of the model.
DDI - CDI is intended to fill the emerging need for integration of disparate data described above. The existing standards under DDI, Codebook and Lifecycle, were built primarily for describing social science data. DDI – CDI was designed to be more generic, with the intention of developing a DDI standard which provides detailed descriptions of describe data coming from multiple sources. Because interesting and relevant data can come from almost any source and on any topic, DDI – CDI is designed to describe the variety of ways a data set might be organized and the provenance of that data, independent of the subject field or the technology platform used.
Current examples were provided for the review package.