Blog Posts by Sam Hume

Why JSON for Datasets?

This post explains the rationale for creating a dataset exchange format using JSON instead of alternative file formats. The PHUSE/CDISC/FDA Dataset-JSON as Alternative Transport Format for Regulatory Submissions pilot project is the motivation for the post. Responses from pilot participants have been very positive, but we have received a few comments about alternative file formats. This post answers the question: Why JSON?

As the name implies, Dataset-JSON is a JSON-based data exchange format for tabular datasets. JSON is the de facto standard for data exchange, especially via RESTful APIs. Almost all programming languages and frameworks support JSON. As a new standard, Dataset-JSON needs dataset conversion and viewing tools, and JSON, a human-readable text-based standard, simplifies the development and testing of these new software tools. In fact, during the first COSA Dataset-JSON Hackathon over 20 tools were developed in a short timeframe, including the SAS, R, and Python conversion tools used in the pilot project.

The Dataset-JSON standard targets the data exchange use case and supports a wide range of data exchange scenarios involving tabular datasets. It is optimized for ease of sharing tabular data between a wide variety of different information systems. File size, read/write speeds, and ease of querying, while important, are secondary to support for data exchange. Dataset-JSON provides reasonable file sizes and processing speeds such that it functions well for data exchange but does not need to be the optimal dataset format for big data or analytical processing.

In many cases, Dataset-JSON data exchange will be API-based. In fact, many data exchange scenarios may never store Dataset-JSON as a file, but instead the data is retrieved using an API. For example, an EDC vendor may provide a Dataset-JSON API for sponsors to retrieve datasets. Those datasets may be generated dynamically from study data stored in a database. The sponsor requests those datasets from the EDC vendor via calls to its API. The EDC system receives the API call, queries the database, formats that data as Dataset-JSON, and sends the Dataset-JSON payload to the requesting sponsor. The sponsor, receiving this Dataset-JSON formatted data, saves each dataset to a data lake by writing it out as a Parquet file, a SAS dataset, or an R dataframe. In this Dataset-JSON API-based data exchange scenario, the data is never stored as a Dataset-JSON file. Furthermore, API’s may use compression and pagination to improve the efficiency of retrieving large datasets.

Importantly, Dataset-JSON targets interoperability between different technologies. For example, a vendor should be able to create dataframes in R, convert them to Dataset-JSON, and then send them to a sponsor that converts them into SAS datasets before beginning their work. That is, each party should be able to use their preferred technology to generate the datasets without worrying about what technology the other party uses. Interoperability also includes other technologies, such as data collection systems like EDC and ePRO, in addition to technologies used in statistical computing environments.

Taking a broader look at the research and healthcare data standards landscape, JSON aligns well with other data exchange standards. Dataset-JSON is part of the CDISC ODM v2.0 standard which provides the ability to represent raw, non-tabular data as JSON. For example, ODM v2.0 could be used to represent hierarchical raw data collected in an EDC system that will be transformed into Dataset-JSON datasets to send to the sponsor. Using JSON for both non-tabular and tabular data exchange has obvious benefits for tool developers. Additionally, a future version of Define-XML, likely to be named Define-JSON, will ensure that the metadata and datasets can be represented using JSON. The primary source of CDISC metadata, the CDISC Library, returns metadata as JSON by default. In terms of healthcare data standards, HL7 FHIR, a widely implemented data exchange standard for EHR data, supports JSON for data exchange and uses NDJSON (Newline Delimited JSON) for bulk data exports. The FHIR standard has been used in research studies to retrieve EHR data as RWD.

Extensibility is an important attribute of a data exchange standard, and the Dataset-JSON schema can easily be extended to add additional metadata. Extensibility enables implementers to add support for specific data exchange scenarios not covered by the base standard. Extensibility has been a widely used feature of ODM-based standards from the beginning and has enabled implementers to adapt ODM to support a diverse set of data exchange scenarios.

To process very large datasets, software tools should not attempt to load the entire JSON dataset into memory as the size of the dataset may exceed available memory. Instead, software tools should use a library that reads JSON as a stream. To provide additional support for processing large datasets, the Dataset-JSON team plans to add support for NDJSON. NDJSON will enable software to read the dataset one line at a time, effectively streaming the JSON content, and will make it easy to create large datasets by simply appending new rows.

Finally, Dataset-JSON eliminates the limitations that SAS V5 XPORT imposes on the data standards as well as the technologies that process study data. This is an important requirement for any SAS V5 XPORT alternative under consideration.

Thanks to all that have participated in the pilot thus far and helped test and improve the Dataset-JSON standard. If you have suggestions to improve this post, please create an issue or pull request on the markdown version in GitHub.