Skip to content

Generate a HEAL-compliant Data Dictionary

Info

The following instructions pertain to the stand-alone, executable version of the HEAL VLMD tool as well as the use of the VLMD tool in HEAL Workspaces. These two options are recommended for users who are unfamiliar with installing Python software and/or who want to generate VLMD documents in the quickest and easiest way possible. If you would like to install and integrate the VLMD tool into an existing, local pipeline, please see the HEAL Data Utilities on GitHub or PyPi for more information.

The HEAL VLMD tool was created to help investigators generate HEAL-compliant variable-level metadata (VLMD) documents that may be uploaded to the HEAL Data Platform. This VLMD tool uses a command-line interface (CLI), is available within HEAL Data Platform Workspaces, and can be incorporated into existing pipelines in the form of a Python module.


Using the Stand-alone VLMD Tool

In an effort to further streamline the VLMD extraction process for researchers, we have developed a stand-alone executable version of the VLMD tool.

Download the VLMD Tool

You can download the latest version of the VLMD tool for your operating system (i.e., MacOS, Windows, Linux) from the NIH HEAL Initiative’s GitHub repository:

Download Latest Software Release

Once you have downloaded the appropriate zip file, double-click the file to unzip the package. You should then see a file labeled vlmd or vlmd.exe, depending on your operating system and how it is configured.

Double-clicking vlmd will then open your computer's command-line interface (CLI). Once the interface opens and the VLMD tool is loaded, you will be presented with four prompts: documentation, extract, start, and validate.

CLI Commands

extract

Extract the variable level metadata from an existing file with a specific type/format

start

Start a data dictionary from an empty template

validate

Check (validate) an existing HEAL Data Dictionary file to see if it follows the HEAL specifications after filling out a template or further annotation after extracting from a different format.

Info

Typing the documentation command will launch the VLMD Data Dictionary definitions in the HEAL Data Utilities documentation.

Using the VLMD Tool in HEAL Workspaces with Python

The VLMD tool has also been preloaded into a HEAL workspace, so that you may use it there instead of downloading it to your local machine. To request access to a workspace, see instructions here.

Once workspace access has been approved, select the (Generic) Jupyter Lab Notebook with R Kernel to get started using the VLMD tool. You can start by uploading your REDCap data dictionary or data file to the persistent drive (/pd). Any data not saved to the persistent drive will be lost when the workspace is terminated. For more information, please see our documentation on HEAL workspaces.

Info

Files containing human subjects data must be de-identified before uploading them to a workspace, and the user is responsible for ensuring that he or she has permission to upload the data to the cloud. Workspaces are secure and any file(s) a user uploads are only accessible by that user.

If you are extracting variable-level metadata from a dataset stored in a format that contains metadata (e.g., Stata, SAS or SPSS), our recommendation is to make a copy of the dataset in which all of the data (i.e., the actual observations) have been deleted, leaving only the variable names, formats, labels, etc. Many people are unaware that this is possible, but it makes a great way of sharing information about your dataset without sharing the data themselves. And it is easy to do.

For example, in Stata, once you have a dataset loaded in memory all that is required is:

drop in 1/l
save empty_dataset

Similarily, in SAS:

data empty_dataset;
    set original_dataset;
    stop;
run;

where the stop statement stops SAS from processing any rows.

In either case, this will leave you with an empty dataset containing all of the original variable-level metadata which you may safely upload to a workspace for use with the VLMD tool.

After you’ve launched the workspace and uploaded your data dictionary or data file, you can import the necessary functions. Below are examples of how to extract VLMD from an SPSS data file, create a new VLMD file from scratch, and validate an existing data dictionary in CSV and JSON formats, all within a workspace.

Python Functions

extract

from healdata_utils import convert_to_vlmd

convert_to_vlmd(input_filepath="~/pd/myfile.sav",inputtype="spss")

Note

Currently the python subcommand is convert_to_vlmd but will be changed to extract_to_vlmd to be consistent with CLI. extract was chosen to better reflect the functionality.

start

from healdata_utils import write_vlmd_template

write_vlmd_template(tmpdir.joinpath("heal.csv"),numfields=10)

validate

from healdata_utils import validate_vlmd_csv,validate_vlmd_json

validate_vlmd_csv("data/myhealcsvdd.csv")

validate_vlmd_json("data/myhealjsondd.json")


Input

There are many applications and software packages that are commonly used during the data collection and processing phases of studies. The HEAL VLMD tool accommodates several of these different input file formats. Please follow the links below if you would like to learn more:

Output

VLMD extraction will result in a JSON and CSV version of the HEAL data dictionary in the output folder along with the validation reports in the errors folder. See below:

Errors

heal-csv-errors.json

  • outputted validation report for table in csv file against frictionless schema

If valid, this file will contain:

{
    "valid": true,
    "errors": []
}

heal-json-errors.json

  • outputted jsonschema validation report.

If valid, this file will contain:

{
    "valid": true,
    "errors": []
}

If no outputdir specified, the resulting HEAL-compliant data dictionaries will be named:

  • heal-csvtemplate-data-dictionary.csv: This is the CSV data dictionary
  • heal-jsontemplate-data-dictionary.json: This is the JSON version of the data dictionary

For more information on workflows, functions, and definitions, please see the HEAL Data Utilities Documentation.

Workflow Summary

Typical workflows for creating a HEAL-compliant data dictionary include:

  1. Create your data dictionary

    (a) Run the vlmd extract command (or convert_to_vlmd if in python) to generate a HEAL-compliant data dictionary via your desired input format

    (b) Run the vlmd template command to start from an empty template.

  2. Add/annotate with additional information in your preferred HEAL data dictionary format (either json or csv).

  3. Run the vlmd validate command with your HEAL data dictioanry as the input to validate.

  4. Repeat (2) and (3) until you are ready to submit. Please note, currently only name and description are required.

Next Steps

Once you’ve created your HEAL-compliant data dictionary, you’re now ready to submit your data dictionary to the Platform. Please see our instructions on submitting a data dictionary.

If you have need any help generating a HEAL-compliant data dictionary with the VMLD Tool, or have a general inquiry, please contact us at heal-support@datacommons.io