Variable Level Metadata (Data Dictionaries)¶
This schema defines the variable level metadata for one data dictionary for a given study.Note a given study can have multiple data dictionaries
title
(string,required)¶
description
(string)¶
data_dictionary
(array,required)¶
Variable level metadata individual fields integrated into the variable level metadata object within the HEAL platform metadata service.
NOTE
Only name
and description
properties are required.
For categorical variables, constraints.enum
and encodings
(where applicable) properties are highly encouraged.
For studies using HEAL or other common data elements (CDEs), standardsMappings
information is highly encouraged.
type
and format
properties may be particularly useful for some variable types (e.g. date-like variables)
Properties for each record¶
module
(string)
The section, form, survey instrument, set of measures or other broad category used
to group variables.
Examples:
Demographics
PROMIS
Substance use
Medical History
Sleep questions
Physical activity
name
(string,required)
The name of a variable (i.e., field) as it appears in the data.
title
(string)
The human-readable title or label of the variable.
Examples:
My Variable
Gender identity
description
(string,required)
An extended description of the variable. This could be the definition of a variable or the
question text (e.g., if a survey).
Examples:
The participant's age at the time of study enrollment
What is the highest grade or level of school you have completed or the highest degree you have received?
type
(string)
A classification or category of a particular data element or property expected or allowed in the dataset.
Definitions:
number
(A numeric value with optional decimal places. (e.g., 3.14))integer
(A whole number without decimal places. (e.g., 42))string
(A sequence of characters. (e.g., "test"))any
(Any type of data is allowed. (e.g., true))boolean
(A binary value representing true or false. (e.g., true))date
(A specific calendar date. (e.g., "2023-05-25"))datetime
(A specific date and time, including timezone information. (e.g., "2023-05-25T10:30:00Z"))time
(A specific time of day. (e.g., "10:30:00"))year
(A specific year. (e.g., 2023)yearmonth
(A specific year and month. (e.g., "2023-05"))duration
(A length of time. (e.g., "PT1H")geopoint
(A pair of latitude and longitude coordinates. (e.g., [51.5074, -0.1278]))
Possible values:
number
integer
string
any
boolean
date
datetime
time
year
yearmonth
duration
geopoint
format
(string)
Indicates the format of the type specified in the type
property.
Each format is dependent on the type
specified.
For example: If type
is "string", then see the String formats.
If type
is "date", "datetime", or "time", default format is ISO8601 formatting for those respective types (see details on ISO8601 format for Date,
Datetime,
or Time) - If you want to specify a date-like variable using standard Python/C strptime syntax, see here for details.
See here for more information about appropriate format
values by variable type
.
[Additional information]
Date Formats (date, datetime, time type
variable):
A format for a date variable (date
,time
,datetime
).
default: An ISO8601 format string.
any: Any parsable representation of a date/time/datetime. The implementing library can attempt to parse the datetime via a range of strategies.
{PATTERN}: The value can be parsed according to {PATTERN}
,
which MUST
follow the date formatting syntax of
C / Python strftime such as:
%Y-%m-%d
(for date, e.g., 2023-05-25)%Y%-%d
(for date, e.g., 20230525) for date without dashes%Y-%m-%dT%H:%M:%S
(for datetime, e.g., 2023-05-25T10:30:45)%Y-%m-%dT%H:%M:%SZ
(for datetime with UTC timezone, e.g., 2023-05-25T10:30:45Z)%Y-%m-%dT%H:%M:%S%z
(for datetime with timezone offset, e.g., 2023-05-25T10:30:45+0300)%Y-%m-%dT%H:%M
(for datetime without seconds, e.g., 2023-05-25T10:30)%Y-%m-%dT%H
(for datetime without minutes and seconds, e.g., 2023-05-25T10)%H:%M:%S
(for time, e.g., 10:30:45)%H:%M:%SZ
(for time with UTC timezone, e.g., 10:30:45Z)%H:%M:%S%z
(for time with timezone offset, e.g., 10:30:45+0300)
String formats:
email
if valid emails (e.g., test@gmail.com)uri
if valid uri addresses (e.g., https://example.com/resource123)binary
if a base64 binary encoded string (e.g., authentication token like aGVsbG8gd29ybGQ=)uuid
if a universal unique identifier also known as a guid (eg., f47ac10b-58cc-4372-a567-0e02b2c3d479)
Geopoint formats:
The two types of formats for geopoint
(describing a geographic point).
array
(if 'lat,long' (e.g., 36.63,-90.20))object
(if {'lat':36.63,'lon':-90.20})
constraints
(object)
-
maxLength
(integer) Indicates the maximum length of an iterable (e.g., array, string, or object). For example, if 'Hello World' is the longest value of a categorical variable, this would be a maxLength of 11. -
enum
(array) Constrains possible values to a set of values.Examples:
[1, 2, 3, 4]
['White', 'Black or African American', 'American Indian or Alaska Native', 'Native Hawaiian or Other Pacific Islander', 'Asian', 'Some other race', 'Multiracial']
-
pattern
(string) A regular expression pattern the data MUST conform to. -
maximum
(integer) Specifies the maximum value of a field (e.g., maximum -- or most recent -- date, maximum integer etc). Note, this is different then maxLength property. -
minimum
(integer) Specifies the minimum value of a field.
encodings
(object)
Variable value encodings provide a way to further annotate any value within a any variable type,
making values easier to understand.
Many analytic software programs (e.g., SPSS,Stata, and SAS) use numerical encodings and some algorithms only support numerical values. Encodings (and mappings) allow categorical values to be stored as numerical values.
Additionally, as another use case, this field provides a way to store categoricals that are stored as "short" labels (such as abbreviations).
Examples:
{'0': 'No', '1': 'Yes'}
{'HW': 'Hello world', 'GBW': 'Good bye world', 'HM': 'Hi, Mike'}
ordered
(boolean)
Indicates whether a categorical variable is ordered. This variable is
relevant for variables that have an ordered relationship but not
necessarily a numerical relationship (e.g., Strongly disagree < Disagree
< Neutral < Agree).
missingValues
(array)
A list of missing values specific to a variable.
Examples:
['Missing', 'Skipped', 'No preference']
['Missing']
trueValues
(array)
For boolean (true) variable (as defined in type field), this field allows
a physical string representation to be cast as true (increasing
readability of the field). It can include one or more values.
Examples:
['required', 'Yes', 'Checked']
['required']
falseValues
(array)
For boolean (false) variable (as defined in type field), this field allows
a physical string representation to be cast as false (increasing
readability of the field) that is not a standard false value. It can include one or more values.
repo_link
(string)
A link to the variable as it exists on the home repository, if applicable
standardsMappings
(array)
A published set of standard variables such as the NIH Common Data Elements program.
relatedConcepts
(array)
Mappings to a published set of concepts related to the given field such as ontological information (eg., NCI thesaurus, bioportal etc)
univarStats
(object)
Univariate statistics inferred from the data about the given variable
-
median
(number) -
mean
(number) -
std
(number) -
min
(number) -
max
(number) -
mode
(number) -
count
(integer) -
twentyFifthPercentile
(number) -
seventyFifthPercentile
(number) -
categoricalMarginals
(array)