Phenopackets Module
The RareLink-Phenopacket module allows users to generate GA4GH Phenopackets from data stored in a local REDCap project. For the RareLink-CDM the engine is preconfigured so that Phenopackets can be instantly exported via the RareLink CLI.
The RareLink-Phenopacket module is designed to be modular and flexible so that it can be adapted to other REDCap data structures. Please see the section below.
Overview:
Get started
To use the Phenopacket export, you need a running REDCap project with API access and the RareLink-CDM instruments set up. You also need the framework and all its components running. You can run the following commands to set everything up:
rarelink framework updateto update the framework and all components.rarelink setup redcap-projectto set up a REDCap project with your REDCap administrator.rarelink setup keysto set up the REDCap API access locally.
RareLink-CDM to Phenopackets
Once you have data captured in you REDCap project using the RareLink-CDM REDCap instruments, you can export the data to Phenopackets. The data is exported to one Phenopacket JSON file per individual and can be used for further analysis.
For this, simply run:
rarelink phenopackets export
And you will be guided through the exporting process. The Phenopackets will be exported to the configured output directory (default is your Downloads folder).
Note
Make sure you comply with your local data protection regulations and ethical agreements before exporting the data!
The section (6.4) Family History (rarelink_6_4_family_history) is not implemented yet. This section may be included in future versions of the RareLink-Phenopacket module.
Hint
Read the Preconfigurations below to see how the RareLink-Phenopacket modules is configured to handle specific fields for dates, data privacy, and preferences over certain fields for the Phenopacket export.
RareLink-Phenopacket engine
The RareLink-Phenopacket module is developed in a modular way to allow for easy adaptation to other REDCap data structures. All data model specific configurations and mappings of the RareLink-CDM are within its GitHub folder. Therefore, all functions and modules we developed can be used or adapted for other REDCap data models extending the RareLink-CDM once the data model is converted to a similar LinkML schema.
Overview
To provide an overview, the RareLink-Phenopacket module consists of the following components:
mappings(GitHub Folder): Contains all the mappings from the REDCap data model to the respective blocks in the Phenopacket schema without containing data-model specific values or codes.DataProcessor(Python Class): Contains all functions to process any REDCap data to Phenopacket-compliant data, including field fetching, data processing, data validation, Label & Mapping, repeated element, and generation methods.create(Python function): Contains the main function to generate Phenopackets from the processed data.write(Python function): Contains the function to write the generated Phenopackets to a JSON file.phenopacket pipeline(Python function): Contains the pipeline to generate Phenopackets from the processed data.
Phenopackets validation
The engine utilizes the Phenopackets Python Library and its Python classes to generate Phenopackets. These classes include inherent validation mechanisms, which raise errors if a required field is missing or if a field is not in the correct format.
If further validation is needed, you can use the Validation Module of the pyphetools Python Package.
Preconfigurations
The engine provides several preconfigurations to streamline data processing. These include:
Date conversions
The engine converts dates to an a Phenopacket Age element as a ISO8601 duration with
Year and
Month
… for the following elements:
Individual.timeAtLastEncounter
VitalStatus.timeOfDeath
PhenotypicFeature.onset
PhenotypicFeature.resolution
Disease.onsetFor example, the resulting ISO8601 duration is formatted as follows:
age: iso8601duration: "P25Y3M"
the
Individual.dateOfBirthmust be a Phenopacket TimeStamp element. Therefore, for data privacy only the year and month are exported from REDCap.
Preferences:
PhenotypicFeature.onset: The engine prefers the ISO8601Duration defined in section 6.2.3 Phenotype Determination over the Age.
Usage for other REDCap data models
If you want to adapt the RareLink-Phenopacket module to another REDCap data model, you can follow these steps:
Develop your REDCap sheets and instruments according to the Develop REDCap Instruments section. Try to use the RareLink-CDM for as much as you can - this will make the mapping and export process easier.
(OPTIONAL): Convert your REDCap data model to a LinkML schema. This can be done by following the instructions in the RareLink-CDM section.
Convert your REDCap data model using the
redcap_to_linkmlfunction. This will convert your REDCap data to a JSON schema that handles repeating elements more inherently.Create mapping configurations for your data model.
Mapping Configuration for Custom Data Models
To create a custom mapping configuration, develop a Python file with a
create_phenopacket_mappings() function. You can use either a single configuration
approach (standard dictionary) or a multi-configuration approach (list of dictionaries)
for more complex data models.
Single Configuration Approach (Standard):
def create_phenopacket_mappings():
"""
Create a comprehensive mapping configuration for Phenopacket creation.
Returns:
Dict: Mapping configurations for Phenopacket generation
"""
return {
"individual": {
"instrument_name": "your_instrument_name",
"mapping_block": {
"id_field": "your_id_field",
"date_of_birth_field": "your_dob_field",
# ... other field mappings
},
"label_dicts": {
"GenderIdentity": {
"code1": "label1",
"code2": "label2",
# ... other labels
}
},
"mapping_dicts": {
"map_sex": {
"code1": "FEMALE",
"code2": "MALE",
# ... other mappings
}
}
},
# Add other blocks: vitalStatus, diseases, phenotypicFeatures, etc.
"metadata": {
"code_systems": {
# Optional: custom code systems
}
}
}
Multi-Configuration Approach (For Complex Data Models):
If your data model has multiple instruments for the same phenopacket block (e.g., different instruments for different types of phenotypic features), you can use the multi-configuration approach:
def create_phenopacket_mappings():
"""
Create a comprehensive mapping configuration for Phenopacket creation
with support for multiple instruments per block.
Returns:
Dict: Mapping configurations for Phenopacket generation
"""
return {
# Standard blocks
"individual": {
# Standard configuration as above
},
# Multi-configuration for phenotypic features
"phenotypicFeatures": [
# First instrument configuration
{
"instrument_name": "first_instrument",
"mapping_block": FIRST_FEATURES_BLOCK,
"data_model": "first_model",
"label_dicts": {
# Label dictionaries
},
"mapping_dicts": {
# Mapping dictionaries
},
# Additional configuration
},
# Second instrument configuration
{
"instrument_name": "second_instrument",
"mapping_block": SECOND_FEATURES_BLOCK,
"data_model": "second_model",
# Other configuration
}
],
# Other blocks follow the standard configuration
}
Command-Line Usage
Export Phenopackets using the command-line interface with various options:
# Basic export using default RareLink mappings
rarelink phenopackets export
# Specify a custom input file
rarelink phenopackets export --input-path /path/to/custom/input.json
# Specify a custom output directory
rarelink phenopackets export --output-dir /path/to/custom/output
# Use custom mapping configuration
rarelink phenopackets export --mappings /path/to/custom_mappings.py
# Enable debug mode for verbose logging
rarelink phenopackets export --debug
# Skip environment validation
rarelink phenopackets export --skip-validation
# Override CREATED_BY from .env
rarelink phenopackets export --created-by "Your Name"
# Set custom timeout (in seconds)
rarelink phenopackets export --timeout 7200
All Command-Line Options
Option |
Short Flag |
Type |
Description |
|---|---|---|---|
|
|
PATH |
Path to the input LinkML JSON file |
|
|
PATH |
Directory to save Phenopackets |
|
|
PATH |
Path to custom mapping configuration module |
|
|
FLAG |
Enable debug mode for verbose logging |
|
FLAG |
Skip environment validation |
|
|
TEXT |
Override CREATED_BY from .env |
|
|
|
INTEGER |
Timeout in seconds (default: 3600) |
|
FLAG |
Show help message and exit |
The Phenopacket engine is designed to work with multiple data models:
- RareLink CDM Structure:
In the RareLink CDM, data is organized by instrument name with a specific data field:
{ "repeated_elements": [ { "redcap_repeat_instrument": "rarelink_6_2_phenotypic_feature", "redcap_repeat_instance": 1, "phenotypic_feature": { "snomedct_8116006": "HP:0001059", "other_fields": "..." } } ] }
- Custom Data Model Structure:
For custom data models like CIEINR, data may be organized differently:
{ "repeated_elements": [ { "redcap_repeat_instrument": "infections_initial_form", "redcap_repeat_instance": 1, "infections_initial_form": { "snomedct_21483005": "hp_0002383", "other_fields": "..." } } ] }
The engine automatically detects the data model structure and accesses the fields accordingly.
Mapping Configuration Structure
The mapping configuration is a nested dictionary with the following key components:
Block-level Configuration -
instrument_name: REDCap instrument name for repeated elements -mapping_block: Dictionary mapping REDCap fields to Phenopacket schema -label_dicts: Dictionaries for human-readable labels -mapping_dicts: Dictionaries for code mappingsSupported Blocks -
individual-vitalStatus-diseases-phenotypicFeatures-measurements-medical_actions-variationDescriptor-interpretations-metadata
Tip
… check out the RareLink-CDM combined.py and all other mappings (src/rarelink/rarelink_cdm/mappings/phenopackets/) to see the full structure and examples of mapping configurations:
Advanced Configuration Options
Multiple Instruments
You can specify multiple instruments for a block using a set or list:
"instrument_name": {"instrument1", "instrument2"}
Data Model Specification
For custom data models, specify the model:
"data_model": "infections" # or "conditions", "rarelink_cdm", etc.
Type Fields
Specify explicit type fields to scan:
"type_fields": [ "field1", "field2", "field3" ]
Multi-onset Support
Enable multi-onset for features with multiple onset dates:
"multi_onset": True, "onset_date_fields": [ "onset_date_1", "onset_date_2" ]
Field Scanning Control
Disable automatic field scanning:
"enable_field_scanning": False
Mapping Strategies
Label Dictionaries Provide human-readable labels for codes:
"label_dicts": { "GenderIdentity": { "code1": "Female", "code2": "Male" } }
Mapping Dictionaries Map local codes to standardized terms:
"mapping_dicts": { "map_sex": { "local_code1": "FEMALE", "local_code2": "MALE" } }
Instrument Names Specify the correct REDCap instrument for repeated elements:
"instrument_name": "your_repeat_instrument_name"
Best Practices
Use ontology codes where possible
Provide comprehensive label and mapping dictionaries
Ensure instrument names match REDCap configuration
Use the RareLink-CDM as a reference for mapping structure
Enable multi-onset for features with multiple occurrence dates
Set appropriate data model for specialized instruments
Use explicit type fields to control which fields generate features
This section provides general examples of how to structure repeating and non-repeating data blocks. Customize the right-hand side values to fit specific user fields. The left-hand values are derived from the respective Phenopacket blocks Disease and Individual.
INDIVIDUAL_BLOCK = {
"id_field": "<individual_id>",
"date_of_birth_field": "<date_of_birth>",
"time_at_last_encounter_field": "<last_encounter>",
"sex_field": "<sex>",
"karyotypic_sex_field": "<karyotypic_sex>",
"gender_field": "<gender>",
}
DISEASE_BLOCK = {
"redcap_repeat_instrument": "<instrument_name>",
"term_field_1": "<disease_term_1>",
"term_field_2": "<disease_term_2>",
"term_field_3": "<disease_term_3>",
"term_field_4": "<disease_term_4>",
"term_field_5": "<disease_term_5>",
"excluded_field": "<excluded_term>",
"onset_date_field": "<onset_date>",
"onset_category_field": "<onset_category>",
"primary_site_field": "<primary_site>",
}
PHENOTYPIC_FEATURES_BLOCK = {
"redcap_repeat_instrument": "<instrument_name>",
"type_field": "<feature_type_field>",
"excluded_field": "<excluded_feature_field>",
"onset_date_field": "<onset_date_field>",
"onset_age_field": "<onset_age_field>",
"resolution_field": "<resolution_field>",
"severity_field": "<severity_field>",
"evidence_field": "<evidence_field>",
"modifier_field_1": "<modifier_field_1>",
"modifier_field_2": "<modifier_field_2>",
# For multi-onset support
"multi_onset": True,
"onset_date_fields": ["<date_field_1>", "<date_field_2>", "<date_field_3>"]
}
Notes:
Replace <instrument_name> and other placeholders with the specific field names or codes used in your REDCap project or dataset.
For repeating blocks, ensure the redcap_repeat_instrument value matches the instrument name configured in REDCap.
For multi-onset features, set “multi_onset”: True and provide a list of date fields.
Customize as needed for other field mappings.
The label dictionaries map codes to human-readable labels defined in your
value sets. Replace the placeholders with specific codes and labels relevant to
your use case. Make sure to include the function below in your .py file
get_mapping_by_name so that the DataProcessor can access the mappings
correctly. All codes that are not defined in here, will be fetched from
the BioPortal API by the DataProcessor.
label_dicts = {
"CategoryName1": {
"<code_1>": "<label_1>",
"<code_2>": "<label_2>",
"<code_3>": "<label_3>",
"<code_4>": "<label_4>",
"<code_5>": "<label_5>",
},
"CategoryName2": {
"<code_1>": "<label_1>",
"<code_2>": "<label_2>",
"<code_3>": "<label_3>",
"<code_4>": "<label_4>",
},
}
def get_mapping_by_name(name, to_boolean=False):
for mapping_dict in mapping_dicts:
if mapping_dict["name"] == name:
mapping = mapping_dict["mapping"]
if to_boolean:
return {key: value.lower() == "true" for key, value in mapping.items()}
return mapping
raise KeyError(f"No mapping found for name: {name}")
The mapping dictionaries map codes to standardized terms or enums defined, with mapped values corresponding to Phenopacket-specific elements. Replace the placeholders with relevant codes and Phenopacket terms.
mapping_dicts = [
{
"name": "<mapping_name_1>",
"mapping": {
"<code_1>": "<PHENOPACKET_TERM_1>", # Example: "FEMALE"
"<code_2>": "<PHENOPACKET_TERM_2>", # Example: "MALE"
"<code_3>": "<PHENOPACKET_TERM_3>", # Example: "UNKNOWN_SEX"
"<code_4>": "<PHENOPACKET_TERM_4>", # Example: "OTHER_SEX"
"<code_5>": "<PHENOPACKET_TERM_5>", # Example: "NOT_RECORDED"
},
},
{
"name": "<mapping_name_2>",
"mapping": {
"<code_1>": "<PHENOPACKET_TERM_1>",
"<code_2>": "<PHENOPACKET_TERM_2>",
"<code_3>": "<PHENOPACKET_TERM_3>",
},
},
]
Notes:
Mapping Name: Replace <mapping_name_x> with descriptive names for the mapping (e.g., “map_sex”, “map_disease”).
Codes: Replace <code_x> with actual codes (e.g., snomedct_248152002).
Phenopacket Terms: Replace <PHENOPACKET_TERM_X> with specific Phenopacket-standardized terms (e.g., “FEMALE”, “UNKNOWN_SEX”).
Add additional mappings as necessary to include all relevant Phenopacket-specific elements.
Phenopacket Adapters
Adapters are preprocessing functions that transform or split data before any mapper runs. They are designed to solve structural challenges that arise when a single REDCap instrument or data element does not map cleanly to a single Phenopacket block. All adapters are:
Opt-in — activated by a configuration key; absent means no effect.
Mapper-agnostic — they produce data shaped to the conventions that existing mappers already understand, so no mapper code changes are needed.
Reusable — they are designed for general use, not tied to any specific data model.
The adapters live in src/rarelink/phenopackets/adapter/.
Multi-Onset Adapter
Module: rarelink.phenopackets.adapter.multi_onset
When to use it
Use the multi-onset adapter when a single clinical finding has been observed
on multiple occasions, each with its own date, and you want to represent
each observation as a separate PhenotypicFeature entry with its own
onset value.
For example, a recurrent respiratory infection may have been recorded on
three separate dates. Without multi-onset, only the first date would be
captured. With multi-onset, three separate PhenotypicFeature messages
are produced — one per date — all sharing the same type, severity, and
modifiers.
Configuration
Enable multi-onset in the mapping block for the relevant phenotypicFeatures instrument:
FEATURES_BLOCK = {
"redcap_repeat_instrument": "your_instrument_name",
"type_field": "your_type_field",
"severity_field": "your_severity_field",
# Enable multi-onset
"multi_onset": True,
"onset_date_fields": [
"onset_date_1",
"onset_date_2",
"onset_date_3",
# ... add as many as your instrument supports
],
}
What it does
For each non-empty date in onset_date_fields:
Creates the base
PhenotypicFeatureusing the normal mapping function.Deep-copies it.
Replaces the
onsetfield with anAgeelement computed from that date and the subject’s date of birth.Appends the copy to the result list.
If no valid dates are found, the base feature (with whatever onset was already set) is returned as a single-element list.
Data shape expected
The adapter reads onset dates directly from the inner instrument dict of each repeated element:
{
"redcap_repeat_instrument": "your_instrument_name",
"redcap_repeat_instance": 1,
"your_instrument_name": {
"your_type_field": "HP:0004469",
"severity_field": "HP:0012826",
"onset_date_1": "2022-02-01",
"onset_date_2": "2023-03-01",
"onset_date_3": "2023-12-01"
}
}
Produces three separate PhenotypicFeature messages, each with
onset.age.iso8601duration computed from the respective date and DOB.
Notes
All copies share the same type, severity, and modifiers.
If DOB is unavailable, no age calculation is possible and the base feature is returned unchanged.
Combine with Ontology Routing Adapter when the same instrument also mixes HP and MONDO codes.
Ontology Routing Adapter
Module: rarelink.phenopackets.adapter.ontology_routing_adapter
When to use it
Use the ontology routing adapter when a single data element (or a set of mutually exclusive sub-fields within one instrument) may be populated with codes from different ontologies depending on the record — and each ontology should map to a different Phenopacket block.
The canonical example is a “specific finding” field that stores either an HPO term (a phenotypic abnormality) or a MONDO term (a disease entity). Both are clinically valid answers to the same question, but they belong in different Phenopacket blocks:
Ontology prefix |
Phenopacket block |
Semantic basis |
|---|---|---|
|
|
HPO terms describe phenotypic abnormalities
( |
|
|
MONDO terms describe disease entities ( |
|
|
OMIM identifiers describe disease entities |
|
|
Orphanet identifiers describe rare disease entities |
These defaults are grounded in the GA4GH Phenopacket v2 schema and are
always active. They can be extended or overridden via the rules key
(see below).
Configuration
Add an ontology_routing key to your mapping_configs:
mapping_configs = {
...
"ontology_routing": {
"enabled": True,
# Instruments whose repeated elements should be inspected
"instruments": [
"your_mixed_instrument",
"another_mixed_instrument",
],
# Per-instrument: which fields to scan for routable codes.
# These are the type_field_1…N names from your mapping block.
# If omitted, all string fields are scanned (slower, zero-config).
"scan_fields": {
"your_mixed_instrument": [
"field_holding_hp_or_mondo_1",
"field_holding_hp_or_mondo_2",
],
"another_mixed_instrument": [
"field_a",
"field_b",
],
},
# Per-instrument: onset date fields.
# The first non-empty value is used as Disease.onset when a
# MONDO-coded element is routed to the diseases block.
"onset_fields": {
"your_mixed_instrument": [
"onset_date_1",
"onset_date_2",
],
"another_mixed_instrument": [
"condition_onset_date",
],
},
# Optional: override or extend the built-in prefix→block rules.
# Useful if your data model uses a non-standard ontology.
# "rules": {
# "HP": "phenotypicFeatures",
# "MONDO": "diseases",
# "MYCUSTOM": "diseases",
# }
},
...
}
When ontology_routing is absent from mapping_configs, the adapter
is never called and pipeline behaviour is identical to before.
What it does
Before any mapper runs, the adapter:
Iterates over
data["repeated_elements"].For each element from a configured instrument, scans the configured (or all) string fields for a value whose ontology prefix is in the routing rules.
Routes the element:
HP-coded → placed in
data["__routed__phenotypicFeatures"]unchanged. ThePhenotypicFeatureMapperprocesses it normally, including multi-onset if configured.MONDO/OMIM/ORDO-coded → normalized to the
term_field_1/onset_date_fieldconvention and placed indata["__routed__diseases"]. TheDiseaseMapperconsumes it via the same path it uses for all other disease data.
Elements with no routable code (e.g. a SNOMED sub-field that holds only SNOMED values) are left in the original
repeated_elementsstream and processed normally by whatever mapper is configured for that instrument.
The original repeated_elements list is never mutated.
Data shape expected
A typical mixed element:
{
"redcap_repeat_instrument": "infections_initial_form",
"redcap_repeat_instance": 2,
"infections_initial_form": {
"type_of_infection": "snomedct_127856007", ← SNOMED category, ignored
"snomedct_127856007": "mondo_0043653", ← MONDO → diseases
"infection_severity": "hp_0012826",
"infection_date": "2023-02-01"
}
}
The adapter detects mondo_0043653 in snomedct_127856007, routes
the element to diseases, and normalizes it to:
{
"term_field_1": "mondo_0043653",
"onset_date_field": "2023-02-01",
"onset_category_field": None,
"excluded_field": None,
"primary_site_field": None,
"__source_instrument": "infections_initial_form",
"__source_instance": 2
}
Validation integration
The adapter also provides check_prefix_placement(), which is called
automatically by validate_phenopackets after each file is written. It
inspects the serialized Phenopacket JSON and emits non-fatal soft warnings
when:
An HP: term appears in
diseases(likely a routing error)A non-HP: term appears in
phenotypicFeatures(MONDO/OMIM in a features block)
These warnings appear in the validation output but do not cause the phenopacket to fail validation. They are intended to help data curators catch misconfigured routing rules.
Adding the adapter to your mapping config
If you are building a custom data model and have instruments with mixed
ontology codes, add the ontology_routing block alongside your existing
phenotypicFeatures and diseases configurations. No changes to
the mapper classes or the rest of the pipeline are needed.
Tip
Combine this adapter with the Multi-Onset Adapter when MONDO-routed
elements are diseases observed multiple times. Configure multi_onset
in the base phenotypicFeatures block; the routing adapter handles the
HP/MONDO split first, and the multi-onset adapter then expands the HP
elements per date.
Common issues and their solutions:
No phenotypic features or diseases are generated
Check that the instrument names in your mapping configuration match the actual instrument names in your data
Verify that the field paths in your mapping block point to the correct fields
Enable debug mode (
--debug) for more detailed logs
Field values not found
For RareLink CDM, remember that data is in a field named differently from the instrument (e.g., “rarelink_5_disease” instrument has data in “disease” field)
For CIEINR-like models, data is in a field with the same name as the instrument
Multi-onset features not working
Ensure “multi_onset” is set to True in your configuration
Verify that “onset_date_fields” contains the correct field names
Check that the date fields contain valid date values
Error: “No mapping found for name”
Ensure your mapping dictionaries include the requested mapping name
Verify that the get_mapping_by_name function is properly implemented
Timeout errors
Increase the timeout value with
--timeout 7200or higherProcess fewer records at a time
Missing labels for codes
Add the codes to your label dictionaries
Check BIOPORTAL API access if using external label lookups