API documentation
CNA(config)
Copy number alteration data for all samples included in GENIE.
Parameters:
-
config
(Type[Configuration]
) –the GENIE source data configuration.
Attributes:
-
data
(DataFrame
) –copy number data for all samples. HGNC gene symbol is row index, sample id is column index. If CNA information is missing for a gene-sample pair, the value is NaN.
get_cna(sample_id, gene_symbol)
Get copy number data for a single sample.
Parameters:
-
sample_id
(str
) –sample identifier.
Returns:
-
float
–copy number (NaN if not measured).
Configuration(genie_dir, config_file=None)
Genie source data configuration.
The default configuration is read from a default JSON config file which is part of the package. If a user config file is specified, it is used to update the information from the default config file.
Parameters:
-
genie_dir
(str
) –the base directory name of the GENIE data files.
-
config_file
(str
, default:None
) –optional user config file (JSON).
Examples:
>>> import os
>>> import genie
>>> home = os.getenv("HOME")
>>> genie_dir = os.path.join(home, "genie_data")
>>> user_config_file = os.path.join(home, ".genie_config.json")
>>> # With a user config file, overwriting some defaults
>>> config = genie.Configuration(genie_dir, user_config_file)
get_aux_file_name(key)
Get the file name of an auxiliary data file.
Parameters:
-
key
(str
) –auxiliary data file identfier used in the config file.
Returns:
-
str
–the file name for this auxiliary data file.
get_cache_file_name(file_name)
Get the name of the Parquet cache file.
The name of the cache file is created from the original file name by stripping the extensions and appending the suffix '.parquet'.
Parameters:
-
file_name
(str
) –the name of the data file
Returns:
-
str
–the name of the cache file
get_case_list_file_names()
Get all files with cast lists.
Returns:
-
dict
–dictionary with disease names as keys and file names as values.
get_config()
Get the complete configuration data.
Returns:
-
dict
–complete configuration as dictionary.
get_data_file_name(key)
Get the file name of a GENIE data file.
Parameters:
-
key
(str
) –data file identfier used in the config file.
Returns:
-
str
–the file name for this data file.
get_dir(key)
Get a directory name from the configuration.
Parameters:
-
key
(str
) –directory identfier used in the config file.
Returns:
-
str
–directory name.
get_gene_panel_file_names()
Get all files with gene panel descriptions.
Returns:
-
dict
–dictionary with panel IDs as keys and file names as values.
get_genie_dir()
Get the base directory of GENIE data.
Returns:
-
str
–base directory name for GENIE data.
get_genie_version()
Get the version of the GENIE release.
The version is obtained from the study meta file.
Returns:
-
str
–release version of GENIE data.
get_meta_file_name(key)
Get the file name of a GENIE meta data file.
Parameters:
-
key
(str
) –meta data file identfier used in the config file.
Returns:
-
str
–the file name for this meta file.
load_aux_file(key)
Load auxiliary data from a file or file cache.
Loads a dataframe from a file. The dataframe is read from the cache Parquet cache file if it exists. If the cache file does not exist yet, it will be created so that next time loading the data will be faster.
Parameters:
-
key
(str
) –auxiliary file identfier used in the config file.
Returns:
-
DataFrame
–the data loaded from the file or its cache.
load_data_file(key)
Load GENIE data from a file or file cache.
Loads a dataframe from a file. The dataframe is read from the cache Parquet cache file if it exists. If the cache file does not exist yet, it will be created so that next time loading the data will be faster.
Parameters:
-
key
(str
) –data file identfier used in the config file.
Returns:
-
DataFrame
–the data loaded from the file or its cache.
load_file(file_name)
Load a dataframe from a file or file cache.
Loads a dataframe from a file. The dataframe is read from the cache Parquet cache file if it exists. If the cache file does not exist yet, it will be created so that next time loading the data will be faster.
Parameters:
-
file_name
(str
) –the name of the data file.
Returns:
-
DataFrame
–the data loaded from the file or its cache.
Genie(genie_dir, config_file=None, verbose=True)
This is the main class for this module, holding all GENIE data.
An object of this class holds and provides all data from a Genie release. This includes all panels (assays) and the genomic regions tested by each panel, patient information, sample information, copy number data, mutation data.
Parameters:
-
genie_dir
(str
) –path name of the directory with Genie data files.
-
config_file
(str
, default:None
) –path name of an optional user config file, defining alternative names for GENIE data files if names should change for future versions of GENIE.
Attributes:
-
panel_set
(PanelSet
) –all panels (assays) included in GENIE.
-
patient_info
(PatientInfo
) –clinical information about patients.
-
sample_info
(SampleInfo
) –sample annotations.
-
cna
(CNA
) –copy number data.
-
mutations
(Mutations
) –mutation data.
-
version
(str
) –Genie version number.
aggregate_to_amino_acid_level(sample_mutation_profiles, gene_symbols, universe='ensembl')
Aggregate mutations to gene level.
This function aggregates the sample mutation level profiles to amino acid level. Several different nucleic acid changes can lead to the same amino acid change. These are aggregated by this function.
The gene_symbols
argument needs to be specified because a mutation can
affect multiple genes, so the mapping from hgvsg to gene symbol is not
unambigious.
Parameters:
-
sample_mutation_profiles
(DataFrame
) –a single column dataframe with hgvsg and sample_id as row index and the boolean mutation status as column.
-
gene_symbols
(list
) –genes of interest - the hgvsgs from the the mutation profiles may map to multiple genes, and only the gene symbols from this list will be kept.
-
universe
(str
, default:'ensembl'
) –one of ensembl and mane, defines which annotations will be used to create the mapping from hgvsg to gene symbols.
Returns:
-
DataFrame
–dataframe with two-dimensional row index (hgvsp, sample_id) and single column with the mutation status
aggregate_to_gene_level(sample_mutation_profiles, gene_symbols, universe='ensembl', min_mutations=1)
Aggregate mutations to gene level.
This function aggregates the sample mutation level profiles to gene
level. A sample is called mutated for a gene if it has at least
min_mutations
of that gene. For oncogenes, min_mutations
should be
kept at the default value of 1. For tumor suppressor genes, it may be
desired to require at least two hits to call a sample mutated, assuming
that the two hits would affect the two copies of that gene. Any
filtering for the functional relevance of mutations needs to be done
prior to calling this function.
The gene_symbols
argument needs to be specified because a mutation can
affect multiple genes, so the mapping from hgvsg to gene symbol is not
unambigious.
Parameters:
-
sample_mutation_profiles
(DataFrame
) –a single column dataframe with hgvsg and sample_id as row index and the boolean mutation status as column.
-
gene_symbols
(list
) –genes of interest - the hgvsgs from the the mutation profiles may map to multiple genes, and only the gene symbols from this list will be kept.
-
universe
(str
, default:'ensembl'
) –one of ensembl and mane, defines which annotations will be used to create the mapping from hgvsg to gene symbols.
-
min_mutations
(int
, default:1
) –Call a gene mutated if it has at least that many mutations. Should be 1 for oncogenes and may be set to 2 for tumor suppressor genes.
Returns:
-
DataFrame
–dataframe with two-dimensional row index (gene, sample_id) and single column with the mutation status
append_patient_info(more_info)
Add more columns to the patient information data.
Parameters:
-
more_info
(DataFrame
) –table with additional columns to be added to the patient information. This dataframe must have the
PATIENT_ID
as index.
Returns:
-
None
–nothing
append_sample_info(more_info)
Add more columns to the sample information data.
Parameters:
-
more_info
(DataFrame
) –table with additional columns to be added to the sample information. This dataframe must have the
SAMPLE_ID
as index.
Returns:
-
None
–nothing
get_amino_acid_level_frequencies(gene_symbols, hgvsgs=None, hgvsps=None, cancer_types=None, cancer_types_to_keep=None, cancer_subtypes=None, cancer_subtypes_to_keep=None, sample_ids=None, cancer_subtype_resolution=False, extra_group_columns=None, universe='ensembl', precision=1, panel_coverage_threshold=0.8, impute=False)
Get amino acid level mutation frequencies.
This function returns the amino acid level mutation frequencies of the specified genes across either all samples or across selected samples after filtering based on sample ids or cancer types or cancer subtypes. Counts and frequencies are for patient numbers, not sample numbers. A patient is considered having a particular mutation if at least one sample of this patient has that mutation.
There may be several different nucleic acid mutations leading to the same protein sequence change. At amino acid level, these different nucleic acid variants are integrated to a an amino acid level, that is, a sample is considered mutated for a particular amino acid change if it has any of the nucleic acid changes leading to that amino acid variant, and it is considered wild type if it has none of these mutations.
The number of genes should be small, because otherwise the volume of results is very high and it will take very long for the function to return.
For each mutation detected in at least one sample, it is checked for
each sample whether the mutation was tested by the panel used for this
sample. Samples not tested for the majority of mutation (less than
panel_coverage_threshold
) are not included in the returned counts and
frequencies. For the remaining samples, missing values are imputed or
set to wild type, depending on the impute
argument. See
get_imputed_sample_mutation_profiles
for details.
If the cancer_types_to_keep
argument is specified, all cancer types
not included in this list are summarized in the other
category.
If the cancer_subtypes_to_keep
argument is specified, all cancer
subtypes not included in this list are summarized in the other
category.
If extra_group_columns
is specified, counts are provided for each
value of these extra columns. One example for this is PRIMARY_RACE
.
The argument extra_group_columns
needs to be provided as a dictionary,
where the dict key is the name of the column, and the dict value is a
list of all values of that column for which an extra row in the count
table will be provided, while other values will be summarized as
other
. For example, if counts shall be returned separately for the
"White", "Black" and "Asian" population and the remaining patients shall
be summarized as "other", the extra_group_columns
argument needs to be
set to {"PRIMARY_RACE": ["White", "Black", "Asian"]}
. If no
aggregation to an other
category is desired, set the dict value to
None
. For example, {"PRIMARY_RACE": None}
would return separate
counts for each race.
Parameters:
-
gene_symbols
(list
) –genes to include in the result.
-
hgvsgs
(list
, default:None
) –keep only these hvsgs, i.e., exclude all other hgvsgs for the specified gene_symbols. If None, keep all hgvsgs.
-
hgvsps
(list
, default:None
) –keep only these hgvsps, i.e., exclude all other hgvsps for the specified gene symbols. If None, keep all hgvsps.
-
cancer_types
(list
, default:None
) –cancer types to include in the result. All if None.
-
cancer_types_to_keep
(list
, default:None
) –cancer types not to summarize in the other category. No summarization if None.
-
cancer_subtypes
(list
, default:None
) –cancer subtypes to include in the result (Oncotree codes). All if None.
-
cancer_subtypes_to_keep
(list
, default:None
) –cancer subtypes not to summarize in the other category (Oncotree codes). No summarization if None.
-
sample_ids
(list
, default:None
) –Keep only these samples. No additional filtering if None.
-
cancer_subtype_resolution
(bool
, default:False
) –if True, provide frequencies by cancer subtype, otherwise return frequencies summarized by cancer type. If cancer_subtypes_to_keep is specified, this is automatically set to True, otherwise it defaults to False.
-
extra_group_columns
(dict
, default:None
) –by default, counts are returned for each combination of gene symbol, cancer type (CANCER_TYPE), and possibly cancer subtype (ONCOTREE_CODE). If counts should be split by additional factors, such as PRIMARY_RACE, it can be added here.
-
universe
(str
, default:'ensembl'
) –one of "ensembl" or "mane"; use "ensembl" (the default) to include the original Ensembl transcripts from GENIE, and use "mane" to use RefSeq MANE transcripts instead.
-
precision
(int
, default:1
) –number of fractional digits of formatted allele frequency percentages.
-
panel_coverage_threshold
(float
, default:0.8
) –Exclude panels and samples profiled with these panels if the fraction of mutations tested by these panels is below that threshold. If this is set to 1, only panels that test all mutations are included. In that case, no imputation is needed. See
get_imputed_sample_muation_profiles
for details. -
impute
(bool
, default:False
) –Whether to impute missing values (mutations not tested). If this is set to False (the default), then missing values are replaced with wild type. See
get_imputed_sample_muation_profiles
for details.
Returns:
-
DataFrame
–Amino acid level counts and frequencies.
get_annotated_unique_mutations(gene_symbols=None, universe='ensembl')
Get annotated mutations for all or selected genes.
This function returns a dataframe with all unique mutations found in at least one sample in Genie for the specified genes. If no genes are specified, return annotations for all genes.
If the universe mane is specified, more than one transcript can be returned for a genomic variant, for example a MANE Select and a MANE Plus clinical variant.
Parameters:
-
gene_symbols
(list
, default:None
) –return mutations for these genes. If not specified, mutations for all genes will be returned.
-
universe
(str
, default:'ensembl'
) –ensembl or mane. Return annotations for Ensembl transcripts or for MANE transcripts.
Returns:
-
DataFrame
–dataframe with annotated unique mutations.
get_cancer_subtype_sample_counts(cancer_types=None, cancer_subtypes_to_keep=None)
Get cancer subtypes and their sample numbers.
Get the number of samples for each cancer subtype included in Genie. This function returns sample numbers and not patient numbers because quite often patients have multiple samples of the same cancer type but slightly different cancer subtype annotations.
get_cancer_subtypes(cancer_types=None)
Get cancer subtypes for all or selected cancer types.
Parameters:
-
cancer_types
(list
, default:None
) –optional list of cancer types for which to return cancer subtypes. If not specified, all cancer types are returned.
Returns:
-
DataFrame
–cancer subtypes for all or specified cancer types. Cancer types as index of the returned dataframe, Oncotree codes and and names of cancer subtypes as columns.
get_cancer_type_patient_counts(cancer_types_to_keep=None)
Get cancer types and their patient numbers.
Get the number of patients for each cancer type included in Genie. If
the optional argument cancer_types_to_keep
is specified, all other
cancer types are summarized as other.
The counts returned are for patients, not samples. Patients can have more than one sample and even more than one cancer type. Each patient is counted only once per cancer type, independent of the number of samples for that patient. However, if a patient has samples of more than one cancer type, the patient will be counted multiple times, once per cancer type. As a consequence, the sum of all counts returned by this function is larger than the total number of patients in Genie.
Parameters:
-
cancer_types_to_keep
(list
, default:None
) –cancer types not to summarized in the other category. If not specified, all cancer types will be returned.
Returns:
-
Series
–number of patients per cancer type
get_cancer_types()
Get a list of cancer type names as used in Genie.
Returns:
-
list
–all cancer types used in Genie.
get_gene_level_frequencies(gene_symbols, hgvsgs=None, cancer_types=None, cancer_types_to_keep=None, cancer_subtypes=None, cancer_subtypes_to_keep=None, sample_ids=None, cancer_subtype_resolution=False, extra_group_columns=None, universe='ensembl', min_mutations=1, precision=1, panel_coverage_threshold=0.8, impute=False)
Get a mutation frequencies of selected genes across samples.
This function returns the gene level mutation frequencies of the specified genes across either all samples or across selected samples after filtering by sample ids or cancer types or cancer subtypes. Counts and frequencies are for patient numbers, not sample numbers. A patient is considered having a particular mutation if at least one sample of this patient has that mutation.
The number of genes should be small, because otherwise the volume of results is very high and it will take very long for the function to return.
For each mutation detected in at least one sample, it is checked for
each sample whether the mutation was tested by the panel used for this
sample. Samples not tested for the majority of mutations (less than
panel_coverage_threshold
) are not included in the returned counts and
frequencies. For the remaining samples, missing values are imputed or
set to wild type, depending on the impute
argument. See
get_imputed_sample_mutation_profiles
for details. Imputation is very
time consuming, and for large enough panel_coverage_threshold
s the
differences in the results are marginal, which is why imputation is
switched off by default.
If the cancer_types_to_keep
argument is specified, all cancer types
not included in this list are summarized in the other
category.
If the cancer_subtypes_to_keep
argument is specified, all cancer
subtypes not included in this list are summarized in the other
category.
If extra_group_columns
is specified, counts are provided for each
value of these extra columns from sample or patient annotations. One
example for this is PRIMARY_RACE
, another one is SEX
. The argument
extra_group_columns
needs to be provided as a dictionary, where the
dict key is the name of the column, and the dict value is a list of all
values of that column for which an extra row in the count table will be
provided, while other values will be summarized as other
. For example,
if counts shall be returned separately for the "White", "Black" and
"Asian" population and the remaining patients shall be summarized as
"other", the extra_group_columns
argument needs to be set to
{"PRIMARY_RACE": ["White", "Black", "Asian"]}
. If no aggregation to an
other
category is desired, set the dict value to None
. For example,
{"PRIMARY_RACE": None}
would return separate counts for each race.
Parameters:
-
gene_symbols
(list
) –genes to include in the result.
-
hgvsgs
(list
, default:None
) –keep only these hvsgs, i.e., exclude all other hgvsgs for the specified gene_symbols. If None, keep all hgvsgs.
-
cancer_types
(list
, default:None
) –cancer types to include in the result. All if None.
-
cancer_types_to_keep
(list
, default:None
) –cancer types not to summarize in the other category. No summarization if None.
-
cancer_subtypes
(list
, default:None
) –cancer subtypes to include in the result (Oncotree codes). All if None.
-
cancer_subtypes_to_keep
(list
, default:None
) –cancer subtypes not to summarize in the other category (Oncotree codes). No summarization if None.
-
sample_ids
(list
, default:None
) –Keep only these samples. No additional filtering if None.
-
cancer_subtype_resolution
(bool
, default:False
) –if True, provide frequencies by cancer subtype, otherwise return frequencies summarized by cancer type. If cancer_subtypes_to_keep is specified, this is automatically set to True, otherwise it defaults to False.
-
extra_group_columns
(dict
, default:None
) –by default, counts are returned for each combination of gene symbol, cancer type (CANCER_TYPE), and possibly cancer subtype (ONCOTREE_CODE). If counts should be split by additional factors, such as PRIMARY_RACE, it can be added here.
-
universe
(str
, default:'ensembl'
) –one of "ensembl" or "mane"; use "ensembl" (the default) to include the original Ensembl transcripts from GENIE, and use "mane" to use RefSeq MANE transcripts instead.
-
min_mutations
(int
, default:1
) –Call a gene mutated if it has at least that many mutations. Should be 1 for oncogenes and may be set to 2 for tumor suppressor genes.
-
precision
(int
, default:1
) –number of fractional digits of formatted allele frequency percentages.
-
panel_coverage_threshold
(float
, default:0.8
) –Exclude panels and samples profiled with these panels if the fraction of mutations tested by these panels is below that threshold. If this is set to 1, only panels that test all mutations are included. In that case, no imputation is needed. See
get_imputed_sample_muation_profiles
for details. -
impute
(bool
, default:False
) –Whether to impute missing values (mutations not tested). If this is set to False (the default), then missing values are replaced with wild type. See
get_imputed_sample_muation_profiles
for details.
Returns:
-
DataFrame
–gene level counts and frequencies.
get_imputed_sample_mutation_profiles(sample_mutation_profiles, panel_coverage_threshold=0.8, impute=False)
Create a sample vs mutation matrix with imputation.
This function accepts sample mutation profiles as obtained by the
function get_sample_mutation_profiles
. The dataframe returned by
get_sample_mutation_profiles
contains only values (True or False for
MUT or WT) for those sample-mutation combinations that were actually
tested by the panel used for a sample. For gene level aggregation, a
full sample-vs-mutation matrix without missing values is needed.
If impute
is set to Yes, then this function calculates such a matrix
using MICE (Multiple Imputation by Chained Equations) for imputing
values. MUT is encoded as 1, WT as 0, and the imputation will result in
a fractional number between 0 and 1 for each missing value. Random
numbers (uniform distribution between 0 and 1) are then used to assign
“MUT” or “WT” depending on the value of the random variable and on the
imputed value from MICE. To be more precise, the originally missing
value gets a “MUT” if the random number is larger than the imputed value
from MICE, and “WT” otherwise. This way we get a full
mutation-versus-sample matrix without missing values.
Imputation is a time consuming process. For example, imputation for KRAS
mutations in NSCLC and CRC with the default panel coverage threshold
takes about 15 minutes. If impute
is False, all missing values are
replaced with "WT". For a large panel_coverage_threshold
, imputation
changes frequencies only marginally. Therefore, imputation is switched
off by default.
Please be careful when calling this function - the memory requirements for an all genes versus all samples matrix would most likely exceed what is available in the compute environment. Therefore, always work with subsets of genes and maybe also indications.
Parameters:
-
sample_mutation_profiles
(DataFrame
) –the mutation profiles as obtained by
get_sample_mutation_profiles
. -
panel_coverage_threshold
(float
, default:0.8
) –Exclude panels and samples profiled with these panels if the fraction of mutations tested by these panels is below that threshold. If this is set to 1, only panels that test all mutations are included. In that case, no imputation is needed.
-
impute
(bool
, default:False
) –Whether to impute missing values (mutations not tested). If this is set to False (the default), then missing values are replaced with wild type. Imputation takes very long (maybe hours), and for large
panel_coverage_threshold
s, frequencies don't change much, which is why imputation is switched off by default.
Returns:
-
DataFrame
–Sample-mutation profiles with no missing values.
get_nucleic_acid_level_frequencies(gene_symbols, hgvsgs=None, cancer_types=None, cancer_types_to_keep=None, cancer_subtypes=None, cancer_subtypes_to_keep=None, sample_ids=None, cancer_subtype_resolution=False, extra_group_columns=None, universe='ensembl', precision=1)
Get a nucleic acid level mutation frequencies.
This function returns the mutation frequencies of the specified genes across either all samples or across selected samples after filtering by sample ids or cancer types or cancer subtypes. Counts and frequencies are for patient numbers, not sample numbers. A patient is considered having a particular mutation if at least one sample of this patient has that mutation.
The number of genes should be small, because otherwise the volume of results is very high and it will take very long for the function to return.
For each mutation detected in at least one sample, it is checked for each sample whether the mutation was tested by the panel used for this sample. Samples not tested for a mutation are not included in the returned counts and frequencies.
If the cancer_types_to_keep
argument is specified, all cancer types
not included in this list are summarized in the other
category.
If the cancer_subtypes_to_keep
argument is specified, all cancer
subtypes not included in this list are summarized in the other
category.
If extra_group_columns
is specified, counts are provided for each
value of these extra columns. One example for this is PRIMARY_RACE
.
The argument extra_group_columns
needs to be provided as a dictionary,
where the dict key is the name of the column, and the dict value is a
list of all values of that column for which an extra row in the count
table will be provided, while other values will be summarized as
other
. For example, if counts shall be returned separately for the
"White", "Black" and "Asian" population and the remaining patients shall
be summarized as "other", the extra_group_columns
argument needs to
be set to {"PRIMARY_RACE": ["White", "Black", "Asian"]}
. If no
aggregation to an other
category is desired, set the dict value to
None
. For example, {"PRIMARY_RACE": None}
would return separate
counts for each race.
Parameters:
-
gene_symbols
(list
) –genes to include in the result.
-
hgvsgs
(list
, default:None
) –keep only these hvsgs, i.e., exclude all other hgvsgs for the specified gene_symbols. If None, keep all hgvsgs.
-
cancer_types
(list
, default:None
) –cancer types to include in the result. All if None.
-
cancer_types_to_keep
(list
, default:None
) –cancer types not to summarize in the other category. No summarization if None.
-
cancer_subtypes
(list
, default:None
) –cancer subtypes to include in the result (Oncotree codes). All if None.
-
cancer_subtypes_to_keep
(list
, default:None
) –cancer subtypes not to summarize in the other category (Oncotree codes). No summarization if None.
-
sample_ids
(list
, default:None
) –Keep only these samples. No additional filtering if None.
-
cancer_subtype_resolution
(bool
, default:False
) –if True, provide frequencies by cancer subtype, otherwise return frequencies summarized by cancer type. If cancer_subtypes_to_keep is specified, this is automatically set to True, otherwise it defaults to False.
-
extra_group_columns
(dict
, default:None
) –by default, counts are returned for each combination of hgvsg, cancer type (CANCER_TYPE), and possibly cancer subtype (ONCOTREE_CODE). If counts should be split by additional factors, such as PRIMARY_RACE, it can be added here.
-
universe
(str
, default:'ensembl'
) –one of "ensembl" or "mane"; use "ensembl" (the default) to include the original Ensembl transcripts from GENIE, and use "mane" to use RefSeq MANE transcripts instead.
-
precision
(int
, default:1
) –number of fractional digits of formatted allele frequency percentages.
get_sample_mutation_profiles(gene_symbols, hgvsgs=None, cancer_types=None, cancer_subtypes=None, sample_ids=None, universe='ensembl')
Get a mutation profile of selected genes across samples.
This function returns the mutation profile of the specified genes across either all samples or across selected samples, based on sample ids or cancer types or cancer subtypes. If more than one sample selection criterion is specified, then the intersection of the specified criteria is used to determine the final set of samples. The number of genes should be small, because otherwise the volume of results is very high and it will take very long for the function to return.
For each mutation detected in at least one sample, it is checked for
each sample whether the mutation was tested by the panel used for this
sample. If this is the case, mutated
is returned as "False", unless
the mutation was in fact detected for this sample, in which case "True"
is returned. Samples that have not been tested for a mutation are not
included in the result.
Parameters:
-
gene_symbols
(list
) –genes to include in the result.
-
hgvsgs
(list
, default:None
) –keep only these hvsgs, i.e., exclude all other hgvsgs for the specified gene_symbols. If None, keep all hgvsgs.
-
cancer_types
(list
, default:None
) –cancer types to include in the result.
-
cancer_subtypes
(list
, default:None
) –cancer subtypes to include in the result (Oncotree codes).
-
sample_ids
(list
, default:None
) –sample identifiers to include in the result.
-
universe
(str
, default:'ensembl'
) –one of "ensembl" or "mane"; use "ensembl" (the default) to include the original Ensembl transcripts from GENIE, and use "mane" to use RefSeq MANE transcripts instead.
Returns:
-
DataFrame
–Mutation status (MUT or WT) for all tested mutations.
get_tmb(sample_ids=None, min_genomic_range=10000, tmb_intermediate_threshold=6, tmb_high_threshold=20)
Get the tumor mutational burdon (TMB) for all specified samples.
The TMB is calculated by dividing the number of mutations reported for a
sample by the total genomic range covered by the panel that is used to
profile that sample. If the total genomic range of a panel is smaller
than min_genomic_range
, NA is returned.
Parameters:
-
sample_ids
(list
, default:None
) –samples for which the TMB is to be returned. If None (the default), TMB is returned for all GENIE samples.
-
min_genomic_range
(int
, default:10000
) –if the total genomic range covered by a panel is smaller that this value, TMB is returned as NA for all samples tested with such a panel.
-
tmb_intermediate_threshold
(int
, default:6
) –samples are classified as "TMB intermediate" if the TMB is >= this threshold but smaller than
tmb_high_threshold
. If it is smaller than this threshold, it is classified as "TMB low". -
tmb_high_threshold
(int
, default:20
) –samples are classified as "TMB high" if the TMB is >= this threshold.
Returns:
-
DataFrame
–Table with the mutation count, size of genomic range covered, TMB, and TMB_class for each sample, with SAMPLE_ID as index.
summary()
Get a summary of data in this GENIE release.
The summary includes the number of panels, number of patiens, number of samples, number of genes tested by at least one panel, the number of genes with copy number alterations identfied at least once.
Merger
Deep merging of dictionary hierarchies.
This class can be used to merge two different dictionary trees. This is useful, for example, to load a default configuration from a JSON file and merge it with a user configuration file where the defaults are kept for all values not overwritten by the user configuration file.
deep_merge(dict1, dict2)
staticmethod
Deep merging of two hierarchical dictionary trees.
The dict1
dictionary is updated with any information from dict2
.
The value of a dictionary item for a particular key can be another
dictionary, and a flat merge would simply replace the dictionary item
for that key in dict1
with the dictionary item from dict2
.
Such a flat merge would therefore lose the information for that
dictionary item from dict1
. A deep merge does not drop the value of a
dictionary item in dict1
and replace it with the dictionary item from
dict2
with the same key, but instead updates the dictionary item in
dict1
with the information from dict2
, changing or adding only those
keys that are part of dict2
.
Parameters:
-
dict1
(dict
) –the dictionary to be updated.
-
dict2
(dict
) –the dictionary with the update information.
Returns:
-
dict
–The updated dictionary (
dict1
updated withdict2
).
Examples:
Meta(config, key)
Bases: dict
This class is a dictionary initialized with data from a meta file.
GENIE comes with several files with meta information about the actual data files. This class is a dictionary that is initialized with key-value pairs from such a meta information file.
Parameters:
-
config
(Type[Configuration]
) –the GENIE source data configuration.
-
key
(str
) –the file name key from the config file.
Mutations(config)
Mutation data for all samples included in GENIE.
Objects of this class contain all SNVs and indels found in any sample in the GENIE data base.
Parameters:
-
config
(Type[Configuration]
) –the GENIE source data configuration.
Attributes:
-
data
(DataFrame
) –mutation data for all samples.
-
annot
(dict
) –annotations of mutations derived from re-annotation by ICA. The dictionary has two keys - "ensembl" and "mane". annot["ensembl"] contains annotations for all Ensembl transcripts in GENIE (GENIE is based on Ensembl transcripts). annot["mane"] contains annotations for all MANE transcripts mapping to the mutations in GENIE.
get_detected_mutations(gene_symbols, universe='ensembl')
Get detected mutations for a list of genes.
This function returns mutations that are found for a the specified genes in all samples included in GENIE. If a mutation is not included in the returned values for a particular sample, this does not mean that the gene is of wild type for this sample because the mutation may not be on the panel used for that sample.
Parameters:
-
gene_symbols
(list
) –list of HGNC gene symbols.
-
universe
(str
, default:'ensembl'
) –one of "ensembl" or "mane"; use "ensembl" (the default) to include the original Ensembl transcripts from GENIE, and use "mane" to use RefSeq MANE transcripts instead.
Returns:
-
DataFrame
–all mutations of these genes found in GENIE samples.
get_mutation_annotations(hgvsgs, universe='ensembl')
Get annotations for a list of mutations.
Get ICA re-annotations for a list of mutations specified by HGVSG. The returned dataframe can include more than one row per unique genomic variant. For example, a locus can be a downstream_gene_variant for one gene and a 3_prime_UTR_variant for another gene. Furthermore, if the mane universe is specified, there can be a MANE Select and one or more MANE Plus Clinical transcript variants covering the genomic location, and there can be more than one gene covering that location. There can be up to 15 different MANE transcripts for a genomic locus. It depends on the use case which if these transcripts shall be included in downstream analyses. Therefore, the user of this function needs to add appropriate filtering to the returned annotations.
Parameters:
-
hgvsgs
(list
) –list of HGVSG specifications of mutations.
-
universe
(str
, default:'ensembl'
) –one of "ensembl" or "mane"; use "ensembl" (the default) to include the original Ensembl transcripts from GENIE, and use "mane" to use RefSeq MANE transcripts instead.
Returns:
-
DataFrame
–Annotations for specified mutations.
get_sample_mutation_profiles(gene_symbols, sample_ids, sample_info, tested_mutations, universe='ensembl', hgvsgs=None)
Get mutation profiles of genes for all tested samples.
While the function get_detected_mutations
returns
mutation-sample-pairs only for those samples where a mutation was
actually detected, this function adds all samples where a mutation was
also tested but not found. The returned dataframe contains a column
mutated
that is either True or False.
Parameters:
-
gene_symbols
(list
) –genes to include in the result
-
sample_ids
(list
) –sample identifiers of samples to include
-
sample_info
(SampleInfo
) –annnotation of all samples
-
tested_mutations
(TestedMutations
) –cache providing hgvsg vs panel matrix
-
universe
(str
, default:'ensembl'
) –one of "ensembl" or "mane"; use "ensembl" (the default) to include the original Ensembl transcripts from GENIE, and use "mane" to use RefSeq MANE transcripts instead.
-
hgvsgs
(list
, default:None
) –keep only these hvsgs, i.e., exclude all other hgvsgs for the specified gene_symbols. If None, keep all hgvsgs.
get_unique_mutations(gene_symbols=None, universe='ensembl')
Get a unique list of mutations found in GENIE for a list of genes.
For the ensembl universe, all variants affecting Ensembl transcripts that were included in the original GENIE mutation data file are returned. For the mane universe, only variants affecting MANE transcripts are returned.
Parameters:
-
gene_symbols
(list
, default:None
) –list of genes to query.
-
universe
(str
, default:'ensembl'
) –one of "ensembl" or "mane"; use "ensembl" (the default) to include the original Ensembl transcripts from GENIE, and use "mane" to use RefSeq MANE transcripts instead.
Returns:
-
list
–hgvsg for all unique mutations for the specified genes.
hgvsp3_to_hgvsp1(hgvsp)
Translate 3-letter amino acid codes to 1-letter amino acid codes.
Parameters:
-
hgvsp
(str
) –amino acid change with 3-letter amino acid codes
Returns:
-
str
–hgvsp_short with 1-letter amino acid codes
Panel(config, global_tested_positions, global_panel_info, panel_id)
This class represents a panel (assay) included in GENIE.
Parameters:
-
config
(Type[Configuration]
) –the GENIE source data configuration.
-
global_tested_positions
(Type[TestedPositions]
) –tested positions for all panels.
-
global_panel_info
(DataFrame
) –assay information for all panels.
-
panel_id
(str
) –the panel identifier.
Attributes:
-
id
(str
) –panel identifier.
-
description
(str
) –panel description.
-
genes
(list
) –genes on panel.
-
tested_positions
(TestedPositions
) –tested position for this panel.
-
panel_info
(dict
) –assay information for this panel.
gene_is_on_panel(gene_symbol)
Check if a gene is included in this panel.
Parameters:
-
gene_symbol
(str
) –HGNC gene symbol.
Returns:
-
bool
–True if gene is on panel, else False.
get_total_range()
Get the overall size of genomic ranges tested by this panel.
The overall size of tested genomic regions can be used to estimate TMB.
Returns:
-
int
–Sum of lengths of all ranges tested by this panel.
position_is_tested(chr, pos)
Check if a genomic position is tested by this panel.
Parameters:
-
chr
(str
) –chromosome.
-
pos
(int
) –position on chromosome.
Returns:
-
bool
–True if position is tested by this panel, else False.
range_is_tested(chr, start_pos, end_pos)
Check if a genomic range is tested by this panel.
Parameters:
-
chr
(str
) –chromosome.
-
start_pos
(int
) –start position of range on chromosome.
-
end_pos
(int
) –end position of range on chromosome.
Returns:
-
bool
–True if entire range is tested by this panel, else False.
tested_positions()
Get all tested positions for this panel.
Returns:
-
DataFrame
–tested positions for this panel.
PanelSet(config)
Set of all panels included in GENIE.
Parameters:
-
config
(Type[Configuration]
) –the GENIE source data configuration.
Attributes:
-
panels
(dict
) –all panels used by GENIE, with the panel identifier as key and
Panel
as value. -
tested_positions
(TestedPositions
) –tested positions for all panels.
-
tested_mutations
(TestedMutations
) –tested mutations for all panels.
-
panel_info
(DataFrame
) –assay information for all panels.
panels_for_mutation(hgvsg)
Get a list of all panels testing a particular mutation.
This is done using the precomputed
Parameters:
-
hgvsg
(str
) –mutation to be checked.
Returns:
-
dict
–dictionary with panel ids as key and
Panel
objects as values containing all panels testing this mutation.
panels_for_position(chr, pos)
Get a list of all panels testing a genomic locaton.
Parameters:
-
chr
(str
) –chromosome.
-
pos
(int
) –position on chromosome.
Returns:
-
list
–List of panels probing this location.
panels_for_range(chr, start_pos, end_pos)
Get a list of all panels testing a genomic range.
Parameters:
-
chr
(str
) –chromosome.
-
start_pos
(int
) –start position of range.
-
end_pos
(int
) –end position of range.
Returns:
-
list
–List of panels probing this range.
PatientInfo(config)
Patient information for all subjects included in GENIE.
Patient information includes sex, primary race, ethnicity, clinical center, contact id, dod(?) id, year of contact, dead or alive, year of death.
Parameters:
-
config
(Type[Configuration]
) –the GENIE source data configuration.
Attributes:
-
data
(DataFrame
) –patient information for all subjects.
append_info(more_info)
Add more columns to the patient information data.
Parameters:
-
more_info
(DataFrame
) –table with additional columns to be added to the patient information. This dataframe must have the
PATIENT_ID
as index.
Returns:
-
None
–nothing
get_info_for_patient(patient_id)
Get patient information for a single patient.
Parameters:
-
patient_id
(str
) –patient identifier.
Returns:
-
dict
–patient information.
SampleInfo(config)
Sample information for all samples included in GENIE.
Sample information includes patient id, age at sequencing, Oncotree code, sample type, sequencing assay id, cancer type, cancer type detailed, sample type detailed.
Parameters:
-
config
(Type[Configuration]
) –the GENIE source data configuration.
Attributes:
-
data
(DataFrame
) –sample information for all samples.
append_info(more_info)
Add more columns to the sample information data.
Parameters:
-
more_info
(DataFrame
) –table with additional columns to be added to the sample information. This dataframe must have the
SAMPLE_ID
as index.
Returns:
-
None
–nothing
get_info_for_sample(sample_id)
Get sample information for for a single sample.
Parameters:
-
sample_id
(str
) –sample identifier.
Returns:
-
dict
–sample information.
get_sample_ids()
Get all sample sample identifiers.
Returns:
-
list
–sample identifiers
TestedMutations(config)
This class knows for each panel-mutation pair if the mutation was tested.
Checking if a particular mutation was tested by a panel could be done with
the TestedPositions
class. However, for thousands of mutations and
hundreds of panels this takes quite long (about a day for all mutations
detected in GENIE samples and for all panels). Therefore, in a data
preparation step done once for each new GENIE release, all mutations found
in any sample are checked against all panels and a panel-mutation matrix is
created where matrix elements are True if a mutation is tested by a panel,
False otherwise.
The TestedMutations
class uses this precomputed cache file, loading it
takes only about 0.5 seconds compared to the about 1 day for computing this
on the fly.
Parameters:
-
config
(Type[Configuration]
) –the GENIE source data configuration.
Attributes:
-
parquet_file
(str
) –name of parquet cache file.
is_tested_matrix(hgvsgs=None, panels=None)
Get table of mutations versus panels telling if mutation is tested.
Get a table with mutations (HGVSGs) as rows and panels as columns where table cells are True or False depending on whether a particular mutation was tested by a particular panel.
Parameters:
-
hgvsgs
(list
, default:None
) –return table for subset of these mutations (or all mutations if None).
-
panels
(list
, default:None
) –return table for subset of these panels (or all panels if None). This can be provided as a list of panel identifiers or a list of Panel objects.
Returns:
-
DataFrame
–table with information which mutation was tested by which panel.
TestedPositions(config)
This class holds all tested positions for all panels.
Parameters:
-
config
(Type[Configuration]
) –the GENIE source data configuration.
Attributes:
-
data
(DataFrame
) –tested positions for all panels.
positions_for_panel(panel_id)
Get all tested positions for a particular panel.
Parameters:
-
panel_id
(str
) –the panel identifier.
Returns:
-
DataFrame
–tested positions for specified panel.