Skip to content

API documentation

CNA(config)

Copy number alteration data for all samples included in GENIE.

Parameters:

  • config (Type[Configuration]) –

    the GENIE source data configuration.

Attributes:

  • data (DataFrame) –

    copy number data for all samples. HGNC gene symbol is row index, sample id is column index. If CNA information is missing for a gene-sample pair, the value is NaN.

get_cna(sample_id, gene_symbol)

Get copy number data for a single sample.

Parameters:

  • sample_id (str) –

    sample identifier.

Returns:

  • float

    copy number (NaN if not measured).

Configuration(genie_dir, config_file=None)

Genie source data configuration.

The default configuration is read from a default JSON config file which is part of the package. If a user config file is specified, it is used to update the information from the default config file.

Parameters:

  • genie_dir (str) –

    the base directory name of the GENIE data files.

  • config_file (str, default: None ) –

    optional user config file (JSON).

Examples:

>>> import os
>>> import genie
>>> home = os.getenv("HOME")
>>> genie_dir = os.path.join(home, "genie_data")
>>> user_config_file = os.path.join(home, ".genie_config.json")
>>> # Without a user config file (use only defaults)
>>> config = genie.Configuration(genie_dir)
>>> # With a user config file, overwriting some defaults
>>> config = genie.Configuration(genie_dir, user_config_file)

get_aux_file_name(key)

Get the file name of an auxiliary data file.

Parameters:

  • key (str) –

    auxiliary data file identfier used in the config file.

Returns:

  • str

    the file name for this auxiliary data file.

get_cache_file_name(file_name)

Get the name of the Parquet cache file.

The name of the cache file is created from the original file name by stripping the extensions and appending the suffix '.parquet'.

Parameters:

  • file_name (str) –

    the name of the data file

Returns:

  • str

    the name of the cache file

get_case_list_file_names()

Get all files with cast lists.

Returns:

  • dict

    dictionary with disease names as keys and file names as values.

get_config()

Get the complete configuration data.

Returns:

  • dict

    complete configuration as dictionary.

get_data_file_name(key)

Get the file name of a GENIE data file.

Parameters:

  • key (str) –

    data file identfier used in the config file.

Returns:

  • str

    the file name for this data file.

get_dir(key)

Get a directory name from the configuration.

Parameters:

  • key (str) –

    directory identfier used in the config file.

Returns:

  • str

    directory name.

get_gene_panel_file_names()

Get all files with gene panel descriptions.

Returns:

  • dict

    dictionary with panel IDs as keys and file names as values.

get_genie_dir()

Get the base directory of GENIE data.

Returns:

  • str

    base directory name for GENIE data.

get_genie_version()

Get the version of the GENIE release.

The version is obtained from the study meta file.

Returns:

  • str

    release version of GENIE data.

get_meta_file_name(key)

Get the file name of a GENIE meta data file.

Parameters:

  • key (str) –

    meta data file identfier used in the config file.

Returns:

  • str

    the file name for this meta file.

load_aux_file(key)

Load auxiliary data from a file or file cache.

Loads a dataframe from a file. The dataframe is read from the cache Parquet cache file if it exists. If the cache file does not exist yet, it will be created so that next time loading the data will be faster.

Parameters:

  • key (str) –

    auxiliary file identfier used in the config file.

Returns:

  • DataFrame

    the data loaded from the file or its cache.

load_data_file(key)

Load GENIE data from a file or file cache.

Loads a dataframe from a file. The dataframe is read from the cache Parquet cache file if it exists. If the cache file does not exist yet, it will be created so that next time loading the data will be faster.

Parameters:

  • key (str) –

    data file identfier used in the config file.

Returns:

  • DataFrame

    the data loaded from the file or its cache.

load_file(file_name)

Load a dataframe from a file or file cache.

Loads a dataframe from a file. The dataframe is read from the cache Parquet cache file if it exists. If the cache file does not exist yet, it will be created so that next time loading the data will be faster.

Parameters:

  • file_name (str) –

    the name of the data file.

Returns:

  • DataFrame

    the data loaded from the file or its cache.

Genie(genie_dir, config_file=None, verbose=True)

This is the main class for this module, holding all GENIE data.

An object of this class holds and provides all data from a Genie release. This includes all panels (assays) and the genomic regions tested by each panel, patient information, sample information, copy number data, mutation data.

Parameters:

  • genie_dir (str) –

    path name of the directory with Genie data files.

  • config_file (str, default: None ) –

    path name of an optional user config file, defining alternative names for GENIE data files if names should change for future versions of GENIE.

Attributes:

  • panel_set (PanelSet) –

    all panels (assays) included in GENIE.

  • patient_info (PatientInfo) –

    clinical information about patients.

  • sample_info (SampleInfo) –

    sample annotations.

  • cna (CNA) –

    copy number data.

  • mutations (Mutations) –

    mutation data.

  • version (str) –

    Genie version number.

aggregate_to_amino_acid_level(sample_mutation_profiles, gene_symbols, universe='ensembl')

Aggregate mutations to gene level.

This function aggregates the sample mutation level profiles to amino acid level. Several different nucleic acid changes can lead to the same amino acid change. These are aggregated by this function.

The gene_symbols argument needs to be specified because a mutation can affect multiple genes, so the mapping from hgvsg to gene symbol is not unambigious.

Parameters:

  • sample_mutation_profiles (DataFrame) –

    a single column dataframe with hgvsg and sample_id as row index and the boolean mutation status as column.

  • gene_symbols (list) –

    genes of interest - the hgvsgs from the the mutation profiles may map to multiple genes, and only the gene symbols from this list will be kept.

  • universe (str, default: 'ensembl' ) –

    one of ensembl and mane, defines which annotations will be used to create the mapping from hgvsg to gene symbols.

Returns:

  • DataFrame

    dataframe with two-dimensional row index (hgvsp, sample_id) and single column with the mutation status

aggregate_to_gene_level(sample_mutation_profiles, gene_symbols, universe='ensembl', min_mutations=1)

Aggregate mutations to gene level.

This function aggregates the sample mutation level profiles to gene level. A sample is called mutated for a gene if it has at least min_mutations of that gene. For oncogenes, min_mutations should be kept at the default value of 1. For tumor suppressor genes, it may be desired to require at least two hits to call a sample mutated, assuming that the two hits would affect the two copies of that gene. Any filtering for the functional relevance of mutations needs to be done prior to calling this function.

The gene_symbols argument needs to be specified because a mutation can affect multiple genes, so the mapping from hgvsg to gene symbol is not unambigious.

Parameters:

  • sample_mutation_profiles (DataFrame) –

    a single column dataframe with hgvsg and sample_id as row index and the boolean mutation status as column.

  • gene_symbols (list) –

    genes of interest - the hgvsgs from the the mutation profiles may map to multiple genes, and only the gene symbols from this list will be kept.

  • universe (str, default: 'ensembl' ) –

    one of ensembl and mane, defines which annotations will be used to create the mapping from hgvsg to gene symbols.

  • min_mutations (int, default: 1 ) –

    Call a gene mutated if it has at least that many mutations. Should be 1 for oncogenes and may be set to 2 for tumor suppressor genes.

Returns:

  • DataFrame

    dataframe with two-dimensional row index (gene, sample_id) and single column with the mutation status

append_patient_info(more_info)

Add more columns to the patient information data.

Parameters:

  • more_info (DataFrame) –

    table with additional columns to be added to the patient information. This dataframe must have the PATIENT_ID as index.

Returns:

  • None

    nothing

append_sample_info(more_info)

Add more columns to the sample information data.

Parameters:

  • more_info (DataFrame) –

    table with additional columns to be added to the sample information. This dataframe must have the SAMPLE_ID as index.

Returns:

  • None

    nothing

get_amino_acid_level_frequencies(gene_symbols, hgvsgs=None, hgvsps=None, cancer_types=None, cancer_types_to_keep=None, cancer_subtypes=None, cancer_subtypes_to_keep=None, sample_ids=None, cancer_subtype_resolution=False, extra_group_columns=None, universe='ensembl', precision=1, panel_coverage_threshold=0.8, impute=False)

Get amino acid level mutation frequencies.

This function returns the amino acid level mutation frequencies of the specified genes across either all samples or across selected samples after filtering based on sample ids or cancer types or cancer subtypes. Counts and frequencies are for patient numbers, not sample numbers. A patient is considered having a particular mutation if at least one sample of this patient has that mutation.

There may be several different nucleic acid mutations leading to the same protein sequence change. At amino acid level, these different nucleic acid variants are integrated to a an amino acid level, that is, a sample is considered mutated for a particular amino acid change if it has any of the nucleic acid changes leading to that amino acid variant, and it is considered wild type if it has none of these mutations.

The number of genes should be small, because otherwise the volume of results is very high and it will take very long for the function to return.

For each mutation detected in at least one sample, it is checked for each sample whether the mutation was tested by the panel used for this sample. Samples not tested for the majority of mutation (less than panel_coverage_threshold) are not included in the returned counts and frequencies. For the remaining samples, missing values are imputed or set to wild type, depending on the impute argument. See get_imputed_sample_mutation_profiles for details.

If the cancer_types_to_keep argument is specified, all cancer types not included in this list are summarized in the other category.

If the cancer_subtypes_to_keep argument is specified, all cancer subtypes not included in this list are summarized in the other category.

If extra_group_columns is specified, counts are provided for each value of these extra columns. One example for this is PRIMARY_RACE. The argument extra_group_columns needs to be provided as a dictionary, where the dict key is the name of the column, and the dict value is a list of all values of that column for which an extra row in the count table will be provided, while other values will be summarized as other. For example, if counts shall be returned separately for the "White", "Black" and "Asian" population and the remaining patients shall be summarized as "other", the extra_group_columns argument needs to be set to {"PRIMARY_RACE": ["White", "Black", "Asian"]}. If no aggregation to an other category is desired, set the dict value to None. For example, {"PRIMARY_RACE": None} would return separate counts for each race.

Parameters:

  • gene_symbols (list) –

    genes to include in the result.

  • hgvsgs (list, default: None ) –

    keep only these hvsgs, i.e., exclude all other hgvsgs for the specified gene_symbols. If None, keep all hgvsgs.

  • hgvsps (list, default: None ) –

    keep only these hgvsps, i.e., exclude all other hgvsps for the specified gene symbols. If None, keep all hgvsps.

  • cancer_types (list, default: None ) –

    cancer types to include in the result. All if None.

  • cancer_types_to_keep (list, default: None ) –

    cancer types not to summarize in the other category. No summarization if None.

  • cancer_subtypes (list, default: None ) –

    cancer subtypes to include in the result (Oncotree codes). All if None.

  • cancer_subtypes_to_keep (list, default: None ) –

    cancer subtypes not to summarize in the other category (Oncotree codes). No summarization if None.

  • sample_ids (list, default: None ) –

    Keep only these samples. No additional filtering if None.

  • cancer_subtype_resolution (bool, default: False ) –

    if True, provide frequencies by cancer subtype, otherwise return frequencies summarized by cancer type. If cancer_subtypes_to_keep is specified, this is automatically set to True, otherwise it defaults to False.

  • extra_group_columns (dict, default: None ) –

    by default, counts are returned for each combination of gene symbol, cancer type (CANCER_TYPE), and possibly cancer subtype (ONCOTREE_CODE). If counts should be split by additional factors, such as PRIMARY_RACE, it can be added here.

  • universe (str, default: 'ensembl' ) –

    one of "ensembl" or "mane"; use "ensembl" (the default) to include the original Ensembl transcripts from GENIE, and use "mane" to use RefSeq MANE transcripts instead.

  • precision (int, default: 1 ) –

    number of fractional digits of formatted allele frequency percentages.

  • panel_coverage_threshold (float, default: 0.8 ) –

    Exclude panels and samples profiled with these panels if the fraction of mutations tested by these panels is below that threshold. If this is set to 1, only panels that test all mutations are included. In that case, no imputation is needed. See get_imputed_sample_muation_profiles for details.

  • impute (bool, default: False ) –

    Whether to impute missing values (mutations not tested). If this is set to False (the default), then missing values are replaced with wild type. See get_imputed_sample_muation_profiles for details.

Returns:

  • DataFrame

    Amino acid level counts and frequencies.

get_annotated_unique_mutations(gene_symbols=None, universe='ensembl')

Get annotated mutations for all or selected genes.

This function returns a dataframe with all unique mutations found in at least one sample in Genie for the specified genes. If no genes are specified, return annotations for all genes.

If the universe mane is specified, more than one transcript can be returned for a genomic variant, for example a MANE Select and a MANE Plus clinical variant.

Parameters:

  • gene_symbols (list, default: None ) –

    return mutations for these genes. If not specified, mutations for all genes will be returned.

  • universe (str, default: 'ensembl' ) –

    ensembl or mane. Return annotations for Ensembl transcripts or for MANE transcripts.

Returns:

  • DataFrame

    dataframe with annotated unique mutations.

get_cancer_subtype_sample_counts(cancer_types=None, cancer_subtypes_to_keep=None)

Get cancer subtypes and their sample numbers.

Get the number of samples for each cancer subtype included in Genie. This function returns sample numbers and not patient numbers because quite often patients have multiple samples of the same cancer type but slightly different cancer subtype annotations.

get_cancer_subtypes(cancer_types=None)

Get cancer subtypes for all or selected cancer types.

Parameters:

  • cancer_types (list, default: None ) –

    optional list of cancer types for which to return cancer subtypes. If not specified, all cancer types are returned.

Returns:

  • DataFrame

    cancer subtypes for all or specified cancer types. Cancer types as index of the returned dataframe, Oncotree codes and and names of cancer subtypes as columns.

get_cancer_type_patient_counts(cancer_types_to_keep=None)

Get cancer types and their patient numbers.

Get the number of patients for each cancer type included in Genie. If the optional argument cancer_types_to_keepis specified, all other cancer types are summarized as other.

The counts returned are for patients, not samples. Patients can have more than one sample and even more than one cancer type. Each patient is counted only once per cancer type, independent of the number of samples for that patient. However, if a patient has samples of more than one cancer type, the patient will be counted multiple times, once per cancer type. As a consequence, the sum of all counts returned by this function is larger than the total number of patients in Genie.

Parameters:

  • cancer_types_to_keep (list, default: None ) –

    cancer types not to summarized in the other category. If not specified, all cancer types will be returned.

Returns:

  • Series

    number of patients per cancer type

get_cancer_types()

Get a list of cancer type names as used in Genie.

Returns:

  • list

    all cancer types used in Genie.

get_gene_level_frequencies(gene_symbols, hgvsgs=None, cancer_types=None, cancer_types_to_keep=None, cancer_subtypes=None, cancer_subtypes_to_keep=None, sample_ids=None, cancer_subtype_resolution=False, extra_group_columns=None, universe='ensembl', min_mutations=1, precision=1, panel_coverage_threshold=0.8, impute=False)

Get a mutation frequencies of selected genes across samples.

This function returns the gene level mutation frequencies of the specified genes across either all samples or across selected samples after filtering by sample ids or cancer types or cancer subtypes. Counts and frequencies are for patient numbers, not sample numbers. A patient is considered having a particular mutation if at least one sample of this patient has that mutation.

The number of genes should be small, because otherwise the volume of results is very high and it will take very long for the function to return.

For each mutation detected in at least one sample, it is checked for each sample whether the mutation was tested by the panel used for this sample. Samples not tested for the majority of mutations (less than panel_coverage_threshold) are not included in the returned counts and frequencies. For the remaining samples, missing values are imputed or set to wild type, depending on the impute argument. See get_imputed_sample_mutation_profiles for details. Imputation is very time consuming, and for large enough panel_coverage_thresholds the differences in the results are marginal, which is why imputation is switched off by default.

If the cancer_types_to_keep argument is specified, all cancer types not included in this list are summarized in the other category.

If the cancer_subtypes_to_keep argument is specified, all cancer subtypes not included in this list are summarized in the other category.

If extra_group_columns is specified, counts are provided for each value of these extra columns from sample or patient annotations. One example for this is PRIMARY_RACE, another one is SEX. The argument extra_group_columns needs to be provided as a dictionary, where the dict key is the name of the column, and the dict value is a list of all values of that column for which an extra row in the count table will be provided, while other values will be summarized as other. For example, if counts shall be returned separately for the "White", "Black" and "Asian" population and the remaining patients shall be summarized as "other", the extra_group_columns argument needs to be set to {"PRIMARY_RACE": ["White", "Black", "Asian"]}. If no aggregation to an other category is desired, set the dict value to None. For example, {"PRIMARY_RACE": None} would return separate counts for each race.

Parameters:

  • gene_symbols (list) –

    genes to include in the result.

  • hgvsgs (list, default: None ) –

    keep only these hvsgs, i.e., exclude all other hgvsgs for the specified gene_symbols. If None, keep all hgvsgs.

  • cancer_types (list, default: None ) –

    cancer types to include in the result. All if None.

  • cancer_types_to_keep (list, default: None ) –

    cancer types not to summarize in the other category. No summarization if None.

  • cancer_subtypes (list, default: None ) –

    cancer subtypes to include in the result (Oncotree codes). All if None.

  • cancer_subtypes_to_keep (list, default: None ) –

    cancer subtypes not to summarize in the other category (Oncotree codes). No summarization if None.

  • sample_ids (list, default: None ) –

    Keep only these samples. No additional filtering if None.

  • cancer_subtype_resolution (bool, default: False ) –

    if True, provide frequencies by cancer subtype, otherwise return frequencies summarized by cancer type. If cancer_subtypes_to_keep is specified, this is automatically set to True, otherwise it defaults to False.

  • extra_group_columns (dict, default: None ) –

    by default, counts are returned for each combination of gene symbol, cancer type (CANCER_TYPE), and possibly cancer subtype (ONCOTREE_CODE). If counts should be split by additional factors, such as PRIMARY_RACE, it can be added here.

  • universe (str, default: 'ensembl' ) –

    one of "ensembl" or "mane"; use "ensembl" (the default) to include the original Ensembl transcripts from GENIE, and use "mane" to use RefSeq MANE transcripts instead.

  • min_mutations (int, default: 1 ) –

    Call a gene mutated if it has at least that many mutations. Should be 1 for oncogenes and may be set to 2 for tumor suppressor genes.

  • precision (int, default: 1 ) –

    number of fractional digits of formatted allele frequency percentages.

  • panel_coverage_threshold (float, default: 0.8 ) –

    Exclude panels and samples profiled with these panels if the fraction of mutations tested by these panels is below that threshold. If this is set to 1, only panels that test all mutations are included. In that case, no imputation is needed. See get_imputed_sample_muation_profiles for details.

  • impute (bool, default: False ) –

    Whether to impute missing values (mutations not tested). If this is set to False (the default), then missing values are replaced with wild type. See get_imputed_sample_muation_profiles for details.

Returns:

  • DataFrame

    gene level counts and frequencies.

get_imputed_sample_mutation_profiles(sample_mutation_profiles, panel_coverage_threshold=0.8, impute=False)

Create a sample vs mutation matrix with imputation.

This function accepts sample mutation profiles as obtained by the function get_sample_mutation_profiles. The dataframe returned by get_sample_mutation_profiles contains only values (True or False for MUT or WT) for those sample-mutation combinations that were actually tested by the panel used for a sample. For gene level aggregation, a full sample-vs-mutation matrix without missing values is needed.

If impute is set to Yes, then this function calculates such a matrix using MICE (Multiple Imputation by Chained Equations) for imputing values. MUT is encoded as 1, WT as 0, and the imputation will result in a fractional number between 0 and 1 for each missing value. Random numbers (uniform distribution between 0 and 1) are then used to assign “MUT” or “WT” depending on the value of the random variable and on the imputed value from MICE. To be more precise, the originally missing value gets a “MUT” if the random number is larger than the imputed value from MICE, and “WT” otherwise. This way we get a full mutation-versus-sample matrix without missing values.

Imputation is a time consuming process. For example, imputation for KRAS mutations in NSCLC and CRC with the default panel coverage threshold takes about 15 minutes. If impute is False, all missing values are replaced with "WT". For a large panel_coverage_threshold, imputation changes frequencies only marginally. Therefore, imputation is switched off by default.

Please be careful when calling this function - the memory requirements for an all genes versus all samples matrix would most likely exceed what is available in the compute environment. Therefore, always work with subsets of genes and maybe also indications.

Parameters:

  • sample_mutation_profiles (DataFrame) –

    the mutation profiles as obtained by get_sample_mutation_profiles.

  • panel_coverage_threshold (float, default: 0.8 ) –

    Exclude panels and samples profiled with these panels if the fraction of mutations tested by these panels is below that threshold. If this is set to 1, only panels that test all mutations are included. In that case, no imputation is needed.

  • impute (bool, default: False ) –

    Whether to impute missing values (mutations not tested). If this is set to False (the default), then missing values are replaced with wild type. Imputation takes very long (maybe hours), and for large panel_coverage_thresholds, frequencies don't change much, which is why imputation is switched off by default.

Returns:

  • DataFrame

    Sample-mutation profiles with no missing values.

get_nucleic_acid_level_frequencies(gene_symbols, hgvsgs=None, cancer_types=None, cancer_types_to_keep=None, cancer_subtypes=None, cancer_subtypes_to_keep=None, sample_ids=None, cancer_subtype_resolution=False, extra_group_columns=None, universe='ensembl', precision=1)

Get a nucleic acid level mutation frequencies.

This function returns the mutation frequencies of the specified genes across either all samples or across selected samples after filtering by sample ids or cancer types or cancer subtypes. Counts and frequencies are for patient numbers, not sample numbers. A patient is considered having a particular mutation if at least one sample of this patient has that mutation.

The number of genes should be small, because otherwise the volume of results is very high and it will take very long for the function to return.

For each mutation detected in at least one sample, it is checked for each sample whether the mutation was tested by the panel used for this sample. Samples not tested for a mutation are not included in the returned counts and frequencies.

If the cancer_types_to_keep argument is specified, all cancer types not included in this list are summarized in the other category.

If the cancer_subtypes_to_keep argument is specified, all cancer subtypes not included in this list are summarized in the other category.

If extra_group_columns is specified, counts are provided for each value of these extra columns. One example for this is PRIMARY_RACE. The argument extra_group_columns needs to be provided as a dictionary, where the dict key is the name of the column, and the dict value is a list of all values of that column for which an extra row in the count table will be provided, while other values will be summarized as other. For example, if counts shall be returned separately for the "White", "Black" and "Asian" population and the remaining patients shall be summarized as "other", the extra_group_columns argument needs to be set to {"PRIMARY_RACE": ["White", "Black", "Asian"]}. If no aggregation to an other category is desired, set the dict value to None. For example, {"PRIMARY_RACE": None} would return separate counts for each race.

Parameters:

  • gene_symbols (list) –

    genes to include in the result.

  • hgvsgs (list, default: None ) –

    keep only these hvsgs, i.e., exclude all other hgvsgs for the specified gene_symbols. If None, keep all hgvsgs.

  • cancer_types (list, default: None ) –

    cancer types to include in the result. All if None.

  • cancer_types_to_keep (list, default: None ) –

    cancer types not to summarize in the other category. No summarization if None.

  • cancer_subtypes (list, default: None ) –

    cancer subtypes to include in the result (Oncotree codes). All if None.

  • cancer_subtypes_to_keep (list, default: None ) –

    cancer subtypes not to summarize in the other category (Oncotree codes). No summarization if None.

  • sample_ids (list, default: None ) –

    Keep only these samples. No additional filtering if None.

  • cancer_subtype_resolution (bool, default: False ) –

    if True, provide frequencies by cancer subtype, otherwise return frequencies summarized by cancer type. If cancer_subtypes_to_keep is specified, this is automatically set to True, otherwise it defaults to False.

  • extra_group_columns (dict, default: None ) –

    by default, counts are returned for each combination of hgvsg, cancer type (CANCER_TYPE), and possibly cancer subtype (ONCOTREE_CODE). If counts should be split by additional factors, such as PRIMARY_RACE, it can be added here.

  • universe (str, default: 'ensembl' ) –

    one of "ensembl" or "mane"; use "ensembl" (the default) to include the original Ensembl transcripts from GENIE, and use "mane" to use RefSeq MANE transcripts instead.

  • precision (int, default: 1 ) –

    number of fractional digits of formatted allele frequency percentages.

get_sample_mutation_profiles(gene_symbols, hgvsgs=None, cancer_types=None, cancer_subtypes=None, sample_ids=None, universe='ensembl')

Get a mutation profile of selected genes across samples.

This function returns the mutation profile of the specified genes across either all samples or across selected samples, based on sample ids or cancer types or cancer subtypes. If more than one sample selection criterion is specified, then the intersection of the specified criteria is used to determine the final set of samples. The number of genes should be small, because otherwise the volume of results is very high and it will take very long for the function to return.

For each mutation detected in at least one sample, it is checked for each sample whether the mutation was tested by the panel used for this sample. If this is the case, mutated is returned as "False", unless the mutation was in fact detected for this sample, in which case "True" is returned. Samples that have not been tested for a mutation are not included in the result.

Parameters:

  • gene_symbols (list) –

    genes to include in the result.

  • hgvsgs (list, default: None ) –

    keep only these hvsgs, i.e., exclude all other hgvsgs for the specified gene_symbols. If None, keep all hgvsgs.

  • cancer_types (list, default: None ) –

    cancer types to include in the result.

  • cancer_subtypes (list, default: None ) –

    cancer subtypes to include in the result (Oncotree codes).

  • sample_ids (list, default: None ) –

    sample identifiers to include in the result.

  • universe (str, default: 'ensembl' ) –

    one of "ensembl" or "mane"; use "ensembl" (the default) to include the original Ensembl transcripts from GENIE, and use "mane" to use RefSeq MANE transcripts instead.

Returns:

  • DataFrame

    Mutation status (MUT or WT) for all tested mutations.

get_tmb(sample_ids=None, min_genomic_range=10000, tmb_intermediate_threshold=6, tmb_high_threshold=20)

Get the tumor mutational burdon (TMB) for all specified samples.

The TMB is calculated by dividing the number of mutations reported for a sample by the total genomic range covered by the panel that is used to profile that sample. If the total genomic range of a panel is smaller than min_genomic_range, NA is returned.

Parameters:

  • sample_ids (list, default: None ) –

    samples for which the TMB is to be returned. If None (the default), TMB is returned for all GENIE samples.

  • min_genomic_range (int, default: 10000 ) –

    if the total genomic range covered by a panel is smaller that this value, TMB is returned as NA for all samples tested with such a panel.

  • tmb_intermediate_threshold (int, default: 6 ) –

    samples are classified as "TMB intermediate" if the TMB is >= this threshold but smaller than tmb_high_threshold. If it is smaller than this threshold, it is classified as "TMB low".

  • tmb_high_threshold (int, default: 20 ) –

    samples are classified as "TMB high" if the TMB is >= this threshold.

Returns:

  • DataFrame

    Table with the mutation count, size of genomic range covered, TMB, and TMB_class for each sample, with SAMPLE_ID as index.

summary()

Get a summary of data in this GENIE release.

The summary includes the number of panels, number of patiens, number of samples, number of genes tested by at least one panel, the number of genes with copy number alterations identfied at least once.

Merger

Deep merging of dictionary hierarchies.

This class can be used to merge two different dictionary trees. This is useful, for example, to load a default configuration from a JSON file and merge it with a user configuration file where the defaults are kept for all values not overwritten by the user configuration file.

deep_merge(dict1, dict2) staticmethod

Deep merging of two hierarchical dictionary trees.

The dict1 dictionary is updated with any information from dict2. The value of a dictionary item for a particular key can be another dictionary, and a flat merge would simply replace the dictionary item for that key in dict1 with the dictionary item from dict2. Such a flat merge would therefore lose the information for that dictionary item from dict1. A deep merge does not drop the value of a dictionary item in dict1 and replace it with the dictionary item from dict2 with the same key, but instead updates the dictionary item in dict1 with the information from dict2, changing or adding only those keys that are part of dict2.

Parameters:

  • dict1 (dict) –

    the dictionary to be updated.

  • dict2 (dict) –

    the dictionary with the update information.

Returns:

  • dict

    The updated dictionary (dict1 updated with dict2).

Examples:

>>> genie.Merger.deep_merge(default_config, user_config)

Meta(config, key)

Bases: dict

This class is a dictionary initialized with data from a meta file.

GENIE comes with several files with meta information about the actual data files. This class is a dictionary that is initialized with key-value pairs from such a meta information file.

Parameters:

  • config (Type[Configuration]) –

    the GENIE source data configuration.

  • key (str) –

    the file name key from the config file.

Mutations(config)

Mutation data for all samples included in GENIE.

Objects of this class contain all SNVs and indels found in any sample in the GENIE data base.

Parameters:

  • config (Type[Configuration]) –

    the GENIE source data configuration.

Attributes:

  • data (DataFrame) –

    mutation data for all samples.

  • annot (dict) –

    annotations of mutations derived from re-annotation by ICA. The dictionary has two keys - "ensembl" and "mane". annot["ensembl"] contains annotations for all Ensembl transcripts in GENIE (GENIE is based on Ensembl transcripts). annot["mane"] contains annotations for all MANE transcripts mapping to the mutations in GENIE.

get_detected_mutations(gene_symbols, universe='ensembl')

Get detected mutations for a list of genes.

This function returns mutations that are found for a the specified genes in all samples included in GENIE. If a mutation is not included in the returned values for a particular sample, this does not mean that the gene is of wild type for this sample because the mutation may not be on the panel used for that sample.

Parameters:

  • gene_symbols (list) –

    list of HGNC gene symbols.

  • universe (str, default: 'ensembl' ) –

    one of "ensembl" or "mane"; use "ensembl" (the default) to include the original Ensembl transcripts from GENIE, and use "mane" to use RefSeq MANE transcripts instead.

Returns:

  • DataFrame

    all mutations of these genes found in GENIE samples.

get_mutation_annotations(hgvsgs, universe='ensembl')

Get annotations for a list of mutations.

Get ICA re-annotations for a list of mutations specified by HGVSG. The returned dataframe can include more than one row per unique genomic variant. For example, a locus can be a downstream_gene_variant for one gene and a 3_prime_UTR_variant for another gene. Furthermore, if the mane universe is specified, there can be a MANE Select and one or more MANE Plus Clinical transcript variants covering the genomic location, and there can be more than one gene covering that location. There can be up to 15 different MANE transcripts for a genomic locus. It depends on the use case which if these transcripts shall be included in downstream analyses. Therefore, the user of this function needs to add appropriate filtering to the returned annotations.

Parameters:

  • hgvsgs (list) –

    list of HGVSG specifications of mutations.

  • universe (str, default: 'ensembl' ) –

    one of "ensembl" or "mane"; use "ensembl" (the default) to include the original Ensembl transcripts from GENIE, and use "mane" to use RefSeq MANE transcripts instead.

Returns:

  • DataFrame

    Annotations for specified mutations.

get_sample_mutation_profiles(gene_symbols, sample_ids, sample_info, tested_mutations, universe='ensembl', hgvsgs=None)

Get mutation profiles of genes for all tested samples.

While the function get_detected_mutations returns mutation-sample-pairs only for those samples where a mutation was actually detected, this function adds all samples where a mutation was also tested but not found. The returned dataframe contains a column mutated that is either True or False.

Parameters:

  • gene_symbols (list) –

    genes to include in the result

  • sample_ids (list) –

    sample identifiers of samples to include

  • sample_info (SampleInfo) –

    annnotation of all samples

  • tested_mutations (TestedMutations) –

    cache providing hgvsg vs panel matrix

  • universe (str, default: 'ensembl' ) –

    one of "ensembl" or "mane"; use "ensembl" (the default) to include the original Ensembl transcripts from GENIE, and use "mane" to use RefSeq MANE transcripts instead.

  • hgvsgs (list, default: None ) –

    keep only these hvsgs, i.e., exclude all other hgvsgs for the specified gene_symbols. If None, keep all hgvsgs.

get_unique_mutations(gene_symbols=None, universe='ensembl')

Get a unique list of mutations found in GENIE for a list of genes.

For the ensembl universe, all variants affecting Ensembl transcripts that were included in the original GENIE mutation data file are returned. For the mane universe, only variants affecting MANE transcripts are returned.

Parameters:

  • gene_symbols (list, default: None ) –

    list of genes to query.

  • universe (str, default: 'ensembl' ) –

    one of "ensembl" or "mane"; use "ensembl" (the default) to include the original Ensembl transcripts from GENIE, and use "mane" to use RefSeq MANE transcripts instead.

Returns:

  • list

    hgvsg for all unique mutations for the specified genes.

hgvsp3_to_hgvsp1(hgvsp)

Translate 3-letter amino acid codes to 1-letter amino acid codes.

Parameters:

  • hgvsp (str) –

    amino acid change with 3-letter amino acid codes

Returns:

  • str

    hgvsp_short with 1-letter amino acid codes

Panel(config, global_tested_positions, global_panel_info, panel_id)

This class represents a panel (assay) included in GENIE.

Parameters:

  • config (Type[Configuration]) –

    the GENIE source data configuration.

  • global_tested_positions (Type[TestedPositions]) –

    tested positions for all panels.

  • global_panel_info (DataFrame) –

    assay information for all panels.

  • panel_id (str) –

    the panel identifier.

Attributes:

  • id (str) –

    panel identifier.

  • description (str) –

    panel description.

  • genes (list) –

    genes on panel.

  • tested_positions (TestedPositions) –

    tested position for this panel.

  • panel_info (dict) –

    assay information for this panel.

gene_is_on_panel(gene_symbol)

Check if a gene is included in this panel.

Parameters:

  • gene_symbol (str) –

    HGNC gene symbol.

Returns:

  • bool

    True if gene is on panel, else False.

get_total_range()

Get the overall size of genomic ranges tested by this panel.

The overall size of tested genomic regions can be used to estimate TMB.

Returns:

  • int

    Sum of lengths of all ranges tested by this panel.

position_is_tested(chr, pos)

Check if a genomic position is tested by this panel.

Parameters:

  • chr (str) –

    chromosome.

  • pos (int) –

    position on chromosome.

Returns:

  • bool

    True if position is tested by this panel, else False.

range_is_tested(chr, start_pos, end_pos)

Check if a genomic range is tested by this panel.

Parameters:

  • chr (str) –

    chromosome.

  • start_pos (int) –

    start position of range on chromosome.

  • end_pos (int) –

    end position of range on chromosome.

Returns:

  • bool

    True if entire range is tested by this panel, else False.

tested_positions()

Get all tested positions for this panel.

Returns:

  • DataFrame

    tested positions for this panel.

PanelSet(config)

Set of all panels included in GENIE.

Parameters:

  • config (Type[Configuration]) –

    the GENIE source data configuration.

Attributes:

  • panels (dict) –

    all panels used by GENIE, with the panel identifier as key and Panel as value.

  • tested_positions (TestedPositions) –

    tested positions for all panels.

  • tested_mutations (TestedMutations) –

    tested mutations for all panels.

  • panel_info (DataFrame) –

    assay information for all panels.

panels_for_mutation(hgvsg)

Get a list of all panels testing a particular mutation.

This is done using the precomputed

Parameters:

  • hgvsg (str) –

    mutation to be checked.

Returns:

  • dict

    dictionary with panel ids as key and Panel objects as values containing all panels testing this mutation.

panels_for_position(chr, pos)

Get a list of all panels testing a genomic locaton.

Parameters:

  • chr (str) –

    chromosome.

  • pos (int) –

    position on chromosome.

Returns:

  • list

    List of panels probing this location.

panels_for_range(chr, start_pos, end_pos)

Get a list of all panels testing a genomic range.

Parameters:

  • chr (str) –

    chromosome.

  • start_pos (int) –

    start position of range.

  • end_pos (int) –

    end position of range.

Returns:

  • list

    List of panels probing this range.

PatientInfo(config)

Patient information for all subjects included in GENIE.

Patient information includes sex, primary race, ethnicity, clinical center, contact id, dod(?) id, year of contact, dead or alive, year of death.

Parameters:

  • config (Type[Configuration]) –

    the GENIE source data configuration.

Attributes:

  • data (DataFrame) –

    patient information for all subjects.

append_info(more_info)

Add more columns to the patient information data.

Parameters:

  • more_info (DataFrame) –

    table with additional columns to be added to the patient information. This dataframe must have the PATIENT_ID as index.

Returns:

  • None

    nothing

get_info_for_patient(patient_id)

Get patient information for a single patient.

Parameters:

  • patient_id (str) –

    patient identifier.

Returns:

  • dict

    patient information.

SampleInfo(config)

Sample information for all samples included in GENIE.

Sample information includes patient id, age at sequencing, Oncotree code, sample type, sequencing assay id, cancer type, cancer type detailed, sample type detailed.

Parameters:

  • config (Type[Configuration]) –

    the GENIE source data configuration.

Attributes:

  • data (DataFrame) –

    sample information for all samples.

append_info(more_info)

Add more columns to the sample information data.

Parameters:

  • more_info (DataFrame) –

    table with additional columns to be added to the sample information. This dataframe must have the SAMPLE_ID as index.

Returns:

  • None

    nothing

get_info_for_sample(sample_id)

Get sample information for for a single sample.

Parameters:

  • sample_id (str) –

    sample identifier.

Returns:

  • dict

    sample information.

get_sample_ids()

Get all sample sample identifiers.

Returns:

  • list

    sample identifiers

TestedMutations(config)

This class knows for each panel-mutation pair if the mutation was tested.

Checking if a particular mutation was tested by a panel could be done with the TestedPositions class. However, for thousands of mutations and hundreds of panels this takes quite long (about a day for all mutations detected in GENIE samples and for all panels). Therefore, in a data preparation step done once for each new GENIE release, all mutations found in any sample are checked against all panels and a panel-mutation matrix is created where matrix elements are True if a mutation is tested by a panel, False otherwise.

The TestedMutations class uses this precomputed cache file, loading it takes only about 0.5 seconds compared to the about 1 day for computing this on the fly.

Parameters:

  • config (Type[Configuration]) –

    the GENIE source data configuration.

Attributes:

  • parquet_file (str) –

    name of parquet cache file.

is_tested_matrix(hgvsgs=None, panels=None)

Get table of mutations versus panels telling if mutation is tested.

Get a table with mutations (HGVSGs) as rows and panels as columns where table cells are True or False depending on whether a particular mutation was tested by a particular panel.

Parameters:

  • hgvsgs (list, default: None ) –

    return table for subset of these mutations (or all mutations if None).

  • panels (list, default: None ) –

    return table for subset of these panels (or all panels if None). This can be provided as a list of panel identifiers or a list of Panel objects.

Returns:

  • DataFrame

    table with information which mutation was tested by which panel.

TestedPositions(config)

This class holds all tested positions for all panels.

Parameters:

  • config (Type[Configuration]) –

    the GENIE source data configuration.

Attributes:

  • data (DataFrame) –

    tested positions for all panels.

positions_for_panel(panel_id)

Get all tested positions for a particular panel.

Parameters:

  • panel_id (str) –

    the panel identifier.

Returns:

  • DataFrame

    tested positions for specified panel.