FAQs
Pharos and IDG
What are the criteria for including predictive models in Pharos?

Utility

  • Adds value to Pharos’ users. Examples include:
    • Predictions that fill in gaps for which experimental evidence is not available
    • Confidence metrics or rankings for experimental data
    • Aggregation of knowledge across sources to generate insight into a target’s functional role
  • Predictions / calculations are well defined and described
    • Standardized confidence metrics that can be compared across targets
    • Details on how metrics are calculated
  • Predictions apply to a significant number of targets or diseases
    • i.e. at least 100 targets / diseases
  • Ideally, predictions for targets include dark targets

Quality

  • High performing predictions for its domain
    • As shown in publication
    • Or from results in a challenge
  • Clearly defined metrics on the confidence in the model results (e.g. overfitting scores, specificity, sensitivity, etc.)

Accessibility

  • Either via an API or through TCRD
  • Training and testing sets are made publicly available, including a clear description of the collection methods, and how the data may be used by others.
  • Source code is available in a public repository

Maintenance

  • Yearly updates

How should developers generating predictive algorithms for IDG provide you with their algorithms and related information including access to underlying datasets used in developing those algorithms?

The preferred option for third party developed algorithms (3PDA) is to have them packaged in an R/Python package or Java library. These could be published on CRAN/PyPI/ MVNRepository or equivalents, with source code simultaneously deposited into an IDG-linked Github repository. Packages should be documented using guidelines from the respective publishing platform.

The underlying datasets (since these can change over time) should be attached with the 3PDA package (compressed format); if this is not possible, “how-to” instructions on how to access these data should be included. Furthermore, we expect that data used for 3PDA that is not in TCRD should be provided to the TCRD team using common data sharing practices as described at https://github.com/jtleek/datasharing.

Briefly, the data package must contain
  1. The tidy data set(s) used by the program/algorithm (preferably in long format)
  2. A code book describing each variable and its values in the tidy data set(s).
  3. The raw data used to generate the tidy data set(s).
  4. An explicit and exact recipe used to go from the raw data to the tidy data set(s), including all annotated code used for transformation.

Algorithm developers are expected to work together with the TCRD team to facilitate the integration of their tidy data set(s) and results into TCRD.

From the Pharos perspective, the main goal is to avoid on-the-fly computation as far as possible. Thus we anticipate that vetted 3PD algorithms will be used to precompute results for individual targets and/or diseases; these results will be stored in TCRD and then made available via Pharos. This transforms computation to look-up, thus ensuring a responsive interface, irrespective of the performance characteristics of the underlying algorithms.

How are decisions made on the datasets actually held in TCRD?

The initial selection of TCRD datasets was focused on capturing a diverse set of data types about genes, proteins and small molecules. These were collected and processed from numerous resources [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5210555/].

In addition to mRNA and protein expression data, disease and phenotype associations, 3D-structural information and pathways, TCRD captured ChEMBL and DrugCentral bioactivity data, drug target interactions and post-processed information about the functions of genes and proteins from 66 resources organized. An additional 114 experimental datasets compressed into the Harmonizome [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4930834/] were also captured. Text-mined bibliometric associations and statistics from biomedical and patent literature were used to develop the “Target Development Levels” [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6339563/], which continue to be used today.

In addition to updating and upgrading sets of data and information from prior releases, the current TCRD/Pharos release [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7778974/] includes human-virus and human-human protein-protein interactions; mouse and rat phenotype information, cancer cell line (CCLE) and network-based cellular signatures (LINCS); Guide-to-Pharmacology and IDG-Consortium generated data [https://druggablegenome.net/IDGResourceTables]. See our “About” page for further information [https://pharos.nih.gov/about].

Are you able to incorporate visualizations or data from other sources?

We are open to suggestions on incorporating new types of visualizations and are continuously looking for information that is complementary to our current aggregated set. Preference will be given to datasets that are well-curated, from well-maintained aggregated resources. If you would like to suggest a specific dataset or aggregator for inclusion / incorporation into TCRD/Pharos, please contact us at Pharos@mail.nih.gov.

How can I access Pharos/TCRD/ Harmonizome data?

Because TCRD is continuously updated, static datasets are likely to render 3PDA models obsolete over a relatively short (2-6 months) period, which may result in additional efforts from the 3PDA perspective, with respect to maintaining datasets and updating models. Depending on the nature of computation, users can download the TCRD dataset and access it via a local MySQL installation or access it via the Pharos GraphQL API. Descriptions of how original data sources are transformed in our already ongoing effort in TCRD are currently documented here: http://juniper.health.unm.edu/tcrd/download/

What are the underlying technologies in Pharos & TCRD?

Pharos is split up into a frontend (UI) and backend (database) codebase. The frontend is written in Angular (currently 11, and kept up to date). Additional tools used are Google firebase for account management, Apollo GraphQL client, Angular Material design spec and D3js for visualizations.

The backend is written in Node.js, and provides a GraphQL interface to allow more fine-grained data access than a traditional API. This is especially useful to retrieve nested data, such as TDL for all protein-protein interaction targets for a given target. The backend is coupled to a MySQL database that stores the TCRD.

How can I find out more about the data shown in Pharos?

Please consult our “about” page here for details.

How can I cite Pharos?

Sheils, T., Mathias, S. et al, "TCRD and Pharos 2021: mining the human proteome for disease biology", Nucl. Acids Res., 2021. DOI: 10.1093/nar/gkaa993

How can I contact the Pharos team?

For bug reports, and general comments, feel free to use the Feedback form on the top right, or contact us via email at pharos@mail.nih.gov

Using Pharos
How do I search for targets that are differentially expressed in one or more diseases?

Starting from the disease details page, click “Explore Associated Targets.” This takes you to a target list for all targets found to be associated with the disease in question. If expression data exist, you will see a histogram facet for the target list labeled “Expression Atlas Log2 Fold Change.” You can filter the target list based on the value for the Log2 Fold Change by changing the bounds of the slider and clicking “Apply.” You can also sort the target list by the Log2 Fold Change from the sorting dropdown near the list of targets. Note the icon that displays the sort order (ascending or descending), since proteins that are upregulated or downregulated may both be of interest.

How do I identify diseases in which a target is differentially expressed?

Navigate to the Disease Associations By Source panel on a target detail page. Disease information from Expression Atlas will show a log2 fold change for the degree of differential expression in the disease and non-disease state.

How do I see filters other than the ones shown by default on the Target List page?

There are more than 50 filters available. To see the full list go to the Target List page, and click “See All Categories.”

Are accounts or logins required to access Pharos?

Logins will not be required to access and download data. Signing in allows users to save custom lists of targets for future analysis. Authentication is handled using a social sign on feature, which means Pharos does not retain any login information.

When I search for 'abl kinase', why do I get non kinase targets?

The results for this query will include non-kinase targets because Pharos indexes many pieces of information associated with a target including publications, drug labels, Gene RIFs and so on. Thus if a publication refers to ABL1 but also discusses a nuclear hormone receptor (NHR) then the phrase 'abl kinase is also associated with the NHR target.

In this scenario consider using the facets (in particular the Target Family facet) to drill down to the appropriate class.

Alternatively, if you know the gene symbol or uniprot ID, you can you can navigate directly to the Target Details page by selecting the appropriate entry from the autocomplete options (i.e. ABL1 (Target)).

Please consult these papers for suggestions and background information.

Timothy K Sheils, Stephen L Mathias, Keith J Kelleher, Vishal B Siramshetty, Dac-Trung Nguyen, Cristian G Bologa, Lars Juhl Jensen, Dušica Vidović, Amar Koleti, Stephan C Schürer, Anna Waller, Jeremy J Yang, Jayme Holmes, Giovanni Bocci, Noel Southall, Poorva Dharkar, Ewy Mathé, Anton Simeonov, Tudor I Oprea. TCRD and Pharos 2021: mining the human proteome for disease biology, Nucleic Acids Research, Volume 49, Issue D1, 8 January 2021, Pages D1334–D1346, https://doi.org/10.1093/nar/gkaa993

Timothy Sheils Stephen L. Mathias Vishal B. Siramshetty Giovanni Bocci Cristian G. Bologa Jeremy J. Yang Anna Waller Noel Southall Dac‐Trung Nguyen Tudor I. Oprea. How to Illuminate the Druggable Genome Using Pharos. Current Protocols in Bioinformatics, Volume 69, Issue 1, March 2020, e92, https://doi.org/10.1002/cpbi.92

Pharos API
Given a UniProt accession for a target, how can I access drug information associated with it?

Using the GraphQP API, drug information, and any other information in Pharos, can be retrieved as a JSON object. A sample query, given a gene symbol such as DRD2, for downloading drug information would likely start like this:

 {
target(q:{sym:"DRD2"}) {
ligandCounts {
name
value
}
drugs:ligands (isdrug: true) {
name
isdrug
activities{
type
value
moa
}
}
}
}

An alternative approach is to explore the list of Ligands associated with the target through Pharos. To do this, navigate to the target details page, and click “Explore Approved Drugs” or “Explore Active Ligands,” to generate a Ligand List of those associated compounds. From there, histogram facets will provide an overview of activity values for the compounds and allow further refinement by those activity values, or by other facet values.

A shortcut to generate a Ligand List like this is to type the gene symbol into the search bar, and select the autocomplete entry for “{{symbol}} (Associated Ligands).”

What else can I do with the Pharos API?
There are many sample queries to get started fetching the data you want on our API page.
Visualizations
What do the numbers on the illumination graph mean?

These radial plots summarize the level of accumulated knowledge about each target. The further the point is away from the center of the radial plot, the more knowledge exists about the target. By mouse hovering the labels, the list of associated resources with links are presented on the left. To compare the relative knowledge for the target to knowledge for the target family, select to 'overlay another dataset'. The normalized knowledge for the target family will show in orange.

To construct these plots we:
  1. Begin with a set of gene-attribute associations. The attributes may be pathways, GO terms, phenotypes, diseases, drugs, tissues, proteins, etc., depending on the dataset. Some datasets require a few preprocessing steps to get here.
  2. Count the number of associations for each gene.
  3. Normalize the counts by calculating the empirical cumulative probability of the count for each gene, which is equal to the fraction of genes with count less than or equal to the count for each gene.ount for each gene.

The normalized counts/CDF values/empirical cumulative probabilities indicate the relative amount of knowledge about a gene compared to other genes in a given dataset. Genes with relatively high numbers of associations get assigned values near 1 and genes with relatively low numbers of associations get assigned values near 0.