Named entity recognition (NER)

The recognition of protein and gene names in text builds upon the text-mining pipeline of the STRING database (Franceschini et al., 2013). This relies heavily on a highly efficient and flexible NER engine, which is implemented in C++ and is described in full detail in Pafilis et al. (2013). Most recently we have updated the dictionaries and underlying human genome annotation as part of releasing STRING version 10 (Szklarczyk et al., 2015). The dictionary merges synonym information from multiple sources, including the Ensembl (Cunningham et al., 2015) and UniProt (UniProt Consortium, 2015) databases.

Text corpora

To construct a literature corpus, we first downloaded all articles from the PubMed Central (PMC) Open Access Subset and converted them to the format required by the NER software. We next queried PubMed for additional articles for which full-text is freely available online and automatically downloaded these in PDF format from the publisher’s website when possible. We used AbiWord to convert these to HTML format and subsequently converted this to the format required by the NER software. Because the text is extracted from PDF files, the text is subject to formatting artefacts. The resulting combined corpus consists of approximately 2 million full-text articles. Despite this effort, we presently do not have access to mine the full-text articles for the majority of Medline entries. We thus extend the literature corpus with these abstracts, which we extracted from a local copy of Medline and converted to the same format as the full-text articles.

We complement the literature corpus with a second corpus of USPTO patent texts, which we downloaded from Google Books. As the file format of these have changed over the years, we developed several parsers to convert them all into the same unified format as the literature corpus. Most notably, all patents prior to 2002 were scanned and converted to plain text through optical character recognition (OCR); these are thus also subject to OCR errors.

Fractional counting

A document can mention multiple proteins without pertaining equally much to all of them. To address this, we use a fractional counting scheme in which each paper that mentions at least one protein contributes a total count of 1, which is distributed across the mentioned proteins relative to how many times each of them was mentioned. The total count for protein i is thus:

$$N_i = \sum_{j \in D} \frac{n_{ij}}{n_j}$$

where \(D\) is the document set, \(n_{ij}\) is the number of times protein \(i\) is mentioned in document \(j\), \(n_j\) is total number of mentions of any protein in document \(j\).

References

Cunningham F, Amode MR, Barrell D, Beal K, Billis K, Brent S, Carvalho-Silva D, Clapham P, Coates G, Fitzgerald S, Gil L, Girón CG, Gordon L, Hourlier T, Hunt SE, Janacek SH, Johnson N, Juettemann T, Kähäri AK, Keenan S, Martin FJ, Maurel T, McLaren W, Murphy DN, Nag R, Overduin B, Parker A, Patricio M, Perry E, Pignatelli M, Riat HS, Sheppard D, Taylor K, Thormann A, Vullo A, Wilder SP, Zadissa A, Aken BL, Birney E, Harrow J, Kinsella R, Muffato M, Ruffier M, Searle SM, Spudich G, Trevanion SJ, Yates A, Zerbino DR, Flicek P (2015). Ensembl 2015. Nucleic Acids Research, 43:D662–D669

Franceschini A, Szklarczyk D, Frankild S, Kuhn M, Simonovic M, Roth A, Lin J, Minguez P, Bork P, von Mering C, Jensen LJ (2013). STRING v9.1: protein-protein interaction networks, with increased coverage and integration. Nucleic Acids Research, 41:D808–D815

Pafilis E, Frankild SP, Fanini L, Faulwetter S, Pavloudi C, Vasileiadou A, Arvanitidis C, Jensen LJ (2013). The SPECIES and ORGANISMS resources for fast and accurate identification of taxonomic names in text. PLOS ONE, 8:e65390

Szklarczyk D, Franceschini A, Wyder S, Forslund K, Heller D, Huerte-Cepas J, Simonovic M, Roth A, Santos A, Tsafou KP, Kuhn M, Bork P, Jensen LJ, von Mering C (2015). STRING v10: protein–protein interaction networks, integrated over the tree of life . Nucleic Acids Research, 43:D447–D452.

UniProt Consortium (2015). UniProt: a hub for protein information. Nucleic Acids Research, 43:D204–D212