View on GitHub

OA STM Corpus

A corpus, and small treebank, of Open Access journal articles from multiple disciplines in Science, Technology, and Medicine

Download this project as a .zip file Download this project as a tar.gz file

Natural Language Processing (NLP) tools perform best if they are used on the same kind of content on which they were trained and tested. Unfortunately for those in the STM domains, our content has some big differences from the newswire text that is commonly used in the development of most NLP tools. There are some corpora of STM content, but the ones we know of are specific to one domain, such as biomedicine, and will typically consist of abstracts instead of full articles. This is less than optimum because math articles are very different from biomed articles, and articles are very different from abstracts.

Corpus

To improve this situation, Elsevier is providing a selection of 110 journal articles from 10 different STM domains as a freely-redistributable corpus. The articles were selected from our Open Access content and have a Creative Commons CC-BY license. Therefore, they are free to redistribute and use. The domains are agriculture, astronomy, biology, chemistry, computer science, earth science, engineering, materials science, math, and medicine. Currently we provide 11 articles in each of the 10 domains. For each article in the corpus we provide:

the XML source,
a simple text version for easier text mining,
several versions with different annotations. These include part of speech tags, sentence breaks, NP and VP chunks, lemmas, syntactic constituents parses, wikipedia concept identification, and discourse analysis. (Some of this is still under construction.)

Annotations and Test Sets

In addition to having a wide-ranging STM corpus, we also hope that the content becomes densely annotated with many different types of NLP analyses beyond those mentioned above. Not only would this allow comparison of algorithms for the same type of annotations, it would also allow for the automatic selection of features to be used in creating higher-order annotations.

Most of the annotations are automatically created. However, we have identified 10 documents as a default test set. As new annotation types are added, those articles should be the first choice for manually reviewed and corrected test data.

Treebank

To seed the process of manually creating test sets, Elsevier has commissioned a treebank of the ten full-text articles in the default test set. We hope this corpus and treebank become a valuable resource for NLP, linguistics, and text mining researchers, developers, and users, as we all work towards tools that do a better job handling STM content.

Context

Quantity	Content
~12M	All Elsevier journal articles
~600k	All free to read Elsevier journal articles. PDFs free to read.
~15k	Open Access articles with CC-BY license. PDFs free to read, redistribute, and use.
110	STM Corpus articles. PDFs and XML free to read, redistribute, and use.
10	Default test set of articles. Starting point for manual annotations, and the source for our treebank.

Future

A public prerelease was made on January 11, 2015, for the FORCE2015 Hackathon. We will make revisions based on feedback from the prerelease. The full release of the corpus and treebank will occur after all 10 articles are treebanked. We may include additional articles in the corpus beyond the initial 110, depending on feedback from the community.