Project Partners

Dr Maria Liakata

Principal investigator for the SAPIENTA project.
Maria is Assistant Professor at the University of Warwick. Previously she held an Early Career Fellowship from the Leverhulme Trust (2010-2013) and had a joint affiliation with the Department of Computer Science at Aberystwyth University, UK, and the text mining group at the European Bioinformatics Institute (EMBL-EBI) in Cambridge, where she was hosted for the duration of her fellowship. Her research interests include Computational Linguistics, Biomedical Text Mining, Knowledge Discovery, Machine learning applications for Natural language processing.

Dr Colin Batchelor

Senior Informatics Analyst, Royal Society of Chemistry, Thomas Graham House, Cambridge, UK CB4 0WF.
Role on project: knowledge expert  in chemistry and publishing, summary evaluation.

Dr Simone Teufel

Senior Lecturer, University of Cambridge, Computer Laborarory, Natural Language and Information Processing Group.
Role on project: advisor in natural language processing and especially in argumentative zoning, annotation schemes and text summarisation.

Prof. Sophia Ananiadou

Director of the National Centre for Text Mining (NacTeM), Professor in text mining, University of Manchester. 
Role on project: Advisor on text mining and biolexical resources, provision of annotated data.

Dr Amanda Clare

Lecturer in Computer Science, Department of Computer Science, Aberystwyth University.
Role on project: Advisor on semantic web technologies and machine learning.

Miss Shyamasree Saha

Software engineer, Text mining group, EMBL-EBI
Role on project: Software Engineer

Dr Simon Dobnik

Postdoctoral Research Fellow in Language Technology
Dialogue Technology Lab, Centre for Language Technology and Department of Philosophy, Linguistics and Theory of Science, University of Gothenburg.
Role on project: Collaboration in automatic summary generation and evaluation

Dr Colin Sauze

Postdoctoral Research Associate, Department of Computer Science, Aberystwyth University.
Role on project: Collaboration involving the creation of a semantic wiki that links to SAPIENT, implementation of web-service components of SAPIENT.

Other Links

ART Project

The project that produced the SAPIENT tool for annotation of general scientific papers.

Easily browsable ART corpus

A site for browsing papers in the ART corpus hosted at UKOLN.
Contains the corpus description and the pages can also be downloaded from here.

The ART Corpus

As part of the ART project 265 chemistry papers were manually annotated using core scientific concepts. The resultant corpus contains over 1 million words or 40,000 sentences. For further information and downloading the corpus visit:

Please reference the corpus as:
Liakata Maria and Soldatova Larisa. 2009. The ART corpus. Technical report, Aberystwyth University.

All 265 papers (225 + 41 from phase II of corpus development) can be obtained by contacting

Multi-CoreSC CRA corpus (MCCRA)

As part of the SAPIENTA project 50 papers from the domain of Cancer Risk Assessment (CRA) were manually annotated by three biology experts, allowing multiple core scientific concepts per sentences. The corpus and its evaluation is described in our LREC 2016 paper:
Multi-label Annotation in Scientific Articles – The Multi-label Cancer Risk Assessment Corpus

You can download the corpus from here.

Please reference the corpus as:
James Ravenscroft, Maria Liakata, Anika Oellrich, and Shyamasree Saha. Multi-label Annotation in Scientific Articles – The Multi-label Cancer Risk Assessment Corpus. Proceedings of LREC 2016.

Multi-CoreSC Annotation Guidelines

Here you can find the annotation guidelines used by experts to annotate publications with multiple Core Scientific Concepts (CoreSC) per sentence.