Background


The goal of  the “SAPIENT Automation” (SAPIENTA project) is to see how useful it is to annotate Core Scientific Concepts (CoreSC)(e.g. ‘Goal’, ‘Experiment’, ‘Method’, ‘Result’, ‘Conclusion’, etc.) in scientific papers. We therefore evaluated the outcomes of the JISC funded project ART (completed March 2009), in order to assess the added benefit from annotating CoreSCs in papers. The ART project produced a corpus of 265 papers (> 1 million words) from physical chemistry and biochemistry annotated with such concepts, as well as a web annotation tool SAPIENT, which allowed experts to manually annotate the papers. We have automated the annotation of CoreSC concepts and have delivered the SAPIENTA tool for this purpose, training it and testing it on the ART corpus. We have also used the automatically annotated CoreSCs to create automatic summaries, evaluated by Chemistry experts.

 



The CoreSC annotation scheme

One of the main objectives of the ART project was to create a tool that would enable manual annotation of scientific papers with semantic information pertaining to the key components of a paper describing a scientific investigation. To this effect, a tool SAPIENT, was created as well as a formalism representing the Core Information about Scientific Papers (CISP). CISP defines key generic scientific concepts and their properties, including the following: ‘Goal of investigation’, ‘Motivation’, ‘Object of investigation’, ‘Research Method’, ‘Experiment’, ‘Result’, ‘Observation’, ‘Conclusion’. The CoreSC scheme implements these concepts as well as Hypothesis, Model and Background, as a sentence-based annotation scheme for 3-layered annotation.
The first layer pertains to the previously mentioned 11 categories, the second layer is for the annotation of properties of the concepts (e.g. “New”, “Old”) and the third layer caters for identifiers (conceptID), which link together instances of the same concept, e.g. all the sentences pertaining to the same method will be linked together with the same conceptID (e.g. “Met1”).
The CoreSC scheme is described in a set of 45 page guidelines which explain the three annotation layers, contain a detailed description of the semantics of the categories, comprehensive examples, category hierarchy and 6 rules for conflict resolution during annotation. A publication with more details about the scheme and its relation to the AZ-II annotation scheme is available here.
The annotation guidelines were used in conjunction with the SAPIENT tool to annotate 265 papers manually for the ART corpus.

A later version of the guidelines (2011), adapted for multiple-annotations and used to annotate 50 biology papers can be obtained by emailing liakata-At-ebi-dot-ac-dot-uk.

 

SAPIENT, SAPIENTA  and the ART Corpus

SAPIENT, developed within the ART Project, is an annotation tool implemented as a web application, which enables experts to annotate scientific papers, sentence by sentence manually.

It also incorporates some of the functionality of OSCAR3 (see Batchelor,C., Corbett, P. (2007) Semantic Enrichment of Journal Articles Using Chemical Named Entity Recognition. Proc. ACL), which allows the automated annotation of chemical named entities. SAPIENT has been primarily designed to work with CoreSC concepts but it can be used to add value to repository papers and data according to any sentence based annotation scheme.

SAPIENT allows the annotation of a paper with CoreSC concepts (‘Goal’, ‘Results’, etc). Within the ART project, SAPIENT has been used by 16 experts who have applied the CoreSC annotation scheme to a set of 265 papers from RSC Publishing journals. This has resulted in the creation of a corpus of the ART/CoreSC Corpus. Evaluation on 41 papers, annotated by at least three different experts, showed significant agreement between annotators, underlining the usability of both the CoreSC scheme and the SAPIENT tool.

In SAPIENTA we further evaluated the CoreSC scheme and the ART corpus by incorporating  machine learning algorithms into SAPIENT and automating the generation of core scientific concepts. SAPIENTA has been trained and tested on the ART corpus and has also been employed to annotate biology papers from Pubmed Central. SAPIENTA can be used both for manual and automatic annotation. Manual annotation has been extended to allow multiple annotations per sentence and has been used to create a corpus of 50 biology papers annotated with CoreSCs. Automatic annotation assigns one concept per sentence.

 

Named Entity Annotation in SAPIENT and SAPIENTA

OSCAR3 is open source software for chemical named entity recognition (NER), which takes Chemistry papers in SciXML format (an XML schema specific to the representation of the logical structure of scientific papers, which also retains as much of the publishers’ original formatting of the paper as possible) as input and annotates chemical entities recognized in the text (e.g. compounds, reactions, enzymes) using ontology terms from OBO ontologies and other formalisms like InChi. As part of the project Prospect the system has already been applied to the RSC publishing workflow to produce semantically enriched articles, enabling increased readability and allowing cross-linking with other articles. SAPIENT and SAPIENTA are built as an independent extensions of OSCAR3, and allow semantic annotation of chemical entities. We plan to incorporate OSCAR4 soon.

 

Extractive summaries

One of the outputs of the project is to use  automatically annotated CoreSC to create automatic summaries.  The idea of extractive summaries for papers is not new but the desirability of such summaries for biomedical papers has been recently highlighted and little work has been done in this area. We have made use of the distribution of CoreSC categories in abstracts to create automatic summaries which reflect the content of the paper and the cohesion of abstracts. The summaries we generated automatically have been evaluated in a question answering task by chemistry experts. Publication pending but more information can be obtained by emailing liakata-At-ebi-dot-ac-dot-uk.