The ART project produced a tool for manual sentence based semantic annotation of papers (SAPIENT). SAPIENT incorporates SSSplit, an XML aware sentence splitter, which was also created within the ART project. In the SAPIENT Automation (SAPIENTA) project we have released a new version of SAPIENT, SAPIENTA, which allows the automatic annotation of core scientific concepts (CoreSC) at the sentence level and also permits multi-label manual annotation. SAPIENTA includes an improved version of SSSplit, which works with the Pubmed Central DTD as well as papers in Scixml, but can also be applied to plain text and other XML schemas. You can also use SSSplit at the command line to obtain sentence boundaries for a batch of papers in XML.

You can download the latest versions of SAPIENTA and SSSplit for non-commercial use below.

SAPIENT stands for “Semantic Annotation of Papers: Interface & ENrichment Tool”. It is an annotation interface implemented as a web application, to help users annotate scientific papers in XML, sentence by sentence, with a set of concepts called Core Scientific Concepts (CoreSCs: see this paper Guidelines for the Annotation of General Scientific Concepts, GSCs have been rebranded as CoreSCs). CoreSCs constitute the set of concepts essential for describing a scientific investigation. However, SAPIENT can also be used in conjunction with other annotation schemes to annotate papers in XML sentence by sentence. SAPIENT also incorporates Oscar3 functionality, allowing the automatic annotation of chemical named entities.

SAPIENTA stands for “Semantic Annotation of Papers: Interface & ENrichment Tool Automated” and incorporates a machine learning classifier for identifying CoreSCs trained using Conditional Random Fields (CRF).  The machine learning classifier has been evaluated on 265 chemistry and bio-chemistry papers yielding more than 50% average accuracy for the 11 Core Scientific Concepts. The automatically generated concepts have been used to generate automatic summaries, evaluated in a question answering task by chemistry experts,  yielding a precision of 75% and a recall of 66%. SAPIENTA also allows multi-label annotation at the sentence level and has been used by three biology experts to annotate 50 biology papers from Pubmed Central, which are relevant for Cancer Risk Assessment (CRA).

SAPIENT Sentence Splitter (SSSplit) is an XML-aware sentence splitter which preserves XML markup and identifies sentences through the addition of in-line markup. The reason for developing our own sentence splitter was that sentence splitters widely available could not handle XML properly. The XML markup contains useful information about the document structure and formatting in the form of inline tags, which is important for determining the logical structure of the paper.

SSSplit has been written in the platform-independent Java language (version 1.6), based on and extending open source Perl code for handling plain text. In order to make our sentence splitter XML aware, we translated the Perl regular expression rules into Java and modifed them to make them compatible with the SciXML and Pubmed journal schemas.

For more details about SAPIENT and SSSplit you can also refer to our BioNLP2009 paper. Please reference this paper, if you find SAPIENT or SSSplit useful:

Liakata M., Q Claire and Soldatova L. N. (2009) Semantic Annotation of Papers: Interface and Enrichment Tool (SAPIENT). Proceedings of BioNLP 2009, Boulder, Colorado, pp 193–200

For SAPIENTA, publication is pending but you can e-mail liakata-At-ebi-dot-ac-dot-uk to obtain more information about our manuscript “Automatic recognition of conceptualisation zones in scientific articles to aid biological information extraction

To download files click on the appropriate name below.

Software Downloads

Please check the description to the left for the download link(s).

SAPIENTA Web service

Please visit for the latest version of the sapienta webservice that allows you to upload papers and provides automatic CoreSC annotation. API coming soon!


Frequently asked questions about installing, running and trouble-shooting the SAPIENTA software.
Click here for general introduction to SAPIENTA software

Latest binary version of SAPIENTA

Click here to download SAPIENTA for Firefox 4 and later.
Click here to download SAPIENTA for Firefox 3.

Java 1.6 or later is required in both cases. For installation instructions click here.
Click here for general introduction to SAPIENTA software

Please check the description to the left for the download link(s).


Download this file to install SSSplit – an XML-aware sentence splitter. Requires Java 1.6 or later.
For details on the sentence splitter see our BioNLP 2009 paper.
Click here for general introduction to SAPIENTA software

Binary version for SAPIENT

Download this binary file if you want to run SAPIENT as a stand-alone process. Java 1.6 or later is needed.
For details on SAPIENT please refer to ourBioNLP 2009 paper
Click here for general introduction to SAPIENTA software

Installing SAPIENT from source

Additional instructions for compiling and running SAPIENT from source code. Click here for general introduction to SAPIENTA software

SAPIENT source files

This is the .tar file to compile SAPIENT (the manual annotation tool without automation) from source