TOOLS: MetaMap

MedPost/SKR Part of Speech Tagger

Overview

The MedPost/SKR POS Tagger is an Java implementation of the MedPost/SKR Part of Speech Tagger for BioMedical Text.

The MedPost Tagger was originally developed by Larry Smith, Tom Rindflesch, and W. John Wilbur from the National Center for Biotechnology Information (NCBI) [Smith, Wilbur], and Lister Hill National Center for Biomedical Communications (LHNCBC) [Rindflesch]. MedPost is currently written in a combination of C++ and Perl. The paper is accessible via the following URL: MedPost: A Part of Speech Tagger for BioMedical Text. Smith et al. Bioinformatics 2004;0:2271-0..

The MedPost/SKR Tagger is a Java-based implementation of the MedPost Tagger specifically formulated for the Semantic Knowledge Representation (SKR) work. MedPost/SKR has modified functionality and only produces SPECIALIST lexicon tags. The base algorithms are consistent between MedPost and MedPost/SKR.

MedPost is a stochastic part of speech tagger employing a hidden Markov model (HMM) to combine contextual information with lexical information to improve on baseline tagging accuracy. MedPost breaks down the original text into sentences and then tokenizes each sentence before finally tagging the text. A static table of bigrams derived during the initial training phase is used to estimate the transition probabilities. The output probabilities of the HMM are determined for words in the lexicon assuming equal probability for the possible tags. Output probabilities for unknown words are based on word orthography (e.g., uper or lowercase, numerics, etc), and word endings up to 4 letters long. The Viterbi algorithm is used to find the most likely tag sequence in the HMM matching the tokens.

MedPost was trained specifically for tagging biological text by using MEDLINE abstracts as the training corpus.

Input is any free formatted text.

  1. Standalone - which accepts a input file and output file. In standalone mode, the tagger breaks the text into processing units at blank lines. Multiple "units" can be included in the input file.

The output from all operation modes is a Prolog formatted string where the text and associated tags are included in a list (see example below).

Example

Input:
This is a test.
	  

Result:

[
  ['This', 'det'],
  ['is', 'aux'],
  ['a', 'det'],
  ['test', 'noun'],
  ['.', 'pd']
].
^THE_END^
	  

Prerequisites

  1. The MedPost/SKR Tagger is protected under the MetaMap Terms and Conditions. Please review prior to downloading the MedPost/SKR Tagger package.
  2. The MedPost/SKR Tagger has been tested using Ant 1.6.5
  3. The MedPost/SKR Tagger has been tested using Java 1.6.0_24
  4. There are README files that explain how to compile and run.

Installation

Simply untar or unjar the distribution file. Under Windows, you can use WinZip or similar products to uncompress the distribution file.
% gunzip -c MedPost-SKR_Public.tar.gz | tar -xvf -
	
That's it for installation. If you want to test the installation, run the following commands:

NOTE: Update run.bat or run.sh changing the top two lines to correspond to locations in your installation.

% cd MedPost-SKR_Public/Sample 
% ./run.sh sample.txt sampleTest.out
% .\run.bat sample.txt sampleTest.out
	
The file sampleTest.out and the provided sample.txt.out files should be identical.

Download

Important Note: MetaMap 2020 requires the UTF-8 version of the MedPost/SKR Tagger. (Gzip Tar - 470 KB)