MetaMap 2008_v2 Release Notes

Skip Navigation  |   Home    NLM » LHNCBC » MetaMap » MetaMap 2008_v2 Release Notes

Home

Substantially
Improved
Performance

XML
Generation

Sentence-
Breaking
Algorithm

Phrase-
Breaking
Algorithm

Changes
in MMO

Negex

Allowed
Form of
PMIDs

No
Variant
Generation
for
Short Words

Suppression
of Header
Information
      Introduction

This document outlines a number of substantial changes made to produce Metamap 2008_v2, the most important of which are
  1. Substantially Improved Performance,
  2. Changes in XML Generation,
  3. A new sentence-breaking algorithm,
  4. A new phrase-breaking algorithm,
  5. Changes in MetaMap Machine Output (MMO),
  6. Initial implementation of Negex,
  7. Changes in the allowed form of PMIDs,
  8. No variant Generation for Short Words, and
  9. Suppression of Header Information.
Other less visible changes, which will be mentioned but not described further, are bug fixes improving communication to the tagger server, an enhancement to the linguistic underpinnings of the parser, and fixing of bugs that caused incorrect positional information to be generated if MetaMap was run with the ignore_word_order option, or if the input text began with an acronym/abbreviation.

Substantially Improved Performance

We have considerably sped up MetaMap computations by storing intermediate results of mappings in an AVL tree rather than a linear list. Although this enhancement will not always be noticeable, it will result in as much as a 50-fold speedup on certain especially troublesome citations containing many instances of human and other words with many associated concepts.

We also fixed a long-standing bug in filehandle management that had caused filehandles to not be closed. This bug fix now allows MetaMap to run on much larger input files than before without exceeding operating-system limits on the number of open filehandles.

XML Generation

The version of XML generation in the initial release of MetaMap08, first described in the
original MetaMap08 Release Notes, did not work properly. It has been fixed in this release, which correctly generates both formatted and unformatted XML output.

Another substantial change to the XML generation is the one-to-one mapping of input citations (not of input files) and XML documents: The original release of MetaMap08 was intended to generate an XML document for each input file, even if the input file contained multiple citations. By contrast, MetaMap08_v2 generates an XML document for each citation in an input file. Consequently, if an input file contains multiple citations, the XML output generated from that file will contain several XML documents separated by a blank line.

For example, consider the following input file:


Heart attack.

Lung cancer.






Because Heart attack. and Lung cancer. are separated by a blank line, they are considered separate citations, just as if multiple Medline citations had been downloaded to a single input file. The XML generated for that file will be therefore be as follows:


<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE MMO PUBLIC "-//NLM//DTD MetaMap Machine Output//EN"
                     "http://ii-public.nlm.nih.gov/DTD/MMOtoXML_v2.dtd">
<MMO>

 . . .  XML for "heart attack." . . .

</MMO>

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE MMO PUBLIC "-//NLM//DTD MetaMap Machine Output//EN"
                     "http://ii-public.nlm.nih.gov/DTD/MMOtoXML_v2.dtd">
<MMO>

 . . .  XML for "lung cancer." . . .

</MMO>



Sentence-Breaking Algorithm

MetaMap08 (and all previous versions of the application) required that a blank space immediately follow a period for that period to signal the end of a sentence. We recently noticed, however, tens of thousands of instances in Medline citations in which a period not followed by a blank space ended a sentence. Consequently, we did not correctly handle instances such as the following (note that each line is intended to be read by itself--this is not continuous text):


serum markers for each of those subgroups.The mathematical simulation

neurogenesis in the adult nervous system.These findings may have

informed choices.Patients who are ready to make changes must be provided

affiliation with a trade union.Although still shut-out by the general

Note that in each of these actual lines from Medline citations, the period is not followed by a blank space, but nonetheless still marks the end of a sentence. We analyzed the phenomenon of end-of-sentence periods that are not followed by a blank space, and determined that if a period was followed by a string beginning with an uppercase letter, and that string was either

  1. one of a well-defined list of short (at most six characters) words such as The, In, We, This, These, It, Our, To, Study, When, etc.; see highlighted examples above, or
  2. any long (seven characters or more) word; see highlighted examples above.
a sentence break was very likely to have been intended. This logic has been included in MetaMap08_v2.

Phrase-Breaking Algorithm

We have modified the phrase-breaking algorithm to force a phrase break whenever a "<", ">", or "=" character is encountered in the input text. This change was motivated by text such as the following, from PMID 16755936:


yeast extract > peptone > magnesium sulfate > vitamin C = potassium phosphate > calcium chloride = ammonium sulfate.

which should clearly not be analyzed as a single phrase.

We have also added the ability to display information about the phrases into which input utterances are divided; this output is displayed by the --debug phrase option. The output consists of the word Phrase followed by the PMID, Section Type (ti or ab), utterance number, number of content-bearing phrase tokens, and finally the phrase tokens themselves. The six fields are separated by a vertical bar "|":


Phrase|16755936|ab|4|3|[of,the,individual,nutrient,component]

In the above example, there are only three content-bearing phrase tokens, even though the phrase contains five words; the example is correct because on and the are not content-bearing.

Finally, we enabled MetaMap to generate phrases only with no concept identification by calling MetaMap with --debug phrase and --phrases_only. Because this method of running MetaMap identifies phrases only, and does no concept identification, it is extremely fast.

Changes in MMO

The form of phrase and utterance terms in MetaMap Machine Output (MMO) has changed slightly in order to allow MMO to more closely match the form of XML output. Just as MetaMap08 introduced an argument to MMO terms that represents positional information, MetaMap08_v2 introduces an additional argument representing the character positions in the string in which <CR> characters have been replaced by blank spaces. The reason for this change and examples of the previous and current MMO forms follow.

Consider this extract from the beginning of PMID 17047334:


PMID- 17047334
OWN - NLM
STAT- MEDLINE
DA  - 20061106
DCOM- 20070618
PUBM- Print-Electronic
IS  - 0001-5652 (Print)
VI  - 62
IP  - 2
DP  - 2006
TI  - Ethnic differences in key candidate genes for spontaneous preterm birth:
      TNF-alpha and its receptors.
PG  - 107-18
AB  - OBJECTIVES: Spontaneous preterm birth (PTB) has a significant ethnic
      disparity with people of African descent having an almost 2-fold higher
      incidence than those of European descent in the United States.

One of the phrases identified by MetaMap in the first utterance of the abstract, a significant ethnic disparity, is represented in XML output by


    <PText>a significant ethnic
      disparity</PText>

Note that the line break between ethnic and disparity and the six blank spaces before disparity in the original input text are faithfully reproduced in the XML output; had the beginning of the citation's abstract read instead


AB  - OBJECTIVES: Spontaneous preterm birth (PTB) has a significant ethnic disparity
      with people of African descent having an almost 2-fold higher incidence
      than those of European descent in the United States.

the XML code generated for the phrase a significant ethnic disparity would have been instead


    <PText>a significant ethnic disparity</PText>

In order to ensure that the MMO representation of phrases and utterances mirrors as faithfully as possible their XML representation, we have modified the MMO phrase and utterance terms to include all blank spaces in the original text. We deemed it unwise, however, to include <CR> characters in MMO terms, because users' postprocessing programs expect all MMO terms to be contained on a single line. A compromise balancing faithfulness to the original text and backward compatibility for our users involved modifying MMO phrase and utterance terms by

  1. changing each <CR> character to a blank space, and
  2. adding an extra argument at the end of phrase and utterance terms representing the character positions in the utterance in which a <CR> character was replaced by a blank space.
For example, the previous form of the MMO term generated for the phrase a significant ethnic disparity was


phrase('a significant ethnic disparity',
       [det([lexmatch([a]),inputmatch([a]),tag(det),tokens([a])]),
        mod([lexmatch([significant]),inputmatch([significant]),tag(adj),tokens([significant])]),
        mod([lexmatch([ethnic]),inputmatch([ethnic]),tag(adj),tokens([ethnic])]),
        head([lexmatch([disparity]),inputmatch([disparity]),tag(noun),tokens([disparity])])],
        325/36).

Note that this term has been pretty-printed for readability; in actual MMO output, the entire term would appear on one line. The argument 325/36 tells us that the string a significant ethnic disparity begins at the 325th character of the abstract (counting from the very beginning, i.e., PMID- 17047334), and contains 36 characters.

The deficiency in this representation is that it does not correctly capture <CR> characters and multiple blank spaces. By way of contrast, the new form in which <CR>s and multiple blanks are more faithfully represented (again, pretty-printed for readability) is


phrase('a significant ethnic       disparity',
       [det([lexmatch([a]),inputmatch([a]),tag(det),tokens([a])]),
        mod([lexmatch([significant]),inputmatch([significant]),tag(adj),tokens([significant])]),
        mod([lexmatch([ethnic]),inputmatch([ethnic]),tag(adj),tokens([ethnic])]),
        head([lexmatch([disparity]),inputmatch([disparity]),tag(noun),tokens([disparity])])],
        325/36,
        [345]).

The additional argument [345] shows that one <CR> character at character position 345 was replaced by a blank space in the MMO representation.

Similarly, the utterance term for the first utterance in the citation's abstract would be the following (the actual utterance text has been replaced by "____" in order to show the entire utterance term on one line):


utterance('17047334.ab.1', ____, 277/216, [345,423]).


The argument [345,423] shows that <CR> characters at positions 345 (between ethnic and disparity) and 423 (between higher and incidence) have been replaced by blank spaces.


Initial Implementation of Negex

This MetaMap release includes the initial implementation of Negex, which was originally described in the
original MetaMap08 Release Notes. In this initial implementation, Negex output is generated if and only if machine_output (-q) or XML (-%) is specified; Negex output will be included in human-readable output in a future release when invoked via --negex.


Allowed Form of PMIDs

A bug had been introduced to MetaMap08 that prevented the analysis of citations whose PMIDs were not purely numeric. MetaMap08_v2 includes a fix that allows the analysis of citations whose PMIDs contain any printing ASCII characters. The table below shows examples of allowable PMID formats.


PMID- 19052218
PMID- MP:001
PMID- 19052218_Findings

No Variant Generation for Short Words

In order to simplify and streamline processing by eliminating many false positive concepts, MetaMap will no longer generate variants for words of one or two characters. This change will suppress, for example, the generation of


   966 TXS (TBXAS1 gene) [Gene or Genome]

from the input word t, and the generation of


   966 AAS (Addiction Admission Scale) [Intellectual Product]

from the input word aa.

Suppression of Header Information

MetaMap will by default display at startup time informational messages such as


/nfsvol/nls/bin/metamap08 (2008)


Control options:
  mm_data_year=0809
Berkeley DB databases (normal strict 0809 model) are open.
Static variants will come from table varsan.
Accessing lexicon /nfsvol/nls/specialist/SKR/src/lexicon/data/lexiconStatic2008.
Variant generation mode: static.
Initializing tagger Established connection to Tagger Server on ind1.
on ind1...

These messages can now be suppressed by specifying the --no_header_info option. We will provide a short (one-character) version of this option if there is sufficient demand for it.

Last Modified: June 08, 2009 ii-public
Links to Our Sites
MetaMap Public Release
NEW: Distributable version of the actual MetaMap program.
Indexing Initiative (II)
Investigating computer-assisted and fully automatic methodologies for indexing biomedical text. Includes the NLM Medical Text Indexer (MTI).
Semantic Knowledge Representation (SKR)
Develop programs to provide usable semantic representation of biomedical text. Includes the MetaMap and SemRep programs.
MetaMap Transfer (MMTx)
Java-Based distributable version of the MetaMap program.
Word Sense Disambiguation (WSD)
Test collection of manually curated MetaMap ambiguity resolution in support of word sense disambiguation research.
Medline Baseline Repository (MBR)
Static MEDLINE Baselines for use in research involving biomedical citations. Allows for query searches and test collection creation.
Lister Hill Center Homepage Link - Image of Lister Hill Center Lister Hill National Center for Biomedical Communications   NLM Homepage Link - NLM Logo U.S. National Library of Medicine   NIH Homepage Link - NIH Logo National Institutes of Health
DHHS Homepage Link - DHHS Logo Department of Health and Human Services
     Contact Us    |   Copyright    |   Privacy    |   Accessibility    |   Freedom of Information Act    |   USA.gov    Get Acrobat Reader button