| MetaMap 2008_v2 Release Notes |
| Skip Navigation | | Home | NLM » LHNCBC » MetaMap » MetaMap 2008_v2 Release Notes |
|
Introduction This document outlines a number of substantial changes made to produce Metamap 2008_v2, the most important of which are
Substantially Improved Performance We have considerably sped up MetaMap computations by storing intermediate results of mappings in an AVL tree rather than a linear list. Although this enhancement will not always be noticeable, it will result in as much as a 50-fold speedup on certain especially troublesome citations containing many instances of human and other words with many associated concepts. We also fixed a long-standing bug in filehandle management that had caused filehandles to not be closed. This bug fix now allows MetaMap to run on much larger input files than before without exceeding operating-system limits on the number of open filehandles. XML Generation The version of XML generation in the initial release of MetaMap08, first described in the original MetaMap08 Release Notes, did not work properly. It has been fixed in this release, which correctly generates both formatted and unformatted XML output. Another substantial change to the XML generation is the one-to-one mapping of input citations (not of input files) and XML documents: The original release of MetaMap08 was intended to generate an XML document for each input file, even if the input file contained multiple citations. By contrast, MetaMap08_v2 generates an XML document for each citation in an input file. Consequently, if an input file contains multiple citations, the XML output generated from that file will contain several XML documents separated by a blank line. For example, consider the following input file:
Because Heart attack. and Lung cancer. are separated by a blank line, they are considered separate citations, just as if multiple Medline citations had been downloaded to a single input file. The XML generated for that file will be therefore be as follows:
Sentence-Breaking Algorithm MetaMap08 (and all previous versions of the application) required that a blank space immediately follow a period for that period to signal the end of a sentence. We recently noticed, however, tens of thousands of instances in Medline citations in which a period not followed by a blank space ended a sentence. Consequently, we did not correctly handle instances such as the following (note that each line is intended to be read by itself--this is not continuous text):
Note that in each of these actual lines from Medline citations, the period is not followed by a blank space, but nonetheless still marks the end of a sentence. We analyzed the phenomenon of end-of-sentence periods that are not followed by a blank space, and determined that if a period was followed by a string beginning with an uppercase letter, and that string was either
Phrase-Breaking Algorithm We have modified the phrase-breaking algorithm to force a phrase break whenever a "<", ">", or "=" character is encountered in the input text. This change was motivated by text such as the following, from PMID 16755936:
which should clearly not be analyzed as a single phrase. We have also added the ability to display information about the phrases into which input utterances are divided; this output is displayed by the --debug phrase option. The output consists of the word Phrase followed by the PMID, Section Type (ti or ab), utterance number, number of content-bearing phrase tokens, and finally the phrase tokens themselves. The six fields are separated by a vertical bar "|":
In the above example, there are only three content-bearing phrase tokens, even though the phrase contains five words; the example is correct because on and the are not content-bearing. Finally, we enabled MetaMap to generate phrases only with no concept identification by calling MetaMap with --debug phrase and --phrases_only. Because this method of running MetaMap identifies phrases only, and does no concept identification, it is extremely fast. Changes in MMO The form of phrase and utterance terms in MetaMap Machine Output (MMO) has changed slightly in order to allow MMO to more closely match the form of XML output. Just as MetaMap08 introduced an argument to MMO terms that represents positional information, MetaMap08_v2 introduces an additional argument representing the character positions in the string in which <CR> characters have been replaced by blank spaces. The reason for this change and examples of the previous and current MMO forms follow. Consider this extract from the beginning of PMID 17047334:
One of the phrases identified by MetaMap in the first utterance of the abstract, a significant ethnic disparity, is represented in XML output by
Note that the line break between ethnic and disparity and the six blank spaces before disparity in the original input text are faithfully reproduced in the XML output; had the beginning of the citation's abstract read instead
the XML code generated for the phrase a significant ethnic disparity would have been instead
In order to ensure that the MMO representation of phrases and utterances mirrors as faithfully as possible their XML representation, we have modified the MMO phrase and utterance terms to include all blank spaces in the original text. We deemed it unwise, however, to include <CR> characters in MMO terms, because users' postprocessing programs expect all MMO terms to be contained on a single line. A compromise balancing faithfulness to the original text and backward compatibility for our users involved modifying MMO phrase and utterance terms by
Note that this term has been pretty-printed for readability; in actual MMO output, the entire term would appear on one line. The argument 325/36 tells us that the string a significant ethnic disparity begins at the 325th character of the abstract (counting from the very beginning, i.e., PMID- 17047334), and contains 36 characters. The deficiency in this representation is that it does not correctly capture <CR> characters and multiple blank spaces. By way of contrast, the new form in which <CR>s and multiple blanks are more faithfully represented (again, pretty-printed for readability) is
The additional argument [345] shows that one <CR> character at character position 345 was replaced by a blank space in the MMO representation. Similarly, the utterance term for the first utterance in the citation's abstract would be the following (the actual utterance text has been replaced by "____" in order to show the entire utterance term on one line):
The argument [345,423] shows that <CR> characters at positions 345 (between ethnic and disparity) and 423 (between higher and incidence) have been replaced by blank spaces. Initial Implementation of Negex This MetaMap release includes the initial implementation of Negex, which was originally described in the original MetaMap08 Release Notes. In this initial implementation, Negex output is generated if and only if machine_output (-q) or XML (-%) is specified; Negex output will be included in human-readable output in a future release when invoked via --negex. Allowed Form of PMIDs A bug had been introduced to MetaMap08 that prevented the analysis of citations whose PMIDs were not purely numeric. MetaMap08_v2 includes a fix that allows the analysis of citations whose PMIDs contain any printing ASCII characters. The table below shows examples of allowable PMID formats.
No Variant Generation for Short Words In order to simplify and streamline processing by eliminating many false positive concepts, MetaMap will no longer generate variants for words of one or two characters. This change will suppress, for example, the generation of
from the input word t, and the generation of
from the input word aa. Suppression of Header Information MetaMap will by default display at startup time informational messages such as
These messages can now be suppressed by specifying the --no_header_info option. We will provide a short (one-character) version of this option if there is sufficient demand for it.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||