Changes made in MetaMap2009 V2 involve mostly bug fixes and enhancements to our implementation of NegEx, the AA-detection logic and lexical processing. This release includes also three more substantive changes:
- New Phrase-Breaking Method,
- MatchMap Consolidation, and, most significantly,
- Improvements in XML generation.
Previous versions of MetaMap considered the following eight characters
: ( ) ; / < > =
as phrase-breaking characters. We have now added
to the list of phrase-breaking characters because of text such as the following:
Research projects supported through the BBBE program include, but are not limited to: * Fermentation technology * Enzyme technology * Recombinant DNA technology * Cell culture technology * Ex vivo and therapeutic stem cell culture technology * Metabolic engineering * Tissue engineering * Nanobiotechnology * Quantitative systems biotechnology * Biosensor development * Food processing with special focus on the safety of the nation's food supply
If "*" is not treated as a phrase-breaking character, all of
* Fermentation technology * Enzyme technology * Recombinant DNA technology * Cell culture technology * Ex vivo
will be treated as a single phrase, as will
therapeutic stem cell culture technology * Metabolic engineering * Tissue engineering * Nanobiotechnology * Quantitative systems biotechnology * Biosensor development * Food processing with special focus on the safety of the nation's food supply
These long phrases lead to a massive combinatorial explosion that requires 45 minutes of processing time. If instead "*" is treated as a phrase-breaking character, each bullet will be treated as a separate phrase, and MetaMap can process the entire text in about three seconds.
MatchMaps are a data structure associated with candidate concepts; they represent
For example, consider the input text obstructive sleep apnea and the candidate concept sleep apnea. The matching words sleep and apnea are the 2nd and 3rd words of the text, and the 1st and 2nd words of the concept. The words in the input text and candidate are identical, so there is no (i.e., zero) lexical variation. The matchmap would therefore be [[[2,3],[1,2],0]]. For the candidate concept sleep apneas, the MatchMap would be the same, other than having lexical variation of 1 instead of 0. The detailed computation of lexical variation is explained on pp. 2-3 of MetaMap Evaluation.
For another instructive example of MatchMaps, consider the input text protein synthesis and the candidate concept protein synthesizers. In this case, the matching words are the 1st and 2nd words of both the text and the concept, and there is a high degree of lexical variation between synthesis and synthesizers. The matchmap here is [[[1,1],[1,1],0],[[2,2],[2,2],6]].
In cases such as this last one, we have consolidated two consecutive MatchMaps if they denote adjacent positions. More specifically, if two MatchMaps
[[Text1Start,Text1End], [Concept1Start,Concept1End], Variation1]
[[Text2Start,Text2End], [Concept2Start,Concept2End], Variation2]
are such that Text2Start=Text1End+1 and Concept2Start=Concept1End+1 then the MatchMaps can be consolidated into
[[Text1Start,Text2End], [Concept1Start,Concept2End], Variation3])
where LexicalVariation3 is the arithmetic average of LexicalVariation1 and LexicalVariation2.
In the protein synthesis example above, the unconsolidated MatchMap
would thus be consolidated into
Note that consolidating MatchMaps will lead to slightly lower candidate and mapping scores, because the penalty for lexical variation will be lessened.
In this release of MetaMap, we have made significant changes to our XML tags. For a full explanation of the old and new XML tags, please see MetaMap2009 V2 XML.