The purpose of this document is to provide an example of the processing necessary to allow for entity recognition of Resource Description Format (RDF) concepts, in this case Experimental Factor Ontology (EFO) concepts, using MetaMap.
Each SPARQL query first finds the node containing the concept identifier in this case the node with the isa relation (rdf:type) "owl:Class" if the same node has a rdfs:label (preferred name) then add the subject and object to the result graph:
SELECT ?s ?o { ?s rdf:type owl:Class . ?s rdfs:label ?o }
Similarly, the following SPARQL query does the same for owl:Class instances which also have efo:alternative terms:
SELECT ?s ?o { ?s rdf:type owl:Class . ?s efo:alternative_term ?o }
We have one graph for concepts and another for synonyms. Each graph is indexed in a dictionary by the identifier, one dictionary for concepts and one for synonyms.
The function abbrev_uri is used to produce a concept identifier from the Uniform Resource Identifier (URI) of the concept in a form that is usuable by MetaMap.
http://www.ebi.ac.uk/efo/EFO_0003549 -> EFO_0003549 http://www.ebi.ac.uk/chebi/searchId.do?chebiId=CHEBI:153671 -> CHEBI_153671 http://purl.org/obo/owl/CL#CL_0001003 -> CL_0001003
When producing the Concept Names and sources File, MRCONSO.RRF, all of the keys in the concept dictionary are traversed. For each concept record for the label and, if present, any synonyms are produced. For example, the concept EFO_0003549 "caudal tuberculum" has the following MRCONSO.RRF records generated:
$ grep EFO_0003549 MRCONSO.RRF EFO_0003549|ENG|P|L00004389|PF|S00015577|Y|A00015591||||EFO|PT|http://www.ebi.ac.uk/efo/EFO_0003549|caudal tuberculum|0|N|| EFO_0003549|ENG|S|L00015225|SY|S00019475|Y|A00019506||||EFO|PT|http://www.ebi.ac.uk/efo/EFO_0003549|posterior tubercle|0|N|| EFO_0003549|ENG|S|L00015226|SY|S00019476|Y|A00019507||||EFO|PT|http://www.ebi.ac.uk/efo/EFO_0003549|posterior tuberculum|0|N||
The Semantic Types file, MRSTY.RRF, has one record for each semantic type assigned to a concept:
$ grep EFO_0003549 MRSTY.RRF EFO_0003549|T205|A0.0.0.0.0.0|Unknown|AT0000000||
The Source Information file, MRSAB, contains information used by MetaMap when restricting to a particular set of sources, or excluding them.
$ cat MRSAB.RRF C4000000|C4000000|GO|GO|http://purl.org/obo/owl/GO#||||||||0|3|2||||ENG|ascii|Y|Y|| C4000001|C4000001|EFO|EFO|http://www.ebi.ac.uk/efo/||||||||0|5222|3501||||ENG|ascii|Y|Y|| C4000002|C4000002|UO|UO|http://purl.org/obo/owl/UO#||||||||0|135|76||||ENG|ascii|Y|Y|| C4000003|C4000003|NCBITaxon|NCBITaxon|http://purl.org/obo/owl/NCBITaxon#||||||||0|77|46||||ENG|ascii|Y|Y|| C4000004|C4000004|SPAN|SPAN|http://www.ifomis.org/bfo/1.1/span#||||||||0|1|1||||ENG|ascii|Y|Y|| C4000005|C4000005|CL|CL|http://purl.org/obo/owl/CL#||||||||0|713|617||||ENG|ascii|Y|Y|| C4000006|C4000006|oboInOwl|oboInOwl|http://www.geneontology.org/formats/oboInOwl#||||||||0|1|1||||ENG|ascii|Y|Y|| C4000007|C4000007|OBO|OBO|http://purl.obolibrary.org/obo/||||||||0|68|42||||ENG|ascii|Y|Y|| C4000008|C4000008|PATO|PATO|http://purl.org/obo/owl/PATO||||||||0|5|3||||ENG|ascii|Y|Y|| C4000009|C4000009|SNAP|SNAP|http://www.ifomis.org/bfo/1.1/snap#||||||||0|8|7||||ENG|ascii|Y|Y|| C4000010|C4000010|CHEBI|CHEBI|http://www.ebi.ac.uk/chebi/searchId.do?chebiId=||||||||0|0|0||||ENG|ascii|Y|Y||
The Concept Name Ranking File, MRRANK.RRF, provides information to the filtering program filter_mrconso needed when determining which records in MRCONSO.RRF should be removed because they are redundant.
$ cat MRRANK 0400|EFO|PT|N| 0399|EFO|SY|N| 0398|GO|PT|N| 0397|GO|SY|N| 0396|NCBITaxon|PT|N| 0395|NCBITaxon|SY|N| 0394|CL|PT|N| 0393|CL|SY|N| 0392|CHEBI|PT|N| 0391|CHEBI|SY|N| 0390|OBO|PT|N| 0389|OBO|SY|N| 0388|oboInOwl|PT|N| 0387|oboInOwl|SY|N| 0386|UO|PT|N| 0385|UO|SY|N| 0384|SNAP|PT|N| 0383|SNAP|SY|N| 0382|PATO|PT|N| 0381|PATO|SY|N| 0380|SPAN|PT|N| 0379|SPAN|SY|N|
When producing the Concept Names File, MRCON, all of the keys in the concept dictionary are traversed. For each concept record for the label and, if present, any synonyms are produced. For example, the concept EFO_0003549 "caudal tuberculum" has the following MRCON records generated:
$ grep EFO_0003549 MRCON EFO_0003549|ENG|P|L0004578|PF|S0004578|caudal tuberculum|0| EFO_0003549|ENG|S|L0004579|SY|S0004579|posterior tubercle|0| EFO_0003549|ENG|S|L0004580|SY|S0004580|posterior tuberculum|0|
Each record contains the Concept Identifier, language, term status, lexical unique identifier (unused), term status, string unique identifier, term string, and Least restriction level (unused).
Similarly, for each record in MRCON there is a corresponding record in the Vocabulary Sources File, MRSO:
$ grep EFO_0003549 MRSO EFO_0003549|L0004578|S0004578|EFO|PT|http://www.ebi.ac.uk/efo/EFO_0003549|0| EFO_0003549|L0004579|S0004579|EFO|SY|http://www.ebi.ac.uk/efo/EFO_0003549|0| EFO_0003549|L0004580|S0004580|EFO|SY|http://www.ebi.ac.uk/efo/EFO_0003549|0|
Each record contains the Concept Identifier, lexical unique identifier (unused), string unique identifier, Source identifier, term type in source, unique identifier in source (in this case the fully specified URI), and source restriction level (unused).
The Semantic Types file, MRSTY, has one record for each semantic type assigned to a concept:
$ grep EFO_0003549 MRSTY EFO_0003549|T205|unkn|
The Source Information file, MRSAB, contains information used by MetaMap when restricting to a particular set of sources, or excluding them.
$ more MRSAB C4000000|C4000000|SPAN|SPAN|http://www.ifomis.org/bfo/1.1/span#||||||||0|1|1||||ENG|ascii|Y|Y| C4000001|C4000001|CL|CL|http://purl.org/obo/owl/CL#||||||||0|713|617||||ENG|ascii|Y|Y| C4000002|C4000002|oboInOwl|oboInOwl|http://www.geneontology.org/formats/oboInOwl#||||||||0|1|1||||ENG|ascii|Y|Y| C4000003|C4000003|GO|GO|http://purl.org/obo/owl/GO#||||||||0|3|2||||ENG|ascii|Y|Y| C4000004|C4000004|OBO|OBO|http://purl.obolibrary.org/obo/||||||||0|68|42||||ENG|ascii|Y|Y| C4000005|C4000005|PATO|PATO|http://purl.org/obo/owl/PATO||||||||0|5|3||||ENG|ascii|Y|Y| C4000006|C4000006|EFO|EFO|http://www.ebi.ac.uk/efo/||||||||0|5222|3501||||ENG|ascii|Y|Y| C4000007|C4000007|SNAP|SNAP|http://www.ifomis.org/bfo/1.1/snap#||||||||0|8|7||||ENG|ascii|Y|Y| C4000008|C4000008|UO|UO|http://purl.org/obo/owl/UO#||||||||0|135|76||||ENG|ascii|Y|Y| C4000009|C4000009|CHEBI|CHEBI|http://www.ebi.ac.uk/chebi/searchId.do?chebiId=||||||||0|0|0||||ENG|ascii|Y|Y| C4000010|C4000010|NCBITaxon|NCBITaxon|http://purl.org/obo/owl/NCBITaxon#||||||||0|77|46||||ENG|ascii|Y|Y|
The Concept Name Ranking File, MRRANK, provides information to the filtering program filter_mrconso needed when determining which records in MRCON and MRSO should be removed because they are redundant.
$ more MRRANK 0400|EFO|PT|N| 0399|EFO|SY|N| 0398|GO|PT|N| 0397|GO|SY|N| 0396|NCBITaxon|PT|N| 0395|NCBITaxon|SY|N| 0394|CL|PT|N| 0393|CL|SY|N| 0392|CHEBI|PT|N| 0391|CHEBI|SY|N| 0390|OBO|PT|N| 0389|OBO|SY|N| 0388|oboInOwl|PT|N| 0387|oboInOwl|SY|N| 0386|UO|PT|N| 0385|UO|SY|N| 0384|SNAP|PT|N| 0383|SNAP|SY|N| 0382|PATO|PT|N| 0381|PATO|SY|N| 0380|SPAN|PT|N| 0379|SPAN|SY|N|
Thanks to Tomasz Adamusiak for exposing how the EFO inferred data was organized.
Original Data for Concept EFO_0003549 (URI:http://www.ebi.ac.uk/efo/EFO_0003549) in n3 format:
:EFO_0003549 a owl:Class; rdfs:label "caudal tuberculum"^^<http://www.w3.org/2001/XMLSchema#string>; :alternative_term "posterior tubercle"^^<http://www.w3.org/2001/XMLSchema#string>, "posterior tuberculum"^^<http://www.w3.org/2001/XMLSchema#string>; :bioportal_provenance "Brain structure which is part of the diencephalon and is larger than the dorsal thalamus and ventral thalamus. From Neuroanatomy of the Zebrafish Brain.[accessedResource: ZFA:0000633][accessDate: 05-04-2011]" ^^<http://www.w3.org/2001/XMLSchema#string>, "posterior tubercle[accessedResource: ZFA:0000633][accessDate: 05-04-2011]" ^^<http://www.w3.org/2001/XMLSchema#string>, "posterior tuberculum[accessedResource: ZFA:0000633][accessDate: 05-04-2011]" ^^<http://www.w3.org/2001/XMLSchema#string>; :definition "Brain structure which is part of the diencephalon and is larger than the dorsal thalamus and ventral thalamus. From Neuroanatomy of the Zebrafish Brain."^^<http://www.w3.org/2001/XMLSchema#string>; :definition_citation "ZFA:0000633"^^<http://www.w3.org/2001/XMLSchema#string>; :definition_editor "Tomasz Adamusiak"^^<http://www.w3.org/2001/XMLSchema#string>; rdfs:subClassOf :EFO_0003331.
This module requires RDFLIB (http://code.google.com/p/rdflib/) (version RDFlib 4.1.0). The program generates both ORF and RRF versions of DATA. The RRF versions are used by the 2013 release of Data File Builder and the ORF versions are used by the 2011 release of Data File Builder.
""" /rhome/wjrogers/studio/python/rdf/efo_extract.py, Mon Jan 9 14:05:05 2012, edit by Will Rogers Extract concepts and synomyms from EFO_inferred_v2.18.owl and generate UMLS format tables for use by MetaMap's Data File Builder. Original Author: Willie Rogers, 09jan2012 """ from rdflib import ConjunctiveGraph, Graph from rdflib import Namespace, URIRef from string import join from readrdf import readrdf import re import sys from mwi_utilities import normalize_ast_string efo_datafile = '/usr/local/pub/rdf/EFO_inferred_v2.18.owl' EFO = Namespace("http://www.ebi.ac.uk/efo/") RDFS = Namespace("http://www.w3.org/2000/01/rdf-schema#") RDF = Namespace('http://www.w3.org/1999/02/22-rdf-syntax-ns#') OWL = Namespace("http://www.w3.org/2002/07/owl#") BFO = Namespace("http://www.ifomis.org/bfo/1.1/snap#MaterialEntity") mmencoding='ascii' prefixdict = { 'http://www.ebi.ac.uk/efo/' : 'EFO', 'http://purl.org/obo/owl/GO#' : 'GO', 'http://purl.org/obo/owl/NCBITaxon#' : 'NCBITaxon', 'http://purl.org/obo/owl/CL#' : 'CL', 'http://www.ebi.ac.uk/chebi/searchId.do?chebiId=' : 'CHEBI', 'http://purl.obolibrary.org/obo/' : 'OBO', 'http://www.geneontology.org/formats/oboInOwl#' : 'oboInOwl', 'http://purl.org/obo/owl/UO#' : 'UO', 'http://www.ifomis.org/bfo/1.1/snap#' : 'SNAP', 'http://purl.org/obo/owl/PATO' : 'PATO', 'http://www.ifomis.org/bfo/1.1/span#' : 'SPAN', } prefixlist = [ 'http://www.ebi.ac.uk/efo/', 'http://purl.org/obo/owl/GO#', 'http://purl.org/obo/owl/NCBITaxon#', 'http://purl.org/obo/owl/CL#', 'http://www.ebi.ac.uk/chebi/searchId.do?chebiId=', 'http://purl.obolibrary.org/obo/', 'http://www.geneontology.org/formats/oboInOwl#', 'http://purl.org/obo/owl/UO#', 'http://www.ifomis.org/bfo/1.1/snap#', 'http://purl.org/obo/owl/PATO', 'http://www.ifomis.org/bfo/1.1/span#', ] srclist = [ 'EFO', 'GO', 'NCBITaxon', 'CL', 'CHEBI', 'OBO', 'oboInOwl', 'UO', 'SNAP', 'PATO', 'SPAN'] semtypes = [('T045','Genetic Function','genf'), ('T028','Gene or Genome','gegm'), ('T116','Amino Acid, Peptide, or Protein','aapp'),] def list_id_and_labels(graph): for s,p,o in graph: if p.__str__ == 'http://www.w3.org/2000/01/rdf-schema#label': print(s,p,o) def query_labels(graph): ns = dict(efo=EFO,rdfs=RDFS) return graph.query('SELECT ?aname ?bname WHERE { ?a rdfs:label ?b }', initNs=ns) def query_efo_type(graph, typename): ns = dict(efo=EFO, rdfs=RDFS) return graph.query('SELECT ?aname ?bname WHERE { ?a efo:%s ?b }' % typename, initNs=ns) def query_efo_concepts(graph): ns = dict(owl=OWL, rdf=RDF) return graph.query('SELECT ?s { ?s rdf:type owl:Class }', initNs=ns) def query_efo_concept_labels(graph): " return all labels (concepts) from graph (EFO_inferred_v2.18.owl)" ns = dict(owl=OWL, rdf=RDF, rdfs=RDFS) return graph.query('SELECT ?s ?o { ?s rdf:type owl:Class . ?s rdfs:label ?o }', initNs=ns) def query_efo_concept_synonyms(graph): " return all alternative_terms (synonyms) from graph (EFO_inferred_v2.18.owl)" ns = dict(owl=OWL, rdf=RDF, efo=EFO) return graph.query('SELECT ?s ?o { ?s rdf:type owl:Class . ?s efo:alternative_term ?o }', initNs=ns) def escape_re_chars(prefix): " escape regular expression special characters in url" return prefix.replace('?','\?').replace('#','\#') def abbrev_uri_original(uri): " remove prefix from uri leaving unique part of identifier " for prefix in prefixlist: m = re.match(r"(%s)(.*)" % escape_re_chars(prefix),uri) if m != None: return m.group(2).replace(':','_') return uri def abbrev_uri(uri): " remove prefix from uri leaving unique part of identifier " for prefix in prefixlist: # print('prefix = ''%s''' % prefix) m = uri.find(prefix) if m == 0: newuri = uri[len(prefix):].replace(':','_') if newuri.find(':') >= 0: print("problem with abbreviated uri: %s" % newuri) return newuri return uri def get_source_name_original(uri): " derive source name from uri " for prefix in prefixdict.keys(): m = re.match(r"(%s)(.*)" % escape_re_chars(prefix),uri) if m != None: return prefixdict[m.group(1)] return uri def get_source_name(uri): " derive source name from uri " for prefix in prefixdict.keys(): m = uri.find(prefix) if m == 0: return prefixdict[uri[0:len(prefix)]] return uri def collect_concepts(graph): """ Return dictionaries (maps) of concepts and synonyms (alternative_terms) from results of SPARQL queries """ cdict = {} conceptresult=query_efo_concept_labels(graph) serialnum = 1 for row in conceptresult: key = tuple(row)[0].__str__() if cdict.has_key(key): cdict[key].append(row) else: cdict[key] = [row] syndict = {} synonymresult = query_efo_concept_synonyms(graph) for row in synonymresult: key = tuple(row)[0].__str__() if syndict.has_key(key): syndict[key].append(row) else: syndict[key] = [row] return cdict,syndict def is_valid_cui(cui): return re.match(r"[A-Za-z]+[\_]*[0-9]+" , cui) def gen_mrcon_original(graph,filename): """ Generate UMLS format MRCON table. return rows of the form: EFO_0003549|ENG|P|L0000001|PF|S0000001|caudal tuberculum|0| EFO_0003549|ENG|S|L0000002|SY|S0000002|posterior tubercle|0| EFO_0003549|ENG|S|L0000003|SY|S0000003|posterior tuberculum|0| """ conceptresult=query_efo_concept_labels(graph) fp = open(filename,'w') serialnum = 1 for row in conceptresult: if is_valid_cui(abbrev_uri(tuple(row)[0].encode(mmencoding, 'replace'))): fp.write("%s|ENG|P|L%07d|PF|S%07d|%s|0|\n" % (abbrev_uri(tuple(row)[0].encode(mmencoding, 'replace')), serialnum, serialnum, tuple(row)[1].encode(mmencoding, 'replace'))) serialnum = serialnum + 1 fp.close() def gen_mrcon(filename, cdict={}, syndict={}, strdict={}, luidict={}): """ Generate UMLS format MRCON (concepts) table. return rows of the form: EFO_0003549|ENG|P|L0000001|PF|S0000001|caudal tuberculum|0| EFO_0003549|ENG|S|L0000002|SY|S0000002|posterior tubercle|0| EFO_0003549|ENG|S|L0000003|SY|S0000003|posterior tuberculum|0| """ fp = open(filename,'w') serialnum = 1 for key in cdict.keys(): for row in cdict[key]: if is_valid_cui(abbrev_uri(tuple(row)[0].encode(mmencoding, 'replace'))): term = tuple(row)[1].encode(mmencoding, 'replace').strip() lui = get_lui(luidict,term) if lui == None: sys.stderr.write('gen_mrso: LUI missing for prefname %s\n' % (term)) sui = strdict.get(term) if sui == None: sys.stderr.write('gen_mrcon:SUI missing for preferred name %s\n' % (term)) fp.write("%s|ENG|P|%s|PF|%s|%s|0|\n" % (abbrev_uri(tuple(row)[0].encode(mmencoding, 'replace')), lui, sui, term)) serialnum = serialnum + 1 if syndict.has_key(key): if is_valid_cui(abbrev_uri(tuple(row)[0].encode(mmencoding, 'replace'))): for row in syndict[key]: term = tuple(row)[1].encode(mmencoding, 'replace').strip() lui = get_lui(luidict,term) if lui == None: sys.stderr.write('gen_mrso: LUI missing for synonym %s\n' % (term)) sui = strdict.get(term) if sui == None: sys.stderr.write('gen_mrcon:SUI missing for synonym %s\n' % (term)) fp.write("%s|ENG|S|%s|SY|%s|%s|0|\n" % (abbrev_uri(tuple(row)[0].encode(mmencoding, 'replace')), lui, sui, term)) serialnum = serialnum + 1 fp.close() def gen_mrso(filename, cdict={}, syndict={}, strdict={}, luidict={}): """ Generate UMLS format MRSO (sources) table. EFO_0003549|L0000001|S0000001|EFO|PT|http://www.ebi.ac.uk/efo/EFO_0003549|0| EFO_0003549|L0000002|S0000002|EFO|SY|http://www.ebi.ac.uk/efo/EFO_0003549|0| EFO_0003549|L0000003|S0000003|EFO|SY|http://www.ebi.ac.uk/efo/EFO_0003549|0| """ fp = open(filename,'w') serialnum = 1 for key in cdict.keys(): for row in cdict[key]: if is_valid_cui(abbrev_uri(tuple(row)[0].encode(mmencoding, 'replace'))): term = tuple(row)[1].encode(mmencoding, 'replace').strip() lui = get_lui(luidict,term) if lui == None: sys.stderr.write('gen_mrso: LUI missing for prefname %s\n' % (term)) sui = strdict.get(term) if sui == None: sys.stderr.write('gen_mrso: SUI missing for prefname %s\n' % (term)) fp.write("%s|%s|%s|%s|PT|%s|0|\n" % (abbrev_uri(tuple(row)[0].encode(mmencoding, 'replace')), lui, sui, get_source_name(tuple(row)[0].encode(mmencoding, 'replace')), tuple(row)[0].encode(mmencoding, 'replace'))) serialnum = serialnum + 1 if syndict.has_key(key): for row in syndict[key]: if is_valid_cui(abbrev_uri(tuple(row)[0].encode(mmencoding, 'replace'))): term = tuple(row)[1].encode(mmencoding, 'replace').strip() lui = get_lui(luidict,term) if lui == None: sys.stderr.write('gen_mrso: LUI missing for synonym %s\n' % (term)) sui = strdict.get(term) if sui == None: sys.stderr.write('SUI missing for synonym %s\n' % (term)) fp.write("%s|%s|%s|%s|SY|%s|0|\n" % (abbrev_uri(tuple(row)[0].encode(mmencoding, 'replace')), lui, sui, get_source_name(tuple(row)[0].encode(mmencoding, 'replace')), tuple(row)[0].encode(mmencoding, 'replace'))) serialnum = serialnum + 1 fp.close() def gen_mrconso(filename, cdict={}, syndict={}, auidict={}, strdict={}, luidict={}): """ Generate UMLS RRF format MRCONSO (concept+sources) table cui|lat|ts|lui|stt|sui|ispref|aui|saui|scui|sdui|sab|tty|code|str|srl|suppress|cvf EFO0003549|ENG|P|L0000001|PF|S0000001|Y|A0003549||||EFO|PT|http://www.ebi.ac.uk/efo/EFO_0003549|caudal tuberculum|0|?|N|| """ fp = open(filename,'w') serialnum = 1 for key in cdict.keys(): for row in cdict[key]: uri = tuple(row)[0].encode(mmencoding, 'replace').strip() if is_valid_cui(abbrev_uri(uri)): sab = get_source_name(uri) term = tuple(row)[1].encode(mmencoding, 'replace').strip() aui = auidict.get((term,sab)) if aui == None: sys.stderr.write('gen_mrconso:AUI missing for preferred name %s,%s\n' % (term,sab)) lui = get_lui(luidict,term) if lui == None: sys.stderr.write('gen_mrso: LUI missing for prefname %s\n' % (term)) sui = strdict.get(term) if sui == None: sys.stderr.write('gen_mrconso:SUI missing for preferred name %s\n' % (term)) fp.write("%s|ENG|P|%s|PF|%s|Y|%s||||%s|PT|%s|%s|0|N||\n" % \ (abbrev_uri(uri),lui,sui, aui,sab,uri,term)) serialnum = serialnum + 1 if syndict.has_key(key): uri = tuple(row)[0].encode(mmencoding, 'replace').strip() if is_valid_cui(abbrev_uri(uri)): for row in syndict[key]: sab = get_source_name(uri) term = tuple(row)[1].encode(mmencoding, 'replace').strip() aui = auidict.get((term,sab)) if aui == None: sys.stderr.write('gen_mrconso:AUI missing for synonym %s,%s\n' % (term,sab)) lui = get_lui(luidict,term) if lui == None: sys.stderr.write('gen_mrso: LUI missing for synonym %s\n' % (term)) sui = strdict.get(term) if sui == None: sys.stderr.write('gen_mrconso:SUI missing for synonym %s\n' % (term)) fp.write("%s|ENG|S|%s|SY|%s|Y|%s||||%s|PT|%s|%s|0|N||\n" % \ (abbrev_uri(uri),lui,sui, aui,sab,uri,term)) serialnum = serialnum + 1 fp.close() def get_semantic_typeid(uri): """ return semantic type id for uri, currently all uris belong to the unknown semantic type. """ return 'T205' def get_semantic_typeabbrev(uri): """ return semantic type abbreviation for uri, currently all uris belong to the unknown semantic type.""" return 'unkn' def get_semantic_typename(uri): """ return semantic type name for uri, currently all uris belong to the unknown semantic type.""" return "Unknown" def get_semantic_typetree_number(uri): """ return semantic tree number for uri, currently all uris belong to the unknown semantic type.""" return "A0.0.0.0.0.0" def get_semantic_typeui(uri): """ return semantic tree number for uri, currently all uris belong to the unknown semantic type.""" return "AT0000000" def gen_mrsty(filename, cdict={}, syndict={}): """ Generate UMLS ORF format MRSTY (semantic type) table. Currently, all of the concepts are assigned the semantic type "unkn". """ fp = open(filename,'w') for key in cdict.keys(): for row in cdict[key]: if is_valid_cui(abbrev_uri(tuple(row)[0].encode(mmencoding, 'replace'))): fp.write("%s|%s|%s|\n" % (abbrev_uri(tuple(row)[0].encode(mmencoding, 'replace')), get_semantic_typeid(tuple(row)[0].encode(mmencoding, 'replace')), get_semantic_typeabbrev(tuple(row)[0].encode(mmencoding, 'replace')))) fp.close() def gen_mrsty_rrf(filename, cdict={}, syndict={}): """ Generate UMLS ORF format MRSTY (semantic type) table. Currently, all of the concepts are assigned the semantic type "unkn". MRSTY.RFF contaons lines like C0000005|T116|A1.4.1.2.1.7|Amino Acid, Peptide, or Protein|AT17648347|| C0000005|T121|A1.4.1.1.1|Pharmacologic Substance|AT17575038|| C0000005|T130|A1.4.1.1.4|Indicator, Reagent, or Diagnostic Aid|AT17634323|| C0000039|T119|A1.4.1.2.1.9|Lipid|AT17617573|256| C0000039|T121|A1.4.1.1.1|Pharmacologic Substance|AT17567371|256| """ fp = open(filename,'w') for key in cdict.keys(): for row in cdict[key]: if is_valid_cui(abbrev_uri(tuple(row)[0].encode(mmencoding, 'replace'))): fp.write("%s|%s|%s|%s|%s||\n" % (abbrev_uri(tuple(row)[0].encode(mmencoding, 'replace')), get_semantic_typeid(tuple(row)[0].encode(mmencoding, 'replace')), get_semantic_typetree_number(tuple(row)[0].encode(mmencoding, 'replace')), get_semantic_typename(tuple(row)[0].encode(mmencoding, 'replace')), get_semantic_typeui(tuple(row)[0].encode(mmencoding, 'replace')))) fp.close() def gen_mrsat(filename, cdict={}, syndict={}, auidict={}, strdict={}, luidict={}): """ Generate UMLS format MRSAT (Simple Concept and String Attributes) table. Currently, empty. """ fp = open(filename,'w') fp.close() def gen_mrsab(filename,cdict={},syndict={}): """ Generate UMLS format MRSAB (Source Informatino) table. Currently, empty. """ cui_index = 4000000 fp = open(filename,'w') for k,v in prefixdict.items(): rcui=vcui='C%7s' % cui_index vsab=rsab=v son=k sf=vstart=vend=imeta=rmeta=slc=scc=ssn=scit='' srl='0' if len(cdict) > 0: # count concepts that belong to source cres=filter(lambda x: re.match(r"(%s)(.*)" % k,x[0].__str__()), cdict.items()) cfr='%d' % len(cres) if len(syndict) > 0: # count synonyms that belong to source sres=filter(lambda x: re.match(r"(%s)(.*)" % k,x[0].__str__()), syndict.items()) tfr='%d' % (len(cres)+len(sres)) else: tfr='%d' % len(cres) else: cfr=tfr='' cxty=ttyl=atnl='' lat='ENG' cenc='ascii' curver=sabin='Y' fp.write('%s\n' % \ join((vcui,rcui,vsab,rsab,son,sf,vstart,vend, imeta,rmeta,slc,scc,srl,tfr,cfr,cxty, ttyl,atnl,lat,cenc,curver,sabin,ssn,scit),'|')) cui_index = cui_index + 1 fp.close() def gen_mrrank(filename): """ Generate UMLS format MRRANK (Concept Name Ranking) table. """ ttylist = ['PT', 'SY'] pred = 400 fp = open(filename, 'w') for sab in srclist: for tty in ttylist: fp.write('%04d|%s|%s|N|\n' % (pred,sab,tty)) pred = pred - 1 fp.close() def print_result(result): for row in result: print("%s|%s" % (tuple(row)[0],tuple(row)[1])) def write_result(result, filename): f = open(filename, 'w') for row in result: f.write(('%s\n' % join((tuple(row)[0],tuple(row)[1]), '|')).encode(mmencoding, 'replace')) f.close() def gen_mrcon_list(cdict={}, syndict={}): """ return rows of the form: EFO_0003549|ENG|P|L0000001|PF|S0000001|caudal tuberculum|0| EFO_0003549|ENG|S|L0000002|SY|S0000002|posterior tubercle|0| EFO_0003549|ENG|S|L0000003|SY|S0000003|posterior tuberculum|0| """ mrconlist = [] serialnum = 1 for key in cdict.keys(): for row in cdict[key]: if is_valid_cui(abbrev_uri(tuple(row)[0].encode(mmencoding, 'replace'))): # "%s|ENG|P|L%07d|PF|S%07d|%s|0|\n" mrconlist.append((abbrev_uri(tuple(row)[0].encode(mmencoding, 'replace')),'ENG','P', 'L%07d' % serialnum, 'PF', 'S%07d' % serialnum, tuple(row)[1].encode(mmencoding, 'replace'),'0','')) serialnum = serialnum + 1 if syndict.has_key(key): for row in syndict[key]: if is_valid_cui(abbrev_uri(tuple(row)[0].encode(mmencoding, 'replace'))): # "%s|ENG|S|L%07d|SY|S%07d|%s|0|\n" mrconlist.append((abbrev_uri(tuple(row)[0].encode(mmencoding, 'replace')),'ENG','P', 'L%07d' % serialnum, 'SY', 'S%07d' % serialnum, tuple(row)[1].encode(mmencoding, 'replace'),'0','')) serialnum = serialnum + 1 return mrconlist def gen_mrconso_list(cdict={}, syndict={}, auidict={}): """ return rows of the form: EFO_0003549|ENG|P|L0000001|PF|S0000001||||caudal tuberculum|0| EFO_0003549|ENG|S|L0000002|SY|S0000002|posterior tubercle|0| EFO_0003549|ENG|S|L0000003|SY|S0000003|posterior tuberculum|0| """ # mrconlist = [] # serialnum = 1 # for key in cdict.keys(): # for row in cdict[key]: # if is_valid_cui(abbrev_uri(tuple(row)[0].encode(mmencoding, 'replace'))): pass def gen_strdict(cdict, syndict): """ Generate dict of concept and synonym triples mapped by string. """ strdict = {} for triplelist in cdict.values(): for triple in triplelist: strkey = triple[1].__str__() if strdict.has_key(strkey): strdict[strkey].append(triple) else: strdict[strkey] = [triple] for triplelist in syndict.values(): for triple in triplelist: strkey = triple[1].__str__() if strdict.has_key(strkey): strdict[strkey].append(triple) else: strdict[strkey] = [triple] return strdict def gen_nmstrdict(cdict, syndict): """ Generate dict of concept and synonym triples mapped by nomalized string. """ strdict = {} for triplelist in cdict.values(): for triple in triplelist: strkey = normalize_ast_string(triple[1].__str__()) if strdict.has_key(strkey): strdict[strkey].append(triple) else: strdict[strkey] = [triple] for triplelist in syndict.values(): for triple in triplelist: strkey = normalize_ast_string(triple[1].__str__()) if strdict.has_key(strkey): strdict[strkey].append(triple) else: strdict[strkey] = [triple] return strdict def gen_strdict_histogram(strdict): """ Generate histrogram of lengths of string dictionary values. """ histogram = {} for v in strdict.values(): key = '%d' % len(v) if histogram.has_key(key): histogram[key] += 1 else: histogram[key] = 1 return histogram def gen_strdict_listsizedict(strdict): sizedict = {} for k,v in strdict.items(): key = '%d' % len(v) if sizedict.has_key(key): sizedict[key].append(k) else: sizedict[key] = [k] return sizedict def gen_aui_dict(cdict=[],syndict=[],auiprefix='A',offset=0): """ A simple way to generate atom unique identifiers (AUIS): 1. Generate list of strings + vocabulary source from ontology. 2. Sort list 3. assign auis in descending order of sorted list. cdict: concept dictionary syndict: synonym dictionary auiprefix: prefix for Atom identifiers, usually "A" for standalone DataSets, should be "B" for dataset to be used with UMLS. A can be used if range new identifier space is outside of existing UMLS atom identifier space. offset=start of range for identifiers, default is zero """ aset=set([]) auidict={} for cstr in cdict.keys(): if cstr == 'http://www.ebi.ac.uk/efo/EFO_0000694': print("%s -> %s" % (cstr,'is SARS ')) prefterm = cdict[cstr][0][1].strip().encode(mmencoding, 'replace') sab = get_source_name(cstr.encode(mmencoding, 'replace')) if prefterm == 'SARS': print('%s --> pref: %s,%s' % (cstr, prefterm, sab)) aset.add((prefterm, sab)) if syndict.has_key(cstr): for row in syndict[cstr]: synonym = row[1].strip().encode(mmencoding, 'replace') if synonym == 'SARS': print('%s --> syn: %s,%s' % (cstr, synonym, sab)) sab = get_source_name(cstr.encode(mmencoding, 'replace')) aset.add((synonym, sab)) alist = [x for x in aset] alist.sort() i = offset for atom in alist: auidict[atom] = '%s%08d' % (auiprefix,i) i = i + 1 return auidict def gen_sui_dict(cdict=[],syndict=[],suiprefix='S',offset=0): """ A simple way to generate String Unique Identifiers(SUIS): 1. Generate list of strings + vocabulary source from ontology. 2. Sort list 3. assign auis in descending order of sorted list. cdict: concept dictionary syndict: synonym dictionary suiprefix: prefix for string identifiers, usually "S" for standalone DataSets, should be "T" for dataset to be used with UMLS, "S" can be used if range new identifier space is outside of existing UMLS string identifier space. offset=start of range for identifiers, default is zero """ sset=set([]) suidict={} for cstr in cdict.keys(): if cstr == 'http://www.ebi.ac.uk/efo/EFO_0000694': print("%s -> %s" % (cstr,'is SARS ')) prefterm = cdict[cstr][0][1].strip().encode(mmencoding, 'replace') sset.add(prefterm) if prefterm == 'SARS': print('%s --> pref: %s' % (cstr, prefterm)) for cstr in syndict.keys(): if cstr == 'http://www.ebi.ac.uk/efo/EFO_0000694': print("%s -> %s" % (cstr,'is SARS ')) for row in syndict[cstr]: synonym = row[1].strip().encode(mmencoding, 'replace') sset.add(synonym) if synonym == 'SARS': print('%s --> syn: %s' % (cstr, synonym)) slist = [x for x in sset] slist.sort() i = offset for mstring in slist: suidict[mstring] = '%s%08d' % (suiprefix,i) i = i + 1 return suidict def gen_lui_dict(cdict=[],syndict=[],luiprefix='L',offset=0): """ A simple way to generate Lexical Unique Identifiers(SUIS): 1. Generate list of strings + vocabulary source from ontology. 2. Sort list 3. assign auis in descending order of sorted list. cdict: concept dictionary syndict: synonym dictionary suiprefix: prefix for string identifiers, usually "L" for standalone DataSets, should be "M" for dataset to be used with UMLS, "L" can be used if range new identifier space is outside of existing UMLS string identifier space. offset=start of range for identifiers, default is zero """ nasset=set([]) luidict={} for cstr in cdict.keys(): if cstr == 'http://www.ebi.ac.uk/efo/EFO_0000694': print("%s -> %s" % (cstr,'is SARS ')) prefterm = cdict[cstr][0][1].strip().encode(mmencoding, 'replace') nasset.add(normalize_ast_string(prefterm)) if prefterm == 'SARS': print('%s --> pref: %s' % (cstr, prefterm)) for cstr in syndict.keys(): if cstr == 'http://www.ebi.ac.uk/efo/EFO_0000694': print("%s -> %s" % (cstr,'is SARS ')) for row in syndict[cstr]: synonym = row[1].strip().encode(mmencoding, 'replace') nasset.add(normalize_ast_string(synonym)) if synonym == 'SARS': print('%s --> syn: %s' % (cstr, synonym)) naslist = [x for x in nasset] naslist.sort() i = offset for nasstring in naslist: luidict[nasstring] = '%s%08d' % (luiprefix,i) i = i + 1 return luidict def get_lui(luidict, mstring): """ get lui for un-normalized string from lui dictionary """ return luidict.get(normalize_ast_string(mstring), 'LUI unknown') def print_couples(alist): for el in alist: print("%s: %s" % (el[0].__str__(),el[1].__str__())) def process(rdffilename): print('reading %s' % rdffilename) graph=readrdf(rdffilename) print('finding concepts and synonyms') cdict,syndict = collect_concepts(graph) print('Generating Atom Unique Identifier Dictionary') auidict = gen_aui_dict(cdict,syndict) print('Generating String Unique Identifier Dictionary') suidict = gen_sui_dict(cdict,syndict) print('Generating Lexical Unique Identifier Dictionary') luidict = gen_lui_dict(cdict,syndict) #rrf print('generating MRCONSO.RRF') gen_mrconso('MRCONSO.RRF',cdict,syndict,auidict,suidict,luidict) #orf print('generating MRCON') gen_mrcon('MRCON',cdict,syndict,suidict,luidict) print('generating MRSO') gen_mrso('MRSO',cdict,syndict,suidict,luidict) #both rrf and orf print('generating MRSAB') gen_mrsab('MRSAB.RRF',cdict,syndict) print('generating MRRANK') gen_mrrank('MRRANK.RRF') print('generating MRSAT') gen_mrsat('MRSAT.RRF',cdict,syndict) print('generating MRSTY') gen_mrsty('MRSTY',cdict,syndict) print('generating MRSTY.RRF') gen_mrsty_rrf('MRSTY.RRF',cdict,syndict) if __name__ == "__main__": print('reading %s' % efo_datafile) process(efo_datafile)
""" readrdf.py -- rdf utilities """ from rdflib import Graph from string import join import sys def readrdf(filename): graph = Graph() graph.parse(filename) return graph def print_triples(graph): for s,p,o in graph: print s,p,o def print_piped_triples(graph): for s,p,o in graph: print join([s,p,o], '|') def doit(filename): graph = readrdf(filename) print_triples(graph) return graph def save_graph(graph, filename, format='n3'): f = open(filename, 'w') f.write(graph.serialize(None, format)) f.close() def write_piped_triples(graph, filename): f = open(filename, 'w') for s,p,o in graph: f.write(('%s\n' % join([s,p,o], '|')).encode('utf-8')) f.close()
""" MetaWordIndex Utilities. Currently only contains a Python implementation of the Prolog predicate normalize_meta_string/2. Translated from Alan Aronson's original prolog version: mwi_utilities.pl. Created: Fri Dec 11 10:10:23 2008 @author <a href="mailto:wrogers@nlm.nih.gov">Willie Rogers</a> @version $Id: MWIUtilities.java,v 1.2 2005/03/04 16:11:12 wrogers Exp $ """ import sys import string import nls_strings from metamap_tokenization import tokenize_text_utterly, is_ws_word, ends_with_s def normalize_meta_string(metastring): """ Normalize metathesaurus string. normalizeMetaString(String) performs "normalization" on String to produce NormalizedMetaString. The purpose of normalization is to detect strings which are effectively the same. The normalization process (also called lexical filtering) consists of the following steps: * removal of (left []) parentheticals; * removal of multiple meaning designators (<n>); * NOS normalization; * syntactic uninversion; * conversion to lowercase; * replacement of hyphens with spaces; and * stripping of possessives. Some right parentheticals used to be stripped, but no longer are. Lexical Filtering Examples: The concept "Abdomen" has strings "ABDOMEN" and "Abdomen, NOS". Similarly, the concept "Lung Cancer" has string "Cancer, Lung". And the concept "1,4-alpha-Glucan Branching Enzyme" has a string "1,4 alpha Glucan Branching Enzyme". Note that the order in which the various normalizations occur is important. The above order is correct. important; e.g., parentheticals must be removed before either lowercasing or normalized syntactic uninversion (which includes NOS normalization) are performed. @param string meta string to normalize. @return normalized meta string. """ try: pstring = remove_left_parentheticals(metastring.strip()) un_pstring = nls_strings.normalized_syntactic_uninvert_string(pstring) lc_un_pstring = un_pstring.lower() hlc_un_pstring = remove_hyphens(lc_un_pstring) norm_string = strip_possessives(hlc_un_pstring) return norm_string.strip() except TypeError,te: print('%s: string: %s' % (te,metastring)) left_parenthetical = ["[X]","[V]","[D]","[M]","[EDTA]","[SO]","[Q]"] def remove_left_parentheticals(astring): """ remove_left_parentheticals(+String, -ModifiedString) remove_left_parentheticals/2 removes all left parentheticals (see left_parenthetical/1) from String. ModifiedString is what is left. """ for lp in left_parenthetical: if astring.find(lp) == 0: return astring[len(lp):].strip() return astring def remove_hyphens(astring): """ remove_hyphens/2 removes hyphens from String and removes extra blanks to produce ModifiedString. """ return remove_extra_blanks(astring.replace('-', ' ')).strip() # def remove_possessives(tokens): # """ remove_possessives/2 filters out possessives # from the results of tokenize_text_utterly/2. """ # if len(tokens) > 2: # if is_ws_word(tokens[0]) & (tokens[1] == "'") & (tokens[2] == "s"): # return tokens[0:1] + remove_possessives(tokens[3:]) # else: # return tokens[0:1] + remove_possessives(tokens[1:]) # elif len(tokens) > 1: # if is_ws_word(tokens[0]) & (tokens[1] == "'") & ends_with_s(tokens[0]): # return tokens[0:1] + remove_possessives(tokens[2:]) # else: # return tokens[0:1] + remove_possessives(tokens[1:]) # return tokens # def remove_possessives(tokens): # modtokens=[] # for token in tokens: # if token[-2:] == "'s": # modtokens.append(token[:-2]) # elif token[-2:] == "s'": # modtokens.append(token[:-1]) # else: # modtokens.append(token) # return modtokens # def remove_possessives(tokens): # """ remove_possessives/2 filters out possessives # from the results of tokenize_text_utterly/2. """ # if len(tokens) > 1: # if is_ws_word(tokens[0]) & (tokens[1] == "'"): # if tokens[0].endswith('s'): # return tokens[0:1] + remove_possessives(tokens[2:]) # elif len(tokens) > 2: # if (tokens[2] == "s"): # return tokens[0:1] + remove_possessives(tokens[3:]) # else: # return tokens[0:1] + remove_possessives(tokens[1:]) # else: # return tokens[0:1] + remove_possessives(tokens[1:]) # else: # return tokens[0:1] + remove_possessives(tokens[1:]) # return tokens def is_quoted_string(tokens, i): pass def is_apostrophe_s(tokens, i): if i+2 < len(tokens): if is_ws_word(tokens[i]) & (tokens[i+1] == "'") & (tokens[i+2] == "s"): if i+3 < len(tokens): if string.punctuation.find(tokens[i+3][0]) >= 0: return False return True return False def is_s_apostrophe(tokens, i): if i+1 < len(tokens): if is_ws_word(tokens[i]) & tokens[i].endswith('s') & (tokens[i+1] == "'"): return True return False def remove_possessives(tokens): """ remove_possessives/2 filters out possessives from the results of tokenize_text_utterly/2. EBNF for possessives using tokenization of original prolog predicates: tokenlist -> token tokenlist | possessive tokenlist ; quoted_string -> "'" tokenlist "'" ; possessive --> apostrophe_s_possessive | s_apostrophe_possessive ; apostrophe_s_possessive -> alnum_word "'" "s" ; s_apostrophe_possessive -> alnum_word_ending_with_s "'" ; """ i = 0 newtokens = [] while i < len(tokens): if is_apostrophe_s(tokens, i): newtokens.append(tokens[i]) i+=3 elif is_s_apostrophe(tokens, i): newtokens.append(tokens[i]) i+=2 else: newtokens.append(tokens[i]) i+=1 return newtokens def strip_possessives(astring): """ strip_possessives/2 tokenizes String, uses metamap_tokenization:remove_possessives/2, and then rebuilds StrippedString. */ """ tokens = tokenize_text_utterly(astring) stripped_tokens = remove_possessives(tokens) if tokens == stripped_tokens: return astring else: return string.join(stripped_tokens,'') # def strip_possessives(astring): # if astring.find("''s") >= 0: # rstring2 = astring.replace("''s","|dps|") # rstring1 = rstring2.replace("'s","") # rstring0 = rstring1.replace("|dps|","''s") # else: # rstring0 = astring.replace("'s","") # return rstring0.replace("s'","s") def remove_extra_blanks(astring): """ remove extra inter-token blanks """ return string.join(astring.split()) def normalize_ast_string(aststring): """ similar to normalize_meta_string except hyphens are not removed normalizeAstString(String) performs "normalization" on String to produce NormalizedMetaString. The purpose of normalization is to detect strings which are effectively the same. The normalization process (also called lexical filtering) consists of the following steps: * syntactic uninversion; * conversion to lowercase; * stripping of possessives. """ un_pstring = nls_strings.syntactic_uninvert_string(aststring) lc_un_pstring = un_pstring.lower() norm_string = strip_possessives(lc_un_pstring) return norm_string if __name__ == '__main__': # a test fixture to test normalizeMetaString method. if len(sys.argv) > 1: print('"%s"' % normalize_meta_string(string.join(sys.argv[1:], ' '))) # fin: mwi_utilities
""" metamap_tokenization.py -- Purpose: MetaMap tokenization routines includes implementation of tokenize_text_utterly """ def lex_word(text): token = text[0] i = 1 if i < len(text): ch = text[i] # while (ch.isalpha() or ch.isdigit() or (ch == "'")) and (i < len(text)): while (ch.isalpha() or ch.isdigit()) and (i < len(text)): token = token + ch i+=1 if i < len(text): ch = text[i] return token, text[i:] def tokenize_text_utterly(text): tokens = [] rest = text try: ch = rest[0] while len(rest) > 0: if ch.isalpha(): token,rest = lex_word(rest) tokens.append(token) else: tokens.append(ch) rest = rest[1:] if len(rest) > 0: ch = rest[0] return tokens except IndexError, e: print("%s: %s" % (e,text)) # tokenize_text_utterly('For 8-bit strings, this method is locale-dependent.') def is_ws_word(astring): return astring.isalnum() & (len(astring) > 1) def ends_with_s(astring): return astring[-1] == 's'
""" File: nls_strings.py Module: NLS Strings Author: Lan (translated to Python by Willie Rogers Purpose: Provide miscellaneous string manipulation routines. Source: strings_lra.pl """ import lex import sets import sys from metamap_tokenization import tokenize_text_utterly def normalized_syntactic_uninvert_string(pstring): """ normalized version of uninverted string. """ normstring = normalize_string(pstring) normuninvstring = syntactic_uninvert_string(normstring) return normuninvstring def normalize_string(astring): """ Normalize string. Elminate multiple meaning designators and "NOS" strings. First eliminates multiple meaning designators (<n>) and then eliminates all forms of NOS. """ string1 = eliminate_multiple_meaning_designator_string(astring) norm_string = eliminate_nos_string(string1) if (len(norm_string.strip()) > 0): return norm_string; else: return astring def eliminate_multiple_meaning_designator_string(astring): """ Remove multiple meaning designators; method removes an expression of the form <n> where n is an integer from input string. The modified string is returned. """ try: if (astring.find('<') >= 0) & (astring.find('>') > astring.find('<')): if astring[astring.find('<') + 1:astring.find('>')].isdigit(): return astring[0:astring.index('<')] + eliminate_multiple_meaning_designator_string(astring[astring.index('>')+1:]).strip() else: return astring else: return astring except ValueError, e: sys.stderr.write('%s: astring="%s"\n' % (e,astring)) sys.exit() def eliminate_nos_string(astring): """ Eliminate NOS String if present. """ norm_string0 = eliminate_nos_acros(astring) return eliminate_nos_expansion(norm_string0).lower() nos_strings = [ ", NOS", " - NOS", " NOS", ".NOS", " - (NOS)", " (NOS)", "/NOS", "_NOS", ",NOS", "-NOS", ")NOS" ] # "; NOS", def eliminate_nos_acros(astring): # split_string_backtrack(String,"NOS",Left,Right), charindex = astring.find("NOS"); if charindex >= 0: left = astring[0:max(charindex, 0)] right = astring[charindex+3: max(charindex+3, len(astring))] if ((len(right) != 0) and right[0].isalnum()) and ((len(left) != 0) and left[len(left)-1].isalpha()): charindex = astring.find("NOS", charindex+1) if charindex == -1: return astring; for nos_string in nos_strings: charindex = astring.find(nos_string) if charindex >= 0: left2 = astring[0: max(charindex, 0)] right2 = astring[charindex+len(nos_string):max(charindex+len(nos_string),len(astring))] if nos_string == ")NOS": return eliminate_nos_acros(left2 + ")" + right2) elif nos_string == ".NOS": return eliminate_nos_acros(left2 + "." + right2); elif (nos_string == " NOS"): if len(right2) > 0: if right2[0].isalnum(): return left2 + " NOS" + eliminate_nos_acros(right2) if not abgn_form(astring[charindex+1:]): return eliminate_nos_acros(left2 + right2) elif nos_string == "-NOS": return eliminate_nos_acros(left2 + right2) else: return eliminate_nos_acros(left2 + right2); return astring; abgn_forms = sets.Set([ "ANB-NOS", "ANB NOS", "C NOS", "CL NOS", # is this right? # C0410315|ENG|s|L0753752|PF|S0970087|Oth.inf.+bone dis-NOS| "NOS AB", "NOS AB:ACNC:PT:SER^DONOR:ORD:AGGL", "NOS-ABN", "NOS ABN", "NOS-AG", "NOS AG", "NOS ANB", "NOS-ANTIBODY", "NOS ANTIBODY", "NOS-ANTIGEN", "NOS ANTIGEN", "NOS GENE", "NOS NRM", "NOS PROTEIN", "NOS-RELATED ANTIGEN", "NOS RELATED ANTIGEN", "NOS1 GENE PRODUCT", "NOS2 GENE PRODUCT", "NOS3 GENE PRODUCT", "NOS MARGN", ]) def abgn_form(astr): """ Determine if a pattern of the form: "NOS ANTIBODY", "NOS AB", etc. exists. if so return true. """ return astr in abgn_forms def syntactic_uninvert_string(astring): """ invert strings of the form "word1, word2" to "word2 word1" if no prepositions or conjunction are present. syntactic_uninvert_string calls lex.uninvert on String if it contains ", " and does not contain a preposition or conjunction.""" if contains_prep_or_conj(astring): return astring else: return lex.uninvert(astring) prep_or_conj = sets.Set([ # init preposition and conjunctions "aboard", "about", "across", "after", "against", "aka", "albeit", "along", "alongside", "although", "amid", "amidst", "among", "amongst", "and", # "anti", "around", "as", "astride", "at", "atop", # "bar", "because", "before", "beneath", "beside", "besides", "between", "but", "by", "circa", "contra", "despite", # "down", "during", "ex", "except", "excluding", "failing", "following", "for", "from", "given", "if", "in", "inside", "into", "less", "lest", # "like", # "mid", "minus", # "near", "nearby", "neath", "nor", "notwithstanding", "of", # "off", "on", "once", # "only", "onto", "or", # "out", # "past", "pending", "per", # "plus", "provided", "providing", "regarding", "respecting", # "round", "sans", "sensu", "since", "so", "suppose", "supposing", "than", "though", "throughout", "to", "toward", "towards", "under", "underneath", "unless", "unlike", "until", "unto", "upon", "upside", "versus", "vs", "w", "wanting", "when", "whenever", "where", "whereas", "wherein", "whereof", "whereupon", "wherever", "whether", "while", "whilst", "with", "within", "without", # "worth", "yet", ]) def contains_prep_or_conj(astring): for token in tokenize_text_utterly(astring): if token in prep_or_conj: return True return False nos_expansion_string = [ ", not otherwise specified", "; not otherwise specified", ", but not otherwise specified", " but not otherwise specified", " not otherwise specified", ", not elsewhere specified", "; not elsewhere specified", " not elsewhere specified", "not elsewhere specified" ] def eliminate_nos_expansion(astring): """ Eliminate any expansions of NOS """ lcString = astring.lower() for expansion in nos_expansion_string: charindex = lcString.find(expansion) if charindex == 0: return eliminate_nos_expansion(lcString[len(expansion):]) elif charindex > 0: return eliminate_nos_expansion(lcString[0:charindex] + lcString[charindex + len(expansion):]) return astring
import sys COMMA = ',' SPACE = ' ' def uninvert(s): """ Recursively uninverts a string. I.e., injury, abdominal ==> abdominal injury Translated directly from uninvert(s,t) in lex.c. @param s INPUT: string "s" containing the term to be uninverted. @return OUTPUT: string containing the uninverted string. """ if len(s) == 0: return s sp = s.find(COMMA) while sp > 0: cp = sp cp+=1 if cp < len(s) and s[cp] == SPACE: while cp < len(s) and s[cp] == SPACE: cp+=1 return uninvert(s[cp:]) + " " + s[0:sp].strip() else: sp+=1 sp = s.find(COMMA, sp) return s.strip(); if __name__ == '__main__': if len(sys.argv) > 1: print('%s -> %s' % (sys.argv[1], uninvert(sys.argv[1])))
This document was generated using AFT v5.098