SMASH (String Metric-Assisted Assessment of Semantic Heterogeneity)
Semantic heterogeneity (SH) is detrimental to data interoperability and integration in healthcare. Assessing SH is difficult, yet fundamental to addressing the problem. Using expert-based and data-driven methods we assessed SH among HIV-associated data elements (DEs). Using Clinicaltrials.gov, we identified and obtained eight data dictionaries, and created a DE inventory. We vectorized DEs by study, and developed a new method, String Metric-assisted Assessment of Semantic Heterogeneity (SMASH), to find DEs: similar in An and Bn, unique to An, and unique to Bn. An HIV expert assessed pairs for semantic equivalence. Heterogeneous DEs were either semantically-equivalent/syntactically-different (HIV-positive/HIV+/Seropositive), or syntactically-equivalent/semantically-different ("Partner" [sexual]/"Partner"[relationship]). Context of usage was considered. SMASH aided identification of SH. Of 1,175 DE from pairs, 1,048 (87%) were semantically heterogeneous and 127 (13%) were homogeneous. Most heterogeneous pairs (97%) were semantically-equivalent/syntactically-different. Expert-based and data-driven methods are complimentary for assessing SH, especially among semantically-equivalent/syntactically-different DE. Similar expert-based/data-driven solutions are recommended for resolving SH.
The DE vectorization process.
DEs similar in An and Bn, unique to An, and unique to Bn.
The distance metrics and string similarity process.
Brown W, Weng C, Vawdrey DK, Carballo-Diéguez A, Bakken S. SMASH: A Data-driven Informatics Method to Assist Experts in Characterizing Semantic Heterogeneity among Data Elements. AMIA Annu Symp Proc. 2016; 2016:1717-1726. PMID: 28269930.
View in: PubMed
HERO (HIV-associated Entities In Research Ontology)