A rule-based semantic approach for data integration, standardization and dimensionality reduction utilizing the UMLS: Application to predicting bariatric surgery outcomes

: Utilization of existing clinical data for improving patient outcomes poses a number of challenging and complex problems involving lack of data integration, the absence of standardization across inhomogeneous data sources and computationally-demanding and time-consuming exploration of very large datasets. In this paper, we will present a robust semantic data integration, standardization and dimensionality reduction method to tackle and solve these problems. Our approach enables the integration of clinical data from diverse sources by resolving canonical inconsistencies and semantic heterogeneity as required by the National Library of Medicine's Unified Medical Language System (UMLS) to produce standardized medical data. Through a combined application of rule-based semantic networks and machine learning, our approach enables a large reduction in dimensionality of the data and thus allows for fast and efficient application of data mining techniques to large clinical datasets. An example application of the techniques developed in our study is presented for the prediction of bariatric surgery outcomes.


Introduction
With the rising interest in utilizing patients' medical history for efficient and effective prediction of clinical outcomes, Clinical Decision Support Systems (CDSS) have become an area of research that shows tremendous potential for enhancing medical care while reducing the associated costs. CDSS aim at helping clinicians utilize the existing medical information and history of the patients for improved clinical decision making and thus improved outcomes [ [1], [2], [3]]. CDSS can help a clinician in several ways including, 1) using patients' medical history in helping to decide the most appropriate treatment for the patients, 2) monitoring and recording the patients' medical information prior and after the start of treatment and alerting the clinician in case of any changes, and 3) using the interrelationships or findings learned from the medical data of other patients relating to a particular condition, to help in early diagnosis and treatment for future patients [4].
For successful application of a CDSS, integration, and standardization of medical data is a necessity considering the dispersed, heterogeneous nature of the existing medical data [5]. Integration of medical data is challenging because of the variations in data entry, as well as imperfections due to human error or unavailability of data [6]. This is one reason cognitive issues are of utmost important in modern medical informatics [7]. Data cleansing needs to be handled very carefully in the case of medical data because the output of this operation will later be utilized during clinical decision making. Data selection and integration can then be done on the cleaned data by choosing the relevant data, combining, and finally presenting and visualizing the data in a format suitable for computer-assisted decision making [8]. Standardization of the medical data is as important as in any other field. This will help the clinicians in their decisionmaking process as it generates standardized, interoperable and universally accepted medical terms. A standardization tool has to be one that exclusively coincides with standard medical terms [ [9], [10], [11]]. The pressing need for data standardization across diverse Electronic Health Record (EHR) systems was highlighted by the President's Council of Advisors on Science and Technology (PCAST) in a 2010 report titled: "Realizing the full potential of health information technology to improve healthcare for americans: the path forward" [12]. The report argued for improved medical data standardization through a "universal exchange language whose semantics is intrinsically extensible" and for "managing and storing data for advanced data-mining techniques through breaking it down into the smallest individual pieces" [12]. One example of such an interoperable and universal exchange language for medical data is the Q-UEL [ [13], [14], [15]]. Furthermore, with the emergence of big data in medicine, there is an ever-increasing need for methods that enable dimensionality reduction of large data sets. Efficient dimensionality reduction of big medical data is of utmost importance particularly when computationally-demanding machine learning techniques are employed to analyze the data with the goal of improving clinical decision making.
In this study, a robust semantic data integration, standardization and dimensionality reduction approach for clinical data is presented. Our approach enables the integration of clinical data from disparate sources by resolving canonical inconsistencies and semantic heterogeneity required by the National Library of Medicine's Unified Medical Language System (UMLS) to produce standardized medical data. This resulting application, henceforth referred to as RxSem, builds upon our previous preliminary research [16] and enables domain experts (i.e., healthcare and medical professionals) to semantically describe their data needs and integration requirements. Furthermore, the dimensionality reduction techniques developed in our study enable fast and efficient application of machine learning techniques to large medical datasets.
As an illustrative example, our developed approach is utilized to predict bariatric surgical outcomes employing traditional data mining techniques. The case of bariatric surgery was chosen for this study because of its relevance to and prevalence of obesity, a worldwide epidemic and major health problem facing societies. Obesity is regarded as a complex multi-factorial chronic disease that develops from an interaction of genotype and the environment and has serious health consequences. The annual health care cost attributable to obesity was about $147 billion in 2008, and it is estimated that the cost will expand to $344 billion by 2018 just in the United States alone [17]. Managing obesity-related medical data electronically has been achieved to a great extent, but the medical field is still in need of a powerful predictive tool which will assist the clinicians in making decisions regarding the optimal patient selection before the operation, during the course of treatment and during follow-up which ultimately leads to better outcomes for the patients. The patterns in which various factors impact the successful outcome of bariatric surgery are key findings of interest for surgeons and medical professionals and are critical decision-making elements for improving the nature of prognosis and treatment. Integration of the datasets based on rules provided by experts, standardization of the attributes in the integrated dataset and efficient data mining of the full and reduced datasets will be presented in this research.

Overview
Fig . 1 shows the architecture of the system developed in this study for integration, standardization, dimensionality reduction and data mining of bariatric surgery outcomes. The integration and standardization approach utilized here are based on the methods developed in our previous study [16]. Our improved approach in the current study utilizes machine learning for selecting the best semantic subgraph and reducing the dimensionality of the data. In this section, after a brief overview of the methods, the dataset for bariatric surgery is introduced. Next, for the sake of completeness, a brief introduction to the data integration and standardization steps is given in section 2.3. Readers interested in further details of the data integration and standardization methods may refer to our previous conference publication [16]. Subsections 2.4 to 2.5 will introduce the methods developed for dimensionality reduction through semantic networks and machine learning. The system can be divided into three layers: data integration, semantic, and data mining layer. Our approach starts with the semantic integration of the datasets. A system is developed that relates the datasets based on defined rules and combines them in order to form a single integrated file. The data integration layer includes the multiple source schema, the semantic rules engine and target schema [16]. The semantic layer emphasizes the standardization of semantically integrated data using UMLS. The terms in the integrated file are compared to the UMLS Metathesaurus to find standardized matching terms. The final output of this part of the system is a semantic network with a higher level categorical semantic types and relationships. Next, a dimensionality reduction technique is applied to find the significant attributes of the dataset through semantic subgraphs. Finally, the data mining part of the system employs the reduced data set to identify the factors that impact the outcome of bariatric surgery.
The data mining layer generates multiple models and compares them. This step produces a set of significant attributes. Variable selection, sometimes referred to as feature selection, is one of the most fundamental areas of concern in predictive analytics model development using clinical data. The goal of variable selection is to choose a subset of variables from the pool of all available variables to be included in the predictive model. An appropriate variable selection method reduces in the dimensionality of the problem by eliminating variables that do not contribute to the performance of a given predictive model. Choosing an appropriate variable selection method is critical to the overall success of predictive modeling process since it enables reduction in the cost associated with gathering and storage of data used thus improving computational speedup while guaranteeing that the predictive model's performance is not degraded. Additionally, a model developed using a reduced set of variable is more parsimonious and would be more understandable by the end users. There are a large of number of variable selection methods available (see Refs. [18,19] or [20] for a comprehensive overview). These methods can be classified along two broad categories of statistically based or semantically based approaches. Statistically based methods reduce the number of variables used using statically based analytical methods (e.g., Principle Component Analysis, Partial Least Squares and others), while semantic based methods use semantic relationship among variables to select the final subset of the variables. Semantic based methods allow prior knowledge to be incorporated in selecting appropriate subset of variables to be used in developing predictive models. The variable selection method presented in the current study is an attempt to develop a hybrid approach that integrates the best ideas of each of the two approaches. To achieve this, a two-step process is used. First, as seen in Fig. 1, our approach allows the data mining engine to train and select the best predictive model. Once the best predictive model is selected in this step, the most significant variables from the best predictive model are extracted. It is worth noting that all of the machine learning algorithms used in the step provide us with the capability to extract the most significant variables used. In the second step our method determines the semantic relationship among the extracted set of variables. This is achieved by determining the semantic relationship of each pair of variables using the UMLS semantic network modeling capability. If there are no direct semantic relationships between any two selected variables, the full model semantic network is used to identify other variables that constitute a path between the two variables. If there is an "isa" relationship between any two variables in the full model semantic network, the variable that is on the "from" side of the "is-a" relationship is eliminated from the set constituting the reduced set of variables. For all other relationships, all the variables in the path that connects any two variables are added to the subset. Although the reduced variable subset constructed in this way is not unique, since multiple paths are possible between any two variables, it consists of variables that are not only statistically significant but also are semantically related. However, we note here that using the semantic network created using the full set of variables, it possible to construct many subset semantic networks all containing the same subset of variables but each travers different paths of variable connectivity. Please see Fig. 2a and b for comparison of different sub networks containing a unique subset of variables. Fig. 2. a and b. Two sample subgraphs. The bolded nodes of the graph denote UMLS semantic types corresponding to the significant attributes generated by the data mining engine. The criterion used to find the best subgraph is the one that has the least number of nodes that does not correspond to any of the significant attributes. The number of non-matching nodes is four and in the top and five in the bottom graph respectively. So, the top graph is considered better than the bottom one.

The case of bariatric surgery
The Bariatric Surgery Information System (BSIS) containing medical data of a large number of patients who underwent the surgery was used in this study. The dataset used in this project is from the National Surgical Quality Improvement Program (NSQIP) of the American College of Surgeons which captures surgical outcomes nationwide [21]. This initiative includes over 200 institutions and, in our sample, included valuable information of more than 100,000 patients. This rich set of data chronicles patients' medical conditions from the first visit long before the surgery, all the finding during the operation and hospitalization and also a huge repertoire of follow-up visits spanning up to 3 years after the initial visit. There is also comprehensive documentation of technical failures and complications. The original data source contained five data sets with information relating to patients that underwent bariatric surgery obtained from reliable experts. The description of each data set is as follows: • Demog -Demographic information of the patients.
• Preop -Medical information of the patient during visits before the surgery.
• Intraop -Information about the surgery that was performed on the patient.
• AE -Data relating to certain side effects and complications or adverse effects that were observed on the patient after the surgery. • Postop -Medical information of the patient during multiple visits after the surgery.

Semantic integration and standardization
Data selection and integration directives (medical rules) were provided by medical experts. The integrated file contained 120,000 patient records with over 250 attributes. The algorithm for rulebased integration was implemented using Java as well as SQL. The latter was more efficient in terms of performance. To standardize the integrated file, we used the web, expert advice and the UMLS to find the best matched standard terms for the metadata in the integrated file. A specification file, provided by experts, along with the metadata of the integrated file was compared to the UMLS to find the best matched standard terms. The web was used to find terms that UMLS failed to match.
The UMLS files and data sets were stored in a MySQL database using the UMLS downloads available on the UMLS website. The search algorithm was implemented using Java which interacted with all databases to get the best match for the medical terms. Several levels of search were implemented until best possible matches were found. As the first step, the metadata of the integrated file was searched for in the UMLS Metathesaurus. The second step used keywords from the description of the term from the specification file to search the UMLS. In the third level of search, all possible permutations of the term being searched was fed into the UMLS search query. The fourth level of search, involved searching in the UMLS for all possible combinations of the words. Finally, each word in the metadata being searched for were fed into the UMLS search query. Terms that passed any one of the levels of search above did not go through the further levels of search. The application was semi-automatic which made the output obtained more relevant than that obtained using complete automation. The importance of having a humanin-the-loop in machine learning has been previously highlighted by other researchers [22].
The matching standard terms obtained from the UMLS Metathesaurus after following the abovementioned search strategies were presented to the users. The user also had a choice of selecting an appropriate match for the term from a list of suggestions, if the default selected match was not satisfactory. After the best matched standard terms were assigned to each term in the metadata of the integrated file, the algorithm replaced the metadata of the integrated file with the corresponding standard term. The output of this operation provided an integrated and standardized file containing bariatric surgery data.
Furthermore, the relationship among the sematic types returned by UMLS for the matched standard terms was obtained. A matrix was generated with the semantic types corresponding to our dataset and the relationship among them. This semantic network is later used during data mining and discovery of significant data attributes.

Dimensionality reduction using semantic networks
We combined the semantic representation with the results of data mining engine. The data mining engine initially considered all the attributes of the integrated dataset to find the best fit model. The significant attributes of the best fit model were then extracted to perform a data reduction process before generating the final predictive model. Based on these significant attributes, a number of subgraphs of the semantic networks generated by the UMLS were isolated. The best of these subgraphs was selected with the criteria being the smallest possible subgraph connecting all the concepts relating to the significant attributes. The dataset was reduced to the significant attributes from the data mining engine and the best possible subgraph. This reduced dataset was sent through the data mining engine to generate the best final predictive model. Sample subgraphs are shown in Fig. 2. The bolded nodes of the graph denote UMLS semantic types corresponding to the significant attributes generated by the data mining engine. The criterion used to find the best subgraph is the one that has the least number of nodes that does not correspond to any of the significant attributes.

The data mining engine
SAS Enterprise Miner @ was utilized for mining of the data. Before performing the data mining operation, the first challenge was to find the best target variable for the analysis. Since the dataset is on bariatric surgery, the target variable can be related to the impact of the surgery on the body mass index (BMI) of the patient. The integrated dataset has a number of attributes for the weight of the patient, each of which corresponded to each visit of the patient before and after the surgery. BMI of the patient was calculated by dividing the weight by the height of patient squared. The change in the BMI of the patient after the surgery was chosen as the target variable (Table 1). The Data Partitioning node in SAS Enterprise Miner was used to partition the data into training and validation data, with training data used for preliminary model fitting and validation data used for monitoring, tuning and assessing the model. Decision trees, regression and neural networks were used as the techniques to generate the models for the dataset.
Three decision tree models were run on the dataset with variation in properties. The DTM1 was run with the default properties in Enterprise Miner, DTM2 was generated using an interactive process, where an attribute of interest was chosen to start the splitting of the tree and further branches of the tree were trained by the software. In DTM3 the depth and splitting rule were specified.
The dataset in this study was also modeled using three regression models. In the first attempt using the forward selection model, a baseline model was generated that represented an overall average of the dataset. In the next step, the best of the models with one input was chosen, followed by a model with two inputs. This sequence was continued until no significant improvement could be made. The next regression model used was backward selection model, which started off with a saturated model and continued to eliminate variables unless a change happened in the results. Stepwise selection model was used in the third regression model.
Finally, three neural network models were run for this dataset with variations in some properties and network architecture. NNM1, used a multilayer perception network architecture. NNM2, used a generalized linear model architecture and in NNM3 the network architecture was not changed from default but the number of hidden units was modified to a higher value.

Results
During the data mining phase, the integrated dataset was mined using a decision tree, regression, and neural network models. Each type of model was run based on three separate configurations. In all, nine models were compared and various statistical aspects including misclassification rate, accuracy, sensitivity, specificity, and precision were considered. Definitions of accuracy, sensitivity, specificity, and precision are given below:  Table 2 shows the output of accuracy, specificity, sensitivity, precision and misclassification rate of the models generated from the data when all the attributes (full model) are included as opposed to when only significant attributes (reduced model) are considered. Both approaches using the full data and the reduced data generated fairly similar results. Table 3 highlights the volume of the data and performance (time taken for the execution of the data mining engine) before and after dimensionality reduction. As can be seen in the results in Table 3, our rule-based semantic approach for reducing data dimensionality was highly effective in reducing the volume of the data and the time needed to run the analysis. The results in Table 2 show that the reduced model performs as well as the full model. The outcome of the data mining using the nine different algorithms was not significantly changed when the reduced data was utilized instead of the full data. In particular, the misclassification rates for the reduced set are very close to the full set (Table 2).  Table 3. Comparison between the original dataset and the reduced dataset in terms of file size, number of attributes and performance. The similarity in the data mining results after data reduction is one of the most important findings of the research. The data reduction was done by reducing the size of the dataset by applying the results of the UMLS semantic network on the output of the data mining engine while considering the whole dataset. The understandability of the data mining results increased considerably because of the smaller number of attributes that were considered during data mining. As is clearly evident from the results in Table 2, Table 3, the predictive power of the full and reduced models are similar while runtime and data volume decreased significantly in the case of the reduced model. Determining the attributes to be removed from the dataset during data reduction would be the most computationally expensive part involved in this study. This computational cost was nearly negligible in the procedure developed here due to the involvement of the semantic network generated by the UMLS.

Discussion
Utilization of clinical data for medical informatics with the goal of improving patient outcomes poses a number of challenging and complex problems. Additionally, the use of clinical data for retrospective analysis in support of medical informatics is problematic when considering that most clinical data, at the time of collection, were intended for patient care and may not have been explicitly collected for developing medical informatics. These challenges stem from the fact that clinical data come from disparate sources and therefore may lack canonical consistency and semantic homogeneity making their integration particularly challenging. Furthermore, clinical data are usually large, complex, time series, nontraditional, and may require extensive data preparation prior to integration, standardization and mining.
Health care applications need semantic medical data integration as they should accept inputting medical data in the form of structured representation. Thus, the applications must address the heterogeneous nature of medical data [23] in order to allow processing of data such as analysis or mining, manipulations, and translations. In this study, we presented a robust rule-based semantic integration, standardization and dimensionality reduction procedure for medical data. Data integration is guided through a set of rules obtained from medical experts. The UMLS was used to achieve the standardization part of the dataset. A stepwise procedure for the mapping of the metadata to the UMLS was incorporated. Decision trees, regression and neural network models were used to analyze the integrated dataset and to find the significant variables and thus reduce the dataset. This dimensionality reduction of the data utilizing a semantic network allowed us to reduce the data volume and data mining runtime significantly.
It should be noted that unlike other dimensionality reduction methods (i.e. principle component analysis, linear discrete analysis, or canonical correlation analysis), the reduced model presented here is not statistically based, but it is semantically based. This is an important contribution of this study. We argue that for the medical data mining to be useful, the understandability of the model and the features used are more important than just accuracy of the models since these models are inherently used by medical professionals as decision support systems. In a decision support situation, understandability and alignment of the model with practitioners mental model are of paramount importance. Semantic based models can accomplish this task.
There are various challenges involved in the process of semantic integration, standardization and deminasionality reduction. The first and most important of these challenges is understanding the semantics of the elements involved. There are just a few ways this can be done, through gathering information from the creators of the data, from documentation or from the schema or data. The first two sources are often inaccessible because the data sources that need to be integrated might be in use for a long time and the documentation, if any exists, might be incorrect or outdated. The last source, schema or data, are often unreliable for the inferring of semantics. The second challenge involves the incompleteness and inaccuracy of the schema and the data involved. It does not provide sufficient information to determine the exact nature of the relationships. The third challenge is the cost involved in the process of matching the elements of the schema because often the datasets are very large in size. The worst of the problems is the customization required for certain elements during matching because of the subjective nature of it. The problem of matching data tuples also faces similar challenges as in schema matching. There are various schema matching techniques among which rule-based semantic matching is an important one. Some of the benefits of rule-based schema matching are that they are inexpensive, fairly fast and provide a quick and concise method to capture valuable user knowledge about the domain. Our work addresses the heterogeneous nature of target schema and enhances interoperability by mapping terms semantically to the UMLS. Standardization through a process of semantic-mapping of target schema to the standard terms was successfully achieved in this study.
Semantic network and ontologies have been utilized by researchers for analyzing medical data in the past [ [24], [25], [26], [27], [28]]. The popularity of semantic network and ontologies in the field of medicine is mainly due to their power in representing data and the relationship among the dataset elements.
In summary, we believe our most important contributions in this study are as follows: • Information Integration: We proposed a rule-based semantic integration approach where domain experts provide the integration rules in a precise logical language, and developers implement the integration in an appropriate database programming language. We demonstrated this approach using the Bariatric Surgery application. • Data Standardization: Information from multiple heterogeneous sources should be prepared for mining. Two of the most important tasks in data preparation are data cleaning and standardization. In our study, we utilized the UMLS standard medical vocabulary for data standardization. • Dimensionality Reduction: The integrated and standardized medical datasets can be (and usually are) very large in size. To make medical data mining feasible, we proposed to identify a small subset of parameters that play a significant role in the prediction of the target variable. In our project, the integrated data had in excess of 250 columns, and 100,000 rows. In other projects, the size and dimensionality of data can be much greater. We proposed the following approach for dimensionality reduction of the data: (1) First, we used a small subset of the data to rapidly determine a subset of significant parameters.
(2) Next, we generated the semantic network graph of the significant parameters obtained in Step 1, plus other parameters that are semantically related to these significant parameters.
(3) subsequently, we explored the subgraphs of the semantic network obtained in Step 2, and obtain the "best" subgraph. (4) We reduced the dimensionality of integrated data by only considering the parameters in the chosen subgraph, and (5) we performed data mining on the reduced data generating predictive models as good as the models generated using the full data, but at a fraction of the computational cost.

Limitation and future research
We have shown that our approach allows for variable subset selection using a hybrid method that integrates statistical and semantic approaches. The current paper describes this method and provides a proof of concept using a limited dataset. We have shown that while the reduced model is far more parsimonious than the full model, its predictive power is comparable to the full set model. However, we are cognizant that the method presented in this paper suffers from several limitations that are described here along with proposed future research avenues to overcome them. First, as we have acknowledged earlier that many possible subset semantic networks could be created using the same subset of variables, the current paper does not propose a method for selecting the "best" sub network. This is a computationally difficult problem to solve and requires additional insights. Second, our method is only tested on a limited dataset. We invite other researchers to replicate our experimental results using other well-known medical datasets. Third, we are aware there are a number of well-established and robust approaches (e.g., OMOP [29], PCORnet [30], i2b2 [31] and Sentinel [32] initiatives) that have been developed to facilitate data integration and standardization of clinical data. We do not claim that our proposed approach presented here should be viewed as yet another data integration and standardization approach. We note that these approaches have been developed primarily to overcome data integration and standardization problems, but not variables subset selection problem. Additionally, we note that these methods require the collected data to be curated based on specific standards and templates that facilitate the eventual integration and standardization of data among disparate sources. Out method does not require that. It assumes no such standardization of data at the point of collection. The standardization and eventual integration occur using the three of the most important components of UML: Lexicon, Meta-thesaurus, and the Semantic network engines. Therefore, we argue that our approach should not be viewed as a "competitor" of these methods, rather as a front-end preprocessing approach that could be used to augment and enhance the capabilities of these methods. We posit that a research project that examines the efficacy of the use of our approach as a front end to these approaches would make a valuable contribution to extant literature.

Summary and conclusions
In summary, RxSem is a system which goes beyond data integration, standardization and mining of the data related to bariatric surgery by utilizing semantic networks for reducing data dimensionality thus making predictive analytics using large dataset feasible and efficient. Although our study focused on medical data, we believe our novel approach and the techniques developed in this study are general and may apply to a vast array of information systems.