OASIcs.ICCSW.2015.87.pdf
- Filesize: 361 kB
- 8 pages
It is challenging to conduct statistical analyses of complex scientific datasets. It is a timeconsuming process to find the relationships within data for whether a scientist or a statistician. The process involves preprocessing the raw data, the selection of appropriate statistics, performing analysis and providing correct interpretations, among which, the data pre-processing is tedious and a particular time drain. In a large amount of data provided for analysis, there is not a standard for recording the information, and some errors either of spelling, typing or transmission. Thus, there will be many expressions for the same meaning in the data, but it will be impossible for analysis system to automatically deal with these inaccuracies. What is needed is an automatic method for transforming the raw clinical data into data which it is possible to process automatically. In this paper we propose a method combining decision tree learning with the string similarity algorithm, which is fast and accuracy to clinical data cleaning. Experimental results show that it outperforms individual string similarity algorithms and traditional data cleaning process.
Feedback for Dagstuhl Publishing