Automatic Transformation of Raw Clinical Data Into Clean Data Using Decision Tree Learning Combining with String Similarity Algorithm

Author Jian Zhang

Thumbnail PDF


  • Filesize: 361 kB
  • 8 pages

Document Identifiers

Author Details

Jian Zhang

Cite AsGet BibTex

Jian Zhang. Automatic Transformation of Raw Clinical Data Into Clean Data Using Decision Tree Learning Combining with String Similarity Algorithm. In 2015 Imperial College Computing Student Workshop (ICCSW 2015). Open Access Series in Informatics (OASIcs), Volume 49, pp. 87-94, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2015)


It is challenging to conduct statistical analyses of complex scientific datasets. It is a timeconsuming process to find the relationships within data for whether a scientist or a statistician. The process involves preprocessing the raw data, the selection of appropriate statistics, performing analysis and providing correct interpretations, among which, the data pre-processing is tedious and a particular time drain. In a large amount of data provided for analysis, there is not a standard for recording the information, and some errors either of spelling, typing or transmission. Thus, there will be many expressions for the same meaning in the data, but it will be impossible for analysis system to automatically deal with these inaccuracies. What is needed is an automatic method for transforming the raw clinical data into data which it is possible to process automatically. In this paper we propose a method combining decision tree learning with the string similarity algorithm, which is fast and accuracy to clinical data cleaning. Experimental results show that it outperforms individual string similarity algorithms and traditional data cleaning process.
  • Raw Clinical Data
  • Decision Tree Learning
  • String Similarity Algorithm


  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    PDF Downloads


  1. Alan C Acock. Sas, stata, spss: A comparison. alan c. acock. Journal of Marriage and Family, 67(4):1093-1095, 2005. Google Scholar
  2. Cyril N Alberga. String similarity and misspellings. Communications of the ACM, 10(5):302-313, 1967. Google Scholar
  3. Clinical. Accessed: 25 June 2015.
  4. Mita K Dalal and Mukesh A Zaveri. Automatic text classification: a technical review. International Journal of Computer Applications, 28(2):37-40, 2011. Google Scholar
  5. Google. Accessed: 25 June 2015.
  6. D Gussfield. Algorithms on strings, trees, and sequences. Computer Science and Computional Biology (Cambrigde, 1999), 1997. Google Scholar
  7. Matthew A Jaro. Advances in record-linkage methodology as applied to matching the 1985 census of tampa, florida. Journal of the American Statistical Association, 84(406):414-420, 1989. Google Scholar
  8. Vladimir I Levenshtein. Binary codes capable of correcting deletions, insertions, and reversals. Soviet physics doklady, 10(8):707-710, 1966. Google Scholar
  9. Ruth Meyer and David Krueger. Minitab guide to statistics. Prentice-Hall, Inc., 1997. Google Scholar
  10. Thomas M. Mitchell. Machine Learning. McGraw-Hill, Inc., New York, NY, USA, 1 edition, 1997. Google Scholar
  11. Saul B Needleman and Christian D Wunsch. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of molecular biology, 48(3):443-453, 1970. Google Scholar
  12. J Quinlan. R.(1993) c4. 5: Programs for machine learning, 1993. Google Scholar
  13. J. Ross Quinlan. Induction of decision trees. Machine learning, 1(1):81-106, 1986. Google Scholar
  14. J. Ross Quinlan. Improved use of continuous attributes in c4. 5. Journal of artificial intelligence research, pages 77-90, 1996. Google Scholar
  15. Erhard Rahm and Hong Hai Do. Data cleaning: Problems and current approaches. IEEE Data Eng. Bull., 23(4):3-13, 2000. Google Scholar
  16. Vijayshankar Raman and Joseph M Hellerstein. Potter’s wheel: An interactive data cleaning system. In VLDB, volume 1, pages 381-390, 2001. Google Scholar
  17. Seth Van Hooland, Ruben Verborgh, Max De Wilde, Johannes Hercher, Erik Mannens, and Rik Van de Walle. Free your metadata: Integrating cultural heritage collections through google refine reconciliation. Pre-submission paper available on. http://freeyourmetadata. org/publications/freeyourmetadata. pdf, 2011. Google Scholar
  18. Ian H Witten, Eibe Frank, Leonard E Trigg, Mark A Hall, Geoffrey Holmes, and Sally Jo Cunningham. Weka: Practical machine learning tools and techniques with java implementations, 1999. Google Scholar
  19. Jian Zhang, Karen Petrie, and Tingting Yu. Automatic transformation of raw clinical data into clean data using decision tree learning. LMT, 84(91):0-2344. Google Scholar