Automatic Transformation of Raw Clinical Data Into Clean Data Using Decision Tree Learning Combining with String Similarity Algorithm

Zhang, Jian

doi:10.4230/OASIcs.ICCSW.2015.87

File

OASIcs.ICCSW.2015.87.pdf

Filesize: 361 kB
8 pages

Document Identifiers

DOI: 10.4230/OASIcs.ICCSW.2015.87
URN: urn:nbn:de:0030-drops-54850

Author Details

Jian Zhang

Cite AsGet BibTex

Jian Zhang. Automatic Transformation of Raw Clinical Data Into Clean Data Using Decision Tree Learning Combining with String Similarity Algorithm. In 2015 Imperial College Computing Student Workshop (ICCSW 2015). Open Access Series in Informatics (OASIcs), Volume 49, pp. 87-94, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2015)
https://doi.org/10.4230/OASIcs.ICCSW.2015.87

Abstract

It is challenging to conduct statistical analyses of complex scientific datasets. It is a timeconsuming process to find the relationships within data for whether a scientist or a statistician. The process involves preprocessing the raw data, the selection of appropriate statistics, performing analysis and providing correct interpretations, among which, the data pre-processing is tedious and a particular time drain. In a large amount of data provided for analysis, there is not a standard for recording the information, and some errors either of spelling, typing or transmission. Thus, there will be many expressions for the same meaning in the data, but it will be impossible for analysis system to automatically deal with these inaccuracies. What is needed is an automatic method for transforming the raw clinical data into data which it is possible to process automatically. In this paper we propose a method combining decision tree learning with the string similarity algorithm, which is fast and accuracy to clinical data cleaning. Experimental results show that it outperforms individual string similarity algorithms and traditional data cleaning process.

Keywords

Raw Clinical Data
Decision Tree Learning
String Similarity Algorithm

Metrics

Access Statistics
Total Accesses (updated on a weekly basis)

0

PDF Downloads

0

Metadata Views

References

Alan C Acock. Sas, stata, spss: A comparison. alan c. acock. Journal of Marriage and Family, 67(4):1093-1095, 2005.
Cyril N Alberga. String similarity and misspellings. Communications of the ACM, 10(5):302-313, 1967.
Clinical. https://www.clinicalstudydatarequest.com/. Accessed: 25 June 2015.
Mita K Dalal and Mukesh A Zaveri. Automatic text classification: a technical review. International Journal of Computer Applications, 28(2):37-40, 2011.
Google. https://code.google.com/p/google-refine/. Accessed: 25 June 2015.
D Gussfield. Algorithms on strings, trees, and sequences. Computer Science and Computional Biology (Cambrigde, 1999), 1997.
Matthew A Jaro. Advances in record-linkage methodology as applied to matching the 1985 census of tampa, florida. Journal of the American Statistical Association, 84(406):414-420, 1989.
Vladimir I Levenshtein. Binary codes capable of correcting deletions, insertions, and reversals. Soviet physics doklady, 10(8):707-710, 1966.
Ruth Meyer and David Krueger. Minitab guide to statistics. Prentice-Hall, Inc., 1997.
Thomas M. Mitchell. Machine Learning. McGraw-Hill, Inc., New York, NY, USA, 1 edition, 1997.
Saul B Needleman and Christian D Wunsch. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of molecular biology, 48(3):443-453, 1970.
J Quinlan. R.(1993) c4. 5: Programs for machine learning, 1993.
J. Ross Quinlan. Induction of decision trees. Machine learning, 1(1):81-106, 1986.
J. Ross Quinlan. Improved use of continuous attributes in c4. 5. Journal of artificial intelligence research, pages 77-90, 1996.
Erhard Rahm and Hong Hai Do. Data cleaning: Problems and current approaches. IEEE Data Eng. Bull., 23(4):3-13, 2000.
Vijayshankar Raman and Joseph M Hellerstein. Potter’s wheel: An interactive data cleaning system. In VLDB, volume 1, pages 381-390, 2001.
Seth Van Hooland, Ruben Verborgh, Max De Wilde, Johannes Hercher, Erik Mannens, and Rik Van de Walle. Free your metadata: Integrating cultural heritage collections through google refine reconciliation. Pre-submission paper available on. http://freeyourmetadata. org/publications/freeyourmetadata. pdf, 2011.
Ian H Witten, Eibe Frank, Leonard E Trigg, Mark A Hall, Geoffrey Holmes, and Sally Jo Cunningham. Weka: Practical machine learning tools and techniques with java implementations, 1999.
Jian Zhang, Karen Petrie, and Tingting Yu. Automatic transformation of raw clinical data into clean data using decision tree learning. LMT, 84(91):0-2344.