AutoML for Explainable Anomaly Detection (XAD)

Myrtakis, Nikolaos; Tsamardinos, Ioannis; Christophides, Vassilis

doi:10.4230/OASIcs.Tannen.8

Abstract

Numerous algorithms have been proposed for detecting anomalies (outliers, novelties) in an unsupervised manner. Unfortunately, it is not trivial, in general, to understand why a given sample (record) is labelled as an anomaly and thus diagnose its root causes. We propose the following reduced-dimensionality, surrogate model approach to explain detector decisions: approximate the detection model with another one that employs only a small subset of features. Subsequently, samples can be visualized in this low-dimensionality space for human understanding. To this end, we develop PROTEUS, an AutoML pipeline to produce the surrogate model, specifically designed for feature selection on imbalanced datasets. The PROTEUS surrogate model can not only explain the training data, but also the out-of-sample (unseen) data. In other words, PROTEUS produces predictive explanations by approximating the decision surface of an unsupervised detector. PROTEUS is designed to return an accurate estimate of out-of-sample predictive performance to serve as a metric of the quality of the approximation. Computational experiments confirm the efficacy of PROTEUS to produce predictive explanations for different families of detectors and to reliably estimate their predictive performance in unseen data. Unlike several ad-hoc feature importance methods, PROTEUS is robust to high-dimensional data.

F. Angiulli, Fabio Fassetti, L. Palopoli, and G. Manco. Outlying property detection with numerical attributes. Data Mining and Knowledge Discovery, 31:134-163, 2016.
Fabrizio Angiulli, Fabio Fassetti, and Luigi Palopoli. Detecting outlying properties of exceptional objects. ACM Trans. Database Syst., 34(1):7:1-7:62, 2009.
Fabrizio Angiulli, Fabio Fassetti, and Luigi Palopoli. Discovering characterizations of the behavior of anomalous subpopulations. IEEE Trans. Knowl. Data Eng., 25(6):1280-1292, 2013.
David Baehrens, Timon Schroeter, Stefan Harmeling, Motoaki Kawanabe, Katja Hansen, and Klaus-Robert Müller. How to explain individual classification decisions. J. Mach. Learn. Res., 11:1803-1831, 2010.
Aline Bessa, Juliana Freire, Tamraparni Dasu, and Divesh Srivastava. Effective discovery of meaningful outlier relationships. ACM/IMS Trans. Data Sci., 2020.
Klemens Böhm, Fabian Keller, Emmanuel Müller, Hoang Vu Nguyen, and Jilles Vreeken. CMI: an information-theoretic contrast measure for enhancing subspace cluster and outlier detection. In Proceedings of the 13th International Conference on Data Mining, pages 198-206, 2013.
Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng, and Jorg Sander. Lof: identifying density-based local outliers. In SIGMOD '00, 2000.
Guilherme O. Campos, Arthur Zimek, Jörg Sander, Ricardo J. Campello, Barbora Micenková, Erich Schubert, Ira Assent, and Michael E. Houle. On the evaluation of unsupervised outlier detection: Measures, datasets, and an empirical study. Data Min. Knowl. Discov., 30:891-927, 2016.
Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall, and W. Philip Kegelmeyer. Smote: Synthetic minority over-sampling technique. J. Artif. Intell. Res., 16:321-357, 2002.
Xuan-Hong Dang, Ira Assent, Raymond T. Ng, Arthur Zimek, and Erich Schubert. Discriminative features for identifying and interpreting outliers. In ICDE, pages 88-99, 2014.
Xuan-Hong Dang, Barbora Micenková, Ira Assent, and Raymond T. Ng. Local outlier detection with interpretation. In ECML PKDD, pages 304-320, 2013.
Manuel Fernández Delgado, Eva Cernadas, Senén Barro, and Dinani Gomes Amorim. Do we need hundreds of classifiers to solve real world classification problems? J. Mach. Learn. Res., 15(1):3133-3181, 2014. URL: https://doi.org/10.5555/2627435.2697065.
Remi Domingues, Maurizio Filippone, Pietro Michiardi, and Jihane Zouaoui. A comparative evaluation of outlier detection algorithms: Experiments and analyses. Pattern Recognition, 74:406-421, 2018.
Lei Duan, Guanting Tang, Jian Pei, James Bailey, Akiko Campbell, and Changjie Tang. Mining outlying aspects on numeric data. Data Mining and Knowledge Discovery, 29:1116-1151, 2014.
Ruth C. Fong and Andrea Vedaldi. Interpretable explanations of black boxes by meaningful perturbation. In IEEE International Conference on Computer Vision, ICCV, pages 3449-3457, 2017.
Ioana Giurgiu and Anika Schumann. Additive explanations for anomalies detected from multivariate temporal data. In CIKM, pages 2245-2248, 2019.
Markus Goldstein and Seiichi Uchida. A comparative evaluation of unsupervised anomaly detection algorithms for multivariate data. PLoS One, 2016.
Xiaoyi Gu, Leman Akoglu, and Alessandro Rinaldo. Statistical analysis of nearest neighbor methods for anomaly detection. In NeurIPS, pages 10921-10931, 2019.
Nikhil Gupta, Dhivya Eswaran, Neil Shah, Leman Akoglu, and Christos Faloutsos. Beyond outlier detection: Lookout for pictorial explanation. In ECML/PKDD, 2018.
Hui Han, Wenyuan Wang, and Binghuan Mao. Borderline-smote: A new over-sampling method in imbalanced data sets learning. In ICIC, 2005.
Haibo He, Yang Bai, Edwardo A. Garcia, and Shutao Li. Adasyn: Adaptive synthetic sampling approach for imbalanced learning. IEEE International Joint Conference on Neural Networks, pages 1322-1328, 2008.
Haibo He and Edwardo A. Garcia. Learning from imbalanced data. TKDE, 21:1263-1284, 2009.
Frank Hutter, Lars Kotthoff, and Joaquin Vanschoren, editors. Automated Machine Learning: Methods, Systems, Challenges. Springer, 2018.
David D. Jensen and Paul R. Cohen. Multiple comparisons in induction algorithms. do. Learn., 38(3):309-338, 2000.
Fabian Keller, Emmanuel Müller, and Klemens Böhm. Hics: High contrast subspaces for density-based outlier ranking. In ICDE, pages 1037-1048, 2012.
Fabian Keller, Emmanuel Müller, Andreas Wixler, and Klemens Böhm. Flexible and adaptive subspace search for outlier analysis. In CIKM, pages 1381-1390, 2013.
Edwin M. Knorr and Raymond T. Ng. Finding intensional knowledge of distance-based outliers. In VLDB, pages 211-222, 1999.
Pang Wei Koh and Percy Liang. Understanding black-box predictions via influence functions. In ICML, volume 70, pages 1885-1894, 2017.
Hans-Peter Kriegel, Peer Kröger, Erich Schubert, and Arthur Zimek. Outlier detection in axis-parallel subspaces of high dimensional data. In PAKDD, volume 5476, pages 831-838, 2009.
Chia-Tung Kuo and Ian Davidson. A framework for outlier description using constraint programming. In AAAI, pages 1237-1243, 2016.
Vincenzo Lagani, Giorgos Athineou, Alessio Farcomeni, Michail Tsagris, Ioannis Tsamardinos, et al. Feature selection with the r package mxm: Discovering statistically equivalent feature subsets. Journal of Statistical Software, 2017.
Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. Isolation forest. In ICDM, 2008. URL: https://doi.org/10.1109/ICDM.2008.17.
Scott M. Lundberg, Gabriel G. Erion, Hugh Chen, Alex J. DeGrave, Jordan M Prutkin, Bala G. Nair, Ronit Katz, Jonathan Himmelfarb, Nisha Bansal, and Su-In Lee. From local explanations to global understanding with explainable ai for trees. Nature Machine Intelligence, 2:56-67, 2020.
Scott M. Lundberg and Su-In Lee. A unified approach to interpreting model predictions. In NeurIPS, pages 4765-4774, 2017. URL: https://proceedings.neurips.cc/paper/2017/hash/8a20a8621978632d76c43dfd28b67767-Abstract.html.
L. V. D. Maaten and Geoffrey E. Hinton. Visualizing data using t-sne. Journal of Machine Learning Research, 9:2579-2605, 2008.
Meghanath Macha and Leman Akoglu. Explaining anomalies in groups with characterizing subspace rules. Data Min. Knowl. Discov., 32(5):1444-1480, 2018.
Emaad A. Manzoor, Hemank Lamba, and Leman Akoglu. xstream: Outlier detection in feature-evolving data streams. In Yike Guo and Faisal Farooq, editors, KDD, pages 1963-1972, 2018. URL: https://doi.org/10.1145/3219819.3220107.
Barbora Micenková, Raymond T. Ng, Xuan-Hong Dang, and Ira Assent. Explaining outliers by subspace separability. In ICDM, pages 518-527, 2013.
Christoph Molnar. Interpretable Machine Learning. independently published, 2019. URL: https://christophm.github.io/interpretable-ml-book/.
Grégoire Montavon, Wojciech Samek, and Klaus-Robert Müller. Methods for interpreting and understanding deep neural networks. Digit. Signal Process., 73:1-15, 2018.
Nikolaos Myrtakis, Ioannis Tsamardinos, and Vassilis Christophides. Proteus: Predictive explanation of anomalies. ICDE, 2021.
Hien M. Nguyen, Eric W. Cooper, and Katsuari Kamei. Borderline over-sampling for imbalanced data classification. Int. J. Knowl. Eng. Soft Data Paradigms, 3:4-21, 2011.
Xuan Vinh Nguyen, Jeffrey Chan, James Bailey, Christopher Leckie, Kotagiri Ramamohanarao, and Jian Pei. Scalable outlying-inlying aspects discovery via feature ranking. In PAKDD, pages 422-434, 2015.
Tomás Pevný. Loda: Lightweight on-line detector of anomalies. Machine Learning, 102:275-304, 2015.
Marco Túlio Ribeiro, Sameer Singh, and Carlos Guestrin. "why should I trust you?": Explaining the predictions of any classifier. In KDD, 2016.
Marko Robnik-Sikonja and Igor Kononenko. Explaining classifications for individual instances. IEEE Transactions on Knowledge and Data Engineering, 20:589-600, 2008.
Md Amran Siddiqui, Alan Fern, Thomas G. Dietterich, and Weng-Keen Wong. Sequential feature explanations for anomaly detection. ACM Trans. Knowl. Discov. Data, 13(1):1:1-1:22, 2019.
Erik Strumbelj and Igor Kononenko. An efficient explanation of individual classifications using game theory. J. Mach. Learn. Res., 11:1-18, 2010.
Erik Strumbelj and Igor Kononenko. Explaining prediction models and individual predictions with feature contributions. Knowl. Inf. Syst., 41(3):647-665, 2014.
Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), pages 267-288, 1996.
Anh Truong, Austin Walters, Jeremy Goodsitt, Keegan E. Hines, C. Bayan Bruss, and Reza Farivar. Towards automated machine learning: Evaluation and comparison of automl approaches and tools. In 31st IEEE International Conference on Tools with Artificial Intelligence, pages 1471-1479, 2019.
Ioannis Tsamardinos, Giorgos Borboudakis, Pavlos Katsogridakis, Polyvios Pratikakis, and Vassilis Christophides. A greedy feature selection algorithm for big data of high dimensionality. Mach. Learn., 108(2):149-202, 2019.
Ioannis Tsamardinos, Elissavet Greasidou, and Giorgos Borboudakis. Bootstrapping the out-of-sample predictions for efficient and accurate cross-validation. Mach. Learn., 107(12):1895-1922, 2018. URL: https://doi.org/10.1007/S10994-018-5714-4.
Adam White and Artur S. d'Avila Garcez. Measurable counterfactual local explanations for any classifier. In European Conference on Artificial Intelligence, ECAI, volume 325, pages 2529-2535, 2020. URL: https://doi.org/10.3233/FAIA200387.
Jiawei Yang, Susanto Rahardja, and Pasi Fränti. Outlier detection: how to threshold outlier scores? In Proceedings of the International Conference on Artificial Intelligence, Information Processing and Cloud Computing, AIIPCC 2019, pages 37:1-37:6, 2019. URL: https://doi.org/10.1145/3371425.3371427.
Haopeng Zhang, Yanlei Diao, and Alexandra Meliou. Exstream: Explaining anomalies in event stream monitoring. In EDBT, pages 156-167, 2017.

AutoML for Explainable Anomaly Detection (XAD)

Authors Nikolaos Myrtakis , Ioannis Tsamardinos , Vassilis Christophides

File

Document Identifiers

Author Details

Cite As Get BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

References

Thanks for your feedback!

Could not send message

AutoML for Explainable Anomaly Detection (XAD)

Authors Nikolaos Myrtakis , Ioannis Tsamardinos , Vassilis Christophides

File

Document Identifiers

Author Details

Funding

Cite As Get BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

References

Thanks for your feedback!

Could not send message