The effect of Methods Used for Missing Data Imputation on Classification Success

Published in 11. International Statistics Days Conference Mugla/Turkey, 2018

Abstract: The missing data problem is frequently encountered in many fields, such as field research, clinical study results, business data. Missing data can cause various problems such as increasing errors in statistical estimates, making false predictions, loss of information due to missing data rate and inadequacy of work. Many methods have been developed based on deletion and imputation approaches as a basis for solving missing data problems. The methods that can be used for different data sets also vary depending on factors such as sample size, missing data amount and missing data mechanism. Data sets that are used to solve real-life problems are often not fed from a single source. In data sets created by combining from many different sources, there is a loss of data due to mistakes in the process of data collection, merging or the like. Before starting statistical analysis, it is necessary to analyze the missing data and perform the necessary data cleaning operations and thus improve the quality of the data. In this study, it was aimed to compare the methods used in the imputation of missing data over the success of classification. In addition to descriptive and predictive methods based on basic statistics for comparison purposes, machine learning methods applied in the imputation of missing data were also used. The problem of predicting credit default risk is preferred for the application. It is aimed to predict the loan repayment status of bank customers with the data obtained from a bank and with the missing observations in different variables. Before the application, Little’s MCAR test was applied in order to detect the missing data mechanism in the data set. In application, different data sets were created by filling in the missing values in the data set related to each imputation method. Afterwards, classification success rates were obtained by using these data sets and LightGBM which is a gradient boosting algorithm and the related imputation methods were compared using these rates.