Statistical Evaluation of Ensemble Learning Outcomes for Missing Value Problem in Imbalanced Class Distribution
Published in Süleyman Demirel University Journal of Natural and Applied Sciences, 2023
Abstract: Rapid developments in technology have brought data in different structures and high dimensions in recent decades. Due to this rapid changes and problems encounters in data sets, it has been inevitable that traditional methods replaced with machine learning methods. Within the range of this study, two important data problems are discussed: data sets with i) missing observations and ii) imbalanced class distribution. This study aims to fill the datasets that have both missing observation and imbalanced class distribution problems at the same time by using various missing observation imputation methods and to assess the success levels of ensemble learning algorithms on the obtained data. In the data set collected through sensors for the application, there are 59000 observations for training versus 1000 positive observations for the negative class. The models obtained were tested with the data with an imbalanced class distribution of 2.4%. In addition, approximately 99% of the features in the data set have missing data up to 82%. These missing observations have been tried to be eliminated by hot deck imputation, mean, median, mode, multiple imputation, expectation maximization, and k nearest neighbour methods. Datasets completed with the imputation methods were comparatively tested with algorithms such as Extra Trees, Random Forest, Gradient Boosting, LightGBM, and XGBoost, and the most promising result was obtained with the XGBoost algorithm.
Read more