Publications

Statistical Evaluation of Ensemble Learning Outcomes for Missing Value Problem in Imbalanced Class Distribution

Published in Süleyman Demirel University Journal of Natural and Applied Sciences, 2023

Abstract: Rapid developments in technology have brought data in different structures and high dimensions in recent decades. Due to this rapid changes and problems encounters in data sets, it has been inevitable that traditional methods replaced with machine learning methods. Within the range of this study, two important data problems are discussed: data sets with i) missing observations and ii) imbalanced class distribution. This study aims to fill the datasets that have both missing observation and imbalanced class distribution problems at the same time by using various missing observation imputation methods and to assess the success levels of ensemble learning algorithms on the obtained data. In the data set collected through sensors for the application, there are 59000 observations for training versus 1000 positive observations for the negative class. The models obtained were tested with the data with an imbalanced class distribution of 2.4%. In addition, approximately 99% of the features in the data set have missing data up to 82%. These missing observations have been tried to be eliminated by hot deck imputation, mean, median, mode, multiple imputation, expectation maximization, and k nearest neighbour methods. Datasets completed with the imputation methods were comparatively tested with algorithms such as Extra Trees, Random Forest, Gradient Boosting, LightGBM, and XGBoost, and the most promising result was obtained with the XGBoost algorithm.

Read more

A Hybrid Approach Using User-Based Collaborative Filtering and Clustering for Personalized Product Recommendation Systemn

Published in Istanbul Ticaret Üniversitesi Sosyal Bilimler Dergisi, 2022

Abstract: Nowadays competition conditions are forced companies, especially retail and e-commerce companies, to know their customers better, to understand their preferences and behaviours, to predict their needs, in this way, to make offers that feel them special. One of the methods used by companies to meet their personalization needs is product recommendation systems. Purpose: In the study, it is aimed to propose a hybrid approach that uses k-means and user-based collaborative filtering algorithms together to improve the user-based collaborative filtering method, which is one of the most frequently used methods in the literature and business world for personalized product recommendation systems. Method: User-based collaborative filtering and k-means methods are used. Findings: The current method and the proposed method were applied for two different data sets. In order to compare the methods, the data sets were divided into two as 80% training and 20% test set, and the errors (RMSE) of the models built on the training data were calculated. As a result of the comparison, it was seen that the error value for the proposed method was less in both data sets. Originality: In this study, an approach that can also use different information about users is proposed to the user-based collaborative filtering method, which works only on user-product scores. In addition, the proposed method has been applied to a real supermarket data as well as being applied from the MovieLens dataset, which is frequently used in the literature.

Read more

Statistical Learning Model for In-Silico Mutagenicity Prediction

Published in Süleyman Demirel University Journal of Natural and Applied Sciences, 2021

Abstract: Among the toxicity tests, mutagenicity defined as a genetic change that can occur due to an agent, has an important place. In this study, statistical learning algorithms were used within the scope of in-silico approach in order to improve the mutagenicity determination process in general. This approach has been applied to the set of molecules containing mutagenicity information obtained by experiments and remarkable classification success were achieved. In order to use in this study, Bursi and Benchmark data sets consisting of molecules found in the literature were combined and the properties of molecules were calculated by means of the Molecular Operating Environment (MOE). As a result of the calculation, decision trees algorithms were applied on the data set with 10835 molecules and 193 variables and parameter selection was performed with grid search approach. The selection of variables was made according to their level of importance in predicting mutagenicity as a result of models established with the best parameters obtained, and the number of descriptors variables was reduced to the 72 most effective descriptor variables. Various statistical learning algorithms were applied to the reduced data set consisting of the selected variables, and five classification algorithms with the best results were decided. By the algorithms whose model performances were increased by means of parameter optimization, accurate prediction rates were obtained approximately 90% for mutagenicity classification.

Read more

Churn Analysis for Factoring: An Application in Turkish Factoring Sector

Published in y-BIS 2019 Conference: ISBIS Young Business and Industrial Statisticians Workshop on Recent Advances in Data Science and Business Analytics 2019 Istanbul/Turkey, 2019

Abstract: Due to the increasing competitive environment in many areas in recent years, customers can easily turn to alternative services. For this reason, it is very important to predict that customers will turn to another service, especially in sectors such as telecom and banking, which have a membership-based revenue model. As in many sectors, in the factoring sector churn prediction models are being developed which predict customers who plan to move to competitors. According to the prediction results, companies aim to prevent customers from leaving the company by developing various campaigns or different actions related to the customers to be lost and to increase the loyalty of the customer to the company. At this stage, focusing on the right customer is critical in order to reduce campaign costs and increase customer loyalty. In order to identify the correct customer, successful prediction models are being developed by using current classification algorithms. However, it would not be enough to treat customer churn prediction as just a classification model. Additional analyzes are needed to provide information to decision processes such as selecting the targeted customer, determining the types of actions to be taken for the customers, and personalizing the actions according to different customer groups. Therefore, it is necessary to consider customer churn prediction as a holistic customer relationship model, which includes the developed forecasting model, as well as analysis to recognize the customer, such as profiling, segmentation.

Read more

Ensemble Learning Methods as In-silico Model for Prediction of Mutagenicity

Published in 11. International Statistics Days Conference Mugla/Turkey, 2018

Abstract: Along with developing technologies, in-vitro experiments that are conducted on outside of living organisms and in-vivo experiments that are conducted on the living organism have begun to leave their place to statistical and computational methods developed in the computer environment without requiring laboratory experiments. Generally, these methods, called in-silico, are capable of predicting and predicting candidate drug molecules prior to in-vitro and / or in-vitro testing. Obtaining information about drug molecules by using an accurate in-silico approach can guide to decide whether the laboratory experiments should be conducted. Foreseeing the design of the tests to be performed can provide advantages such as using less experimental animals, predicting the chemical concentration to be used, reducing time and cost. Today, various toxicity tests are used in the legal and ethical regulation of chemicals (drug molecules, food additives, cosmetics, etc.). Among the toxicity tests, mutagenicity defined as a genetic change that can occur due to an agent, has an important place. In particular, it is a prerequisite for candidate drug molecules to have no mutagenic effects in order to continue their clinical trials. Mutagenicity screening studies must be maintained in multiple steps which are consisted of in-vivo and in-vitro tests that are not enough alone. The in-silico approach has been used with statistical learning algorithms as the first step in order to improve the mutagenicity determination process in general. This approach has been applied to the set of molecules containing mutagenicity information obtained by experiments and promising classification successes have been obtained. Within the scope of the study, data sets of molecules known as Bursi and Benchmark were merged in the literature and the properties of the molecules in the data set were calculated via the Molecular Operating Environment (MOE) program. Statistical learning algorithms (AdaBoost, ExtraTrees and Random Forest) were applied on the obtained data set of 10835 x 193 and parameter selection was performed by grid search approach. All selections and applications were performed with 10-fold cross validation. As a result of the models established with the best parameters obtained, the selection of the variables was made according to the predicted levels of mutagenicity and the most effective 72 variables were obtained. The new database consisting of the selected variables has been applied to 19 different statistical learning algorithms and it has been decided to use seven (AdaBoost, Bagging, ExtraTrees, GradientBoosting, Random Forest, XGBoost, LGBM) ensemble classification algorithms that yield the best results. Using these algorithms, which increase the model performances by parameter optimization, accurate classification rates of mutagenicity ranging from 78% to 90% are obtained.

Read more

The effect of Methods Used for Missing Data Imputation on Classification Success

Published in 11. International Statistics Days Conference Mugla/Turkey, 2018

Abstract: The missing data problem is frequently encountered in many fields, such as field research, clinical study results, business data. Missing data can cause various problems such as increasing errors in statistical estimates, making false predictions, loss of information due to missing data rate and inadequacy of work. Many methods have been developed based on deletion and imputation approaches as a basis for solving missing data problems. The methods that can be used for different data sets also vary depending on factors such as sample size, missing data amount and missing data mechanism. Data sets that are used to solve real-life problems are often not fed from a single source. In data sets created by combining from many different sources, there is a loss of data due to mistakes in the process of data collection, merging or the like. Before starting statistical analysis, it is necessary to analyze the missing data and perform the necessary data cleaning operations and thus improve the quality of the data. In this study, it was aimed to compare the methods used in the imputation of missing data over the success of classification. In addition to descriptive and predictive methods based on basic statistics for comparison purposes, machine learning methods applied in the imputation of missing data were also used. The problem of predicting credit default risk is preferred for the application. It is aimed to predict the loan repayment status of bank customers with the data obtained from a bank and with the missing observations in different variables. Before the application, Little’s MCAR test was applied in order to detect the missing data mechanism in the data set. In application, different data sets were created by filling in the missing values in the data set related to each imputation method. Afterwards, classification success rates were obtained by using these data sets and LightGBM which is a gradient boosting algorithm and the related imputation methods were compared using these rates.

Read more

Performance Comparison of Tree-Based Algorithms for In-silico Mutagenicity Prediction

Published in Akademik Bilisim 2018 Karabuk/Turkey, 2018

Abstract: Observation of the biological activities of chemical compounds is a time consuming and costly process. In-silico experiments are simulations that are done in computer environment to accelerate and reduce the cost. By using in-silico experiments, biological molecular systems become much easier to understand in terms of time and cost. Mutagenicity, which may cause genetic change in the cell, can be predcited by In-silico experiment method through using data mining algorithms. Data set including 8208 observations and 155 variables is used in the study,and data mining algorithms with tree structure are applied to determine mutagenicity on the data set. Among the algorithms used, CART showed a classification success of 71.67%, GBM 77.9%, XGBoost 84.21% and finally Random Forest 84.68%.

Read more