Ensemble Learning Methods as In-silico Model for Prediction of Mutagenicity

Published in 11. International Statistics Days Conference Mugla/Turkey, 2018

Abstract: Along with developing technologies, in-vitro experiments that are conducted on outside of living organisms and in-vivo experiments that are conducted on the living organism have begun to leave their place to statistical and computational methods developed in the computer environment without requiring laboratory experiments. Generally, these methods, called in-silico, are capable of predicting and predicting candidate drug molecules prior to in-vitro and / or in-vitro testing. Obtaining information about drug molecules by using an accurate in-silico approach can guide to decide whether the laboratory experiments should be conducted. Foreseeing the design of the tests to be performed can provide advantages such as using less experimental animals, predicting the chemical concentration to be used, reducing time and cost. Today, various toxicity tests are used in the legal and ethical regulation of chemicals (drug molecules, food additives, cosmetics, etc.). Among the toxicity tests, mutagenicity defined as a genetic change that can occur due to an agent, has an important place. In particular, it is a prerequisite for candidate drug molecules to have no mutagenic effects in order to continue their clinical trials. Mutagenicity screening studies must be maintained in multiple steps which are consisted of in-vivo and in-vitro tests that are not enough alone. The in-silico approach has been used with statistical learning algorithms as the first step in order to improve the mutagenicity determination process in general. This approach has been applied to the set of molecules containing mutagenicity information obtained by experiments and promising classification successes have been obtained. Within the scope of the study, data sets of molecules known as Bursi and Benchmark were merged in the literature and the properties of the molecules in the data set were calculated via the Molecular Operating Environment (MOE) program. Statistical learning algorithms (AdaBoost, ExtraTrees and Random Forest) were applied on the obtained data set of 10835 x 193 and parameter selection was performed by grid search approach. All selections and applications were performed with 10-fold cross validation. As a result of the models established with the best parameters obtained, the selection of the variables was made according to the predicted levels of mutagenicity and the most effective 72 variables were obtained. The new database consisting of the selected variables has been applied to 19 different statistical learning algorithms and it has been decided to use seven (AdaBoost, Bagging, ExtraTrees, GradientBoosting, Random Forest, XGBoost, LGBM) ensemble classification algorithms that yield the best results. Using these algorithms, which increase the model performances by parameter optimization, accurate classification rates of mutagenicity ranging from 78% to 90% are obtained.