Comparison of Random Forest and Boosted Regression Tree in improving predicted affinity

Publish Year: 1400
نوع سند: مقاله کنفرانسی
زبان: English
View: 120

نسخه کامل این Paper ارائه نشده است و در دسترس نمی باشد

  • Certificate
  • من نویسنده این مقاله هستم

استخراج به نرم افزارهای پژوهشی:

لینک ثابت به این Paper:

شناسه ملی سند علمی:

IBIS10_002

تاریخ نمایه سازی: 5 تیر 1401

Abstract:

One of the challenges in predicting protein-ligand affinity is how target flexibility should be considered inthe docking procedure. Recently, ensemble docking has gained increasing attention and is incorporated as apromising solution to this problem, however, there is still missing information on how an optimal set ofconformations can be chosen in order to reduce computational costs and also the number of false positivesin pose prediction. In order to generate an efficient ensemble of CDK۲ X-ray structures, a robust graph-basedselection algorithm is proposed, using which, ۱۲۶ non-redundant CDK۲ structures are selected in theensemble dataset. A diverse set of ligands extracted from ChEMBL, and docked to the non-redundantreceptor ensemble. A feature set of ۵۱۲ features including feature energetics of docking results, beside othereight simple molecular features of ligands, are considered in the final dataset for the machine learning(ensemble-based) affinity prediction method. The use of machine learning eliminates the need of usingclassical scoring functions such as force-field, knowledge-based and empirical function, which are prone tolimitations with increase in training data size. In this study, Random Forest (RF) and Boosted RegressionTrees (BRT) ensemble learning algorithms are used for final affinity prediction. Finally, the impurityimportance value of RF method is used in order to choose CDK۲ structures which play a more important rolein ensemble docking. Experiments show that docking to only those receptors selected by RF, reduces theerror and also error skewness. Finally, using the mentioned methods, a 𝑀𝑆𝐸𝑅𝐹 = ۱.۳, 𝑅𝑝𝑅𝐹 = ۰.۵ for RFand 𝑀𝑆𝐸𝐵𝑅𝑇 = ۱.۳۷, 𝑅𝑝𝐵𝑅𝑇 = ۰.۵۲ for BRT is obtained (hyperparameters set to the default values andmodels iterated ۵۰ times). By letting machine learning select important features, an accuracy of ۱kcal/mol isachieved, which is significantly better than methods not based on machine learning.

Authors

Sara Mohammadi

Department of Bioinformatics, Institute of Biochemistry and Biophysics, University of Tehran, Tehran, Iran

Zahra Narimani

Department of Computer Science and Information Technology, Institute for Advanced Studies in Basic Sciences (IASBS), Zanjan, Iran

Mitra Ashouri

Department of Bioinformatics, Institute of Biochemistry and Biophysics, University of Tehran, Tehran, Iran

Mohammad Hossein Karimi-Jafari

Department of Bioinformatics, Institute of Biochemistry and Biophysics, University of Tehran, Tehran, Iran