Comparison of Random Forest and Boosted Regression Tree in improving predicted affinity
Publish place: The first international conference and the tenth national bioinformatics conference of Iran
Publish Year: 1400
نوع سند: مقاله کنفرانسی
زبان: English
View: 120
نسخه کامل این Paper ارائه نشده است و در دسترس نمی باشد
- Certificate
- من نویسنده این مقاله هستم
استخراج به نرم افزارهای پژوهشی:
شناسه ملی سند علمی:
IBIS10_002
تاریخ نمایه سازی: 5 تیر 1401
Abstract:
One of the challenges in predicting protein-ligand affinity is how target flexibility should be considered inthe docking procedure. Recently, ensemble docking has gained increasing attention and is incorporated as apromising solution to this problem, however, there is still missing information on how an optimal set ofconformations can be chosen in order to reduce computational costs and also the number of false positivesin pose prediction. In order to generate an efficient ensemble of CDK۲ X-ray structures, a robust graph-basedselection algorithm is proposed, using which, ۱۲۶ non-redundant CDK۲ structures are selected in theensemble dataset. A diverse set of ligands extracted from ChEMBL, and docked to the non-redundantreceptor ensemble. A feature set of ۵۱۲ features including feature energetics of docking results, beside othereight simple molecular features of ligands, are considered in the final dataset for the machine learning(ensemble-based) affinity prediction method. The use of machine learning eliminates the need of usingclassical scoring functions such as force-field, knowledge-based and empirical function, which are prone tolimitations with increase in training data size. In this study, Random Forest (RF) and Boosted RegressionTrees (BRT) ensemble learning algorithms are used for final affinity prediction. Finally, the impurityimportance value of RF method is used in order to choose CDK۲ structures which play a more important rolein ensemble docking. Experiments show that docking to only those receptors selected by RF, reduces theerror and also error skewness. Finally, using the mentioned methods, a 𝑀𝑆𝐸𝑅𝐹 = ۱.۳, 𝑅𝑝𝑅𝐹 = ۰.۵ for RFand 𝑀𝑆𝐸𝐵𝑅𝑇 = ۱.۳۷, 𝑅𝑝𝐵𝑅𝑇 = ۰.۵۲ for BRT is obtained (hyperparameters set to the default values andmodels iterated ۵۰ times). By letting machine learning select important features, an accuracy of ۱kcal/mol isachieved, which is significantly better than methods not based on machine learning.
Keywords:
Authors
Sara Mohammadi
Department of Bioinformatics, Institute of Biochemistry and Biophysics, University of Tehran, Tehran, Iran
Zahra Narimani
Department of Computer Science and Information Technology, Institute for Advanced Studies in Basic Sciences (IASBS), Zanjan, Iran
Mitra Ashouri
Department of Bioinformatics, Institute of Biochemistry and Biophysics, University of Tehran, Tehran, Iran
Mohammad Hossein Karimi-Jafari
Department of Bioinformatics, Institute of Biochemistry and Biophysics, University of Tehran, Tehran, Iran