Finding a genomic pattern for classifying breast cancer patients byRandom Forest and Linear Discriminant Analysis and cancer-specificbiomarkers by Integrative analysis of DNA methylation and geneexpression

romina norouzi; sajjad Gharaghani

Finding a genomic pattern for classifying breast cancer patients byRandom Forest and Linear Discriminant Analysis and cancer-specificbiomarkers by Integrative analysis of DNA methylation and geneexpression

Publish place: The 11th National Conference and the 2nd International Conference on Bioinformatics of Iran

Publish Year: 1401

نوع سند: مقاله کنفرانسی

زبان: English

نسخه کامل این Paper ارائه نشده است و در دسترس نمی باشد

Certificate
من نویسنده این مقاله هستم

استخراج به نرم افزارهای پژوهشی:

لینک ثابت به این Paper:

https://civilica.com/doc/1848048

شناسه ملی سند علمی:

IBIS11_005

تاریخ نمایه سازی: 19 آذر 1402

Abstract:

Differential methylation analysis of Illumina Human Methylation ۴۵۰K from TCGA performed between ۲۷ normal and ۳۱۳ tumor samples and ۲۷,۳۴۲ methylation sites, Beta value ( ) used as an index to quantify the level of methylation, The Wilcoxon rank-sum test used to determine the di↵erentially methylated CpGs (DMCs), and the p-values adjusted using the FDR method. DMCs reported if the mean methylation di↵erence was >۰.۲ with an FDR of ۵%. Transcriptome data and clinical information of breast cancer patients (including ۱۲۲۲ samples) accessed from the TCGA database. Di↵erentially expressed genes, reported only if the log-fold change was >۱.۵ and the adjusted p-value was smaller than ۰.۰۵. Then, we integrated Di↵erential expression and methylation data to identify the breast cancer-specific markers. After that, to find a genomic pattern for classifying breast patients, we performed Random Forest and Linear Discriminant Analysis on TCGA data and compared a complex model with three hyperparameters, such as RF against a simple baseline model such as LDA. For machine learning analysis, we will separate the dataset into two groups, training, and testing (۸۰%-۲۰%) according to the output of the training we have ۱۰۹۰ samples and ۱۳۷۹۳ genes. To follow a robust experimental design, we followed Random Forest and Linear Discriminant with ۵ runs of a ۱۰-fold cross-validation approach. Implementation of the RF has the hyperparameters such as the number of features randomly selected for each tree and the minimum node’s size. Implementation of the LDA has no hyperparameters. To find which model is the best, we compare these two models; we ran ۵ repetitions of a ۱۰-fold cross-validation experiment and ensured that both models worked on the same set of data and partitions and in the same order. The Accuracy of two models during the five runs for the training set of the random forest was ۰.۶۵ and for Linear Discriminant Analysis was ۰.۴۸ After finding reliable biomarkers and specific patterns, functional enrichment, and drug-gene interaction, done. To discover the main function and the information from associated enriched pathways of the di↵erentially expressed and methylated genes, we used Gene ontology (GO), Benjamini–Hochberg method, and p-value <۰.۰۵ and qvalue<۰.۰۵ criteria used for GO analysis of DEGs. For finding Drug -the gene interaction analysis of the di↵erentially expressed and methylated genes, we only selected protein coding gene that logFC>۳ and logFC < ۳ for DGIdb input, export interaction from DGIdb and, and then selected only an activator and inhibitor interaction type.

Keywords:

machine learning , DNA methylation and gene expression , cancer-specific biomarkers

Authors

romina norouzi

University of tehran

sajjad Gharaghani

University of tehran