CIVILICA We Respect the Science
(ناشر تخصصی کنفرانسهای کشور / شماره مجوز انتشارات از وزارت فرهنگ و ارشاد اسلامی: ۸۹۷۱)

Extracting Protein Names from Biological Literature

عنوان مقاله: Extracting Protein Names from Biological Literature
شناسه ملی مقاله: JR_ACSIJ-3-2_009
منتشر شده در شماره 2 دوره 3 فصل March 2014 در سال 1392
مشخصات نویسندگان مقاله:

Huang-Cheng Kuo - Department of Computer Science and Information Engineering National Chiayi University, Chia-Yi City, Taiwan
Ken-I Lin - Department of Computer Science and Information Engineering National Chiayi University, Chia-Yi City, Taiwan

خلاصه مقاله:
Name entity recognition is an essential task in extracting biological knowledge. In biological corpus, protein names and other terminologies are mixed in natural language sentences. Sometimes whether an abbreviation is a protein name or not depends on the context. Protein names are often composed ofgene names, cell names, or even drug names. Moreover, the number of newly coined protein names is increasing. Even withthe assistance of a dictionary, it is still hard to correctlyautomatically identify all protein names in a biological corpus. We modify a hierarchical model of protein name tokens. On theone hand, we choose rule-base method to improve protein name recognition prediction accuracy rate. On the other hand, we usethe N-gram language model to determine the boundary of protein name. Numerous studies mentioned that the hardest part is toidentify abbreviations and words beginning with uppercase. In order to enhance the recognition performance, we use a dictionary to strengthen recognition for abbreviations and words beginning with uppercase. Experimental results show that about 10% increase in performance.We use YAPEX corpus andGENIA corpus datasets for experiment. In our study, an F-score can achieve 0.697 on the YAPEX corpus and 0.691 on theGENIA corpus. Finally, strengthening the abbreviation for part recognition, we use the Uniprot dictionary database to recognize, an F-score can achieve 0.797 on the YAPEX corpus and 0.806 on the GENIA corpus.

کلمات کلیدی:
Name Entity Recognition, Protein Name Recognition, N-gram Language Model

صفحه اختصاصی مقاله و دریافت فایل کامل: https://civilica.com/doc/245348/