A New Dataset of Persian Handwritten Documentsand its Segmentation

عنوان مقاله: A New Dataset of Persian Handwritten Documentsand its Segmentation
شناسه ملی مقاله: ICMVIP07_138
منتشر شده در هفتمین کنفرانس ماشین بینایی و پردازش تصویر ایران در سال 1390
مشخصات نویسندگان مقاله:

Alireza Alaei - Department of Studies in ComputerScience, University of MysoreMysore, ۵۷۰۰۰۶, India
P. Nagabhushan - Department of Studies in ComputerScience, University of MysoreMysore, ۵۷۰۰۰۶, India
Umapada Pal - Computer Vision and PatternRecognition Unit, Indian StatisticalInstitute, Kolkata–۱۰۸, India

خلاصه مقاله:
In document image analysis and especially inhandwritten document image recognition, standard datasets playvital roles for evaluating performances of algorithms andcomparing results obtained by different groups of researchers. Inthis paper, an unconstrained Persian handwritten text dataset(PHTD) is introduced. The PHTD contains 140 handwrittendocuments of three different categories written by 40 individuals.Total number of text-lines and words/subwords in the dataset are1787 and 27073, respectively. In most of the PHTD documentseither an overlapping or a touching text-lines is present. Theaverage number of text-lines in documents of the PHTD is 13.Two types of ground truths based on pixels information andcontent information are generated for the dataset. Providingthese two types of ground truths for the PHTD, it can be utilizedin many areas of document image processing such as sentencerecognition/understanding, text-line segmentation, wordsegmentation, word recognition, and character segmentation. Toprovide a framework for other researches, recent text-linesegmentation results on this dataset are also reported

کلمات کلیدی:
Handwritten document; Persian handwrittenrecognition; Persian handwritten dataset; Ground truth

