Inspired by the success of Deep Learning based approaches to English scene text recognition, we pose and bench-mark scene text recognition for three Indic scripts - Devanagari, Telugu and Malayalam. Synthetic word images rendered from Unicode fonts are used for training the recognition system. And the performance is bench-marked on a new - IIIT-ILST dataset comprising of hundreds of real scene images containing text in the above mentioned scripts. We use a segmentation free, hybrid but end-to-end trainable CNN-RNN deep neural network for transcribing the word images to the corresponding texts. The cropped word images need not be segmented into the sub-word units and the error is calculated and backpropagated for the the given word image at once. The network is trained using CTC loss, which is proven quite effective for sequence-to-sequence transcription tasks. The CNN layers in the network learn to extract robust feature representations from word images. The sequence of features learnt by the convolutional block is transcribed to a sequence of labels by the RNN+CTC block. The transcription is not bound by word length or a lexicon and is ideal for Indian languages which are highly inflectional.
In most of the Optical Character Recognition soft-wares, a substantial percentage of errors are caused by the incor-rect segmentation of degraded words. This is especially true forrecognizing old books, newspapers and historical manuscripts.In this paper, we propose multiple segmentation methods whichaddress the problem of cuts and merges in degraded words. Wehave created an annotated dataset of 1034 word images withpixel level ground truth for quantitative evaluation of the meth-ods. We compare the methods with a baseline implementationbased on connected component analysis. We report substantialimprovement in accuracy both at character and at word level.Keywords-Character Segmentation; Degradation Correction;Malayalam; Indian Language;
Malayalam is an Indian language spoken by 40 million people with its own script. It has a rich literary tradition. A character recognition system for this language will be of immense help in a spectrum of applications ranging from data entry to reading aids. The Malayalam script has a large number of similar characters making the recognition problem challenging. In this chapter, we present our approach for recognition of Malayalam documents, both printed and handwritten. Classification results as well as ongoing activities are presented.
This paper describes the character recognition process from printed documents containing Hindi and Telugu text. Hindi and Telugu are among the most popular languages in India. The bilingual recognizer is based on Principal Component Analysis followed by support vector classification. This attains an overall accuracy of approximately 96.7%. Extensive experimentation is carried out on an independent test set of approximately 200000 characters. Applications based on this OCR are sketched.