IIT Jodhpur has been working to develop recognizers for typewritten text and language models for improving accuracy of OCR text. We have also worked to develop novel methods to preprocess and categorize printed documents.
To preprocess printed documents, a deep parallel architecture has been trained that can super resolve a given text image. A layout matching scheme that is invariant to the rotation of the document images is also developed. This scheme helps in categorization of documents.
Two methods have been developed for recognizing typewritten text. The first method makes use of 2D-CNN for recognizing segmented characters. The second method makes use of an LSTM based approach that takes as input an entire text line and recognizes it. The LSTM based approach resulted in improved performance. Ongoing work focuses on making the recognizer robust to input image degradation.
To build language models for Indian langauges, a corpus is being collected and existing literature on statistical language modeling is being surveyed. Work on building the n-gram models and morphological analyzers is in progress.