Indian Language Benchmark Portal

2 results

Please Login/Register to submit the new Resources

Bangla Word Clustering Based on Tri-gram, 4-gram and 5-gram Language Model
Dipaloke SahaMd Saddam HossainMD. Saiful IslamSabir Ismail

In this paper, we describe a research method that generates Bangla word clusters on the basis of relating to meaning in language and contextual similarity. The importance of word clustering is in parts of speech (POS) tagging, word sense disambiguation, text classification, recommender system, spell checker, grammar checker, knowledge discover and for many others Natural Language Processing (NLP) applications. In the history of word clustering, English and some other languages have already implemented some methods on word clustering efficiently. But due to lack of the resources, word clustering in Bangla has not been still implemented efficiently. Presently, its implementation is in the beginning stage. In some research of word clustering in English based on preceding and next five words of a key word they found an efficient result. Now, we are trying to implement the tri-gram, 4-gram and 5-gram model of word clustering for Bangla to observe which one is the best among them. We have started our research with quite a large corpus of approximate 1 lakh Bangla words. We are using a machine learning technique in this research. We will generate word clusters and analyze the clusters by testing some different threshold values.

Document Decomposition of Bangla Printed Text
Md. Fahad HasanTasmin AfrozSabir IsmailMd. Saiful Islam

Today all kind of information is getting digitized and along with all this digitization, the huge archive of various kinds of documents is being digitized too. We know that, Optical Character Recognition is the method through which, newspapers and other paper documents convert into digital resources. But, it is a fact that this method works on texts only. As a result, if we try to process any document which contains non-textual zones, then we will get garbage texts as output. That is why; in order to digitize documents properly they should be prepossessed carefully. And while preprocessing, segmenting document in different regions according to the category properly is most important. But, the Optical Character Recognition processes available for Bangla language have no such algorithm that can categorize a newspaper/book page fully. So we worked to decompose a document into its several parts like headlines, sub headlines, columns, images etc. And if the input is skewed and rotated, then the input was also deskewed and de-rotated. To decompose any Bangla document we found out the edges of the input image. Then we find out the horizontal and vertical area of every pixel where it lies in. Later on the input image was cut according to these areas. Then we pick each and every sub image and found out their height-width ratio, line height. Then according to these values the sub images were categorized. To deskew the image we found out the skew angle and de skewed the image according to this angle. To de-rotate the image we used the line height, matra line, pixel ratio of matra line.

Filter by Author
P. D. Gujrati (8)
Manish Shrivastava (7)
Umapada Pal (5)
Partha Pratim Roy (5)
Iti Mathur (4)
C.V. Jawahar (4)