We present in this paper our work on comparison between Statistical Machine Translation (SMT) and Rule-based machine translation for translation from Marathi to Hindi. Rule Based systems although robust take lots of time to build. On the other hand statistical machine translation systems are easier to create, maintain and improve upon. We describe the development of a basic Marathi-Hindi SMT system and evaluate its performance. Through a detailed error analysis, we, point out the relative strengths and weaknesses of both systems. Effectively, we shall see that even with a small amount of training corpus a statistical machine translation system has many advantages for high quality domain specific machine translation over that of a rule-based counterpart.
Lack of proper linguistic resources is the major challenges faced by the Machine Translation system developments when dealing with the resource poor languages. In this paper, we describe effective ways to utilize the lexical resources to improve the quality of statistical machine translation. Our research on the usage of lexical resources mainly focused on two ways, such as; augmenting the parallel corpus with more vocabulary and to provide various word forms. We have augmented the training corpus with various lexical resources such as lexical words, function words, kridanta pairs and verb phrases. We have described the case studies, evaluations and detailed error analysis for both Marathi to Hindi and Hindi to Marathi machine translation systems. From the evaluations we observed that, there is an incremental growth in the quality of machine translation as the usage of various lexical resources increases. Moreover, usage of various lexical resources helps to improve the coverage and quality of machine translation where limited parallel corpus is available.
Use of social media has grown dramatically during the last few years. Users follow informal languages in communicating through social media. The language of communication is often mixed in nature, where people transcribe their regional language with English and this technique is found to be extremely popular. Natural language processing (NLP) aims to infer the information from these text where Part-of-Speech (PoS) tagging plays an important role in getting the prosody of the written text. For the task of PoS tagging on Code-Mixed Indian Social Media Text, we develop a supervised system based on Conditional Random Field classifier. In order to tackle the problem effectively, we have focused on extracting rich linguistic features. We participate in three different language pairs, ie. English-Hindi, English-Bengali and English-Telugu on three different social media platforms, Twitter, Facebook & WhatsApp. The proposed system is able to successfully assign coarse as well as fine-grained PoS tag labels for a given a code-mixed sentence. Experiments show that our system is quite generic that shows encouraging performance levels on all the three language pairs in all the domains.