Sentence Level Paraphrase Identification System for Tamil Language


Author (s):

  1. Dr.C.S.Kanimozhiselvi, Kongu Engineering College, Erode, Tamilnadu, India
  2. Dr.S.Malliga, Kongu Engineering College, Erode, Tamilnadu, India
  3. Dr.S.V.Kogilavani, Kongu Engineering College, Erode, Tamilnadu, India, kogilavani.sv@gmail.com

Abstract:

Automatic detection of the paraphrase is a process which has immense applications like plagiarism detection and new event detection. Paraphrase is the representation of a given fact in more than one way by means of different phrases. Identification of a paraphrase is a classical natural language processing task which is of classification type. The aim is to detect sentence level plagiarism through paraphrase identification of sentences in Tamil. The sentences in Tamil language are processed using Tamil shallow parser. Shallow parsing is used to analyze a sentence to identify Part of Speech of sentences such as nouns, verbs, adjectives etc. Sentences are also processed using word2vec tool to identify word order between sentences. From the output of the shallow parsing process and word2vec, the feature file is constructed where the text values are converted into numerical matrix. This feature file is given as input into machine learning algorithms which in turn classify the sentence pair into paraphrase or not-a-paraphrase. If the result is paraphrase means, that sentence will be considered as plagiarized sentence. The accuracy and performance of these methods are measured based on evaluation parameters like accuracy, precision, recall and f-measure. The analysis based on these performance measures shows that Random Forest method classifies the sentence pair into paraphrase or not-a-paraphrase with high accuracy compared to other methods.

No of downloads: 347