IPRC - 2016

Permanent URI for this collectionhttp://repository.kln.ac.lk/handle/123456789/157

Browse

Search Results

Now showing 1 - 2 of 2
  • Thumbnail Image
    Item
    Intelligent Sorting System for Curriculum Vitae using Natural Language Processing
    (Faculty of Graduate Studies, University of Kelaniya, Sri Lanka, 2016) Weerasooriya, T.
    Natural language Processing (NLP) has undergone tremendous development over the past few decades. The logic behind sentence analysis plays a vital role in NLP applications. The present study makes use of Stanford CoreNLP, an NLP tool that enables Parts-of-Speech (POS) tagging and NamedEntity Tagging to extract the essential information from a curriculum vitae (CV), followed by ranking the best candidates according to the information included in the CV. The system design is as follows: the proposed system initially categorizes the candidates according to the post applied. The second step checks for the basic qualifications required by the company. If the basic requirements are not met, the CV is rejected. The third step uses POS tagging to interpret and assign marks for each section in the CV. The extracurricular activities section is grammatically ambiguous as it contains achievements in sports, clubs and societies. The research was aimed at classifying the extracurricular activities using a mix of rule based parsers and the NamedEntity Tagger. Firstly, the sentence is passed through the rule based parser, which classifies it as a sport or a club activity (using a word match specific to each group). The category which has the highest match is given ¾ mark of the decision. The NamedEntity tagger searches the sentence for any sports or organizations, and the classification is given a ¼ point in the decision. The sentence is categorized into the relevant category depending on the highest score. During testing, in a CV which contained 28 extracurricular activities, the system classified 14 achievements in Sports and 14 achievements in Clubs and Societies. However, the correct classification should be 17 in Sports and 11 achievements in Clubs and Societies. The methodology would succeed in sorting ambiguous sentences, where a corpus based method would fail (i.e. “Compered at Kelani Hockey 6’s”. The keyword of the sentence is Hockey, but it is not an achievement in sports). Being an adaptable system using NLP, it could be customized to assign a weighted score for specific keywords depending on the requirement of the organization. The fourth step is to assign a total score to the CV. At the end of the cycle, the system would output the list of the top 50 CVs qualified for the post. This system was tested with a sample data set from the CV bank of the Career Fair 2015 (CF) of the University of Kelaniya. The manual CV sorting process of the CF required at least 2 minutes per CV and each CV was sorted individually. The system was less time consuming, more organized and efficient.
  • Thumbnail Image
    Item
    Comparison of Part of Speech taggers for Sinhala Language
    (Faculty of Graduate Studies, University of Kelaniya, Sri Lanka, 2016) Jayaweera, M.; Dias, N.G.J.
    Part of Speech (POS) tagging is an important tool for processing natural languages. It is one of the basic analytical model used in for many Natural language processing applications. It is the process of marking up a word in a corpus as corresponding to a particular part of speech like noun, verb, adjective and adverb. Automatic assignment of descriptors to the given tokens is called Tagging. The descriptor is called a tag. The tag may indicate one of the parts of speech category and the semantic information. So tagging is a kind of classification. The process of assigning one of the parts of speech to the given word is called parts of speech tagging. It is commonly referred to as POS tagging. In grammar, a part of speech (also known as word class, lexical class, or lexical category) is a linguistic category of words (or more precisely lexical items), which is generally defined by the syntactic or morphological behavior of the lexical item in the language. Each part of speech explains not what the word is, but how the word is used. In fact, the same word can be a noun in one sentence and a verb or adjective in another. In most of the natural languages in the world, noun and verb are common linguistic categories among others. Almost all languages have the lexical categories noun and verb, but beyond these there are significant variations in different languages. The significance of the part of speech for language processing is that it gives a significant amount of information about the word and its neighbours. There are different approaches to the problem of assigning a part of speech tag to each word of a natural language sentence. The most widely used methods for English are the statistical methods that is Hidden Markov Model (HMM) based tagging and the rule based or transformation based methods. Subsequent researches add various modifications to these basic approaches to improve the performance of the taggers for English. In this paper we present a comparison of the different researches that was carried out of POS tagging for Sinhala language. For Sinhala language, there were 4 reported work for developing a POS tagger. In 2004, a HMM based POS tagger was proposed using bigram model and reported only 60% of accuracy. Another HMM based approach was tried out for Sinhala language in 2013 and reported a 62% of accuracy. In 2016, another research was reported 72% of accuracy which was a hybrid approach based on bi-gram HMM and rules based approach in predicting the relevant tag for unknown words. The tagger that we have developed is based on a trigram based HMM approach, which used the knowledge of distribution of words and parts of speech categories in predicting the relevant tag for unknown words. The Witten-Bell discounting technique was used for smoothing and our approach gave an accuracy of 91.50% with a corpus of 90551 annotated words.