Sinhala Character Recognition using Tesseract OCR

Manisha, U.K.D.N.; Liyanage, S.R.

Sinhala Character Recognition using Tesseract OCR

dc.contributor.author	Manisha, U.K.D.N.
dc.contributor.author	Liyanage, S.R.
dc.date.accessioned	2018-08-10T08:29:10Z
dc.date.available	2018-08-10T08:29:10Z
dc.date.issued	2018
dc.description.abstract	In Sri Lanka, there are many fields that uses Sinhala scripts, such as newspaper editors, writers, postal and application processes. In these fields there have only a scanned or printed copies of Sinhala script, where they have to enter them manually to a computerized system, which consumes much time and cost. The proposed method was consisted of two areas as image pre-processing and training the OCR classifier. In Image pre-processing, the scanned images were enhanced and binarized using image processing techniques such as grayscale conversion and binarization using local thresholding. In order to avoid distortions of scanned images such as water marks and highlights was removed through the grayscale conversion with color intensity changes. In the OCR training, the Tesseract OCR engine was used to create the Sinhala language data file and used the data file with a customized function to detect Sinhala characters in scanned documents. OCR engine was primarily used to create a language data file. First, pre-processed images were segmented (white letters in black background) using local adaptive thresholding where performing Otsu’s thresholding algorithm to separate the text from the background. Then page layout analysis was performed to identify non-text areas such as images, as well as multi-column text into columns. Then used detections of baselines and words by using blob analysis where each blob was sorted using the x-coordinate (left edge of the blob) as the sort key which enables to track the skew across the page. After the separation of each character, then labeled manually into Sinhala language characters. By using the Sinhala language data file into OCR function, it returns the recognized text, the recognition confidence, and the location of the text in the original image. By considering the recognition confidence of each word it is possible to control the accuracy of the system. The classifier was trained using 40 characters sets with 20 images from each character and tested using over 1000 characters (200 words) with variations of font sizes and achieved approximately 97% of accuracy. The elapsed time was less than 0.05 per a line with more than 20 words, which was a higher improvement than a manual data entering. Since the classifier can be retrained using testing images, it can be developed to achieve active learning.	en_US
dc.identifier.citation	Manisha, U.K.D.N. and Liyanage, S.R. (2018). Sinhala Character Recognition using Tesseract OCR. 3rd International Conference on Advances in Computing and Technology (ICACT ‒ 2018), Faculty of Computing and Technology, University of Kelaniya, Sri Lanka. p12.	en_US
dc.identifier.uri	http://repository.kln.ac.lk/handle/123456789/18983
dc.language.iso	en	en_US
dc.publisher	3rd International Conference on Advances in Computing and Technology (ICACT ‒ 2018), Faculty of Computing and Technology, University of Kelaniya, Sri Lanka.	en_US
dc.subject	Optical Character Recognition	en_US
dc.subject	Computer Vision	en_US
dc.subject	Image Processing	en_US
dc.subject	Image Segmentation	en_US
dc.title	Sinhala Character Recognition using Tesseract OCR	en_US
dc.type	Article	en_US

Files

Original bundle

Now showing 1 - 1 of 1

Name:: 12.pdf
Size:: 280.59 KB
Format:: Adobe Portable Document Format
Description:

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 1.71 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

ICACT 2018