Title: SKIP TRIE MATCHING: A GREEDY ALGORITHM FOR REAL-TIME OCR ERROR CORRECTION ON SMARTPHONES

Issue Number: Vol. 3, No. 3
Year of Publication: 2013
Page Numbers: 261-270
Authors: Vladimir Kulyukin, Aditya Vanka, Haitao Wang
Journal Name: International Journal of Digital Information and Wireless Communications (IJDIWC)
- Hong Kong

Abstract:


Proactive nutrition management is considered by many nutritionists and dieticians as a key factor in reducing diabetes, cancer, and other illnesses caused by mismanaged diets. As more individuals manage their daily activities with smartphones, they start using their smartphones as diet management tools. Unfortunately, while there are many vision-based mobile applications to process barcodes, especially aligned ones, there is a relative dearth of vision-based applications for extracting useful nutrition information items such as nutrition facts, caloric contents, and ingredients. In this article, we present a greedy algorithm, called Skip Trie Matching (STM), for real time optical character recognition (OCR) output error correction on smartphones. The STM algorithm uses a dictionary of strings stored in a trie data structure to correct OCR errors by skipping misrecognized characters while driving down several paths in the trie. The number of skipped characters is referred to as the skip distance. The algorithm’s worst-case performance is , where n is the length of the input string to spellcheck. The algorithm’s performance is compared with Apache Lucene’s spell checker [1], a state of the art spell checker where spell checking can be done with the n-gram matching [2] or the Levenshtein edit distance (LED) [3]. The input data for comparison tests are text strings produced by the Tesserract OCR engine [4] on text image segments of nutrition data automatically extracted by an Android 2.3.6 smartphone application from real-time video streams of grocery product packages. The evaluation results indicate that, while the STM algorithm is greedy in that it does not find all possible corrections of a misspelled word, it gives higher recalls than Lucene’s n-gram matching or LED. The average run time of the STM algorithm is also lower than Lucene’s implementations of both algorithms.