Title: SKIP TRIE MATCHING FOR REAL TIME OCR OUTPUT ERROR CORRECTION ON ANDROID SMARTPHONES

Year of Publication: Jul - 2013
Page Numbers: 1-11
Authors: Vladimir Kulyukin, Aditya Vanka
Conference Name: The Third International Conference on Digital Information and Communication Technology and its Applications (DICTAP2013)
- Czech Republic

Abstract:


Proactive nutrition management is considered by many dieticians as a key factor in reducing cancer, diabetes, and other illnesses caused by mismanaged diets. As more individuals manage their daily activities with smartphones, they start using their smartphones as diet management tools. Unfortunately, while there are many vision-based mobile applications to process barcodes, there is a relative dearth of visionbased applications for extracting useful nutrition information items such as nutrition facts, caloric contents, and ingredients. In this paper, we present a greedy algorithm, called Skip Trie Matching (STM), for real time OCR output error correction on smartphones. The STM algorithm uses a dictionary of strings stored in a trie data structure to correct OCR errors by skipping misrecognized characters. The number of skipped characters is referred to as the skip distance. The algorithm’s worst-case performance is O(n│∑│d log│∑│), where │∑│ is the constant size of the character alphabet over which the trie is constructed (e.g., 26 characters in the ISO basic Latin alphabet) and n is the length of the input string to be spellchecked. The algorithm’s performance is compared with Apache Lucene’s spell checker, a state of the art spell checker where spell checking can be done with the n-gram matching or the Levenshtein edit distance. The input data for comparison tests are text strings produced by the Tesserract OCR engine on text image segments of nutrition data automatically extracted by an Android 2.3.6 smartphone application from real-time video streams of U.S. grocery product packages. Preliminary evaluation results indicate that, while the STM algorithm is greedy in that it does not find all possible corrections of a misspelled word, it gives higher recalls than Lucene’s n-gram matching or Levenshtein edit distance. The average run time of the STM algorithm is also lower than Lucene’s.