Title: AUTOMATIC DOCUMENT STRUCTURE ANALYSIS OF STRUCTURED PDF FILES

Issue Number: Vol. 1, No. 2
Year of Publication: Aug - 2011
Page Numbers: 404-411
Authors: Rosmayati Mohemad, Abdul Razak Hamdan, Zulaiha Ali Othman, Noor MaizuraMohamad Noor
Journal Name: International Journal of New Computer Architectures and their Applications (IJNCAA)
- Hong Kong

Abstract:


Portable Document Format (PDF) is the most comfortable way to publish information because of its operating system independent. However, information on PDF document is unstructured and are applicable only for human reader. In addition, PDF consists of non-tagged internal structure which make the extraction task difficult. Automatically details analyzing and recognizing of PDF document structures especially paragraph and tabular area is vital for extracting relevant information precisely for use in other domain applications. Motivation of this study is to support knowledge extraction and exploit its actual semantic for improving further analysis. This paper proposed an intelligent approach to identify and recognize automatically the layout and structure of PDF documents together with their text and then structure the extracted information into ontological- based representation. An experimental study has been conducted using a collection of construction tender documents in PDF to test the performance of the proposed approach. The accuracies of precision, recall and f-measures have shown significant results when detecting tabular and paragraph structure.