Title: Comparison of Different Implementation of Inverted Indexes in Hadoop

Year of Publication: March - 2014
Page Numbers: 52-58
Authors: Hediyeh Baban , S. Kami Makki, and Stefan Andrei
Conference Name: The Second International Conference on E-Technologies and Business on the Web (EBW2014)
- Malaysia

Abstract:


There is a growing trend of applications that need to handle Big Data, as many corporations and organizations are required to collect more data from their operations. Recently, processing Big Data, using MapReduce structure has become very popular, because the traditional data warehousing solutions for handling such datasets are not feasible. Hadoop provides an environment for execution of MapReduce program over distributed memory clusters, which supports the processing of large datasets in a distributed computing environment. Information retrieval systems facilitate searching of the content of the books or journals based on the metadata or indexing. An inverted index is a data structure which stores a mapping from content, such as words or numbers, to its locations in one or more documents. In this paper we propose three different implementations for inverted indexes (Indexer, IndexerCombiner, and IndexerMap) in Hadoop environment using MapReduce programming model, and compare their performance to evaluate the impacts of different factors such as data format and output file format in MapReduce.