Title: Measuring the Performance of Data Placement Structures for MapReduce-based Data Warehousing Systems

Issue Number: Vol. 8, No. 1
Year of Publication: March - 2018
Page Numbers: 11-20
Authors: S. Kami Makki, M. Rakibul Hasan
Journal Name: International Journal of New Computer Architectures and their Applications (IJNCAA)
- Hong Kong
DOI:  http://dx.doi.org/10.17781/P002371


The exponential growth of data requires systems that are able to provide a scalable and fault-tolerant infrastructure for storage and processing of vast amount of data efficiently. Hive is a MapReduce-based data warehouse for data aggregation and query analysis. This data warehousing system can arrange millions of rows of data into tables, and its data placement structures play a significant role for increasing the performance of this data warehouse. Hive also provides SQL-like language called HiveQL, which is able to compile MapReduce jobs into queries on Hadoop. In this paper, we measure the efficiency of these data placement structures (Record Columnar File (RCFile) and Optimize Record Columnar File (ORCFile)) in terms of data loading, storage and query processing using MapReduce framework. The experimental results showed the effectiveness of these data placement structures for Hive data warehousing systems.