Title: Performance Evaluation of Data Placement Structures for Big Data Warehouses

Year of Publication: Jul - 2016
Page Numbers: 15-21
Authors: Mohammad Rakibul Hasan, S. Kami Makki
Conference Name: The Third International Conference on Data Mining, Internet Computing, and Big Data (BigData2016)
- Turkey


Rapid growth of data requires systems that are able to provide a scalable infrastructure for distributed storage and processing of vast amount of data efficiently. Hive is a MapReduce-based data warehousing system for data summarization and query analysis. This warehousing system can arrange millions of rows of data into tables, where its data placement structures play a significant role in the performance of this warehouse. It also provides SQL-like language called HiveQL, that able to compile MapReduce jobs into queries on Hadoop. In this paper, we investigate the performance of Hive's data placement structures (RCFile and ORCFile). The experimental results showed the effectiveness of RCFile and ORCFile for data placement structure in MapReduce system.