ORC (Optimized Row Columnar) and Snappy compression can offer higher performance for Hadoop. But sometimes the Hive planner can’t keep up, causing your queries to run slowly and thus under utilizing your cluster. While this problem would be common to any type of compressed input, here’s something to consider when tuning for performance.
To set the stage, we were using a commercially available Hadoop platform at a client with Tez, YARN, ORC and Hive 0.13. The platform consisted of high end data nodes connected with a 10G network interconnect. We were using this Hadoop cluster with a data ingestion pipeline that starts with text files and ends with an ORC table having Snappy compression:
step 0: 80 G flat text delimited snmp sample data 210 byte rows x .5 Trillion rows
step 1: 2.9 G .tar.gz files
step 2: 7.8 G text files compressed via Snappy
step 3: 1.3 G ORC with Snappy compression
ORC typically provides a ~ 60 to 1 compression over text and while this represents a huge advance in the last 5 years, we noticed that queries over this highly compressed table would take forever and we were only using 2 reducers. Our 350+ container cluster was mostly idle while two reducers were busy. The Hive execution planner wasn’t keeping up with the compression techniques.
After researching this problem and experimenting, we noticed the Hive query planner still uses (#bytes input to mappers/hive.exec.reducers.bytes.per.reducer) to estimate the number of reducers needed to perform aggregates (think count *, count distinct). With a highly compressed 1.3GB input data set divided the default setting of:
hive.exec.reducers.bytes.per.reducer=1000000000 (1G) you get “2” (once you round up) as the number of reducers.
Given this formula, Hive only adds two reducers to the TEZ dag execution plan to process 80 GB of structured text from the ORC formatted table. Since Hive can only look at the raw HDFS file sizes and the data is uncompressed by the input formatter, Hive doesn’t have visibility into the uncompressed data set size. Thus, Hive ends up getting the input data set size wrong, by a factor of 60! Additionally, Hive currently generates a static execution plan so it can’t adjust once the data from the table passes through the InputFormatter and mappers.
Here’s the key lesson learned: To tune queries reading from a highly compressed table, you need to think about both the compressed and un-compressed size of data processed.
After just a couple trials, we are seeing much better parallelism with setting bytes per reducer to 50M:
Using ORC and SNAPPY are best practices to improve performance, however if your cluster is underutilized and you want a faster response time then look at ways of overriding Hive’s execution planner with
SET hive.exec.reducers.bytes.per.reducer=50000000; to get much better parallelism.