Business users use Hive because it presents as SQL a language the user already knows and is able to use. Developers use Hive because it is easy to use and less complex and faster to develop in than writing Java. The ease of learning and using Hive can lull users into a false sense of security. A sense of security that you can do everything you need to do in Hive can be dangerous. Invariably on every Hadoop implementation I have worked on I have encountered a problem that I decided to code using Hive because it was easier; however, when I go to run it I find that the Hive query is running for too long and will not meet the SLA.
Hive is by definition an abstraction layer built to do general applications. This is great in that it can do many things and it can do them pretty well; however, generalizing the code also means that there are some problems where it will inevitably perform very poorly. One example is skew joins where the way Hive executes the joins causes the vast majority of the data to go to a very few reducers. This causes these reducers to run slow and you lose a lot of the performance gains Hadoop can give you through parallelism.
Recently, I developed a Hive query to process and load some data. This query needed to run every day, processing several million records from the previous day of data and comparing the records to billions of records already stored in a Hive table to identify new records that should be loaded. This join is very easy to write in Hive, and significantly more difficult to write in MapReduce or Pentaho. Since it was going to be simple I decided to develop this join in Hive.
After about 30 minutes of development on the Hive query, I ran it. The next morning, after 16 hours, the query was still running. I need the entire daily process to finish in eight hours, which means this particular query can take no longer than two hours. I looked at these results and decided that maybe if I rewrite my query in a more efficient manner it will run faster. The next day, the query is still running. I banged my head on this for a week.
Finally, I realized that I was never going to be able to get this to perform in Hive and I decided to use Pentaho’s Visual MapReduce capabilities. This was significantly more complex to code, but still only took me 6 hours to develop and debug. Using Pentaho, I was able to build and optimize my code specifically for this use case. Rewriting the Hive query using Pentaho’s Visual MapReduce returned results in 17 minutes.
I was able to rewrite and run my Hive query in Pentaho’s Visual MapReduce in less time than it took to run the Hive Query. When I decided to let the Hive query run to completion, I found it took 76 hours total development, and running time in Pentaho was 6 hours and 17 minutes.
This is not the only time I have seen a drastic performance in improvement by rewriting a Hive Query using MapReduce. In the past nine months I have rewritten four Hive queries in MapReduce and seen greater than 5x improvements in performance.
Hive is still a very good tool most of the time, but it is important to remember that Hive can sometimes be a bad choice. Building production processes using Pentaho’s Visual MapReduce or Java MapReduce to develop processes optimized specifically for the use case can result in major performance improvements over other MapReduce abstraction layers. Do not let the ease of use of Hive lull you into a false sense of security that it is always the best choice.