How best to implement a Dashboard from data in HDFS/Hadoop

Question

We have a bunch of data (several TB) in Hadoop HDFS and it's growing. We want to create a dashboard that reports on the contents in there e.g counts of different types of objects, trends over time etc.

Our first thought was to use something like Oracle Tableau or d3.js and to use Hive queries. But Hive is just too darn slow for these "precanned" queries.

Now we're thinking of using Hive to extract data regularly from HDFS and store the output in a "more real time" database e.g. HBase or an RDBMS (e.g. MySQL).

That'll work but I'm worried we're missing a simpler / easier solution (if there is one) that requires less ETL / extracts and duplicate data storage mechanisms (HDFS + something else)

Jeff Hammerbacher · Answer 1 · 2013-10-10T16:58:04.960

I'd recommend keeping the data in HDFS and converting it to the Parquet file format. Parquet uses a concise, columnar representation of nested data and will reduce the I/O required for many of your queries.

Once your data is in the Parquet format, I'd use Impala to issue SQL queries against the data. Impala implements a highly efficient execution engine for SQL queries over data stored in HDFS. Impala queries will return results to your dashboard with low latency. Unlike Hive, the Impala execution engine doesn't rely on Hadoop's MapReduce implementation.

If you have text data that you'd like to view on the dashboard, I'd recommend Cloudera Search for indexing it. Cloudera Search is a version of Solr Cloud that stores and serves partitioned Lucene indices out of HDFS.

It's quite trivial to install Impala and Search with Cloudera Manager. Cloudera Manager is a free software tool that provides an in-browser GUI for installing and managing Cloudera and related third-party software. If you install and manage your cluster with Cloudera Manager, you don't have to worry about tuning your configuration or ensuring cross-version compatibility between HDFS, Parquet, and Impala.

To try out your new cluster, you may want to use Cloudera Manager to install Hue as well. Hue provides a web-based GUI for end users of Cloudera and related third-party software. From Hue you can explore the data in HDFS and issue SQL or keyword search queries over your data.

For an example of an interactive dashboard built with D3 that uses Cloudera Impala and Search on the backend, check out Zoomdata. This video is a wonderful demonstration of the interactive capabilities of Impala and Search.

If you'd like to use Tableau, Cloudera makes a connector for Tableau available that works with Impala.

Note that the already exceptional performance of Impala for small data sets will be aided by the upcoming in-memory cache that's being added to HDFS with our next release.

How best to implement a Dashboard from data in HDFS/Hadoop

1 Answers1