What is Impala?
Cloudera Impala
is an open source Massively Parallel Processing (MPP) query engine that runs
natively on Apache Hadoop. The Apache-licensed Impala project brings scalable
parallel database technology to Hadoop, enabling users to issue low-latency SQL
queries to data stored in HDFS and Apache HBase without requiring data movement
or transformation
With Impala,
analysts and data scientists now have the ability to perform real-time, “speed
of thought” analytics on data stored in Hadoop via SQL or through Business
Intelligence (BI) tools. The result is that large-scale data processing (via
MapReduce) and interactive queries can be done on the same system using the
same data and metadata – removing the need to migrate data sets into specialized
systems and/or proprietary formats simply to perform analysis
How does impala
provide faster query response compared to hive for the same data on hdfs ?
You should see
Impala as "SQL on HDFS", while Hive is more "SQL on
Hadoop".
In other words,
Impala doesn't even use Hadoop at all. It simply has daemons running on all
your nodes which cache some of the data that is in HDFS, so that these daemons
can return data quickly without having to go through a whole Map/Reduce job.
The reason for
this is that there is a certain overhead involved in running a Map/Reduce job,
so by short-circuiting Map/Reduce altogether you can get some pretty big gain
in runtime.
That being said,
Impala does not replace Hive, it is good for very different use cases. Impala
doesn't provide fault-tolerance compared to Hive, so if there is a problem
during your query then it's gone. Definitely for ETL type of jobs where failure
of one job would be costly I would recommend Hive, but Impala can be awesome
for small ad-hoc queries, for example for data scientists or business analysts
who just want to take a look and analyze some data without building robust
jobs. Also from my personal experience, Impala is still not very mature, and
I've seen some crashes sometimes when the amount of data is larger than
available memory.
Impala Use Cases
What are good use
cases for Impala as opposed to Hive or MapReduce?
Impala is
well-suited to executing SQL queries for interactive exploratory analytics on
large data sets. Hive and MapReduce are appropriate for very long running,
batch-oriented tasks such as ETL.
Is MapReduce
required for Impala? Will Impala continue to work as expected if MapReduce is
stopped?
Impala does not
use MapReduce at all.
Can Impala be
used for complex event processing?
For example, in
an industrial environment, many agents may generate large amounts of data. Can
Impala be
used to analyze
this data, checking for notable changes in the environment?
Complex Event
Processing (CEP) is usually performed by dedicated stream-processing systems.
Impala is not
a
stream-processing system, as it most closely resembles a relational database.
Is Impala
intended to handle real time queries in low-latency applications or is it for
ad hoc queries for the purpose of data exploration?
Ad-hoc queries
are the primary use case for Impala. We anticipate it being used in many other
situations where
low-latency is
required. Whether Impala is appropriate for any particular use-case depends on
the workload,
data size and
query volume.
No comments:
Post a Comment