why is Hive much slower than Impala in Cloudera - hive

I tried to run the SQL like the following:
select count(*) from test_table where columna='a' and columnb in ('test1', test2')
For Impala in Cloudera, it takes around 2 mins, but for Hive, it takes 20mins, not sure is this normal? if yes, why does Impala run much faster than Hive in Cloudera? and in which kind of scenario will Hive be faster than Impala?
Thanks.

Related

Hive to Spectrum - Query Migration

I have hundreds of Hive Queries(HQLs, using Hive functions like date_sub, lead, lag etc) which I need to convert to Redshift Spectrum, is there any tool which helps in this?

Hawq Queries Very Slow

Im trying Pivotal Hawq with ambari and now im trying to run some queries over hive tables with hawq.
From what i have seen Hawq can query hive tables through HCatalog (https://community.hortonworks.com/articles/43264/hawqhdb-and-hadoop-with-hive-and-hbase.html ), and so, i use psql tool on the comand line to run queries like this:
SELECT * FROM hcatalog.hive-db-name.hive-table-name;
Previously i run some queries on Hive to compare results with Hawq, i was expecting hawq to be much faster, but hawq its being much more slow, the query response is much more long than in Hive.
Can someone explain why is this happening?

How to get count record in hbase table? which is fastest way to query the record?

I have 100 Million record in HBase table. I have created hive external table.
How to query the record fastest way.
Hive ---> Select count(*) from table.
Running Query more than 8 hours.
Please guide me
I think the better way here would be use Hbase in built RowCounter operation which internally runs a map reduce job to count the number of rows.
Syntax:
hbase org.apache.hadoop.hbase.mapreduce.RowCounter mytable
Hive supports COUNT() query directly-
SELECT COUNT(*) FROM table
But it will get slow as your records increases because hive uses MapReduce jobs. If you want to query really fast, I would recommend you using Apache Phoenix or ORM tool Kundera

Hive command for finding the data nodes on which a query was run

Can somebody help me with a hive command to find the data nodes on which a aprticular hive query was run.
For eg- Select * from mytable;
ran on which data nodes in my hadoop cluster having only hive.
datanode is only for storage.what you really want is which mr node is running the sql
hive transform the sql to normal MR jobs.So you can find your sql job at jobtracker(MR1) or resoucemanager(yarn) web interface

Factors that limit speed Presto?

I have just installed Presto today on our server at work (version 0.57) and when doing a select count(*) from log; it takes more than 17 minutes for a table with only 640 million records (~64GB).
Now I am under the impression that this is way too slow for presto, but I am not sure.
Some info:
Hive and Presto have both been installed with default configurations from their documentation.
Hive table is an external table with about 24 columns most of them String and 3 of them are Array and the file is stored as Textfile (Hive complains about RCFile with my file for some reason).
The table will be mostly used for grouping and count operations.
Do you have any tips for increasing performance or what the targetted query time should be for a simple count(*) of a table?
Cheers
You should solve your problem with RCFile. Using RCFile will increase the performance significant (x2 - x4 the developers say conform with my experience). Try to convert it using CREATE TABLE <new rcfile table name> AS SELECT * FROM <old textfile table name>; in Presto. (Be sure to have enough space on disk.)