Im trying Pivotal Hawq with ambari and now im trying to run some queries over hive tables with hawq.
From what i have seen Hawq can query hive tables through HCatalog (https://community.hortonworks.com/articles/43264/hawqhdb-and-hadoop-with-hive-and-hbase.html ), and so, i use psql tool on the comand line to run queries like this:
SELECT * FROM hcatalog.hive-db-name.hive-table-name;
Previously i run some queries on Hive to compare results with Hawq, i was expecting hawq to be much faster, but hawq its being much more slow, the query response is much more long than in Hive.
Can someone explain why is this happening?
Related
I tried to run the SQL like the following:
select count(*) from test_table where columna='a' and columnb in ('test1', test2')
For Impala in Cloudera, it takes around 2 mins, but for Hive, it takes 20mins, not sure is this normal? if yes, why does Impala run much faster than Hive in Cloudera? and in which kind of scenario will Hive be faster than Impala?
Thanks.
I'm running Hive 1.0, trying to compute column statistics using the built-in analyze command. HQL script looks like:
set hive.cbo.enable=true;
set hive.compute.query.using.stats=true;
set hive.stats.fetch.column.stats=true;
use db;
analyze table tbl compute statistics for columns;
Which kicks off a map-only MR task as expected. The job runs to 100% for both map and reduce, then reports:
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.ColumnStatsTask
But the job is registered as a SUCCESS.
Googling led me to this JIRA ticket, but the resolution says the problem is resolved in Hive 0.14. Is there something simple I'm missing in the query?
EDIT: Five and a half years later, I've changed jobs and industries twice, picked up Spark and then abandoned Hadoop altogether in all my workflows, and the world aligned around efficient cloud data lakes that don't require a new query language. Hive is a distant memory for me, but I hope the other answer seekers found sufficient workarounds. I don't think I ever did.
Can somebody help me with a hive command to find the data nodes on which a aprticular hive query was run.
For eg- Select * from mytable;
ran on which data nodes in my hadoop cluster having only hive.
datanode is only for storage.what you really want is which mr node is running the sql
hive transform the sql to normal MR jobs.So you can find your sql job at jobtracker(MR1) or resoucemanager(yarn) web interface
I running 10 hive scripts using oozie coordinator, it is getting stuck in one of the script in reduce stage at same percentage without any error, the scripts are simple insert statements and I tested them on command line they just work fine, how do I debug this?
It was data skew issue, 80% of the data was mapped to single key. Once we updated to Hive 10, skew optimization join resolved the issue.
My hive query has multiple outer joins and takes very long to execute. I was wondering if it would make sense to break it into multiple smaller queries and use pig to work the transformations.
Is there a way I could query hive tables or read hive table data within a pig script?
Thanks
The goal of the Howl project is to allow Pig and Hive to share a single metadata repository. Once Howl is mature, you'll be able to run PigLatin and HiveQL queries over the
same tables. For now, you can try to work with the data as it is stored in HDFS.
Note that Howl has been renamed to HCatalog.