Hive queries taking so long

Hive queries taking so long - hive

I have a CDP environment running Hive, for some reason some queries run pretty quickly and others are taking even more than 5 minutes to run, even a regular select current_timestamp or things like that.
I see that my cluster usage is pretty low so I don't understand why this is happening.
How can I use my cluster fully? I read some posts in the cloudera website, but they are not helping a lot, after all the tuning all the things are the same.
Something to note is that I have the following message in the hive logs:
"Get Query Coordinator (AM) 350"
Then I see that the time to execute the query was pretty low.
I am using tez, any idea what can I look at?

Besides taking care of the overall tuning: https://community.cloudera.com/t5/Community-Articles/Demystify-Apache-Tez-Memory-Tuning-Step-by-Step/ta-p/245279
Please check my answer to this same issue here Enable hive parallel processing
That post explains what you need to do to enable parallel processing.

Related

What is "cold start" in Hive and why doesn't Impala suffer from this?

I'm reading the literature on comparing Hive and Impala.
Several sources state some version of the following "cold start" line:
It is well known that MapReduce programs take some time before all nodes are running at full capacity. In Hive, every query suffers this “cold start” problem.
Reference
In my opinion, it is not sufficient to understand what is meant by "cold start". Looking for more information and clarity to understand this.
For context, I'm a data scientist. I create queries, and have only basic understanding of big data concepts.
I've referred to questions that explain why Impala is faster (example), but they don't explicitly address or define cold start.

With every Hive query, a MapReduce Job is executed which requires overhead and time for nodes within the MapReduce cluster to work on the task. This is known as "cold start". On the other hand, because Impala sits directly atop HDFS, it does not invoke a MapReduce job and avoids the overhead and time needed in a MapReduce job. Rather, Impala daemon processes are active at boot time and ready to process queries.
Takeaway: cold start refers to the overhead required in booting and executing a MapReduce job.

Do BigQuery queries run faster at night?

I just received this from Google Support, and it surprised me as I didn't know there was a congestion issue - do other people have this experience?
To fasten up your query, I would recommend that you try to run your query in other time like midnight

BigQuery is nocturnal, so it runs better in the dark. There are fewer predators around, so BigQuery can be free to express itself and cavort across the prairies near the Google Datacenters.
Other techniques to "enfasten" the queries involve running them from the ley lines of power, which are described in the Alchemical diaries of Hermes Trismagistus. Unfortunately, I am not permitted to share their location, and may be putting myself at risk of excommunication from a number of secret societies by just mentioning their existence.
Finally, if you name your tables with the suffix __Turbo, BigQuery will run them in turbo mode, which means they run on 486/66 processors instead of the default Z80 datapath.
Edited to add:
In a non-snarky answer, if you do not have reserved BigQuery capacity (i.e. fixed-price reservations), you may experience lower throughput at certain times. BigQuery has a shared pool of resources, so if lots of other customers are using it at the same time, there may not be enough resources to give everyone the resources that their queries would need to run at full speed.
That said, BigQuery uses a very large pool of resources, and we (currently) run at a utilization rate where every user gets nearly all of the resources they need nearly all of the time.
If you are seeing your queries slow down by 20% at certain times of the day, this might not be surprising. If you see queries take 2 or 3 times as long as they usually do, there is probably something else going on.

How do I diagnose a slow performing Postgres database on Heroku?

We have a rails app on Heroku with the Postgres add on.
In isolation no one query performs out of the norm. However in production some read queries perform very badly.
Currently I can only assume that it's the combination of concurrent queries that is causing a slow down. But, how can I prove this and diagnose further?
NewRelic is just telling me that a query is slow. But I know it's slow. My hypothesis is not that the queries are too slow, but rather that the db server is under too heavy a load. But are there any tools that would tell me that for sure?
Thanks!

NewRelic tells you several details about slow queries, including query analysis and explanation.
However, that report is not available in all plans.
http://newrelic.com/pricing/details
Check your plan details. In case your plan doesn't include it, you may want to upgrade for one month to identify the issue and take action, then you can downgrade again.

What are the advantages of setting "hive.exec.parallel" to false in Hive ?

I came to know that when hive.exec.parallel is set to true in hive i.e
set hive.exec.parallel=true;
then independent tasks in a query can run in parallel.
Thanks to Qubole for this:
Are there any advantages of setting this parameter to false?
I'll iterate myself here: Obviously, whenever possible, you would like to run things in parallel and have more throughput. Why would someone set this parameter to false - are there any disadvantages too?

It's simply a parameter because when it got introduced it wasn't clear how stable it would be and so you should be able to turn it off. Once enough people tried it and found it stable the default switched to true:
https://issues.apache.org/jira/browse/HIVE-1033
There is no realistic disadvantage at this time.

In my experience, the only disadvantage is resource use. If you have limited resources available, it could be better overall to have queries running serially. When queries run in parallel, one query can manage several jobs at the same time, which could starve the cluster of resources. If you don't need the speed and have a cluster with a lot of workload, it might be better overall to let things run serially.

Mayank, This property also has benefits with some star condition. I mean to say that Hive has a feature of database locking while multiple queries running on that database.
For example - You have a complex query with multiple stages
running on one database where Parallel property can increase your
efficiency but It will also create "LOCK" on DATABASE which may
stop other processes which are running on same database for the time
of it's own execution.
I have recently faced this issue and resolved by making this property "FALSE".
I hope this answer may help you to understand in what scenario we have to make it false.

SQL Optimization Strategy

I'm a student and I'm doing my database assignment.
I want to use indexing and query optimization for my database optimization strategy.
The problem is how can I prove my strategy make a improvement? my lecture said about query optimization that I can prove by calculation, anyone got more ideas? what to calculate?
what about indexing, I need evidence to prove it. how??

In terms of evidence of optimization, you have to have instrumented code for your test cases (e.g. you can take timings accurately) and re-runnable test cases. The ideal situation for a re-runable set of test cases is to also be able to reset to a baseline database so you can guarentee the starting conditions of the data is the same per test run.
You also need to understand for each test case other more subtle factors:
Are you running against a cold procedure cache / warm procedure cache.
Are you running against a cold data cache / warm data cache.
For larger datasets, are you using the exact same table, e.g. no page splits have occured since.

I would think a before and after explain plan would go a long way towards proving an improvement.
See SQL Server Performance HERE.

Which DBMS are you using?
I suggest you take a look at what tracing options your DBMS product provides. For example, in Oracle you can use SQL Trace and parse the output using tkprof to provide you with the figures you'll need to prove that your database optimization strategy shows an improvement.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas