hive queries take a long time - hive

My hive queries take a long time. What can I do to mitigate against that?
What are some ways to optimize queries?
I try to look at the partitions on the table and make sure I'm filtering the results based on the partition like date.
Here's my query
CREATE VIEW dwo_analysis.spark_experiment_unbounce_control_lookup_b AS
SELECT
a.visid,
a.date_time
FROM dwo_analysis.spark_experiment_unbounce_control_list a
LEFT JOIN dwo_analysis.spark_experiment_unbounce_aq_list aq ON aq.visid = a.visid
LEFT JOIN dwo_analysis.spark_experiment_unbounce_ap_list ap ON ap.visid = a.visid
WHERE ap.visid IS NULL
AND aq.visid IS NULL;

Related

HIVE join taking too long, but is fast on Impala

i have a query like the below. This query runs in 15 seconds on Impala, but when I run the same on HIVE, it takes 10+ minutes. I have to join several other tables to this query (with similar joins as the below) and the total time it takes is more than 1 hour(sometimes it fails/gets stuck after an hour), but on Impala it runs within a minute.
Can you please tell me why this might be happening and how I might be able to optimize the below join on hive?
SELECT count(*)
FROM table_A A
LEFT JOIN table_B B ON cast(A.value AS decimal(5, 2)) BETWEEN B.fromvalue AND B.tovalue
AND A.date BETWEEN B.fromdate AND B.todate ;
Check query plan and try to configure mapjoin.
Theta joins (non-equality joins) like yours are implemented using cross join + filter in Hive. In case of Map join it will work much faster.
See here how to configure map-join: https://stackoverflow.com/a/49154414/2700344.
Check query plan again and make sure the MapJoinOperator is used.
Even with mapjoin, Hive is slower than Impala, though can process bigger datasets.
Hive is running basic map reduce tasks (which are very slow by nature).
Also, joins work best with equi-joins (colA = colB) and you are not doing any equi-joins. (between is colA >= colB and colA <= colC)
One of the main feature of Impala is to be really fast when reading data.
So basically, yes, Hive is slow compare to Impala, and there is not much you can do about it. That's just how it is.
Also, you are doing a count over a left join which means that, if there is not duplicates, the output will be the number of line in A. So maybe, you do not need the join ...

Spool output multiple tables involved in query

I am trying to find efficient ways to pull a large amount of data from a database, to put it on a cloud platform for analysis (for technical reasons there's no way of doing this in an automated way). For the big tables, I'd like to extract CSVs of a month's worth of data at a time; however, one huge table doesn't have dates, and the IDs have a prefix so I can't simply get a range of IDs. So I think I'd have to join onto another table. Something like this:
select * from big_table
inner join (
select * from table2 where date between to_date("'2020-04-01'","yyyy-mm-dd") and to_date("'2020-05-01'","yyyy-mm-dd") query_result
on big_table.id = query_result.id
Thing is, I want to be able to spool to CSV the results of both this query and the inner query into separate files. The inner query can take some time to run (approx 8 minutes), so ideally I'd want to run the whole query above and export to two locations, rather than run the above query and the inner query as separate tasks (thereby duplicating the work).
Is this possible?
I don't think you can spool to two locations at the same time. But you could create a temporary table and use that:
create temp table temp_table2 as
select t2.*
from table2 t2
where date >= date '2020-04-01' and
date < date '2020-05-01';
Then, you might want to create an index on this:
create index idx_temp_table2_id on table2(id);
Then:
select bt.*
from big_table bt join
temp_table2 t2
on bt.id = t2.id;
You would still need to spool them spearately.
Note: A simple index on table2(date) might be sufficient to speed both queries.

Tuning Oracle Query for slow select

I'm working on an oracle query that is doing a select on a huge table, however the joins with other tables seem to be costing a lot in terms of time of processing.
I'm looking for tips on how to improve the working of this query.
I'm attaching a version of the query and the explain plan of it.
Query
SELECT
l.gl_date,
l.REST_OF_TABLES
(
SELECT
MAX(tt.task_id)
FROM
bbb.jeg_pa_tasks tt
WHERE
l.project_id = tt.project_id
AND l.task_number = tt.task_number
) task_id
FROM
aaa.jeg_labor_history l,
bbb.jeg_pa_projects_all p
WHERE
p.org_id = 2165
AND l.project_id = p.project_id
AND p.project_status_code = '1000'
Something to mention:
This query takes data from oracle to send it to a sql server database, so I need it to be this big, I can't narrow the scope of the query.
the purpose is to set it to a sql server job with SSIS so it runs periodically
One obvious suggestion is not to use sub query in select clause.
Instead, you can try to join the tables.
SELECT
l.gl_date,
l.REST_OF_TABLES
t.task_id
FROM
aaa.jeg_labor_history l
Join bbb.jeg_pa_projects_all p
On (l.project_id = p.project_id)
Left join (SELECT
tt.project_id,
tt.task_number,
MAX(tt.task_id) task_id
FROM
bbb.jeg_pa_tasks tt
Group by tt.project_id, tt.task_number) t
On (l.project_id = t.project_id
AND l.task_number = t.task_number)
WHERE
p.org_id = 2165
AND p.project_status_code = '1000';
Cheers!!
As I don't know exactly how many rows this query is returning or how many rows this table/view has.
I can provide you few simple tips which might be helpful for you for better query performance:
Check Indexes. There should be indexes on all fields used in the WHERE and JOIN portions of the SQL statement.
Limit the size of your working data set.
Only select columns you need.
Remove unnecessary tables.
Remove calculated columns in JOIN and WHERE clauses.
Use inner join, instead of outer join if possible.
You view contains lot of data so you can also break down and limit only the information you need from this view

Simple optimization in a SQL join

Take the following simple SQL. It pulls any set of fields from two tables. A "Jobs" table, and some supporting table that we join off of the jobs table to get. Note that it's a left join because in this case the supporting table data is not required to be there.
select [fields]
from jobs j
left join supporting_data sd on sd.id = j.supporting_data_id
Would the query perform any different when written as follows:
select [fields]
from jobs j
left join supporting_data sd
on (j.supporting_data_id > 0) and (sd.id = j.supporting_id)
The difference being that if the main table has a record of "-1" which I commonly see in databases indicating "no value" then boolean short circuit evaluation should kick in and stop the query from checking the "supporting_data" table at all for that record.
Of course there should always be an index on the field. But if I had a record with "jobs.supporting_id = -1" then wouldn't this cause the database engine to scan the index for that record? Maybe negligible... Just wondering if there is any difference internally.

Huge Performance Cost to using SQL Server ORDER BY clause?

What is causing a query to take longer if we have an ORDER BY clause at the end?
If I run the query without Order BY it takes a split second, but throw the ORDER BY on and its MINUTES!!
Is there a known reason for this?
SELECT top 100 a.UniqueID
,a.SomeID
,a.ContentID
,SortOrder
,b.ValueOfMine
INTO #ContentHistory
FROM widgetHistory.dbo.CustomerProductContent a WITH (NOLOCK)
LEFT JOIN widgetHistory.dbo.ProductContent b WITH (NOLOCK) ON a.ContentID = b.ContentID
LEFT JOIN widgetHistory.dbo.SomeThings k WITH (NOLOCK) ON a.SomeID = k.SomeID
LEFT JOIN widgetHistory.dbo.SubscriptionContents c WITH (NOLOCK) ON b.ContentID = c.ContentID
AND c.SubscriptionID = k.SubscriptionID
WHERE c.ContentStatus = 'GO'
ORDER BY UniqueID
It wont even complete so I cannot view the execution plan..
Without the ORDER BY, SQL Server will give you the first 100 rows it computes as soon as it's done computing them.
With the ORDER BY, SQL Server must compute all rows, sort them, and only then can it give you the 100 rows you asked for.
As SQL is set-oriented, I think that you would be better off creating your temporary table and then using your order by when you query the result set from the temporary table. Tables by definition do not have a default ordering, so you are always better off to use the Order By clause when you actually want to query the data rather than when you are posting the data.