Improve Performance Window function Hive

Improve Performance Window function Hive - hive

I'm trying run the hive query below with the windowing function and it's taking forever. I'm hoping someone has some tips on what I could do to speed it up. table1 below has close to 1 billion records and table2 only has a few thousand. Any tips are greatly appreciated.
Code:
SELECT up.uid,up.ban,up.ban_pref,
DENSE_RANK() OVER (PARTITION BY up.uid ORDER BY up.ban_pref DESC, bnp.tot_pod DESC) AS rank
FROM table1 AS up
INNER JOIN table2 AS bnp ON up.ban=bnp.ban

May be this would have already been resolved. But, my thoughts.
1. First try to complete the join using map side join as second table is small. This can be done using Hive.auto.convert.join = true.
2. In the next step try to have the window function executed.

Related

SQL - LEFT JOIN takes an extremely long time to execute

I'm trying to see if there are any rows in table A which I've missed in table B.
For this, I'm using the following query:
SELECT t1.cusa
FROM patch t1
LEFT JOIN trophy t2
ON t2.titleid = t1.titleid
WHERE t2.titleid IS NULL
And the query worked before, but now that the trophy table has nearly 200.000 rows, it's extremely slow. I've waited 5 minutes for it to execute but it was still loading and timed out eventually.
Is there any way to speed this query up?

Adding Indexes to titleId on both tables (but especially t2) is the quickest way to get better performance. 200K records is nothing for SQL Server.

Try this and it might perform a bit better!
SELECT t1.cusa
FROM patch t1
WHERE NOT EXISTS (SELECT 1
FROM trophy t2
WHERE t2.titleid = t1.titleid );

New Analyst SQL Optimization - Multiple Spools

Unfortunately my ability to Query has outgrown my knowledge of SQL Optimization, so i am hoping someone would help a young analyst by looking at this atrocious execution plan and provide some wisdom as to how i could speed it up. I've read a few threads about spooling, but they were all mostly a discussion about weather an Eager Table spool is good or bad, and the answer is always "it depends".
My execution plan looks like it's Spooling and Sorting the same #Temp Table multiple times, and it's eating up a lot of execution cost.
My understanding of a Table Spool is that it is temporary storage to be used later, but if the data is already stored for later use, why would it spool the same data over and over again? My query doesn't require any ordering so why would it need to sort the same #TempTable/Spool multiple times?
I'm so new to this, i can't figure out how to attach the entire execution plan.... so i attached an image of the bottom half of it...
Help me experienced analysts. You're my only hope.
A Little Context.
I currently have a transaction table that tracks all changes made to a lead in my CRM, and i am attempting to create a new table from this data to speed up reporting.
I am pulling data from this transaction table and flagging the first action, first user, and other firsts of a lead by using Row_Number(). I am then inserting every "first" into a #Temp Table, as i know i am going to utilize this data multiple times.
SELECT
ID,
Action,
ROW_NUMBER() OVER (PARTITION BY ID, Action ORDER BY DATE) AS ActionNum,
ROW_NUMBER() OVER (PARTITION BY ID, Actor ORDER BY DATE) AS USERNUM
INTO #Temp
FROM Table
;
I am then Left joining this #Temp Table many times (10 times actually). I have tried multiple other ways of solving this issue but using Row_Number multiple times seems like the best solution.
SELECT
*
FROM #temp T1
LEFT JOIN #Temp T2
ON T2.ID = T1.ID AND T2.Action = A2 AND T2.ActionNum = 1
LEFT JOIN #Temp T3
On T3.ID = T1.ID AND T3.Action = A3 AND T3.ActionNum = 1
LEFT JOIN #Temp T4
ON T4.ID = T1.ID AND T4.UserNum = 1
WHERE
T1.Action = A1
AND
T1.ActionNum = 1
I've looked into creating a clustered index on the #TempTable, but i must not be doing it right because it didn't change anything about my execution.
Thanks in advance for all your help! Any good reading material is also greatly apprecaited!
Best,
Austin

Optimize WHERE clause in query

I have the following query:
SELECT table2.serialcode,
p.name,
date,
power,
behind,
direction,
length,
centerlongitude,
centerlatitude,
currentlongitude,
currentlatitude
FROM table1 as table2
JOIN pivots p ON p.serialcode = table2.serial
WHERE table2.serialcode = '49257'
and date = (select max(a.date) from table1 a where a.serialcode ='49257');
It seems it is retrieving the select max subquery for each join. It takes a lot of time. Is there a way to optimize it? Any help will be appreciated.

Sub selects that end up being evaluated "per row of the main query" can cause tremendous performance problems once you try to scale to larger number of rows.
Sub selects can almost always be eliminated with a data model tweak.
Here's one approach: add a new is_latest to the table to track if it's the max value (and for ties, use other fields like created time stamp or the row ID). Set it to 1 if true, else 0.
Then you can add where is_latest = 1 to your query and this will radically improve performance.
You can schedule the update to happen or add a trigger etc. if you need an automated way of keeping is_latest up to date.
Other approaches involve 2 tables - one where you keep only the latest record and another table where you keep the history.

declare #maxDate datetime;
select #maxDate = max(a.date) from table1 a where a.serialcode ='49257';
SELECT table2.serialcode,
p.name,
date,
power,
behind,
direction,
length,
centerlongitude,
centerlatitude,
currentlongitude,
currentlatitude
FROM table1 as table2
JOIN pivots p ON p.serialcode = table2.serial
WHERE table2.serialcode = '49257'
and date =#maxDate;

You can optimize this query using indexes. Here are somethat should help: table1(serialcode, serial, date), table1(serialcode, date), and pivots(serialcode).
Note: I find it very strange that you have columns called serial and serialcode in the same table, and the join is on serial.

Since you haven't mentioned which DB you are using, I would answer if it was for Oracle.
You can use WITH clause to take out the subquery and make it perform just once.
WITH d AS (
SELECT max(a.date) max_date from TABLE1 a WHERE a.serialcode ='49257'
)
SELECT table2.serialcode,
p.name,
date,
power,
behind,
direction,
length,
centerlongitude,
centerlatitude,
currentlongitude,
currentlatitude
FROM table1 as table2
JOIN pivots p ON p.serialcode = table2.serial
JOIN d on (table2.date = d.max_date)
WHERE table2.serialcode = '49257'
Please note that you haven't qualified date column, so I just assumed it belonged to table1 and not pivots. You can change it. An advise on the same note - always qualify your columns by using table.column format.

SELECT MAX() too slow - any alternatives?

I've inherited a SQL Server based application and it has a stored procedure that contains the following, but it hits timeout. I believe I've isolated the issue to the SELECT MAX() part, but I can't figure out how to use alternatives, such as ROW_NUMBER() OVER( PARTITION BY...
Anyone got any ideas?
Here's the "offending" code:
SELECT BData.*, B.*
FROM BData
INNER JOIN
(
SELECT MAX( BData.StatusTime ) AS MaxDate, BData.BID
FROM BData
GROUP BY BData.BID
) qryMaxDates
ON ( BData.BID = qryMaxDates.BID ) AND ( BData.StatusTime = qryMaxDates.MaxDate )
INNER JOIN BItems B ON B.InternalID = qryMaxDates.BID
WHERE B.ICID = 2
ORDER BY BData.StatusTime DESC;
Thanks in advance.

SQL performance problems are seldom addressed by rewriting the query. The compiler already know how to rewrite it anyway. The problem is always indexing. To get MAX(StatusTime ) ... GROUP BY BID efficiently, you need an index on BData(BID, StatusTime). For efficient seek of WHERE B.ICID = 2 you need an index on BItems.ICID.
The query could also be, probably, expressed as a correlated APPLY, because it seems that what is what's really desired:
SELECT D.*, B.*
FROM BItems B
CROSS APPLY
(
SELECT TOP(1) *
FROM BData
WHERE B.InternalID = BData.BID
ORDER BY StatusTime DESC
) AS D
WHERE B.ICID = 2
ORDER BY D.StatusTime DESC;
SQL Fiddle.
This is not semantically the same query as OP, the OP would return multiple rows on StatusTime collision, I just have a guess though that this is what is desired ('the most recent BData for this BItem').

Consider creating the following index:
CREATE INDEX LatestTime ON dbo.BData(BID, StatusTime DESC);
This will support a query with a CTE such as:
;WITH x AS
(
SELECT *, rn = ROW_NUMBER() OVER (PARTITION BY BID ORDER BY StatusDate DESC)
FROM dbo.BData
)
SELECT * FROM x
INNER JOIN dbo.BItems AS bi
ON x.BID = bi.InternalID
WHERE x.rn = 1 AND bi.ICID = 2
ORDER BY x.StatusDate DESC;
Whether the query still gets efficiencies from any indexes on BItems is another issue, but this should at least make the aggregate a simpler operation (though it will still require a lookup to get the rest of the columns).
Another idea would be to stop using SELECT * from both tables and only select the columns you actually need. If you really need all of the columns from both tables (this is rare, especially with a join), then you'll want to have covering indexes on both sides to prevent lookups.
I also suggest calling any identifier the same thing throughout the model. Why is the ID that links these tables called BID in one table and InternalID in another?
Also please always reference tables using their schema.
Bad habits to kick : avoiding the schema prefix

This may be a late response, but I recently ran into the same performance issue where a simple query involving max() is taking more than 1 hour to execute.
After looking at the execution plan, it seems in order to perform the max() function, every record meeting the where clause condition will be fetched. In your case, it's every record in your table will need to be fetched before performing max() function. Also, indexing the BData.StatusTime will not speed up the query. Indexing is useful for looking up a particular record, but it will not help performing comparison.
In my case, I didn't have the group by so all I did was using the ORDER BY DESC clause and SELECT TOP 1. The query went from over 1 hour down to under 5 minutes. Perhaps, you can do what Gordon Linoff suggested and use PARTITION BY. Hopefully, your query can speed up.
Cheers!

The following is the version of your query using row_number():
SELECT bd.*, b.*
FROM (select bd.*, row_number() over (partition by bid order by statustime desc) as seqnum
from BData bd
) bd INNER JOIN
BItems b
ON b.InternalID = bd.BID and bd.seqnum = 1
WHERE B.ICID = 2
ORDER BY BData.StatusTime DESC;
If this is not faster, then it would be useful to see the query plans for your query and this query to figure out how to optimize them.

Depends entirely on what kind of data you have there. One alternative that may be faster is using CROSS APPLY instead of the MAX subquery. But more than likely it won't yield any faster results.
The best option would probably be to add an index on BID, with INCLUDE containing the StatusTime, and if possible filtering that by InternalID's matching BItems.ICID = 2.

[UNSOLVED] But I've moved on!
Thanks to everyone who provided answers / suggestions. Unfortunately I couldn't get any further with this, so have given-up trying for now.
It looks like the best solution is to re-write the application to UPDATE the latest data into into a different table, that way it's a really quick and simple SELECT to latest readings.
Thanks again for the suggestions.

Why is an IN statement with a list of items faster than an IN statement with a subquery?

I'm having the following situation:
I've got a quite complex view from which I've to select a couple of records.
SELECT * FROM VW_Test INNER JOIN TBL_Test ON VW_Test.id = TBL_Test.id
WHERE VW_Test.id IN (1000,1001,1002,1003,1004,[etc])
This returns a result practically instantly (currently with 25 items in that IN statement). However when I use the following query it slows down really fast.
SELECT * FROM VW_Test INNER JOIN TBL_Test ON VW_Test.id = TBL_Test.id
WHERE VW_Test.id IN (SELECT id FROM TBL_Test)
With 25 records in the TBL_Test this query takes about 5 seconds. I've got an index on that id in the TBL_Test.
Anyone got an idea why this happens and how to get performance up?
EDIT: I forgot to mention that this subquery
SELECT id FROM TBL_Test
returns a result instantly as well.

Well, when using a subquery the database engine will first have to generate the results for the subquery before it can do anything else, which takes time. If you have a predefined list, this will not need to happen and the engine can simply use those values 'as is'. At least, this is how I understand it.
How to improve performance: do away with the subquery. I don't think you even need the IN clause in this case. The INNER JOIN should suffice.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas