Row_number partition by performance

Row_number partition by performance - hive

How to improve the performance when row_number Partitioned by used in Hive query.
select *
from
(
SELECT
'123' AS run_session_id
, tbl1.transaction_id
, tbl1.src_transaction_id
, tbl1.transaction_created_epoch_time
, tbl1.currency
, tbl1.event_type
, tbl1.event_sub_type
, tbl1.estimated_total_cost
, tbl1.actual_total_cost
, tbl1.tfc_export_created_epoch_time
, tbl1.authorizer
, tbl1.acquirer
, tbl1.processor
, tbl1.company_code
, tbl1.country_of_account
, tbl1.merchant_id
, tbl1.client_id
, tbl1.ft_id
, tbl1.transaction_created_date
, tbl1.event_pst_time
, tbl1.extract_id_seq
, tbl1.src_type
, ROW_NUMBER() OVER(PARTITION by tbl1.transaction_id ORDER BY tbl1.event_pst_time DESC) AS seq_num -- while writing back to the pfit events table, write each event so that event_pst_time populates in right way
FROM nest.nest_cost_events tbl1 --<hiveFinalDB>-- -- DB variables wont work, so need to change the DB accrodingly for testing and PROD deployment
WHERE extract_id_seq BETWEEN 275 - 60
AND 275
AND event_type in('ACT','CBR','SKU','CAL','KIT','BXT' )) tbl1
where seq_num=1;
This table is partitioned by src_type.
Now it is taking 20 mnts to process 154M records. I want to reduce to 10 mnts.
Any suggestions ?
Thanks

Related

Sql query with group by takes too long

I have a very simple query but it takes too long to load when I use Max and group by. Could you please propose an alternative?. I use Oracle 18g for running this query. a_num_ver, id, site_id is a primary key.
SELECT id
, site_id
, sub_id
, max(a_num_ver) as a_num_ver
, ae_no
, max(aer_ver) AS aer_ver
FROM table_1
GROUP BY id
, site_id
, sub_id
, ae_no

Try using parallel hints 4 OR 8 if that is allowed from DBA. I have tried a similar query in a table with around 296,292,720 rows. Without hints, it took around 2 minutes to execute. It comes down to 20 seconds with PARALLEL 8.
SELECT /*+ PARALLEL(8) */
id
, site_id
, sub_id
, max(a_num_ver) as a_num_ver
, ae_no
, max(aer_ver) AS aer_ver
FROM table_1
GROUP BY id
, site_id
, sub_id
, ae_no

“Hive” max column value from multiple columns

Hi: I have a situation where I need to find the max value on 3 calculated fields and store it in another field, is it possible to do it in one SQL query? Below is the example
SELECT Income1 ,
Income1 * 2% as Personal_Income ,
Income2 ,
Income2 * 10% as Share_Income ,
Income3 ,
Income3 * 1% as Job_Income ,
Max(Personal_Income, Share_Income, Job_Income )
From Table
One way I tried is to calculate Personal_Income, Share_Income, Job_Income in the first pass and in the second pass I used
Select
Case when Personal_income > Share_Income and Personal_Income > Job_Income
then Personal_income
when Share_income > Job_Income
then Share_income
Else Job_income as the greatest_income
but this require me to do 2 scans on a billion rows table, How can I avoid this and do it in a single pass? Any help much appreciated.

As of Hive 1.1.0 you can use greatest() function. This query will do in a single table scan:
select Income1 ,
Personal_Income ,
Income2 ,
Share_Income ,
Income3 ,
Job_Income ,
greatest(Personal_Income, Share_Income, Job_Income ) as greatest_income
from
(
SELECT Income1 ,
Income1 * 2% as Personal_Income ,
Income2 ,
Income2 * 10% as Share_Income ,
Income3 ,
Income3 * 1% as Job_Income ,
From Table
)s
;

Combine query results into a single row and display time span?

First, I apologize for not having a fiddle, but my data set was too large. So I am including a file.
I have a set of queries that combine user data with location data. The location history data is gathered from RF devices and a history for that user is persisted. Going against the supplied SQL file I need to combine the data where a Staff member may enter a room, leave that room, then come back. This example may constitute a visit with a patient. Another example is if a Staff member is logged in a room consecutively. We are only concerned with data where the time stamps are greater than 2 minutes in length. Meaning the RF reader may read a tag if a staff member walks by a room and that location is logged. Here are the intial set of queries:
with
StaffHistory as(
SELECT
LocationHistories.UserProfileId,
UserProfiles.FirstName,
UserProfiles.LastName,
LocationHistories.LocationId,
Locations.LocationName,
LocationHistories.LocationHistoryTimeStamp,
PreviousLocationTime = LAG(LocationHistories.LocationHistoryTimeStamp, 1) OVER
(PARTITION BY LocationHistories.UserProfileId ORDER BY LocationHistories.LocationHistoryTimeStamp),
NextLocationTime = Lead(LocationHistories.LocationHistoryTimeStamp, 1) OVER
(PARTITION BY LocationHistories.UserProfileId ORDER BY LocationHistories.LocationHistoryTimeStamp)
FROM
LocationHistories
INNER JOIN UserProfiles ON LocationHistories.UserProfileId = UserProfiles.Id
INNER JOIN Locations ON LocationHistories.LocationId = Locations.Id
where LocationTrackingType = 1),
StaffInRoomTime as(
SELECT
StaffHistory.UserProfileId,
StaffHistory.FirstName,
StaffHistory.LastName,
StaffHistory.LocationId,
StaffHistory.LocationName,
DATEDIFF(SECOND, LocationHistoryTimeStamp, NextLocationTime) as TimeSpentInRoom,
StaffHistory.LocationHistoryTimeStamp,
StaffHistory.PreviousLocationTime,
StaffHistory.NextLocationTime
FROM
StaffHistory
Where DATEDIFF(SECOND, LocationHistoryTimeStamp, NextLocationTime) > 120
)
select * from StaffInRoomTime ORDER BY UserProfileId, LocationHistoryTimeStamp
I used common table expressions just for this example. These are actual views in the DB.
The first query joins the histories with the staff. We also create a couple columns for the Previous logged time and the Next logged time. This is so we can determine the length of time in a room.
The second query pulls from the first query and we set a column for how long the staff member was in that location during that time and we also filter any data out where the LocationHistoryTimeStamp and NextLocationTime are greater than 2 minutes.
What I am trying to achieve is combine the data where a staff member may be logged for a room consecutively or if they leave the room and come back.
Here is an example of the data set where the Staff member is in a room consecutively for:
Here would be the outcome:
Here is an example of spanning multiple rooms for a given visit:
I have tried using a Inner Join on the second query that uses the Min(LocationHistoryTimeStamp). However, the time spans ended up not being correct so I am missing something.
This was the Inner Join query I tried:
Select
StaffHistory.LocationId,
StaffHistory.LocationName,
StaffHistory.UserProfileId,
Min(LocationHistoryTimeStamp) as LocationHistoryTimeStamp,
DATEDIFF(SECOND, Min(LocationHistoryTimeStamp), Lead(Min(LocationHistoryTimeStamp), 1) OVER
(PARTITION BY UserProfileId ORDER BY Min(LocationHistoryTimeStamp))) As TimeSpentInRoom,
NextLocationTime = Lead(Min(LocationHistoryTimeStamp), 1) OVER
(PARTITION BY UserProfileId ORDER BY Min(LocationHistoryTimeStamp))
FROM
StaffHistory
Where TimeSpentInRoom > 120
GROUP BY LocationId,LocationName,UserProfileId

Here is statement:
WITH StaffHistory
AS ( SELECT LocationHistories.UserProfileId ,
UserProfiles.FirstName ,
UserProfiles.LastName ,
LocationHistories.LocationId ,
Locations.LocationName ,
LocationHistories.LocationHistoryTimeStamp ,
PreviousLocationTime = LAG(LocationHistories.LocationHistoryTimeStamp,
1) OVER ( PARTITION BY LocationHistories.UserProfileId ORDER BY LocationHistories.LocationHistoryTimeStamp ) ,
NextLocationTime = LEAD(LocationHistories.LocationHistoryTimeStamp,
1) OVER ( PARTITION BY LocationHistories.UserProfileId ORDER BY LocationHistories.LocationHistoryTimeStamp )
FROM LocationHistories
INNER JOIN UserProfiles ON LocationHistories.UserProfileId = UserProfiles.Id
INNER JOIN Locations ON LocationHistories.LocationId = Locations.Id
WHERE LocationTrackingType = 1
),
StaffInRoomTime
AS ( SELECT StaffHistory.UserProfileId ,
StaffHistory.FirstName ,
StaffHistory.LastName ,
StaffHistory.LocationId ,
StaffHistory.LocationName ,
DATEDIFF(SECOND, LocationHistoryTimeStamp,
NextLocationTime) AS TimeSpentInRoom ,
StaffHistory.LocationHistoryTimeStamp ,
StaffHistory.PreviousLocationTime ,
StaffHistory.NextLocationTime
FROM StaffHistory
WHERE DATEDIFF(SECOND, LocationHistoryTimeStamp,
NextLocationTime) > 120
),
prepareIsland
AS ( SELECT * ,
CASE WHEN LAG(LocationId) OVER ( PARTITION BY UserProfileId ORDER BY LocationHistoryTimeStamp ) <> LocationId
THEN 1
ELSE 0
END AS prepIsland
FROM StaffInRoomTime
),
islands
AS ( SELECT * ,
SUM(prepIsland) OVER ( ORDER BY UserProfileId , LocationHistoryTimeStamp ) AS Island
FROM prepareIsland
)
SELECT island ,
UserProfileId ,
FirstName ,
LastName ,
LocationId ,
LocationName ,
SUM(TimeSpentInRoom) TimeSpentInRoom ,
MIN(LocationHistoryTimeStamp) LocationHistoryTimeStamp ,
MIN(PreviousLocationTime) PreviousLocationTime ,
MAX(NextLocationTime) NextLocationTime
FROM islands
GROUP BY island ,
UserProfileId ,
FirstName ,
LastName ,
LocationId ,
LocationName
ORDER BY UserProfileId ,
LocationHistoryTimeStamp

To return only the latest row [duplicate]

This question already has answers here:
How to get the last row of an Oracle table
(7 answers)
Closed 8 years ago.
I have a table storing transaction called TRANSFER . I needed to write a query to return only the newest entry of transaction for the given stock tag (which is a unique key to identify the material) so i used the following query
SELECT a.TRANSFER_ID
, a.TRANSFER_DATE
, a.ASSET_CATEGORY_ID
, a.ASSET_ID
, a.TRANSFER_FROM_ID
, a.TRANSFER_TO_ID
, a.STOCK_TAG
FROM TRANSFER a
INNER JOIN (
SELECT STOCK_TAG
, MAX(TRANSFER_DATE) maxDATE
FROM TRANSFER
GROUP BY STOCK_TAG
) b
ON a.STOCK_TAG = b.STOCK_TAG AND
a.Transfer_Date =b.maxDATE
But i end with a problem where when more than one transfer happens on the same transfer date it returns all the row where as i need only the latest . how can i get the latest row?
edited:
transfer_id transfer_date asset_category_id asset_id stock_tag
1 24/12/2010 100 111 2000
2 24/12/2011 100 111 2000

To avoid the potential situation of rows not being inserted in transfer_date order, and maybe for performance reasons, you might like to try:
select
TRANSFER_ID ,
TRANSFER_DATE ,
ASSET_CATEGORY_ID,
ASSET_ID ,
TRANSFER_FROM_ID ,
TRANSFER_TO_ID ,
STOCK_TAG
from (
SELECT
TRANSFER_ID ,
TRANSFER_DATE ,
ASSET_CATEGORY_ID,
ASSET_ID ,
TRANSFER_FROM_ID ,
TRANSFER_TO_ID ,
STOCK_TAG ,
row_number() over (
partition by stock_tag
order by transfer_date desc,
transfer_id desc) rn
FROM TRANSFER)
where rn = 1

Consider selecting MAX(TRANSFER_ID) in your subquery, assuming that TRANSFER_ID is an incrementing field, such that later transfers always have larger IDs than earlier transfers.

SQL query, select from 2 tables random

Hello all i have a problem that i just CANT get to work like i what it..
i want to show news and reviews (2 tables) and i want to have random output and not the same output
here is my query i really hope some one can explain me what i do wrong
SELECT
anmeldelser.billed_sti ,
anmeldelser.overskrift ,
anmeldelser.indhold ,
anmeldelser.id ,
anmeldelser.godkendt
FROM
anmeldelser
LIMIT 0,6
UNION ALL
SELECT
nyheder.id ,
nyheder.billed_sti ,
nyheder.overskrift ,
nyheder.indhold ,
nyheder.godkendt
FROM nyheder
ORDER BY rand() LIMIT 0,6

First off it looks like the column order for the two SELECT statements don't match which they need to for a UNION.
What does the following return?
SELECT
anmeldelser.billed_sti ,
anmeldelser.overskrift ,
anmeldelser.indhold ,
anmeldelser.id ,
anmeldelser.godkendt
FROM
anmeldelser
LIMIT 0,6
UNION ALL
SELECT
nyheder.billed_sti ,
nyheder.overskrift ,
nyheder.indhold ,
nyheder.id ,
nyheder.godkendt
FROM nyheder
ORDER BY rand() LIMIT 0,6
(which RDBMS are you using? the SQL you have is not valid for Sybase but there may be techniques depending on the 'flavour' of SQL you are using)

Since RAND() appears only in the ORDER BY clause, would it not only be evaluated once for the whole query, and not once per row?

The problem is the first table is not selecting random elements
SELECT temp.* FROM
(
SELECT
anmeldelser.id ,
anmeldelser.billed_sti ,
anmeldelser.overskrift ,
anmeldelser.indhold ,
anmeldelser.godkendt,
'News' as artType
FROM anmeldelser
UNION
SELECT
nyheder.id ,
nyheder.billed_sti ,
nyheder.overskrift ,
nyheder.indhold ,
nyheder.godkendt,
'Review' as artType
FROM nyheder
) temp
ORDER BY rand() LIMIT 0,6

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Row_number partition by performance - hive

Related

Sql query with group by takes too long

“Hive” max column value from multiple columns

Combine query results into a single row and display time span?

To return only the latest row [duplicate]

SQL query, select from 2 tables random

Categories

Resources