To return only the latest row [duplicate] - sql

This question already has answers here:
How to get the last row of an Oracle table
(7 answers)
Closed 8 years ago.
I have a table storing transaction called TRANSFER . I needed to write a query to return only the newest entry of transaction for the given stock tag (which is a unique key to identify the material) so i used the following query
SELECT a.TRANSFER_ID
, a.TRANSFER_DATE
, a.ASSET_CATEGORY_ID
, a.ASSET_ID
, a.TRANSFER_FROM_ID
, a.TRANSFER_TO_ID
, a.STOCK_TAG
FROM TRANSFER a
INNER JOIN (
SELECT STOCK_TAG
, MAX(TRANSFER_DATE) maxDATE
FROM TRANSFER
GROUP BY STOCK_TAG
) b
ON a.STOCK_TAG = b.STOCK_TAG AND
a.Transfer_Date =b.maxDATE
But i end with a problem where when more than one transfer happens on the same transfer date it returns all the row where as i need only the latest . how can i get the latest row?
edited:
transfer_id transfer_date asset_category_id asset_id stock_tag
1 24/12/2010 100 111 2000
2 24/12/2011 100 111 2000

To avoid the potential situation of rows not being inserted in transfer_date order, and maybe for performance reasons, you might like to try:
select
TRANSFER_ID ,
TRANSFER_DATE ,
ASSET_CATEGORY_ID,
ASSET_ID ,
TRANSFER_FROM_ID ,
TRANSFER_TO_ID ,
STOCK_TAG
from (
SELECT
TRANSFER_ID ,
TRANSFER_DATE ,
ASSET_CATEGORY_ID,
ASSET_ID ,
TRANSFER_FROM_ID ,
TRANSFER_TO_ID ,
STOCK_TAG ,
row_number() over (
partition by stock_tag
order by transfer_date desc,
transfer_id desc) rn
FROM TRANSFER)
where rn = 1

Consider selecting MAX(TRANSFER_ID) in your subquery, assuming that TRANSFER_ID is an incrementing field, such that later transfers always have larger IDs than earlier transfers.

Related

Query keeps giving me duplicate records. How can I fix this?

I wrote a query which uses 2 temp tables. And then joins them into 1. However, I am seeing duplicate records in the student visit temp table. (Query is below). How could this be modified to remove the duplicate records of the visit temp table?
with clientbridge as (Select *
from (Select visitorid, --Visid
roomnumber,
room_id,
profid,
student_id,
ambc.datekey,
RANK() over(PARTITION BY visitorid,student_id,profid ORDER BY ambc.datekey desc) as rn
from university.course_office_hour_bridge cohd
--where student_id = '9999999-aaaa-6634-bbbb-96fa18a9046e'
)
where rn = 1 --visitorid = '999999999999999999999999999999'---'1111111111111111111111111111111' --and pai.datekey is not null --- 00000000000000000000000000
),
-----------------Data Header Table
studentvisit as
(SELECT
--Visit key will allow us to track everything they did within that visit.
distinct visid_visitorid,
--calcualted_visitorid,
uniquevisitkey,
--channel, -- says the room they're in. Channel might not be reliable would need to see how that operates
--office_list, -- add 7 to exact
--user_college,
--first_office_hour_name,
--first_question_time_attended,
studentaccountid_5,
profid_officenumber_8,
studentvisitstarttime,
room_id_115,
--date_time,
qqq144, --Course Name
qqq145, -- Course Office Hour Benefit
qqq146, --Course Office Hour ID
datekey
FROM university.office_hour_details ohd
--left_join niversity.course_office_hour_bridge cohd on ohd.visid_visitorid
where DateKey >='2022-10-01' --between '2022-10-01' and '2022-10-27'
and (qqq146 <> '')
)
select
*
from clientbridge ab inner join studentvisit sv on sv.visid_visitorid = cb.visitorid
I wrote a query which uses 2 temp tables. And then joins them into 1. However, I am seeing duplicate records in the student visit temp table. (Query is below). How could this be modified to remove the duplicate records of the visit temp table?
I think you may get have a better shot by joining the two datasets in the same query where you want the data ranked, otherwise your rank from query will be ignored within the results from the second query. Perhaps, something like ->
;with studentvisit as
(SELECT
--Visit key will allow us to track everything they did within that visit.
distinct visid_visitorid,
--calcualted_visitorid,
uniquevisitkey,
--channel, -- says the room they're in. Channel might not be reliable would need to see how that operates
--office_list, -- add 7 to exact
--user_college,
--first_office_hour_name,
--first_question_time_attended,
studentaccountid_5,
profid_officenumber_8,
studentvisitstarttime,
room_id_115,
--date_time,
qqq144, --Course Name
qqq145, -- Course Office Hour Benefit
qqq146, --Course Office Hour ID
datekey
FROM university.office_hour_details ohd
--left_join niversity.course_office_hour_bridge cohd on ohd.visid_visitorid
where DateKey >='2022-10-01' --between '2022-10-01' and '2022-10-27'
and (qqq146 <> '')
)
,clientbridge as (
Select
sv.*,
university.course_office_hour_bridge cohd, --Visid
roomnumber,
room_id,
profid,
student_id,
ambc.datekey,
RANK() over(PARTITION BY sv.visitorid,sv.student_id,sv,profid ORDER BY ambc.datekey desc) as rn
from university.course_office_hour_bridge cohd
inner join studentvisit sv on sv.visid_visitorid = cohd.visitorid
)
select
*
from clientbridge WHERE rn=1

My SQL MERGE statement runs for too long

I have this Hive MERGE statement:
MERGE INTO destination dst
USING (
SELECT
-- DISTINCT fields
company
, contact_id as id
, ct.cid as cid
-- other fields
, email
, timestamp_utc
-- there are actually about 6 more
-- deduplication
, ROW_NUMBER() OVER (
PARTITION BY company
, ct.id
, contact_id
ORDER BY timestamp_utc DESC
) as r
FROM
source
LATERAL VIEW explode(campaign_id) ct AS cid
) src
ON
dst.company = src.company
AND dst.campaign_id = src.cid
AND dst.id = src.id
-- On match: keep latest loaded
WHEN MATCHED
AND dst.updated_on_utc < src.timestamp_utc
AND src.r = 1
THEN UPDATE SET
email = src.email
, updated_on_utc = src.timestamp_utc
WHEN NOT MATCHED AND src.r = 1 THEN INSERT VALUES (
src.id
, src.email
, src.timestamp_utc
, src.license_name
, src.cid
)
;
Which runs for a very long time (30 minutes for 7GB of avro compressed data on disk).
I wonder if there are any SQL ways to improve it.
ROW_NUMBER() is here to deduplicate the source table, so that in the MATCH clause we only select the earliest row.
One thing I am not sure of, is that hive says:
SQL Standard requires that an error is raised if the ON clause is such
that more than 1 row in source matches a row in target. This check is
computationally expensive and may affect the overall runtime of a
MERGE statement significantly. hive.merge.cardinality.check=false may
be used to disable the check at your own risk. If the check is
disabled, but the statement has such a cross join effect, it may lead
to data corruption.
I do indeed disable the cardinality check, as although the ON statement might give 2 rows in source, those rows are limited to 1 only thanks to the r=1 later in the MATCH clause.
Overall I like this MERGE statement but it is just too slow and any help would be appreciated.
Note that the destination table is partitioned. The source table is not as it is an external table which for every run must be fully merged, so fully scanned (in the background already merged data files are removed and new files are added before next run). Not sure that partitioning would help in that case
What I have done:
play with hdfs/hive/yarn configuration
try with a temporary table (2 steps) instead of a single MERGE, the run time jumped to more than 2 hours.
Option 1: Move where filter where src.r = 1 inside the src subquery and check the merge performance. This will reduce the number of source rows before merge.
Other two options do not require ACID mode. Do full target rewrite.
Option 2: Rewrite using UNION ALL + row_number (this should be the fastest one):
insert overwrite table destination
select
company
, contact_id as id
, ct.cid as cid
, email
, timestamp_utc
, -- add more fields
from
(
select --dedupe, select last updated rows using row_number
s.*
, ROW_NUMBER() OVER (PARTITION BY company, ct.id , contact_id ORDER BY timestamp_utc DESC) as rn
from
(
select --union all source and target
company
, contact_id as id
, ct.cid as cid
, email
, timestamp_utc
, -- add more fields
from source LATERAL VIEW explode(campaign_id) ct AS cid
UNION ALL
select
company
, contact_id as id
, ct.cid as cid
, email
, timestamp_utc
,-- add more fields
from destination
)s --union all
where rn=1 --filter duplicates
)s-- filtered dups
If source contains a lot of duplicates, you can apply additional row_number filtering to the src subquery as well before union.
One more approach using full join: https://stackoverflow.com/a/37744071/2700344

Row_number partition by performance

How to improve the performance when row_number Partitioned by used in Hive query.
select *
from
(
SELECT
'123' AS run_session_id
, tbl1.transaction_id
, tbl1.src_transaction_id
, tbl1.transaction_created_epoch_time
, tbl1.currency
, tbl1.event_type
, tbl1.event_sub_type
, tbl1.estimated_total_cost
, tbl1.actual_total_cost
, tbl1.tfc_export_created_epoch_time
, tbl1.authorizer
, tbl1.acquirer
, tbl1.processor
, tbl1.company_code
, tbl1.country_of_account
, tbl1.merchant_id
, tbl1.client_id
, tbl1.ft_id
, tbl1.transaction_created_date
, tbl1.event_pst_time
, tbl1.extract_id_seq
, tbl1.src_type
, ROW_NUMBER() OVER(PARTITION by tbl1.transaction_id ORDER BY tbl1.event_pst_time DESC) AS seq_num -- while writing back to the pfit events table, write each event so that event_pst_time populates in right way
FROM nest.nest_cost_events tbl1 --<hiveFinalDB>-- -- DB variables wont work, so need to change the DB accrodingly for testing and PROD deployment
WHERE extract_id_seq BETWEEN 275 - 60
AND 275
AND event_type in('ACT','CBR','SKU','CAL','KIT','BXT' )) tbl1
where seq_num=1;
This table is partitioned by src_type.
Now it is taking 20 mnts to process 154M records. I want to reduce to 10 mnts.
Any suggestions ?
Thanks

SQL - Group By 2 fields, where Count is Greater than X

I'm trying to find the correct and efficient way to write this query. For a sales data set, I want to isolate customers with more than 4 transactions in a given month, but with that, I want the details of the sale.
What've written so far is two volatile tables: T1) Details and T2) Count. T1, puts all the details of the sales into a volatile table and T2 counts the transactions based on identifiers of Store, Date, Time, Trans_ID, Cashier and TotalSale (grouping). Some transactions will show twice based on the details in T1, so I'm trying to eliminate the duplicates without removing the details.
My third query queries T1 against T2, where the Count in T2 is greater than 1.
I think this works, however, I'm not sure it's the most efficient way to run it as it takes hours to run this way. Any input on changes I can make? Code is pasted below:
create volatile table IC_DTL as
(
SELECT DISTINCT
o.orgn_nm REGION
, o.str_org_nbr STORE
, d.sltrn_dt SLS_DATE
, cast(d.last_upd_ts as TIME) SLS_TIME
, d.pos_rgstr_id REGISTER
, d.sltrn_id TRANS_ID
, dp.user_id CASHIER_ID
, d.sltrn_line_nbr LINE_ITEM
, i.sub_class_nbr SUB_CLASS
, i.item_sc_desc SC_DESC
, i.item_sku_nbr SKU
, i.item_desc SKU_DESC
, d.ITEM_QTY QTY
, d.EXT_RETL_AMT EXT_RETAIL
, st.grs_sls_amt GROSS_SALE_AMT
, pc.s_paymt_meth_desc PAYMT_METH
, py.cardhldr_acct_nbr ACCT_NBR
FROM ALL MY TABLES (9 JOINS)
WHERE MY CONDITIONS
)
WITH DATA
ON COMMIT PRESERVE ROWS;
Create volatile table IC_COUNT as
(
select cashier_id, count(Trans_ID) as TOTAL
FROM IC_DETL
GROUP BY store, sls_date, sls_time, trans_id, cashier_id
,gross_sales_amt
)
WITH DATA
ON COMMIT PRESERVE ROWS;
select *
from IC_DTL D
JOIN IC_COUNT C
ON C.CASHIER_ID = D.CASHIER_ID
WHERE D.TOTAL >= 2
Any help is greatly appreciated!

Datediff between two tables

I have those two tables
1-Add to queue table
TransID , ADD date
10 , 10/10/2012
11 , 14/10/2012
11 , 18/11/2012
11 , 25/12/2012
12 , 1/1/2013
2-Removed from queue table
TransID , Removed Date
10 , 15/1/2013
11 , 12/12/2012
11 , 13/1/2013
11 , 20/1/2013
The TansID is the key between the two tables , and I can't modify those tables, what I want is to query the amount of time each transaction spent in the queue
It's easy when there is one item in each table , but when the item get queued more than once how do I calculate that?
Assuming the order TransIDs are entered into the Add table is the same order they are removed, you can use the following:
WITH OrderedAdds AS
( SELECT TransID,
AddDate,
[RowNumber] = ROW_NUMBER() OVER(PARTITION BY TransID ORDER BY AddDate)
FROM AddTable
), OrderedRemoves AS
( SELECT TransID,
RemovedDate,
[RowNumber] = ROW_NUMBER() OVER(PARTITION BY TransID ORDER BY RemovedDate)
FROM RemoveTable
)
SELECT OrderedAdds.TransID,
OrderedAdds.AddDate,
OrderedRemoves.RemovedDate,
[DaysInQueue] = DATEDIFF(DAY, OrderedAdds.AddDate, ISNULL(OrderedRemoves.RemovedDate, CURRENT_TIMESTAMP))
FROM OrderedAdds
LEFT JOIN OrderedRemoves
ON OrderedAdds.TransID = OrderedRemoves.TransID
AND OrderedAdds.RowNumber = OrderedRemoves.RowNumber;
The key part is that each record gets a rownumber based on the transaction id and the date it was entered, you can then join on both rownumber and transID to stop any cross joining.
Example on SQL Fiddle
DISCLAIMER: There is probably problem with this, but i hope to send you in one possible direction. Make sure to expect problems.
You can try in the following direction (which might work in some way depending on your system, version, etc) :
SELECT transId, (sum(add_date_sum) - sum(remove_date_sum)) / (1000*60*60*24)
FROM
(
SELECT transId, (SUM(UNIX_TIMESTAMP(add_date)) as add_date_sum, 0 as remove_date_sum
FROM add_to_queue
GROUP BY transId
UNION ALL
SELECT transId, 0 as add_date_sum, (SUM(UNIX_TIMESTAMP(remove_date)) as remove_date_sum
FROM remove_from_queue
GROUP BY transId
)
GROUP BY transId;
A bit of explanation: as far as I know, you cannot sum dates, but you can convert them to some sort of timestamps. Check if UNIX_TIMESTAMPS works for you, or figure out something else. Then you can sum in each table, create union by conveniently leaving the other one as zeto and then subtracting the union query.
As for that devision in the end of first SELECT, UNIT_TIMESTAMP throws out miliseconds, you devide to get days - or whatever it is that you want.
This all said - I would probably solve this using a stored procedure or some client script. SQL is not a weapon for every battle. Making two separate queries can be much simpler.
Answer 2: after your comments. (As a side note, some of your dates 15/1/2013,13/1/2013 do not represent proper date formats )
select transId, sum(numberOfDays) totalQueueTime
from (
select a.transId,
datediff(day,a.addDate,isnull(r.removeDate,a.addDate)) numberOfDays
from AddTable a left join RemoveTable r on a.transId = r.transId
order by a.transId, a.addDate, r.removeDate
) X
group by transId
Answer 1: before your comments
Assuming that there won't be a new record added unless it is being removed. Also note following query will bring numberOfDays as zero for unremoved records;
select a.transId, a.addDate, r.removeDate,
datediff(day,a.addDate,isnull(r.removeDate,a.addDate)) numberOfDays
from AddTable a left join RemoveTable r on a.transId = r.transId
order by a.transId, a.addDate, r.removeDate