Optimize SQL Script: getting range value from another table - sql

My script I believe should be running but it may not be that 'efficient' and the main problem is I guess it's taking too long to run hence when I run it at work, the whole session is being aborted before it finishes.
I have basically 2 tables
Table A - contains every transactions a person do
Person's_ID Transaction TransactionDate
---------------------------------------
123 A 01/01/2017
345 B 04/06/2015
678 C 13/07/2015
123 F 28/10/2016
Table B - contains person's ID and GraduationDate
What I want to do is check if a person is active.
Active = if there is at least 1 transaction done by the person 1 month before his GraduationDate
The run time is too long because imagine if I have millions of persons and each persons do multiple transactions and these transactions are recorded line by line in Table A
SELECT
PERSON_ID
FROM
(SELECT PERSON_ID, TRANSACTIONDATE FROM TABLE_A) A
LEFT JOIN
(SELECT CIN, GRAD_DATE FROM TABLE_B) B
ON A.PERSON_ID = B.PERSON_ID
AND TRANSACTIONDATE <= GRAD_DATE
WHERE TRANSACTIONDATE BETWEEN GRAD_DATE - INTERVAL '30' DAY AND GRAD_DATE;
*Table A and B are products of joined tables hence they are subqueried.

If you just want active customers, I would try exists:
SELECT PERSON_ID
FROM TABLE_A A
WHERE EXISTS (SELECT 1
FROM TABLE_B B
WHERE A.PERSON_ID = B.PERSON_ID AND
A.TRANSACTIONDATE BETWEEN B.GRAD_DATE - INTERVAL '30' DAY AND GRAD_DATE
);
The performance, though, is likely to be similar to your query. If the tables were really tables, I would suggest indexes. In reality, you will probably need to understand the views (so you can create better indexes) or perhaps use temporary tables.

A non-equi-join might be quite inefficient (no matter if it's coded as join or a Not Exists), but the logic can be rewritten to:
SELECT
PERSON_ID
FROM
( -- combine both Selects
SELECT 0 AS flag -- indicating source table
PERSON_ID, TRANSACTIONDATE AS dt
FROM TABLE_A
UNION ALL
SELECT 1 AS flag,
PERSON_ID, GRAD_DATE
FROM TABLE_B
) A
QUALIFY
flag = 1 -- only return a row from table B
AND Min(dt) -- if the previous row (from table A) is within 30 days
Over (PARTITION BY PERSON_ID
ORDER BY dt, flag
ROWS BETWEEN 1 Preceding AND 1 Preceding) >= dt - 30
This assumes that there's only one row from table A per person, otherwise the MIN has to be changed to:
AND MAX(CASE WHEN flag = 1 THEN dt END) -- if the previous row (from table A) is within 30 days
Over (PARTITION BY PERSON_ID
ORDER BY dt, flag
ROWS UNBOUNDED Preceding) >= dt - 30

Related

Retrieve records from versioned table

this sql case has been troubling me for a while and I wanted to ask here what other folks think.
I have a table user who owns vehicles, but the same vehicle maybe owned by multiple user over time, there is another column called effective_date which tells from what day this is owning is effective. Two driver doesn't own the same vehicle, but records are versioned, meaning we can check who owned this vehicle 2 years ago, or 5 years ago using effective date.
Table has following columns,
id, version, name, vehicle_id, effective_date. Every change to this table is versioned
Now there is another table called accidents which tells what accident with vehicle and when, not versioned
it has id, description, vehicle_id, acc_date
Now I am trying to select all accidents and who caused the accident. Inner join doesn't work here, What I do is select all rows from accident table and run sub query for each row and find the user's id and version that was responsible for the cause. This will be super slow and I am looking for more performant way of organizing the date or constructing a query. Right now it runs a subquery for every row it selects from accident table, because each row has different accident date. I am ok doing few queries if there is easy way of doing within a single query.
Example
user table
id
version
name
vehicle_id
effective_date
1
1
A
1
01/10/2021
1
2
A
2
02/10/2021
2
1
B
1
03/10/2021
2
2
B
2
04/10/2021
accident:
id
description
vehicle_id
acc_date
1
hit1
1
03/5/2021
2
hit2
1
03/15/2021
Result:
user_id
user_version
acc_id
vehicle_id
acc_date
1
1
1
1
03/5/2021
2
1
2
1
03/15/2021
thanks for your help
To get the latest user at the time of the accident you can use ROW_NUMBER() sorting by descending effective_date. With this ordering the first user listed for each accident is the responsible one.
For example:
select *
from (
select *,
row_number() over(partition by u.vehicle_id
order by effective_date desc) as rn
from user u
join accident a on a.vehicle_id = u.vehicle_id
where u.effective_date <= a.acc_date
) x
where rn = 1
Select user_id, user_version,
acc_id, vehicle_id, acc_date from(
Select rownumber() over
(Partition by a.id, a vehicle_id,
b.id) sn ,a.id
as user_id, a.version
as user_version,
b.id as acc_id, a.vehicle_id,
acc_date from user a
Inner Join
Accident b on
a.vehicle_id = b.vehicle_id) a
where sn = 1

Retrieve last record in a group based on string - DB2

I have a table with transactional data in a DB2 database that I want to retrieve the last record, per location and product. The date is unfortunately stored as a YYYYMMDD string. There is not a transaction id or similar field I can key in on. There is no primary key.
DATE
LOCATION
PRODUCT
QTY
20210105
A
P1
4
20210106
A
P1
3
20210112
A
P1
7
20210104
B
P1
3
20210105
B
P1
1
20210103
A
P2
6
20210105
A
P2
5
I want to retrieve results showing the last transaction per location, per product, so the results should be:
DATE
LOCATION
PRODUCT
QTY
20210112
A
P1
7
20210105
B
P1
1
20210105
A
P2
5
I've looked at answers to similar questions but for some reason can't make the jump from an answer that addresses a similar question to code that works in my environment.
Edit: I've tried the code below, taken from an answer to this question. It returns multiple rows for a single location/part combination. I've tried the other answers in that question to, but have not had luck getting them to execute.
SELECT *
FROM t
WHERE DATE > '20210401' AND DATE in (SELECT max(DATE)
FROM t GROUP BY LOCATION) order by PRODUCT desc
Thank you!
You can use ROW_NUMBER(). For example, if your table is called t you can do:
select *
from (
select *,
row_number() over(partition by location, product
order by date desc) as rn
from t
) x
where rn = 1
You can use lead() to get the last row before a change:
select t.*
from (select t.*,
lead(date) over (partition by location, product order by date) as next_lp_date,
lead(date) over (order by date) as next_date
from t
) t
where next_lp_date is null or next_lp_date <> next_date
It looks like you just needed to match your keys within the subselect.
SELECT *
FROM t T1
WHERE DATE > '20210401'
AND DATE in (SELECT max(DATE) FROM t T2 WHERE T2.Location = T1.Location and T2.Product=T1.Product)

Postgres/SQL subquery - return multiples columns per grouping based on condition

Struggling with this subquery - it should be basic, but I'm missing something. I need to make these available as apart of a larger query.
I have customers, and I want to get the ONE transaction with the HIGHEST timestamp.
Customer
customer foo
1 val1
2 val2
Transaction
tx_key customer timestamp value
1 1 11/22 10
2 1 11/23 15
3 2 11/24 20
4 2 11/25 25
The desired of the query:
customer foo timestamp value
1 val1 11/23 15
2 val2 11/25 25
I successfully wrote a subquery to calculate what I needed by using multiple sub queries, but it is very slow when I have a larger data set.
I did it like this:
(select timestamp where transaction.customer = customer.customer order by timestamp desc limit 1) as tx_timestamp
(select value where transaction.customer = customer.customer order by timestamp desc limit 1) as tx_value
So how do I reduce this down to only calculating it once? In my real data set, i have 15 columns joined over 100k rows, so doing this over and over is not performant enough.
In Postgres, the simplest method is distinct on:
select distinct on (cust_id) c.*, t.timestamp, t.value
from transactions t join
customer c
using (cust_id)
order by cust_id, timestamp desc;
Try this query please:
SELECT
T.customer, T.foo, T.timestamp, T.value
FROM Transaction T
JOIN
(SELECT
customer, max(timestamp) as timestamp
from Transaction GROUP BY customer) MT ON
T.customer = MT.customer
AND t.timestamp = MT.timestamp

SQL count of 90 day gaps between records

Say I have a Payment table. I need to know the number of times the gap between payments is greater than 90 days grouped by personID. Payment frequency varies. There is no expected number of payments. There could 0 or hundreds of payments in a 90 day period. If there was no payment for a year, that would count as 1. If there was a payment every month, the count would be 0. If there were 4 payments the first month, then a 90 day gap, then 2 more payments, then another 90 day gap, the count would be 2.
CREATE TABLE Payments
(
ID int PRIMARY KEY,
PersonID int FOREIGN KEY REFERENCES Persons(ID),
CreateDate datetime
)
If you have SQL Server 2014, you can use the LAG or LEAD function to peek at other rows, making this easy:
Select PersonId, Sum(InfrequentPayment) InfrequentPayments
from
(
select PersonId
, case
when dateadd(day,#period,paymentdate) < coalesce(lead(PaymentDate) over (partition by personid order by PaymentDate),getutcdate())
then 1
else 0
end InfrequentPayment
from #Payment
) x
Group by PersonId
Demo: http://sqlfiddle.com/#!6/9eecb7d/491
Explanation:
The outer SQL is fairly trivial; we take the results of the inner SQL, group by PersonId, and count/sum the number of times they've paid payment judged as Infrequent.
The inner SQL is also simple; we're selecting every record, making a note of the person and whether that payment (or rather the delay after that payment) was judged infrequent.
The case statement determines what constitutes an infrequent payment.
Here we say that if the record's paymentdate plus 90 days is still earlier than the next payment (or current date if it's the last payment, so there's no next payment) then it's infrequent (1); otherwise it's not (0).
The coalesce is simply there to handle the last record for a person; i.e. so that if there is no next payment the current date is used (thus capturing anyone who's last payment was over 90 days before today).
Now for the "clever" bit: lead(PaymentDate) over (partition by personid order by PaymentDate).
LEAD is a new SQL function which lets you look at the record after the current one (LAG is to see the previous record).
If you're familiar with row_number() or rank() you may already understand what's going on here.
To determine the record after the current one we don't look at the current query though; rather we specify an order by clause just for this function; that's what's in the brackets after the over keyword.
We also want to only compare each person's payment dates with other payments made by them; not by any customer. To achieve that we use the partition by clause.
I hope that makes sense / meets your requirement. Please say if anything's unclear and I'll try to improve my explanation.
EDIT
For older versions of SQL, the same effect can be achieved by use or ROW_NUMBER and a LEFT OUTER JOIN; i.e.
;with cte (PersonId, PaymentDate, SequenceNo) as
(
select PersonId
, PaymentDate
, ROW_NUMBER() over (partition by PersonId order by PaymentDate)
from #Payment
)
select a.PersonId
, sum(case when dateadd(day,#period,a.paymentdate) < coalesce(b.paymentdate,getutcdate()) then 1 else 0 end) InfrequentPayments
from cte a
left outer join cte b
on b.PersonId = a.PersonId
and b.SequenceNo = a.SequenceNo + 1
Group by a.PersonId
Another method which should work on most databases (though less efficient)
select PersonId
, sum(InfrequentPayment) InfrequentPayments
from
(
select PersonId
, case when dateadd(day,#period,paymentdate) < coalesce((
select min(PaymentDate)
from #Payment b
where b.personid = a.personid
and b.paymentdate > a.paymentdate
),getutcdate()) then 1 else 0 end InfrequentPayment
from #Payment a
) x
Group by PersonId
Generic query for this problem given a timestamp field would be something like this:
SELECT p1.personID, COUNT(*)
FROM payments p1
JOIN payments p2 ON
p1.timestamp < p2.timestamp
AND p1.personID = p2.personID
AND NOT EXISTS (-- exclude combinations of p1 and p2 where p exists between them
SELECT * FROM payments p
WHERE p.personID = p1.personID
AND p.timestamp > p1.timestamp
AND p.timestamp < p2.timestamp)
WHERE
DATEDIFF(p2.timestamp, p1.timestamp) >= 90
GROUP BY p1.personID

A query calls two instances of the same tables joined to compare fields, gives mirrored results. How do I eliminate mirrored duplicates?

This is a simpler version of the query I have.
Alias1 as
(select distinct ID, file_tag, status, creation_date from tables where creation_dt >= sysdate and creation_dt <= sysdate + 1),
Alias2 as
(select distinct ID, file_tag, status, creation_date from same tables creation_dt >= sysdate and creation_dt <= sysdate + 1)
select distinct Alias1.ID ID_1,
Alias2.ID ID_2,
Alias1.file_tag,
Alias1.creation_date in_dt1,
Alias2.creation_date in_dt2
from Alias1, Alias2
where Alias1.file_tag = Alias2.file_tag
and Alias1.ID != Alias2.ID
order by Alias1.creation_dt desc
This is an example of the results. Both of these are the same, though their values are flipped.
ID_1 ID_2 File_Tag in_dt1 in_dt2
70 66 Apples 6/25/2012 3:06 6/25/2012 2:53:47 PM
66 70 Apples 6/25/2012 2:53 6/25/2012 3:06:18 PM
The goal of the query is to find more than one ID with a matching file tag and do stuff to the one submitted earlier in the day (the query runs daily and only needs duplicates from that given day). I am still relatively new to SQL/Oracle and wonder if there's a better way to approach this problem.
SELECT *
FROM (SELECT id, file_tag, creation_date in_dt
, row_number() OVER (PARTITION BY file_tag
ORDER BY creation_date) rn
, count(*) OVER (PARTITION BY file_tag) ct
FROM tables
WHERE creation_date >= TRUNC(SYSDATE)) tbls
WHERE rn = 1
AND ct > 1;
This should get you the first (earliest) row within each file_tag having at least 2 records today.
The inner select calculates the relative row numbers of each set of identical file_tag records by creation date. The outer select retrieves the first one in each partition.
This assumes from your goal statement that you want to do something with the earliest single row for each file_tag. The inner query only returns rows with a creation_date of sometime on the current day.
Here is an easy way, just by chaning your comparison operation:
select distinct Alias1.ID ID_1, Alias2.ID ID_2, Alias1.file_tag,
Alias1.creation_date in_dt1, Alias2.creation_date in_dt2
from Alias1 join
Alias2
on Alias1.file_tag = Alias2.file_tag and
Alias1.ID < Alias2.ID
order by Alias1.creation_dt desc
Replacing the not-equals with less-than orders the two ideas so the smaller one is always first. This will eliminate the duplicates. Note: I also fixed the join syntax.