Max(date) in inner query - sql

I was given sample SQL which does not seem to do what I need.
Big table has 4 million rows and small table has 600 thousand rows.
/* Sample code: (I was given this sample by a senior analyst) */
SELECT SUM(BigTable.VALUE)
FROM BigTable INNER JOIN SmallTable
WHERE BigTable.ID = SmallTable.ID
AND BigTable.VALUATION_DATE IN
(SELECT MAX(VALUATION_DATE)
FROM BigTable)
GROUP BY BigTable.ID
/* My code: (I placed a WHERE in the inner query) */
SELECT BigTable.ID, SUM(BigTable.VALUE)
FROM BigTable INNER JOIN SmallTable
WHERE BigTable.ID = SmallTable.ID
AND BigTable.VALUATION_DATE IN
(SELECT MAX(VALUATION_DATE)
FROM BigTable INNER JOIN SmallTable
WHERE BigTable.ID = SmallTable.ID)
GROUP BY BigTable.ID
If ID xyz has three accounts with values $1, $2, $3 respectively on the most recent date, I want to return the sum of all accounts on that date: xyz, $6

So the INNER JOIN syntax you are using I believe is incorrect. After the INNER JOIN table that will be joined, you need to state ON what columns you wish to join the tables on.
The following query is the correct syntax (although it may not be correct for your implementation).
SELECT BigTable.ID, SUM(BigTable.VALUE)
FROM BigTable INNER JOIN SmallTable
ON BigTable.ID = SmallTable.ID
WHERE BigTable.VALUATION_DATE IN
(SELECT MAX(VALUATION_DATE)
FROM BigTable INNER JOIN SmallTable
ON BigTable.ID = SmallTable.ID)
GROUP BY BigTable.ID
Only when you are doing cross joins, and natural joins do you not use the ON keyword and only use the WHERE command.

You should avoid the where clause and use the ON Clause
SELECT SUM(BigTable.VALUE)
FROM BigTable
INNER JOIN SmallTable ON BigTable.ID = SmallTable.ID
AND BigTable.VALUATION_DATE = (
SELECT MAX(VALUATION_DATE)
FROM BigTable)
and youn should not use a group by id ..

Use window functions:
SELECT b.ID, b.VALUE
FROM (SELECT b.*,
ROW_NUMBER() OVER (PARTITION BY b.id ORDER BY b.VALUATION_DATE DESC) as seqnum
FROM BigTable b
) b JOIN
SmallTable s
ON b.ID = s.ID
WHERE b.seqnum = 1;
I don't think aggregation is necessary. But, if you have multiple values on the same date for the same id, then:
SELECT b.ID, SUM(b.VALUE)
FROM (SELECT b.*,
RANK() OVER (PARTITION BY b.id ORDER BY b.VALUATION_DATE DESC) as seqnum
FROM BigTable b
) b JOIN
SmallTable s
ON b.ID = s.ID
WHERE b.seqnum = 1
GROUP BY b.id;

Related

Removing duplicate values from a column in SQL

I have two tables A (group_id, id, subject) and B (id, date). Below is the joint table of tables A and B on id. I have tried using distinct and partition to remove the duplicates in group_id(field) only, but no luck:
My code:
select
a.group_id, a.id, a.subject, b.date
from
A a
inner join
(select
b.*,
row_number() over (partition by group_id order by date asc) as seqnum
from
B b) b on a.id = b.id and seqnum = 1
order by
date desc;
I got this error when I ran the code:
Partitioning can not be used stand-alone in query near 'partition by group_id order by date asc) as seqnum from B' at line 1
This is my expected result:
Thank you in advance!
It looks like you want the earliest date for each row in the table you show. Your question mentions two tables, but you only show one.
I recommend a correlated subquery in most databases:
select b.*
from b
where b.date = (select min(b2.date)
from b b2
where b2.group_id = b.group_id
);
I see. You need to join first and then use row_number():
select ab.*
from (select a.group_id, a.id, a.subject, b.date,
row_number() over (partition by a.group_id order by b.date) as seqnum
from A a join
B b
on a.id = b.id
) ab
where seqnum = 1
order by date desc;
You are almost there. But the column that you try to use to partition (ie group_id) comes from table a, which is not available in the subquery.
You would need to JOIN and assign the row number in a subquery, and then filter in the outer query.
select *
from (
select
a.group_id,
a.id,
a.subject,
b.date,
row_number() over (partition by a.group_id order by b.date asc) as seqnum
from a
inner join b on ON a.id = b.id
)
where seqnum = 1
ORDER BY date desc;
Another way to achieve your goal though it may not be the efficient one
SELECT
A.group_id, A.id, B.Date, A.subject
FROM A
INNER JOIN B
ON A.Id = B.Id
INNER JOIN
(
SELECT
A.Group_id, MIN(B.Date) AS Date
FROM A
INNER JOIN B
ON A.Id = B.Id
GROUP BY A.group_id
) AS supportTable
ON A.group_id = supportTable.group_id
AND B.Date = supportTable.Date

Need assistance in rewriting this query

We have this query in production which runs daily
It does a lot of joins and also uses window function in hive
We tried to add few set options but that did not help much
Structure is something like this -
SELECT
C.f1, C.f2, A.f2 ...
FROM (
SELECT * FROM (
SELECT T1.*, B.atid, B.a_id,
ROW_NUMBER() OVER (PARTITION BY T1.wtid, B.atid ORDER BY T1.b_ts DESC) AS RANK_
FROM T1 AS T1
JOIN T5 ON T1.t_dt = T5.t_dt
JOIN T2 B ON T1.wtid = B.wtid and T1.b_ts = B.b_ts
LEFT OUTER JOIN (SELECT p_cd FROM T3 WHERE PV_TY_CD = 'ORIG_CD') PV
ON T1.TYP = PV.p_cd
WHERE T1.state not in ("INVALID")
AND T1.evt_name NOT IN ('INACTIVE','DORMANT')
AND ISNULL(PV.p_cd)
) T
WHERE T.rank_ = 1
) A
JOIN (SELECT *, row_number() over (partition by ac_id order by b_ts desc) rank_
FROM T4
WHERE event not in ('CT','UPD')
) AS C
ON A.a_id = C.a_id
AND A.atid = C.ac_id
AND C.rank_ = 1
JOIN T6 ON C.t_dt = T6.t_dt
As i cannot ignore any tables ( and joins ), My approach was to substitute the window function with another join using aggregate function max but i was not able to rewrite it.
Also i am not sure if that will surely help to improve performance so any guidance will help us.
Analytic functions usually perform better than joins with select max, because you are reading the same table only once in case of analytic function and row_number calculation is parallelized by partition by.
Try to regroup joins and filtering.
Join
LEFT OUTER JOIN (SELECT p_cd FROM T3 WHERE PV_TY_CD = 'ORIG_CD') PV
ON T1.TYP = PV.p_cd
with where condition ISNULL(PV.p_cd) is reducing some rows in T1.
The same do these conditions:
WHERE T1.state not in ("INVALID")
AND T1.evt_name NOT IN ('INACTIVE','DORMANT')
Move this join into the subquery, if it filters a lo, this may help to reduce the dataset in T1 before all other joins and row_number():
(select T1.* from T1
left join (SELECT p_cd FROM T3 WHERE PV_TY_CD = 'ORIG_CD') PV
ON T1.TYP = PV.p_cd
where T1.state not in ("INVALID")
AND T1.evt_name NOT IN ('INACTIVE','DORMANT')
AND ISNULL(PV.p_cd)
) as T1
Also first row_number is calculated only on T1 and B tables:
PARTITION BY T1.wtid, B.atid ORDER BY T1.b_ts DESC
Consider joining T5 table after row_number filter, if this join is heavy, and row_number filter is reducing the dataset, then wrap row_number with filter in the subquery again and join subquery filtered with T5.
(--filtered by row_number
select * from
(
SELECT T1.*, B.atid, B.a_id,
ROW_NUMBER() OVER (PARTITION BY T1.wtid, B.atid ORDER BY T1.b_ts DESC) AS RANK_
from
(select T1.* from T1
left join (SELECT p_cd FROM T3 WHERE PV_TY_CD = 'ORIG_CD') PV
ON T1.TYP = PV.p_cd
where T1.state not in ("INVALID")
AND T1.evt_name NOT IN ('INACTIVE','DORMANT')
AND ISNULL(PV.p_cd)
) as T1 JOIN T2 B ON T1.wtid = B.wtid and T1.b_ts = B.b_ts
) T WHERE T.rank_ = 1
) T --filtered
JOIN T5 ON T1.t_dt = T5.t_d
This may help depending on your data.
Read also: https://stackoverflow.com/a/51061613/2700344 and this: https://stackoverflow.com/a/51061613/2700344

How to create a temp table in PostgreSQL?

I'm trying to use temp table to simplify my query. At the beginning I used WITH, which was not recognized if I'm not joining each table specifically. What's the best way to approach this query? what's wrong with this syntax?
For the account that purchased the most (in total over their lifetime as a customer) standard_qty paper, how many accounts still had more in total purchases?
create temp table t1 as (
SELECT
a.id as account_id,
SUM(o.standard_qty) as all_std_qty
FROM
accounts a
JOIN orders o ON (a.id = o.account_id)
GROUP BY
1
order by
2 desc
limit
1
)
create temp table t2 as (
SELECT
a.id as account_id,
SUM(o.total) as total_purchases
FROM
accounts a
JOIN orders o ON (a.id = o.account_id)
GROUP BY
1
)
create temp table t3 as (
SELECT
t1.account_id,
t2.total_purchases as total_pur FROM
t1
JOIN t2
ON (t1.account_id = t2.account_id)
)
SELECT
count(a.id) as count_ids
FROM
accounts a
JOIN orders o ON (a.id = o.account_id)
WHERE
o.total > t3.total_pur
I think you missed a join with table t3 and you used it on where clause thats the problem ,can you please try with below query
WITH t1 as (
SELECT
a.id as account_id,
SUM(o.standard_qty) as all_std_qty
FROM
accounts a
JOIN orders o ON (a.id = o.account_id)
GROUP BY
1
order by
2 desc
limit
1
), t2 as (
SELECT
a.id as account_id,
SUM(o.total) as total_purchases
FROM
accounts a
JOIN orders o ON (a.id = o.account_id)
GROUP BY
1
), t3 as (
SELECT
t1.account_id,
t2.total_purchases as total_pur FROM
t1
JOIN t2
ON (t1.account_id = t2.account_id)
)
SELECT
count(a.id) as count_ids
FROM
accounts a
JOIN orders o ON (a.id = o.account_id)
inner join t3 on a.id=t3.account_id
WHERE
o.total > t3.total_pur

SQL: Modifying Inner Join to Select One Row

I have two tables, A and B that I want to inner join on location. However, for each row in A, there are many rows in B whose location matches. I want to end up with at most the same number of rows as in A. Specifically, I want to take the row in B where date is earliest. Here's what I have so far:
SELECT *
FROM A
INNER JOIN B ON A.location = B.location
How would I modify this so that each row in A only gets joined with a single row in B (using the earliest date)?
Attempt:
SELECT *
FROM A
INNER JOIN B ON A.location = B.location
AND B.date = (SELECT MIN(date) FROM B)
Is that the right approach?
You can use the ANSI/ISO standard row_number() function:
SELECT *
FROM A INNER JOIN
(SELECT B.*, ROW_NUMBER() OVER (PARTITION BY B.location ORDER BY B.date) as seqnum
FROM B
) B
ON A.location = B.location AND seqnum = 1;
SELECT TOP(1) * FROM A
INNER JOIN B ON
A.LOCATION=B.LOCATION
ORDER BY B.DATE

Aliasing derived table which is a union of two selects

I can't get the syntax right for aliasing the derived table correctly:
SELECT * FROM
(SELECT a.*, b.*
FROM a INNER JOIN b ON a.B_id = b.B_id
WHERE a.flag IS NULL AND b.date < NOW()
UNION
SELECT a.*, b.*
FROM a INNER JOIN b ON a.B_id = b.B_id
INNER JOIN c ON a.C_id = c.C_id
WHERE a.flag IS NOT NULL AND c.date < NOW())
AS t1
ORDER BY RAND() LIMIT 1
I'm getting a Duplicate column name of B_id. Any suggestions?
The problem isn't the union, it's the select a.*, b.* in each of the inner select statements - since a and b both have B_id columns, that means you have two B_id cols in the result.
You can fix that by changing the selects to something like:
select a.*, b.col_1, b.col_2 -- repeat for columns of b you need
In general, I'd avoid using select table1.* in queries you're using from code (rather than just interactive queries). If someone adds a column to the table, various queries can suddenly stop working.
In your derived table, you are retrieving the column id that exists in table a and table b, so you need to choose one of them or give an alias to them:
SELECT * FROM
(SELECT a.*, b.[all columns except id]
FROM a INNER JOIN b ON a.B_id = b.B_id
WHERE a.flag IS NULL AND b.date < NOW()
UNION
SELECT a.*, b.[all columns except id]
FROM a INNER JOIN b ON a.B_id = b.B_id
INNER JOIN c ON a.C_id = c.C_id
WHERE a.flag IS NOT NULL AND c.date < NOW())
AS t1
ORDER BY RAND() LIMIT 1
First, you could use UNION ALL instead of UNION. The two subqueries will have no common rows because of the excluding condtion on a.flag.
Another way you could write it, is:
SELECT a.*, b.*
FROM a
INNER JOIN b
ON a.B_id = b.B_id
WHERE ( a.flag IS NULL
AND b.date < NOW()
)
OR
( a.flag IS NOT NULL
AND EXISTS
( SELECT *
FROM c
WHERE a.C_id = c.C_id
AND c.date < NOW()
)
)
ORDER BY RAND()
LIMIT 1