Conditional filter with row numbers - sql

I have a sample code below of containing an ID, a Date, a Value, along with a row numbered that is partitioned by the ID holder and ordered by their meeting date:
SELECT
c.ID
,m.CONTACT_DATE
,d.TEST
,row_number() over(partition by c.ID
order by m.CONTACT_DATE desc
) [rn]
FROM COMMUNITY C
INNER JOIN MEETING m
ON c.ID = m.CONTACT_ID
LEFT JOIN DISCUSSION d
ON m.DISCUSSION_TEST = d.TEST
A sample of the results of running such a query would bring:
ID CONTACT_DATE TEST rn
01 2017-05-01 NULL 1
01 2017-04-01 1 2
01 2017-03-01 NULL 3
02 2017-08-01 NULL 1
02 2017-09-01 NULL 2
02 2017-10-01 1 3
03 2017-02-01 NULL 1
03 2017-01-01 NULL 2
What I'd like to do is group each of the IDs to get the most recent CONTACT_DATE (ie. Place in subquery T, then WHERE T.rn = 1 GROUP BY T.ID)
However, if there's a value under TEST, then instead I want to see the most recent CONTACT_DATE that has a value, like below:
ID CONTACT_DATE TEST rn
01 2017-04-01 1 2
02 2017-10-01 1 3
03 2017-02-01 NULL 1
What can I do to filter the most recent CONTACT_DATE that has a value under TEST, while still getting the most recent CONTACT_DATE if all values for that ID is NULL?

You can change your row_number ordering:
row_number() over(partition by c.ID
order by CASE WHEN d.TEST IS NOT NULL THEN 1 ELSE 2 END
, m.CONTACT_DATE desc
)

Related

create table with 2 column with different conditions SQL

I have a table with this format:
Id_command, date_creat
01 01-01-2020
02 01-01-2021
03 01-11-2020
..
I would like to extract from a table a new table where the first table contain all the id_command where date_creat > 01-01-2020 and a second column where date_creat > 01-01-2021.
The expected result :
Id_command (date_creat > 01-01-2020) , id command(date_creat < 31-12-2020)
01 02
03
I got the idea to crate two differnt table, then outer_join, but i am not sure if we can do this with a simpler manner
Thanks
First select the relevant rows from the table and add a row number
select Id_command,
row_number() over (order by Id_command) as rn
from tab
where date_creat > DATE'2020-01-01'
ID_COMMAND RN
---------- ----------
2 1
3 2
Make the same for the second conditions.
Finally use those two subqueries and full outer join them using the row number.
with a as(
select Id_command,
row_number() over (order by Id_command) as rn
from tab
where date_creat > DATE'2020-01-01'
), b as (
select Id_command,
row_number() over (order by Id_command) as rn
from tab
where date_creat <= DATE'2020-01-01')
select a.Id_command, b.Id_command
from a
full outer join b
on a.rn = b.rn
order by 1,2
ID_COMMAND ID_COMMAND
---------- ----------
2 1
3

Multiple left outer joins on Hive

In Hive, I have two tables as shown below:
SELECT * FROM p_test;
OK
p_test.id p_test.age
01 1
02 2
01 10
02 11
Time taken: 0.07 seconds, Fetched: 4 row(s)
SELECT * FROM p_test2;
OK
p_test2.id p_test2.height
02 172
01 170
Time taken: 0.053 seconds, Fetched: 2 row(s)
I'm supposed to get the age differences between the same user in the p_test table. Hence, I run HiveQL via row_number function as following:
SELECT *
FROM
(SELECT *, ROW_NUMBER() OVER(partition by id order by age asc) rn FROM p_test) t1
LEFT JOIN
(SELECT *, ROW_NUMBER() OVER(partition by id order by age asc) rn FROM p_test) t2
ON t2.id=t1.id AND t1.rn=(t2.rn+1)
LEFT JOIN
(SELECT * FROM p_test2) t_2
ON t_2.id = t1.id;
The result of it is :
t1.id t1.age t1.rn t2.id t2.age t2.rn t_2.id t_2.height
01 1 1 NULL NULL NULL 01 170
01 10 2 01 1 1 01 170
02 11 1 NULL NULL NULL 02 172
02 2 2 02 11 1 02 172
Time taken: 60.773 seconds, Fetched: 4 row(s)
It is all ok so far. However, If I move the condition which left joins table t1 and table t2 shown above to the last line as shown below:
SELECT *
FROM
(SELECT *, ROW_NUMBER() OVER(partition by id order by age asc) rn FROM p_test) t1
LEFT JOIN
(SELECT *, ROW_NUMBER() OVER(partition by id order by age asc) rn FROM p_test) t2
LEFT JOIN
(SELECT * FROM p_test2) t_2
ON t_2.id = t1.id
AND t2.id=t1.id AND t1.rn=(t2.rn+1);
I got the unexpected result as following:
t1.id t1.age t1.rn t2.id t2.age t2.rn t_2.id t_2.height
01 1 1 01 1 1 NULL NULL
01 1 1 01 10 2 NULL NULL
01 1 1 02 11 1 NULL NULL
01 1 1 02 2 2 NULL NULL
01 10 2 01 1 1 01 170
01 10 2 01 10 2 NULL NULL
01 10 2 02 11 1 NULL NULL
01 10 2 02 2 2 NULL NULL
02 11 1 01 1 1 NULL NULL
02 11 1 01 10 2 NULL NULL
02 11 1 02 11 1 NULL NULL
02 11 1 02 2 2 NULL NULL
02 2 2 01 1 1 NULL NULL
02 2 2 01 10 2 NULL NULL
02 2 2 02 11 1 02 172
02 2 2 02 2 2 NULL NULL
It seems that the condition which I move to the last line doesn't work anymore. It bothers me for a long time. Do hope I can hear any valuable answers, thx for anyone who provides me with answers in advance.
In your second query LEFT JOIN with t2 without ON condition is transformed to CROSS JOIN. This is why you have duplication. For id=01 you have two rows in subquery t1 and 2 rows in t2 initially, this CROSS join gives you 2x2=4 rows.
And the ON condition works, but it is applied only to the last LEFT join with t_2 subquery, this condition is being checked only to determine which rows to join in the last join, not all joins, it does not affect first CROSS JOIN (LEFT JOIN without ON condition) at all.
Every join should have it's own ON condition, except cross joins.
See also this answer about joins without ON condition behavior: https://stackoverflow.com/a/46843832/2700344
BTW you can do the same without t2 join at all using lag or lead analytic functions for calculating values ordered by age.
Like this:
lag(height) over(partition by id order by age) -- to get previous height

select best row available from each group each item for dates available

I have data like this
Group id Date
1 1 2015-01-01
2 1 2015-01-01
1 1 2015-02-01
2 1 2015-02-01
1 1 2015-03-01
1 2 2015-04-01
2 2 2015-04-01
Want to select each record for each day for each id from Group 2 if available. If Group 2 row is not available then return Group 1 record for that date.for each id there is always record with Group 1. so End result should be
Group id Date
2 1 2015-01-01
2 1 2015-02-01
1 1 2015-03-01
2 2 2015-04-01
Use ROW_NUMBER window function
select * from
(
select row_number()over(partition by Date order by Group) as RN,*
from yourtable
) A
Where RN =1
If a Date has more than one Group = 1 and if you want to return all Group = 1 records for a date then use DENSE_RANK instead of ROW_NUMBER
select distinct max(gid) as gid over(partition by id,[grp_date] order by id), id,grp_date
from test
order by grp_date

ROW_NUMBER() with DISTINCT

I've got a table of ticket assignments showing the different groups a ticket is transferred to before its resolved. Here is a simplified table:
asgn_grp | date | ticket_id
---------|--------|----------
A | 1-1-15 | 1
A | 1-2-15 | 1
B | 1-3-15 | 1
A | 1-1-15 | 2
C | 1-2-15 | 2
B | 1-3-15 | 2
C | 1-1-15 | 3
B | 1-2-15 | 3
I need to get a count of the second distinct group that a ticket was assigned to, meaning I want to know once a ticket is transferred out of the group its in, internal transfers don't count. So the second distinct group for ticket 1 is B, ticket 2 is C, ticket 3 is B. I need to get a count of these, so the end result I need is
asgn_grp | count
---------|-------
B | 2
C | 1
I've tried
SELECT distinct top 2 asgn_grp, ROW_NUMBER() OVER (ORDER BY date)
As my sub-query and pulling the second one out of that, but when I add the ROW_NUMBER() it messes up my distinct. If I pull the ROW_NUMBER() out of the sub-query, I have now way to order my values to ensure I get the second one after I DISTINCT the list.
Also, let me know if I was unclear about anything.
Instead of using distinct, try using group by twice.
select asgn_grp, count(*) from (
select * , row_number() over (partition by ticket_id order by min_date) rn
from (
select asgn_grp, ticket_id, min(date) min_date
from Table1 group by asgn_grp, ticket_id
) t1
) t2 where rn = 2
group by asgn_grp;
http://sqlfiddle.com/#!3/a0d1e
The derived table t1 contains every unique asgn_grp for each ticket_id along with the minimum date of each asgn_grp. For the sample data t1 has the following rows:
ASGN_GRP TICKET_ID MIN_DATE
A 1 January, 01 2015 00:00:00+0000
B 1 January, 03 2015 00:00:00+0000
A 2 January, 01 2015 00:00:00+0000
B 2 January, 03 2015 00:00:00+0000
C 2 January, 02 2015 00:00:00+0000
B 3 January, 02 2015 00:00:00+0000
C 3 January, 01 2015 00:00:00+0000
The outer query then uses row_number() to number each asgn_grp within a ticket_id by its min_date and generates the following for t2
ASGN_GRP TICKET_ID MIN_DATE RN
A 1 January, 01 2015 00:00:00+0000 1
B 1 January, 03 2015 00:00:00+0000 2
A 2 January, 01 2015 00:00:00+0000 1
C 2 January, 02 2015 00:00:00+0000 2
B 2 January, 03 2015 00:00:00+0000 3
C 3 January, 01 2015 00:00:00+0000 1
B 3 January, 02 2015 00:00:00+0000 2
This table is filtered for RN = 2 and is grouped by asgn_grp to get the count for each asgn_grp.
First, you need to identify groups of constant values of asgn_grp for each ticket. You can do that with a difference of row numbers.
Then, you need the ordering for each group. For that, use the minimum date in the group. Finally, you can rank these groups to get the second one, using dense_rank() on the date.
select asgn_grp, count(*)
from (select ticket_id, asgn_grp,
dense_rank() over (partition by ticket_id order by grpdate) as seqnum
from (select s.*, min(date) over (partition by ticket_id, asgn_grp, grp) as grpdate
from (select s.*,
(row_number() over (partition by ticket_id order by date) -
row_number() over (partition by ticket_id, asgn_grp order by date)
) as grp
from simplified s
) s
) s
) s
where seqnum = 2
group by asgn_grp;
If you need all assign groups with count zero for non-changed ones, use outer joins instead of inner joins
WITH TBL AS
(
SELECT A.*, ROW_NUMBER() OVER(PARTITION BY ticket_id ORDER BY asgn_grp) AS RN
FROM TABLE AS A
)
SELECT A.ASSN_GRP, COUNT(*) AS CNT
FROM TBL AS A
INNER JOIN TBL AS B
ON B.TICKET_ID = A.TICKET_ID
AND A.RN = B.RN + 1
GROUP BY A.ASSGN_GRP
As you want to know why using DISTINCT with ROW_NUMBER() changes your results:
You can see this question that is about differentioal between DISTINCT and GROUP BY.
And from that:
The GROUP BY query aggregates before it computes. The DISTINCT query computes before the aggregate.
So When you use ROW_NUMBER() -that is a scalar value- if query computes first you will have a unique field for ROW_NUMBER() results and then your DISTINCT will apply over it that in your result it will not find any duplicate row!
And for your results you can use this query
SELECT ticket_id, asgn_grp,
(SELECT COUNT([date]) FROM yourTable t WHERE t.asgn_grp = r.asgn_grp And t.ticket_id = r.ticket_id)
FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY ticket_id ORDER BY [date]) As ra
FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY ticket_id, asgn_grp ORDER BY [date] Desc) As rn
FROM yourTable) findingOldDates
WHERE rn = 1) r
WHERE ra = 2

Pivot SQL with Rank

Basically i have the following query and i am trying to distinguish only the unique ranks from this:
WITH numbered_rows
as (
SELECT Claim,
reserve,
time,
RANK() OVER (PARTITION BY ClaimNumber ORDER BY time asc) as 'Rank'
FROM (
SELECT cc.Claim,
MAX(csd.time) as time,
csd.reserve
FROM ClaimData csd WITH (NOLOCK)
JOIN Core cc WITH (NOLOCK)
on cc.ClaimID = csd.ClaimID
GROUP BY cc.Claim, csd.Reserve
) as t
)
select *
from numbered_rows cur, numbered_rows prev
where cur.Claim= prev.Claim
and cur.Rank = prev.Rank -1
The results set I get is the following:
Claim reserve Time Rank Claim reserve Time Rank
--------------------------------------------------------------------
11 0 12/10/2012 1 11 15000 5/30/2013 2
34 2000 1/21/2013 1 34 750 1/31/2013 2
34 750 1/31/2013 2 34 0 3/31/2013 3
07 800000 5/9/2013 1 07 0 5/10/2013 2
But what I only want to see the following: (have the Claim 34 Rank 2 removed because its not the highest
Claim reserve Time Rank Claim reserve Time Rank
--------------------------------------------------------------------
11 0 12/10/2012 1 11 15000 5/30/2013 2
34 750 1/31/2013 2 34 0 3/31/2013 3
07 800000 5/9/2013 1 07 0 5/10/2013 2
I think you can do this by just reversing your logic, i.e. order by time DESC, switching cur and prev in your final select and changing -1 to +1 in your final select, then just limiting prev.rank to 1, therefore ensuring that the you only include the latest 2 results for each claim:
WITH numbered_rows AS
( SELECT Claim,
reserve,
time,
[Rank] = RANK() OVER (PARTITION BY ClaimNumber ORDER BY time DESC)
FROM ( SELECT cc.Claim,
[Time] = MAX(csd.time),
csd.reserve
FROM ClaimData AS csd WITH (NOLOCK)
INNER JOIN JOIN Core AS cc WITH (NOLOCK)
ON cc.ClaimID = csd.ClaimID
GROUP BY cc.Claim, csd.Reserve
) t
)
SELECT *
FROM numbered_rows AS prev
INNER JOIN numbered_rows AS cur
ON cur.Claim= prev.Claim
AND cur.Rank = prev.Rank + 1
WHERE prev.Rank = 1;