How to count distinct a field cumulatively using recursive cte or other method in SQL? - sql

Using example below, Day 1 will have 1,3,3 distinct name(s) for A,B,C respectively.
When calculating distinct name(s) for each house on Day 2, data up to Day 2 is used.
When calculating distinct name(s) for each house on Day 3, data up to Day 3 is used.
Can recursive cte be used?
Data:
Day
House
Name
1
A
Jack
1
B
Pop
1
C
Anna
1
C
Dew
1
C
Franco
2
A
Jon
2
B
May
2
C
Anna
3
A
Jon
3
B
Ken
3
C
Dew
3
C
Dew
Result:
Day
House
Distinct names
1
A
1
1
B
1
1
C
3
2
A
2 (jack and jon)
2
B
2
2
C
3
3
A
2 (jack and jon)
3
B
3
3
C
3

Without knowing the need and size of data it'll be hard to give an ideal/optimal solution. Assuming a small dataset needing a quick and dirty way to calculate, just use sub query like this...
SELECT p.[Day]
, p.House
, (SELECT COUNT(DISTINCT([Name]))
FROM #Bing
WHERE [Day]<= p.[Day] AND House = p.House) DistinctNames
FROM #Bing p
GROUP BY [Day], House
ORDER BY 1

There is no need for a recursive CTE. Just mark the first time a name is seen in a house and use a cumulative sum:
select day, house,
sum(sum(case when seqnum = 1 then 1 else 0 end)) over (partition by house order by day) as num_unique_names
from (select t.*,
row_number() over (partition by house, name order by day) as seqnum
from t
) t
group by day, house

Related

Return top 1 million based off of two criteria (SQL Query)

I’d like to build a query that returns the top 1 million customers based off of two criteria that ranks 10 million customers.
Criterion 1 being a grade assigned to each customer from 1 to 5, 1 being the best
Criterion 2 being a grade assigned to each customer from A to E, A being the best
Criterion 1 outweighs Criterion 2, in that before you move to B (Criterion 2), you must first go from 1 to 5 (Criterion 1) within the A band (Criterion 2) i.e. A customer that scores a 5 (criterion 1) and an A (criterion 2), is a better customer than a customer that scores a 1 (criterion 1) and a B (criterion 2).
I’d like the query to return the top 1 million customers, stopping within the bands that return the 1 million-th customer e.g. if customer 1 million is in the 4C band, don’t return any customers beyond 4C. It’s ok if it's just over 1 million, to accommodate every customer in 4C band.
This is my attempt at it but this doesn’t account for sequence:
SELECT *
FROM CUSTOMER_POPULATION
WHERE Criterion1 IN (5,4,3,2,1)
AND Criterion2 ('A','B','C','D','E')
LIMIT 1000000
TIA.
WITH CUSTOMER_POPULATION (NAME, CRITERION1, CRITERION2) AS
(SELECT * FROM VALUES
('Alice',3,'A'),('Bob',4,'C'),('Carol',5,'E'),('Dave',2,'C')
,('Esther',2,'E'),('Fred',5,'C'),('Gladys',3,'E'),('Harvey',2,'E')
,('Iona',3,'C'),('John',1,'A'),('Kate',4,'E'),('Leo',3,'B')
,('Mary',2,'C'),('Nora',3,'A'),('Oscar',1,'D'),('Penny',3,'C')
,('Quincy',3,'A'),('Ruth',5,'E'),('Sam',4,'B'),('Tina',2,'C')
,('Ulrich',1,'B'),('Velma',5,'B'),('Wayne',2,'C'),('Xena',5,'B')
,('Yale',1,'D'),('Zoe',5,'C')
)
SELECT *
FROM CUSTOMER_POPULATION
WHERE Criterion1 IN (5,4,3,2,1)
AND Criterion2 IN ('A','B','C','D','E')
ORDER BY CONCAT(Criterion2, Criterion1)
LIMIT 1000000
NAME
CRITERION1
CRITERION2
John
1
A
Alice
3
A
Nora
3
A
Quincy
3
A
Ulrich
1
B
Leo
3
B
Sam
4
B
Xena
5
B
Velma
5
B
Dave
2
C
Wayne
2
C
Tina
2
C
Mary
2
C
Iona
3
C
Penny
3
C
Bob
4
C
Fred
5
C
Zoe
5
C
Oscar
1
D
Yale
1
D
Harvey
2
E
Esther
2
E
Gladys
3
E
Kate
4
E
Ruth
5
E
Carol
5
E
rank() will let you number the bands according their ordering by the two criteria. Because ties get the same ranking, you won't cut off the results in the middle of a band at exactly the one millionth row.
with data as (
select *, rank() over (order by Criterion2, Criterion1) as rnk
from CUSTOMER_POPULATION
where Criterion1 IN (5,4,3,2,1) and Criterion2 in ('A','B','C','D','E')
)
select * from data where rnk <= 1000000;

PostgreSQL Absolute Ranking

The users table is like this:
Id
Name
Room
Point
1
A
1
10
2
B
1
20
3
C
2
30
4
D
2
40
I want to get ranking with some conditions.
The query is SELECT *, RANK() OVER (ORDER BY users.point ASC) rnk FROM users WHERE users.room = 2
Then the ranking column(rnk) is not a absolute ranking.
The query result is
Id
Name
Room
Point
rnk
3
C
2
30
1
4
D
2
40
2
But I want absolute ranking, and the expected result is
Id
Name
Room
Point
rnk
3
C
2
30
3
4
D
2
40
4
Rank first, filter later. For example:
select *
from (
select *, rank() over(order by point) as rnk
from users
) x
where room = 2

Teradata/SQL, select all rows until a certain value is reached per partition

I'd like to select all rows from a table until (and including) a certain value is reached per partition. In this case all rows per id that precede when status has the value 'b' for the last time. Note: the timestamp is in order per id
id
name
status
status
timestamp
1
Sta
open
a
10:50:09.000000
1
Danny
open
c
10:50:19.000000
1
Elle
closed
b
10:50:39.000000
2
anton
closed
a
16:00:09.000000
2
jill
done
b
16:00:19.000000
2
tom
open
b
16:05:09.000000
2
bill
open
c
16:07:09.000000
3
ann
done
b
08:00:13.000000
3
stef
done
b
08:12:13.000000
3
martin
open
b
08:25:13.000000
3
jeff
open
a
09:00:13.000000
3
luke
open
c
09:07:13.000000
3
karen
open
c
09:15:13.000000
3
lucy
open
a
10:00:13.000000
The output would look like this:
id
name
status
status
timestamp
1
Sta
open
a
10:50:09.000000
1
Danny
open
c
10:50:19.000000
1
Elle
closed
b
10:50:39.000000
2
anton
closed
a
16:00:09.000000
2
jill
done
b
16:00:19.000000
2
tom
open
b
16:05:09.000000
3
ann
done
b
08:00:13.000000
3
stef
done
b
08:12:13.000000
3
martin
open
b
08:25:13.000000
I've tried to solve this using qualify with rank etc. but unfortunately with no succes. would be appreciated if somebody would be able to help me!
all rows per id that precede when status has the value 'b' for the last time is the same as no rows before value 'b' occurs the first time when you revert the sort order:
SELECT *
FROM tab
QUALIFY -- tag the last 'b'
Count(CASE WHEN status = 'b' THEN 1 end)
Over (PARTITION BY id
ORDER BY timestamp DESC
ROWS Unbounded Preceding) > 0
ORDER BY id, timestamp
;
This will not return ids where no 'b' exists.
If you want to return those, too, add another condition to QUALIFY:
OR -- no 'b' found
Count(CASE WHEN status = 'b' THEN 1 end)
Over (PARTITION BY id) = 0
As both counts share the same partition, it's still a single STAT step in Explain.

Count condition met

I have a table (stu_grades) that stores student data and their grades at the centers they attended
I want to find out how many times for e.g. each student in that table got 'A' and then 'B' etc at any center
stu_grades
stu_ID|grade1|grade2|Grade3|center
1 A A C 1
2 B B B 2
3 C C A 1
1 C A C 2
the same student could occur more than once in the table with the same grades or even a different grade, same or different center
I especially want to check where the grade has appeared more than 3 or more times and how many centeres they exist in
So the final output should be like:
Stu_ID|Grade|Count|centercount
1 A 3 2 (As they accquired 'A' from 2 centres)
1 C 3 2
2 B 3 1 (As they only exist in 1 centre)
3 C 2 1
3 A 1 1
select
stu_id,
grade,
sum(count) count,
count(distinct center) centercount
from (
select stu_id, grade, center, count(*)
from stu_grades,
lateral unnest(array[grade1, grade2, grade3]) grade
group by 1, 2, 3
) s
group by 1, 2
order by 1, 2;
Test it here.

Need help designing a proper sql query

I have this table:
DebitDate | DebitTypeID | DebitPrice | DebitQuantity
----------------------------------------------------
40577 1 50 3
40577 1 100 1
40577 2 75 2
40578 1 50 2
40578 2 150 2
I would like to get with a single query (if that's possible), these details:
date, debit_id, total_sum_of_same_debit, how_many_debits_per_day
so from the example above i would get:
40577, 1, (50*3)+(100*1), 2 (because 40577 has 1 and 2 so total of 2 debits per this day)
40577, 2, (75*2), 2 (because 40577 has 1 and 2 so total of 2 debits per this day)
40578, 1, (50*2), 2 (because 40578 has 1 and 2 so total of 2 debits per this day)
40578, 2, (150*2), 2 (because 40578 has 1 and 2 so total of 2 debits per this day)
So i have this sql query:
SELECT DebitDate, DebitTypeID, SUM(DebitPrice*DebitQuantity) AS TotalSum
FROM DebitsList
GROUP BY DebitDate, DebitTypeID, DebitPrice, DebitQuantity
And now i'm having trouble and i'm not sure where to put the count for the last info i need.
You would need a correlated subquery to get this new column. You also need to drop DebitPrice and DebitQuantity from the GROUP BY clause for it to work.
SELECT DebitDate,
DebitTypeID,
SUM(DebitPrice*DebitQuantity) AS TotalSum,
( select Count(distinct E.DebitTypeID)
from DebitsList E
where E.DebitDate=D.DebitDate) as CountDebits
FROM DebitsList D
GROUP BY DebitDate, DebitTypeID
I think this can help you.
SELECT DebitDate, SUM(DebitPrice*DebitQuantity) AS TotalSum, Count(DebitDate) as DebitDateCount
FROM DebitsList where DebitTypeID = 1
GROUP BY DebitDate