Group By multple columns with conditions Spark SQL - sql

Can anyone shed some lights to how I should tackle this problem.
Current data
Name
Code
Date
Count
A
1A
2020-05-03
34
A
1A
2020-04-02
25
B
3D
2021-04-23
24
C
2X
2021-04-01
01
C
2X
2021-03-31
01
Desired Output:
Name
Code
Date
Count
A
1A
2020-05-03
34
B
3D
2021-04-23
24
C
2X
2021-04-01
01
C
2X
2021-03-31
01
Output from my code:
Name
Code
Date
Count
A
1A
2020-05-03
34
B
3D
2021-04-23
24
C
2X
2021-04-01
01
Below is my code:
SELECT
name,
code,
MAX(date) AS dates,
MAX(Cases_Number) AS Max_Num
FROM(
SELECT
lhd_2010_name AS name,
lhd_2010_code AS code,
notification_date AS date,
FLOOR(SUM(num)) as Cases_Number
FROM cases
GROUP BY
notification_date,
lhd_2010_name,
lhd_2010_code
ORDER BY Cases_Number DESC, notification_date, lhd_2010_name DESC
) AS innertable
GROUP BY name,code ORDER BY Max_Num DESC")
In the innertable I had to sum up the counts as all the counts were 1 before with GroupBy Name Code and Date to get the total counts. Then on the outertable I have to find the max count based on Name+Code combination. If max count is the same name+code combination, we will output the row too.
I understand the reason for the missing row is because I have used max(date), but this is the only way for me to be able to group by name and code, and also showing the dates. If I try to group by name, code, and dates it will show all other rows.
Thanks in Advance

Let's call your main table main, we can first group by name, code and count to find the count (of duplicates), we name the alias countDup and we filter countDup > 1, basically, we need these kind of rows:
|C |2X |1 |2020-04-01|
The code looks like this:
val ds2 = main.groupBy("name", "code", "count")
.agg(count("*").alias("countDup"))
.where(col("countDup")
.gt(1))
Preview of the code:
+----+----+-----+--------+
|name|code|count|countDup|
+----+----+-----+--------+
| C| 2X| 1| 2|
+----+----+-----+--------+
Then, we join with main table (left join), we add a rank to get maximum count, then we use a filter to filter only rows that we want, code:
main
.join(ds2, Seq("name", "code", "count"), "left")
.withColumn("ranking", expr("max(count) over (partition by name,code)"))
.filter(col("countDup").isNotNull || col("count").equalTo(col("ranking")))
.drop("countDup", "ranking")
.orderBy("name")
Final output (with order in name):
+----+----+-----+----------+
|name|code|count|date |
+----+----+-----+----------+
|A |1A |34 |2020-05-03|
|B |3D |24 |2020-04-23|
|C |2X |1 |2020-04-01|
|C |2X |1 |2020-03-31|
+----+----+-----+----------+
I hope this is what you need!
SPARK SQL VERSION
First, we create the temp table:
main.createTempView("main")
Then apply the following SQL:
SELECT name,code,date,count FROM (
SELECT m.name,m.code,m.date,m.count,r.countDup,MAX(m.count) OVER (PARTITION BY m.name,m.code) AS ranking FROM main m LEFT JOIN (
SELECT name,code,count,COUNT(*) AS countDup FROM main GROUP BY name,code,count HAVING COUNT(*) > 1) r
ON m.name = r.name AND m.code = r.code AND m.count = r.count)
WHERE countDup > 0 OR count == ranking ORDER BY name
Result is the same as above!

Related

Using COUNT and GROUP BY in Spark SQL

I'm trying to get pretty basic output that pulls unique NDC Codes for medications and counts the number of unique patients that take each drug. My dataset basically looks like this:
patient_id | drug_ndc
---------------------
01 | 250
02 | 725
03 | 1075
04 | 1075
05 | 250
06 | 250
I want the output to look something like this:
NDC | Patients
--------------
250 | 3
1075 | 2
725 | 1
I tried using some queries like this:
select distinct drug_ndc as NDC, count patient_id as Patients
from table 1
group by 1
order by 1
But I keep getting errors. I've tried with and without using an alias, but to no avail.
The correct syntax should be:
select drug_ndc as NDC, count(*) as Patients
from table 1
group by drug_ndc
order by 1;
SELECT DISTINCT is almost never appropriate with GROUP BY. And you can can use COUNT(*) unless the patient id can be NULL.
to get the number of unique patients, you should do:
select drug_ndc as NDC, count(distinct patient_id) as Patients
from table 1
group by drug_ndc;

SQL Finding sum of rows and returning count of keys

For a database table looking something like this:
id | year | stint | sv
----+------+-------+---
mk1 | 2001 | 1 | 30
mk1 | 2001 | 2 | 20
ml0 | 1999 | 1 | 43
ml0 | 2000 | 1 | 44
hj2 | 1993 | 1 | 70
I want to get the following output:
count
-------
3
with the conditions being count the number of ids that have a sv > 40 for a single year greater than 1994. If there is more than one stint for the same year, add the sv points and see if > 40.
This is what I have written so far but it is obviously not right:
SELECT COUNT(DISTINCT id),
SUM(sv) as SV
FROM public.pitching
WHERE (year > 1994 AND sv >40);
I know the syntax is completely wrong and some of the conditions' information is missing but I'm not familiar enough with SQL and don't know how to properly do the summing of two rows in the same table with a condition (maybe with a subquery?). Any help would be appreciated! (using postgres)
You could use a nested query to get the aggregations, and wrap that for getting the count. Note that the condition on the sum must be in a having clause:
SELECT COUNT(id)
FROM (
SELECT id,
year,
SUM(sv) as SV
FROM public.pitching
WHERE year > 1994
GROUP BY id,
year
HAVING SUM(sv) > 40 ) years
If an id should only count once even it fulfils the condition in more than one year, then do COUNT(distinct id) instead of COUNT(id)
You can try like following using sum and partition by year.
select count( distinct year) from
(
select year, sum(sv) over (partition by year) s
from public.pitching
where year > 1994
) t where s>40
Online Demo

How to get aggregate data from a dynamic number of related rows in adjacent table

I have a table of matches played, roughly looking like this:
player_id | match_id | result | opponent_rank
----------------------------------------------
82 | 2847 | w | 42
82 | 3733 | w | 185
82 | 4348 | l | 10
82 | 5237 | w | 732
82 | 5363 | w | 83
82 | 7274 | w | 6
51 | 2347 | w | 39
51 | 3746 | w | 394
51 | 5037 | l | 90
... | ... | ... | ...
To get all the winning streaks (not just top streak by any player), I use this query:
SELECT player.tag, s.streak, match.date, s.player_id, s.match_id FROM (
SELECT streaks.streak, streaks.player_id, streaks.match_id FROM (
SELECT w1.player_id, max(w1.match_id) AS match_id, count(*) AS streak FROM (
SELECT w2.player_id, w2.match_id, w2.win, w2.date, sum(w2.grp) OVER w AS grp FROM (
SELECT m.player_id, m.match_id, m.win, m.date, (m.win = false AND LAG(m.win, 1, true) OVER w = true)::integer AS grp FROM matches_m AS m
WHERE matches_m.opponent_position<'100'
WINDOW w AS (PARTITION BY m.player_id ORDER BY m.date, m.match_id)
) AS w2
WINDOW w AS (PARTITION BY w2.player_id ORDER BY w2.date, w2.match_id)
) AS w1
WHERE w1.win = true
GROUP BY w1.player_id, w1.grp
ORDER BY w1.player_id DESC, count(*) DESC
) AS streaks
ORDER BY streaks.streak DESC
LIMIT 100
) AS s
LEFT JOIN player ON player.id = s.player_id
LEFT JOIN match ON match.id = s.match_id
And the result looks like this (note that this is not a fixed table/view, as the query above can be extended by certain parameters such as nationality, date range, ranking of players, etc):
player_id | match_id | streak
-------------------------------
82 | 3733 | 2
82 | 7274 | 3
51 | 3746 | 2
... | ... | ...
What I want to add now is a bunch of aggregate data to provide details about the winning streaks. For starters, I'd like to know the average rank of the opponents during each those streaks. Other data are the duration of the streak in time, first and last date, opponent name who ended the streak or if it's still ongoing, and so on. I've tried various things - CTE, some elaborate joins, unions, or adding them in as lag functions in the existing code. But I'm completely stuck how to solve this.
As is obvious from the code, my SQL skills are very basic, so please excuse any mistakes or inefficient statements. For complete context, I'm using Postgres 9.4 on Debian, the matches_m table is a materialized view with 550k lines (query takes 2.5s right now). The data comes from http://aligulac.com/about/db/, I just mirror it to create the aforementioned view.
I think this does what you want.
The key idea is to assign a "streak group" to each winning streak, so you can aggregate them. You can do this by observing:
A match in a winning streak is obviously a "win".
A winning streak can be identified by counting the number of losses before it -- this is constant for the streak.
Postgres introduced the filter clause in 9.4, which makes the syntax a little easier:
select player_id, count(*) as streak_length,
avg(opponent_rank) as avg_opponent_rank
from (select m.*,
count(*) filter (where result = 'l') over (partition by player_id order by date) as streak_grp
from matches_m m
) m
where result = 'w'
group by player_id, streak_grp;
You need to get all rows for the highest streaks instead of an aggregated row.
This returns the top 100 streaks with details (would be easier to return all streaks over n instead).
SELECT ....
FROM
(
SELECT streaks.*,
-- used to filter the top 100 streaks
-- (would be more efficient without by filtering streaks only in Where)
Dense_Rank()
Over (ORDER BY streak DESC, grp, player_id) AS topStreak
FROM
(
SELECT w1.*,
Count(*)
Over (PARTITION BY player_id, grp) AS streak -- count wins per streak
FROM
( -- simplified assigning the group numbers to a single Cumulative Sum
SELECT m.player_id, m.match_id, m.win, m.DATE, --additional columns needed
-- cumulative sum over 0/1, doesn't increase for wins, i.e. a streak of wins gets the same number
Sum(CASE WHEN win = False THEN 1 ELSE 0 end)
Over(PARTITION BY m.player_id
ORDER BY DATE, match_id
ROWS Unbounded Preceding) AS grp
FROM matches_m AS m
WHERE matches_m.opponent_position<'100' -- should be <100 if it's an INT
) AS w1
WHERE w1.win = True -- remove the losses
) AS streaks
-- restrict the number of rows processed by the DENSE_RANK
-- (could be used instead of DENSE_RANK + WHERE topStreak <= 100)
WHERE streak > 20
) AS s
WHERE topStreak <= 100
Now you can apply any kind of analysis on those streaks. As PG is not my main DBMS I don't know if this is easier using Arrays or Window Functions like last_value(opponent_player_id) over ...

SQL: Values in column listed twice after pivot

When querying a specific table, I need to change the structure of the result, making it so that all values from a given year are on the same row, in separate columns that identify the category that the value belongs to.
The table looks like this (example data):
year | category | amount
1991 | A of s | 56
1992 | A of s | 55
1993 | A of s | 40
1994 | A of s | 51
1995 | A of s | 45
1991 | Total | 89
1992 | Total | 80
1993 | Total | 80
1994 | Total | 81
1995 | Total | 82
The result I need is this:
year | a_of_s | total
1991 | 56 | 89
1992 | 55 | 80
1993 | 40 | 80
1994 | 51 | 81
1995 | 45 | 82
From what I can understand I need to use pivot. However, my problem seems to be that I don't understand pivot. I've attempted to adapt the queries of solutions in similar questions where pivot seems to be part of the answer, and so far what I've come up with is this:
SELECT year, [A of s], [Total] FROM table
pivot (
max(amount)
FOR category in ([A of s], [Total])
) pvt
ORDER BY year
This returns the correct table structure, but all cells in the columns a_of_s and total are NULL, and every year is listed twice. What am I missing to get the result I need?
EDIT: After fixing the errors pointed out in the comments, the only real issue that remains is that years in the year column are listed twice.
Possibly related: Is the aggregate function I use in pivot (max, sum, min, etc) arbitrary?
I assumed you really don't need to pivot your table and with the result you require you can create an alternative approach to achieve it.
This is the query i made that returns according to your requirement.
;With cte as
(
select year, Amount from tbl
where category = 'A of s'
)
select
tbl1.year, tbl2.Amount as A_of_S, tbl1.Amount as Total
from tbl as tbl1
inner join cte as tbl2 on tbl1.year = tbl2.year
where tbl1.category = 'Total'
and this is the SQL fiddle i created for you with your test day. -> SQL fiddle
Much simpler answer:
WITH VTE AS(
SELECT *
FROM (VALUES (1991,'A of s',56),
(1992,'A of s',55),
(1993,'A of s',40),
(1994,'A of s',51),
(1995,'A of s',45),
(1991,'Total',89),
(1992,'Total',80),
(1993,'Total',80),
(1994,'Total',81),
(1995,'Total',82)) V([year],category, amount))
SELECT [year],
MAX(CASE category WHEN 'A of s' THEN amount END) AS [A of s],
MAX(CASE category WHEN 'Total' THEN amount END) AS Total
FROM VTE
GROUP BY [year];

count calculation on the basis of date

I have this query where I am counting values from three different tables and getting two columns, date and count from each of it. Getting 3 records for first 12 for second and 12 for third. Finally I am doing little calculation from this count columns from all these table where there date from the second column is finding a match..
For eg if date of first table column is matching with second and third one I am adding all these values and getting percentage of it… though I ‘ve created a query for it…but it is got giving correct data..
I need to know how could I perform calculation form this table on the basis of date.. I am using oracle DB
SELECT TRUNC(ans.date),
(a.count1+b.count2+c.count3)*100/count4 AS status
FROM
(SELECT COUNT(pans.actual)count1,
TRUNC(ans.date) AS subdate
FROM pans, ans
WHERE pans.actres ='1'
AND TRUNC(anss.date) > sysdate-100
AND pans.id = anss.id
GROUP BY TRUNC(anss.date)
)a,
(SELECT COUNT(conans.actres)count2,
TRUNC(anss.date) AS subdate
FROM conans, anss
WHERE conans.actres ='1'
AND TRUNC(anss.date) > sysdate-100
AND conans.id = anss.id
GROUP BY TRUNC(anss.date)
)b,
(SELECT COUNT(anss.submitted)count3,
TRUNC(anss.date) AS subdate
FROM anss
WHERE submitted = 1
AND TRUNC(anss.date) > sysdate-100
GROUP BY TRUNC(anss.date)
)c,
(SELECT COUNT(pans.actres) count4
FROM pans, anss
WHERE anss.date > sysdate-100
)d,
anss,
pans
WHERE a.subdate = b.subdate
AND b.subdate = c.subdate
AND a.subdate = c.subdate
AND TRUNC(anss.date) > sysdate-100
GROUP BY TRUNC(anss.date),a.count1,b.count2,c.count3,d.count4
count1
------------------
count | date
3 | 12/12/1928
5 | 12/12/1998
6 | 12/12/1995
count2
------------------
count| date
3 | 12/12/1928
5 | 12/12/1998
6 | 12/12/1995
23 | 12/12/1924
56 | 12/12/1993
68 | 12/12/1992
39 | 12/12/1921
58 | 12/12/1990
63 | 12/12/1999
count3
------------------
count| date
3 | 12/12/1928
5 | 12/12/1998
6 | 12/12/1995
23 | 12/12/1924
56 | 12/12/1993
68 | 12/12/1992
39 | 12/12/1921
58 | 12/12/1990
63 | 12/12/1999
count4
------------------
4500
now I have to calculate
(count1+count2+count3)*100/count4
when count1.date=count2date=count3.date
HTH
You take all anns records of the last 100 days and cross join pans. Is this cross join desired? (It would be better you used ANSI join syntax to avoid accidental cross joins.)
You find all dates that are present in all of a, b and c and join accordingly.
You cross join the a-b-c join with your anns-pans cross join. Again: Do you really want to cross-join here? You group by anns.date (date is a bad name as it is a keyword, by the way), but what does anns.date habe to do with the records retrieved???
Syntactically your statement looks correct, but I mistrust your joins :-)