SQL: Count distinct column combinations with multiple conditions - sql

I have a joined table that looks like this where mytable contains date, storeid, item and units, whereas stores has state and storeid.
date state storeid item units
==============================================
2020-01-22 new york 712 a 5
2020-01-22 new york 712 b 7
2020-02-18 new york 712 c 0
2020-05-11 new york 518 b 9
2020-01-22 new york 518 b 10
2020-01-21 oregon 613 b 0
2020-02-13 oregon 613 b 9
2020-04-30 oregon 613 b 10
2020-01-22 oregon 515 c 3
And I am trying to create a column that counts the unique number of times that both storeid and item occur in a row where the date is between a given date range and units is greater than 0. Also, only need to select/group by state and the calculated column. I have something that looks like this:
select
s.state,
count(distinct case
when m.date between '2020-01-01' and '2020-03-31'
and m.units > 0
then m.storeid, m.item
end) as q1_total
from mytable as m
left join (select
state,
storeid
from stores) s
on m.storeid=s.storeid
group by s.state
I know my count function isn't written correctly, but I'm not sure how to fix it. Trying to get the end result to look like this.
state q1total
=========================
new york 3
oregon 2

With this query:
select distinct state, storeid, item
from mytable
where date between '2020-01-01' and '2020-03-31' and units > 0
you get all the distinct combinations of state, storeid and item under your conditions and you can aggregate on it:
select state, count(*) q1total
from (
select distinct state, storeid, item
from mytable
where date between '2020-01-01' and '2020-03-31' and units > 0
) t
group by state
See the demo.
Results:
> state | q1total
> :------- | ------:
> new york | 3
> oregon | 2

Related

Trying to count unique observations in SQL using Partition By

I have these two datasets:
Conditions: I would like to count the number of Unique Discharge_ID as Total_Discharges in my final dataset.
ICU_ID is a little bit more difficult. For PT_ID 001, what is happening is that PT 001 has 4 of the same discharge dates but 4 unique ICU_IDs. Since all of these ICU_IDs occur within 30 days of the Discharge_DT, I only want to count one of them. That is why total discharges for AZ is 1 and ICU_Admits = 1.
For PT_ID 002, I have 2 different Discharge_IDs but 1 ICU Admit that occurred within 30 days of both of the Discharge_IDs. I would like to count the Discharges as 2, and ICU_admits as 1.
DF1: Dataset of Discharges from hospital and admission to ICU within 30 days of Discharge_DT
City
PT_ID
Hospital_ID
Admit_Dt
Discharge_DT
Discharge_ID
ICU_ID
AZ
001
ABC
01-01-2021
01-03-2021
001,ABC,01-01-2021,01-03-2021
001,XYZ,01-05-2021,01-06-2021
AZ
001
ABC
01-01-2021
01-03-2021
001,ABC,01-01-2021,01-03-2021
001,XYZ,01-08-2021,01-09-2021
AZ
001
ABC
01-01-2021
01-03-2021
001,ABC,01-01-2021,01-03-2021
001,XYZ,01-11-2021,01-11-2021
AZ
001
ABC
01-01-2021
01-03-2021
001,ABC,01-01-2021,01-03-2021
001,XYZ,01-15-2021,01-16-2021
CA
002
DEF
04-03-2021
04-07-2021
001,ABC,04-03-2021,04-07-2021
002,LMN,04-27-2021,04-27-2021
CA
002
DEF
04-20-2021
04-21-2021
001,ABC,04-20-2021,04-21-2021
002,LMN,04-27-2021,04-27-2021
DF desired:
City
TotalDischarges
ICU_Admit
AZ
1
1
CA
2
1
Current Code:
DROP TABLE IF EXISTS #edit1
WITH CTE_df1 as (
select * from df1
)
select
City,
PT_ID,
Hospital_ID,
Admit_Dt,
Discharge_DT,
Discharge_ID,
count(ICU_ID) over (partition by ICU_ID) as ICU_Pts,
count(distinct Discharge_ID) as Total_Discharges
into #edit1
from CTE_df1
group by City, Discharge_ID, ICU_ID, PT_ID
order by City,
;with CTE_edit1 as (
select * from #edit1
)
select City, sum(ICU_Pts), sum(Total_Discharges)
from CTE_edit1
group by City
order by City
Current Output: PT_ID 001 works great but PT_ID 002 shows up at 2 in ICU_Admit as it is counting both as unique ICU visits.
City
TotalDischarges
ICU_Admit
AZ
1
1
CA
2
2
Any help would be appreciated

SQL Window Function over sliding time window

I have the following data:
country objectid objectuse
record_date
2022-07-20 chile 0 4
2022-07-01 chile 1 4
2022-07-02 chile 1 4
2022-07-03 chile 1 4
2022-07-04 chile 1 4
... ... ... ...
2022-07-26 peru 3088 4
2022-07-27 peru 3088 4
2022-07-28 peru 3088 4
2022-07-30 peru 3088 4
2022-07-31 peru 3088 4
The data describes the daily usage of an object within a country for a single month (July 2022), and not all object are used every day. One of the things I am interested in finding is the sum of the monthly maximums for the month:
WITH month_max AS (
SELECT
country,
objectid,
MAX(objectuse) AS maxuse
FROM mytable
GROUP BY
country,
objectid
)
SELECT
country,
SUM(maxuse)
FROM month_max
GROUP BY country;
Which results in this:
country sum
-------------
chile 1224
peru 17008
But what I actually want is to get the rolling sum of the maxima from the beginning of the month up to each date. So that I get something that looks like:
country sum
record_date
2022-07-01 chile 1
2022-07-01 peru 1
2022-07-02 chile 2
2022-07-02 peru 3
... ... ...
2022-07-31 chile 1224
2022-07-31 peru 17008
I tried using a window function like this to no avail:
SELECT
*,
SUM(objectuse) OVER (
PARTITION BY country
ORDER BY record_date ROWS 30 PRECEDING
) as cumesum
FROM mytable
order BY cumesum DESC;
Is there a way I can achieve the desired result in SQL?
Thanks in advance.
EDIT: For what it's worth, I asked the same question but on Pandas and I received an answer; perhaps it helps to figure out how to do it in SQL.
What ended up working is probably not the most efficient approach to this problem. I essentially created backwards looking blocks from each day in the month back towards the beginning of the month. Within each of these buckets I get the maximum of objectuse for each objectid within that bucket. After taking the max, I sum across all the maxima for that backward looking period. I do this for every day in the data.
Here is the query that does it:
WITH daily_lookback AS (
SELECT
A.record_date,
A.country,
B.objectid,
MAX(B.objectuse) AS maxuse
FROM mytable AS A
LEFT JOIN mytable AS B
ON A.record_date >= B.record_date
AND A.country = B.country
AND DATE_PART('month', A.record_date) = DATE_PART('month', B.record_date)
AND DATE_PART('year', A.record_date) = DATE_PART('year', B.record_date)
GROUP BY
A.record_date,
A.country,
B.objectid
)
SELECT
record_date,
country,
SUM(maxuse) AS usetotal
FROM daily_lookback
GROUP BY
record_date,
country
ORDER BY
record_date;
Which gives me exactly what I was looking for: the cumulative sum of the objectid maximums for the backward looking period, like this:
country sum
record_date
2022-07-01 chile 1
2022-07-01 peru 1
2022-07-02 chile 2
2022-07-02 peru 3
... ... ...
2022-07-31 chile 1224
2022-07-31 peru 17008
You need to change your inner query to use the windowed maximum:
WITH month_max AS (
SELECT record_date, country, objectid,
MAX(objectuse) over (PARTITION BY country, objectid ORDER BY record_date) AS mx
FROM mytable
)
SELECT record_date, country, SUM(mx) as "sum"
FROM month_max
GROUP BY record_date, country;
This does assume one row per object per date.
Here's a re-written version of your query. With indexing it seems possible that it might run faster:
select record_date, country, min(usetotal) as usetotal
from mytable d inner join lateral (
select distinct sum(max(objectuse)) over () as usetotal from mytable a
where a.record_date between date_trunc('month', d.record_date) and d.record_date
and a.country = d.country
group by objectid
) T on 1 = 1
group by record_date, country
order by record_date, country;
https://dbfiddle.uk/?rdbms=postgres_14&fiddle=63760e30aecf4c885ec4967045b6cd03

SQL sum values for each ID

I have a dataset about trains, it's including a table for the customers information which is a number representing an age group and the amount of travellers for that age group.
The ID represents a location which has multiple departure times, which has multiple age groups.
The data looks something like this
StationID
Time of Departure
TravellerID
Amount of travellers
1
12:13
4001
30
1
12:13
4002
15
1
19:45
4001
10
1
19:45
4002
20
I want to sum the amount of travellers for each departure
I tried to code it this way:
SELECT StationID,[Time of Departure], sum(Amount)
FROM Train_Stations AS TS
INNER JOIN DepartureData AS DD
ON DD.FK_StationID = TS.PK_StationID
INNER JOIN CustomerInfo AS CI
ON CI.FK_StationID = TS.PK_StationID
GROUP BY StationID, [Time of Departure]
The result is like this:
StationID
Time of Departure
Amount
1
12:13
75
1
12:13
75
1
19:45
75
1
19:45
75
But I want it like this:
StationID
Time of Departure
Amount
1
12:13
45
1
19:45
30
Seems, you do something different.Based on your data query is correct
WITH CTE(StationID,DEPARTURE_TIME,TRAVELLERID,AMOUNT_OF_TRAVELLERS) AS
(
SELECT 1,CAST('12:13'AS TIME),4001,30 UNION ALL
SELECT 1,CAST('12:13'AS TIME),4002,15 UNION ALL
SELECT 1,CAST('19:45'AS TIME),4001,10 UNION ALL
SELECT 1,CAST('19:45'AS TIME),4002,20
)
SELECT C.StationID,C.DEPARTURE_TIME,SUM(AMOUNT_OF_TRAVELLERS)TOTAL_TRAVELLERS
FROM CTE AS C
GROUP BY C.StationID,C.DEPARTURE_TIME
You should specify the column as DD.StationID. It will return as an expected result.
SELECT DD.StationID,DD.[Time of Departure], sum(DD.Amount)
FROM Train_Stations AS TS
INNER JOIN DepartureData AS DD
ON DD.FK_StationID = TS.PK_StationID
INNER JOIN CustomerInfo AS CI
ON CI.FK_StationID = TS.PK_StationID
GROUP BY DD.StationID, DD.[Time of Departure]

Netezza add new field for first record value of the day in SQL

I'm trying to add new columns of first values of the day for location and weight.
For instance, the original data format is:
id dttm location weight
--------------------------------------------
1 1/1/20 11:10:00 A 40
1 1/1/20 19:07:00 B 41.1
2 1/1/20 08:01:00 B 73.2
2 1/1/20 21:00:00 B 73.2
2 1/2/20 10:03:00 C 74
I want each id to have only one day record, such as:
id dttm location weight
--------------------------------------------
1 1/1/20 11:10:00 A 40
2 1/1/20 08:01:00 B 73.2
2 1/2/20 10:03:00 C 74
I have other columns in my data set that I'm using location and weight to create, so I don't think I can just filter for 'first' records of the day.. Is it possible to write query to recognize first record of the day for those two columns and create new column with those values?
You can use row_number():
select t.*
from (select t.*,
row_number() over (partition by id, ddtm::date order by dttm) as seqnum
from t
) t
where seqnum = 1;

Sum only for Employee ID's present in latest snapshot

I have a database with a row per month for each employee working in our company. So, if employee A has been working for our company from July 2016 till now, this person has approx. 24 rows (one row for each month she was in service).
I'm trying to summarize the experience each of the current employees have in a particular function. So, if employee A has worked 6 months in Sales and 18 months in Marketing, then I count the number of rows this employee has Sales or Marketing in the column indicating the function.
I have created a code which does seems to count the functional experience per employee, but it double counts data. It does not take the latest snapshot as starting point.
SELECT A.EMPLOYEE_ID,
SUM(CASE WHEN A.FUNCTION_CODE ='CUS' THEN 1 ELSE 0 END) AS EXP_CUS,
SUM(CASE WHEN A.FUNCTION_CODE ='MKT' THEN 1 ELSE 0 END) AS EXP_MKT
FROM [dbname].[AGL_V_HRA_FE_R].[VW_HRA_EMPLOYEE_DETAIL] AS A INNER JOIN [dbname].[AGL_V_HRA_FE_R].[VW_HRA_EMPLOYEE_DETAIL] AS B ON A.EMPLOYEE_ID = B.EMPLOYEE_ID
WHERE B.WORKLEVEL_CODE > '1'
GROUP BY A.EMPLOYEE_ID
I expected the output for employee A to be EXP_CUS = 6 and EXP_MKT = 18. Instead, the output for both is much higher as it is double counting rows. When I add the line AND B.SNAPSHOT_DATE = '2019-06-30', the output is correct. I don't like to manually adjust the code every month and rather refer to the latest snapshot date.
ADDED
The original table looks like this
SNAPSHOT_DATE | EMPLOYEE_ID | FUNCTION_CODE
2019-06-30 | 000000001 | CUS
2019-06-30 | 000000002 | MKT
2019-05-31 | 000000001 | CUS
2019-05-31 | 000000002 | MKT
2019-04-30 | 000000001 | MKT
2019-04-30 | 000000002 | MKT
The desired output would be
EMPLOYEE_ID | EXP_CUS | EXP_MKT
000000001 | 2 | 1
000000002 | 0 | 3
You can use PIVOT to get your desired result as below-
SELECT EMPLOYEE_ID,
ISNULL([CUS],0) AS [EXP_CUS],
ISNULL([MKT],0) AS [EXP_MKT]
FROM
(
SELECT EMPLOYEE_ID,FUNCTION_CODE,COUNT(SNAPSHOT_DATE) T
FROM your_table
GROUP BY EMPLOYEE_ID,FUNCTION_CODE
)P
PIVOT(
SUM(T)
FOR FUNCTION_CODE IN ([CUS],[MKT])
)PVT
Output is-
EMPLOYEE_ID EXP_CUS EXP_MKT
000000001 2 1
000000002 0 3
I don't understand why you are using a self join. This seems to do what you want:
SELECT ED.EMPLOYEE_ID,
SUM(CASE WHEN ED.FUNCTION_CODE ='CUS' THEN 1 ELSE 0 END) AS EXP_CUS,
SUM(CASE WHEN ED.FUNCTION_CODE ='MKT' THEN 1 ELSE 0 END) AS EXP_MKT
FROM [dbname].[AGL_V_HRA_FE_R].[VW_HRA_EMPLOYEE_DETAIL] ed
WHERE ED.WORKLEVEL_CODE > '1'
GROUP BY ED.EMPLOYEE_ID;
If you only want employees with the most recent snapshot date, then you can use window functions:
SELECT ED.EMPLOYEE_ID,
SUM(CASE WHEN ED.FUNCTION_CODE ='CUS' THEN 1 ELSE 0 END) AS EXP_CUS,
SUM(CASE WHEN ED.FUNCTION_CODE ='MKT' THEN 1 ELSE 0 END) AS EXP_MKT
(SELECT ED.*,
MAX(SNAPSHOT_DATE) OVER () as OVERALL_MAX_SNAPSHOT_DATE,
MAX(SNAPSHOT_DATE) OVER (PARTITION BY EMPLOYEE_ID) as EMPLOYEE_MAX_SNAPSHOT_DATE
FROM [dbname].[AGL_V_HRA_FE_R].[VW_HRA_EMPLOYEE_DETAIL] ED
) ED
WHERE ED.WORKLEVEL_CODE > '1' AND
EMPLOYEE_MAX_SNAPSHOT_DATE = OVERALL_MAX_SNAPSHOT_DATE
GROUP BY ED.EMPLOYEE_ID;