Grouped conditional sum in Oracle SQL - sql

my_table shows the account balance of each person's credits N months ago. From this table, I want to get the monthly sum of each person's balances for the past 2 and 3 months and divide each sum by 2 and 3 respectively (that is, a moving average of the sum of balance for the last 2 and 3 months).
Please note that I need the sum of the balance in the past M months divided by M months.
PERSON_ID CRED_ID MONTHS_BEFORE BALANCE
01 01 1 1100
01 01 2 1500
01 01 3 2000
01 02 1 50
01 02 2 400
01 02 3 850
02 06 1 300
02 06 2 320
02 11 1 7500
02 11 2 10000
One way to do this would be to:
select
person_id, sum(balance) / 2 as ma_2
from
my_table
where
months_before <= 2
group by
person_id
and merge this result with
select
person_id, sum(balance) / 3 as ma_3
from
my_table
where
months_before <= 3
group by
person_id
I want to know if this can be handled with a case or a conditional sum or something along these lines:
select
person_id,
sum(balance) over (partition by person_id when months_before <= 2) / 2 as ma_2,
sum(balance) over (partition by person_id when months_before <= 3) / 3 as ma_3
from
my_table
The desired result would look as follows:
PERSON_ID MA_2 MA_3
01 1525.00 1966.66
02 9060.00 9060.00

If these two queries gives what you want and you need to merge them then only ma_2 needs conditional sum:
select person_id,
sum(case when months_before <= 2 then balance end) / 2 as ma_2,
sum(balance) / 3 as ma_3
from my_table
where months_before <= 3
group by person_id
dbfiddle

If you had a "month" column, you would use a window function:
select t.*,
avg(balance) over (partition by person_id
order by month
rows between 2 preceding and current row
) as avg_3month
from t;

Related

Snowflake SQL - Count Distinct Users within descending time interval

I want to count the distinct amount of users over the last 60 days, and then, count the distinct amount of users over the last 59 days, and so on and so forth.
Ideally, the output would look like this (TARGET OUTPUT)
Day Distinct Users
60 200
59 200
58 188
57 185
56 180
[...] [...]
where 60 days is the max total possible distinct users, and then 59 would have a little less and so on and so forth.
my query looks like this.
select
count(distinct (case when datediff(day,DATE,current_date) <= 60 then USER_ID end)) as day_60,
count(distinct (case when datediff(day,DATE,current_date) <= 59 then USER_ID end)) as day_59,
count(distinct (case when datediff(day,DATE,current_date) <= 58 then USER_ID end)) as day_58
FROM Table
The issue with my query is that This outputs the data by column instead of by rows (like shown below) AND, most importantly, I have to write out this logic 60x for each of the 60 days.
Current Output:
Day_60 Day_59 Day_58
209 207 207
Is it possible to write the SQL in a way that creates the target as shown initially above?
Using below data in CTE format -
with data_cte(dates,userid) as
(select * from values
('2022-05-01'::date,'UID1'),
('2022-05-01'::date,'UID2'),
('2022-05-02'::date,'UID1'),
('2022-05-02'::date,'UID2'),
('2022-05-03'::date,'UID1'),
('2022-05-03'::date,'UID2'),
('2022-05-03'::date,'UID3'),
('2022-05-04'::date,'UID1'),
('2022-05-04'::date,'UID1'),
('2022-05-04'::date,'UID2'),
('2022-05-04'::date,'UID3'),
('2022-05-04'::date,'UID4'),
('2022-05-05'::date,'UID1'),
('2022-05-06'::date,'UID1'),
('2022-05-07'::date,'UID1'),
('2022-05-07'::date,'UID2'),
('2022-05-08'::date,'UID1')
)
Query to get all dates and count and distinct counts -
select dates,count(userid) cnt, count(distinct userid) cnt_d
from data_cte
group by dates;
DATES
CNT
CNT_D
2022-05-01
2
2
2022-05-02
2
2
2022-05-03
3
3
2022-05-04
5
4
2022-05-05
1
1
2022-05-06
1
1
2022-05-08
1
1
2022-05-07
2
2
Query to get difference of date from current date
select dates,datediff(day,dates,current_date()) ddiff,
count(userid) cnt,
count(distinct userid) cnt_d
from data_cte
group by dates;
DATES
DDIFF
CNT
CNT_D
2022-05-01
45
2
2
2022-05-02
44
2
2
2022-05-03
43
3
3
2022-05-04
42
5
4
2022-05-05
41
1
1
2022-05-06
40
1
1
2022-05-08
38
1
1
2022-05-07
39
2
2
Get records with date difference beyond a certain range only -
include clause having
select datediff(day,dates,current_date()) ddiff,
count(userid) cnt,
count(distinct userid) cnt_d
from data_cte
group by dates
having ddiff<=43;
DDIFF
CNT
CNT_D
43
3
3
42
5
4
41
1
1
39
2
2
38
1
1
40
1
1
If you need to prefix 'day' to each date diff count, you can
add and outer query to previously fetched data-set and add the needed prefix to the date diff column as following -
I am using CTE syntax, but you may use sub-query given you will select from table -
,cte_1 as (
select datediff(day,dates,current_date()) ddiff,
count(userid) cnt,
count(distinct userid) cnt_d
from data_cte
group by dates
having ddiff<=43)
select 'day_'||to_char(ddiff) days,
cnt,
cnt_d
from cte_1;
DAYS
CNT
CNT_D
day_43
3
3
day_42
5
4
day_41
1
1
day_39
2
2
day_38
1
1
day_40
1
1
Updated the answer to get distinct user count for number of days range.
A clause can be included in the final query to limit to number of days needed.
with data_cte(dates,userid) as
(select * from values
('2022-05-01'::date,'UID1'),
('2022-05-01'::date,'UID2'),
('2022-05-02'::date,'UID1'),
('2022-05-02'::date,'UID2'),
('2022-05-03'::date,'UID5'),
('2022-05-03'::date,'UID2'),
('2022-05-03'::date,'UID3'),
('2022-05-04'::date,'UID1'),
('2022-05-04'::date,'UID6'),
('2022-05-04'::date,'UID2'),
('2022-05-04'::date,'UID3'),
('2022-05-04'::date,'UID4'),
('2022-05-05'::date,'UID7'),
('2022-05-06'::date,'UID1'),
('2022-05-07'::date,'UID8'),
('2022-05-07'::date,'UID2'),
('2022-05-08'::date,'UID9')
),cte_1 as
(select datediff(day,dates,current_date()) ddiff,userid
from data_cte), cte_2 as
(select distinct ddiff from cte_1 )
select cte_2.ddiff,
(select count(distinct userid)
from cte_1 where cte_1.ddiff <= cte_2.ddiff) cnt
from cte_2
order by cte_2.ddiff desc
DDIFF
CNT
47
9
46
9
45
9
44
8
43
5
42
4
41
3
40
1
You can do unpivot after getting your current output.
sample one.
select
*
from (
select
209 Day_60,
207 Day_59,
207 Day_58
)unpivot ( cnt for days in (Day_60,Day_59,Day_58));

SQL - Snowflake - Inner Join not working as expected

I have a table ADS in snowflake like so (data is being inserted each day), note there are duplicates entries on rows 3 and 4:
ID
REPORT_DATE
CLICKS
IMPRESSIONS
1
Jan 01
20
400
1
Jan 02
25
600
1
Jan 03
80
900
1
Jan 03
80
900
2
Jan 01
30
500
2
Jan 02
55
650
2
Jan 03
90
950
I want to select all entries based on ID with the max REPORT_DATE - essentially I want to know the latest number of CLICKS and IMPRESSIONS for each ID:
ID
REPORT_DATE
CLICKS
IMPRESSIONS
1
Jan 03
80
900
2
Jan 03
90
950
This query successfully gives me the max DATE for each ID:
SELECT
MAX(REPORT_DATE),
ID
FROM ADS
GROUP BY
ID;
Result:
ID
MAX(REPORT_DATE)
1
Jan 03
2
Jan 03
However, when I try to conduct an inner join, duplicates arise:
SELECT
a.ID,
a.REPORT_DATE,
a.CLICKS,
a.IMPRESSIONS
FROM ADS a
INNER JOIN (
SELECT
MAX(REPORT_DATE),
ID
FROM ADS
GROUP BY
ID
) b
ON a.ID = b.ID
AND a.REPORT_DATE = b.REPORT_DATE;
Result:
ID
REPORT_DATE
CLICKS
IMPRESSIONS
1
Jan 03
80
900
1
Jan 03
80
900
2
Jan 03
90
950
How can I construct my query to remove these duplicates?
You could use QUALIFY and ROW_NUMBER():
SELECT a.ID,a.REPORT_DATE,a.CLICKS,a.IMPRESSIONS
FROM ADS a
QUALIFY ROW_NUMBER() OVER(PARTITION BY ID ORDER BY REPORT_DATE DESC) = 1;
Please note that ORDER BY REPORT_DATE is not stable(in case of a tie). I would suggest adding another column for sorting that is the tuple is always unique.
If the rows that have a tie are the same it actually is not an issue.
You can use row_number() window function:
select id, report_date, clicks, impresions from
(
select id, report_date, clicks, impresions, row_number()over(partition by id order
by report_date desc) rnk from ADs
)t
where rn=1

Selecting records that have low numbers consecutively

I have a table as following (using bigquery):
id
year
month
day
rating
111
2020
11
30
4
111
2020
12
01
4
112
2020
11
30
5
113
2020
11
30
5
Is there a way in which I can select ids that have ratings that are consecutively (two or more consecutive records) low (low as in both records' ratings less than 4.5)?
For example, my desired output is:
id
year
month
day
rating
111
2020
11
30
4
111
2020
12
01
4
If you want all rows, then you need to look at both the previous rating and the next rating:
SELECT t.*
FROM (SELECT t.*,
LAG(rating) OVER (PARTITION BY id ORDER BY year, month, day ASC) AS prev_rating,
LEAD(rating) OVER (PARTITION BY id ORDER BY year, month, day ASC) AS next_rating,
FROM dataset.table t
) t
WHERE (rating < 4.5 and prev_rating < 4.5) OR
(rating < 4.5 and next_rating < 4.5)
Below is for BigQuery Standard SQL
select * except(grp, seq_len)
from (
select *, sum(1) over(partition by grp) seq_len
from (
select *,
countif(rating >= 4.5) over(partition by id order by year, month, day) grp
from `project.dataset.table`
)
where rating < 4.5
)
where seq_len > 1

Insert zero values for unexisting groups in Redshift

I'm writing a simple query on Amazon Redshift as follows:
SELECT EXTRACT(year FROM created_at) AS year,
EXTRACT(month FROM created_at) AS month,
member_id,
COUNT(*) as pageviews
FROM TABLE
GROUP BY year,
month,
member_id
ORDER BY year,
month,
member_id
This gives me the following result as an example:
year month member_id pageviews
2015 1 100 29
2015 2 100 22
2015 3 100 178
2015 4 100 34
2015 1 200 56
2015 3 200 16
Here's the result I would like to have:
year month member_id pageviews
2015 1 100 29
2015 2 100 22
2015 3 100 178
2015 4 100 34
2015 1 200 56
2015 2 200 0
2015 3 200 16
2015 4 200 0
In the result above, notice the additional rows with zero pageviews.
How do I get this result? Any help would be much appreciated.
Use a cross join to generate the rows and then a left join to bring in the data:
SELECT EXTRACT(year FROM created_at) AS year,
EXTRACT(month FROM created_at) AS month,
m.member_id,
COUNT(t.member_id) as pageviews
FROM (SELECT DISTINCT EXTRACT(year FROM created_at) AS year, EXTRACT(month FROM created_at) AS month FROM TABLE) ym CROSS JOIN
(SELECT DISTINCT member_id FROM TABLE) m LEFT JOIN
TABLE t
ON EXTRACT(year FROM created_at) AS month = ym.year AND
EXTRACT(month FROM created_at) AS month = ym.month AND
t.member_id = m.member_id
GROUP BY ym.year, ym.month, m.member_id
ORDER BY ym.year, ym.month, m.member_id;
This assumes that all year/month combinations are included in the table.
If you have other tables that are better sources for members and the dates, try them -- that may be faster than SELECT DISTINCT.

SQL NOOB - Oracle joins and Row Number

I was hoping to get some guidance on a SQL script I am trying to put together for Oracle database 11g.
I am attempting to perform a count of claims from the 'claim' table, and order them by year / month / and enterprise.
I was able to get a count of claims and order them like I would like, however I need to pull data from another table and I am having trouble combining the 'row_number' function with a join.
Here is my script so far:
SELECT TO_CHAR (SYSTEM_ENTRY_DATE, 'YYYY') YEAR,
TO_CHAR (SYSTEM_ENTRY_DATE, 'MM') MONTH,
ENTERPRISE_IID,
COUNT (*) CLAIMS
FROM (SELECT CLAIM.CLAIM_EID,
CLAIM.SYSTEM_ENTRY_DATE,
CLAIM.ENTERPRISE_IID,
ROW_NUMBER () OVER (PARTITION BY CLAIM.CLAIM_EID, CLAIM.ENTERPRISE_IID
ORDER BY CLAIM.SYSTEM_ENTRY_DATE DESC) RN
FROM CLAIM
WHERE CLAIM_IID IN (SELECT DISTINCT (CLAIM_IID)
FROM CLAIM_LINE
WHERE STATUS <> 'D')
AND CLAIM.CONTEXT = '1'
AND CLAIM.CLAIM_STATUS = 'A'
AND CLAIM.LAST_ANALYSIS_DATE IS NOT NULL)
WHERE RN = 1
GROUP ENTERPRISE_IID,
TO_CHAR (SYSTEM_ENTRY_DATE, 'YYYY'),
TO_CHAR (SYSTEM_ENTRY_DATE, 'MM');
So far all of my data is coming from the 'claim' table. This pulls the following result:
YEAR MONTH ENTERPRISE_IID CLAIMS
---- ----- -------------- ----------
2016 01 6 1
2015 08 6 3
2016 02 6 2
2015 09 6 2
2015 07 6 2
2015 09 5 22
2015 11 5 29
2015 12 5 27
2016 04 5 8
2015 07 5 29
2015 05 5 15
2015 06 5 5
2015 10 5 45
2016 03 5 54
2015 03 5 10
2016 02 5 70
2016 01 5 55
2015 08 5 32
2015 04 5 12
19 rows selected.
The enterprise_IID is the primary key on the 'enterprise' table. The 'enterprise' table also contains the 'name' attribute for each entry. I would like to join the claim and enterprise table in order to show the enterprise name for this count, and not the enterprise_IID.
As you can tell I am rather new to Oracle and SQL, and I am a bit stuck on this one. I was thinking that I should do an inner join between the two tables, but I am not quite sure how to do that when using the row_number function.
Or perhaps I am taking the wrong approach here, and someone could push me in another direction.
Here is what I tried:
SELECT TO_CHAR (SYSTEM_ENTRY_DATE, 'YYYY') YEAR,
TO_CHAR (SYSTEM_ENTRY_DATE, 'MM') MONTH,
ENTERPRISE_IID,
ENTERPRISE.NAME,
COUNT (*) CLAIMS
FROM (SELECT CLAIM.CLAIM_EID,
CLAIM.SYSTEM_ENTRY_DATE,
CLAIM.ENTERPRISE_IID,
ROW_NUMBER () OVER (PARTITION BY CLAIM.CLAIM_EID, CLAIM.ENTERPRISE_IID
ORDER BY CLAIM.SYSTEM_ENTRY_DATE DESC) RN
FROM CLAIM, enterprise
INNER JOIN ENTERPRISE
ON CLAIM.ENTERPRISE_IID = ENTERPRISE.ENTERPRISE_IID
WHERE CLAIM_IID IN (SELECT DISTINCT (CLAIM_IID)
FROM CLAIM_LINE
WHERE STATUS <> 'D')
AND CLAIM.CONTEXT = '1'
AND CLAIM.CLAIM_STATUS = 'A'
AND CLAIM.LAST_ANALYSIS_DATE IS NOT NULL)
WHERE RN = 1
GROUP BY ENTERPRISE.NAME,
ENTERPRISE_IID,
TO_CHAR (SYSTEM_ENTRY_DATE, 'YYYY'),
TO_CHAR (SYSTEM_ENTRY_DATE, 'MM');
Thank you in advance!
"Desired Output"
YEAR MONTH NAME CLAIMS
---- ----- ---- ----------
2016 01 Ent1 1
2015 08 Ent1 3
2016 02 Ent1 2
2015 09 Ent1 2
2015 07 Ent1 2
2015 09 Ent2 22
2015 11 Ent2 29
2015 12 Ent2 27
2016 04 Ent2 8
2015 07 Ent2 29
2015 05 Ent2 15
2015 06 Ent2 5
2015 10 Ent2 45
2016 03 Ent2 54
2015 03 Ent2 10
2016 02 Ent2 70
2016 01 Ent2 55
2015 08 Ent2 32
2015 04 Ent2 12
19 rows selected.
You can try this. Joins can be used when calculating row numbers with row_number function.
SELECT TO_CHAR (SYSTEM_ENTRY_DATE, 'YYYY') YEAR,
TO_CHAR (SYSTEM_ENTRY_DATE, 'MM') MONTH,
ENTERPRISE_IID,
NAME,
COUNT (*) CLAIMS
FROM (SELECT CLAIM.CLAIM_EID,
CLAIM.SYSTEM_ENTRY_DATE,
CLAIM.ENTERPRISE_IID,
ENTERPRISE.NAME,
ROW_NUMBER () OVER (PARTITION BY CLAIM.CLAIM_EID, CLAIM.ENTERPRISE_IID
ORDER BY CLAIM.SYSTEM_ENTRY_DATE DESC) RN
FROM CLAIM --, enterprise (this is not required as the table is being joined already)
INNER JOIN ENTERPRISE ON CLAIM.ENTERPRISE_IID = ENTERPRISE.ENTERPRISE_IID
INNER JOIN (SELECT DISTINCT CLAIM_IID FROM CLAIM_LINE WHERE STATUS <> 'D') CLAIM_LINE
ON CLAIM.CLAIM_IID = CLAIM_LINE.CLAIM_IID
WHERE CLAIM.CONTEXT = '1'
AND CLAIM.CLAIM_STATUS = 'A'
AND CLAIM.LAST_ANALYSIS_DATE IS NOT NULL) t
WHERE RN = 1
GROUP BY NAME, --ENTERPRISE.NAME (The alias ENTERPRISE is not accessible here.)
ENTERPRISE_IID,
TO_CHAR(SYSTEM_ENTRY_DATE, 'YYYY'),
TO_CHAR(SYSTEM_ENTRY_DATE, 'MM');
I'd write the query like this:
SELECT TO_CHAR(TRUNC(c.system_entry_date,'MM'),'YYYY') AS year
, TO_CHAR(TRUNC(c.system_entry_date,'MM'),'MM') AS month
, e.enterprise_name AS name
, COUNT(*) AS claims
FROM (
SELECT r.claim_eid
, r.enterprise_iid
, MAX(r.system_entry_date) AS system_entry_date
FROM ( SELECT DISTINCT l.claim_iid
FROM claim_line l
WHERE l.status <> 'D'
) d
JOIN claim r
ON r.claim_iid = d.claim_iid
AND r.context = '1'
AND r.claim_status = 'A'
AND r.last_analysis_date IS NOT NULL
GROUP
BY r.claim_eid
, r.enterprise_iid
) c
JOIN enterprise e
ON e.enterprise_iid = c.enterprise_iid
GROUP
BY c.enterprise_iid
, TRUNC(c.system_entry_date,'MM')
, e.enterprise_name
ORDER
BY e.enterprise_name
, TRUNC(c.system_entry_date,'MM')
A few notes:
I prefer to qualify ALL column references with the table name or short table alias, and assign aliases to all inline views.
Since the usage of ROW_NUMBER() appears to be get the "latest" system_entry_date for a claim and eliminate duplicates, I'd prefer to use a GROUP BY and a MAX() aggregate.
I prefer to use a join operation rather than the NOT IN (subquery) pattern. (Or, I would tend to use a NOT EXISTS (correlated subquery) pattern.
I don't think it matters too much if you use TO_CHAR or EXTRACT. The TO_CHAR gets you the leading zero in the month, I don't think EXTRACT(MONTH ) gets you the leading zero. I'd use whichever gets me closest to the resultset I need.Personally, I would return just a single column, either containing the year and month as one string e.g. TO_CHAR( , 'YYYYMM') or just a DATE value. It all depends what I'm going to be doing with that.
Just hypothesis to start with, because requirement of query output unclear:
SELECT
C.ENTERPRISE_IID,
E.ENTERPRISE_NAME,
extract(year from CLAIM.SYSTEM_ENTRY_DATE) SYSTEM_ENTRY_YEAR,
extract(month from CLAIM.SYSTEM_ENTRY_DATE) SYSTEM_ENTRY_MONTH,
count(distinct C.CLAIM_EID) CLAIM_COUNT
FROM
CLAIM C,
ENTERPRISE E
WHERE
C.CLAIM_IID IN (
SELECT DISTINCT (CLAIM_IID)
FROM CLAIM_LINE
WHERE STATUS <> 'D'
)
AND C.CONTEXT = '1'
AND C.CLAIM_STATUS = 'A'
AND C.LAST_ANALYSIS_DATE IS NOT NULL
AND E.ENTERPRISE_IID = C.ENTERPRISE_IID
GROUP BY
C.ENTERPRISE_IID,
E.ENTERPRISE_NAME,
extract(year from CLAIM.SYSTEM_ENTRY_DATE),
extract(month from CLAIM.SYSTEM_ENTRY_DATE)
ORDER BY
extract(year from CLAIM.SYSTEM_ENTRY_DATE),
extract(month from CLAIM.SYSTEM_ENTRY_DATE),
E.ENTERPRISE_NAME