Finding the average when values are missing using SQL - sql

I'm using Presto but any flavor of SQL will do.
I have a table in that format.
Group_id
event_id
month
party
time_interval
1
1
Jan
Player A
1 hour
1
1
Jan
Player A
2 hours
1
1
Jan
Player B
1 hours
1
1
Jan
Player B
1 hour
1
2
Jan
Player A
3 hour
I need to get the average per group_id, per month, per party
Here's how my average should be calculated
total number of hours per group, per month, per party/total number of events per org, per month
Here's the output I should be expecting for clarity's sake:
Group_id
month
party
avg_time_interval
1
Jan
Player A
3 hours
1
Jan
Player B
1 hour
Now here's the tricky part. For the first row everything makes perfect sense. We have 6 hours across both events, which we divide by 2 distinct events and get an average of 3.
However for the 2nd row, we get 1 hour instead of 2 because since the user did not get a time included we should be assuming that the interval there was 0. This means that there are still 2 unique events across that org_id, month. So the 2 hours totaled should be divided by 2 and not by 1.
This missing data essentially has made this way more complicated than it should be. Otherwise I believe running the following would've solved it
SELECT Group_id , month, party, total/num_cases FROM(
SELECT Group_id , month, party, SUM(time_interval) AS total, COUNT(DISTINCT(event_id)) AS num_cases
FROM table
GROUP BY Group_id , month, party
)

You may find the count of distinct event_id values grouped by group_id, month; then join this with your table as the following:
SELECT T.Group_id, T.month, T.party
,SUM(T.time_interval)*1.0/ MAX(D.eid) AS avg_time_interval
FROM tbl T
JOIN
(
SELECT Group_id, month,
COUNT(DISTINCT event_id) AS eid
FROM tbl GROUP BY Group_id, month
) D
ON T.Group_id=D.Group_id AND
T.month=D.month
GROUP BY T.Group_id,T.month,T.party
ORDER BY T.Group_id,T.month,T.party

select distinct Group_id
,month
,party
,total_hours_per_party/max(dns_rnk) over() as avg_time_interval
from (
select Group_id
,month
,party
,sum(time_interval) over(partition by party) as total_hours_per_party
,dense_rank() over(order by event_id) as dns_rnk
from t
) t
Group_id
month
party
avg_time_interval
1
Jan
Player A
3
1
Jan
Player B
1
Fiddle

Related

Create column for rolling total for the previous month of a current rows date

Context
Using Presto syntax, I'm trying to create an output table that has rolling totals of an 'amount' column value for each day in a month. In each row there will also be a column with a rolling total for the previous month, and also a column with the difference between the totals.
Output Requirements
completed: create month_to_date_amount column that stores rolling total from
sum of amount column. The range for the rolling total is between 1st of month and current row date column value. Restart rolling
total each month. I already have a working query below that creates this column.
SELECT
*,
SUM(amount) OVER (
PARTITION BY
team,
month_id
ORDER BY
date ASC
) month_to_date_amount
FROM (
SELECT -- this subquery is required to handle duplicate dates
date,
SUM(amount) AS amount,
team,
month_id
FROM input_table
GROUP BY
date,
team,
month_id
) AS t
create prev_month_to_date_amount column that:
a. stores previous months rolling amount for the current rows date and team and add to same
output row.
b. Return 0 if there is no record matching the previous month date. (Ex. Prev months date for March 31 is Feb 31 so does not exist). Also a record will not exist for days that have no amount values. Example output table is below.
create movement column that stores the difference
amount between month_to_date_amount column and
prev_month_to_date_amount column from current row.
Question
Could someone assist with my 2nd and 3rd requirements above to achieve my desired output shown below? By either adding on to my current query above, or creating another more efficient one if necessary. A solution with multiple queries is fine.
Input Table
team
date
amount
month_id
A
2022-04-01
1
2022-04
A
2022-04-01
1
2022-04
A
2022-04-02
1
2022-04
B
2022-04-01
3
2022-04
B
2022-04-02
3
2022-04
B
2022-05-01
4
2022-05
B
2022-05-02
4
2022-05
C
2022-05-01
1
2022-05
C
2022-05-02
1
2022-05
C
2022-06-01
5
2022-06
C
2022-06-02
5
2022-06
This answer is a good example of using the window function LAG. In summary the query partitions the data by Team and Day of Month, and uses LAG to get the previous months amount and calculate the movement value.
e.g. for Team B data. The window function will create two partition sets: one with the Team B 01/04/2022 and 01/05/2022 rows, and one with the Team B 02/04/2022 and 02/05/2022 rows, order each partition set by date. Then for each set for each row, use LAG to get the data from the previous row (if one exists) to enable calculation of the movement and retrieve the previous months amount.
I hope this helps.
;with
totals
as
(
select
*,
sum(amount) over(
partition by team, month_id
order by date, team) monthToDateAmount
from
( select
date,
sum(amount) as amount,
team,
month_id
from input_table
group by
date,
team,
month_id
) as x
),
totalsWithMovement
as
(
select
*,
monthToDateAmount
- coalesce(lag(monthToDateAmount) over(
partition by team,day(date(date))
order by team, date),0)
as movement,
coalesce(lag(monthToDateAmount) over
(partition by team, day(date(date))
order by team,month_id),0)
as prevMonthToDateAmount
from
totals
)
select
date, amount, team, monthToDateAmount,
prevMonthToDateAmount, movement
from
totalswithmovement
order by
team, date;

Retrieve Customers with a Monthly Order Frequency greater than 4

I am trying to optimize the below query to help fetch all customers in the last three months who have a monthly order frequency +4 for the past three months.
Customer ID
Feb
Mar
Apr
0001
4
5
6
0002
3
2
4
0003
4
2
3
In the above table, the customer with Customer ID 0001 should only be picked, as he consistently has 4 or more orders in a month.
Below is a query I have written, which pulls all customers with an average purchase frequency of 4 in the last 90 days, but not considering there is a consistent purchase of 4 or more last three months.
Query:
SELECT distinct lines.customer_id Customer_ID, (COUNT(lines.order_id)/90) PurchaseFrequency
from fct_customer_order_lines lines
LEFT JOIN product_table product
ON lines.entity_id= product.entity_id
AND lines.vendor_id= product.vendor_id
WHERE LOWER(product.country_code)= "IN"
AND lines.date >= DATE_SUB(CURRENT_DATE() , INTERVAL 90 DAY )
AND lines.date < CURRENT_DATE()
GROUP BY Customer_ID
HAVING PurchaseFrequency >=4;
I tried to use window functions, however not sure if it needs to be used in this case.
I would sum the orders per month instead of computing the avg and then retrieve those who have that sum greater than 4 in the last three months.
Also I think you should select your interval using "month(CURRENT_DATE()) - 3" instead of using a window of 90 days. Of course if needed you should handle the case of when current_date is jan-feb-mar and in that case go back to oct-nov-dec of the previous year.
I'm not familiar with Google BigQuery so I can't write your query but I hope this helps.
So I've found the solution to this using WITH operator as below:
WITH filtered_orders AS (
select
distinct customer_id ID,
extract(MONTH from date) Order_Month,
count(order_id) CountofOrders
from customer_order_lines` lines
where EXTRACT(YEAR FROM date) = 2022 AND EXTRACT(MONTH FROM date) IN (2,3,4)
group by ID, Order_Month
having CountofOrders>=4)
select distinct ID
from filtered_orders
group by ID
having count(Order_Month) =3;
Hope this helps!
An option could be first count the orders by month and then filter users which have purchases on all months above your threshold:
WITH ORDERS_BY_MONTH AS (
SELECT
DATE_TRUNC(lines.date, MONTH) PurchaseMonth,
lines.customer_id Customer_ID,
COUNT(lines.order_id) PurchaseFrequency
FROM fct_customer_order_lines lines
LEFT JOIN product_table product
ON lines.entity_id= product.entity_id
AND lines.vendor_id= product.vendor_id
WHERE LOWER(product.country_code)= "IN"
AND lines.date >= DATE_SUB(CURRENT_DATE() , INTERVAL 90 DAY )
AND lines.date < CURRENT_DATE()
GROUP BY PurchaseMonth, Customer_ID
)
SELECT
Customer_ID,
AVG(PurchaseFrequency) AvgPurchaseFrequency
FROM ORDERS_BY_MONTH
GROUP BY Customer_ID
HAVING COUNT(1) = COUNTIF(PurchaseFrequency >= 4)

Find row that has the largest sum

I have data for tutors. I recorded hours spent tutoring by month in the SESSION table. I need to know who had the most tutoring hours in March, 2006.
TABLE TUTOR
tutorID
1
2
TABLE SESSION
tutorID Hours Month
1 2 March
1 1 March
2 1 March
Expected Output:
TutorID
1
I would suggest:
select top 1 sum(Hours), tutorID from SESSION where Month like 'March' group by
tutorID order by sum(Hours) DESC
Use 2 CTEs.
The 1st will return all the sums for each tutor.
The 2nd will return the maximum of the sums returned by the 1st cte.
Finally your select statement will return only the tutors from the 1st cte that have sum of hours equal to that maximum returned by the 2nd cte.
with
sumcte as (
select tutorID, sum(hours) sumhours
from session
where month = 'March' -- here there should be another condition for the year?
group by tutorID
),
maxcte as (
select max(sumhours) maxhours from sumcte
)
select tutorid from sumcte
where sumhours = (select maxhours from maxcte)

Get the first appearance of each value in a hive query

I have an event table, every row has:
event-id (primary key)
user-id
item-id
day
So, it's possible that and the same (item-id) appear in different days, but I need obtain the first day that appear every (item-id) and count all occurrence in this day.
For example
event-id user-id item-id day
1 pp a 2015/05/01
2 df a 2015/05/01
3 pp b 2015/05/02
3 al a 2015/05/02
I want the follow result:
day item-id count
2015/05/01 a 2
2015/05/02 b 1
I'm using this query:
SELECT
min(day) as day,
item_id,
count (event_id) as count
FROM
events
GROUP BY
day,
item_id;
but doesn't work correctly.
Here is one method:
SELECT e.*
FROM (SELECT day, item_id, count(*) as cnt,
MIN(day) OVER (PARTITION BY item_id) as minday
FROM events
GROUP BY day, item_id
) e
WHERE day = minday;

SQL Grouping with No Duplicates

Here is the output. No problem here. Exactly what I want. But I added the DISTINCT ID to remove duplicates and that works in each grouped month.
MN | CNT
====================
1 | 1
10 | 2
11 | 5
12 | 5
SELECT EXTRACT(MONTH FROM TRUNC(HDATE)) as MN, COUNT(DISTINCT ID) as CNT
FROM Schema.TRAVEL
WHERE (ARR = '2' OR ARR = '3')
AND
HDATE BETWEEN to_date('2015-10-01', 'yyyy-mm-dd') AND to_date('2016-09-30', 'yyyy-mm-dd')
GROUP BY EXTRACT(MONTH FROM TRUNC(HDATE));
But I can still possibly have duplicates that span more than each month. So if I have a record in October and another in November with the same ID - I want to only count this once - that is my issue
So over the course of a year or any time period - an ID only gets counted once...but I still need to maintain the monthly groupings and output...
??
In other words, you want to count each id in the first month where it appears.
SELECT EXTRACT(MONTH FROM TRUNC(HDATE)) as MN, COUNT(DISTINCT ID) as CNT
FROM (SELECT id, MIN(HDATE) as HDATE
FROM Schema.TRAVEL t
WHERE ARR IN '2', '3') AND
HDATE BETWEEN DATE '2015-10-01' AND DATE '2016-09-30'
GROUP BY id
) t
GROUP BY EXTRACT(MONTH FROM TRUNC(HDATE));
Note: If an id appears before '2015-10-01', this will still count the id in the first month it appears after that date. If you don't want such an id counted at all, move the HDATE comparison to the outer query.