How to determine number of days in a month in Presto? - sql

I have data with date, userid, and amount. I want to calculate sum(amount) divided by total day for each month. the final will be presented in monthly basis.
The table I have is looks like this
date userid amount
2019-01-01 111 10
2019-01-15 112 20
2019-01-20 113 10
2019-02-01 114 30
2019-02-15 111 20
2019-03-01 115 40
2019-03-23 155 50
desired result is like this
date avg_qty_sol
Jan-19 1.29
Feb-19 1.79
Mar-19 2.90
avg_qty_sold is coming from sum(amount) / total day for respective month
e.g for jan 2019 sum amount is 40 and total days in jan is 31. so the avg_qty_sold is 40/31
Currently Im using case when for this solution. is there any better approach to this?

Since Presto 318, you this is as easy as:
SELECT day(last_day_of_month(some_date))
See https://trino.io/docs/current/functions/datetime.html#last_day_of_month
Before Presto 318, You can combine date_trunc with EXTRACT:
date_trunc('month', date_value)) gives beginning of the month, while date_add('month', 1, date_trunc('month', date_value)) gives beginning of the next month
subtracting date values returns an interval day to second
EXTRACT(DAY FROM interval) returns day-portion of the interval. You can also use day convenience function instead of EXTRACT(DAY FROM ...). The EXTRACT syntax is more verbose and more standard.
presto:default> SELECT
-> date_value,
-> EXTRACT(DAY FROM (
-> date_add('month', 1, date_trunc('month', date_value)) - date_trunc('month', date_value)))
-> FROM (VALUES DATE '2019-01-15', DATE '2019-02-01') t(date_value);
date_value | _col1
------------+-------
2019-01-15 | 31
2019-02-01 | 28
(2 rows)
Less natural, but a bit shorter alternative would be to get day number for the last day of given month with day(date_add('day', -1, date_add('month', 1, date_trunc('month', date_value)))):
presto:default> SELECT
-> date_value,
-> day(date_add('day', -1, date_add('month', 1, date_trunc('month', date_value))))
-> FROM (VALUES DATE '2019-01-15', DATE '2019-02-01') t(date_value);
date_value | _col1
------------+-------
2019-01-15 | 31
2019-02-01 | 28
(2 rows)

Related

SQL - POSTGRES - DATE_PART why is sql resulting in week 53 when it should be week 2

TABLE
INSERT INTO runners
("runner_id", "registration_date")
VALUES
(1, '2021-01-01'),
(2, '2021-01-03'),
(3, '2021-01-08'),
(4, '2021-01-15');
SQL Query
SELECT
DATE_PART('WEEK', R.registration_date) AS week_num,
COUNT(runner_id)
FROM
pizza_runner.runners R
GROUP BY
week_num
ORDER BY
week_num ASC;
I was expecting the query to return weeks 1 and 2 only but for some reason I am getting 53
]1
I was expecting the query to return weeks 1 and 2 only but for some reason I am getting 53
The documentation does a good job explaining the ISO rules for weeks - which Postgres follows:
The number of the ISO 8601 week-numbering week of the year. By definition, ISO weeks start on Mondays and the first week of a year contains January 4 of that year. In other words, the first Thursday of a year is in week 1 of that year.
Using your dataset:
SELECT r.*,
extract(week from registration_date) AS week_num,
extract(isodow from registration_date) as day_of_week
FROM runners r
ORDER BY registration_date;
runner_id
registration_date
week_num
day_of_week
1
2021-01-01
53
5
2
2021-01-03
53
7
3
2021-01-08
1
5
4
2021-01-15
2
5
It turns out that January 3rd, 2021 was a Sunday (day of week 7). January 4st, 2021 was a Monday, and according to the ISO rules this is when the first week of that year began. Previous dates (January 3rd, 2nd, 1st, and so on) belong to the last week of 2020 (week 53), although the dates belong to year 2021.

Calculating difference (or deltas) between current and previous row with clickhouse

It would be awesome if there was a way to index rows during a query.
Is there a way to SELECT (compute) the difference of a single column between consecutive rows?
Let's say, something like the following query
SELECT
toStartOfDay(stamp) AS day,
count(day ) AS events ,
day[current] - day[previous] AS difference, -- how do I calculate this
day[current] / day[previous] as percent, -- and this
FROM records
GROUP BY day
ORDER BY day
I want to get the integer and percentage difference between the current row's 'events' column and the previous one for something similar to this:
day
events
difference
percent
2022-01-06 00:00:00
197
NULL
NULL
2022-01-07 00:00:00
656
459
3.32
2022-01-08 00:00:00
15
-641
0.02
2022-01-09 00:00:00
7
-8
0.46
2022-01-10 00:00:00
137
130
19.5
My version of Clickhouse doesn't support window-function but, on looking about the LAG() function mentioned in the comments, I found neighbor(), which works perfectly for what I'm trying to do
SELECT
toStartOfDay(stamp) AS day,
count(day ) AS events ,
(events - neighbor(events, -1)) as diff,
(events / neighbor(events, -1)) as perc
FROM records
GROUP BY day
ORDER BY day

Total revenue of an account for the preceding 12 months - Redshift SQL

So my doubt is in sql. I am looking to find the total revenue of a parent account for the last 12 months.
The data will look something like this
revenue
name
month
year
10000
abc
201001
2010-01-12
10000
abc
201402
2014-02-14
2000
abc
201404
2014-04-12
3000
abc
201406
2014-06-30
30000
def
201301
2013-01-14
6000
def
201304
2013-04-12
9000
def
201407
2013-07-19
And the output should be something like this
revenue
name
month
year
Running Sum
10000
abc
201001
2010-01-12
10000
10000
abc
201402
2014-02-14
10000
2000
abc
201404
2014-04-12
12000
3000
abc
201406
2014-06-30
15000
30000
def
201301
2013-01-14
30000
6000
def
201304
2013-04-12
36000
9000
def
201407
2013-07-19
45000
I have tried using using windowing function something like this and the logic that I need
select revenue, name, date, month,
sum(revenue) over (partition by name order by month rows between '12 months' preceding AND CURRENT ROW )
from table
but the above command gives a syntax error
Redshift does not support intervals in the window frame specification.
So, convert to a number. A convenient one in this case is the number of months since some point in time:
select revenue, name, date, month,
sum(revenue) over (partition by name
order by datediff(month, '1900-01-01', month)
range between 12 preceding and current row
)
from table;
I will note that your logic adds up data from 13 months, not 12. I suspect you want between 11 preceding and current row.
You can use rows between if you have data for all months:
sum(revenue) over (partition by name
order by datediff(month, '1900-01-01', month)
rows between 12 preceding and current row
)

Calculate Churn by aggregating by date range in SQL

I am trying to calculate the churn rate from a data that has customer_id, group, date. The aggregation is going to be by id, group and date. The churn formula is (customers in previous cohort - customers in last cohort)/customers in previous cohort
customers in previous cohort refers to cohorts in before 28 days
customers in last cohort refers to cohorts in last 28 days
I am not sure how to aggregate them by date range to calculate the churn.
Here is sample data that I copied from SQL Group by Date Range:
Date Group Customer_id
2014-03-01 A 1
2014-04-02 A 2
2014-04-03 A 3
2014-05-04 A 3
2014-05-05 A 6
2015-08-06 A 1
2015-08-07 A 2
2014-08-29 XXXX 2
2014-08-09 XXXX 3
2014-08-10 BB 4
2014-08-11 CCC 3
2015-08-12 CCC 2
2015-03-13 CCC 3
2014-04-14 CCC 5
2014-04-19 CCC 4
2014-08-16 CCC 5
2014-08-17 CCC 3
2014-08-18 XXXX 2
2015-01-10 XXXX 3
2015-01-20 XXXX 4
2014-08-21 XXXX 5
2014-08-22 XXXX 2
2014-01-23 XXXX 3
2014-08-24 XXXX 2
2014-02-25 XXXX 3
2014-08-26 XXXX 2
2014-06-27 XXXX 4
2014-08-28 XXXX 1
2014-08-29 XXXX 1
2015-08-30 XXXX 2
2015-09-31 XXXX 3
The goal is to calculate the churn rate every 28 days in between 2014 and 2015 by the formula given above. So, it is going to be aggregating the data by rolling it by 28 days and calculating the churn by the formula.
Here is what I tried to aggregate the data by date range:
SELECT COUNT(distinct customer_id) AS count_ids, Group,
DATE_SUB(CAST(Date AS DATE), INTERVAL 56 DAY) AS Date_min,
DATE_SUB(CURRENT_DATE, INTERVAL 28 DAY) AS Date_max
FROM churn_agg
GROUP BY count_ids, Group, Date_min, Date_max
Hope someone will help me with aggregation and churn calculation. I want to simply deduct the aggregated count_ids to deduct it from the next aggregated count_ids which is after 28 days. So this is going to be successive deduction of the same column value (count_ids). I am not sure if I have to use rolling window or simple aggregation to find the churn.
As corrected by #jarlh, it's not 2015-09-31 but 2015-09-30
You can use this to create 28 days calendar:
create table daysby28 (i int, _Date date);
insert into daysby28 (i, _Date)
SELECT i, cast('01-01-2014'as date) + i*INTERVAL '28 day'
from generate_series(0,50) i
order by 1;
After you use #jarlh churn_agg table creation he sent with the fiddle, with this query, you get what you want:
with cte as
(
select count(Customer) as TotalCustomer, Cohort, CohortDateStart From
(
select distinct a.Customer_id as Customer, b.i as Cohort, b._Date as CohortDateStart
from churn_agg a left join daysby28 b on a._Date >= b._Date and a._Date < b._Date + INTERVAL '28 day'
) a
group by Cohort, CohortDateStart
)
select a.CohortDateStart,
1.0*(b.TotalCustomer - a.TotalCustomer)/(1.0*b.TotalCustomer) as Churn from cte a
left join cte b on a.cohort > b.cohort
and not exists(select 1 from cte c where c.cohort > b.cohort and c.cohort < a.cohort)
order by 1
The fiddle of all together is here

GROUP BY several hours

I have a table where our product records its activity log. The product starts working at 23:00 every day and usually works one or two hours. This means that once a batch started at 23:00, it finishes about 1:00am next day.
Now, I need to take statistics on how many posts are registered per batch but cannot figure out a script that would allow me achiving this. So far I have following SQL code:
SELECT COUNT(*), DATEPART(DAY,registrationtime),DATEPART(HOUR,registrationtime)
FROM RegistrationMessageLogEntry
WHERE registrationtime > '2014-09-01 20:00'
GROUP BY DATEPART(DAY, registrationtime), DATEPART(HOUR,registrationtime)
ORDER BY DATEPART(DAY, registrationtime), DATEPART(HOUR,registrationtime)
which results in following
count day hour
....
1189 9 23
8611 10 0
2754 10 23
6462 11 0
1885 11 23
I.e. I want the number for 9th 23:00 grouped with the number for 10th 00:00, 10th 23:00 with 11th 00:00 and so on. How could I do it?
You can do it very easily. Use DATEADD to add an hour to the original registrationtime. If you do so, all the registrationtimes will be moved to the same day, and you can simply group by the day part.
You could also do it in a more complicated way using CASE WHEN, but it's overkill on the view of this easy solution.
I had to do something similar a few days ago. I had fixed timespans for work shifts to group by where one of them could start on one day at 10pm and end the next morning at 6am.
What I did was:
Define a "shift date", which was simply the day with zero timestamp when the shift started for every entry in the table. I was able to do so by checking whether the timestamp of the entry was between 0am and 6am. In that case I took only the date of this DATEADD(dd, -1, entryDate), which returned the previous day for all entries between 0am and 6am.
I also added an ID for the shift. 0 for the first one (6am to 2pm), 1 for the second one (2pm to 10pm) and 3 for the last one (10pm to 6am).
I was then able to group over the shift date and shift IDs.
Example:
Consider the following source entries:
Timestamp SomeData
=============================
2014-09-01 06:01:00 5
2014-09-01 14:01:00 6
2014-09-02 02:00:00 7
Step one extended the table as follows:
Timestamp SomeData ShiftDay
====================================================
2014-09-01 06:01:00 5 2014-09-01 00:00:00
2014-09-01 14:01:00 6 2014-09-01 00:00:00
2014-09-02 02:00:00 7 2014-09-01 00:00:00
Step two extended the table as follows:
Timestamp SomeData ShiftDay ShiftID
==============================================================
2014-09-01 06:01:00 5 2014-09-01 00:00:00 0
2014-09-01 14:01:00 6 2014-09-01 00:00:00 1
2014-09-02 02:00:00 7 2014-09-01 00:00:00 2
If you add one hour to registrationtime, you will be able to group by the date part:
GROUP BY
CAST(DATEADD(HOUR, 1, registrationtime) AS date)
If the starting hour must be reflected accurately in the output (as 9, 23, 10, 23 rather than as 10, 0, 11, 0), you could obtain it as MIN(registrationtime) in the SELECT clause:
SELECT
count = COUNT(*),
day = DATEPART(DAY, MIN(registrationtime)),
hour = DATEPART(HOUR, MIN(registrationtime))
Finally, in case you are not aware, you can reference columns by their aliases in ORDER BY:
ORDER BY
day,
hour
just so that you do not have to repeat the expressions.
The below query will give you what you are expecting..
;WITH CTE AS
(
SELECT COUNT(*) Count, DATEPART(DAY,registrationtime) Day,DATEPART(HOUR,registrationtime) Hour,
RANK() over (partition by DATEPART(HOUR,registrationtime) order by DATEPART(DAY,registrationtime),DATEPART(HOUR,registrationtime)) Batch_ID
FROM RegistrationMessageLogEntry
WHERE registrationtime > '2014-09-01 20:00'
GROUP BY DATEPART(DAY, registrationtime), DATEPART(HOUR,registrationtime)
)
SELECT SUM(COUNT) Count,Batch_ID
FROM CTE
GROUP BY Batch_ID
ORDER BY Batch_ID
You can write a CASE statement as below
CASE WHEN DATEPART(HOUR,registrationtime) = 23
THEN DATEPART(DAY,registrationtime)+1
END,
CASE WHEN DATEPART(HOUR,registrationtime) = 23
THEN 0
END