Count grouping by specific date intervals - sql

I want to count the amount of "services" a customer have in 30 days period groups, since the contract start day. So I have to count the services in a monthly based period since his start date. Simplifying the table is something like this:
services
------------------
id serial
id_customer bigint
service_date date
Lets imaging there is only one type of service. I solve it like this:
SELECT
DATE_PART('year',service_date)||'-'|| CASE WHEN DATE_PART('day',service_date) >= 15 THEN
DATE_PART('month',service_date)
ELSE
CASE WHEN DATE_PART('month',service_date) = 1 THEN
12
ELSE
DATE_PART('month',service_date)-1
END
END bill, count(id)
FROM services
WHERE id_customer = 1
GROUP BY bill
results would be
bill | count
-------------------
2019-02 | 2455333
In the example the start date for id_customer 1 is 2019-02-15 but for 2019-02 period I will be counting the services until 2019-03-14.
What I want to know is, there is a better/more efficient solution?
I saw the solution here but implies an INNER JOIN with a GROUP BY with the same table which I think it would be slower, because my table has a lot of records.

You don't need to worry about the actual number of days in a month nor the month, year or day-of-month.
Just use the start date for a customer and let PostgreSQL generate the correct billing cycle periods for you.
To run a single query over all customers, I have used a separate table with the customer id as well as a billing_start date configured, for which we can then run a query such as the following:
WITH
periods (id, period_start, period_end) AS (
SELECT
id,
generate_series(billing_start, current_date, '1 month'::interval)::date,
(generate_series(billing_start, current_date, '1 month'::interval) + '1 month'::interval)::date
FROM test_customers
),
data AS (
SELECT
periods.id AS customer,
period_start,
count(test_services.*) AS service_calls
FROM periods INNER JOIN test_services ON (test_services.id_customer = periods.id)
WHERE test_services.service_date >= periods.period_start AND test_services.service_date < periods.period_end
GROUP BY 1, 2
)
SELECT customer, to_char(period_start, 'YYYY-MM') AS bill, service_calls
FROM data
ORDER BY 1, 2
;
...resulting in an output such as the following:
customer | bill | service_calls
----------+---------+---------------
1 | 2018-12 | 382736
1 | 2019-01 | 382735
1 | 2019-02 | 345696
2 | 2018-12 | 382736
2 | 2019-01 | 382734
2 | 2019-02 | 234580
3 | 2018-12 | 382734
3 | 2019-01 | 382736
3 | 2019-02 | 123463
4 | 2018-12 | 382734
4 | 2019-01 | 382736
4 | 2019-02 | 12346
5 | 2019-01 | 382735
5 | 2019-02 | 283965
6 | 2019-01 | 382735
6 | 2019-02 | 172848
7 | 2019-01 | 382734
7 | 2019-02 | 61732
8 | 2019-02 | 333351
9 | 2019-02 | 222234
10 | 2019-02 | 111117
(21 rows)
Complete online example: https://rextester.com/IHLJ95398
An important thing to note for this to be fast is a multi-column index on id_customer and service_date because that's where the counting takes place, which then can be done without sorting:
CREATE INDEX idx_svc_customer_date ON test_services (id_customer, service_date);
(otherwise, the sorting will most likely be done on disk, rather than in memory for large data sets)
If you just want the cycles for a single customer, use it like this:
WITH
periods (id, period_start, period_end) AS (
SELECT
id,
generate_series(billing_start, current_date, '1 month'::interval)::date,
(generate_series(billing_start, current_date, '1 month'::interval) + '1 month'::interval)::date
FROM test_customers WHERE id = 4
),
data AS (
SELECT
periods.id AS customer,
period_start,
count(test_services.*) AS service_calls
FROM periods INNER JOIN test_services ON (test_services.id_customer = periods.id)
WHERE test_services.service_date >= periods.period_start AND test_services.service_date < periods.period_end
GROUP BY 1, 2
)
SELECT customer, to_char(period_start, 'YYYY-MM') AS bill, service_calls
FROM data
ORDER BY 1, 2
;
...giving:
bill | service_calls
---------+---------------
2018-12 | 382734
2019-01 | 382736
2019-02 | 12346
(3 rows)

Related

Aggregate columns based on different conditions?

I have a Teradata query that generates:
customer | order | amount | days_ago
123 | 1 | 50 | 2
123 | 1 | 50 | 7
123 | 2 | 10 | 19
123 | 3 | 100 | 35
234 | 4 | 20 | 20
234 | 5 | 10 | 10
With performance in mind, what’s the most efficient way to produce an output per customer where orders is the number of distinct orders a customer had within the last 30 days and total is the sum of the amount of the distinct orders regardless of how many days ago the order was placed?
Desired output:
customer | orders | total
123 | 2 | 160
234 | 2 | 30
Given your rules, maybe it takes two steps - de-duplicate first then aggregate:
SELECT customer,
SUM(CASE WHEN days_ago <=30 THEN 1 ELSE 0 END) AS orders,
SUM(amount) AS total
FROM
(SELECT customer, order, MAX-or-MIN(amount) AS amount, MIN-or-MAX(days_ago) AS days_ago
FROM your_relation
GROUP BY 1, 2) AS DistinctCustOrder
GROUP BY 1;

postgresql - cumul. sum active customers by month (removing churn)

I want to create a query to get the cumulative sum by month of our active customers. The tricky thing here is that (unfortunately) some customers churn and so I need to remove them from the cumulative sum on the month they leave us.
Here is a sample of my customers table :
customer_id | begin_date | end_date
-----------------------------------------
1 | 15/09/2017 |
2 | 15/09/2017 |
3 | 19/09/2017 |
4 | 23/09/2017 |
5 | 27/09/2017 |
6 | 28/09/2017 | 15/10/2017
7 | 29/09/2017 | 16/10/2017
8 | 04/10/2017 |
9 | 04/10/2017 |
10 | 05/10/2017 |
11 | 07/10/2017 |
12 | 09/10/2017 |
13 | 11/10/2017 |
14 | 12/10/2017 |
15 | 14/10/2017 |
Here is what I am looking to achieve :
month | active customers
-----------------------------------------
2017-09 | 7
2017-10 | 6
I've managed to achieve it with the following query ... However, I'd like to know if there are a better way.
select
"begin_date" as "date",
sum((new_customers.new_customers-COALESCE(churn_customers.churn_customers,0))) OVER (ORDER BY new_customers."begin_date") as active_customers
FROM (
select
date_trunc('month',begin_date)::date as "begin_date",
count(id) as new_customers
from customers
group by 1
) as new_customers
LEFT JOIN(
select
date_trunc('month',end_date)::date as "end_date",
count(id) as churn_customers
from customers
where
end_date is not null
group by 1
) as churn_customers on new_customers."begin_date" = churn_customers."end_date"
order by 1
;
You may use a CTE to compute the total end_dates and then subtract it from the counts of start dates by using a left join
SQL Fiddle
Query 1:
WITH edt
AS (
SELECT to_char(end_date, 'yyyy-mm') AS mon
,count(*) AS ct
FROM customers
WHERE end_date IS NOT NULL
GROUP BY to_char(end_date, 'yyyy-mm')
)
SELECT to_char(c.begin_date, 'yyyy-mm') as month
,COUNT(*) - MAX(COALESCE(ct, 0)) AS active_customers
FROM customers c
LEFT JOIN edt ON to_char(c.begin_date, 'yyyy-mm') = edt.mon
GROUP BY to_char(begin_date, 'yyyy-mm')
ORDER BY month;
Results:
| month | active_customers |
|---------|------------------|
| 2017-09 | 7 |
| 2017-10 | 6 |

Bigquery: new column

I have the following table structure
+----+-------------+------------+
| id | transaction | time |
+----+-------------+------------+
| 1 | 10 | 01.01.2018 |
| 1 | 20 | 10.01.2018 |
| 2 | 20 | 05.01.2018 |
| 2 | 30 | 15.01.2018 |
| 2 | 5 | 03.02.2018 |
+----+-------------+------------+
What I want to do now, is to calculate the sum of transaction for each id. However, I would like to do it with a rolling sum for each let's say month of time separately. So I would like to end with something like:
+----+-------+-------+
| id | sum_1 | sum_2 |
+----+-------+-------+
| 1 | 30 | 30 |
| 2 | 50 | 55 |
+----+-------+-------+
So that means, I would like to group time monthly, and calculate the sum for each id up to this point. So it's not like a classic partition I assume. Of course I could just do it separately and then join, but as I have quite many monthly or maybe weekly partitions, this might not be feasible. Maybe someone has an idea.
Below is example for BigQuery Standard SQL
#standardSQL
WITH `project.dataset.table` AS (
SELECT 1 id, 10 transaction, '01.01.2018' time UNION ALL
SELECT 1, 20, '10.01.2018' UNION ALL
SELECT 2, 20, '05.01.2018' UNION ALL
SELECT 2, 30, '15.01.2018' UNION ALL
SELECT 2, 5, '03.02.2018'
)
SELECT id, month,
SUM(transactions) OVER(PARTITION BY id ORDER BY month) rolling_transactions
FROM (
SELECT id,
DATE_TRUNC(PARSE_DATE('%d.%m.%Y', time), MONTH) month,
SUM(transaction) transactions
FROM `project.dataset.table`
GROUP BY id, month
)
ORDER BY id, month
with result as
Row id month rolling_transactions
1 1 2018-01-01 30
2 2 2018-01-01 50
3 2 2018-02-01 55
It is more recommended to have flatten result as it scales to any number of months or weeks or whatever else time period you need and then you can further pivot result in your application
Note: for weekly case - just change MONTH to WEEK in DATE_TRUNC

Querying all past and future round birthdays

I got the birthdates of users in a table and want to display a list of round birthdays for the next n years (starting from an arbitrary date x) which looks like this:
+----------------------------------------------------------------------------------------+
| Name | id | birthdate | current_age | birthday | year | month | day | age_at_date |
+----------------------------------------------------------------------------------------+
| User 1 | 1 | 1958-01-23 | 59 | 2013-01-23 | 2013 | 1 | 23 | 55 |
| User 2 | 2 | 1988-01-29 | 29 | 2013-01-29 | 2013 | 1 | 29 | 25 |
| User 3 | 3 | 1963-02-12 | 54 | 2013-02-12 | 2013 | 2 | 12 | 50 |
| User 1 | 1 | 1958-01-23 | 59 | 2018-01-23 | 2018 | 1 | 23 | 60 |
| User 2 | 2 | 1988-01-29 | 29 | 2018-01-29 | 2018 | 1 | 29 | 30 |
| User 3 | 3 | 1963-02-12 | 54 | 2018-02-12 | 2018 | 2 | 12 | 55 |
| User 1 | 1 | 1958-01-23 | 59 | 2023-01-23 | 2023 | 1 | 23 | 65 |
| User 2 | 2 | 1988-01-29 | 29 | 2023-01-29 | 2023 | 1 | 29 | 35 |
| User 3 | 3 | 1963-02-12 | 54 | 2023-02-12 | 2023 | 2 | 12 | 60 |
+----------------------------------------------------------------------------------------+
As you can see, I want to be "wrap around" and not only show the next upcoming round birthday, which is easy, but also historical and far future data.
The core idea of my current approach is the following: I generate via generate_series all dates from 1900 till 2100 and join them by matching day and month of the birthdate with the user. Based on that, I calculate the age at that date to select finally only that birthdays, which are round (divideable by 5) and yield to a nonnegative age.
WITH
test_users(id, name, birthdate) AS (
VALUES
(1, 'User 1', '23-01-1958' :: DATE),
(2, 'User 2', '29-01-1988'),
(3, 'User 3', '12-02-1963')
),
dates AS (
SELECT
s AS date,
date_part('year', s) AS year,
date_part('month', s) AS month,
date_part('day', s) AS day
FROM generate_series('01-01-1900' :: TIMESTAMP, '01-01-2100' :: TIMESTAMP, '1 days' :: INTERVAL) AS s
),
birthday_data AS (
SELECT
id AS member_id,
test_users.birthdate AS birthdate,
(date_part('year', age((test_users.birthdate)))) :: INT AS current_age,
date :: DATE AS birthday,
date_part('year', date) AS year,
date_part('month', date) AS month,
date_part('day', date) AS day,
ROUND(extract(EPOCH FROM (dates.date - birthdate)) / (60 * 60 * 24 * 365)) :: INT AS age_at_date
FROM test_users, dates
WHERE
dates.day = date_part('day', birthdate) AND
dates.month = date_part('month', birthdate) AND
dates.year >= date_part('year', birthdate)
)
SELECT
test_users.name,
bd.*
FROM test_users
LEFT JOIN birthday_data bd ON bd.member_id = test_users.id
WHERE
bd.age_at_date % 5 = 0 AND
bd.birthday BETWEEN NOW() - INTERVAL '5' YEAR AND NOW() + INTERVAL '10' YEAR
ORDER BY bd.birthday;
My current approach seems to be very inefficient and rather complicated: It takes >100ms. Does anybody have an idea for a more compact and performant query? I am using Postgresql 9.5.3. Thank you!
Maybe try to join the generate series:
create table bday(id serial, name text, dob date);
insert into bday (name, dob) values ('a', '08-21-1972'::date);
insert into bday (name, dob) values ('b', '03-20-1974'::date);
select * from bday ,
lateral( select generate_series( (1950-y)/5 , (2010-y)/5)*5 + y as year
from (select date_part('year',dob)::integer as y) as t2
) as t1;
This will for each entry generate years between 1950 and 2010.
You can add a where clause to exclude people born after 2010 (they cant have a birthday in range)
Or exclude people born before 1850 (they are unlikely...)
--
Edit (after your edit):
So your generate_series creates 360+ rows per annum. In 100 years that is over 30.000. And they get joined to each user. (3 users => 100.000 rows)
My query generates only rows for years needed. In 100 years that is 20 rows.
That means 20 rows per user.
By dividing by 5, it ensures that the start date is a round birthday.
(1950-y)/5) calculates how many round birthdays there were before 1950.
A person born in 1941 needs to skip 1941 and 1946, but has a round birthday in 1951. So that is the difference (9 years) divided by 5, and then actually plus 1 to account for the 0st.
If the person is born after 1950 the number is negative, and greatest(-1,...)+1 gives 0, starting at the actual birthday year.
But actually it should be
select * from bday ,
lateral( select generate_series( greatest(-1,(1950-y)/5)+1, (2010-y)/5)*5 + y as year
from (select date_part('year',dob)::integer as y) as t2
) as t1;
(you may be doing greatest(0,...)+1 if you want to start at age 5)

How to insert additional values in between a GROUP BY

i am currently making a monthly report using MySQL. I have a table named "monthly" that looks something like this:
id | date | amount
10 | 2009-12-01 22:10:08 | 7
9 | 2009-11-01 22:10:08 | 78
8 | 2009-10-01 23:10:08 | 5
7 | 2009-07-01 21:10:08 | 54
6 | 2009-03-01 04:10:08 | 3
5 | 2009-02-01 09:10:08 | 456
4 | 2009-02-01 14:10:08 | 4
3 | 2009-01-01 20:10:08 | 20
2 | 2009-01-01 13:10:15 | 10
1 | 2008-12-01 10:10:10 | 5
Then, when i make a monthly report (which is based by per month of per year), i get something like this.
yearmonth | total
2008-12 | 5
2009-01 | 30
2009-02 | 460
2009-03 | 3
2009-07 | 54
2009-10 | 5
2009-11 | 78
2009-12 | 7
I used this query to achieved the result:
SELECT substring( date, 1, 7 ) AS yearmonth, sum( amount ) AS total
FROM monthly
GROUP BY substring( date, 1, 7 )
But I need something like this:
yearmonth | total
2008-01 | 0
2008-02 | 0
2008-03 | 0
2008-04 | 0
2008-05 | 0
2008-06 | 0
2008-07 | 0
2008-08 | 0
2008-09 | 0
2008-10 | 0
2008-11 | 0
2008-12 | 5
2009-01 | 30
2009-02 | 460
2009-03 | 3
2009-05 | 0
2009-06 | 0
2009-07 | 54
2009-08 | 0
2009-09 | 0
2009-10 | 5
2009-11 | 78
2009-12 | 7
Something that would display the zeroes for the month that doesnt have any value. Is it even possible to do that in a MySQL query?
You should generate a dummy rowsource and LEFT JOIN with it:
SELECT *
FROM (
SELECT 1 AS month
UNION ALL
SELECT 2
…
UNION ALL
SELECT 12
) months
CROSS JOIN
(
SELECT 2008 AS year
UNION ALL
SELECT 2009 AS year
) years
LEFT JOIN
mydata m
ON m.date >= CONCAT_WS('.', year, month, 1)
AND m.date < CONCAT_WS('.', year, month, 1) + INTERVAL 1 MONTH
GROUP BY
year, month
You can create these as tables on disk rather than generate them each time.
MySQL is the only system of the major four that does have allow an easy way to generate arbitrary resultsets.
Oracle, SQL Server and PostgreSQL do have those (CONNECT BY, recursive CTE's and generate_series, respectively)
Quassnoi is right, and I'll add a comment about how to recognize when you need something like this:
You want '2008-01' in your result, yet nothing in the source table has a date in January, 2008. Result sets have to come from the tables you query, so the obvious conclusion is that you need an additional table - one that contains each month you want as part of your result.