How to select data with an unusual grouping by date? - sql

There is a table:
id
direction_id
created_at
1
2
22 November 2021 г., 16:00:00
2
2
22 November 2021 г., 16:20:00
43
2
22 November 2021 г., 16:25:00
455
1
22 November 2021 г., 16:27:00
6567
2
22 November 2021 г., 17:36:00
674556
2
22 November 2021 г., 20:01:00
5243554
1
22 November 2021 г., 20:50:00
5243554
1
22 November 2021 г., 21:46:00
I need to get the following result:
1
2
created_at_by_hour
1
3
22.11.21 17
1
4
22.11.21 18
1
4
22.11.21 19
1
4
22.11.21 20
2
5
22.11.21 21
3
5
22.11.21 22
1 and 2 in the header are all possible values of direction_id that are in the table.
created_at is reduced to hours and you need to count how many records satisfy the condition <= created_at_by_hour. But the grouping should be such that if the time (hour) when no records were created, then just duplicate the previous hour.
The table consists of three fields - id (int), direction_id (int), created_at (timestamptz). I need to get an hourly (based on the created_at field) data upload with the number of records created before this "grouped" time. But I need not just the number, but separately for each direction_id (there are only two of them - 1 and 2). If no records were created for a certain direction_id at a certain hour, duplicate the previous one, but the result should end at the last created_at. created_at is the time when the record was created.

In my opinion, better to generate a date between min and max date according to an hour then calculate the count of each direction.
Demo
with time_range as (
select
min(created_at) + interval '1 hour' as min,
max(created_at) + interval '1 hour' as max
from test
)
select
count(*) filter (where direction_id = 1) as "1",
count(*) filter (where direction_id = 2) as "2",
to_char(gs.hour, 'dd.mm.yy HH24') as created_at_by_hour
from
test t
cross join time_range tr
inner join generate_series(tr.min, tr.max, interval '1 hour') gs(hour)
on t.created_at <= gs.hour
group by gs.hour
order by gs.hour

Truncate the date down to the hour, group by it and count. Then use SUM OVER to get a running total of the counts. In order to show missing hours in the table, you must generate a series of hours and outer join your data.
with hourly as
(
select date_trunc('hour', created_at) as hour, direction_id from mytable
)
, hours(hour) as
(
select *
from generate_series
(
(select min(hour) from hourly), (select max(hour) from hourly), interval '1 hour'
)
)
select
hours.hour,
sum(count(*) filter (where hourly.direction_id = 1)) over (order by hour) as "1",
sum(count(*) filter (where hourly.direction_id = 2)) over (order by hour) as "2"
from hours
left join hourly using (hour)
group by hour
order by hour;
Demo: https://dbfiddle.uk/?rdbms=postgres_14&fiddle=21d0c838452a09feac4ebc57906829f4

Related

SQL Bigquery Counting repeated customers from transaction table

I have a transaction table that looks something like this.
userid
orderDate
amount
111
2021-11-01
20
112
2021-09-07
17
111
2021-11-21
17
I want to count how many distinct customers (userid) that bought from our store this month also bought from our store in the previous month. For example, in February 2020, we had 20 customers and out of these 20 customers 7 of them also bought from our store in the previous month, January 2020. I want to do this for all the previous months so ending up with something like.
year
month
repeated customers
2020
01
11
2020
02
7
2020
03
9
I have written this but this only works for only the current month. How would I iterate or rewrite it to get the table as shown above.
WITH CURRENT_PERIOD AS (
SELECT DISTINCT userid
FROM table1
WHERE DATE(orderDate) BETWEEN DATE_TRUNC(CURRENT_DATE(),MONTH) AND DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)
),
PREVIOUS_PERIOD AS (
SELECT DISTINCT userid
FROM table1
WHERE DATE(orderDate) BETWEEN DATE_TRUNC(DATE_SUB(CURRENT_DATE(), INTERVAL 1 MONTH),MONTH) AND LAST_DAY(DATE_SUB(CURRENT_DATE(), INTERVAL 1 MONTH))
)
SELECT count(1)
FROM CURRENT_PERIOD RC
WHERE RC.userid IN (SELECT DISTINCT userid FROM PREVIOUS_PERIOD)
You can summarize to get one record per month, use lag(), and then aggregate:
select yyyymm,
countif(prev_yyyymm = date_add(yyyymm, interval -1 month)
from (select userid, date_trunc(order_date, month) as yyyymm,
lag(date_trunc(order_date, month)) over (partition by userid order by date_trunc(order_date, month)) as prev_yyyymm
from table1
group by 1, 2
) t
group by yyyymm
order by yyyymm;

SQL - In a week get result count of records in that week and count of records ageing 7days from that week

This is redshift SQL
I'm trying to get 2 results for a week:
Total records in that week
Total records ageing greater than 7 days from that week.
say there are sample 100 records in below format, in current example 7 records/week:
day code week
1/1/2020 P001 1
1/2/2020 P002 1
1/3/2020 P003 1
1/4/2020 P004 1
1/5/2020 P005 2
1/6/2020 P006 2
1/7/2020 P007 2
1/8/2020 P008 2
1/9/2020 P009 2
1/10/2020 P010 2
1/11/2020 P011 2
.....................
4/8/2020 P099 15
Trying to get output like this:
Week count count>7 days
1 7 0
2 7 7
3 7 14
4 7 21
15 7 98
Basically for the latest week, i'm trying to get distinct number of records ageing more than 7 days. In actual use case, the number of records in week will vary.
What i've tried:
calendar_week_number,
count(code) as count 1,
count(DISTINCT (case when datediff(day, trunc(completion_date-7), '2020-01-01') then code end)) as count 2,
count(case when completion_date between TO_DATE('20200101','YYYYMMDD') and TO_DATE(completion_date,'YYYYMMDD')-7 then code end) as count 3
from rbsrpt.RBS_DAILY_ASIN_PROC_SNPSHT ul
LEFT JOIN rbsrpt.dim_rbs_time t ON Trunc(ul.completion_date) = trunc(t.cal_date)
where
mp=1
and calendar_year=2020
group by
calendar_week_number
order by calendar_week_number desc
but my output is as below:
week count1 count 2 count 3
51 2866 2866 0
50 3211 3211 0
49 6377 6377 0
48 9013 9013 0
47 5950 5950 0
One option uses lateral joins. It is probably more efficient to aggregate the calendar table by weeks first, then perform the searches on week per week in the dataset.
Assuming Postgres (since there is no TO_DATE() in MySQL):
select d.cal_date, c1.*, c2.*
from (
select calendar_week_number, min(cal_date) as cal_date
rbsrpt.dim_rbs_time t
group by calendar_week_number
) t
cross join lateral (
select count(*) as cnt
from rbsrpt.rbs_daily_asin_proc_snpsht r
where r.completion_date >= t.cal_date
and r.completion_date < t.cal_date + interval '7 day'
) c1
cross join lateral (
select count(*) as cnt_aged
from rbsrpt.rbs_daily_asin_proc_snpsht r
where r.completion_date >= t.cal_date - interval '7' day
and r.completion_date < t.cal_date
) c2
This ages out records after 7 days. If you wanted 30 days instead, you would change the where clause of the second subquery:
cross join lateral (
select count(*) as cnt_aged
from rbsrpt.rbs_daily_asin_proc_snpsht r
where r.completion_date >= t.cal_date - interval '30 day'
and r.completion_date < t.cal_date - interval '23 day'
) c2
Edit: if your database does not support lateral joins, you can use subqueries instead:
select d.cal_date,
(
select count(*)
from rbsrpt.rbs_daily_asin_proc_snpsht r
where r.completion_date >= t.cal_date
and r.completion_date < t.cal_date + interval '7 day'
) as cnt,
(
select count(*)
from rbsrpt.rbs_daily_asin_proc_snpsht r
where r.completion_date >= t.cal_date - interval '7' day
and r.completion_date < t.cal_date
) as cnt_aged
from (
select calendar_week_number, min(cal_date) as cal_date
rbsrpt.dim_rbs_time t
group by calendar_week_number
) t

How do I compare a current partial month vs a previous partial month with postgres?

I'm building some basic reports and I want to see if I'm on track to surpass last month's metrics without waiting for the month to end. Basically I want to compare June 1 (start of current month) through June 23 (current_date) against May 1 (start of previous month) through May 23 (current_date - 1 month).
My goal is to show a count of distinct users that did event1 and event2.
Here's what I have so far:
CREATE VIEW events AS
(SELECT *
FROM public.event
WHERE TYPE in ('event1',
'event2')
AND created_at > now() - interval '1 months' );
CREATE VIEW MAU AS
(SELECT EXTRACT(DOW
FROM created_at) AS month,
DATE_TRUNC('week', created_at) AS week,
COUNT(*) AS total_engagement,
COUNT(DISTINCT user_id) AS total_users
FROM events
GROUP BY 2,
1
ORDER BY week DESC);
SELECT month,
week,
SUM(total_engagement) OVER (PARTITION BY month
ORDER BY week) AS total_engagment
FROM MAU
ORDER BY 1 DESC,
2
Here's an example of what that returns:
Month Week Unique Engagement
6 2017-05-22 00:00:00 165
6 2017-05-29 00:00:00 355
6 2017-06-05 00:00:00 572
6 2017-06-12 00:00:00 723
5 2017-05-22 00:00:00 757
5 2017-05-29 00:00:00 1549
5 2017-06-05 00:00:00 2394
5 2017-06-12 00:00:00 3261
5 2017-06-19 00:00:00 3592
Expected return
Month Day Total Engagement
6 1 50
6 2 100
6 3 180
5 1 89
5 2 213
5 3 284
5 4 341
Can you point out where I've got this wrong or if there's an easier way to do it?
You are confusing days, weeks and months in your question but from the expected output I assume that you want month number, week number within a month and a count of those pairs.
SELECT
month,
week,
count(*) as total_engagement
FROM (
SELECT
extract(month from created_at) as month,
extract('day' from date_trunc('week', created_at::date) -
date_trunc('week', date_trunc('month', created_at::date))) / 7 + 1 as week
FROM public.event
WHERE type IN ('event1', 'event2')
AND created_at > now() - interval '1 month'
) t
GROUP BY 1,2
The most interesting part could be getting the week number within a month and for that you can check this answer.

Add Missing monthly dates in a timeseries data in Postgresql

I have monthly time series data in table where dates are as a last day of month. Some of the dates are missing in the data. I want to insert those dates and put zero value for other attributes.
Table is as follows:
id report_date price
1 2015-01-31 40
1 2015-02-28 56
1 2015-04-30 34
2 2014-05-31 45
2 2014-08-31 47
I want to convert this table to
id report_date price
1 2015-01-31 40
1 2015-02-28 56
1 2015-03-31 0
1 2015-04-30 34
2 2014-05-31 45
2 2014-06-30 0
2 2014-07-31 0
2 2014-08-31 47
Is there any way we can do this in Postgresql?
Currently we are doing this in Python. As our data is growing day by day and its not efficient to handle I/O just for one task.
Thank you
You can do this using generate_series() to generate the dates and then left join to bring in the values:
with m as (
select id, min(report_date) as minrd, max(report_date) as maxrd
from t
group by id
)
select m.id, m.report_date, coalesce(t.price, 0) as price
from (select m.*, generate_series(minrd, maxrd, interval '1' month) as report_date
from m
) m left join
t
on m.report_date = t.report_date;
EDIT:
Turns out that the above doesn't quite work, because adding months to the end of month doesn't keep the last day of the month.
This is easily fixed:
with t as (
select 1 as id, date '2012-01-31' as report_date, 10 as price union all
select 1 as id, date '2012-04-30', 20
), m as (
select id, min(report_date) - interval '1 day' as minrd, max(report_date) - interval '1 day' as maxrd
from t
group by id
)
select m.id, m.report_date, coalesce(t.price, 0) as price
from (select m.*, generate_series(minrd, maxrd, interval '1' month) + interval '1 day' as report_date
from m
) m left join
t
on m.report_date = t.report_date;
The first CTE is just to generate sample data.
This is a slight improvement over Gordon's query which fails to get the last date of a month in some cases.
Essentially you generate all the month end dates between the min and max date for each id (using generate_series) and left join on this generated table to show the missing dates with 0 price.
with minmax as (
select id, min(report_date) as mindt, max(report_date) as maxdt
from t
group by id
)
select m.id, m.report_date, coalesce(t.price, 0) as price
from (select *,
generate_series(date_trunc('MONTH',mindt+interval '1' day),
date_trunc('MONTH',maxdt+interval '1' day),
interval '1' month) - interval '1 day' as report_date
from minmax
) m
left join t on m.report_date = t.report_date
Sample Demo

Total Number of Records per Week

I have a Postgres 9.1 database. I am trying to generate the number of records per week (for a given date range) and compare it to the previous year.
I have the following code used to generate the series:
select generate_series('2013-01-01', '2013-01-31', '7 day'::interval) as series
However, I am not sure how to join the counted records to the dates generated.
So, using the following records as an example:
Pt_ID exam_date
====== =========
1 2012-01-02
2 2012-01-02
3 2012-01-08
4 2012-01-08
1 2013-01-02
2 2013-01-02
3 2013-01-03
4 2013-01-04
1 2013-01-08
2 2013-01-10
3 2013-01-15
4 2013-01-24
I wanted to have the records return as:
series thisyr lastyr
=========== ===== =====
2013-01-01 4 2
2013-01-08 3 2
2013-01-15 1 0
2013-01-22 1 0
2013-01-29 0 0
Not sure how to reference the date range in the subsearch. Thanks for any assistance.
The simple approach would be to solve this with a CROSS JOIN like demonstrated by #jpw. However, there are some hidden problems:
The performance of an unconditional CROSS JOIN deteriorates quickly with growing number of rows. The total number of rows is multiplied by the number of weeks you are testing for, before this huge derived table can be processed in the aggregation. Indexes can't help.
Starting weeks with January 1st leads to inconsistencies. ISO weeks might be an alternative. See below.
All of the following queries make heavy use of an index on exam_date. Be sure to have one.
Only join to relevant rows
Should be much faster:
SELECT d.day, d.thisyr
, count(t.exam_date) AS lastyr
FROM (
SELECT d.day::date, (d.day - '1 year'::interval)::date AS day0 -- for 2nd join
, count(t.exam_date) AS thisyr
FROM generate_series('2013-01-01'::date
, '2013-01-31'::date -- last week overlaps with Feb.
, '7 days'::interval) d(day) -- returns timestamp
LEFT JOIN tbl t ON t.exam_date >= d.day::date
AND t.exam_date < d.day::date + 7
GROUP BY d.day
) d
LEFT JOIN tbl t ON t.exam_date >= d.day0 -- repeat with last year
AND t.exam_date < d.day0 + 7
GROUP BY d.day, d.thisyr
ORDER BY d.day;
This is with weeks starting from Jan. 1st like in your original. As commented, this produces a couple of inconsistencies: Weeks start on a different day each year and since we cut off at the end of the year, the last week of the year consists of just 1 or 2 days (leap year).
The same with ISO weeks
Depending on requirements, consider ISO weeks instead, which start on Mondays and always span 7 days. But they cross the border between years. Per documentation on EXTRACT():
week
The number of the week of the year that the day is in. By definition (ISO 8601), weeks start on Mondays and the first week of a
year contains January 4 of that year. In other words, the first
Thursday of a year is in week 1 of that year.
In the ISO definition, it is possible for early-January dates to be part of the 52nd or 53rd week of the previous year, and for
late-December dates to be part of the first week of the next year. For
example, 2005-01-01 is part of the 53rd week of year 2004, and
2006-01-01 is part of the 52nd week of year 2005, while 2012-12-31 is
part of the first week of 2013. It's recommended to use the isoyear
field together with week to get consistent results.
Above query rewritten with ISO weeks:
SELECT w AS isoweek
, day::text AS thisyr_monday, thisyr_ct
, day0::text AS lastyr_monday, count(t.exam_date) AS lastyr_ct
FROM (
SELECT w, day
, date_trunc('week', '2012-01-04'::date)::date + 7 * w AS day0
, count(t.exam_date) AS thisyr_ct
FROM (
SELECT w
, date_trunc('week', '2013-01-04'::date)::date + 7 * w AS day
FROM generate_series(0, 4) w
) d
LEFT JOIN tbl t ON t.exam_date >= d.day
AND t.exam_date < d.day + 7
GROUP BY d.w, d.day
) d
LEFT JOIN tbl t ON t.exam_date >= d.day0 -- repeat with last year
AND t.exam_date < d.day0 + 7
GROUP BY d.w, d.day, d.day0, d.thisyr_ct
ORDER BY d.w, d.day;
January 4th is always in the first ISO week of the year. So this expression gets the date of Monday of the first ISO week of the given year:
date_trunc('week', '2012-01-04'::date)::date
Simplify with EXTRACT()
Since ISO weeks coincide with the week numbers returned by EXTRACT(), we can simplify the query. First, a short and simple form:
SELECT w AS isoweek
, COALESCE(thisyr_ct, 0) AS thisyr_ct
, COALESCE(lastyr_ct, 0) AS lastyr_ct
FROM generate_series(1, 5) w
LEFT JOIN (
SELECT EXTRACT(week FROM exam_date)::int AS w, count(*) AS thisyr_ct
FROM tbl
WHERE EXTRACT(isoyear FROM exam_date)::int = 2013
GROUP BY 1
) t13 USING (w)
LEFT JOIN (
SELECT EXTRACT(week FROM exam_date)::int AS w, count(*) AS lastyr_ct
FROM tbl
WHERE EXTRACT(isoyear FROM exam_date)::int = 2012
GROUP BY 1
) t12 USING (w);
Optimized query
The same with more details and optimized for performance
WITH params AS ( -- enter parameters here, once
SELECT date_trunc('week', '2012-01-04'::date)::date AS last_start
, date_trunc('week', '2013-01-04'::date)::date AS this_start
, date_trunc('week', '2014-01-04'::date)::date AS next_start
, 1 AS week_1
, 5 AS week_n -- show weeks 1 - 5
)
SELECT w.w AS isoweek
, p.this_start + 7 * (w - 1) AS thisyr_monday
, COALESCE(t13.ct, 0) AS thisyr_ct
, p.last_start + 7 * (w - 1) AS lastyr_monday
, COALESCE(t12.ct, 0) AS lastyr_ct
FROM params p
, generate_series(p.week_1, p.week_n) w(w)
LEFT JOIN (
SELECT EXTRACT(week FROM t.exam_date)::int AS w, count(*) AS ct
FROM tbl t, params p
WHERE t.exam_date >= p.this_start -- only relevant dates
AND t.exam_date < p.this_start + 7 * (p.week_n - p.week_1 + 1)::int
-- AND t.exam_date < p.next_start -- don't cross over into next year
GROUP BY 1
) t13 USING (w)
LEFT JOIN ( -- same for last year
SELECT EXTRACT(week FROM t.exam_date)::int AS w, count(*) AS ct
FROM tbl t, params p
WHERE t.exam_date >= p.last_start
AND t.exam_date < p.last_start + 7 * (p.week_n - p.week_1 + 1)::int
-- AND t.exam_date < p.this_start
GROUP BY 1
) t12 USING (w);
This should be very fast with index support and can easily be adapted to intervals of choice.
The implicit JOIN LATERAL for generate_series() in the last query requires Postgres 9.3.
SQL Fiddle.
Using across joinshould work, I'm just going to paste the markdown output from SQL Fiddle below. It would seem that your sample output is incorrect for series 2013-01-08: the thisyr should be 2, not 3. This might not be the best way to do this though, my Postgresql knowledge leaves a lot to be desired.
SQL Fiddle
PostgreSQL 9.2.4 Schema Setup:
CREATE TABLE Table1
("Pt_ID" varchar(6), "exam_date" date);
INSERT INTO Table1
("Pt_ID", "exam_date")
VALUES
('1', '2012-01-02'),('2', '2012-01-02'),
('3', '2012-01-08'),('4', '2012-01-08'),
('1', '2013-01-02'),('2', '2013-01-02'),
('3', '2013-01-03'),('4', '2013-01-04'),
('1', '2013-01-08'),('2', '2013-01-10'),
('3', '2013-01-15'),('4', '2013-01-24');
Query 1:
select
series,
sum (
case
when exam_date
between series and series + '6 day'::interval
then 1
else 0
end
) as thisyr,
sum (
case
when exam_date + '1 year'::interval
between series and series + '6 day'::interval
then 1 else 0
end
) as lastyr
from table1
cross join generate_series('2013-01-01', '2013-01-31', '7 day'::interval) as series
group by series
order by series
Results:
| SERIES | THISYR | LASTYR |
|--------------------------------|--------|--------|
| January, 01 2013 00:00:00+0000 | 4 | 2 |
| January, 08 2013 00:00:00+0000 | 2 | 2 |
| January, 15 2013 00:00:00+0000 | 1 | 0 |
| January, 22 2013 00:00:00+0000 | 1 | 0 |
| January, 29 2013 00:00:00+0000 | 0 | 0 |