PostgreSQL 10.12
I have a table with calculated data grouped by date with hour, e.g.:
hourly_stats
clicks_count | visitors_count | product_id | promoter_id | bundle_id | date_time
------------------------------------------------------------------------------------------
15 | 6 | 123 | 456 | 789 | 2018-11-02 12:00:00
8 | 3 | 123 | 456 | 789 | 2018-11-02 16:00:00
2 | 1 | 123 | 456 | 789 | 2018-11-13 10:00:00
5 | 2 | 123 | 456 | 789 | 2018-11-13 21:00:00
Every new hour I collect statistics for the previous hour and insert it into the table.
In addition, to always display fresh data, I use a materialized view, which stores the calculated data from the beginning of the current hour to the current moment (refreshed every 5 minutes).
The core part of the query is always based on two timestamp values and looks like this:
SELECT *
FROM (
SELECT
clicks_count,
visitors_count,
product_id,
promoter_id,
bundle_id,
date_time
FROM hourly_stats
UNION ALL (
SELECT
clicks_count,
visitors_count,
product_id,
promoter_id,
bundle_id,
date_time
FROM materialized_stats
)
)
WHERE (date_time > start_date AND date_time <= end_date)
This core part is used in multiple really complex queries, which are too slow. For example, it takes more than a 1.5 minute to complete the query (if no row is filtered by start_date and end_date) if table has more than 20 million records in one of the cases.
I decided to add two more table with calculated data grouped by year-month-day:
daily_stats
clicks_count | visitors_count | product_id | promoter_id | bundle_id | date_time
------------------------------------------------------------------------------------------
23 | 9 | 123 | 456 | 789 | 2018-11-02
7 | 3 | 123 | 456 | 789 | 2018-11-13
and by year-month:
monthly_stats
clicks_count | visitors_count | product_id | promoter_id | bundle_id | date_time
------------------------------------------------------------------------------------------
30 | 12 | 123 | 456 | 789 | 2018-11
So, if I have start_date = '2019-01-01 00:00:00' and end_date = '2020-08-12 16:00:00' I will be able to collect data like this
(SELECT
clicks_count,
visitors_count,
product_id,
promoter_id,
bundle_id,
date_time
FROM monthly_stats
WHERE 'monthly_condition')
UNION ALL
(SELECT
clicks_count,
visitors_count,
product_id,
promoter_id,
bundle_id,
date_time
FROM daily_stats
WHERE 'daily_condition')
UNION ALL
(SELECT
clicks_count,
visitors_count,
product_id,
promoter_id,
bundle_id,
date_time
FROM hourly_stats
WHERE 'hourly_condition')
UNION ALL (
SELECT
clicks_count,
visitors_count,
product_id,
promoter_id,
bundle_id,
date_time
FROM materialized_stats
)
Each calculated row is added to the corresponding table only after the base time period (month, day, or hour) is over. So for specific set of product_id | promoter_id | bundle_id I should get:
19 rows from monthly_stats +
11 rows from daily_stats +
16 rows from hourly_stats +
1 row from materialized_stats
Already implemented restrictions (on a application layer):
max end_date value may be equal to the end of the current day
start_date is always less than end_date
start_date and end_date values are specified with an hour precision
Question: how to implement these 'monthly_condition', 'daily_condition' and 'hourly_condition' above? They should be based on the start_date and end_date parts, but I quite don't understand how to do this.
Thanks for any help.
This is an interesting problem. I had to solve this once before for SQL Server. PostgreSQL makes it much easier. Everything down to the fullness cte has been tested. The allstats cte is a best guess since I do not have your tables or data.
with invars as (
select '2016-08-15 12:35:00'::timestamptz as start_date,
'2020-08-12 19:00:00'::timestamptz as end_date
), days as (
select c.dhour,
tstzrange(
date_trunc('hour', i.start_date),
date_trunc('hour', i.end_date), '[)') as qrange
from invars i
cross join lateral generate_series(
date_trunc('hour', i.start_date),
date_trunc('hour', i.end_date),
interval '1 hour'
) as c(dhour)
), calendar as (
select dhour,
date_trunc('day', dhour) as dday,
date_trunc('month', dhour) as dmonth,
qrange
from days
), fullness as (
select dhour, dday, dmonth, qrange,
qrange #> tstzrange(dday, dday + interval '1 day', '[)') as full_day,
qrange #> tstzrange(dmonth, dmonth + interval '1 month', '[)') as full_month
from calendar
), allstats as (
select clicks_count, visitors_count, product_id, promoter_id, bundle_id
from monthly_stats
where date_time in (select distinct to_char(dmonth, 'YYYY-MM')
from fullness where full_month)
union all
select clicks_count, visitors_count, product_id, promoter_id, bundle_id
from daily_stats
where date_time in (select distinct to_char(dday, 'YYYY-MM-DD')
from fullness where full_day and not full_month)
union all
select clicks_count, visitors_count, product_id, promoter_id, bundle_id
from hourly_stats
where date_time in (select dhour from fullness
where not full_day and not full_month
and dhour < date_trunc(hour, now()))
union all
select clicks_count, visitors_count, product_id, promoter_id, bundle_id
from materialized_stats
)
select * from allstats;
I think your problem description leaves off the fact that the start_date can begin in the middle of a month or even a day. This query covers that.
Related
I've got raw data from table with information about clients. Information comes from different sources, so it causes duplicates but with different dates:
id pp type start_dt end_dt
100| 1 | Y | 01.05.19 | 01.10.20
100| 1 | Y | 10.08.20 | 01.10.20
100| 1 | N | 01.10.20 | 02.12.21
100| 1 | N | 13.12.20 | 02.12.21
100| 1 | Y | 02.12.21 | 02.12.26
100| 1 | Y | 20.12.21 | 20.12.26
For example, in this table row 2, 4 and 6 have start date within "start_dt" and "end_dt" of previous row. It's a duplicate, but I need to combine min start date and max end date from both rows for type.
FYI. First two rows and last two rows have same id, pp and type, but I need to stack them separately because of the timeline.
What I want to get (continuous timeline for a client is a key):
id pp type start_dt end_dt | cnt
100| 1 | Y | 01.05.19 | 01.10.20 | 2
100| 1 | N | 01.10.20 | 02.12.21 | 2
100| 1 | Y | 02.12.21 | 20.12.26 | 2
I'm using PL/SQL. I think it could be solved by window functions, but I can't figure out which functions to use.
Tried to solve it by group by while having > 1, but in this case it stacks four rows with same type (rows 1,2 and 5,6) into one. I need two separate rows for each type while saving continuous timeline of dates for one client.
From Oracle 12, you can use MATCH_RECOGNIZE for row-by-row pattern matching:
SELECT *
FROM table_name
MATCH_RECOGNIZE(
PARTITION BY id, pp
ORDER BY start_dt
MEASURES
FIRST(type) AS type,
FIRST(start_dt) AS start_dt,
MAX(end_dt) AS end_dt,
COUNT(*) AS cnt
PATTERN (overlapping* last_row)
DEFINE
overlapping AS type = NEXT(type)
AND MAX(end_dt) >= NEXT(start_dt)
)
Which, for the sample data:
CREATE TABLE table_name (id, pp, type, start_dt, end_dt) AS
SELECT 100, 1, 'Y', DATE '2019-05-01', DATE '2020-10-01' FROM DUAL UNION ALL
SELECT 100, 1, 'Y', DATE '2020-08-10', DATE '2020-10-01' FROM DUAL UNION ALL
SELECT 100, 1, 'N', DATE '2020-10-01', DATE '2021-12-02' FROM DUAL UNION ALL
SELECT 100, 1, 'N', DATE '2020-12-13', DATE '2021-12-02' FROM DUAL UNION ALL
SELECT 100, 1, 'Y', DATE '2021-12-02', DATE '2026-12-02' FROM DUAL UNION ALL
SELECT 100, 1, 'Y', DATE '2021-12-20', DATE '2026-12-20' FROM DUAL;
Outputs:
ID
PP
TYPE
START_DT
END_DT
CNT
100
1
Y
2019-05-01 00:00:00
2020-10-01 00:00:00
2
100
1
N
2020-10-01 00:00:00
2021-12-02 00:00:00
2
100
1
Y
2021-12-02 00:00:00
2026-12-20 00:00:00
2
fiddle
If you want to use analytic and aggregation functions then it is a bit more complicated:
SELECT id, pp, type,
MIN(start_dt) AS start_dt,
MAX(end_dt) AS end_dt,
COUNT(*) AS cnt
FROM (
SELECT id, pp, type, start_dt, end_dt,
SUM(grp_change) OVER (
PARTITION BY id, pp, type
ORDER BY start_dt
) AS grp
FROM (
SELECT t.*,
CASE
WHEN start_dt <= MAX(end_dt) OVER (
PARTITION BY id, pp, type
ORDER BY start_dt
ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING
)
THEN 0
ELSE 1
END AS grp_change
FROM table_name t
)
)
GROUP BY id, pp, type, grp
ORDER BY id, pp, start_dt
fiddle
I prefer this version because comparing "type = next(type)" without "type" being in the "order by" may lead to errors.
match_recognize(
partition by id, pp, type
order by start_dt,end_dt
measures first(start_dt) as start_dt, max(end_dt) as end_dt, count(*) as n
pattern (merged* strt)
define
merged as max(end_dt) >= next(start_dt)
)
I need to create query that will return time intervals from table, that has attributes for (almost) every day.
The original table looks like the following:
Person | Date | Date_Type
-------|------------|----------
Sam | 01.06.2020 | Vacation
Sam | 02.06.2020 | Vacation
Sam | 03.06.2020 | Work
Sam | 04.06.2020 | Work
Sam | 05.06.2020 | Work
Frodo | 01.06.2020 | Work
Frodo | 02.06.2020 | Work
.....
And the desired should look like:
Person | Date_Interval | Date_Type
-------|-----------------------|----------
Sam | 01.06.2020-02.06.2020 | Vacation
Sam | 03.06.2020-05.06.2020 | Work
Frodo | 01.06.2020-02.06.2020 | Work
.....
Will be grateful for any idea :)
This reads like a gaps-and-island problem. Here is one approach:
select person, min(date) startdate, max(date) enddate, date_type
from (
select t.*,
row_number() over(partition by person order by date) rn1,
row_number() over(partition by person, date_type order by date) rn2
from mytable t
) t
group by person, date_type, rn1 - rn2
This also works if not all dates are contiguous (since you stated that you have almost all dates, I understood you don't have them all).
This is a type of gaps-and-islands problem.
To get adjacent days with the same date_type, you can subtract a sequence. It will be constant for adjacent days. Then you can aggregate:
select person, date_type, min(date), max(date)
from (select t.*,
row_number() over (partition by person, date_type
order by date) as seqnum
from t
) t
group by person, date_type, (date - seqnum);
One of the simplest methods is to use MATCH_RECOGNIZE to perform a row-by-row comparison and aggregation:
SELECT *
FROM table_name
MATCH_RECOGNIZE (
PARTITION BY Person
ORDER BY "DATE"
MEASURES
FIRST( "DATE" ) AS start_date,
LAST( "DATE") AS end_date,
FIRST( Date_Type ) AS date_type
ONE ROW PER MATCH
PATTERN ( successive_dates+ )
DEFINE
SUCCESSIVE_DATES AS (
FIRST( Date_Type ) = NEXT( Date_Type )
AND MAX( "DATE" ) + INTERVAL '1' DAY = NEXT( "DATE")
)
);
Which, for the sample data:
CREATE TABLE table_name ( Person, "DATE", Date_Type ) AS
SELECT 'Sam', DATE '2020-06-01', 'Vacation' FROM DUAL UNION ALL
SELECT 'Sam', DATE '2020-06-02', 'Vacation' FROM DUAL UNION ALL
SELECT 'Sam', DATE '2020-06-03', 'Work' FROM DUAL UNION ALL
SELECT 'Sam', DATE '2020-06-04', 'Work' FROM DUAL UNION ALL
SELECT 'Sam', DATE '2020-06-05', 'Work' FROM DUAL UNION ALL
SELECT 'Frodo', DATE '2020-06-01', 'Work' FROM DUAL UNION ALL
SELECT 'Frodo', DATE '2020-06-02', 'Work' FROM DUAL;
Outputs:
PERSON | START_DATE | END_DATE | DATE_TYPE
:----- | :------------------ | :------------------ | :--------
Frodo | 2020-06-01 00:00:00 | 2020-06-01 00:00:00 | Work
Sam | 2020-06-01 00:00:00 | 2020-06-01 00:00:00 | Vacation
Sam | 2020-06-03 00:00:00 | 2020-06-04 00:00:00 | Work
db<>fiddle here
I am unable to group by on date from a timestamp column in below query:
CHG_TABLE
+----+--------+----------------+-----------------+-------+-----------+
| Key|Seq_Num | Start_Date | End_Date | Value |Record_Type|
+----+--------+----------------+-----------------+-------+-----------+
| 1 | 1 | 5/25/2019 2.05 | 12/31/9999 00.00| 800 | Insert |
| 1 | 1 | 5/25/2019 2.05 | 5/31/2019 11.12 | 800 | Update |
| 1 | 2 | 5/31/2019 11.12| 12/31/9999 00.00| 900 | Insert |
| 1 | 2 | 5/31/2019 11.12| 6/15/2019 12.05 | 900 | Update |
| 1 | 3 | 6/15/2019 12.05| 12/31/9999 00.00| 1000 | Insert |
| 1 | 3 | 6/15/2019 12.05| 6/25/2019 10.20 | 1000 | Update |
+---+---------+----------------+-----------------+-------+-----------+
RESULT:
+-----+------------------+----------------+-----------+----------+
| Key | Month_Start_Date | Month_End_Date |Begin_Value|End_Value |
+---- +------------------+----------------+-----------+----------+
| 1 | 6/1/2019 | 6/30/2019 | 1700 | 1000 |
| 1 | 7/1/2019 | 7/31/2019 | 1000 | 1000 |
+-----+------------------+----------------+-----------+----------+
Begin_Value : Sum(Value) for Max(Start_Date) < Month_Start_Date -> Should pick up latest date from last month
End_Value : Sum(Value) for Max(Start_Date) <= Month_End_Date -> Should pick up the latest date
SELECT k.key,
dd.month_start_date,
dd.month_end_date,
gendata.value first_value,
gendata.next_value last_value
FROM dim_date dd CROSS JOIN dim_person k
JOIN (SELECT ct.key,
dateadd('day',1,last_day(ct.start_date)) start_date ,
SUM(ct.value),
lead(SUM(ct.value)) OVER(ORDER BY ct.start_date) next_value
FROM (SELECT key,to_char(start_Date,'MM-YYYY') MMYYYY, max(start_Date) start_date
FROM CHG_TABLE
GROUP BY to_char(start_Date,'MM-YYYY'), key
) dt JOIN CHG_TABLE ct ON
dt.start_date = ct.start_date AND
dt.key = ct.key
group by ct.key, to_char(start_Date,'MM-YYYY')
) gendata ON
to_char(dd.month_end_date,'MM-YYYY') = to_char(to_char(start_Date,'MM-YYYY')) AND
k.key = gendata.key;
Error:
start_Date is not a valid group by expression
Related post:
Monthly Snapshot using Date Dimension
Hoping, I understood your question correctly.
You can check below query
WITH chg_table ( key, seq_num, start_date, end_date, value, record_type ) AS
(
SELECT 1,1,TO_DATE('5/25/2019 2.05','MM/DD/YYYY HH24.MI'),TO_DATE('12/31/9999 00.00','MM/DD/YYYY HH24.MI'), 800, 'Insert' FROM DUAL UNION ALL
SELECT 1,1,TO_DATE('5/25/2019 2.05','MM/DD/YYYY HH24.MI'),TO_DATE('5/31/2019 11.12','MM/DD/YYYY HH24.MI'), 800, 'Update' FROM DUAL UNION ALL
SELECT 1,2,TO_DATE('5/31/2019 11.12','MM/DD/YYYY HH24.MI'),TO_DATE('12/31/9999 00.00','MM/DD/YYYY HH24.MI'), 900, 'Insert' FROM DUAL UNION ALL
SELECT 1,2,TO_DATE('5/31/2019 11.12','MM/DD/YYYY HH24.MI'),TO_DATE('6/15/2019 12.05','MM/DD/YYYY HH24.MI'), 900, 'Update' FROM DUAL UNION ALL
SELECT 1,3,TO_DATE('6/15/2019 12.05','MM/DD/YYYY HH24.MI'),TO_DATE('12/31/9999 00.00','MM/DD/YYYY HH24.MI'), 1000, 'Insert' FROM DUAL UNION ALL
SELECT 1,3,TO_DATE('6/15/2019 12.05','MM/DD/YYYY HH24.MI'),TO_DATE('6/25/2019 10.20','MM/DD/YYYY HH24.MI'), 1000, 'Update' FROM DUAL
)
select key , new_start_date Month_Start_Date , new_end_date Month_End_Date , begin_value ,
nvl(lead(begin_value) over(order by new_start_date),begin_value) end_value
from
(
select key , new_start_date , new_end_date , sum(value) begin_value
from
(
select key, seq_num, start_date
, value, record_type ,
trunc(add_months(start_date,1),'month') new_start_date ,
trunc(add_months(start_date,2),'month')-1 new_end_date
from chg_table
where record_type = 'Insert'
)
group by key , new_start_date , new_end_date
)
order by new_start_date
;
Db Fiddle link: https://dbfiddle.uk/?rdbms=oracle_18&fiddle=c77a71afa82769b48f424e1c0fa1c0b6
I am assuming that you are getting an "ORA-00979: not a GROUP BY expression" and this is due to your use of the TO_CHAR(timestamp_col,'DD-MM-YYYY') in the GROUP BY clause.
Adding the TO_CHAR(timestamp_col,'DD-MM-YYYY') to the select side of your statement should resolve this and provide the results you are expecting.
a, b, dateadd('day',1,last_day(timestamp_col)) start_date, TO_CHAR(timestamp_col,'DD-MM-YYYY'), ...```
I have a database table with a start date and a number of months. How can I transform that into multiple rows based on the number of months?
I want to transform this
Into this:
We can try using a calendar table here, which includes all possible start of month dates which might appear in the expected output:
with calendar as (
select '2017-09-01'::date as dt union all
select '2017-10-01'::date union all
select '2017-11-01'::date union all
select '2017-12-01'::date union all
select '2018-01-01'::date union all
select '2018-02-01'::date union all
select '2018-03-01'::date union all
select '2018-04-01'::date union all
select '2018-05-01'::date union all
select '2018-06-01'::date union all
select '2018-07-01'::date union all
select '2018-08-01'::date
)
select
t.id as subscription_id,
c.dt,
t.amount_monthly
from calendar c
inner join your_table t
on c.dt >= t.start_date and
c.dt < t.start_date + (t.month_count::text || ' month')::interval
order by
t.id,
c.dt;
Demo
This can easily be done using generate_series() in Postgres
select t.id,
g.dt::date,
t.amount_monthly
from the_table t
cross join generate_series(t.start_date,
t.start_date + interval '1' month * (t.month_count - 1),
interval '1' month) as g(dt);
OK, it's very easy to implement this in PostgreSQL, just use generate_series, as below:
select * from month_table ;
id | start_date | month_count | amount | amount_monthly
------+------------+-------------+--------+----------------
1382 | 2017-09-01 | 3 | 38 | 1267
1383 | 2018-02-01 | 6 | 50 | 833
(2 rows)
select
id,
generate_series(start_date,start_date + (month_count || ' month') :: interval - '1 month'::interval, '1 month'::interval)::date as date,
amount_monthly
from
month_table ;
id | date | amount_monthly
------+------------+----------------
1382 | 2017-09-01 | 1267
1382 | 2017-10-01 | 1267
1382 | 2017-11-01 | 1267
1383 | 2018-02-01 | 833
1383 | 2018-03-01 | 833
1383 | 2018-04-01 | 833
1383 | 2018-05-01 | 833
1383 | 2018-06-01 | 833
1383 | 2018-07-01 | 833
(9 rows)
You may not need so many subqueries but this should help you understand how it can be broken down
WITH date_minmax AS(
SELECT
min(start_date) as date_first,
(max(start_date) + (month_count::text || ' months')::interval)::date AS date_last
FROM "your_table"
GROUP BY month_count
), series AS (
SELECT generate_series(
date_first,
date_last,
'1 month'::interval
)::date as list_date
FROM date_minmax
)
SELECT
id as subscription_id,
list_date as date,
amount_monthly as amount
FROM series
JOIN "your_table"
ON list_date <# daterange(
start_date,
(start_date + (month_count::text || ' months')::interval)::date
)
ORDER BY list_date
This should achieve the desired result http://www.sqlfiddle.com/#!17/7d943/1
I have a simple select query that has this result:
first_date | last_date | outstanding
14/01/2015 | 14/04/2015 | 100000
I want to split it to be
first_date | last_date | period | outstanding
14/01/2015 | 31/01/2015 | 31/01/2015 | 100000
01/02/2015 | 28/02/2015 | 28/02/2015 | 100000
01/03/2015 | 31/03/2015 | 31/03/2015 | 100000
01/04/2015 | 14/04/2015 | 31/04/2015 | 100000
Please show me how to do it simply, without using function/procedure, object and cursor.
Try:
WITH my_query_result AS(
SELECT date '2015-01-14' as first_date , date '2015-04-14' as last_date,
10000 as outstanding
FROM dual
)
SELECT greatest( trunc( add_months( first_date, level - 1 ),'MM'), first_date )
as first_date,
least( trunc( add_months( first_date, level ),'MM')-1, last_date )
as last_date,
trunc( add_months( first_date, level ),'MM')-1 as period,
outstanding
FROM my_query_result t
connect by level <= months_between( trunc(last_date,'MM'), trunc(first_date,'MM') ) + 1;
A side note: April has only 30 days, so a date 31/04/2015 in your question is wrong.