convert data wide to long with make sequential date in postgresql - sql

I have data frame with date like below :
id start_date end_date product supply_per_day
1 2020-03-01 2020-03-01 A 10
1 2020-03-01 2020-03-01 B 10
1 2020-03-01 2020-03-02 A 5
2 2020-02-28 2020-03-02 A 10
2 2020-03-01 2020-03-03 B 4
2 2020-03-02 2020-03-05 A 5
I want make this data wide to long like :
id date product supply_per_day
1 2020-03-01 A 10
1 2020-03-01 B 10
1 2020-03-01 A 5
1 2020-03-02 A 5
2 2020-02-28 A 10
2 2020-03-01 A 10
2 2020-03-02 A 10
2 2020-03-01 B 4
2 2020-03-02 B 4
2 2020-03-03 B 4
2 2020-03-02 B 5
2 2020-03-03 B 5
2 2020-03-04 B 5
2 2020-03-05 B 5
give me some idea please

For Oracle 12c and later, you can use:
SELECT t.id,
d.dt,
t.product,
t.supply_per_day
FROM table_name t
OUTER APPLY(
SELECT start_date + LEVEL - 1 AS dt
FROM DUAL
CONNECT BY start_date + LEVEL - 1 <= end_date
) d
Which, for the sample data:
CREATE TABLE table_name ( id, start_date, end_date, product, supply_per_day ) AS
SELECT 1, DATE '2020-03-01', DATE '2020-03-01', 'A', 10 FROM DUAL UNION ALL
SELECT 1, DATE '2020-03-01', DATE '2020-03-01', 'B', 10 FROM DUAL UNION ALL
SELECT 1, DATE '2020-03-01', DATE '2020-03-02', 'A', 5 FROM DUAL UNION ALL
SELECT 2, DATE '2020-02-28', DATE '2020-03-02', 'A', 10 FROM DUAL UNION ALL
SELECT 2, DATE '2020-03-01', DATE '2020-03-03', 'B', 4 FROM DUAL UNION ALL
SELECT 2, DATE '2020-03-02', DATE '2020-03-05', 'A', 5 FROM DUAL;
Outputs:
ID
DT
PRODUCT
SUPPLY_PER_DAY
1
2020-03-01 00:00:00
A
10
1
2020-03-01 00:00:00
B
10
1
2020-03-01 00:00:00
A
5
1
2020-03-02 00:00:00
A
5
2
2020-02-28 00:00:00
A
10
2
2020-02-29 00:00:00
A
10
2
2020-03-01 00:00:00
A
10
2
2020-03-02 00:00:00
A
10
2
2020-03-01 00:00:00
B
4
2
2020-03-02 00:00:00
B
4
2
2020-03-03 00:00:00
B
4
2
2020-03-02 00:00:00
A
5
2
2020-03-03 00:00:00
A
5
2
2020-03-04 00:00:00
A
5
2
2020-03-05 00:00:00
A
5
db<>fiddle here

In Postgres you can use generate_series() for this:
select t.id, g.day::date as date, t.product, t.supply_per_day
from the_table t
cross join generate_series(t.start_date, t.end_date, interval '1 day') as g(day)
order by t.id, g.day

Related

SQL time-series resampling

I have clickhouse table with some rows like that
id
created_at
6962098097124188161
2022-07-01 00:00:00
6968111372399976448
2022-07-02 00:00:00
6968111483775524864
2022-07-03 00:00:00
6968465518567268352
2022-07-04 00:00:00
6968952917160271872
2022-07-07 00:00:00
6968952924479332352
2022-07-09 00:00:00
I need to resample time-series and get count by date like this
created_at
count
2022-07-01 00:00:00
1
2022-07-02 00:00:00
2
2022-07-03 00:00:00
3
2022-07-04 00:00:00
4
2022-07-05 00:00:00
4
2022-07-06 00:00:00
4
2022-07-07 00:00:00
5
2022-07-08 00:00:00
5
2022-07-09 00:00:00
6
I've tried this
SELECT
arrayJoin(
timeSlots(
MIN(created_at),
toUInt32(24 * 3600 * 10),
24 * 3600
)
) as ts,
SUM(
COUNT(*)
) OVER (
ORDER BY
ts
)
FROM
table
but it counts all rows.
How can I get expected result?
why not use group by created_at
like
select count(*) from table_name group by toDate(created_at)

How to dynamically add rows with values of last 12 months of data when particular period is missing within the particular last 12 month period?

This is the input Table:
ITEM
QTY
DATEPERIOD
A
2
1/1/2020 0:00
A
3
2/1/2020 0:00
A
4
3/1/2020 0:00
A
1
4/1/2020 0:00
A
2
5/1/2020 0:00
A
2
6/1/2020 0:00
A
2
8/1/2020 0:00
A
2
10/1/2020 0:00
A
2
12/1/2020 0:00
A
2
1/1/2021 0:00
A
3
2/1/2021 0:00
A
4
3/1/2021 0:00
A
2
5/1/2021 0:00
A
2
6/1/2021 0:00
A
2
8/1/2021 0:00
A
1
9/1/2021 0:00
A
2
10/1/2021 0:00
A
1
11/1/2021 0:00
A
1
12/1/2021 0:00
This input table does not have data of 2021-July, when I have to calculate the data of last 12 month for each rows, I will be able to get data of last 12 months from dec 2021 to Aug 2021.
But since the input table does not have data of 2021-July, using usual query
SUM(qty) OVER (
PARTITION BY item
ORDER BY dateperiod
RANGE BETWEEN INTERVAL '11' MONTH PRECEDING
AND INTERVAL '0' MONTH FOLLOWING
) AS total,
would generate last 12 month data for June 2021. But the expected output is even if the data is not available in July-2021, is it possible to dynamically generate a row as last 12 month data for July-2021 which should be from July 2021 to Aug 2020. The result of the qty is: 19
similarly, the input table is missing data for April 2021. Then the query generate a row as last 12 month data for April-2021 which should be from April 2021 to May 2020. The result of the qty is: 19
So the expected output will be in the form of
ITEM
DATEPERIOD
Output
A
1/1/2020
2
A
2/1/2020
5
A
3/1/2020
9
A
4/1/2020
10
A
5/1/2020
12
A
6/1/2020
14
A
7/1/2020
14
A
8/1/2020
16
A
9/1/2020
16
A
10/1/2020
18
A
11/1/2020
18
A
12/1/2020
20
A
1/1/2021
20
A
2/1/2021
20
A
3/1/2021
20
A
4/1/2021
19
A
5/1/2021
19
A
6/1/2021
19
A
7/1/2021
19
A
8/1/2021
19
A
9/1/2021
20
A
10/1/2021
20
A
11/1/2021
21
A
12/1/2021
20
Please let me know if this is possible
You can use a hierarchical query to generate a calendar and the use an OUTER JOIN to join it to your data (however, since you are doing it for each item then you probably want a PARTITIONed OUTER JOIN):
WITH calendar (month) AS (
SELECT ADD_MONTHS(min_dp, LEVEL - 1) AS month
FROM (
SELECT MIN(dateperiod) AS min_dp,
MAX(dateperiod) AS max_dp
FROM table_name
)
CONNECT BY ADD_MONTHS(min_dp, LEVEL - 1) <= max_dp
)
SELECT item,
c.month AS dateperiod,
COALESCE(t.qty, 0) AS qty,
SUM(t.qty) OVER (
PARTITION BY t.item
ORDER BY c.month
RANGE BETWEEN INTERVAL '11' MONTH PRECEDING
AND INTERVAL '0' MONTH FOLLOWING
) AS total
FROM calendar c
LEFT OUTER JOIN table_name t
PARTITION BY (t.item)
ON (c.month = t.dateperiod);
Which, for your sample data:
CREATE TABLE table_name (ITEM, QTY, DATEPERIOD) AS
SELECT 'A', 2, DATE '2020-01-01' FROM DUAL UNION ALL
SELECT 'A', 3, DATE '2020-02-01' FROM DUAL UNION ALL
SELECT 'A', 4, DATE '2020-03-01' FROM DUAL UNION ALL
SELECT 'A', 1, DATE '2020-04-01' FROM DUAL UNION ALL
SELECT 'A', 2, DATE '2020-05-01' FROM DUAL UNION ALL
SELECT 'A', 2, DATE '2020-06-01' FROM DUAL UNION ALL
SELECT 'A', 2, DATE '2020-08-01' FROM DUAL UNION ALL
SELECT 'A', 2, DATE '2020-10-01' FROM DUAL UNION ALL
SELECT 'A', 2, DATE '2020-12-01' FROM DUAL UNION ALL
SELECT 'A', 2, DATE '2021-01-01' FROM DUAL UNION ALL
SELECT 'A', 3, DATE '2021-02-01' FROM DUAL UNION ALL
SELECT 'A', 4, DATE '2021-03-01' FROM DUAL UNION ALL
SELECT 'A', 2, DATE '2021-05-01' FROM DUAL UNION ALL
SELECT 'A', 2, DATE '2021-06-01' FROM DUAL UNION ALL
SELECT 'A', 2, DATE '2021-08-01' FROM DUAL UNION ALL
SELECT 'A', 1, DATE '2021-09-01' FROM DUAL UNION ALL
SELECT 'A', 2, DATE '2021-10-01' FROM DUAL UNION ALL
SELECT 'A', 1, DATE '2021-11-01' FROM DUAL UNION ALL
SELECT 'A', 1, DATE '2021-12-01' FROM DUAL;
Outputs:
ITEM
DATEPERIOD
QTY
TOTAL
A
2020-01-01 00:00:00
2
2
A
2020-02-01 00:00:00
3
5
A
2020-03-01 00:00:00
4
9
A
2020-04-01 00:00:00
1
10
A
2020-05-01 00:00:00
2
12
A
2020-06-01 00:00:00
2
14
A
2020-07-01 00:00:00
0
14
A
2020-08-01 00:00:00
2
16
A
2020-09-01 00:00:00
0
16
A
2020-10-01 00:00:00
2
18
A
2020-11-01 00:00:00
0
18
A
2020-12-01 00:00:00
2
20
A
2021-01-01 00:00:00
2
20
A
2021-02-01 00:00:00
3
20
A
2021-03-01 00:00:00
4
20
A
2021-04-01 00:00:00
0
19
A
2021-05-01 00:00:00
2
19
A
2021-06-01 00:00:00
2
19
A
2021-07-01 00:00:00
0
19
A
2021-08-01 00:00:00
2
19
A
2021-09-01 00:00:00
1
20
A
2021-10-01 00:00:00
2
20
A
2021-11-01 00:00:00
1
21
A
2021-12-01 00:00:00
1
20
db<>fiddle here

How can I join two tables on an ID and a DATE RANGE in SQL

I have 2 query result tables containing records for different assessments. There are RAssessments and NAssessments which make up a complete review.
The aim is to eventually determine which reviews were completed. I would like to join the two tables on the ID, and on the date, HOWEVER the date each assessment is completed on may not be identical and may be several days apart, and some ID's may have more of an RAssessment than an NAssessment.
Therefore, I would like to join T1 on to T2 on ID & on T1Date(+ or - 7 days). There is no other way to match the two tables and to align the records other than using the date range, as this is a poorly designed database. I hope for some help with this as I am stumped.
Here is some sample data:
Table #1:
ID
RAssessmentDate
1
2020-01-03
1
2020-03-03
1
2020-05-03
2
2020-01-09
2
2020-04-09
3
2022-07-21
4
2020-06-30
4
2020-12-30
4
2021-06-30
4
2021-12-30
Table #2:
ID
NAssessmentDate
1
2020-01-07
1
2020-03-02
1
2020-05-03
2
2020-01-09
2
2020-07-06
2
2020-04-10
3
2022-07-21
4
2021-01-03
4
2021-06-28
4
2022-01-02
4
2022-06-26
I would like my end result table to look like this:
ID
RAssessmentDate
NAssessmentDate
1
2020-01-03
2020-01-07
1
2020-03-03
2020-03-02
1
2020-05-03
2020-05-03
2
2020-01-09
2020-01-09
2
2020-04-09
2020-04-10
2
NULL
2020-07-06
3
2022-07-21
2022-07-21
4
2020-06-30
NULL
4
2020-12-30
2021-01-03
4
2021-06-30
2021-06-28
4
2021-12-30
2022-01-02
4
NULL
2022-01-02
Try this:
SELECT
COALESCE(a.ID, b.ID) ID,
a.RAssessmentDate,
b.NAssessmentDate
FROM (
SELECT
ROW_NUMBER() OVER (PARTITION BY ID ORDER BY ID) RowId, *
FROM table1
) a
FULL OUTER JOIN (
SELECT
ROW_NUMBER() OVER (PARTITION BY ID ORDER BY ID) RowId, *
FROM table2
) b ON a.ID = b.ID AND a.RowId = b.RowId
WHERE (a.RAssessmentDate BETWEEN '2020-01-01' AND '2022-01-02')
OR (b.NAssessmentDate BETWEEN '2020-01-01' AND '2022-01-02')

auto increment inside group

I have a dataframe:
df = pd.DataFrame.from_dict({
'product': ('a', 'a', 'a', 'a', 'c', 'b', 'b', 'b'),
'sales': ('-', '-', 'hot_price', 'hot_price', '-', 'min_price', 'min_price', 'min_price'),
'price': (100, 100, 50, 50, 90, 70, 70, 70),
'dt': ('2020-01-01 00:00:00', '2020-01-01 00:05:00', '2020-01-01 00:07:00', '2020-01-01 00:10:00', '2020-01-01 00:13:00', '2020-01-01 00:15:00', '2020-01-01 00:19:00', '2020-01-01 00:21:00')
})
product sales price dt
0 a - 100 2020-01-01 00:00:00
1 a - 100 2020-01-01 00:05:00
2 a hot_price 50 2020-01-01 00:07:00
3 a hot_price 50 2020-01-01 00:10:00
4 c - 90 2020-01-01 00:13:00
5 b min_price 70 2020-01-01 00:15:00
6 b min_price 70 2020-01-01 00:19:00
7 b min_price 70 2020-01-01 00:21:00
I need the next output:
product sales price dt unique_group
0 a - 100 2020-01-01 00:00:00 0
1 a - 100 2020-01-01 00:05:00 0
2 a hot_price 50 2020-01-01 00:07:00 1
3 a hot_price 50 2020-01-01 00:10:00 1
4 c - 90 2020-01-01 00:13:00 2
5 b min_price 70 2020-01-01 00:15:00 3
6 b min_price 70 2020-01-01 00:19:00 3
7 b min_price 70 2020-01-01 00:21:00 3
How I do it:
unique_group = 0
df['unique_group'] = unique_group
for i in range(1, len(df)):
current, prev = df.loc[i], df.loc[i - 1]
if not all([
current['product'] == prev['product'],
current['sales'] == prev['sales'],
current['price'] == prev['price'],
]):
unique_group += 1
df.loc[i, 'unique_group'] = unique_group
Is it possible to do it without iteration? I tried using cumsum(), shift(), ngroup(), drop_duplicates() but unsuccessfully.
IIUC, GroupBy.ngroup:
df['unique_group'] = df.groupby(['product', 'sales', 'price'],sort=False).ngroup()
print(df)
product sales price dt unique_group
0 a - 100 2020-01-01 00:00:00 0
1 a - 100 2020-01-01 00:05:00 0
2 a hot_price 50 2020-01-01 00:07:00 1
3 a hot_price 50 2020-01-01 00:10:00 1
4 c - 90 2020-01-01 00:13:00 2
5 b min_price 70 2020-01-01 00:15:00 3
6 b min_price 70 2020-01-01 00:19:00 3
7 b min_price 70 2020-01-01 00:21:00 3
this works either way, even if the data frame is not ordered
Another approach
this works with the ordered data frame
cols = ['product','sales','price']
df['unique_group'] = df[cols].ne(df[cols].shift()).any(axis=1).cumsum().sub(1)
Another option which might be a bit faster than groupby:
df['unique_group'] = (~df.duplicated(['product','sales','price'])).cumsum() - 1
Output:
product sales price dt unique_group
0 a - 100 2020-01-01 00:00:00 0
1 a - 100 2020-01-01 00:05:00 0
2 a hot_price 50 2020-01-01 00:07:00 1
3 a hot_price 50 2020-01-01 00:10:00 1
4 c - 90 2020-01-01 00:13:00 2
5 b min_price 70 2020-01-01 00:15:00 3
6 b min_price 70 2020-01-01 00:19:00 3
7 b min_price 70 2020-01-01 00:21:00 3

Transposing SQLite rows and columns with average per hour

I have a table in SQLite called param_vals_breaches that looks like the following:
id param queue date_time param_val breach_count
1 c a 2013-01-01 00:00:00 188 7
2 c b 2013-01-01 00:00:00 156 8
3 c c 2013-01-01 00:00:00 100 2
4 d a 2013-01-01 00:00:00 657 0
5 d b 2013-01-01 00:00:00 23 6
6 d c 2013-01-01 00:00:00 230 12
7 c a 2013-01-01 01:00:00 100 0
8 c b 2013-01-01 01:00:00 143 9
9 c c 2013-01-01 01:00:00 12 2
10 d a 2013-01-01 01:00:00 0 1
11 d b 2013-01-01 01:00:00 29 5
12 d c 2013-01-01 01:00:00 22 14
13 c a 2013-01-01 02:00:00 188 7
14 c b 2013-01-01 02:00:00 156 8
15 c c 2013-01-01 02:00:00 100 2
16 d a 2013-01-01 02:00:00 657 0
17 d b 2013-01-01 02:00:00 23 6
18 d c 2013-01-01 02:00:00 230 12
I want to write a query that will show me a particular queue (e.g. "a") with the average param_val and breach_count for each param on an hour by hour basis. So transposing the data to get something that looks like this:
Results for Queue A
Hour 0 Hour 0 Hour 1 Hour 1 Hour 2 Hour 2
param avg_param_val avg_breach_count avg_param_val avg_breach_count avg_param_val avg_breach_count
c xxx xxx xxx xxx xxx xxx
d xxx xxx xxx xxx xxx xxx
is this possible? I'm not sure how to go about it. Thanks!
SQLite does not have a PIVOT function but you can use an aggregate function with a CASE expression to turn the rows into columns:
select param,
avg(case when time = '00' then param_val end) AvgHour0Val,
avg(case when time = '00' then breach_count end) AvgHour0Count,
avg(case when time = '01' then param_val end) AvgHour1Val,
avg(case when time = '01' then breach_count end) AvgHour1Count,
avg(case when time = '02' then param_val end) AvgHour2Val,
avg(case when time = '02' then breach_count end) AvgHour2Count
from
(
select param,
strftime('%H', date_time) time,
param_val,
breach_count
from param_vals_breaches
where queue = 'a'
) src
group by param;
See SQL Fiddle with Demo