SQL time-series resampling - sql

I have clickhouse table with some rows like that
id
created_at
6962098097124188161
2022-07-01 00:00:00
6968111372399976448
2022-07-02 00:00:00
6968111483775524864
2022-07-03 00:00:00
6968465518567268352
2022-07-04 00:00:00
6968952917160271872
2022-07-07 00:00:00
6968952924479332352
2022-07-09 00:00:00
I need to resample time-series and get count by date like this
created_at
count
2022-07-01 00:00:00
1
2022-07-02 00:00:00
2
2022-07-03 00:00:00
3
2022-07-04 00:00:00
4
2022-07-05 00:00:00
4
2022-07-06 00:00:00
4
2022-07-07 00:00:00
5
2022-07-08 00:00:00
5
2022-07-09 00:00:00
6
I've tried this
SELECT
arrayJoin(
timeSlots(
MIN(created_at),
toUInt32(24 * 3600 * 10),
24 * 3600
)
) as ts,
SUM(
COUNT(*)
) OVER (
ORDER BY
ts
)
FROM
table
but it counts all rows.
How can I get expected result?

why not use group by created_at
like
select count(*) from table_name group by toDate(created_at)

Related

How to allocate a list of payments to a list of invoices/charges in SQL?

Let's say I have the following two tables. The first is invoice data.
customer_id
scheduled_payment_date
scheduled_total_payment
1004
2021-04-08 00:00:00
1300
1004
2021-04-29 00:00:00
1300
1004
2021-05-13 00:00:00
1300
1004
2021-06-11 00:00:00
1300
1004
2021-06-26 00:00:00
1300
1004
2021-07-12 00:00:00
1300
1004
2021-07-26 00:00:00
1300
1003
2021-04-05 00:00:00
2012
1003
2021-04-21 00:00:00
2012
1003
2021-05-05 00:00:00
2012
1003
2021-05-17 00:00:00
2012
1003
2021-06-02 00:00:00
2012
1003
2021-06-17 00:00:00
2012
The second is payment data.
customer_id
payment_date
total_payment
1003
2021-04-06 00:00:00
2012
1003
2021-04-16 00:00:00
2012
1003
2021-05-03 00:00:00
2012
1003
2021-05-18 00:00:00
2012
1003
2021-06-01 00:00:00
2012
1003
2021-06-17 00:00:00
2012
1004
2021-04-06 00:00:00
1300
1004
2021-04-22 00:00:00
200
1004
2021-04-27 00:00:00
2600
1004
2021-06-11 00:00:00
1300
I want to allocate the payments to the invoices in the correct order, i.e. payments are allocated to the earliest charge first and then when that is paid start allocating to the next earliest charge. The results should look like:
customer_id
payment_date
scheduled_payment_date
total_payment
payment_allocation
scheduled_total_payment
1004
2021-04-06 00:00:00
2021-04-08 00:00:00
1300
1300
1300
1004
2021-04-22 00:00:00
2021-04-29 00:00:00
200
200
1300
1004
2021-04-27 00:00:00
2021-04-29 00:00:00
2600
1100
1300
1004
2021-04-27 00:00:00
2021-05-13 00:00:00
2600
1300
1300
1004
2021-04-27 00:00:00
2021-06-11 00:00:00
2600
200
1300
1004
2021-06-11 00:00:00
2021-06-11 00:00:00
1300
1100
1300
1004
2021-06-11 00:00:00
2021-06-26 00:00:00
1300
200
1300
1003
2021-04-06 00:00:00
2021-04-05 00:00:00
2012
2012
2012
1003
2021-04-16 00:00:00
2021-04-21 00:00:00
2012
2012
2012
1003
2021-05-03 00:00:00
2021-05-05 00:00:00
2012
2012
2012
1003
2021-05-18 00:00:00
2021-05-17 00:00:00
2012
2012
2012
1003
2021-06-01 00:00:00
2021-06-02 00:00:00
2012
2012
2012
1003
2021-06-17 00:00:00
2021-06-17 00:00:00
2012
2012
2012
How can I do this in SQL?
When I was searching for the answer to this question I couldn't find a good solution anywhere so I figured out my own that I think can be understood and adapted for similar situations.
WITH payments_data AS (
SELECT
*,
SUM(total_payment) OVER (
PARTITION BY customer_id ORDER BY payment_ind ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS total_payment_cum,
COALESCE(SUM(total_payment) OVER (
PARTITION BY customer_id ORDER BY payment_ind ASC ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING
), 0) AS prev_total_payment_cum
FROM (
SELECT
customer_id,
payment_date,
ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY payment_date ASC) AS payment_ind,
total_payment
FROM
payments
) AS payments_ind
), charges_data AS (
SELECT
customer_id,
scheduled_payment_date,
scheduled_total_payment,
SUM(scheduled_total_payment) OVER (
PARTITION BY customer_id ORDER BY scheduled_payment_date ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS scheduled_total_payment_cum,
COALESCE(SUM(scheduled_total_payment) OVER (
PARTITION BY customer_id ORDER BY scheduled_payment_date ASC ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING
), 0) AS prev_scheduled_total_payment_cum
FROM
charges
)
SELECT
*,
CASE
WHEN current_balance >= 0 THEN IIF(
updated_charges >= total_payment,
total_payment,
updated_charges
)
WHEN current_balance < 0 THEN IIF(
scheduled_total_payment >= updated_payments,
updated_payments,
scheduled_total_payment
)
ELSE 0
END AS payment_allocation
FROM (
SELECT
pd.customer_id,
pd.payment_ind,
payment_date,
scheduled_payment_date,
total_payment,
scheduled_total_payment,
total_payment_cum,
scheduled_total_payment_cum,
prev_total_payment_cum,
prev_scheduled_total_payment_cum,
prev_total_payment_cum - prev_scheduled_total_payment_cum AS current_balance,
IIF(
prev_total_payment_cum - prev_scheduled_total_payment_cum >= 0,
scheduled_total_payment - (prev_total_payment_cum - prev_scheduled_total_payment_cum),
NULL
) AS updated_charges,
IIF(
prev_total_payment_cum - prev_scheduled_total_payment_cum < 0,
total_payment + (prev_total_payment_cum - prev_scheduled_total_payment_cum),
NULL
) AS updated_payments
FROM
payments_data AS pd
JOIN charges_data AS cd
ON pd.customer_id = cd.customer_id
WHERE
prev_total_payment_cum < scheduled_total_payment_cum
AND total_payment_cum > prev_scheduled_total_payment_cum
) data
There is a lot going on here so I wrote up an article explaining it in detail. You can find it on Medium here.
The basic idea is to track the cumulative amount of payment and charge through each record (the payments_data and charges_data CTEs) and then use this information to identify whether the charge and payment match each other (the WHERE statement that generates the "data" subquery). If they match then identify how much of the payment should be allocated to the charge (all the calculations related to the "current_balance").

Select data between 2 datetime fields based on current date/time

I have a table that has the following values (reduced for brevity)
Period
Periodfrom
Periodto
Glperiodoracle
Glperiodcalendar
88
2022-01-01 00:00:00
2022-01-28 00:00:00
JAN-FY2022
JAN-2022
89
2022-01-29 00:00:00
2022-02-25 00:00:00
FEB-FY2022
FEB-2022
90
2022-02-26 00:00:00
2022-04-01 00:00:00
MAR-FY2022
MAR-2022
91
2022-04-02 00:00:00
2022-04-29 00:00:00
APR-FY2022
APR-2022
92
2022-04-30 00:00:00
2022-05-27 00:00:00
MAY-FY2022
MAY-2022
93
2022-05-28 00:00:00
2022-07-01 00:00:00
JUN-FY2022
JUN-2022
94
2022-07-02 00:00:00
2022-07-29 00:00:00
JUL-FY2022
JUL-2022
95
2022-07-30 00:00:00
2022-08-26 00:00:00
AUG-FY2022
AUG-2022
96
2022-08-27 00:00:00
2022-09-30 00:00:00
SEP-FY2022
SEP-2022
97
2022-10-01 00:00:00
2022-10-28 00:00:00
OCT-FY2023
OCT-2022
I want to make a stored procedure that when executed (without receiving parameters) will return the single row corresponding to the date between PeriodFrom and PeriodTo based on execution date.
I have something like this:
Select top 1 Period,
Periodfrom,
Periodto,
Glperiodoracle,
Glperiodcalendar
From Calendar_Period
Where Periodfrom <= getdate()
And Periodto >= getdate()
I understand that using BETWEEN could lead to errors, but would this work in the edge cases taking in account seconds, right?
Looks like (i) your end date is inclusive (ii) the time portion is always 00:00. So the correct and most performant query would be:
where cast(getdate() as date) between Periodfrom and Periodto
It will, for example, return the first row when the current time is 2022-01-28 23:59:59.999.

Create table with 15 minutes interval on date time in Snowflake

I am trying to create a table in Snowflake with 15 mins interval. I have tried with generator, but that's not give in the 15 minutes interval. Are there any function which I can use to generate and build this table for couple of years worth data.
Such as
Date
Hour
202-03-29
02:00 AM
202-03-29
02:15 AM
202-03-29
02:30 AM
202-03-29
02:45 AM
202-03-29
03:00 AM
202-03-29
03:15 AM
.........
........
.........
........
Thanks
Use following as time generator with 15min interval and then use other date time functions as needed to extract date part or time part in separate columns.
with CTE as
(select timestampadd(min,seq4()*15 ,date_trunc(hour, current_timestamp())) as time_count
from table(generator(rowcount=>4*24)))
select time_count from cte;
+-------------------------------+
| TIME_COUNT |
|-------------------------------|
| 2022-03-29 14:00:00.000 -0700 |
| 2022-03-29 14:15:00.000 -0700 |
| 2022-03-29 14:30:00.000 -0700 |
| 2022-03-29 14:45:00.000 -0700 |
| 2022-03-29 15:00:00.000 -0700 |
| 2022-03-29 15:15:00.000 -0700 |
.
.
.
....truncated output
| 2022-03-30 13:15:00.000 -0700 |
| 2022-03-30 13:30:00.000 -0700 |
| 2022-03-30 13:45:00.000 -0700 |
+-------------------------------+
There are many answers to this question h e r e already (those 4 are all this month).
But major point to note is you MUST NOT use SEQx() as the number generator (you can use it in the ORDER BY, but that is not needed). As noted in the doc's
Important
This function uses sequences to produce a unique set of increasing integers, but does not necessarily produce a gap-free sequence. When operating on a large quantity of data, gaps can appear in a sequence. If a fully ordered, gap-free sequence is required, consider using the ROW_NUMBER window function.
CREATE TABLE table_of_2_years_date_times AS
SELECT
date_time::date as date,
date_time::time as time
FROM (
SELECT
row_number() over (order by null)-1 as rn
,dateadd('minute', 15 * rn, '2022-03-01'::date) as date_time
from table(generator(rowcount=>4*24*365*2))
)
ORDER BY rn;
then selecting the top/bottom:
(SELECT * FROM table_of_2_years_date_times ORDER BY date,time LIMIT 5)
UNION ALL
(SELECT * FROM table_of_2_years_date_times ORDER BY date desc,time desc LIMIT 5)
ORDER BY 1,2;
DATE
TIME
2022-03-01
00:00:00
2022-03-01
00:15:00
2022-03-01
00:30:00
2022-03-01
00:45:00
2022-03-01
01:00:00
2024-02-28
22:45:00
2024-02-28
23:00:00
2024-02-28
23:15:00
2024-02-28
23:30:00
2024-02-28
23:45:00

convert data wide to long with make sequential date in postgresql

I have data frame with date like below :
id start_date end_date product supply_per_day
1 2020-03-01 2020-03-01 A 10
1 2020-03-01 2020-03-01 B 10
1 2020-03-01 2020-03-02 A 5
2 2020-02-28 2020-03-02 A 10
2 2020-03-01 2020-03-03 B 4
2 2020-03-02 2020-03-05 A 5
I want make this data wide to long like :
id date product supply_per_day
1 2020-03-01 A 10
1 2020-03-01 B 10
1 2020-03-01 A 5
1 2020-03-02 A 5
2 2020-02-28 A 10
2 2020-03-01 A 10
2 2020-03-02 A 10
2 2020-03-01 B 4
2 2020-03-02 B 4
2 2020-03-03 B 4
2 2020-03-02 B 5
2 2020-03-03 B 5
2 2020-03-04 B 5
2 2020-03-05 B 5
give me some idea please
For Oracle 12c and later, you can use:
SELECT t.id,
d.dt,
t.product,
t.supply_per_day
FROM table_name t
OUTER APPLY(
SELECT start_date + LEVEL - 1 AS dt
FROM DUAL
CONNECT BY start_date + LEVEL - 1 <= end_date
) d
Which, for the sample data:
CREATE TABLE table_name ( id, start_date, end_date, product, supply_per_day ) AS
SELECT 1, DATE '2020-03-01', DATE '2020-03-01', 'A', 10 FROM DUAL UNION ALL
SELECT 1, DATE '2020-03-01', DATE '2020-03-01', 'B', 10 FROM DUAL UNION ALL
SELECT 1, DATE '2020-03-01', DATE '2020-03-02', 'A', 5 FROM DUAL UNION ALL
SELECT 2, DATE '2020-02-28', DATE '2020-03-02', 'A', 10 FROM DUAL UNION ALL
SELECT 2, DATE '2020-03-01', DATE '2020-03-03', 'B', 4 FROM DUAL UNION ALL
SELECT 2, DATE '2020-03-02', DATE '2020-03-05', 'A', 5 FROM DUAL;
Outputs:
ID
DT
PRODUCT
SUPPLY_PER_DAY
1
2020-03-01 00:00:00
A
10
1
2020-03-01 00:00:00
B
10
1
2020-03-01 00:00:00
A
5
1
2020-03-02 00:00:00
A
5
2
2020-02-28 00:00:00
A
10
2
2020-02-29 00:00:00
A
10
2
2020-03-01 00:00:00
A
10
2
2020-03-02 00:00:00
A
10
2
2020-03-01 00:00:00
B
4
2
2020-03-02 00:00:00
B
4
2
2020-03-03 00:00:00
B
4
2
2020-03-02 00:00:00
A
5
2
2020-03-03 00:00:00
A
5
2
2020-03-04 00:00:00
A
5
2
2020-03-05 00:00:00
A
5
db<>fiddle here
In Postgres you can use generate_series() for this:
select t.id, g.day::date as date, t.product, t.supply_per_day
from the_table t
cross join generate_series(t.start_date, t.end_date, interval '1 day') as g(day)
order by t.id, g.day

SQL Select up to a certain sum

I have been trying to figure out a way to write a SQL script to select a given sum, and would appreciate any ideas given to do so.
I am trying to do a stock valuation based on the dates of goods received. At month-end closing, the value of my stocks remaining in the warehouse would be a specified sum of the last received goods.
The below query is done by a couple of unions but reduces to:
SELECT DATE, W1 FROM Table
ORDER BY DATE DESC
Query result:
Row DATE W1
1 2019-02-28 00:00:00 13250
2 2019-02-28 00:00:00 42610
3 2019-02-28 00:00:00 41170
4 2019-02-28 00:00:00 13180
5 2019-02-28 00:00:00 20860
6 2019-02-28 00:00:00 19870
7 2019-02-28 00:00:00 37780
8 2019-02-28 00:00:00 47210
9 2019-02-28 00:00:00 32000
10 2019-02-28 00:00:00 41930
I have thought about solving this issue by calculating a cumulative sum as follows:
Row DATE W1 Cumulative Sum
1 2019-02-28 00:00:00 13250 13250
2 2019-02-28 00:00:00 42610 55860
3 2019-02-28 00:00:00 41170 97030
4 2019-02-28 00:00:00 13180 110210
5 2019-02-28 00:00:00 20860 131070
6 2019-02-28 00:00:00 19870 150940
7 2019-02-28 00:00:00 37780 188720
8 2019-02-28 00:00:00 47210 235930
9 2019-02-28 00:00:00 32000 267930
10 2019-02-28 00:00:00 41930 309860
However, I am stuck when figuring out a way to use a parameter to return only the rows of interest.
For example, if a parameter was specified as '120000', it would return the rows where the cumulative sum is exactly 120000.
Row DATE W1 Cumulative Sum W1_Select
1 2019-02-28 00:00:00 13250 13250 13250
2 2019-02-28 00:00:00 42610 55860 42610
3 2019-02-28 00:00:00 41170 97030 41170
4 2019-02-28 00:00:00 13180 110210 13180
5 2019-02-28 00:00:00 20860 131070 9790
----------
Total 120000
This just requires some arithmetic:
select t.*,
(case when running_sum < #threshold then w1
else #threshold - w1
end)
from (select date, w1, sum(w1) over (order by date) as running_sum
from t
) t
where running_sum - w1 < #threshold;
Actually, in your case, the dates are all the same. That is a bit counter-intuitive, but you need to use the row for this to work:
select t.*,
(case when running_sum < #threshold then w1
else #threshold - w1
end)
from (select date, w1, sum(w1) over (order by row) as running_sum
from t
) t
where running_sum - w1 < #threshold;
Here is a db<>fiddle.