SQL Select up to a certain sum - sql

I have been trying to figure out a way to write a SQL script to select a given sum, and would appreciate any ideas given to do so.
I am trying to do a stock valuation based on the dates of goods received. At month-end closing, the value of my stocks remaining in the warehouse would be a specified sum of the last received goods.
The below query is done by a couple of unions but reduces to:
SELECT DATE, W1 FROM Table
ORDER BY DATE DESC
Query result:
Row DATE W1
1 2019-02-28 00:00:00 13250
2 2019-02-28 00:00:00 42610
3 2019-02-28 00:00:00 41170
4 2019-02-28 00:00:00 13180
5 2019-02-28 00:00:00 20860
6 2019-02-28 00:00:00 19870
7 2019-02-28 00:00:00 37780
8 2019-02-28 00:00:00 47210
9 2019-02-28 00:00:00 32000
10 2019-02-28 00:00:00 41930
I have thought about solving this issue by calculating a cumulative sum as follows:
Row DATE W1 Cumulative Sum
1 2019-02-28 00:00:00 13250 13250
2 2019-02-28 00:00:00 42610 55860
3 2019-02-28 00:00:00 41170 97030
4 2019-02-28 00:00:00 13180 110210
5 2019-02-28 00:00:00 20860 131070
6 2019-02-28 00:00:00 19870 150940
7 2019-02-28 00:00:00 37780 188720
8 2019-02-28 00:00:00 47210 235930
9 2019-02-28 00:00:00 32000 267930
10 2019-02-28 00:00:00 41930 309860
However, I am stuck when figuring out a way to use a parameter to return only the rows of interest.
For example, if a parameter was specified as '120000', it would return the rows where the cumulative sum is exactly 120000.
Row DATE W1 Cumulative Sum W1_Select
1 2019-02-28 00:00:00 13250 13250 13250
2 2019-02-28 00:00:00 42610 55860 42610
3 2019-02-28 00:00:00 41170 97030 41170
4 2019-02-28 00:00:00 13180 110210 13180
5 2019-02-28 00:00:00 20860 131070 9790
----------
Total 120000

This just requires some arithmetic:
select t.*,
(case when running_sum < #threshold then w1
else #threshold - w1
end)
from (select date, w1, sum(w1) over (order by date) as running_sum
from t
) t
where running_sum - w1 < #threshold;
Actually, in your case, the dates are all the same. That is a bit counter-intuitive, but you need to use the row for this to work:
select t.*,
(case when running_sum < #threshold then w1
else #threshold - w1
end)
from (select date, w1, sum(w1) over (order by row) as running_sum
from t
) t
where running_sum - w1 < #threshold;
Here is a db<>fiddle.

Related

How to allocate a list of payments to a list of invoices/charges in SQL?

Let's say I have the following two tables. The first is invoice data.
customer_id
scheduled_payment_date
scheduled_total_payment
1004
2021-04-08 00:00:00
1300
1004
2021-04-29 00:00:00
1300
1004
2021-05-13 00:00:00
1300
1004
2021-06-11 00:00:00
1300
1004
2021-06-26 00:00:00
1300
1004
2021-07-12 00:00:00
1300
1004
2021-07-26 00:00:00
1300
1003
2021-04-05 00:00:00
2012
1003
2021-04-21 00:00:00
2012
1003
2021-05-05 00:00:00
2012
1003
2021-05-17 00:00:00
2012
1003
2021-06-02 00:00:00
2012
1003
2021-06-17 00:00:00
2012
The second is payment data.
customer_id
payment_date
total_payment
1003
2021-04-06 00:00:00
2012
1003
2021-04-16 00:00:00
2012
1003
2021-05-03 00:00:00
2012
1003
2021-05-18 00:00:00
2012
1003
2021-06-01 00:00:00
2012
1003
2021-06-17 00:00:00
2012
1004
2021-04-06 00:00:00
1300
1004
2021-04-22 00:00:00
200
1004
2021-04-27 00:00:00
2600
1004
2021-06-11 00:00:00
1300
I want to allocate the payments to the invoices in the correct order, i.e. payments are allocated to the earliest charge first and then when that is paid start allocating to the next earliest charge. The results should look like:
customer_id
payment_date
scheduled_payment_date
total_payment
payment_allocation
scheduled_total_payment
1004
2021-04-06 00:00:00
2021-04-08 00:00:00
1300
1300
1300
1004
2021-04-22 00:00:00
2021-04-29 00:00:00
200
200
1300
1004
2021-04-27 00:00:00
2021-04-29 00:00:00
2600
1100
1300
1004
2021-04-27 00:00:00
2021-05-13 00:00:00
2600
1300
1300
1004
2021-04-27 00:00:00
2021-06-11 00:00:00
2600
200
1300
1004
2021-06-11 00:00:00
2021-06-11 00:00:00
1300
1100
1300
1004
2021-06-11 00:00:00
2021-06-26 00:00:00
1300
200
1300
1003
2021-04-06 00:00:00
2021-04-05 00:00:00
2012
2012
2012
1003
2021-04-16 00:00:00
2021-04-21 00:00:00
2012
2012
2012
1003
2021-05-03 00:00:00
2021-05-05 00:00:00
2012
2012
2012
1003
2021-05-18 00:00:00
2021-05-17 00:00:00
2012
2012
2012
1003
2021-06-01 00:00:00
2021-06-02 00:00:00
2012
2012
2012
1003
2021-06-17 00:00:00
2021-06-17 00:00:00
2012
2012
2012
How can I do this in SQL?
When I was searching for the answer to this question I couldn't find a good solution anywhere so I figured out my own that I think can be understood and adapted for similar situations.
WITH payments_data AS (
SELECT
*,
SUM(total_payment) OVER (
PARTITION BY customer_id ORDER BY payment_ind ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS total_payment_cum,
COALESCE(SUM(total_payment) OVER (
PARTITION BY customer_id ORDER BY payment_ind ASC ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING
), 0) AS prev_total_payment_cum
FROM (
SELECT
customer_id,
payment_date,
ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY payment_date ASC) AS payment_ind,
total_payment
FROM
payments
) AS payments_ind
), charges_data AS (
SELECT
customer_id,
scheduled_payment_date,
scheduled_total_payment,
SUM(scheduled_total_payment) OVER (
PARTITION BY customer_id ORDER BY scheduled_payment_date ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS scheduled_total_payment_cum,
COALESCE(SUM(scheduled_total_payment) OVER (
PARTITION BY customer_id ORDER BY scheduled_payment_date ASC ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING
), 0) AS prev_scheduled_total_payment_cum
FROM
charges
)
SELECT
*,
CASE
WHEN current_balance >= 0 THEN IIF(
updated_charges >= total_payment,
total_payment,
updated_charges
)
WHEN current_balance < 0 THEN IIF(
scheduled_total_payment >= updated_payments,
updated_payments,
scheduled_total_payment
)
ELSE 0
END AS payment_allocation
FROM (
SELECT
pd.customer_id,
pd.payment_ind,
payment_date,
scheduled_payment_date,
total_payment,
scheduled_total_payment,
total_payment_cum,
scheduled_total_payment_cum,
prev_total_payment_cum,
prev_scheduled_total_payment_cum,
prev_total_payment_cum - prev_scheduled_total_payment_cum AS current_balance,
IIF(
prev_total_payment_cum - prev_scheduled_total_payment_cum >= 0,
scheduled_total_payment - (prev_total_payment_cum - prev_scheduled_total_payment_cum),
NULL
) AS updated_charges,
IIF(
prev_total_payment_cum - prev_scheduled_total_payment_cum < 0,
total_payment + (prev_total_payment_cum - prev_scheduled_total_payment_cum),
NULL
) AS updated_payments
FROM
payments_data AS pd
JOIN charges_data AS cd
ON pd.customer_id = cd.customer_id
WHERE
prev_total_payment_cum < scheduled_total_payment_cum
AND total_payment_cum > prev_scheduled_total_payment_cum
) data
There is a lot going on here so I wrote up an article explaining it in detail. You can find it on Medium here.
The basic idea is to track the cumulative amount of payment and charge through each record (the payments_data and charges_data CTEs) and then use this information to identify whether the charge and payment match each other (the WHERE statement that generates the "data" subquery). If they match then identify how much of the payment should be allocated to the charge (all the calculations related to the "current_balance").

SQL time-series resampling

I have clickhouse table with some rows like that
id
created_at
6962098097124188161
2022-07-01 00:00:00
6968111372399976448
2022-07-02 00:00:00
6968111483775524864
2022-07-03 00:00:00
6968465518567268352
2022-07-04 00:00:00
6968952917160271872
2022-07-07 00:00:00
6968952924479332352
2022-07-09 00:00:00
I need to resample time-series and get count by date like this
created_at
count
2022-07-01 00:00:00
1
2022-07-02 00:00:00
2
2022-07-03 00:00:00
3
2022-07-04 00:00:00
4
2022-07-05 00:00:00
4
2022-07-06 00:00:00
4
2022-07-07 00:00:00
5
2022-07-08 00:00:00
5
2022-07-09 00:00:00
6
I've tried this
SELECT
arrayJoin(
timeSlots(
MIN(created_at),
toUInt32(24 * 3600 * 10),
24 * 3600
)
) as ts,
SUM(
COUNT(*)
) OVER (
ORDER BY
ts
)
FROM
table
but it counts all rows.
How can I get expected result?
why not use group by created_at
like
select count(*) from table_name group by toDate(created_at)

how to clean sql table base on startdate, enddate and effective date

I have a really dirty table in which I have a mix between the start date and one values's change effective date.
The table look like this
id
value
startdate
enddate
effective date
1
0.3
2020-10-07
2021-02-28
2020-07-01
1
1
2020-10-07
2021-02-28
2020-10-07
2
0.46
2021-01-01
2021-01-01
2
1
2021-01-01
2020-10-07
2021-05-01
3
1
2021-08-01
2021-08-01
4
1
2019-03-01
2019-03-01
4
0.5
2019-03-01
2020-08-01
4
0.7
2019-03-01
2021-05-01
When the enddate is empty it means that there is not change planning and when the start date is later and the effective date, it means than they delete an older record and create a new one with other values.
my goal is to clean the table and get it sorted as something like this.
id
value
startdate_valid
enddate_valid
1
0.3
2020-07-01
2020-10-07
1
1
2020-10-07
2021-02-28
2
0.46
2021-01-01
2021-05-01
2
1
2021-05-01
3
1
2021-08-01
4
1
2019-03-01
2020-08-01
4
0.5
2020-08-01
2021-05-01
4
0.7
2021-05-01
any idea of how can I achieve this?
EDIT:
I think I was able to get the startdate_valid value by using
MAX([effective date]) OVER(PARTITION BY id, YEAR([effective date]), MONTH([effective date]) ORDER BY [effective date])
This make sense as I have the startdate included in the effective date but I am still stuck in order to get the enddate_valid
I have found a solution to my problem, I needed to do it in two steps so if someone has a better solution, please share and I will set it as correct
SELECT
*,
COALESCE(
LEAD(sub.StartDate_value) OVER(PARTITION BY sub.Code ORDER BY sub.StartDate_value),
sub.[startdate]) AS [EndDate_value]
FROM (
SELECT
id, name,
COALESCE(
MAX([effective date]) OVER(PARTITION BY id YEAR([effective date]), MONTH([effective date]) ORDER BY [effective date]),
startdate)
) AS StartDate_value
from table ) sub

Sum column values over a window based on variable date range (impala)

Given a table as follows :
client_id date connections
---------------------------------------
121438297 2018-01-03 0
121438297 2018-01-08 1
121438297 2018-01-10 3
121438297 2018-01-12 1
121438297 2018-01-19 7
363863811 2018-01-18 0
363863811 2018-01-30 5
363863811 2018-02-01 4
363863811 2018-02-10 0
I am looking for an efficient way to sum the number of connections that occur within x number of days following the current row (the current row being included in the sum), partitioned by client_id.
If x=6 then the output table would result in :
client_id date connections connections_within_6_days
---------------------------------------------------------------------
121438297 2018-01-03 0 1
121438297 2018-01-08 1 5
121438297 2018-01-10 3 4
121438297 2018-01-12 1 1
121438297 2018-01-19 7 7
363863811 2018-01-18 0 0
363863811 2018-01-30 5 9
363863811 2018-02-01 4 4
363863811 2018-02-10 0 0
Concerns :
I do not want to add all missing dates and then perform a sliding window counting the x following rows because my table is already extremely large.
I am using Impala and the range between interval 'x' days following and current row is not supported.
The generic solution is a bit troublesome for multiple periods, but you can use multiple CTEs to support that. The idea is to "unpivot" the counts based on when they go in and out and then use a cumulative sum.
So:
with conn as (
select client_id, date, connections
from t
union all
select client_id, date + interval 7 day, -connections
from t
),
conn1 as (
select client_id, date,
sum(sum(connections)) over (partition by client_id order by date) as connections_within_6_days
from t
group by client_id, date
)
select t.*, conn1. connections_within_6_days
from t join
conn1
on conn1.client_id = t.client_id and
conn1.date = t.date;

SQL Collapse Data

I am trying to collapse data that is in a sequence sorted by date. While grouping on the person and the type.
The data is stored in an SQL server and looks like the following -
seq person date type
--- ------ ------------------- ----
1 1 2018-02-10 08:00:00 1
2 1 2018-02-11 08:00:00 1
3 1 2018-02-12 08:00:00 1
4 1 2018-02-14 16:00:00 1
5 1 2018-02-15 16:00:00 1
6 1 2018-02-16 16:00:00 1
7 1 2018-02-20 08:00:00 2
8 1 2018-02-21 08:00:00 2
9 1 2018-02-22 08:00:00 2
10 1 2018-02-23 08:00:00 1
11 1 2018-02-24 08:00:00 1
12 1 2018-02-25 08:00:00 2
13 2 2018-02-10 08:00:00 1
14 2 2018-02-11 08:00:00 1
15 2 2018-02-12 08:00:00 1
16 2 2018-02-14 16:00:00 3
17 2 2018-02-15 16:00:00 3
18 2 2018-02-16 16:00:00 3
This data set contains about 1.2 million records that resemble the above.
The result that I would like to get from this would be -
person start type
------ ------------------- ----
1 2018-02-10 08:00:00 1
1 2018-02-20 08:00:00 2
1 2018-02-23 08:00:00 1
1 2018-02-25 08:00:00 2
2 2018-02-10 08:00:00 1
2 2018-02-14 16:00:00 3
I have the data in the first format by running the following query -
select
ROW_NUMBER() OVER (ORDER BY date) AS seq
person,
date,
type,
from table
group by person, date, type
I am just not sure how to keep the minimum date with the other distinct values from person and type.
This is a gaps-and-islands problem so, you can use differences of row_number() & use them in grouping :
select person, min(date) as start, type
from (select *,
row_number() over (partition by person order by seq) seq1,
row_number() over (partition by person, type order by seq) seq2
from table
) t
group by person, type, (seq1 - seq2)
order by person, start;
The correct solution using the difference of row numbers is:
select person, type, min(date) as start
from (select t.*,
row_number() over (partition by person order by seq) as seqnum_p,
row_number() over (partition by person, type order by seq) as seqnum_pt
from t
) t
group by person, type, (seqnum_p - seqnum_pt)
order by person, start;
type needs to be included in the GROUP BY.