I have 3 tables:
INVENTORY_IN:
ID INV_TIMESTAMP PRODUCT_ID IN_QUANTITY SUPPLIER_ID
...
1 10.03.21 01:00:00 101 100 4
2 11.03.21 02:00:00 101 50 3
3 14.03.21 01:00:00 101 10 2
INVENTORY_OUT:
ID INV_TIMESTAMP PRODUCT_ID OUT_QUANTITY CUSTOMER_ID
...
1 10.03.21 02:00:00 101 30 1
2 11.03.21 01:00:00 101 40 2
3 12.03.21 01:00:00 101 80 1
INVENTORY_BALANCE:
INV_DATE PRODUCT_ID QUANTITY
...
09.03.21 101 20
10.03.21 101 90
11.03.21 101 100
12.03.21 101 20
13.03.21 101 20
14.03.21 101 30
I want to use FIFO (first in-first out) logic for the inventory, and to see which quantities correspond to each SUPPLIER-CUSTOMER combination.
The desired ouput looks like this (queried for dates >= 2021-03-10):
PRODUCT_ID SUPPLIER_ID CUSTOMER_ID QUANTITY
101 1 20
101 4 1 60
101 4 2 40
101 3 1 30
101 3 20
101 2 10
edit. fixed little typo in numbers.
edit. Added a diagram which explains every row. All of the black arrows correspond to supplier and customer combinations, there are 7 of them, because for supplier_id = 4 and customer_id = 1 the desired results is the sum of matched quantities happening between them. So, it explains why there are 7 arrows, while the desired results contains only 6 rows.
Option 1
This is probably a job for PL/SQL. Starting with the data types to output:
CREATE TYPE supply_details_obj AS OBJECT(
product_id NUMBER,
quantity NUMBER,
supplier_id NUMBER,
customer_id NUMBER
);
CREATE TYPE supply_details_tab AS TABLE OF supply_details_obj;
Then we can define a pipelined function to read the INVENTORY_IN and INVENTORY_OUT tables one row at a time and merge the two keeping a running total of the remaining inventory or amount to supply:
CREATE FUNCTION assign_suppliers_to_customers (
i_product_id IN INVENTORY_IN.PRODUCT_ID%TYPE
)
RETURN supply_details_tab PIPELINED
IS
v_supplier_id INVENTORY_IN.SUPPLIER_ID%TYPE;
v_customer_id INVENTORY_OUT.CUSTOMER_ID%TYPE;
v_quantity_in INVENTORY_IN.IN_QUANTITY%TYPE := NULL;
v_quantity_out INVENTORY_OUT.OUT_QUANTITY%TYPE := NULL;
v_cur_in SYS_REFCURSOR;
v_cur_out SYS_REFCURSOR;
BEGIN
OPEN v_cur_in FOR
SELECT in_quantity, supplier_id
FROM INVENTORY_IN
WHERE product_id = i_product_id
ORDER BY inv_timestamp;
OPEN v_cur_out FOR
SELECT out_quantity, customer_id
FROM INVENTORY_OUT
WHERE product_id = i_product_id
ORDER BY inv_timestamp;
LOOP
IF v_quantity_in IS NULL THEN
FETCH v_cur_in INTO v_quantity_in, v_supplier_id;
IF v_cur_in%NOTFOUND THEN
v_supplier_id := NULL;
END IF;
END IF;
IF v_quantity_out IS NULL THEN
FETCH v_cur_out INTO v_quantity_out, v_customer_id;
IF v_cur_out%NOTFOUND THEN
v_customer_id := NULL;
END IF;
END IF;
EXIT WHEN v_cur_in%NOTFOUND AND v_cur_out%NOTFOUND;
IF v_quantity_in > v_quantity_out THEN
PIPE ROW(
supply_details_obj(
i_product_id,
v_quantity_out,
v_supplier_id,
v_customer_id
)
);
v_quantity_in := v_quantity_in - v_quantity_out;
v_quantity_out := NULL;
ELSE
PIPE ROW(
supply_details_obj(
i_product_id,
v_quantity_in,
v_supplier_id,
v_customer_id
)
);
v_quantity_out := v_quantity_out - v_quantity_in;
v_quantity_in := NULL;
END IF;
END LOOP;
END;
/
Then, for the sample data:
CREATE TABLE INVENTORY_IN ( ID, INV_TIMESTAMP, PRODUCT_ID, IN_QUANTITY, SUPPLIER_ID ) AS
SELECT 0, TIMESTAMP '2021-03-09 00:00:00', 101, 20, 0 FROM DUAL UNION ALL
SELECT 1, TIMESTAMP '2021-03-10 01:00:00', 101, 100, 4 FROM DUAL UNION ALL
SELECT 2, TIMESTAMP '2021-03-11 02:00:00', 101, 50, 3 FROM DUAL UNION ALL
SELECT 3, TIMESTAMP '2021-03-14 01:00:00', 101, 10, 2 FROM DUAL;
CREATE TABLE INVENTORY_OUT ( ID, INV_TIMESTAMP, PRODUCT_ID, OUT_QUANTITY, CUSTOMER_ID ) AS
SELECT 1, TIMESTAMP '2021-03-10 02:00:00', 101, 30, 1 FROM DUAL UNION ALL
SELECT 2, TIMESTAMP '2021-03-11 01:00:00', 101, 40, 2 FROM DUAL UNION ALL
SELECT 3, TIMESTAMP '2021-03-12 01:00:00', 101, 80, 1 FROM DUAL;
The query:
SELECT product_id,
supplier_id,
customer_id,
SUM( quantity ) AS quantity
FROM TABLE( assign_suppliers_to_customers( 101 ) )
GROUP BY
product_id,
supplier_id,
customer_id
ORDER BY
MIN( inv_timestamp )
Outputs:
PRODUCT_ID | SUPPLIER_ID | CUSTOMER_ID | QUANTITY
---------: | ----------: | ----------: | -------:
101 | 0 | 1 | 20
101 | 4 | 1 | 60
101 | 4 | 2 | 40
101 | 3 | 1 | 30
101 | 3 | null | 20
101 | 2 | null | 10
Option 2
A (very) complicated SQL query:
WITH in_totals ( ID, INV_TIMESTAMP, PRODUCT_ID, IN_QUANTITY, SUPPLIER_ID, TOTAL_QUANTITY ) AS (
SELECT i.*,
SUM( in_quantity ) OVER ( PARTITION BY product_id ORDER BY inv_timestamp )
FROM inventory_in i
),
out_totals ( ID, INV_TIMESTAMP, PRODUCT_ID, OUT_QUANTITY, CUSTOMER_ID, TOTAL_QUANTITY ) AS (
SELECT o.*,
SUM( out_quantity ) OVER ( PARTITION BY product_id ORDER BY inv_timestamp )
FROM inventory_out o
),
split_totals ( product_id, inv_timestamp, supplier_id, customer_id, quantity ) AS (
SELECT i.product_id,
MIN( COALESCE( LEAST( i.inv_timestamp, o.inv_timestamp ), i.inv_timestamp ) )
AS inv_timestamp,
i.supplier_id,
o.customer_id,
SUM(
COALESCE(
LEAST(
i.total_quantity - o.total_quantity + o.out_quantity,
o.total_quantity - i.total_quantity + i.in_quantity,
i.in_quantity,
o.out_quantity
),
0
)
)
FROM in_totals i
LEFT OUTER JOIN
out_totals o
ON ( i.product_id = o.product_id
AND i.total_quantity - i.in_quantity <= o.total_quantity
AND i.total_quantity >= o.total_quantity - o.out_quantity )
GROUP BY
i.product_id,
i.supplier_id,
o.customer_id
ORDER BY
inv_timestamp
),
missing_totals ( product_id, inv_timestamp, supplier_id, customer_id, quantity ) AS (
SELECT i.product_id,
i.inv_timestamp,
i.supplier_id,
NULL,
i.in_quantity - COALESCE( s.quantity, 0 )
FROM inventory_in i
INNER JOIN (
SELECT product_id,
supplier_id,
SUM( quantity ) AS quantity
FROM split_totals
GROUP BY product_id, supplier_id
) s
ON ( i.product_id = s.product_id
AND i.supplier_id = s.supplier_id )
ORDER BY i.inv_timestamp
)
SELECT product_id, supplier_id, customer_id, quantity
FROM (
SELECT product_id, inv_timestamp, supplier_id, customer_id, quantity
FROM split_totals
WHERE quantity > 0
UNION ALL
SELECT product_id, inv_timestamp, supplier_id, customer_id, quantity
FROM missing_totals
WHERE quantity > 0
ORDER BY inv_timestamp
);
Which, for the sample data above, outputs:
PRODUCT_ID | SUPPLIER_ID | CUSTOMER_ID | QUANTITY
---------: | ----------: | ----------: | -------:
101 | 0 | 1 | 20
101 | 4 | 1 | 60
101 | 4 | 2 | 40
101 | 3 | 1 | 30
101 | 3 | null | 20
101 | 2 | null | 10
db<>fiddle here
If your system controls the timestamps so you cannot consume what was not supplied (I've met systems, that didn't track intraday balance), then you can use SQL solution with interval join. The only thing to take care here is to track the last supply that was not consumed in full: it should be added as supply with no customer.
Here's the query with comments:
CREATE TABLE INVENTORY_IN ( ID, INV_TIMESTAMP, PRODUCT_ID, IN_QUANTITY, SUPPLIER_ID ) AS
SELECT 0, TIMESTAMP '2021-03-09 00:00:00', 101, 20, 0 FROM DUAL UNION ALL
SELECT 1, TIMESTAMP '2021-03-10 01:00:00', 101, 100, 4 FROM DUAL UNION ALL
SELECT 2, TIMESTAMP '2021-03-11 02:00:00', 101, 50, 3 FROM DUAL UNION ALL
SELECT 3, TIMESTAMP '2021-03-14 01:00:00', 101, 10, 2 FROM DUAL;
CREATE TABLE INVENTORY_OUT ( ID, INV_TIMESTAMP, PRODUCT_ID, OUT_QUANTITY, CUSTOMER_ID ) AS
SELECT 1, TIMESTAMP '2021-03-10 02:00:00', 101, 30, 1 FROM DUAL UNION ALL
SELECT 2, TIMESTAMP '2021-03-11 01:00:00', 101, 40, 2 FROM DUAL UNION ALL
SELECT 3, TIMESTAMP '2021-03-12 01:00:00', 101, 80, 1 FROM DUAL;
with i as (
select
/*Get total per product, supplier at each timestamp
to calculate running sum on timestamps without need to resolve ties with over(... rows between) addition*/
inv_timestamp
, product_id
, supplier_id
, sum(in_quantity) as quan
, sum(sum(in_quantity)) over(
partition by product_id
order by
inv_timestamp asc
, supplier_id asc
) as rsum
from INVENTORY_IN
group by
product_id
, supplier_id
, inv_timestamp
)
, o as (
select /*The same for customer*/
inv_timestamp
, product_id
, customer_id
, sum(out_quantity) as quan
, sum(sum(out_quantity)) over(
partition by product_id
order by
inv_timestamp asc
, customer_id asc
) as rsum
/*Last consumption per product: when lead goes beyond the current window*/
, lead(0, 1, 1) over(
partition by product_id
order by
inv_timestamp asc
, customer_id asc
) as last_consumption
from INVENTORY_OUT
group by
product_id
, customer_id
, inv_timestamp
)
, distr as (
select
/*Distribute the quantity. This is the basic interval intersection:
new_value_to = least(t1.value_to, t2.value_to)
new_value_from = greatest(t1.value_from, t2.value_from)
So we need a capacity of the interval
*/
i.product_id
, least(i.rsum, nvl(o.rsum, i.rsum))
- greatest(i.rsum - i.quan, nvl(o.rsum - o.quan, i.rsum - i.quan)) as supplied_quan
/*At the last supply we can have something not used.
Calculate it to add later as not consumed
*/
, case
when last_consumption = 1
and i.rsum > nvl(o.rsum, i.rsum)
then i.rsum - o.rsum
end as rest_quan
, i.supplier_id
, o.customer_id
, i.inv_timestamp as i_ts
, o.inv_timestamp as o_ts
from i
left join o
on i.product_id = o.product_id
/*No equality here, because values are continuous:
>= will include the same value in two intervals if some of value_to of one table equals
another's table value_to (which is value_from for the next interval)*/
and i.rsum > o.rsum - o.quan
and o.rsum > i.rsum - i.quan
)
select
product_id
, supplier_id
, customer_id
, sum(quan) as quan
from (
select /*Get distributed quantities*/
product_id
, supplier_id
, customer_id
, supplied_quan as quan
, i_ts
, o_ts
from distr
union all
select /*Add not consumed part of last consumed supply*/
product_id
, supplier_id
, null
, rest_quan
, i_ts
, null /*No consumption*/
from distr
where rest_quan is not null
)
group by
product_id
, supplier_id
, customer_id
order by
min(i_ts) asc
/*To order not consumed last*/
, min(o_ts) asc nulls last
PRODUCT_ID | SUPPLIER_ID | CUSTOMER_ID | QUAN
---------: | ----------: | ----------: | ---:
101 | 0 | 1 | 20
101 | 4 | 1 | 60
101 | 4 | 2 | 40
101 | 3 | 1 | 30
101 | 3 | null | 20
101 | 2 | null | 10
db<>fiddle here
Related
I am trying to find the customer count and sales by the type of customer (New and Returning) and the number of times they have purchased.
txn_date Customer_ID Transaction_Number Sales Reference(not in the SQL table) customer type (not in the sql table)
1/2/2019 1 12345 $10 Second Purchase SLS Repeat
4/3/2018 1 65890 $20 First Purchase SLS Repeat
3/22/2019 3 64453 $30 First Purchase SLS new
4/3/2019 4 88567 $20 First Purchase SLS new
5/21/2019 4 85446 $15 Second Purchase SLS new
1/23/2018 5 89464 $40 First Purchase SLS Repeat
4/3/2019 5 99674 $30 Second Purchase SLS Repeat
4/3/2019 6 32224 $20 Second Purchase SLS Repeat
1/23/2018 6 46466 $30 First Purchase SLS Repeat
1/20/2018 7 56558 $30 First Purchase SLS new
I am using the below code to get the aggregate sales and customer count for the total customers:
select seqnum, count(distinct customer_id), sum(sales) from (
select co.*,
row_number() over (partition by customer_id order by txn_date) as seqnum
from somya co)
group by seqnum
order by seqnum;
I want to get the same data by the customer type:
for example for the new customers my result should show:
New Customers Customer_Count Sum(Sales)
1st Purchase 3 $80
2nd Purchase 1 $15
Returning Customers Customer_Count Sum(Sales)
1st Purchase 3 $90
2nd Purchase 3 $60
I am trying the below query to get the data for new and repeat customers:
New Customers:
select seqnum, count(distinct customer_id), sum(sales)
from (
select co.*,
row_number() over (partition by customer_id order by trunc(txn_date)) as seqnum,
MIN (TRUNC (TXN_DATE)) OVER (PARTITION BY customer_id) as MIN_TXN_DATE
from somya co
)
where MIN_TXN_DATE between '01-JAN-19' and '31-DEC-19'
group by seqnum
order by seqnum asc;
Returning Customers:
select seqnum, count(distinct customer_id), sum(sales)
from (
select co.*,
row_number() over (partition by customer_id order by trunc(txn_date)) as seqnum,
MIN (TRUNC (TXN_DATE)) OVER (PARTITION BY customer_id) as MIN_TXN_DATE
from somya co
)
where MIN_TXN_DATE <'01-JAN-19'
group by seqnum
order by seqnum asc;
I am not able to figure out what is wrong with my query or if there is a problem with my logic.
This is just a sample data, I have transactions from all the years in my data base so I need to narrow the transaction date in the query but as soon as I narrowing down the data using the transaction date the repeat customer query doesnt give me anything and the new customer query gives me the total customer for that period.
If I understand correctly, you need to know the first time someone becomes a customer. And then use this:
select (case when first_year < 2019 then 'returning' else 'new' end) as custtype,
seqnum, count(*), sum(sales)
from (select co.*,
row_number() over (partition by customer_id, extract(year from txn_date) order by txn_date) as seqnum,
min(extract(year from txn_date)) over (partition by customer_id) as first_year
from somya co
) s
where txn_date >= date '2019-01-01' and
txn_date < date '2020-01-01'
group by (case when first_year < 2019 then 'returning' else 'new' end),
seqnum
order by custtype, seqnum;
You can categorize your sales data to assign a customer type and a purchase sequence using windowing functions, like this:
SELECT sd.txn_date,
sd.customer_id,
sd.transaction_number,
sd.sales,
case when min(txn_date) over ( partition by customer_id ) < DATE '2019-01-01'
AND max(txn_date) OVER ( partition by customer_id ) >= DATE '2019-01-01'
THEN 'Repeat'
ELSE 'New' END customer_type,
row_number() over ( partition by customer_id order by txn_date) purchase_sequence
FROM sales_data sd
+-----------+-------------+--------------------+-------+---------------+-------------------+
| TXN_DATE | CUSTOMER_ID | TRANSACTION_NUMBER | SALES | CUSTOMER_TYPE | PURCHASE_SEQUENCE |
+-----------+-------------+--------------------+-------+---------------+-------------------+
| 03-APR-18 | 1 | 65890 | 20 | Repeat | 1 |
| 02-JAN-19 | 1 | 12345 | 10 | Repeat | 2 |
| 22-MAR-19 | 3 | 64453 | 30 | New | 1 |
| 03-APR-19 | 4 | 88567 | 20 | New | 1 |
| 21-MAY-19 | 4 | 85446 | 15 | New | 2 |
| 23-JAN-18 | 5 | 89464 | 40 | Repeat | 1 |
| 03-APR-19 | 5 | 99674 | 30 | Repeat | 2 |
| 23-JAN-18 | 6 | 46466 | 30 | Repeat | 1 |
| 03-APR-19 | 6 | 32224 | 20 | Repeat | 2 |
| 20-JAN-18 | 7 | 56558 | 30 | New | 1 |
+-----------+-------------+--------------------+-------+---------------+-------------------+
Then, you can wrap that in a common table expression (aka "WITH" clause) and summarize by the customer type and purchase sequence:
WITH categorized_sales_data AS (
SELECT sd.txn_date,
sd.customer_id,
sd.transaction_number,
sd.sales,
case when min(txn_date) over ( partition by customer_id ) < DATE '2019-01-01' AND max(txn_date) OVER ( partition by customer_id ) >= DATE '2019-01-01' THEN 'Repeat' ELSE 'New' END customer_type,
row_number() over ( partition by customer_id order by txn_date) purchase_sequence
FROM sales_data sd)
SELECT customer_type, purchase_sequence, count(*), sum(sales)
FROM categorized_sales_data
group by customer_type, purchase_sequence
order by customer_type, purchase_sequence
+---------------+-------------------+----------+------------+
| CUSTOMER_TYPE | PURCHASE_SEQUENCE | COUNT(*) | SUM(SALES) |
+---------------+-------------------+----------+------------+
| New | 1 | 3 | 80 |
| New | 2 | 1 | 15 |
| Repeat | 1 | 3 | 90 |
| Repeat | 2 | 3 | 60 |
+---------------+-------------------+----------+------------+
Here's a full SQL with test data:
with sales_data (txn_date, Customer_ID, Transaction_Number, Sales ) as (
SELECT TO_DATE('1/2/2019','MM/DD/YYYY'), 1, 12345, 10 FROM DUAL UNION ALL
SELECT TO_DATE('4/3/2018','MM/DD/YYYY'), 1, 65890, 20 FROM DUAL UNION ALL
SELECT TO_DATE('3/22/2019','MM/DD/YYYY'), 3, 64453, 30 FROM DUAL UNION ALL
SELECT TO_DATE('4/3/2019','MM/DD/YYYY'), 4, 88567, 20 FROM DUAL UNION ALL
SELECT TO_DATE('5/21/2019','MM/DD/YYYY'), 4, 85446, 15 FROM DUAL UNION ALL
SELECT TO_DATE('1/23/2018','MM/DD/YYYY'), 5, 89464, 40 FROM DUAL UNION ALL
SELECT TO_DATE('4/3/2019','MM/DD/YYYY'), 5, 99674, 30 FROM DUAL UNION ALL
SELECT TO_DATE('4/3/2019','MM/DD/YYYY'), 6, 32224, 20 FROM DUAL UNION ALL
SELECT TO_DATE('1/23/2018','MM/DD/YYYY'), 6, 46466, 30 FROM DUAL UNION ALL
SELECT TO_DATE('1/20/2018','MM/DD/YYYY'), 7, 56558, 30 FROM DUAL ),
-- Query starts here
/* WITH */ categorized_sales_data AS (
SELECT sd.txn_date,
sd.customer_id,
sd.transaction_number,
sd.sales,
case when min(txn_date) over ( partition by customer_id ) < DATE '2019-01-01' AND max(txn_date) OVER ( partition by customer_id ) >= DATE '2019-01-01' THEN 'Repeat' ELSE 'New' END customer_type,
row_number() over ( partition by customer_id order by txn_date) purchase_sequence
FROM sales_data sd)
SELECT customer_type, purchase_sequence, count(*), sum(sales)
FROM categorized_sales_data
group by customer_type, purchase_sequence
order by customer_type, purchase_sequence
Response to comment from OP
all the customers whose first purchase date is in 2019 would be a new customer. Any customer who has transacted in 2019 but their first purchase date is before 2019 would be a repeat customer
So, change
case when min(txn_date) over ( partition by customer_id ) < DATE '2019-01-01'
AND max(txn_date) OVER ( partition by customer_id ) >= DATE '2019-01-01'
THEN 'Repeat' ELSE 'New' END customer_type
to
case when min(txn_date) over ( partition by customer_id )
BETWEEN DATE '2019-01-01' AND DATE '2020-01-01' - INTERVAL '1' SECOND
THEN 'New' ELSE 'Repeat' END customer_type
i.e., if and only if a customer's first purchase was in 2019 then they are "new".
I have a list of account balances over time. The schema looks like this:
+-------------+---------+---------+----------------------+
| customer_id | city_id | value | timestamp |
+-------------+---------+---------+----------------------+
| 1 | 1 | -500 | 2019-02-12T00:00:00 |
| 2 | 1 | -200 | 2019-02-12T00:00:00 |
| 3 | 2 | 200 | 2019-02-10T00:00:00 |
| 4 | 1 | -10 | 2019-02-09T00:00:00 |
+-------------+ --------+---------+----------------------+
I want to aggregate this data, such that I get the daily total negative account balance partitioned by city and ordered by time:
+---------+---------+--------------+
| city_id | value | timestamp |
+---------+---------+--------------+
| 1 | -500 | 2019-02-12 |
| 1 | -200 | 2019-02-10 |
| 1 | -10 | 2019-02-09 |
+ --------+---------+--------------+
What I've tried:
SELECT city_id, FORMAT_TIMESTAMP("%Y-%m-%d", TIMESTAMP(timestamp)) as date,
SUM(value) OVER (PARTITION BY city_id ORDER BY FORMAT_TIMESTAMP("%Y-%m-%d", TIMESTAMP(timestamp))) negative_account_balance
FROM `account_balances`
WHERE value < 0
However this gives me strange account balance values like -5.985856421224E10. Any ideas why? Besides that the query generates entries for the same city and same day multiple times. I would expect it to return a the same city only once for the same day.
Below is for BigQuery Standard SQL
#standardSQL
SELECT city_id, account_balance, `date` FROM (
SELECT city_id, `date`,
SUM(value) OVER(PARTITION BY city_id ORDER BY `date`) account_balance
FROM (
SELECT city_id, DATE(TIMESTAMP(t.timestamp)) AS `date`, SUM(value) value
FROM `project.dataset.account_balances` t
GROUP BY city_id, `date` )
)
WHERE account_balance< 0
You can test, play with above using sample/dummy data as in below example
#standardSQL
WITH `project.dataset.account_balances` AS (
SELECT 1 customer_id, 1 city_id, -500 value, '2019-02-12T00:00:00' `timestamp` UNION ALL
SELECT 2, 1, -200, '2019-02-12T00:00:00' UNION ALL
SELECT 5, 1, 100, '2019-02-13T00:00:00' UNION ALL
SELECT 3, 2, 200, '2019-02-10T00:00:00' UNION ALL
SELECT 4, 1, -10, '2019-02-09T00:00:00'
)
SELECT city_id, account_balance, `date` FROM (
SELECT city_id, `date`,
SUM(value) OVER(PARTITION BY city_id ORDER BY `date`) account_balance
FROM (
SELECT city_id, DATE(TIMESTAMP(t.timestamp)) AS `date`, SUM(value) value
FROM `project.dataset.account_balances` t
GROUP BY city_id, `date` )
)
WHERE account_balance< 0
which produces below result
Row city_id account_balance date
1 1 -10 2019-02-09
2 1 -710 2019-02-12
3 1 -610 2019-02-13
I took a simpler approach and used this sql (BTW When I tried your original query I got a result which seems ok)
SELECT city_id, FORMAT_TIMESTAMP("%Y-%m-%d", TIMESTAMP(timestamp)) as date,
SUM(value) as value
FROM `account_balances`
GROUP BY city_id, timestamp
HAVING value < 0
I used this data to check it out (Note: I changed the date format to match BigQuery format although the result is the same either way)
WITH account_balances as (
SELECT 1 AS customer_id, 1 as city_id, -500 as value, '2019-02-12 00:00:00' as timestamp UNION ALL
SELECT 2 AS customer_id, 1 as city_id, -200 as value, '2019-02-12 00:00:00' as timestamp UNION ALL
SELECT 3 AS customer_id, 2 as city_id, 200 as value, '2019-02-10 00:00:00' as timestamp UNION ALL
SELECT 4 AS customer_id, 1 as city_id, -10 as value, '2019-02-09 00:00:00' as timestamp
)
SELECT city_id, FORMAT_TIMESTAMP("%Y-%m-%d", TIMESTAMP(timestamp)) as date,
SUM(value) as value
FROM `account_balances`
GROUP BY city_id, timestamp
HAVING value < 0
This is the result:
I am playing around with bigquery and hit an interesting use case. I have a collection of customers and account balances. The account balances collection records any account balance change.
Customers:
+---------+--------+
| ID | Name |
+---------+--------+
| 1 | Alice |
| 2 | Bob |
+---------+--------+
Accounts balances:
+---------+---------------+---------+------------+
| ID | customer_id | value | timestamp |
+---------+---------------+---------+------------+
| 1 | 1 | -500 | 2019-02-12 |
| 2 | 1 | -200 | 2019-02-10 |
| 3 | 2 | 200 | 2019-02-10 |
| 4 | 1 | 0 | 2019-02-09 |
+---------+---------------+---------+------------+
The goal is to find out, for how long a customer has a negative account balance. The resulting collection would look like this:
+---------+--------+---------------------------------+
| ID | Name | Negative account balance since |
+---------+--------+---------------------------------+
| 1 | Alice | 2 days |
+---------+--------+---------------------------------+
Bob is not in the collection, because his last account record shows a positive value.
I think following steps are involved:
get last account balance per customer, see if it is negative
go through the account balance values until you hit a positive (or no more) value
compute datediff
Is something like this even possible in sql? Do you have any ideas on who to create such query? To get customers that currently have a negative account balance, I use this query:
SELECT customer_id FROM (
SELECT t.account_balance, ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY timestamp DESC) as seqnum FROM `account_balances` t
) t
WHERE seqnum = 1 AND account_balance<0
Below is for BigQuery Standard SQL
#standardSQL
SELECT customer_id, name,
SUM(IF(negative_positive < 0, days, 0)) negative_days,
SUM(IF(negative_positive = 0, days, 0)) zero_days,
SUM(IF(negative_positive > 0, days, 0)) positive_days
FROM (
SELECT customer_id, negative_positive, grp,
1 + DATE_DIFF(MAX(ts), MIN(ts), DAY) days
FROM (
SELECT customer_id, ts, SIGN(value) negative_positive,
COUNTIF(flag) OVER(PARTITION BY customer_id ORDER BY ts) grp
FROM (
SELECT *, SIGN(value) = IFNULL(LEAD(SIGN(value)) OVER(PARTITION BY customer_id ORDER BY ts), 0) flag
FROM `project.dataset.balances`
)
)
GROUP BY customer_id, negative_positive, grp
)
LEFT JOIN `project.dataset.customers`
ON id = customer_id
GROUP BY customer_id, name
You can test, play with above using sample data from your question as in below example
#standardSQL
WITH `project.dataset.balances` AS (
SELECT 1 customer_id, -500 value, DATE '2019-02-12' ts UNION ALL
SELECT 1, -200, '2019-02-10' UNION ALL
SELECT 2, 200, '2019-02-10' UNION ALL
SELECT 1, 0, '2019-02-09'
), `project.dataset.customers` AS (
SELECT 1 id, 'Alice' name UNION ALL
SELECT 2, 'Bob'
)
SELECT customer_id, name,
SUM(IF(negative_positive < 0, days, 0)) negative_days,
SUM(IF(negative_positive = 0, days, 0)) zero_days,
SUM(IF(negative_positive > 0, days, 0)) positive_days
FROM (
SELECT customer_id, negative_positive, grp,
1 + DATE_DIFF(MAX(ts), MIN(ts), DAY) days
FROM (
SELECT customer_id, ts, SIGN(value) negative_positive,
COUNTIF(flag) OVER(PARTITION BY customer_id ORDER BY ts) grp
FROM (
SELECT *, SIGN(value) = IFNULL(LEAD(SIGN(value)) OVER(PARTITION BY customer_id ORDER BY ts), 0) flag
FROM `project.dataset.balances`
)
)
GROUP BY customer_id, negative_positive, grp
)
LEFT JOIN `project.dataset.customers`
ON id = customer_id
GROUP BY customer_id, name
-- ORDER BY customer_id
with result
Row customer_id name negative_days zero_days positive_days
1 1 Alice 3 1 0
2 2 Bob 0 0 1
I have a table structure as below,
CREATE TABLE #CustOrder ( CustId INT, OrderDate DATE )
INSERT #CustOrder ( CustId, OrderDate )
VALUES ( 1, '2016-11-01' ),
( 1, '2019-09-01' ),
( 2, '2019-07-01' ),
( 2, '2019-11-01' ),
( 3, '2017-01-01' ),
( 4, '2016-12-01' ),
( 4, '2017-01-01' )
I want to list the customer with their future order dates, if they do not have a future order I want to list their last or most recent order. I have the following query.
; WITH LastOrder AS
(
SELECT
CO.CustId,
CO.OrderDate,
ROW_NUMBER() OVER(PARTITION BY CO.CustId ORDER BY ABS(DATEDIFF(DAY, CO.OrderDate, GETUTCDATE()))) AS RowNum
FROM #CustOrder AS CO
)
SELECT LO.CustId, LO.OrderDate
FROM LastOrder AS LO
WHERE LO.RowNum = 1
This query gives me the result as,
CustId | OrderDate
--------+-------------
1 | 2016-11-01
2 | 2019-07-01
3 | 2017-01-01
4 | 2017-01-01
However, I need the result as,
CustId | OrderDate
--------+-------------
1 | 2019-09-01
2 | 2019-07-01
3 | 2017-01-01
4 | 2017-01-01
As
Customer 1 has a future order on 2019-09-01
Customer 2 has two future order but the first one is on 2019-07-01
Customer 3 has no more than 1 order, it should just return 2017-01-01
Customer 4 has two past orders but the most recent is 2017-01-01
rextester: http://rextester.com/PBKNA95127
CREATE TABLE #CustOrder ( CustId INT, OrderDate DATE )
INSERT #CustOrder ( CustId, OrderDate )
VALUES ( 1, '2016-11-01' ),
( 1, '2019-09-01' ),
( 2, '2019-07-01' ),
( 2, '2019-11-01' ),
( 3, '2017-01-01' ),
( 4, '2016-12-01' ),
( 4, '2017-01-01' )
; WITH LastOrder AS
(
SELECT
CO.CustId,
CO.OrderDate,
ROW_NUMBER() OVER(PARTITION BY CO.CustId
ORDER BY case when co.OrderDate > getdate() then 0 else 1 end
, abs(DATEDIFF(DAY, getdate(),CO.OrderDate)) asc
) AS RowNum
FROM #CustOrder AS CO
)
SELECT LO.CustId, LO.OrderDate
FROM LastOrder AS LO
WHERE LO.RowNum = 1
results:
+--------+------------+
| CustId | OrderDate |
+--------+------------+
| 1 | 2019-09-01 |
| 2 | 2019-07-01 |
| 3 | 2017-01-01 |
| 4 | 2017-01-01 |
+--------+------------+
You can use the MAX function to check if the latest date is in the future. If so, get the MIN date after today using MIN. Else get the latest date.
SELECT CUSTID,OrderDate
FROM (SELECT CustId,
OrderDate,
CASE WHEN MAX(orderdate) OVER(PARTITION BY CustId) > GETUTCDATE()
THEN MIN(case when orderdate >getutcdate() then orderdate end) OVER(PARTITION BY CustId)
ELSE MAX(orderdate) OVER(PARTITION BY CustId) end as latest_date
FROM #CustOrder) T
WHERE latest_date=orderDate
Min, Max, UNION approach
select custID, MIN(OrderDate)
from #CustOrder
where OrderDate > '2017-02-17'
group by custID
union all
select co1.custID, max(co1.OrderDate)
from #CustOrder co1
where not exists ( select 1
from #CustOrder co2
where co2.CustId = co1.CustId
and co2.OrderDate > '2017-02-17'
)
group by co1.custID
Start your ORDER BY with a CASE expression that prefers future over past, and then use the ABS DATEDIFF (like you have now) as the second condition in the ORDER BY.
Maybe create another column and use the LAG() window function to grab the last date function and then put a conditional/case statement within the select portion? https://msdn.microsoft.com/en-us/library/hh231256.aspx
I hope I can describe my challenge in an understandable way.
I have two tables on a Oracle Database 12c which look like this:
Table name "Invoices"
I_ID | invoice_number | creation_date | i_amount
------------------------------------------------------
1 | 10000000000 | 01.02.2016 00:00:00 | 30
2 | 10000000001 | 01.03.2016 00:00:00 | 25
3 | 10000000002 | 01.04.2016 00:00:00 | 13
4 | 10000000003 | 01.05.2016 00:00:00 | 18
5 | 10000000004 | 01.06.2016 00:00:00 | 12
Table name "payments"
P_ID | reference | received_date | p_amount
------------------------------------------------------
1 | PAYMENT01 | 12.02.2016 13:14:12 | 12
2 | PAYMENT02 | 12.02.2016 15:24:21 | 28
3 | PAYMENT03 | 08.03.2016 23:12:00 | 2
4 | PAYMENT04 | 23.03.2016 12:32:13 | 30
5 | PAYMENT05 | 12.06.2016 00:00:00 | 15
So I want to have a select statement (maybe with oracle analytic functions but I am not really familiar with it) where the payments are getting summed up till the amount of an invoice is reached, ordered by dates. If the sum of for example two payments is more than the invoice amount the rest of the last payment amount should be used for the next invoice.
In this example the result should be like this:
invoice_number | reference | used_pay_amount | open_inv_amount
----------------------------------------------------------
10000000000 | PAYMENT01 | 12 | 18
10000000000 | PAYMENT02 | 18 | 0
10000000001 | PAYMENT02 | 10 | 15
10000000001 | PAYMENT03 | 2 | 13
10000000001 | PAYMENT04 | 13 | 0
10000000002 | PAYMENT04 | 13 | 0
10000000003 | PAYMENT04 | 4 | 14
10000000003 | PAYMENT05 | 14 | 0
10000000004 | PAYMENT05 | 1 | 11
It would be nice if there is a solution with a "simple" select statement.
thx in advance for your time ...
Oracle Setup:
CREATE TABLE invoices ( i_id, invoice_number, creation_date, i_amount ) AS
SELECT 1, 100000000, DATE '2016-01-01', 30 FROM DUAL UNION ALL
SELECT 2, 100000001, DATE '2016-02-01', 25 FROM DUAL UNION ALL
SELECT 3, 100000002, DATE '2016-03-01', 13 FROM DUAL UNION ALL
SELECT 4, 100000003, DATE '2016-04-01', 18 FROM DUAL UNION ALL
SELECT 5, 100000004, DATE '2016-05-01', 12 FROM DUAL;
CREATE TABLE payments ( p_id, reference, received_date, p_amount ) AS
SELECT 1, 'PAYMENT01', DATE '2016-01-12', 12 FROM DUAL UNION ALL
SELECT 2, 'PAYMENT02', DATE '2016-01-13', 28 FROM DUAL UNION ALL
SELECT 3, 'PAYMENT03', DATE '2016-02-08', 2 FROM DUAL UNION ALL
SELECT 4, 'PAYMENT04', DATE '2016-02-23', 30 FROM DUAL UNION ALL
SELECT 5, 'PAYMENT05', DATE '2016-05-12', 15 FROM DUAL;
Query:
WITH total_invoices ( i_id, invoice_number, creation_date, i_amount, i_total ) AS (
SELECT i.*,
SUM( i_amount ) OVER ( ORDER BY creation_date, i_id )
FROM invoices i
),
total_payments ( p_id, reference, received_date, p_amount, p_total ) AS (
SELECT p.*,
SUM( p_amount ) OVER ( ORDER BY received_date, p_id )
FROM payments p
)
SELECT invoice_number,
reference,
LEAST( p_total, i_total )
- GREATEST( p_total - p_amount, i_total - i_amount ) AS used_pay_amount,
GREATEST( i_total - p_total, 0 ) AS open_inv_amount
FROM total_invoices
INNER JOIN
total_payments
ON ( i_total - i_amount < p_total
AND i_total > p_total - p_amount );
Explanation:
The two sub-query factoring (WITH ... AS ()) clauses just add an extra virtual column to the invoices and payments tables with the cumulative sum of the invoice/payment amount.
You can associate a range with each invoice (or payment) as the cumulative amount owing (paid) before the invoice (payment) was placed and the cumulative amount owing (paid) after. The two tables can then be joined where there is an overlap of these ranges.
The open_inv_amount is the positive difference between the cumulative amount invoiced and the cumulative amount paid.
The used_pay_amount is slightly more complicated but you need to find the difference between the lower of the current cumulative invoice and payment totals and the higher of the previous cumulative invoice and payment totals.
Output:
INVOICE_NUMBER REFERENCE USED_PAY_AMOUNT OPEN_INV_AMOUNT
-------------- --------- --------------- ---------------
100000000 PAYMENT01 12 18
100000000 PAYMENT02 18 0
100000001 PAYMENT02 10 15
100000001 PAYMENT03 2 13
100000001 PAYMENT04 13 0
100000002 PAYMENT04 13 0
100000003 PAYMENT04 4 14
100000003 PAYMENT05 14 0
100000004 PAYMENT05 1 11
Update:
Based on mathguy's method of using UNION to join the data, I came up with a different solution re-using some of my code.
WITH combined ( invoice_number, reference, i_amt, i_total, p_amt, p_total, total ) AS (
SELECT invoice_number,
NULL,
i_amount,
SUM( i_amount ) OVER ( ORDER BY creation_date, i_id ),
NULL,
NULL,
SUM( i_amount ) OVER ( ORDER BY creation_date, i_id )
FROM invoices
UNION ALL
SELECT NULL,
reference,
NULL,
NULL,
p_amount,
SUM( p_amount ) OVER ( ORDER BY received_date, p_id ),
SUM( p_amount ) OVER ( ORDER BY received_date, p_id )
FROM payments
ORDER BY 7,
2 NULLS LAST,
1 NULLS LAST
),
filled ( invoice_number, reference, i_prev, i_total, p_prev, p_total ) AS (
SELECT FIRST_VALUE( invoice_number ) IGNORE NULLS OVER ( ORDER BY ROWNUM ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING ),
FIRST_VALUE( reference ) IGNORE NULLS OVER ( ORDER BY ROWNUM ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING ),
FIRST_VALUE( i_total - i_amt ) IGNORE NULLS OVER ( ORDER BY ROWNUM ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING ),
FIRST_VALUE( i_total ) IGNORE NULLS OVER ( ORDER BY ROWNUM ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING ),
FIRST_VALUE( p_total - p_amt ) IGNORE NULLS OVER ( ORDER BY ROWNUM ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING ),
COALESCE(
p_total,
LEAD( p_total ) IGNORE NULLS OVER ( ORDER BY ROWNUM ),
LAG( p_total ) IGNORE NULLS OVER ( ORDER BY ROWNUM )
)
FROM combined
),
vals ( invoice_number, reference, upa, oia, prev_invoice ) AS (
SELECT invoice_number,
reference,
COALESCE( LEAST( p_total - i_total ) - GREATEST( p_prev, i_prev ), 0 ),
GREATEST( i_total - p_total, 0 ),
LAG( invoice_number ) OVER ( ORDER BY ROWNUM )
FROM filled
)
SELECT invoice_number,
reference,
upa AS used_pay_amount,
oia AS open_inv_amount
FROM vals
WHERE upa > 0
OR ( reference IS NULL AND invoice_number <> prev_invoice AND oia > 0 );
Explanation:
The combined sub-query factoring clause joins the two tables with a UNION ALL and generates the cumulative totals for the amounts invoiced and paid. The final thing it does is order the rows by their ascending cumulative total (and if there are ties it will put the payments, in order created, before the invoices).
The filled sub-query factoring clause will fill the previously generated table so that if a value is null then it will take the value from the next non-null row (and if there is an invoice with no payments then it will find the total of the previous payments from the preceding rows).
The vals sub-query factoring clause applies the same calculations as my previous query (see above). It also adds the prev_invoice column to help identify invoices which are entirely unpaid.
The final SELECT takes the values and filters out the unnecessary rows.
Here is a solution that doesn't require a join. This is important if the amount of data is significant. I did some testing on my laptop (nothing commercial), using the free edition (XE) of Oracle 11.2. Using MT0's solution, the query with the join takes about 11 seconds if there are 10k invoices and 10k payments. For 50k invoices and 50k payments, the query took 287 seconds (almost 5 minutes). This is understandable, since joining two 50k tables requires 2.5 billion comparisons.
The alternative below uses a union. It uses lag() and last_value() to do the work the join does in the other solution. This union-based solution, with 50k invoices and 50k payments, took less than 0.5 seconds on my laptop (!)
I simplified the setup a bit; i_id, invoice_number and creation_date are all used for one purpose only: to order the invoice amounts. I use just an inv_id (invoice id) for that purpose, and similar for payments..
For testing purposes, I created tables invoices and payments like so:
create table invoices (inv_id, inv_amt) as
(select level, trunc(dbms_random.value(20, 80)) from dual connect by level <= 50000);
create table payments (pmt_id, pmt_amt) as
(select level, trunc(dbms_random.value(20, 80)) from dual connect by level <= 50000);
Then, to test the solutions, I use the queries to populate a CTAS, like this:
create table bal_of_pmts as
[select query, including the WITH clause but without the setup CTE's, comes here]
In my solution, I look to show the allocation of payments to one or more invoice, and the payment of invoices from one or more payments; the output discussed in the original post only covers half of this information, but for symmetry it makes more sense to me to show both halves. The output (for the same inputs as in the original post) looks like this, with my version of inv_id and pmt_id:
INV_ID PAID UNPAID PMT_ID USED AVAILABLE
---------- ---------- ---------- ---------- ---------- ----------
1 12 18 101 12 0
1 18 0 103 18 10
2 10 15 103 10 0
2 2 13 105 2 0
2 13 0 107 13 17
3 13 0 107 13 4
4 4 14 107 4 0
4 14 0 109 14 1
5 1 11 109 1 0
5 11 0 11
Notice how the left half is what the original post requested. There is an extra row at the end. Notice the NULL for payment id, for a payment of 11 - that shows how much of the last payment is left uncovered. If there was an invoice with id = 6, for an amount of, say, 22, then there would be one more row - showing the entire amount (22) of that invoice as "paid" from a payment with no id - meaning actually not covered (yet).
The query may be a little easier to understand than the join approach. To see what it does, it may help to look closely at intermediate results, especially the CTE c (in the WITH clause).
with invoices (inv_id, inv_amt) as (
select 1, 30 from dual union all
select 2, 25 from dual union all
select 3, 13 from dual union all
select 4, 18 from dual union all
select 5, 12 from dual
),
payments (pmt_id, pmt_amt) as (
select 101, 12 from dual union all
select 103, 28 from dual union all
select 105, 2 from dual union all
select 107, 30 from dual union all
select 109, 15 from dual
),
c (kind, inv_id, inv_cml, pmt_id, pmt_cml, cml_amt) as (
select 'i', inv_id, sum(inv_amt) over (order by inv_id), null, null,
sum(inv_amt) over (order by inv_id)
from invoices
union all
select 'p', null, null, pmt_id, sum(pmt_amt) over (order by pmt_id),
sum(pmt_amt) over (order by pmt_id)
from payments
),
d (inv_id, paid, unpaid, pmt_id, used, available) as (
select last_value(inv_id) ignore nulls over (order by cml_amt desc),
cml_amt - lead(cml_amt, 1, 0) over (order by cml_amt desc),
case kind when 'i' then 0
else last_value(inv_cml) ignore nulls
over (order by cml_amt desc) - cml_amt end,
last_value(pmt_id) ignore nulls over (order by cml_amt desc),
cml_amt - lead(cml_amt, 1, 0) over (order by cml_amt desc),
case kind when 'p' then 0
else last_value(pmt_cml) ignore nulls
over (order by cml_amt desc) - cml_amt end
from c
)
select inv_id, paid, unpaid, pmt_id, used, available
from d
where paid != 0
order by inv_id, pmt_id
;
In most cases, CTE d is all we need. However, if the cumulative sum for several invoices is exactly equal to the cumulative sum for several payments, my query would add a row with paid = unpaid = 0. (MT0's join solution does not have this problem.) To cover all possible cases, and not have rows with no information, I had to add the filter for paid != 0.