Sum and Running Sum, Distinct and Running Distinct - sql

I want to calculate sum, running sum, distinct, running distinct - preferably all in one query.
http://sqlfiddle.com/#!18/65eff/1
create table test (store int, day varchar(10), food varchar(10), quantity int)
insert into test select 101, '2021-01-01', 'rice', 1
insert into test select 101, '2021-01-01', 'rice', 1
insert into test select 101, '2021-01-01', 'rice', 2
insert into test select 101, '2021-01-01', 'fruit', 2
insert into test select 101, '2021-01-01', 'water', 3
insert into test select 101, '2021-01-01', 'fruit', 1
insert into test select 101, '2021-01-01', 'salt', 2
insert into test select 101, '2021-01-02', 'rice', 1
insert into test select 101, '2021-01-02', 'rice', 2
insert into test select 101, '2021-01-02', 'fruit', 1
insert into test select 101, '2021-01-02', 'pepper', 4
Uniques (distinct) & Total (sum) are simple:
select store, day, count(distinct food) as uniques, sum(quantity) as total
from test
group by store, day
But I want output to be :
store
day
uniques
run_uniques
total
run_total
101
2021-01-01
4
4
12
12
101
2021-01-02
3
5
10
22
I tried a self-join with t.day >= prev.day to get cumulative/running data, but it's causing double-counting.

First off: always store data in the correct data type, day should be a date column.
Calculating a running sum of sum(quantity) aggregate is quite simple, you just nest it inside a window function: SUM(SUM(...)) OVER (...).
Calculating the running number of unique food per store is more complicated because you want the rolling number of unique items before grouping, and there is no COUNT(DISTINCT window function in SQL Server (which is what I'm using).
So I've gone with calculating a row_number() for each store and food across all days, then we just sum up the number of times we get 1 i.e. this is the first time we've seen this food.
SELECT
t.store,
t.day,
uniques = COUNT(DISTINCT t.food),
run_uniques = SUM(SUM(CASE WHEN t.rn = 1 THEN 1 ELSE 0 END))
OVER (PARTITION BY t.store ORDER BY t.day ROWS UNBOUNDED PRECEDING),
total = SUM(t.quantity),
run_total = SUM(SUM(t.quantity))
OVER (PARTITION BY t.store ORDER BY t.day ROWS UNBOUNDED PRECEDING)
FROM (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY store, food ORDER BY day) rn
FROM test
) t
GROUP BY t.store, t.day;

Related

How to get a date interval with condition

How to get a continuous date interval from rows fulfilling specific condition?
I have a table of employees states with 2 types of user_position.
The interval is continuous if the next higher date_position per user_id has the same user_id, the next day value and user_position didn't change. The user cannot have different user positions in one day.
Have a feeling it requires several cases, window functions and tsrange, but can't quite get the right result.
I would be really grateful if you could help me.
Fiddle:
http://sqlfiddle.com/#!17/ba641/1/0
The result should look like this:
user_id
user_position
position_start
position_end
1
1
01.01.2019
02.01.2019
1
2
03.01.2019
04.01.2019
1
1
05.01.2019
06.01.2019
2
1
01.01.2019
03.01.2019
2
2
04.01.2019
05.01.2019
2
2
08.01.2019
08.01.2019
2
2
10.01.2019
10.01.2019
Create/insert query for the source data:
CREATE TABLE IF NOT EXISTS users_position
( id integer GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
user_id integer,
user_position integer,
date_position date);
INSERT INTO users_position (user_id,
user_position,
date_position)
VALUES
(1, 1, '2019-01-01'),
(1, 1, '2019-01-02'),
(1, 2, '2019-01-03'),
(1, 2, '2019-01-04'),
(1, 1, '2019-01-05'),
(1, 1, '2019-01-06'),
(2, 1, '2019-01-01'),
(2, 1, '2019-01-02'),
(2, 1, '2019-01-03'),
(2, 2, '2019-01-04'),
(2, 2, '2019-01-05'),
(2, 2, '2019-01-08'),
(2, 2, '2019-01-10');
SELECT user_id, user_position
, min(date_position) AS position_start
, max(date_position) AS position_end
FROM (
SELECT user_id, user_position,date_position
, count(*) FILTER (WHERE (date_position = last_date + 1
AND user_position = last_pos) IS NOT TRUE)
OVER (PARTITION BY user_id ORDER BY date_position) AS interval
FROM (
SELECT user_id, user_position, date_position
, lag(date_position) OVER w AS last_date
, lag(user_position) OVER w AS last_pos
FROM users_position
WINDOW w AS (PARTITION BY user_id ORDER BY date_position)
) sub1
) sub2
GROUP BY user_id, user_position, interval
ORDER BY user_id, interval;
db<>fiddle here
Basically, this forms intervals by counting the number of disruptions in continuity. Whenever the "next" row per user_id is not what's expected, a new interval starts.
The WINDOW clause allows to specify a window frame once and use it repeatedly; no effect on performance.
last_date + 1 works while last_date is type date. See:
Is there a way to do date arithmetic on values of type DATE without result being of type TIMESTAMP?
Related:
Get start and end date time based on based on sequence of rows
Select longest continuous sequence
About the aggregate FILTER:
Aggregate columns with additional (distinct) filters

SQL Find the daily maximum units from a table which stores transactions

I have an SQL Table which stores the units (inventory) of items at any given timestamp. Any transaction(add/delete) on an item basically updates this table with the new quantity and the timestamp of occurrence.
update_timestamp item_id units
1637993217 item1 3
1637993227 item2 1
1637993117 item1 2
1637993237 item1 5
I need to fetch the daily maximum units for every item from this table.
The query I am using is something similar to this :
SELECT date_format(from_unixtime((CAST(update_timestamp AS BIGINT))/1000),'%Y-%m-%d') AS day,
item_id,
MAX(units) as max_units
from Table
group by item_id, day;
which gives an output like:
day item_id max_units
2021-11-23 item1 5
2021-11-24 item1 6
2021-11-23 item2 3
....
....
However when generating the output, I also need to account for the units carrying forward from the balance of the transaction previous to my current day.
Example : For item1, there were few transactions on day 2021-11-24 and the quantity at the end of that day was 6. Now if the next transaction(s) on this item occurred only on 2021-11-26, and say were in the following sequence for this date : [ 4, 2, 3 ]. Then 6 should continue to be the maximum units of the item for the days 2021-11-25 and 2021-11-26 as well.
I am stuck here and unable to get it working through SQL. Currently how I am approaching this is by fetching the last transaction for every day separately, and then using python scripts to forward-fill this data for next days, which is not clean and scalable in my case.
I am running queries on Presto SQL engine.
You can use lag window function to get previous value and select maximum between it and current one:
WITH dataset (update_timestamp, item_id, units) AS (
VALUES (timestamp '2021-11-21 00:00:01', 'item1', 10),
(timestamp '2021-11-23 00:00:02', 'item1', 6),
(timestamp '2021-11-23 00:00:03', 'item2', 1),
(timestamp '2021-11-24 00:00:01', 'item1', 2),
(timestamp '2021-11-24 00:00:04', 'item1', 5)
)
SELECT item_id,
day,
coalesce( -- greatest will return NULL if one of the arguments is NULL so fallback to "current"
greatest(
max_units,
lag(max_units) over (
partition by item_id
order by day
)
),
max_units
) as max_units
FROM (
SELECT item_id,
date_trunc('day', update_timestamp) day,
max(units) as max_units
FROM dataset
GROUP BY item_id,
date_trunc('day', update_timestamp)
)
Output:
item_id
day
max_units
item2
2021-11-23 00:00:00.000
1
item1
2021-11-21 00:00:00.000
10
item1
2021-11-23 00:00:00.000
10
item1
2021-11-24 00:00:00.000
6
I think my answer is really close to Guru's. I made an assumption that you might need to fill in dates that were missing, so created a calendar table - replace with whatever you want.
This was written in BigQuery, so not sure if it will compile/execute in Presto but I think they are syntactically close.
with transactions as (
select cast('2021-11-17' as date) as update_timestamp, 'item1' as item_id, 3 as units union all
select cast('2021-11-18' as date), 'item2', 1 union all
select cast('2021-11-18' as date), 'item2', 5 union all
select cast('2021-11-20' as date), 'item1', 2 union all
select cast('2021-11-20' as date), 'item2', 3 union all
select cast('2021-11-20' as date), 'item2', 2 union all
select cast('2021-11-20' as date), 'item1', 10 union all
select cast('2021-11-24' as date), 'item1', 8 union all
select cast('2021-11-24' as date), 'item1', 5
),
some_calendar_table AS (
SELECT cast(d as date) as cal_date
FROM UNNEST(GENERATE_DATE_ARRAY('2021-11-15', '2021-11-30', INTERVAL 1 DAY)) AS d
),
daily_transaction_max as (
SELECT update_timestamp AS transaction_date,
item_id,
MAX(units) as max_value
from transactions
group by item_id, transaction_date
)
select cal.cal_date
, t.item_id
, mt.max_value as max_inventory_from_this_dates_transactions
, greatest(coalesce(mt.max_value, 0), coalesce(last_value(mt.max_value ignore nulls) over(partition by t.item_id
order by cal.cal_date
rows between unbounded preceding and 1 preceding)
, 0)) as max_daily_inventory
from some_calendar_table cal
cross join (select distinct item_id from daily_transaction_max) t
left join daily_transaction_max mt
on mt.transaction_date = cal.cal_date
and mt.item_id = t.item_id
order by t.item_id, cal.cal_date

Cumulative product using input from multiple columns

I have a combination of some daily return estimates and month-to-day (MTD) returns, which are issued weekly. I want to combine these two data series to get a daily estimated MTD value.
I have tried to summarize what I would like to archieve below
I got all the columns except MTD_estimate, which I would like to derive from DailyReturnEstimate and MTD. In case that a MTD value exist, then it should use that value. Otherwise, it should do a cumulative product of the returns. My code looks as follows
select *, exp(sum(log(1+DailyReturnEstimate)) OVER (ORDER BY dates) )-1 as Cumu_DailyReturn from TestTbl
My problem is that I am not sure how to do use the MTD value when present when doing the cumulative product.
I am using Microsot SQL 2012. I have made a small data example below:
CREATE TABLE TestTbl (
id integer PRIMARY KEY,
dates date,
DailyReturnEstimate float,
MTD integer,
);
INSERT INTO TestTbl
(id, Dates, DailyReturnEstimate, MTD) VALUES
(1, '2020-01-01', -0.01, NULL ),
(2, '2020-01-02', 0.005 , NULL ),
(3, '2020-01-03', 0.012 , NULL ),
(4, '2020-01-04', -0.007 , NULL ),
(5, '2020-01-05', 0.021 , 0.016 ),
(6, '2020-01-06', 0.001 , NULL );
This is a bit tricky, but the idea is to set up separate groups based on where mtd is already defined. Then do the calculation only within those groups:
select t.*,
exp(sum(log(1+coalesce(mtd, DailyReturnEstimate))) OVER (partition by grp ORDER BY dates) )-1 as Cumu_DailyReturn
from (select t.*,
count(mtd) over (order by dates) as grp
from testtbl t
) t;
Here is a db<>fiddle.

SQL server query to find values grouped by one column but different in at least one of other columns

Please pardon the title of my question -
I have a table
TRXN (ID,ACCT_NUM,TRAN_MEMO,AMOUNT,DATE,LRN)
I want to write a query to pull records which have same LRN but atleast one of the other column has different value. Is it possible?
In my answer I consider you have unique value for ID and exclude it.
Table created:
CREATE TABLE #TRXN (ID INT IDENTITY(1, 1)
,ACCT_NUM INT
,TRAN_MEMO INT
,AMOUNT INT
,[DATE] DATE
,LRN INT
)
Sample data inserted
INSERT INTO #TRXN VALUES (1, 2, 2, '1 jan 2000', 2)
,(2, 2, 2, '2 jan 2000', 2)
,(1, 2, 2, '1 jan 2000', 2)
,(1, 2, 2, '1 jan 2000', 3)
Have same LRN but at least one of the other column has different value
;WITH C AS(
SELECT ROW_NUMBER() OVER (PARTITION BY ACCT_NUM, TRAN_MEMO, AMOUNT, [DATE], LRN ORDER BY ACCT_NUM, TRAN_MEMO, AMOUNT, [DATE], LRN) AS Rn
,ID, ACCT_NUM, TRAN_MEMO, AMOUNT, [DATE], LRN
FROM #TRXN WHERE LRN IN(
SELECT LRN FROM #TRXN GROUP BY LRN HAVING COUNT(ID) > 1)
)
SELECT ID, ACCT_NUM, TRAN_MEMO, AMOUNT, [DATE], LRN
FROM C WHERE Rn = 1
Output:
ID ACCT_NUM TRAN_MEMO AMOUNT DATE LRN
---------------------------------------------
1 1 2 2 2000-01-01 2
2 2 2 2 2000-01-02 2
why simply, use group by:
SELECT COUNT(1) AS numberOfGroupedRows,ID,ACCT_NUM,TRAN_MEMO,AMOUNT,DATE,LRN
FROM TRNX GROUP BY ID,ACCT_NUM,TRAN_MEMO,AMOUNT,DATE,LRN
since group by it will group all similar rows in one row

How to select and sum() most recent item in a history table for each time period?

I have a history table containing a single entry when an entry in another table changes. I need to perform a query that produces the sum() or count() of the most recent entries for each time period.
Here's the relevant bits of my table structure:
CREATE TABLE opportunity_history (
"id" BIGSERIAL PRIMARY KEY,
"opportunity_id" TEXT NOT NULL,
"employee_external_id" TEXT NOT NULL,
"item_date" TIMESTAMP NOT NULL,
"amount" NUMERIC(18,2) NOT NULL DEFAULT 0
);
So for example if I have a single opportunity created in January, and updated twice in February, I want to count it once in Jan, and only once in Feb.
The other similar queries I have (which don't involve history - just singular data points) work fine by joining to a generate_series() in a single query. I would love to be able to achieve something similar. Here's an example using generate_series:
SELECT Periods.day, sum(amount) as value
FROM (
SELECT generate_series('2012-01-01', '2013-01-01', '1 month'::interval)::date AS day
) as Periods LEFT JOIN opportunity ON (
employee_external_id='...'
AND close_date >= Periods.day
AND close_date < Periods.day
)
GROUP BY 1
ORDER BY 1
However that doesn't work for opportunity_history, because if a single item is listed in the same month twice you get duplication.
I'm really stumped on this one. I've tried doing it via WITH RECURSIVE and nothing seems to unfold properly for me.
Edit:
Example data (skipping id columns and using dates instead of timestamps):
'foo', 'user1', 2013-01-01, 100
'bar', 'user1', 2013-01-02, 50
'foo', 'user1', 2013-01-12, 100
'bar', 'user1', 2013-01-13, 55
'foo', 'user1', 2013-01-23, 100
'foo', 'user1', 2013-02-04, 100
'foo', 'user1', 2013-02-15, 100
'bar', 'user1', 2013-03-01, 55
For sum I would want:
2013-01 155 (foo on 2013-01-23 and bar on 2013-01-13)
2013-02 100 (foo on 2013-02-15)
2013-03 55 (bar on 2013-03-01)
Or for count:
2013-01 2
2013-02 1
2013-03 1
Also note I'm happy to use "extended" SQL such as CTEs or WITH RECURSIVE or window functions if required. I'd rather avoid a loop in a Pg/plsql function if I can do it in a single query.
select item_month, count(*), sum(amount)
from (
select opportunity_id,
item_date,
amount,
to_char(item_date, 'yyyy-mm') as item_month,
row_number() over (partition by opportunity_id, to_char(item_date, 'yyyy-mm') order by item_date desc) as rn
from opportunity_history
) t
where rn = 1
group by item_month
order by 1;
SQLFiddle example: http://sqlfiddle.com/#!15/c4152/1
Is this what you need?
select
opportunity_id,
extract(year from item_date) as year,
extract(month from item_date) as month,
count(*),
sum(amount)
from opportunity_history
group by opportunity_id, year, month
order by opportunity_id, year, month
If not, please explain what else you require/why it is wrong.
See fiddle.