Identify which users have positive balance every day in SQL - sql

I have user transaction data in a bank from several users with the following schema:
CREATE TABLE if not EXISTS transactions (
id int,
user_id int,
created_at DATE,
amount float
);
INSERT INTO transactions VALUES
(1, 1, '2020-01-01', 100),
(2, 1, '2020-01-02', -50),
(3, 1, '2020-01-04', -50),
(4, 2, '2020-01-04', 80),
(5, 3, '2020-01-06', 10),
(6, 3, '2020-01-10', -10);
I want to know, for each day from the beginning of the transactions to the current date, which users have a positive balance on their accounts.
In this case, the output of the query would be:
date,user_id
'2020-01-01',1
'2020-01-02',1
'2020-01-03',1
'2020-01-04',1
'2020-01-04',2
'2020-01-05',2
'2020-01-06',2
'2020-01-07',2
...
'2021-05-17',2 -- Today's date, user 2 still has positive balance
'2020-01-06',3
'2020-01-07',3
'2020-01-08',3
'2020-01-09',3
'2020-01-10',3
Is there an easy way to do this using PostgreSQL? Or even better, in BigQuery?

Try this for BigQuery:
with transactions as (
select 1 as user_id, date '2020-01-01' as date, 100 as amount union all
select 1, '2020-01-02', -50 union all
select 1, '2020-01-04', -50 union all
select 2, '2020-01-04', 80 union all
select 3, '2020-01-06', 10 union all
select 3, '2020-01-10', -10
),
all_users as (
select min(date) as min_date, user_id
from transactions
group by user_id
),
all_days as (
select *
from all_users, unnest(generate_date_array('2020-01-01', current_date())) as date
where date >= min_date
)
select date, user_id
from all_days left join transactions using (user_id, date)
where true
qualify sum(amount) over (partition by user_id order by date) > 0
Without qualify:
with transactions as (
select 1 as user_id, date '2020-01-01' as date, 100 as amount union all
select 1, '2020-01-02', -50 union all
select 1, '2020-01-04', -50 union all
select 2, '2020-01-04', 80 union all
select 3, '2020-01-06', 10 union all
select 3, '2020-01-10', -10
),
all_users as (
select min(date) as min_date, user_id
from transactions
group by user_id
),
all_days as (
select *
from all_users, unnest(generate_date_array('2020-01-01', current_date())) as date
where date >= min_date
)
select date, user_id
from (
select date, user_id, sum(amount) over (partition by user_id order by date) as balance
from all_days left join transactions using (user_id, date)
)
where balance > 0

Related

Use rank command to limit find last purchase

I'm trying to find the last purchase for each customer_id. Since there are 3 customers I was expecting to get back 3 rows but I'm getting more.
Can someone tell me what's wrong and how to fix this issue. Any help would be greatly appreciated
ALTER SESSION SET NLS_TIMESTAMP_FORMAT = 'DD-MON-YYYY HH24:MI:SS.FF';
ALTER SESSION SET NLS_DATE_FORMAT = 'DD-MON-YYYY HH24:MI:SS';
CREATE TABLE customers
(CUSTOMER_ID, FIRST_NAME, LAST_NAME) AS
SELECT 1, 'Faith', 'Mazzarone' FROM DUAL UNION ALL
SELECT 2, 'Lisa', 'Saladino' FROM DUAL UNION ALL
SELECT 3, 'Jerry', 'Torchiano' FROM DUAL;
CREATE TABLE items
(PRODUCT_ID, PRODUCT_NAME) AS
SELECT 100, 'Black Shoes' FROM DUAL UNION ALL
SELECT 101, 'Brown Shoes' FROM DUAL UNION ALL
SELECT 102, 'White Shoes' FROM DUAL;
CREATE TABLE purchases
(CUSTOMER_ID, PRODUCT_ID, QUANTITY, PURCHASE_DATE) AS
SELECT 1, 100, 1, TIMESTAMP'2022-10-11 09:54:48' FROM DUAL UNION ALL
SELECT 1, 100, 1, TIMESTAMP '2022-10-11 19:04:18' FROM DUAL UNION ALL
SELECT 2, 101,1, TIMESTAMP '2022-10-11 09:54:48' FROM DUAL UNION ALL
SELECT 2,101,1, TIMESTAMP '2022-10-17 19:04:18' FROM DUAL UNION ALL
SELECT 3, 101,1, TIMESTAMP '2022-10-11 09:54:48' FROM DUAL UNION ALL
SELECT 3,102,1, TIMESTAMP '2022-10-17 19:04:18' FROM DUAL UNION ALL
SELECT 3,102, 4,TIMESTAMP '2022-10-10 17:00:00' + NUMTODSINTERVAL ( LEVEL * 2, 'DAY') FROM dual
CONNECT BY LEVEL <= 5;
with cte as
(select
CUSTOMER_ID,
PRODUCT_ID,
QUANTITY,
PURCHASE_DATE,
rank() over (partition by customer_id order by purchase_date desc) rnk
from purchases
)
SELECT p.customer_id,
c.first_name,
c.last_name,
p.product_id,
i.product_name,
p.quantity,
p.purchase_date
from cte p
JOIN customers c ON c.customer_id = p.customer_id
JOIN items i ON i.product_id = p.product_id
where rnk = 1:
First, don't use RANK or DENSE_RANK - they will assign identical purchase_date values with the same rank and hence give you more than one "1" value. Use ROW_NUMBER instead.
Second, you have "from cte p" in there twice. Remove the second one.
And lastly, the real answer to your question is that you have a semicolon before the "where rank = 1" and so nothing after the semicolon is being executed. Hence it isn't filtering. A semicolon ends the SQL, completely.

Rolling Daily Distinct Counts Partition by Month

I have the following table:
CREATE TABLE tbl (
id int NOT NULL
, date date NOT NULL
, cid int NOT NULL
);
INSERT INTO tbl VALUES
(1 , '2022-01-01', 1)
, (2 , '2022-01-01', 1)
, (3 , '2022-01-01', 2)
, (4 , '2022-01-01', 3)
, (5 , '2022-01-02', 1)
, (6 , '2022-01-02', 4)
, (7 , '2022-01-03', 5)
, (8 , '2022-01-03', 6)
, (9 , '2022-02-01', 1)
, (10, '2022-02-01', 5)
, (11, '2022-02-02', 5)
, (12, '2022-02-02', 3)
;
I'm trying to count distinct users (= cid) each day, but the result is rolling during the month. E.g., for 2022-01-01, only distinct users with date = 2022-01-01 are counted. For 2022-01-02, distinct users with date between 2022-01-01 and 2022-01-02 are counted, and so on. The count should restart each month.
My desired output:
date distinct_cids
2022-01-01 3
2022-01-02 4
2022-01-03 6
2022-02-01 2
2022-02-02 3
I don't have access to snowflake so I can't guarantee that this will work, but from the sound of it:
select date, count(distinct cid) over (partition by month(date) order by date)
from tbl
order by date;
If you have several years worth of data, you can partition by year, month:
select date, count(distinct cid) over (partition by year(date), month(date) order by date)
from tbl
order by date;
Date is a reserved word, so you may consider renaming your column
EDIT: Since distinct is disallowed you can try a vanilla SQL variant. It is likely slow for a large table:
select dt, count(cid)
from (
select distinct dt.dt, x.cid
from tbl x
join (
select distinct date as dt from tbl
) dt (dt)
on x.date <= dt.dt
and month(x.date) = month(dt.dt)
) t
group by dt
order by dt
;
The idea is that we create a new relation (t) with distinct users with a date less than or equal to the current date in the current month. Then we can just count those users for each date.

Sum and Running Sum, Distinct and Running Distinct

I want to calculate sum, running sum, distinct, running distinct - preferably all in one query.
http://sqlfiddle.com/#!18/65eff/1
create table test (store int, day varchar(10), food varchar(10), quantity int)
insert into test select 101, '2021-01-01', 'rice', 1
insert into test select 101, '2021-01-01', 'rice', 1
insert into test select 101, '2021-01-01', 'rice', 2
insert into test select 101, '2021-01-01', 'fruit', 2
insert into test select 101, '2021-01-01', 'water', 3
insert into test select 101, '2021-01-01', 'fruit', 1
insert into test select 101, '2021-01-01', 'salt', 2
insert into test select 101, '2021-01-02', 'rice', 1
insert into test select 101, '2021-01-02', 'rice', 2
insert into test select 101, '2021-01-02', 'fruit', 1
insert into test select 101, '2021-01-02', 'pepper', 4
Uniques (distinct) & Total (sum) are simple:
select store, day, count(distinct food) as uniques, sum(quantity) as total
from test
group by store, day
But I want output to be :
store
day
uniques
run_uniques
total
run_total
101
2021-01-01
4
4
12
12
101
2021-01-02
3
5
10
22
I tried a self-join with t.day >= prev.day to get cumulative/running data, but it's causing double-counting.
First off: always store data in the correct data type, day should be a date column.
Calculating a running sum of sum(quantity) aggregate is quite simple, you just nest it inside a window function: SUM(SUM(...)) OVER (...).
Calculating the running number of unique food per store is more complicated because you want the rolling number of unique items before grouping, and there is no COUNT(DISTINCT window function in SQL Server (which is what I'm using).
So I've gone with calculating a row_number() for each store and food across all days, then we just sum up the number of times we get 1 i.e. this is the first time we've seen this food.
SELECT
t.store,
t.day,
uniques = COUNT(DISTINCT t.food),
run_uniques = SUM(SUM(CASE WHEN t.rn = 1 THEN 1 ELSE 0 END))
OVER (PARTITION BY t.store ORDER BY t.day ROWS UNBOUNDED PRECEDING),
total = SUM(t.quantity),
run_total = SUM(SUM(t.quantity))
OVER (PARTITION BY t.store ORDER BY t.day ROWS UNBOUNDED PRECEDING)
FROM (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY store, food ORDER BY day) rn
FROM test
) t
GROUP BY t.store, t.day;

SQL last status change date

I am trying to get the dates of last status changes. Below is an example data table.
In brief I want to query the minimum DATE value of the latest STATUS (ordered by CHANGE_NO) for each PRODUCT_ID. Mentioned values are the ones filled with yellow.
So far, I could get only the latest dates for each product.
SELECT
*
FROM
(
SELECT
PRODUCT_ID, CHANGE_NO, STATUS, DATE
,MAX(CHANGE_NO) OVER(PARTITION BY PRODUCT_ID) MAX_CHANGE_NO
FROM TABLE
ORDER BY PRODUCT_ID, CHANGE_NO
)
WHERE MAX_CHANGE_NO = CHANGE_NO
Please kindly share the link if there is already a question/answer for a similar case; I've searched but couldn't find any.
Note: I am using Oracle SQL.
Thanks in advance.
Here's one way to do this with analytic functions (avoiding joins).
with
test_data ( product_id, change_no, status, dt ) as (
select 1, 1, 'A', date '2016-10-10' from dual union all
select 1, 2, 'B', date '2016-10-11' from dual union all
select 1, 3, 'C', date '2016-10-12' from dual union all
select 1, 4, 'D', date '2016-10-13' from dual union all
select 2, 1, 'Y', date '2016-02-02' from dual union all
select 2, 2, 'X', date '2016-02-03' from dual union all
select 2, 3, 'X', date '2016-02-04' from dual union all
select 3, 1, 'H', date '2016-06-20' from dual union all
select 3, 2, 'G', date '2016-06-21' from dual union all
select 3, 3, 'T', date '2016-06-22' from dual union all
select 3, 4, 'K', date '2016-06-23' from dual union all
select 3, 5, 'K', date '2016-06-24' from dual union all
select 3, 6, 'K', date '2016-06-25' from dual
)
-- End of test data (not part of the solution). SQL query begins below this line.
select product_id,
max(status) keep (dense_rank last order by change_no) as status,
max(dt) as dt
from (
select product_id, change_no, status, dt,
case when lead(status) over (partition by product_id
order by change_no desc)
= status then 0 else 1 end as flag
from test_data
)
where flag = 1
group by product_id
order by product_id -- if needed
;
Output
PRODUCT_ID STATUS DT
---------- ------ ----------
1 D 13/10/2016
2 X 03/02/2016
3 K 23/06/2016
SELECT * FROM (
SELECT PRODUCT_ID, CHANGE_NO, STATUS,DATE, MIN(DATE) OVER(PARTITION BY PRODUCT_ID,STATUS) as MIN_DATE_OF_LATEST_STATUS
FROM (SELECT PRODUCT_ID, CHANGE_NO, STATUS, DATE
,FIRST_VALUE(STATUS) OVER(PARTITION BY PRODUCT_ID ORDER BY CHANGE_NO DESC) LATEST_STATUS
FROM TABLE
) T
WHERE STATUS = LATEST_STATUS
) T
WHERE DATE = MIN_DATE_OF_LATEST_STATUS
Use the FIRST_VALUE window function to get the latest status for each product_id
Get the MIN date for those status rows
Finally get those rows where min_date = date
If change_no isn't needed in the final result, the query can be simplified to
SELECT PRODUCT_ID, STATUS, MIN(DATE) as MIN_DATE_OF_LATEST_STATUS
FROM (SELECT PRODUCT_ID, CHANGE_NO, STATUS, DATE
,FIRST_VALUE(STATUS) OVER(PARTITION BY PRODUCT_ID ORDER BY CHANGE_NO DESC) LATEST_STATUS
FROM TABLE
) T
WHERE STATUS = LATEST_STATUS
GROUP BY PRODUCT_ID, STATUS

SQL return consecutive records

A simple table:
ForumPost
--------------
ID (int PK)
UserID (int FK)
Date (datetime)
What I'm looking to return how many times a particular user has made at least 1 post a day for n consecutive days.
Example:
User 15844 has posted at least 1 post a day for 30 consecutive days 10 times
I've tagged this question with linq/lambda as well as a solution there would also be great. I know I can solve this by iterating all the users records but this is slow.
There is a handy trick you can use using ROW_NUMBER() to find consecutive entries, imagine the following set of dates, with their row_number (starting at 0):
Date RowNumber
20130401 0
20130402 1
20130403 2
20130404 3
20130406 4
20130407 5
For consecutive entries if you subtract the row_number from the value you get the same result. e.g.
Date RowNumber date - row_number
20130401 0 20130401
20130402 1 20130401
20130403 2 20130401
20130404 3 20130401
20130406 4 20130402
20130407 5 20130402
You can then group by date - row_number to get the sets of consecutive days (i.e. the first 4 records, and the last 2 records).
To apply this to your example you would use:
WITH Posts AS
( SELECT FirstPost = DATEADD(DAY, 1 - ROW_NUMBER() OVER(PARTITION BY UserID ORDER BY [Date]), [Date]),
UserID,
Date
FROM ( SELECT DISTINCT UserID, [Date] = CAST(Date AS [Date])
FROM ForumPost
) fp
), Posts2 AS
( SELECT FirstPost,
UserID,
Days = COUNT(*),
LastDate = MAX(Date)
FROM Posts
GROUP BY FirstPost, UserID
)
SELECT UserID, ConsecutiveDates = MAX(Days)
FROM Posts2
GROUP BY UserID;
Example on SQL Fiddle (simple with just most consecutive days per user)
Further example to show how to get all consecutive periods
EDIT
I don't think the above quite answered the question, this will give the number of times a user has posted on, or over n consecutive days:
WITH Posts AS
( SELECT FirstPost = DATEADD(DAY, 1 - ROW_NUMBER() OVER(PARTITION BY UserID ORDER BY [Date]), [Date]),
UserID,
Date
FROM ( SELECT DISTINCT UserID, [Date] = CAST(Date AS [Date])
FROM ForumPost
) fp
), Posts2 AS
( SELECT FirstPost,
UserID,
Days = COUNT(*),
FirstDate = MIN(Date),
LastDate = MAX(Date)
FROM Posts
GROUP BY FirstPost, UserID
)
SELECT UserID, [Times Over N Days] = COUNT(*)
FROM Posts2
WHERE Days >= 30
GROUP BY UserID;
Example on SQL Fiddle
Your particular application makes this pretty simple, I think. If you have 'n' distinct dates in an 'n'-day interval, those 'n' distinct dates must be consecutive.
Scroll to the bottom for a general solution that requires only common table expressions and changing to PostgreSQL. (Kidding. I implemented in PostgreSQL, because I'm short of time.)
create table ForumPost (
ID integer primary key,
UserID integer not null,
post_date date not null
);
insert into forumpost values
(1, 1, '2013-01-15'),
(2, 1, '2013-01-16'),
(3, 1, '2013-01-17'),
(4, 1, '2013-01-18'),
(5, 1, '2013-01-19'),
(6, 1, '2013-01-20'),
(7, 1, '2013-01-21'),
(11, 2, '2013-01-15'),
(12, 2, '2013-01-16'),
(13, 2, '2013-01-17'),
(16, 2, '2013-01-17'),
(14, 2, '2013-01-18'),
(15, 2, '2013-01-19'),
(21, 3, '2013-01-17'),
(22, 3, '2013-01-17'),
(23, 3, '2013-01-17'),
(24, 3, '2013-01-17'),
(25, 3, '2013-01-17'),
(26, 3, '2013-01-17'),
(27, 3, '2013-01-17');
Now, let's look at the output of this query. For brevity, I'm looking at 5-day intervals, not 30-day intervals.
select userid, count(distinct post_date) distinct_dates
from forumpost
where post_date between '2013-01-15' and '2013-01-19'
group by userid;
USERID DISTINCT_DATES
1 5
2 5
3 1
For users that fit the criteria, the number of distinct dates in that 5-day interval will have to be 5, right? So we just need to add that logic to a HAVING clause.
select userid, count(distinct post_date) distinct_dates
from forumpost
where post_date between '2013-01-15' and '2013-01-19'
group by userid
having count(distinct post_date) = 5;
USERID DISTINCT_DATES
1 5
2 5
A more general solution
It doesn't really make sense to say that, if you post every day from 2013-01-01 to 2013-01-31, you've posted 30 consecutive days 2 times. Instead, I'd expect the clock to start over on 2013-01-31. My apologies for implementing in PostgreSQL; I'll try to implement in T-SQL later.
with first_posts as (
select userid, min(post_date) first_post_date
from forumpost
group by userid
),
period_intervals as (
select userid, first_post_date period_start,
(first_post_date + interval '4' day)::date period_end
from first_posts
), user_specific_intervals as (
select
userid,
(period_start + (n || ' days')::interval)::date as period_start,
(period_end + (n || ' days')::interval)::date as period_end
from period_intervals, generate_series(0, 30, 5) n
)
select userid, period_start, period_end,
(select count(distinct post_date)
from forumpost
where forumpost.post_date between period_start and period_end
and userid = forumpost.userid) distinct_dates
from user_specific_intervals
order by userid, period_start;