Most elegant way to eliminate equal but opposite data using SQL - sql

I have a relatively simple set of data that looks like this:
invoice_id created_at amount_in_cents user_id
22348 2019-11-07 550 31773927
22349 2019-11-08 -550 31773927
22498 2019-11-10 -3400 2389483
22499 2019-11-10 3400 2389483
22500 2019-11-11 18000 93842938
As you can see, the first two rows of the sample data are attributed to the same user_id, but are of inverse amounts (add up to 0). Same with rows 3 and 4. I want to remove all invoices where there is an inverse invoice for the same user, within 30 days of each other, leaving just the fifth row.
I could do this with python, but it would expand the process a lot. Is there a simple way to do this with SQL?

You could use not exists with a correlated subquery:
select t.*
from mytable t
where not exists (
select 1
from mytable t1
where
t1.user_id = t.user_id
and greatest(t1.created_at, t.created_at)
<= least(t1.created_at, t.created_at) + interval '30 days'
and t1.amount_in_cents = - t.amount_in_cents
)
The not exists condition ensures that no other record exists for the same user and with an opposite amount within 30 days.

I don't think there is a simple solution to this problem. If you wanted to remove all matching pairs, then you could enumerate and remove:
select min(invoice_id), min(created_at), user_id, max(amount_in_cents) as amount_in_cents
from (select t.*,
row_number() over (partition by user_id, amount_in_cents order by created_at) as seqnum
from t
) t
group by abs(amount_in_cents), user_id, seqnum
having count(*) = 1; -- only one "matching" amount
However, the limitation on 30 days is challenging and I think you might need a recursive CTE for it.
Consider the following data:
1 jan 1 500
1 jan 15 500
1 feb 1 -500
1 feb 10 -500
What result would you want?

Related

Aggregating two values in same select statement. Second aggregation is decreasing in value for each row for some reason

I'm currently trying to aggregate two values simultaneously in one select statement; however, the second aggregated value is decreasing for some reason. I know what I'm doing is wrong, but I don't understand why it's wrong (assuming it's the very last code block). Mainly just trying to better understand what's going on, and why it's happening.
I already have a corrected query that works (at the bottom)
Note: Query and outputs are simplified, please ignore any syntax issues. Additionally, in real query, I need to keep subscription_start_date field in until the end.
Query with issue (very last block):
WITH max_product_user_count AS (
-- The total count is obtained when "days" = 0
SELECT
subscription_start_date,
datediff('days', subscription_start_date, subscription_date) AS days,
product,
num_users AS total_user_count
FROM users
WHERE days = 0
),
daily_product_user_count AS (
-- As "days" go up, the number of subscribers for each start date/product type decreases
SELECT
subscription_start_date,
datediff('days', subscription_start_date, subscription_date) AS days,
product,
num_users AS daily_user_count
FROM users
WHERE days IN (0,5,14,21,30,33,60)
)
-- Trying to aggregate by product and day, across all subscription start dates
SELECT
d.product,
d.days,
SUM(daily_user_count) AS daily_count,
SUM(total_user_count) AS total_count
FROM daily_product_user_count d
INNER JOIN max_product_user_count m ON d.subscription_start_date = m.subscription_start_date
AND d.product = m.product
GROUP BY 1,2
ORDER BY 1,2
Current Output:
PRODUCT DAYS DAILY_COUNT TOTAL_COUNT
product_1 0 10000 10000
product_1 5 99231 99781
product_1 14 96124 98123
product_1 21 85123 96441
product_1 30 23412 94142
product_1 33 12931 92111
product_1 60 10231 90123
Expected Output:
PRODUCT DAYS DAILY_COUNT TOTAL_COUNT
product_1 0 10000 10000
product_1 5 99231 10000
product_1 14 96124 10000
product_1 21 85123 10000
product_1 30 23412 10000
product_1 33 12931 10000
product_1 60 10231 10000
Updated correct query:
WITH max_product_user_count AS (
SELECT
subscription_start_date,
datediff('days', subscription_start_date, subscription_date) AS days,
product,
num_users AS total_user_count
FROM users
WHERE days = 0
),
max_user_count_aggregation AS (
SELECT
product,
SUM(total_user_count) AS total_count
FROM max_product_user_count
GROUP BY 1
),
daily_product_user_count AS (
SELECT
subscription_start_date,
datediff('days', subscription_start_date, subscription_date) AS days,
product,
num_users AS daily_user_count
FROM users
WHERE days IN (0,5,14,21,30,33,60)
)
daily_user_count_aggregation AS (
SELECT
product,
days,
SUM(daily_user_count) AS daily_count
FROM daily_product_user_count
GROUP BY 1
)
SELECT
d.product,
d.days,
daily_count,
total_count
FROM daily_user_count_aggregation d
INNER JOIN max_user_count_aggregation m ON d.product = m.product
ORDER BY 1,2
If I understand what you are trying to do, the query is way more complicated than necessary. I think this does what you want:
SELECT datediff('days', subscription_start_date, subscription_date) AS days,
product,
SUM(num_users) FILTER (WHERE days IN (0, 5, 14, 21, 30, 33, 60)) AS daily_user_count,
SUM(num_users) FILTER (WHERE days = 0) AS total_user_count
FROM users
GROUP BY days, product;
I would advise you to ask a new question, explaining the logic you want to implement and providing reasonable sample data and desired results.

How to write a SQL query to find the first time when sum greater than a number?

I have a postgresql table:
create table orders
(
id int,
cost int,
time timestamp
);
How to write a PostgreSQL query to find the first time when sum(cost) is greater than 200?
For example:
id cost time
------------------
1 120 2019-10-10
2 50 2019-11-11
3 80 2019-12-12
4 60 2019-12-16
The first time sum(cost) greater than 200 is 2019-12-12.
This is a variation of Nick's answer (which would be correct with an ORDER BY). However, this version is more efficient:
select d.*
from (select d.*,
sum(d.cost) over (order by d.time) as running_cost
from d
) d
where running_cost - cost < 200 and
running_cost >= 200;
Note that this does not require an order by in the outer query to work correctly.
There is also almost a way to solve this without using a subquery:
select o.*
from orders o
order by (sum(cost) over (order by time) >= 200) desc,
time asc
limit 1;
The only issue is that this will return a row if no row matches the condition. You could get around this by using a subquery in the limit:
limit (case when (select sum(cost) from orders) >= 400 then 1 else 0 end)
But then a subquery would be needed.
For PostgreSQL, you can get this result by using a CTE to calculate the SUM of cost for rows up to and including the current one, and then selecting the first row which has total cost >= 200:
WITH CTE AS (
SELECT time,
SUM(cost) OVER (ORDER BY time ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS total
FROM data
)
SELECT *
FROM CTE
WHERE total >= 200
ORDER BY total
LIMIT 1
Output:
time total
2019-12-12 250
Demo on SQLFiddle

Need to count unique transactions by month but ignore records that occur 3 days after 1st entry for that ID

I have a table with just two columns: User_ID and fail_date. Each time somebody's card is rejected they are logged in the table, their card is automatically tried again 3 days later, and if they fail again, another entry is added to the table. I am trying to write a query that counts unique failures by month so I only want to count the first entry, not the 3 day retries, if they exist. My data set looks like this
user_id fail_date
222 01/01
222 01/04
555 02/15
777 03/31
777 04/02
222 10/11
so my desired output would be something like this:
month unique_fails
jan 1
feb 1
march 1
april 0
oct 1
I'll be running this in Vertica, but I'm not so much looking for perfect syntax in replies. Just help around how to approach this problem as I can't really think of a way to make it work. Thanks!
You could use lag() to get the previous timestamp per user. If the current and the previous timestamp are less than or exactly three days apart, it's a follow up. Mark the row as such. Then you can filter to exclude the follow ups.
It might look something like:
SELECT month,
count(*) unique_fails
FROM (SELECT month(fail_date) month,
CASE
WHEN datediff(day,
lag(fail_date) OVER (PARTITION BY user_id,
ORDER BY fail_date),
fail_date) <= 3 THEN
1
ELSE
0
END follow_up
FROM elbat) x
WHERE follow_up = 0
GROUP BY month;
I'm not so sure about the exact syntax in Vertica, so it might need some adaptions. I also don't know, if fail_date actually is some date/time type variant or just a string. If it's just a string the date/time specific functions may not work on it and have to be replaced or the string has to be converted prior passing it to the functions.
If the data spans several years you might also want to include the year additionally to the month to keep months from different years apart. In the inner SELECT add a column year(fail_date) year and add year to the list of columns and the GROUP BY of the outer SELECT.
You can add a flag about whether this is a "unique_fail" by doing:
select t.*,
(case when lag(fail_date) over (partition by user_id order by fail_date) > fail_date - 3
then 0 else 1
end) as first_failure_flag
from t;
Then, you want to count this flag by month:
select to_char(fail_date, 'Mon'), -- should aways include the year
sum(first_failure_flag)
from (select t.*,
(case when lag(fail_date) over (partition by user_id order by fail_date) > fail_date - 3
then 0 else 1
end) as first_failure_flag
from t
) t
group by to_char(fail_date, 'Mon')
order by min(fail_date)
In a Derived Table, determine the previous fail_date (prev_fail_date), for a specific user_id and fail_date, using a Correlated subquery.
Using the derived table dt, Count the failure, if the difference of number of days between current fail_date and prev_fail_date is greater than 3.
DateDiff() function alongside with If() function is used to determine the cases, which are not repeated tries.
To Group By this result on Month, you can use MONTH function.
But then, the data can be from multiple years, so you need to separate them out yearwise as well, so you can do a multi-level group by, using YEAR function as well.
Try the following (in MySQL) - you can get idea for other RDBMS as well:
SELECT YEAR(dt.fail_date) AS year_fail_date,
MONTH(dt.fail_date) AS month_fail_date,
COUNT( IF(DATEDIFF(dt.fail_date, dt.prev_fail_date) > 3, user_id, NULL) ) AS unique_fails
FROM (
SELECT
t1.user_id,
t1.fail_date,
(
SELECT t2.fail_date
FROM your_table AS t2
WHERE t2.user_id = t1.user_id
AND t2.fail_date < t1.fail_date
ORDER BY t2.fail_date DESC
LIMIT 1
) AS prev_fail_date
FROM your_table AS t1
) AS dt
GROUP BY
year_fail_date,
month_fail_date
ORDER BY
year_fail_date ASC,
month_fail_date ASC

How to find the number of purchases over time intervals SQL

I'm using Redshift (Postgres), and Pandas to do my work. I'm trying to get the number of user actions, lets say purchases to make it easier to understand. I have a table, purchases that holds the following data:
user_id, timestamp , price
1, , 2015-02-01, 200
1, , 2015-02-02, 50
1, , 2015-02-10, 75
ultimately I would like the number of purchases over a certain timestamp. Such as
userid, 28-14_days, 14-7_days, 7
Here is what I have so far, I'm aware I don't have an upper limit on the dates:
SELECT DISTINCT x_days.user_id, SUM(x_days.purchases) AS x_num, SUM(y_days.purchases) AS y_num,
x_days.x_date, y_days.y_date
FROM
(
SELECT purchases.user_id, COUNT(purchases.user_id) as purchases,
DATE(purchases.timestamp) as x_date
FROM purchases
WHERE purchases.timestamp > (current_date - INTERVAL '%(x_days_ago)s day') AND
purchases.max_value > 200
GROUP BY DATE(purchases.timestamp), purchases.user_id
) AS x_days
JOIN
(
SELECT purchases.user_id, COUNT(purchases.user_id) as purchases,
DATE(purchases.timestamp) as y_date
FROM purchases
WHERE purchases.timestamp > (current_date - INTERVAL '%(y_days_ago)s day') AND
purchases.max_value > 200
GROUP BY DATE(purchases.timestamp), purchases.user_id) AS y_days
ON
x_days.user_id = y_days.user_id
GROUP BY
x_days.user_id, x_days.x_date, y_days.y_date
params={'x_days_ago':x_days_ago, 'y_days_ago':y_days_ago}
where these are set in python/pandas
x_days_ago = 14
y_days_ago = 7
But this didn't work out exactly as planned:
user_id x_num y_num x_date y_date
0 5451772 1 1 2015-02-10 2015-02-10
1 5026678 1 1 2015-02-09 2015-02-09
2 6337993 2 1 2015-02-14 2015-02-13
3 6204432 1 3 2015-02-10 2015-02-11
4 3417539 1 1 2015-02-11 2015-02-11
Even though I don't have an upper date to look between (so x is effectively searching from 14 days to now and y is 7 days to now, meaning overlap), in some cases y is higher.
Can anyone help me either fix this or give me a better way?
Thanks!
It might not be the most efficient answer, but you can generate each sum with a sub-select:
WITH
summed AS (
SELECT user_id, day, COUNT(1) AS purchases
FROM (SELECT user_id, DATE(timestamp) AS day FROM purchases) AS _
GROUP BY user_id, day
),
users AS (SELECT DISTINCT user_id FROM purchases)
SELECT user_id,
(SELECT SUM(purchases) FROM summed
WHERE summed.user_id = users.user_id
AND day >= DATE(NOW() - interval ' 7 days')) AS days_7,
(SELECT SUM(purchases) FROM summed
WHERE summed.user_id = users.user_id
AND day >= DATE(NOW() - interval '14 days')) AS days_14
FROM users;
(This was tested in Postgres, not in Redshift; but the Redshift documentation suggests that both WITH and DISTINCT are supported.) I would have liked to do this with a window, to obtain rolling sums; but it's a little onerous without generate_series().

query to display additional column based on aggregate value

I've been mulling on this problem for a couple of hours now with no luck, so I though people on SO might be able to help :)
I have a table with data regarding processing volumes at stores. The first three columns shown below can be queried from that table. What I'm trying to do is to add a 4th column that's basically a flag regarding if a store has processed >=$150, and if so, will display the corresponding date. The way this works is the first instance where the store has surpassed $150 is the date that gets displayed. Subsequent processing volumes don't count after the the first instance the activated date is hit. For example, for store 4, there's just one instance of the activated date.
store_id sales_volume date activated_date
----------------------------------------------------
2 5 03/14/2012
2 125 05/21/2012
2 30 11/01/2012 11/01/2012
3 100 02/06/2012
3 140 12/22/2012 12/22/2012
4 300 10/15/2012 10/15/2012
4 450 11/25/2012
5 100 12/03/2012
Any insights as to how to build out this fourth column? Thanks in advance!
The solution start by calculating the cumulative sales. Then, you want the activation date only when the cumulative sales first pass through the $150 level. This happens when adding the current sales amount pushes the cumulative amount over the threshold. The following case expression handles this.
select t.store_id, t.sales_volume, t.date,
(case when 150 > cumesales - t.sales_volume and 150 <= cumesales
then date
end) as ActivationDate
from (select t.*,
sum(sales_volume) over (partition by store_id order by date) as cumesales
from t
) t
If you have an older version of Postgres that does not support cumulative sum, you can get the cumulative sales with a subquery like:
(select sum(sales_volume) from t t2 where t2.store_id = t.store_id and t2.date <= t.date) as cumesales
Variant 1
You can LEFT JOIN to a table that calculates the first date surpassing the 150 $ limit per store:
SELECT t.*, b.activated_date
FROM tbl t
LEFT JOIN (
SELECT store_id, min(thedate) AS activated_date
FROM (
SELECT store_id, thedate
,sum(sales_volume) OVER (PARTITION BY store_id
ORDER BY thedate) AS running_sum
FROM tbl
) a
WHERE running_sum >= 150
GROUP BY 1
) b ON t.store_id = b.store_id AND t.thedate = b.activated_date
ORDER BY t.store_id, t.thedate;
The calculation of the the first day has to be done in two steps, since the window function accumulating the running sum has to be applied in a separate SELECT.
Variant 2
Another window function instead of the LEFT JOIN. May of may not be faster. Test with EXPLAIN ANALYZE.
SELECT *
,CASE WHEN running_sum >= 150 AND thedate = first_value(thedate)
OVER (PARTITION BY store_id, running_sum >= 150 ORDER BY thedate)
THEN thedate END AS activated_date
FROM (
SELECT *
,sum(sales_volume)
OVER (PARTITION BY store_id ORDER BY thedate) AS running_sum
FROM tbl
) b
ORDER BY store_id, thedate;
->sqlfiddle demonstrating both.