Way to exclude all subsequent rows after an uncontinuous order - sql

Story:
I am looking at continuous records based on a 1 month interval. As soon as this rule is broken, any subsequent rows should be excluded from the list. Even if the continuous rule reoccurs later in the future
Sample Data:
+----------------+---------+------------+
| date_purchased | product | date_rebill |
+----------------+---------+------------+
| 2019-01-01 | a | 2019-02-01 |
| 2019-01-01 | a | 2019-03-01 |
| 2019-01-01 | a | 2019-04-01 |
| 2019-01-01 | a | 2019-06-01 |
| 2019-01-01 | a | 2019-07-01 |
| 2019-01-01 | a | 2019-08-01 |
| 2019-02-01 | b | 2019-05-01 |
| 2019-02-01 | b | 2019-06-01 |
+----------------+---------+------------+
In this example May is mising for product A, therefore june and july records should be excluded.
As for product B, there should be no records or at least the count should be 0 for rebill. This is because the first rebill happens more than a month after the first date purchased
Query:
I started with something like that. Now I have '1' for consecutive months. The issue is that I can't filter the data set to diff = 1 due to consecutive rows happening after a break has happened.
select
date_puchased
,product
,datediff(month,previous_date,date_rebill) as diff
from (
select date_purchased
, product
, date_rebill
, lag(date_rebill,1,date_purchased)
over (partition by product order by date_purchased ASC) as previous_date
from table
) as base
My Objective:
My objective here is remove any future rows as soon as the "consecutiveness" rule is broken

If I understand correctly, you can use row_number() and arithmetic
select t.*
from (select t.*,
row_number() over (partition by product order by date_rebill) as seqnum
from t
) t
where datediff(month, date_purchased, date_rebill) = seqnum;

Related

SQL: Return values if data shows up over window of time

I have a script that runs daily to check the status of orders for different sellers. The output populates a table that looks like this, which tells me the failure status of an order:
| date | failure | order | seller_id |
|------------|-------------------------|-------|-----------|
| 2021-04-01 | stuck_in_pending_status | 123 | user1 |
| 2021-04-01 | shipping_is_late | 456 | user2 |
| 2021-04-01 | stuck_in_pending_status | 789 | user3 |
| 2021-04-02 | stuck_in_pending_status | 123 | user1 |
| 2021-04-02 | shipping_is_late | 456 | user2 |
| 2021-04-03 | stuck_in_pending_status | 123 | user1 |
| 2021-04-04 | stuck_in_pending_status | 987 | user1 |
| 2021-04-04 | shipping_is_late | 654 | user3 |
I can get summary stats on the overall health of the system with this query to see how orders are failing and if orders are piling up/there is a spike in failures for any particular date
:
SELECT
date,
failure,
COUNT(0)
FROM my table
WHERE
date >= '2021-03-01'
GROUP BY
date,
failure
I can also add seller_id = foo to the WHERE clause to get seller specific failures
I would like to get a bit more granular and see the specific health of orders at the seller level, specifically if there are issues with the same order over a period of time (say 3 days). So if the same order shows up in the failures over a 3 day period give me the seller so that I can notify someone to look into why that is happening.
For example, with the table above, I would like the query to return user1 since order 123 has had an issue for 3 straight days.
What would be the best way to structure a query like that? Would I use a WINDOW function?
If I understand correctly , you can use window function:
select * from (
SELECT * ,
row_number() over (partition by order, seller_id order by date desc) rn
FROM mytable
WHERE date >= '2021-03-01'
) t
where rn = 1
Just use lag(). To get all rows that are the "third" in order, you can use:
select t.*
from (select t.*,
lag(date, 2) over (partition by order_id order by date) as date_2
from mytable t
) t
where date_2 = date - interval '2 day';
Note: data/time functions are specific to a database. This uses standard syntax; you may need to tweak for your database.

Adding indicator column to table based on having two consecutive days within group

I need to add a logic that helps me to flag the first of two consecutive days as 1 and the second day as 0 grouped by a column (test). If a test (a) has three consecutive days then the third should start with 1 again etc.
Example table would be like following with new col being the column I need.
|---------------------|------------------|---------------------|
| test | test_date | new col |
|---------------------|------------------|---------------------|
| a | 1/1/2020 | 1 |
|---------------------|------------------|---------------------|
| a | 1/2/2020 | 0 |
|---------------------|------------------|---------------------|
| a | 1/3/2020 | 1 |
|---------------------|------------------|---------------------|
| b | 1/1/2020 | 1 |
|---------------------|------------------|---------------------|
| b | 1/2/2020 | 0 |
|---------------------|------------------|---------------------|
| b | 1/15/2020 | 1 |
|---------------------|------------------|---------------------|
As it seems to be some gaps-and-islands problem and I assume some windows function approach should get me there.
I tried something like following to get the consecutive part but struggle with the indicator column.
Select
test,
test_date,
grp_var = dateadd(day,
-row_number() over (partition by test order by test_date), test_date)
from
my_table
This does read as a gaps-and-island problem. I would recommend using the difference between row_number() and the date to generate the groups, and then arithmetic:
select
test,
test_date,
row_number() over(
partition by test, dateadd(day, -rn, test_date)
order by test_date
) % 2 new_col
from (
select
t.*,
row_number() over(partition by test order by test_date) rn
from mytable t
) t
Demo on DB Fiddle:
test | test_date | new_col
:--- | :--------- | ------:
a | 2020-01-01 | 1
a | 2020-01-02 | 0
a | 2020-01-03 | 1
b | 2020-01-01 | 1
b | 2020-01-02 | 0
b | 2020-01-15 | 1

Dealing with required duplicates in table records

Here's the situation. My team forecasts sales and revenue numbers at a monthly resolution but would like all reporting to be at a daily resolution. So what I am doing is ingesting these numbers and dividing the monthly targets by number of days and saving it in a table.
So I start of with something like this:
| date | forecasted_units | forecasted_revenue |
|---------|------------------|--------------------|
| 2020-01 | 372 | 9300 |
| 2020-02 | 435 | 9280 |
...
My target table now looks like this:
| date | forecasted_units | forecasted_revenue |
|------------|------------------|--------------------|
| 2020-01-01 | 12 | 300 |
| 2020-01-02 | 12 | 300 |
| 2020-01-03 | 12 | 300 |
...
| date | forecasted_units | forecasted_revenue |
|------------|------------------|--------------------|
| 2020-02-01 | 15 | 320 |
| 2020-02-02 | 15 | 320 |
| 2020-02-03 | 15 | 320 |
...
Now my table is quite a lot wider than the one above and all of them have duplicate records. As you can see there's a lot of data redundancy. Now my question is, Is there a more efficient method to save the same resolution of data in one table.
My immediate thought is to reshape the table to include a start date and end date to look like this:
| start_date | end_date | forecasted_units | forecasted_revenue |
|------------|------------|------------------|--------------------|
| 2020-01-01 | 2020-01-31 | 12 | 300 |
| 2020-02-01 | 2020-02-29 | 15 | 320 |
But that would offload all the computation to the instance generating all the reports because it would have to generate the data for each day in between the start and end date.
Is there a better way to do this?
Unfortunately, Redshift does not support handy Postgres function generate_series(), which would have largely simplified the task here.
Typical alternative solutions would involve a calendar table - basically, a table that enumerates all possible dates. If you have a table with a sufficient number of rows, you can generate such dataset on the fly with row_number() and dateadd():
select dateadd(day, row_number() over(order by 1) - 1, '2020-01-01') dt
from my_large_table;
You can store the results in another table (using the create table ... as select ... syntax), or use the query result directly. In both cases, you would then join it with your actual table. To count the number of days in the month, we use a window count:
select
d.dt,
t.forecasted_unit / count(*) over(partition by t.date) forecasted_units,
t.forecasted_revenue / count(*) over(partition by t.date) forecasted_revenue
from (
select dateadd(day, row_number() over(order by 1) - 1, '2020-01-01') dt
from my_large_table
) d
inner join mytable t on t.date = date_trunc('month', d.dt)

SQLite: generating customer counts for a date range (months) using a normalized table

I have a sales funnel dataset in SQLite and each row represents a movement through the funnel. As there are quite a few ways a potential customer can move through the funnel (and possibly even go backwards), I wasn't planning on flattening/denormalizing the table. How could I calculate "the number of customers per month up to today"?
customer | opp_value | status_old | status_new | current_status | status_change_date | current_lead_status | lead_created_date
cust_8 | 22 | confirmed | paying | paying | 2020-01-01 | Customer | 2020-01-01
cust_9 | 23 | confirmed | paying | churned | 2020-01-03 | Customer | 2020-01-02
cust_9 | 23 | paying | churned | churned | 2020-03-24 | Customer | 2020-02-25
cust_13 | 30 | negotiation | lost | paying | 2020-04-03 | Lost | 2020-03-20
cust_14 | 45 | qualified | confirmed | paying | 2020-03-03 | Customer | 2020-02-28
cust_14 | 45 | confirmed | paying | paying | 2020-04-03 | Customer | 2020-02-28
... | ... | ... | ... | ... | ... | ... | ...
We're assuming we use end-of-month as definition for whether a customer is still with us.
The result, with the above data should be:
month | customers
Jan-2020 | 2 (cust_8, cust_9)
Feb-2020 | 1 (cust_8, cust_9)
Mar-2020 | 1 (cust_8) # cust_9 churned
Apr-2020 | 2 (cust_8, cust_14)
May-2020 | 2 (cust_8, cust_14)
The part I'd really like to understand is how to create the month column, as I can't rely on the dates of status_change_date as there might be missing records. Would one have to manually generate that column? I know I can generate dates manually using:
WITH RECURSIVE cnt (
x
) AS (
SELECT 0
UNION ALL
SELECT x + 1
FROM cnt
LIMIT (
SELECT
ROUND(((julianday ('2020-05-01') - julianday ('2020-01-01')) / 30) + 1))
)
SELECT
date(julianday ('2020-01-01'), '+' || x || ' month') AS month
FROM cnt
but wondering if there is a better way? Would it possibly be easier to create a snapshot table and generate the current state of each customer for each date?
If you have the dates, you can use a brute-force method. This determines the most recent status for each customer for each date:
select d.date,
sum(as_of_status = 'paying')
from (select distinct d.date, t.customer,
first_value(status_new) over (partition by d.date, t.customer order by t. status_change_date desc) as as_of_status
from dates d join
t
on t.status_change_date <= d.date
) dc
group by d.date
order by d.date;

SQL query to select today and previous day's price

I have historic stock price data that looks like the below. I want to generate a new table that has one row for each ticker with the most recent day's price and its previous day's price. What would be the best way to do this? My database is Postgres.
+---------+------------+------------+
| ticker | price | date |
+---------+------------+------------|
| AAPL | 6 | 10-23-2015 |
| AAPL | 5 | 10-22-2015 |
| AAPL | 4 | 10-21-2015 |
| AXP | 5 | 10-23-2015 |
| AXP | 3 | 10-22-2015 |
| AXP | 5 | 10-21-2015 |
+------- +-------------+------------+
You can do something like this:
with ranking as (
select ticker, price, dt,
rank() over (partition by ticker order by dt desc) as rank
from stocks
)
select * from ranking where rank in (1,2);
Example: http://sqlfiddle.com/#!15/e45ea/3
Results for your example will look like this:
| ticker | price | dt | rank |
|--------|-------|---------------------------|------|
| AAPL | 6 | October, 23 2015 00:00:00 | 1 |
| AAPL | 5 | October, 22 2015 00:00:00 | 2 |
| AXP | 5 | October, 23 2015 00:00:00 | 1 |
| AXP | 3 | October, 22 2015 00:00:00 | 2 |
If your table is large and have performance issues, use a where to restrict the data to last 30 days or so.
Best bet is to use a window function with an aggregated case statement which is used to create a pivot on the data.
You can see more on window functions here: http://www.postgresql.org/docs/current/static/tutorial-window.html
Below is a pseudo code version of where you may need to head to answer your question (sorry I couldn't validate it due to not have a postgres database setup).
Select
ticker,
SUM(CASE WHEN rank = 1 THEN price ELSE 0 END) today,
SUM(CASE WHEN rank = 2 THEN price ELSE 0 END) yesterday
FROM (
SELECT
ticker,
price,
date,
rank() OVER (PARTITION BY ticker ORDER BY date DESC) as rank
FROM your_table) p
WHERE rank in (1,2)
GROUP BY ticker.
Edit - Updated the case statement with an 'else'