Match group of variables and values with the nearest datetime - google-bigquery

I have a transaction table that looks like that:
transaction_start store_no item_no amount post_voided
2021-03-01 10:00:00 001 101 45 N
2021-03-01 10:00:00 001 105 25 N
2021-03-01 10:00:00 001 109 40 N
2021-03-01 10:05:00 002 103 35 N
2021-03-01 10:05:00 002 135 20 N
2021-03-01 10:08:00 001 140 2 N
2021-03-01 10:11:00 001 101 -45 Y
2021-03-01 10:11:00 001 105 -25 Y
2021-03-01 10:11:00 001 109 -40 Y
The table does not have an id column; the transaction_start for a given store_no will never be the same.
Whenever a transaction is post voided, the transaction is then repeated with the same store_no, item_no but with a negative/minus amount and an equal or higher transaction_start. Also, the column post_voided is then equal to 'Y'.
In the example above, the rows 1-3 have the same transaction_start and store_no, thus belonging to the same receipt, containing three different items (101, 105, 109). The same logic is applied to the other rows: rows 4-5 belong to a same receipt, and so on. In the example, 4 different receipts can be seen. The last receipt, given by the last three rows, is a post voided of the first receipt (rows 1-3).
What I want to do is to change the transaction_start for the post_voided = 'Y' transactions (in my example, only one receipt - represented by the last three rows - has it) to the next/closest datetime of a similar receipt that has the variables store_no, item_no and (negative) amount (but post_voided = 'N') (in my example, the similar ticket is given by the first three rows - store_no, all item_no and (positive) amount match). The transaction_start for the post voided receipt is always equal or higher than the "original" receipt.
Desired output:
transaction_start store_no item_no amount post_voided
2021-03-01 10:00:00 001 101 45 N
2021-03-01 10:00:00 001 105 25 N
2021-03-01 10:00:00 001 109 40 N
2021-03-01 10:05:00 002 103 35 N
2021-03-01 10:05:00 002 135 20 N
2021-03-01 10:08:00 001 140 2 N
2021-03-01 10:00:00 001 101 -45 Y
2021-03-01 10:00:00 001 105 -25 Y
2021-03-01 10:00:00 001 109 -40 Y
Here a link of the table: https://dbfiddle.uk/?rdbms=sqlserver_2019&fiddle=26142fa24e46acb4213b96c86f4eb94b
Thanks in advance!

Consider below
select a.* replace(ifnull(b.transaction_start, a.transaction_start) as transaction_start)
from `project.dataset.table` a
left join (
select * replace(-amount as amount)
from `project.dataset.table`
where post_voided = 'N'
) b
using (store_no, item_no)
if applied to sample data in your question - output is
Consider below for new / extended example (https://dbfiddle.uk/?rdbms=sqlserver_2019&fiddle=91f9f180fd672e7c357aa48d18ced5fd)
select x.* replace(ifnull(y.original_transaction_start, x.transaction_start) as transaction_start)
from `project.dataset.table` x
left join (
select b.transaction_start, b.store_no, b.item_no, b.amount amount,
max(a.transaction_start) original_transaction_start
from `project.dataset.table` a
join `project.dataset.table` b
on a.store_no = b.store_no
and a.item_no = b.item_no
and a.amount = -b.amount
and a.post_voided = 'N'
and b.post_voided = 'Y'
and a.transaction_start < b.transaction_start
group by b.transaction_start, b.store_no, b.item_no, b.amount
) y
using (store_no, item_no, amount, transaction_start)
with output

Related

Query to find active days per year to find revenue per user per year

I have 2 dimension tables and 1 fact table as follows:
user_dim
user_id
user_name
user_joining_date
1
Steve
2013-01-04
2
Adam
2012-11-01
3
John
2013-05-05
4
Tony
2012-01-01
5
Dan
2010-01-01
6
Alex
2019-01-01
7
Kim
2019-01-01
bundle_dim
bundle_id
bundle_name
bundle_type
bundle_cost_per_day
101
movies and TV
prime
5.5
102
TV and sports
prime
6.5
103
Cooking
prime
7
104
Sports and news
prime
5
105
kids movie
extra
2
106
kids educative
extra
3.5
107
spanish news
extra
2.5
108
Spanish TV and sports
extra
3.5
109
Travel
extra
2
plans_fact
user_id
bundle_id
bundle_start_date
bundle_end_date
1
101
2019-10-10
2020-10-10
2
107
2020-01-15
(null)
2
106
2020-01-15
2020-12-31
2
101
2020-01-15
(null)
2
103
2020-01-15
2020-02-15
1
101
2020-10-11
(null)
1
107
2019-10-10
2020-10-10
1
105
2019-10-10
2020-10-10
4
101
2021-01-01
2021-02-01
3
104
2020-02-17
2020-03-17
2
108
2020-01-15
(null)
4
102
2021-01-01
(null)
4
103
2021-01-01
(null)
4
108
2021-01-01
(null)
5
103
2020-01-15
(null)
5
101
2020-01-15
2020-02-15
6
101
2021-01-01
2021-01-17
6
101
2021-01-20
(null)
6
108
2021-01-01
(null)
7
104
2020-02-17
(null)
7
103
2020-01-17
2020-01-18
1
102
2020-12-11
(null)
2
106
2021-01-01
(null)
7
107
2020-01-15
(null)
note: NULL bundle_end_date refers to active subscription.
user active days can be calculated as: bundle_end_date - bundle_start_date (for the given bundle)
total revenue per user could be calculated as : total no. of active days * bundle rate per day
I am looking to write a query to find revenue generated per user per year.
Here is what I have for the overall revenue per user:
select pf.user_id
, sum(datediff(day, pf.bundle_start_date, coalesce(pf.bundle_end_date, getdate())) * bd.price_per_day) total_cost_per_bundle
from plans_fact pf
inner join bundle_dim bd on bd.bundle_id = pf.bundle_id
group by pf.user_id
order by pf.user_id;
You need a 'year' table to help parse out each multi-year spanning row into it's seperate years. For each year, you need to also recalculate the start and end dates. That's what I do in the yearParsed cte in the code below. I hard code the years into the join statement that creates y. You probably will do it different but however you get those values will work.
After that, pretty much sum as you did before, just adding the year column to your grouping.
Aside from that, all I did was move the null coalesce logic to the cte to make the overall logic simpler.
with yearParsed as (
select pf.*,
y.year,
startDt = iif(pf.bundle_start_date > y.startDt, pf.bundle_start_date, y.startDt),
endDt = iif(ap.bundle_end_date < y.endDt, ap.bundle_end_date, y.endDt)
from plans_fact pf
cross apply (select bundle_end_date = isnull(pf.bundle_end_date, getdate())) ap
join (values
(2019, '2019-01-01', '2019-12-31'),
(2020, '2020-01-01', '2020-12-31'),
(2021, '2021-01-01', '2021-12-31')
) y (year, startDt, endDt)
on pf.bundle_start_date <= y.endDt
and ap.bundle_end_date >= y.startDt
)
select yp.user_id,
yp.year,
total_cost_per_bundle = sum(datediff(day, yp.startDt, yp.endDt) * bd.bundle_cost_per_day)
from yearParsed yp
join bundle_dim bd on bd.bundle_id = yp.bundle_id
group by yp.user_id,
yp.year
order by yp.user_id,
yp.year;
Now, if this is common, you should probably create a base-table for your 'year' table. But if it's not common, but for this report you don't want to have to keep coming back to hard-code the year information into the y table, you can do this:
declare #yearTable table (
year int,
startDt char(10),
endDt char(10)
);
with y as (
select year = year(min(pf.bundle_start_date))
from #plans_fact pf
union all
select year + 1
from y
where year < year(getdate())
)
insert #yearTable
select year,
startDt = convert(char(4),year) + '-01-01',
endDt = convert(char(4),year) + '-12-31'
from y;
and it will create the appropriate years for you. But you can see why creating a base table may be preferred if you have this or a similar need often.

How can I create a column which computes only the change of other column on redshift?

I have this dataset:
product customer date value buyer_position
A 123455 2020-01-01 00:01:01 100 1
A 123456 2020-01-02 00:02:01 100 2
A 523455 2020-01-02 00:02:05 100 NULL
A 323455 2020-01-03 00:02:07 100 NULL
A 423455 2020-01-03 00:09:01 100 3
B 100455 2020-01-01 00:03:01 100 1
B 999445 2020-01-01 00:04:01 100 NULL
B 122225 2020-01-01 00:04:05 100 2
B 993848 2020-01-01 10:04:05 100 3
B 133225 2020-01-01 11:04:05 100 NULL
B 144225 2020-01-01 12:04:05 100 4
The dataset has the product the company sells and the customers who saw the product. A customer can see more than one product, but the combination product + customer doesn't have any repetition. I want to get how many people bought the product before the customer sees it.
This would be the perfect output:
product customer date value buyer_position people_before
A 123455 2020-01-01 00:01:01 100 1 0
A 123456 2020-01-02 00:02:01 100 2 1
A 523455 2020-01-02 00:02:05 100 NULL 2
A 323455 2020-01-03 00:02:07 100 NULL 2
A 423455 2020-01-03 00:09:01 100 3 2
B 100455 2020-01-01 00:03:01 100 1 0
B 999445 2020-01-01 00:04:01 100 NULL 1
B 122225 2020-01-01 00:04:05 100 2 1
B 993848 2020-01-01 10:04:05 100 3 2
B 133225 2020-01-01 11:04:05 100 NULL 3
B 144225 2020-01-01 12:04:05 100 4 3
As you can see, when the customer 122225 saw the product he wanted, two people have already bought it. In the case of customer 323455, two people have already bought the product A.
I think I should use some window function, like lag(). But lag() function won't get this "cumulative" information. So I'm kind of lost here.
This looks like a window count of non-null values of buyer_position over the preceding rows:
select t.*,
coalesce(count(buyer_position) over(
partition by product
order by date
rows between unbounded preceding and 1 preceding
), 0) as people_before
from mytable t
Hmmm . . . If I understand correctly, You want the max of the buyer position for the customer/product minus 1:
select t.*,
max(buyer_position) over (partition by customer, product order by date rows between unbounded preceding and current row) - 1
from t;

Finding a minimum date before another date

Let's say I have two tables. One is a table with information about customer service inquiries, which contains information about the customer and the time the inquiry was placed. The customer's information (in this case, the ID) is saved for all future inquiries.
CUST_ID INQUIRY_ID INQUIRY_DATE
001 34 2015-05-03 08:15
001 36 2015-05-05 13:12
002 39 2015-05-10 18:43
003 42 2015-05-12 14:58
003 46 2015-05-14 07:27
001 50 2015-05-18 19:06
003 55 2015-05-20 11:40
The other table contains information about the resolution dates for all customer inquiries.
CUST_ID RESOLVED_DATE
001 2015-05-06 12:54
002 2015-05-11 08:09
003 2015-05-14 19:37
001 2015-05-19 16:12
003 2015-05-22 08:40
The resolution table doesn't have a key to link to the inquiry table other than the CUST_ID, so in order to calculate the time to resolution, I want to determine the minimum inquiry date before the resolution for EACH resolution date. The resulting table would look like this:
CUST_ID FIRST_INQUIRY RESOLVED_DT
001 2015-05-03 08:15 2015-05-06 12:54
001 2015-05-18 19:06 2015-05-19 16:12
002 2015-05-10 18:43 2015-05-11 08:09
003 2015-05-12 14:58 2015-05-14 19:37
003 2015-05-20 11:40 2015-05-22 08:40
At first I just went with min(case when INQUIRY_DATE < RESOLVED_DT), but for people like customers 001 and 003 who have multiple inquiries across different dates, the query would just return the first ever inquiry date, not the first since the last inquiry. Does anyone know how to do this? I'm using Netezza.
One option is to create a subquery for each table (inquries and resolutions) which numbers the transaction for each CUST_ID using the date. Then, the two subqueries can be joined together using this ordered index column along with the CUST_ID.
I also used the INQUIRY_ID in the inquiries table to break a tie, should it occur. There is not way to break a tie in the resolutions table for a given customer and date based on the data you showed us.
SELECT t1.CUST_ID, t1.INQUIRY_ID AS FIRST_INQUIRY, t2.RESOLVED_DATE AS RESOLVED_DT
FROM
(
SELECT CUST_ID, INQUIRY_ID, INQUIRY_DATE,
(SELECT COUNT(*) + 1
FROM inquiries
WHERE CUST_ID = t.CUST_ID AND INQUIRY_DATE <= t.INQUIRY_DATE
AND INQUIRY_ID < t.INQUIRY_ID) AS index
FROM inquiries AS t
) AS t1
INNER JOIN
(
SELECT CUST_ID, RESOLVED_DATE,
(SELECT COUNT(*) + 1
FROM resolutions
WHERE CUST_ID = t.CUST_ID AND RESOLVED_DATE < t.RESOLVED_DATE) AS index
FROM resolutions t
) AS t2
ON t1.CUST_ID = t2.CUST_ID AND t1.index = t2.index
Here are what the subquery tables look like:
inquiries:
CUST_ID INQUIRY_ID INQUIRY_DATE index
001 34 2015-05-03 08:15 1
001 36 2015-05-05 13:12 2
002 39 2015-05-10 18:43 1
003 42 2015-05-12 14:58 1
003 46 2015-05-14 07:27 2
001 50 2015-05-18 19:06 3
003 55 2015-05-20 11:40 3
resolutions:
CUST_ID RESOLVED_DATE index
001 2015-05-06 12:54 1
002 2015-05-11 08:09 1
003 2015-05-14 19:37 1
001 2015-05-19 16:12 2
003 2015-05-22 08:40 2
Note that this solution is not robust to missing data, e.g. there is an inquiry which was not closed, or the resolution was never recorded.

SQL - Creating a timeline for each ID (Vertica)

I am dealing with the following problem in SQL (using Vertica):
In short -- Create a timeline for each ID (in a table where I have multiple lines, orders in my example, per ID)
What I would like to achieve -- At my disposal I have a table on historical order date and I would like to compute new customer (first order ever in the past month), active customer- (>1 order in last 1-3 months), passive customer- (no order for last 3-6 months) and inactive customer (no order for >6 months) rates.
Which steps I have taken so far -- I was able to construct a table similar to the example presented below:
CustomerID Current order date Time between current/previous order First order date (all-time)
001 2015-04-30 12:06:58 (null) 2015-04-30 12:06:58
001 2015-09-24 17:30:59 147 05:24:01 2015-04-30 12:06:58
001 2016-02-11 13:21:10 139 19:50:11 2015-04-30 12:06:58
002 2015-10-21 10:38:29 (null) 2015-10-21 10:38:29
003 2015-05-22 12:13:01 (null) 2015-05-22 12:13:01
003 2015-07-09 01:04:51 47 12:51:50 2015-05-22 12:13:01
003 2015-10-23 00:23:48 105 23:18:57 2015-05-22 12:13:01
A little bit of intuition: customer 001 placed three orders from which the second one was 147 days after its first order. Customer 002 has only placed one order in total.
What I think that the next steps should be -- I would like to know for each date (also dates on which a certain user did not place an order), for each CustomerID, how long it has been since his/her last order. This would imply that I would create some sort of timeline for each CustomerID. In the example presented above I would get 287 (days between 1st of May 2015 and 11th of February 2016, the timespan of this table) lines for each CustomerID. I have difficulties solving this previous step. When I have performed this step I want to create a field which shows at each date the last order date, the period between the last order date and the current date, and what state someone is in at the current date. For the example presented earlier, this would look something like this:
CustomerID Last order date Current date Time between current date /last order State
001 2015-04-30 12:06:58 2015-05-01 00:00:00 0 00:00:00 New
...
001 2015-04-30 12:06:58 2015-06-30 00:00:00 60 11:53:02 Active
...
001 2015-09-24 17:30:59 2016-02-01 00:00:00 129 11:53:02 Passive
...
...
002 2015-10-21 17:30:59 2015-10-22 00:00:00 0 06:29:01 New
...
002 2015-10-21 17:30:59 2015-11-30 00:00:00 39 06:29:01 Active
...
...
003 2015-05-22 12:13:01 2015-06-23 00:00:00 31 11:46:59 Active
...
003 2015-07-09 01:04:51 2015-10-22 00:00:00 105 11:46:59 Inactive
...
At the dots there should be all the inbetween dates but for sake of space I have left these out of the table.
When I know for each date what the state is of each customer (active/passive/inactive) my plan is to sum the states and group by date which should give me the sum of new, active, passive and inactive customers. From here on I can easily compute the rates at each date.
Anybody that knows how I can possibly achieve this task?
Note -- If anyone has other ideas how to achieve the goal presented above (using some other approach compared to the approach I had in mind) please let me know!
EDIT
Suppose you start from a table like this:
SQL> select * from ord order by custid, ord_date ;
custid | ord_date
--------+---------------------
1 | 2015-04-30 12:06:58
1 | 2015-09-24 17:30:59
1 | 2016-02-11 13:21:10
2 | 2015-10-21 10:38:29
3 | 2015-05-22 12:13:01
3 | 2015-07-09 01:04:51
3 | 2015-10-23 00:23:48
(7 rows)
You can use Vertica's Timeseries Analytic Functions TS_FIRST_VALUE(), TS_LAST_VALUE() to fill gaps and interpolate last_order date to the current date:
Then you just have to join this with a Vertica's TimeSeries generated from the same table with interval one day starting from the first day each customer did place his/her first order up to now (current_date):
select
custid,
status_dt,
last_order_dt,
case
when status_dt::date - last_order_dt::date < 30 then case
when nord = 1 then 'New' else 'Active' end
when status_dt::date - last_order_dt::date < 90 then 'Active'
when status_dt::date - last_order_dt::date < 180 then 'Passive'
else 'Inactive'
end as status
from (
select
custid,
last_order_dt,
status_dt,
conditional_true_event (first_order_dt is null or
last_order_dt > lag(last_order_dt))
over(partition by custid order by status_dt) as nord
from (
select
custid,
ts_first_value(ord_date) as first_order_dt ,
ts_last_value(ord_date) as last_order_dt ,
dt::date as status_dt
from
( select custid, ord_date from ord
union all
select distinct(custid) as custid, current_date + 1 as ord_date from ord
) z timeseries dt as '1 day' over (partition by custid order by ord_date)
) x
) y
where status_dt <= current_date
order by 1, 2
;
And you will get something like this:
custid | status_dt | last_order_dt | status
--------+------------+---------------------+---------
1 | 2015-04-30 | 2015-04-30 12:06:58 | New
1 | 2015-05-01 | 2015-04-30 12:06:58 | New
1 | 2015-05-02 | 2015-04-30 12:06:58 | New
...
1 | 2015-05-29 | 2015-04-30 12:06:58 | New
1 | 2015-05-30 | 2015-04-30 12:06:58 | Active
1 | 2015-05-31 | 2015-04-30 12:06:58 | Active
...
etc.

Incremental count in SQL Server 2005

I am working with a Raiser's Edge database using SQL Server 2005. I have written SQL that will produce a temporary table containing details of direct debit instalments. Below is a small table containing the key variables for the question I'm going to ask, with some fictional data:
Donor_ID Instalment_ID Instalment_Date Amount
1234 1111 01/01/2011 £5.00
1234 1112 01/02/2011 £0.00
1234 1113 01/03/2011 £5.00
1234 1114 01/04/2011 £5.00
1234 1115 01/05/2011 £0.00
1234 1116 01/06/2011 £0.00
2345 2111 01/01/2011 £0.00
2345 2112 01/02/2011 £5.00
2345 2113 01/03/2011 £5.00
2345 2114 01/04/2011 £0.00
2345 2115 01/05/2011 £0.00
2345 2116 01/06/2011 £0.00
As you will see, some of the values in the Amount column are £0.00. This can occur when a donor has insufficient funds in their account, for example.
What I'd like to do is write a SQL query that will create a field containing an incremental count of consecutive £0.00 payments that resets after a non-£0.00 payment or after a change in Donor_ID. I have reproduced the above data below, with the field I'd like to see.
Donor_ID Instalment_ID Instalment_Date Amount New_Field
1234 1111 01/01/2011 £5.00
1234 1112 01/02/2011 £0.00 1
1234 1113 01/03/2011 £5.00
1234 1114 01/04/2011 £5.00
1234 1115 01/05/2011 £0.00 1
1234 1116 01/06/2011 £0.00 2
2345 2111 01/01/2011 £0.00 1
2345 2112 01/02/2011 £5.00
2345 2113 01/03/2011 £5.00
2345 2114 01/04/2011 £0.00 1
2345 2115 01/05/2011 £0.00 2
2345 2116 01/06/2011 £0.00 3
To help clarify what I'm looking for, I think what I'm looking to do would be similar to a winning streak field on a list of a football team's results. For example:
Opponent Score Winning_Streak
Arsenal 1-0 1
Liverpool 0-0
Swansea 3-1 1
Chelsea 2-1 2
Fulham 4-0 3
Stoke 0-0
Man Utd 1-3
Reading 2-1 1
I've considered various options, but have made no progress. Unless I've missed something obvious, I think that a solution more advanced than my current SQL programming level might be required.
If I am thinking about this problem correctly, I believe that you want a row number when the Amount is 0.00 pounds.
Select 0 as As InsufficientCount
, Donor_ID
, Installment_ID
, Amount
From [Table]
Where Amount > 0.00
Union
Select Row_Number() Over (Partition By Donor_ID Order By Installment_ID)
, Donor_ID
, Installment_ID
, Amount
From [Table]
Where Amount = 0.00
This union select should only give you 'ranks' where the Amount equals 0.
Am calling your new field streakAmount
ALTER TABLE instalments ADD streakAmount int NULL;
Then, to update the value:
UPDATE instalments
SET streakAmount =
(SELECT
COUNT(*)
FROM
instalments streak
WHERE
streak.donor_id = instalments.donor_id
AND
streak.instalment_date <= instalments.instalment_date
AND
(streak.instalment_date >
-- find previous instalment date, if any exists
COALESCE(
(
SELECT
MAX(instalment_date)
FROM
instalments prev
WHERE
prev.donor_id = instalments.donor_id
AND
prev.amount > 0
AND
prev.instalment_date < instalments.instalment_date
)
-- otherwise min date
, cast('1753-1-1' AS date))
)
)
WHERE
amount = 0;
http://sqlfiddle.com/#!6/a571f/18