SQL - Creating a timeline for each ID (Vertica) - sql

I am dealing with the following problem in SQL (using Vertica):
In short -- Create a timeline for each ID (in a table where I have multiple lines, orders in my example, per ID)
What I would like to achieve -- At my disposal I have a table on historical order date and I would like to compute new customer (first order ever in the past month), active customer- (>1 order in last 1-3 months), passive customer- (no order for last 3-6 months) and inactive customer (no order for >6 months) rates.
Which steps I have taken so far -- I was able to construct a table similar to the example presented below:
CustomerID Current order date Time between current/previous order First order date (all-time)
001 2015-04-30 12:06:58 (null) 2015-04-30 12:06:58
001 2015-09-24 17:30:59 147 05:24:01 2015-04-30 12:06:58
001 2016-02-11 13:21:10 139 19:50:11 2015-04-30 12:06:58
002 2015-10-21 10:38:29 (null) 2015-10-21 10:38:29
003 2015-05-22 12:13:01 (null) 2015-05-22 12:13:01
003 2015-07-09 01:04:51 47 12:51:50 2015-05-22 12:13:01
003 2015-10-23 00:23:48 105 23:18:57 2015-05-22 12:13:01
A little bit of intuition: customer 001 placed three orders from which the second one was 147 days after its first order. Customer 002 has only placed one order in total.
What I think that the next steps should be -- I would like to know for each date (also dates on which a certain user did not place an order), for each CustomerID, how long it has been since his/her last order. This would imply that I would create some sort of timeline for each CustomerID. In the example presented above I would get 287 (days between 1st of May 2015 and 11th of February 2016, the timespan of this table) lines for each CustomerID. I have difficulties solving this previous step. When I have performed this step I want to create a field which shows at each date the last order date, the period between the last order date and the current date, and what state someone is in at the current date. For the example presented earlier, this would look something like this:
CustomerID Last order date Current date Time between current date /last order State
001 2015-04-30 12:06:58 2015-05-01 00:00:00 0 00:00:00 New
...
001 2015-04-30 12:06:58 2015-06-30 00:00:00 60 11:53:02 Active
...
001 2015-09-24 17:30:59 2016-02-01 00:00:00 129 11:53:02 Passive
...
...
002 2015-10-21 17:30:59 2015-10-22 00:00:00 0 06:29:01 New
...
002 2015-10-21 17:30:59 2015-11-30 00:00:00 39 06:29:01 Active
...
...
003 2015-05-22 12:13:01 2015-06-23 00:00:00 31 11:46:59 Active
...
003 2015-07-09 01:04:51 2015-10-22 00:00:00 105 11:46:59 Inactive
...
At the dots there should be all the inbetween dates but for sake of space I have left these out of the table.
When I know for each date what the state is of each customer (active/passive/inactive) my plan is to sum the states and group by date which should give me the sum of new, active, passive and inactive customers. From here on I can easily compute the rates at each date.
Anybody that knows how I can possibly achieve this task?
Note -- If anyone has other ideas how to achieve the goal presented above (using some other approach compared to the approach I had in mind) please let me know!

EDIT
Suppose you start from a table like this:
SQL> select * from ord order by custid, ord_date ;
custid | ord_date
--------+---------------------
1 | 2015-04-30 12:06:58
1 | 2015-09-24 17:30:59
1 | 2016-02-11 13:21:10
2 | 2015-10-21 10:38:29
3 | 2015-05-22 12:13:01
3 | 2015-07-09 01:04:51
3 | 2015-10-23 00:23:48
(7 rows)
You can use Vertica's Timeseries Analytic Functions TS_FIRST_VALUE(), TS_LAST_VALUE() to fill gaps and interpolate last_order date to the current date:
Then you just have to join this with a Vertica's TimeSeries generated from the same table with interval one day starting from the first day each customer did place his/her first order up to now (current_date):
select
custid,
status_dt,
last_order_dt,
case
when status_dt::date - last_order_dt::date < 30 then case
when nord = 1 then 'New' else 'Active' end
when status_dt::date - last_order_dt::date < 90 then 'Active'
when status_dt::date - last_order_dt::date < 180 then 'Passive'
else 'Inactive'
end as status
from (
select
custid,
last_order_dt,
status_dt,
conditional_true_event (first_order_dt is null or
last_order_dt > lag(last_order_dt))
over(partition by custid order by status_dt) as nord
from (
select
custid,
ts_first_value(ord_date) as first_order_dt ,
ts_last_value(ord_date) as last_order_dt ,
dt::date as status_dt
from
( select custid, ord_date from ord
union all
select distinct(custid) as custid, current_date + 1 as ord_date from ord
) z timeseries dt as '1 day' over (partition by custid order by ord_date)
) x
) y
where status_dt <= current_date
order by 1, 2
;
And you will get something like this:
custid | status_dt | last_order_dt | status
--------+------------+---------------------+---------
1 | 2015-04-30 | 2015-04-30 12:06:58 | New
1 | 2015-05-01 | 2015-04-30 12:06:58 | New
1 | 2015-05-02 | 2015-04-30 12:06:58 | New
...
1 | 2015-05-29 | 2015-04-30 12:06:58 | New
1 | 2015-05-30 | 2015-04-30 12:06:58 | Active
1 | 2015-05-31 | 2015-04-30 12:06:58 | Active
...
etc.

Related

Match group of variables and values with the nearest datetime

I have a transaction table that looks like that:
transaction_start store_no item_no amount post_voided
2021-03-01 10:00:00 001 101 45 N
2021-03-01 10:00:00 001 105 25 N
2021-03-01 10:00:00 001 109 40 N
2021-03-01 10:05:00 002 103 35 N
2021-03-01 10:05:00 002 135 20 N
2021-03-01 10:08:00 001 140 2 N
2021-03-01 10:11:00 001 101 -45 Y
2021-03-01 10:11:00 001 105 -25 Y
2021-03-01 10:11:00 001 109 -40 Y
The table does not have an id column; the transaction_start for a given store_no will never be the same.
Whenever a transaction is post voided, the transaction is then repeated with the same store_no, item_no but with a negative/minus amount and an equal or higher transaction_start. Also, the column post_voided is then equal to 'Y'.
In the example above, the rows 1-3 have the same transaction_start and store_no, thus belonging to the same receipt, containing three different items (101, 105, 109). The same logic is applied to the other rows: rows 4-5 belong to a same receipt, and so on. In the example, 4 different receipts can be seen. The last receipt, given by the last three rows, is a post voided of the first receipt (rows 1-3).
What I want to do is to change the transaction_start for the post_voided = 'Y' transactions (in my example, only one receipt - represented by the last three rows - has it) to the next/closest datetime of a similar receipt that has the variables store_no, item_no and (negative) amount (but post_voided = 'N') (in my example, the similar ticket is given by the first three rows - store_no, all item_no and (positive) amount match). The transaction_start for the post voided receipt is always equal or higher than the "original" receipt.
Desired output:
transaction_start store_no item_no amount post_voided
2021-03-01 10:00:00 001 101 45 N
2021-03-01 10:00:00 001 105 25 N
2021-03-01 10:00:00 001 109 40 N
2021-03-01 10:05:00 002 103 35 N
2021-03-01 10:05:00 002 135 20 N
2021-03-01 10:08:00 001 140 2 N
2021-03-01 10:00:00 001 101 -45 Y
2021-03-01 10:00:00 001 105 -25 Y
2021-03-01 10:00:00 001 109 -40 Y
Here a link of the table: https://dbfiddle.uk/?rdbms=sqlserver_2019&fiddle=26142fa24e46acb4213b96c86f4eb94b
Thanks in advance!
Consider below
select a.* replace(ifnull(b.transaction_start, a.transaction_start) as transaction_start)
from `project.dataset.table` a
left join (
select * replace(-amount as amount)
from `project.dataset.table`
where post_voided = 'N'
) b
using (store_no, item_no)
if applied to sample data in your question - output is
Consider below for new / extended example (https://dbfiddle.uk/?rdbms=sqlserver_2019&fiddle=91f9f180fd672e7c357aa48d18ced5fd)
select x.* replace(ifnull(y.original_transaction_start, x.transaction_start) as transaction_start)
from `project.dataset.table` x
left join (
select b.transaction_start, b.store_no, b.item_no, b.amount amount,
max(a.transaction_start) original_transaction_start
from `project.dataset.table` a
join `project.dataset.table` b
on a.store_no = b.store_no
and a.item_no = b.item_no
and a.amount = -b.amount
and a.post_voided = 'N'
and b.post_voided = 'Y'
and a.transaction_start < b.transaction_start
group by b.transaction_start, b.store_no, b.item_no, b.amount
) y
using (store_no, item_no, amount, transaction_start)
with output

SQL lite query - Merging time periods

I am working with a SQLite RDB and have the following problem.
PID EID EPISODETYPE START_TIME END_TIME
123 556 emergency_room 2020-03-29 15:09:00 2020-03-30 20:36:00
123 558 ward 2020-04-30 20:35:00 2020-05-04 22:12:00
123 660 ward 2020-05-04 22:12:00 2020-05-21 08:59:00
123 661 icu 2020-05-21 09:00:00 2020-07-01 17:00:00
Basically, PID represents each patient unique identifier. They all have an episode identifier for all the different beds they occupy during a unique stay.
What I wish to accomplish is to select all episodes from a single hospital stay and return it as the stay number.
I would want my query to result in this :
PID EID StayNumber
123 556 1
123 558 2
123 660 2
123 661 2
1 st row is StayNumber as it's the first.
As the 2nd, 3rd and 4th row are from the same hospital stay (we can tell by the overlapping OR relatively close start and end time period) they are all labeled StayNumber 2.
A hospital stay is defined as the period of time during which the patient never left the hospital.
I tried to write the query by starting off with a :
GROUP BY PID (to isolate the process for each individual patient)
Using datetime to compute a simple time difference rule but I have trouble writing down a query using the end time from a row and the start time from the next row.
Thank you in advance.
I am a SQL learner
UPDATE ***
Use window function LAG() to flag the groups for each hospital stay and window function SUM() to get the numbers:
SELECT PID, EID,
SUM(flag) OVER (PARTITION BY PID ORDER BY START_TIME) StayNumber
FROM (
SELECT *,
strftime('%s', START_TIME) -
strftime('%s', LAG(END_TIME, 1, datetime(START_TIME, '-1 hour')) OVER (PARTITION BY PID ORDER BY START_TIME)) > 60 flag
FROM tablename
)
See the demo.
Results:
|PID | EID | StayNumber
|:-- | :-- | ---------:
|123 | 556 | 1
|123 | 558 | 2
|123 | 660 | 2
|123 | 661 | 2

Creating a new calculated column in SQL

Is there a way to find the solution so that I need for 2 days, there are 2 UD's because there are June 24 2 times and for the rest there are single days.
I am showing the expected output here:
Primary key UD Date
-------------------------------------------
1 123 2015-06-24 00:00:00.000
6 456 2015-06-24 00:00:00.000
2 123 2015-06-25 00:00:00.000
3 658 2015-06-26 00:00:00.000
4 598 2015-06-27 00:00:00.000
5 156 2015-06-28 00:00:00.000
No of times Number of days
-----------------------------
4 1
2 2
The logic is 4 users are there who used the application on 1 day and there are 2 userd who used the application on 2 days
You can use two levels of aggregation:
select cnt, count(*)
from (select date, count(*) as cnt
from t
group by date
) d
group by cnt
order by cnt desc;

How to go between a set of dates and times

I have a set of data where one column is date and time. I have been asked for all the data in the table, between two date ranges and within those dates, only certain time scale. For example, I was data between 01/02/2019 - 10/02/2019 and within the times 12:00 AM to 07:00 AM. (My real date ranges are over a number of months, just using these dates as an example)
I can cast the date and time into two different columns to separate them out as shown below:
select
name
,dateandtimetest
,cast(dateandtimetest as date) as JustDate
,cast(dateandtimetest as time) as JustTime
INTO #Test01
from [dbo].[TestTable]
I put this into a test table so that I could see if I could use a between function on the JustTime column, because I know I can do the between on the dates no problem. My idea was to get them done in two separate tables and perform an inner join to get the results I need
from #Test01
WHERE justtime between '00:00' and '05:00'
The above code will not give me the data I need. I have been racking my brain for this so any help would be much appreciated!
The test table I am using to try and get the correct code is shown below:
|Name | DateAndTimeTest
-----------------------------------------|
|Lauren | 2019-02-01 04:14:00 |
|Paul | 2019-02-02 08:20:00 |
|Bill | 2019-02-03 12:00:00 |
|Graham | 2019-02-05 16:15:00 |
|Amy | 2019-02-06 02:43:00 |
|Jordan | 2019-02-06 03:00:00 |
|Sid | 2019-02-07 15:45:00 |
|Wes | 2019-02-18 01:11:00 |
|Adam | 2019-02-11 11:11:00 |
|Rhodesy | 2019-02-11 15:16:00 |
I have now tried and got the data to show me information between the times on one date using the below code, but now I would need to make this piece of code run for every date over a 3 month period
select *
from dbo.TestTable
where DateAndTimeTest between '2019-02-11 00:00:00' and '2019-02-11 08:30:00'
You can use SQL similar to following:
select *
from dbo.TestTable
where (CAST(DateAndTimeTest as date) between '2019-02-11' AND '2019-02-11') AND
(CAST(DateAndTimeTest as time) between '00:00:00' and '08:30:00')
Above query will return all records where DateAndTimeTest value in date range 2019-02-11 to 2019-02-11 and with time between 12AM to 8:30AM.

Finding a minimum date before another date

Let's say I have two tables. One is a table with information about customer service inquiries, which contains information about the customer and the time the inquiry was placed. The customer's information (in this case, the ID) is saved for all future inquiries.
CUST_ID INQUIRY_ID INQUIRY_DATE
001 34 2015-05-03 08:15
001 36 2015-05-05 13:12
002 39 2015-05-10 18:43
003 42 2015-05-12 14:58
003 46 2015-05-14 07:27
001 50 2015-05-18 19:06
003 55 2015-05-20 11:40
The other table contains information about the resolution dates for all customer inquiries.
CUST_ID RESOLVED_DATE
001 2015-05-06 12:54
002 2015-05-11 08:09
003 2015-05-14 19:37
001 2015-05-19 16:12
003 2015-05-22 08:40
The resolution table doesn't have a key to link to the inquiry table other than the CUST_ID, so in order to calculate the time to resolution, I want to determine the minimum inquiry date before the resolution for EACH resolution date. The resulting table would look like this:
CUST_ID FIRST_INQUIRY RESOLVED_DT
001 2015-05-03 08:15 2015-05-06 12:54
001 2015-05-18 19:06 2015-05-19 16:12
002 2015-05-10 18:43 2015-05-11 08:09
003 2015-05-12 14:58 2015-05-14 19:37
003 2015-05-20 11:40 2015-05-22 08:40
At first I just went with min(case when INQUIRY_DATE < RESOLVED_DT), but for people like customers 001 and 003 who have multiple inquiries across different dates, the query would just return the first ever inquiry date, not the first since the last inquiry. Does anyone know how to do this? I'm using Netezza.
One option is to create a subquery for each table (inquries and resolutions) which numbers the transaction for each CUST_ID using the date. Then, the two subqueries can be joined together using this ordered index column along with the CUST_ID.
I also used the INQUIRY_ID in the inquiries table to break a tie, should it occur. There is not way to break a tie in the resolutions table for a given customer and date based on the data you showed us.
SELECT t1.CUST_ID, t1.INQUIRY_ID AS FIRST_INQUIRY, t2.RESOLVED_DATE AS RESOLVED_DT
FROM
(
SELECT CUST_ID, INQUIRY_ID, INQUIRY_DATE,
(SELECT COUNT(*) + 1
FROM inquiries
WHERE CUST_ID = t.CUST_ID AND INQUIRY_DATE <= t.INQUIRY_DATE
AND INQUIRY_ID < t.INQUIRY_ID) AS index
FROM inquiries AS t
) AS t1
INNER JOIN
(
SELECT CUST_ID, RESOLVED_DATE,
(SELECT COUNT(*) + 1
FROM resolutions
WHERE CUST_ID = t.CUST_ID AND RESOLVED_DATE < t.RESOLVED_DATE) AS index
FROM resolutions t
) AS t2
ON t1.CUST_ID = t2.CUST_ID AND t1.index = t2.index
Here are what the subquery tables look like:
inquiries:
CUST_ID INQUIRY_ID INQUIRY_DATE index
001 34 2015-05-03 08:15 1
001 36 2015-05-05 13:12 2
002 39 2015-05-10 18:43 1
003 42 2015-05-12 14:58 1
003 46 2015-05-14 07:27 2
001 50 2015-05-18 19:06 3
003 55 2015-05-20 11:40 3
resolutions:
CUST_ID RESOLVED_DATE index
001 2015-05-06 12:54 1
002 2015-05-11 08:09 1
003 2015-05-14 19:37 1
001 2015-05-19 16:12 2
003 2015-05-22 08:40 2
Note that this solution is not robust to missing data, e.g. there is an inquiry which was not closed, or the resolution was never recorded.