How do I find the difference between one event timestamp and the first event timestamp following it that is not the same event as the original? - sql

I'm trying to find the difference between two timestamps that meet certain criterion. My table has ID's, timestamps, a payment state, and subtype. For a certain ID, if they've ever entered the payment state "unpaid" and the subtype "grace_period", I need to find out if that same ID has ever gone back to paying payment state "paid" and subtype "active". If so, the end result needs to be the difference between the date they became unpaid and the first date where they're active. I've included a photo for reference.
I've tried using IF/THEN statements and nested case statements, but none of them are really working. Assume that the dates are true datetimes.
Thanks for your help with this!

use datediff and case when
select id,
datediff(day, max(case when paymentstate='paid' and subtype='active' then date end),
max(case when paymentstate='unpaid' and subtype='grace_period' then date end)
) as ddiff from table group by id

Redshift supports the ignore nulls option so getting the dates is pretty simply:
select t.*,
datediff(day, date, next_pa_date) as diff_in_days
from (select t.*,
lead(case when paymentstate = 'paid' and subtype = 'active' then date end ignore nulls) over (partition by id order by date) as next_pa_date
from t
) t
where paymentstate = 'unpaid' and subtype = 'grace_period' and
next_pa_date is not null;

Related

SQL - Update column based on date comparasion with previous occurrence

I have a huge table;
I want to create a third column based on the time difference between two dates for the same id. If the difference is less than a month, then it's active, if it is between 1-2 months then inactive and anything more than 2 is dormant. The expected outcome is below;( note last entries don't have activity definitions as I don't have previous occurrences.)
My question would be, how to do such operation.
case when date_>=date_add((select max(date_) from schema.table),-30) then 'Active' when date_<date_add((select max(date_) from schema.table),-30) and date_>= date_add((select max(date_) from schema.table),-60) then 'Inactive' when date_<date_add((select max(date_) from schema.table),-60) then 'Dormant3' end as Activity
the code I came up with is not what I need as it only checks for the final entry date in the table. What I need is more akin to a for loop and checking the each row and comparing it to the previous occurrence.
edit:
By partitioning over id and dense ranking them, I reached something that almost works. I just need to compare to the previous element in the dense rank groups.
Create base data first using LEAD()
Then compare than with original row.
SELECT ID, DATE,
CASE
WHEN DATEDIFF(DATE,PREVIOUS_DATE) <=30 THEN 'Active'
DATEDIFF(DATE,PREVIOUS_DATE) between 31 and 60 'Active'
ELSE 'Dormant'
END as Activity
(SELECT ID, DATE, LEAD(DATE) OVER( partition by id ORDER BY DATE) PREVIOUS_DATE FROM MYTABLE) RS

Return Only Most Recent Instance of Item From Query (Where Multiple Instances Exist)

I have written the following subquery, which is returning instances of item counts from my application's log table.
The idea is that from this subquery I will be pulling information on item counts from a specific date, to be compared to the same information from a different date - info such as, for a given location on the system, what the latest quantity of all items counted within it was.
select
LOCATION,
ITEM,
SUM(CASE
WHEN ACTION = 'COUNT-OK'
THEN QUANTITY
ELSE QUANTITY * CHANGE --If ACTION <> 'OK', then we need to adjust the quantity
END) AS QuantityCalc,
DATE_TIME,
from LOG_TABLE
where ACTION IN ('COUNT-ADJ','COUNT-OK')
AND (CAST(DATE_TIME AS DATE) = #CountDate) --Declared elsewhere
group by LOCATION, ITEM, DATE_TIME
order by DATE_TIME desc
My issue is with the rows returned. Because these are application logs, there is a row for each count being done on the system, so only the most recent 'QuantityCalc' for a given item in a location would be accurate.
I need a way to return only the most recent instance of a count happening (where the LOCATION and ITEM values are the same). I am using a SUM in the main query which is pulling the QuantityCalc value from this subquery to find the total Quantity by Item and Location per specific count (to compare them side by side). This is currently being thrown off by instances such as the below.
I've attached an example image of what this query returns. My issue is with Item2 in Location B and Item3 in location C, and I'd be looking for the query to ONLY return rows 2, 3, 5 and 8 (including header).
Thank you
You can pre-filter the logs for the latest row per location/item tuple, then aggregate. We would typically use row_number() to enumerate the rows in a subquery:
select
LOCATION,
ITEM,
sum(case when ACTION = 'COUNT-OK' then QUANTITY else QUANTITY * CHANGE end) AS QuantityCalc,
DATE_TIME,
from (
select l.*,
row_number() over(partition by LOCATION, ITEM order by date_time desc) AS RN
from LOG_TABLE
where ACTION IN ('COUNT-ADJ','COUNT-OK') and CAST(DATE_TIME AS DATE) = #CountDate
) l
where RN = 1
group by LOCATION, ITEM, DATE_TIME
order by DATE_TIME desc
Side note: the filtering on date_time can probably be optimized; rather than casting your column to date, we can check it directly against a range defined from the date parameter. The syntax of date arithmetic widely varies across databases (and you did not well which one you are using), but in standard SQL that would be:
DATE_TIME >= #CountDate and DATE_TIME < #CountDate + interval '1' day

Attribute sales to email campaigns based on time window

Hopefully someone is able to help me with the following. I have two tables, one containing events which are "clicked emails" and one containing "conversions" (=orders). I would like to write attribution logic in BigQuery SQL to match the conversions to the clicked emails based on the following logic:
the matching ID is a "person_id"
conversions that happened within a 5 day (120h) timeframe after the clicked email count to that mail
if the customer clicked on two different emails within a five day window, the email clicked most recent before the conversion gets attributed the revenue
To know: the clicks table also contains clicks which do not have a conversion, and the conversions table also contains conversions not related to emails.
Wished end result: a table containing a count of all attributed orders and a count of all clicks, on date and campaign name.
I figured I would need to do a left join getting only the conversions in that might be related to a click, on person_id. However, now I need to define the window up until which conversions are counted (the 5 days). Maybe I could include this in the where statement? Then after that, I need to check, in case the count of conversions is > 1, to only take the conversion into account where the "date diff" is smallest.
How far I got :
SELECT
c. person_id,
c.campaign_name,
c.datetime,
s.processed_at,
c.email,
s.order_number,
SUM(s.total_price) AS revenue,
COUNT(DISTINCT s.order_number) AS orders
FROM
`klaviyo_de_clicks` c
LEFT JOIN
`klaviyo_de_conversions` s
ON
c.person_id = s.person_id
GROUP BY
1,2,3,4,5,6
Thanks for your help!
You can get the emails just before the conversion using union all and window functions.
The idea is to put all the data into a single "result set" with the clicks and conversions separated. Then use window functions to get the previous click information for each conversion.
Your sample code as a bunch of columns not alluded to in your question. But this is the structure of the code:
with cc as (
select person_id, date_time, conversion_id, null as click_id
from conversions c
union all
select person_id, date_time, null, click_id
from clicks c
)
select cc.*,
(click_date_time >= datetime_add(datetime, interval -5 day)) as is_click_conversion
from (select cc.*,
max(case when click_id is not null then date_time end) over (partition by person_id order by date_time) as click_date_time,
last_value(click_id ignore nulls) over (partition by person_id order by date_time) as prev_click_id
from cc
) cc
where conversion_id is not null;
If you need additional columns, you can use join to bring them in.

aggregate multiple rows based on time ranges

i do have a customerand he use over a specific period of time different devices, tracked with a valid_from and valid_to date. but, every time something changes for this device there will be a new row written without any visible changes for the row based data, besides a new valid from/to.
what i'm trying to do is to aggregate the first two rows into one, same for row 3 and 4, while leaving 5 and 6 as they are. all my solutions i came up so far with are working for a usage history for the user not switching back to device a. everything keeps failing.
i'd really appreciate some help, thanks in advance!
If you know that the previous valid_to is the same as the current valid_from, then you can use lag() to identify where a new grouping starts. Then use a cumulative sum to calculate the grouping and finally aggregation:
select cust, act_dev, min(valid_from), max(valid_to)
from (select t.*,
sum(case when prev_valid_to = valid_from then 0 else 1 end) over (partition by cust order by valid_from) as grouping
from (select t.*,
lag(valid_to) over (partition by cust, act_dev order by valid_from) as prev_valid_to
from t
) t
) t
group by cust, act_dev, grouping;
Here is a db<>fiddle.

SQL merging result sets on a unique column value

I have 2 similar queries which both work on the same table, and I essentially want to combine their results such that the second query supplies default values for what the first query doesn't return. I've simplified the problem as much as possible here. I'm using Oracle btw.
The table has account information in it for a number of accounts, and there are multiple entries for each account with a commit_date to tell when the account information was inserted. I need get the account info which was current for a certain date.
The queries take a list of account ids and a date.
Here is the query:
-- Select the row which was current for the accounts for the given date. (won't return anything for an account which didn't exist for the given date)
SELECT actr.*
FROM Account_Information actr
WHERE actr.account_id in (30000316, 30000350, 30000351)
AND actr.commit_date <= to_date( '2010-DEC-30','YYYY-MON-DD ')
AND actr.commit_date =
(
SELECT MAX(actrInner.commit_date)
FROM Account_Information actrInner
WHERE actrInner.account_id = actr.account_id
AND actrInner.commit_date <= to_date( '2010-DEC-30','YYYY-MON-DD ')
)
This looks a little ugly, but it returns a single row for each account which was current for the given date. The problem is that it doesn't return anything if the account didn't exist until after the given date.
Selecting the earliest account info for each account is trival - I don't need to supply a date for this one:
-- Select the earliest row for the accounts.
SELECT actr.*
FROM Account_Information actr
WHERE actr.account_id in (30000316, 30000350, 30000351)
AND actr.commit_date =
(
SELECT MAX(actrInner .commit_date)
FROM Account_Information actrInner
WHERE actrInner .account_id = actr.account_id
)
But I want to merge the result sets in such a way that:
For each account, if there is account info for it in the first result set - use that.
Otherwise, use the account info from the second result set.
I've researched all of the joins I can use without success. Unions almost do it but they will only merge for unique rows. I want to merge based on the account id in each row.
Sql Merging two result sets - my case is obviously more complicated than that
SQL to return a merged set of results - I might be able to adapt that technique? I'm a programmer being forced to write SQL and I can't quite follow that example well enough to see how I could modify it for what I need.
The standard way to do this is with a left outer join and coalesce. That is, your overall query will look like this:
SELECT ...
FROM defaultQuery
LEFT OUTER JOIN currentQuery ON ...
If you did a SELECT *, each row would correspond to the current account data plus your defaults. With me so far?
Now, instead of SELECT *, for each column you want to return, you do a COALESCE() on matched pairs of columns:
SELECT COALESCE(currentQuery.columnA, defaultQuery.columnA) ...
This will choose the current account data if present, otherwise it will choose the default data.
You can do this more directly using analytic functions:
select *
from (SELECT actr.*, max(commit_date) over (partition by account_id) as maxCommitDate,
max(case when commit_date <= to_date( '2010-DEC-30','YYYY-MON-DD ') then commit_date end) over
(partition by account_id) as MaxCommitDate2
FROM Account_Information actr
WHERE actr.account_id in (30000316, 30000350, 30000351)
) t
where (MaxCommitDate2 is not null and Commit_date = MaxCommitDate2) or
(MaxCommitDate2 is null and Commit_Date = MaxCommitDate)
The subquery calculates two values, the two possibilities of commit dates. The where clause then chooses the appropriate row, using the logic that you want.
I've combined the other answers. Tried it out at apex.oracle.com. Here's some explanation.
MAX(CASE WHEN commit_date <= to_date('2010-DEC-30', 'YYYY-MON-DD')) will give us the latest date not before Dec 30th, or NULL if there isn't one. Combining that with a COALESCE, we get
COALESCE(MAX(CASE WHEN commit_date <= to_date('2010-DEC-30', 'YYYY-MON-DD') THEN commit_date END), MAX(commit_date)).
Now we take the account id and commit date we have and join them with the original table to get all the other fields. Here's the whole query that I came up with:
SELECT *
FROM Account_Information
JOIN (SELECT account_id,
COALESCE(MAX(CASE WHEN commit_date <=
to_date('2010-DEC-30', 'YYYY-MON-DD')
THEN commit_date END),
MAX(commit_date)) AS commit_date
FROM Account_Information
WHERE account_id in (30000316, 30000350, 30000351)
GROUP BY account_id)
USING (account_id, commit_date);
Note that if you do use USING, you have to use * instead of acrt.*.