Count values checking if consecutive - sql

This is my table:
Event Order Timestamp
delFailed 281475031393706 2018-07-24T15:48:08.000Z
reopen 281475031393706 2018-07-24T15:54:36.000Z
reopen 281475031393706 2018-07-24T15:54:51.000Z
I need to count the number of event 'delFailed' and 'reopen' to calculate #delFailed - #reopen.
The difficulty is that there cannot be two same consecutives events, so that in this case the result will be "0" not "-1".
This is what i have achieved so far (Which is wrong because it gives me -1 instead of 0 due to the fact there are two consecutive "reopen" events )
with
events as (
select
event as events,
orders,
"timestamp"
from main_source_execevent
where orders = '281475031393706'
and event in ('reopen', 'delFailed')
order by "timestamp"
),
count_events as (
select
count(events) as CEvents,
events,
orders
from events
group by orders, events
)
select (
(select cevents from count_events where events = 'delFailed') - (select cevents from count_events where events = 'reopen')
) as nAttempts,
orders
from count_events
group by orders
How can i count once if there are two same consecutive events?

It is a gaps-and-islands problem, you can use make to row number to check rows are two same consecutive events
Explain
one row number created by normal.
another row number created by Event column
SELECT *
FROM (
SELECT *
,ROW_NUMBER() OVER(ORDER BY Timestamp) grp
,ROW_NUMBER() OVER(PARTITION BY Event ORDER BY Timestamp) rn
FROM T
) t1
| event | Order | timestamp | grp | rn |
|-----------|-----------------|----------------------|-----|----|
| delFailed | 281475031393706 | 2018-07-24T15:48:08Z | 1 | 1 |
| reopen | 281475031393706 | 2018-07-24T15:54:36Z | 2 | 1 |
| reopen | 281475031393706 | 2018-07-24T15:54:51Z | 3 | 2 |
when you create those two row you can get an upper result, then use grp - rn to get calculation the row are or are not same consecutive.
SELECT *,grp-rn
FROM (
SELECT *
,ROW_NUMBER() OVER(ORDER BY Timestamp) grp
,ROW_NUMBER() OVER(PARTITION BY Event ORDER BY Timestamp) rn
FROM T
) t1
| event | Order | timestamp | grp | rn | grp-rn |
|-----------|-----------------|----------------------|-----|----|----------|
| delFailed | 281475031393706 | 2018-07-24T15:48:08Z | 1 | 1 | 0 |
| reopen | 281475031393706 | 2018-07-24T15:54:36Z | 2 | 1 | 1 |
| reopen | 281475031393706 | 2018-07-24T15:54:51Z | 3 | 2 | 1 |
you can see when if there are two same consecutive events grp-rn column will be the same, so we can group by by grp-rn column and get count
Final query.
CREATE TABLE T(
Event VARCHAR(50),
"Order" VARCHAR(50),
Timestamp Timestamp
);
INSERT INTO T VALUES ('delFailed',281475031393706,'2018-07-24T15:48:08.000Z');
INSERT INTO T VALUES ('reopen',281475031393706,'2018-07-24T15:54:36.000Z');
INSERT INTO T VALUES ('reopen',281475031393706,'2018-07-24T15:54:51.000Z');
Query 1:
SELECT
SUM(CASE WHEN event = 'delFailed' THEN 1 END) -
SUM(CASE WHEN event = 'reopen' THEN 1 END) result
FROM (
SELECT Event,COUNT(distinct Event)
FROM (
SELECT *
,ROW_NUMBER() OVER(ORDER BY Timestamp) grp
,ROW_NUMBER() OVER(PARTITION BY Event ORDER BY Timestamp) rn
FROM T
) t1
group by grp - rn,Event
)t1
Results:
| result |
|--------|
| 0 |

I would just use lag() to get the first event in any sequence of similar values. Then do the calculation:
select sum( (event = 'reopen')::int ) as num_reopens,
sum( (event = 'delFailed')::int ) as num_delFailed
from (select mse.*,
lag(event) over (partition by orders order by "timestamp") as prev_event
from main_source_execevent mse
where orders = '281475031393706' and
event in ('reopen', 'delFailed')
) e
where prev_event <> event or prev_event is null;

Related

Finding created on dates for duplicates in SQL

I have one table of contact records and I'm trying to get the count of duplicate records that were created on each date. I'm not looking to include the original instance in the count. I'm using SQL Server.
Here's an example table
| email | created_on |
| ------------- | ---------- |
| aaa#email.com | 08-16-22 |
| bbb#email.com | 08-16-22 |
| zzz#email.com | 08-16-22 |
| bbb#email.com | 07-12-22 |
| aaa#email.com | 07-12-22 |
| zzz#email.com | 06-08-22 |
| aaa#email.com | 06-08-22 |
| bbb#email.com | 04-21-22 |
And I'm expecting to return
| created_on | dupe_count |
| ---------- | ---------- |
| 08-16-22 | 3 |
| 07-12-22 | 2 |
| 06-08-22 | 0 |
| 04-21-22 | 0 |
Edited to add error message:
error message
I created a sub table based on email and created date row number. Then, you query that, and ignore the date when the email first was created (row number 1). Works perfectly fine in this case.
Entire code:
Create table #Temp
(
email varchar(50),
dateCreated date
)
insert into #Temp
(email, dateCreated) values
('aaa#email.com', '08-16-22'),
('bbb#email.com', '08-16-22'),
('zzz#email.com', '08-16-22'),
('bbb#email.com', '07-12-22'),
('aaa#email.com', '07-12-22'),
('zzz#email.com', '06-08-22'),
('aaa#email.com', '06-08-22'),
('bbb#email.com', '04-21-22')
select datecreated, sum(case when r = 1 then 0 else 1 end) as duplicates
from
(
Select email, datecreated, ROW_NUMBER() over(partition by email
order by datecreated) as r from #Temp
) b
group by dateCreated
drop table #Temp
Output:
datecreated duplicates
2022-04-21 0
2022-06-08 0
2022-07-12 2
2022-08-16 3
You can calculate the difference between total count of emails for every day and the count of unique emails for the day:
select created_on,
count(email) - count(distinct email) as dupe_count
from cte
group by created_on
It seems I have misunderstood your request, and you wanted to consider previous created_on dates' too:
ct as (
select created_on,
(select case when (select count(*)
from cte t2
where t1.email = t2.email and t1.created_on > t2.created_on
) > 0 then email end) as c
from cte t1)
select created_on,
count(distinct c) as dupe_count
from ct
group by created_on
order by 1
It seems that in oracle it is also possible to aggregate it using one query:
select created_on,
count(distinct case when (select count(*)
from cte t2
where t1.email = t2.email and t1.created_on > t2.created_on
) > 0 then email end) as c
from cte t1
group by created_on
order by 1

SQL Server : how to select first and last value within a date range grouped by user

I have the following table (called report) in SQL Server:
+---------+------------------------+---------+
| User_id | timestamp | balance |
+---------+------------------------+---------+
| 1 |2021-04-29 09:31:10.100 | 10 |
| 1 |2021-04-29 09:35:25.800 | 15 |
| 1 |2021-04-29 09:36:30.550 | 5 |
| 2 |2021-04-29 09:38:15.009 | 100 |
+---------+------------------------+---------+
I would like to group the opening balance, closing balance and net movement of all users between a date period (only if the user has a record within that date range)
I would like the following output if my query asked for everything between 2021-04-29 and 2021-04-30
+---------+-----------------+-----------------+--------------+
| User_id | opening_balance | closing_balance | net_movement |
+---------+-----------------+-----------------+--------------+
| 1 | 10 | 5 | -5 |
| 2 | 100 | 100 | 0 |
+---------+-----------------+-----------------+--------------+
I am unclear on what best approach to take, should I be making multiple queries for the TOP 1 of the balance ([TOP 1 order by timestamp] AND [TOP 1 order by timestamp DESC]) and I am unclear on how to calculate the net movement if I do manage to get the values.
Any clues or nudges in the right direction would be most appreciated.
You can use conditional aggregation:
select user_id,
max(case when seqnum = 1 then balance end) as opening,
max(case when seqnum_desc = 1 then balance end) as closing,
sum(case when seqnum = 1 and seqnum_desc = 1 then 0
when seqnum = 1 then - balance
when seqnum_desc = 1 then balance
end) as movement
from (select r.*,
row_number() over (partition by user_id order by timestamp) as seqnum,
row_number() over (partition by user_id order by timestamp desc) as seqnum_desc
from report r
) r
group by user_id;
You can also do this without explicit aggregation:
select distinct user_id,
first_value(balance) over (partition by user_id order by timestamp) as opening,
first_value(balance) over (partition by user_id order by timestamp desc) as closing,
(first_value(balance) over (partition by user_id order by timestamp desc) -
first_value(balance) over (partition by user_id order by timestamp)
) as movement
from t;
Here is a db<>fiddle.
I would expect the two methods to have similar performance. I find that the first is clearer on the intent, though.

How to combine Cross Join and String Agg in Bigquery with date time difference

I am trying to go from the following table
| user_id | touch | Date | Purchase Amount
| 1 | Impression| 2020-09-12 |0
| 1 | Impression| 2020-10-12 |0
| 1 | Purchase | 2020-10-13 |125$
| 1 | Email | 2020-10-14 |0
| 1 | Impression| 2020-10-15 |0
| 1 | Purchase | 2020-10-30 |122
| 2 | Impression| 2020-10-15 |0
| 2 | Impression| 2020-10-16 |0
| 2 | Email | 2020-10-17 |0
to
| user_id | path | Number of days between First Touch and Purchase | Purchase Amount
| 1 | Impression,Impression,Purchase | 2020-10-13(Purchase) - 2020-09-12 (Impression) |125$
| 1 | Email,Impression, Purchase | 2020-10-30(Purchase) - 2020-10-14(Email) | 122$
| 2 | Impression, Impression, Email | 2020-12-31 (Fixed date) - 2020-10-15(Impression) | 0$
In essence, I am trying to create a new row for each unique user in the table every time a 'Purchase' is encountered in a comma-separated string.
Also, take the difference between the first touch and first purchase for each unique user. When a new row is created we do the same for the same user as show in the example above.
From the little I have gathered I need to use a mixture of cross join and string agg but I tried using a case statement within string agg and was not able to get to the required result.
Is there a better way to do it in SQL (Bigquery).
Thank you
Below is for BigQuery Standard SQL
#standardSQL
select user_id,
string_agg(touch order by date) path,
date_diff(max(date), min(date), day) days,
sum(amount) amount
from (
select user_id, touch, date, amount,
countif(touch = 'Purchase') over win grp
from `project.dataset.table`
window win as (partition by user_id order by date rows between unbounded preceding and 1 preceding)
)
group by user_id, grp
if to apply to sample data from your question - output is
another change, in case there is no Purchase in the touch we calculate the number of days from a fixed window we have set. How can I add this to the query above?
select user_id,
string_agg(touch order by date) path,
date_diff(if(countif(touch = 'Purchase') = 0, '2020-12-31', max(date)), min(date), day) days,
sum(amount) amount
from (
select user_id, touch, date, amount,
countif(touch = 'Purchase') over win grp
from `project.dataset.table`
window win as (partition by user_id order by date rows between unbounded preceding and 1 preceding)
)
group by user_id, grp
with output
Means you need solution which divides row if there is purchase in touch.
Use following query:
Select user_id,
Aggregation function according to your requirement,
Sum(purchase_amount)
From
(Select t.*,
Sum(case when touch = 'Purchase' then 1 else 0 end) over (partition by user_id order by date) as sm
From t) t
Group by user_id, sm
We could approach this as a gaps-and-island problem, where every island ends with a purchase. How do we define the groups? By counting how many purchases we have ahead (current row included) - so with a descending sort in the query.
select user_id, string_agg(touch order by date),
min(date) as first_date, max(date) as max_date,
date_diff(max(date), min(date)) as cnt_days
from (
select t.*,
countif(touch = 'Purchase') over(partition by user_id order by date desc) as grp
from mytable t
) t
group by user_id, grp
You can create a value for each row that corresponds to the number of instances where table.touch = 'Purchase', which can then be used to group on:
with r as (select row_number() over(order by t1.user_id) rid, t1.* from table t1)
select t3.user_id, group_concat(t3.touch), sum(t3.amount), date_diff(max(t3.date), min(t3.date))
from (select
(select sum(r1.touch = 'Purchase' AND r1.rid < r2.rid) from r r1) c1, r2.* from r r2
) t3
group by t3.c1;

Query with conditional lag statement

I'm trying to find the previous value of a column where the row meets some criteria. Consider the table:
| user_id | session_id | time | referrer |
|---------|------------|------------|------------|
| 1 | 1 | 2018-01-01 | [NULL] |
| 1 | 2 | 2018-02-01 | google.com |
| 1 | 3 | 2018-03-01 | google.com |
I want to find, for each session, the previous value of session_id where the referrer is NULL. So, for the second AND third rows, the value of parent_session_id should be 1.
However, by just using lag(session_id) over (partition by user_id order by time), I will get parent_session_id=2 for the 3rd row.
I suspect it can be done using a combination of window functions, but I just can't figure it out.
I'd use last_value() in combination with if():
WITH t AS (SELECT * FROM UNNEST([
struct<user_id int64, session_id int64, time date, referrer string>(1, 1, date('2018-01-01'), NULL),
(1,2,date('2018-02-01'), 'google.com'),
(1,3,date('2018-03-01'), 'google.com')
]) )
SELECT
*,
last_value(IF(referrer is null, session_id, NULL) ignore nulls)
over (partition by user_id order by time rows between unbounded preceding and 1 preceding) lastNullrefSession
FROM t
You could even do this via a correlated subquery:
SELECT
session_id,
(SELECT MAX(t2.session_id) FROM yourTable t2
WHERE t2.referrer IS NULL AND t2.session_id < t1.session_id) prev_session_id
FROM yourTable t1
ORDER BY
session_id;
Here is an approach using analytic functions which might work:
WITH cte AS (
SELECT *,
SUM(CASE WHEN referrer IS NULL THEN 1 ELSE 0 END)
OVER (ORDER BY session_id) cnt
FROM yourTable
)
SELECT
session_id,
CASE WHEN cnt = 0
THEN NULL
ELSE MIN(session_id) OVER (PARTITION BY cnt) END prev_session_id
FROM cte
ORDER BY
session_id;

Group rows into sequences using a sliding window on a DateTime column

I have a table that stores timestamped events. I want to group the events into 'sequences' by using 5-min sliding window on the timestamp column, and write the 'sequence ID' (any ID that can distinguish sequences) and 'order in sequence' into another table.
Input - event table:
+----+-------+-----------+
| Id | Name | Timestamp |
+----+-------+-----------+
| 1 | test | 00:00:00 |
| 2 | test | 00:06:00 |
| 3 | test | 00:10:00 |
| 4 | test | 00:14:00 |
+----+-------+-----------+
Desired output - sequence table. Here SeqId is the ID of the starting event, but it doesn't have to be, just something to uniquely identify a sequence.
+---------+-------+----------+
| EventId | SeqId | SeqOrder |
+---------+-------+----------+
| 1 | 1 | 1 |
| 2 | 2 | 1 |
| 3 | 2 | 2 |
| 4 | 2 | 3 |
+---------+-------+----------+
What would be the best way to do it? This is MSSQL 2008, I can use SSAS and SSIS if they make things easier.
CREATE TABLE #Input (Id INT, Name VARCHAR(20), Time_stamp TIME)
INSERT INTO #Input
VALUES
( 1 ,'test','00:00:00' ),
( 2 ,'test','00:06:00' ),
( 3 ,'test','00:10:00' ),
( 4 ,'test','00:14:00' )
SELECT * FROM #Input;
WITH cte AS -- add a sequential number
(
SELECT *,
ROW_NUMBER() OVER(ORDER BY Id) AS sort
FROM #Input
), cte2 as -- find the Id's with a difference of more than 5min
(
SELECT cte.*,
CASE WHEN DATEDIFF(MI, cte_1.Time_stamp,cte.Time_stamp) < 5 THEN 0 ELSE 1 END as GrpType
FROM cte
LEFT OUTER JOIN
cte as cte_1 on cte.sort =cte_1.sort +1
), cte3 as -- assign a SeqId
(
SELECT GrpType, Time_Stamp,ROW_NUMBER() OVER(ORDER BY Time_stamp) SeqId
FROM cte2
WHERE GrpType = 1
), cte4 as -- find the Time_Stamp range per SeqId
(
SELECT cte3.*,cte_2.Time_stamp as TS_to
FROM cte3
LEFT OUTER JOIN
cte3 as cte_2 on cte3.SeqId =cte_2.SeqId -1
)
-- final query
SELECT
t.Id,
cte4.SeqId,
ROW_NUMBER() OVER(PARTITION BY cte4.SeqId ORDER BY t.Time_stamp) AS SeqOrder
FROM cte4 INNER JOIN #Input t ON t.Time_stamp>=cte4.Time_stamp AND (t.Time_stamp <cte4.TS_to OR cte4.TS_to IS NULL);
This code is slightly more complex but it returns the expected output (which Gordon Linoffs solution doesn't...) and it's even slightly faster.
You seem to want things grouped together when they are less than five minutes apart. You can assign the groups by getting the previous time stamp and marking the beginning of a group. You then need to do a cumulative sum to get the group id:
with e as (
select e.*,
(case when datediff(minute, prev_timestamp, timestamp) < 5 then 1 else 0 end) as flag
from (select e.*,
(select top 1 e2.timestamp
from events e2
where e2.timestamp < e.timestamp
order by e2.timestamp desc
) as prev_timestamp
from events e
) e
)
select e.eventId, e.seqId,
row_number() over (partition by seqId order b timestamp) as seqOrder
from (select e.*, (select sum(flag) from e e2 where e2.timestamp <= e.timestamp) as seqId
from e
) e;
By the way, this logic is easier to express in SQL Server 2012+ because the window functions are more powerful.