How to identify the first response message to each previous message? - google-bigquery

I have a table that contains messages from an agent and a customer. I need to identify the first response from the agent to the customer, for each message sent by the customer and vice versa.
I have been trying to do this without success using the following query:
WITH HAVE AS
(SELECT "A" AS CONVERSATIONID, "CONSUMER" AS SENTBY, 1631929267942 AS TIMEL UNION ALL
SELECT "A" AS CONVERSATIONID, "AGENT" AS SENTBY, 1631929298918 AS TIMEL UNION ALL
SELECT "A" AS CONVERSATIONID, "AGENT" AS SENTBY, 1631929307192 AS TIMEL UNION ALL
SELECT "A" AS CONVERSATIONID, "CONSUMER" AS SENTBY, 1631929313065 AS TIMEL UNION ALL
SELECT "A" AS CONVERSATIONID, "AGENT" AS SENTBY, 1631929317717 AS TIMEL UNION ALL
SELECT "A" AS CONVERSATIONID, "AGENT" AS SENTBY, 1631929333779 AS TIMEL UNION ALL
SELECT "A" AS CONVERSATIONID, "CONSUMER" AS SENTBY, 1631929337240 AS TIMEL UNION ALL
SELECT "A" AS CONVERSATIONID, "AGENT" AS SENTBY, 1631929404611 AS TIMEL UNION ALL
SELECT "A" AS CONVERSATIONID, "CONSUMER" AS SENTBY, 1631929448033 AS TIMEL UNION ALL
SELECT "A" AS CONVERSATIONID, "AGENT" AS SENTBY, 1631929477379 AS TIMEL
)
SELECT
CONVERSATIONID,
SENTBY,
TIMEL,
CASE
WHEN FIRST_VALUE(TIMEL) OVER (PARTITION BY CONVERSATIONID, SENTBY ORDER BY TIMEL ASC) = TIMEL THEN 1
ELSE 0
END AS FIRST_MESSAGE,
FROM
HAVE
ORDER BY CONVERSATIONID, TIMEL
Is there a way to achieve the following result (where the red marks indicate that the value should be 1):
Any help would be greately appreciated.

Consider below approach
select * except(grp),
if(
row_number() over(partition by conversationid, grp order by timel) = 1
and grp > 0, 1, 0
) first_message
from (
select * except(isnew),
countif(isnew) over(partition by conversationid order by timel) grp
from (
select *,
sentby != lag(sentby) over(partition by conversationid order by timel) isnew
from have
)
)
if apply to sample data in your question - output is

Related

How to retain values from one message to the first reply to the message?

This question is an extension of How to identify the first response message to each previous message?, but given that it deals with different logic, I felt it best to raise another question.
This is the data that I now have thanks to #Mikhail:
This is the code used to generate the original data:
WITH HAVE AS
(SELECT "A" AS CONVERSATIONID, "CONSUMER" AS SENTBY, 1631929267942 AS TIMEL UNION ALL
SELECT "A" AS CONVERSATIONID, "AGENT" AS SENTBY, 1631929298918 AS TIMEL UNION ALL
SELECT "A" AS CONVERSATIONID, "AGENT" AS SENTBY, 1631929307192 AS TIMEL UNION ALL
SELECT "A" AS CONVERSATIONID, "CONSUMER" AS SENTBY, 1631929313065 AS TIMEL UNION ALL
SELECT "A" AS CONVERSATIONID, "AGENT" AS SENTBY, 1631929317717 AS TIMEL UNION ALL
SELECT "A" AS CONVERSATIONID, "AGENT" AS SENTBY, 1631929333779 AS TIMEL UNION ALL
SELECT "A" AS CONVERSATIONID, "CONSUMER" AS SENTBY, 1631929337240 AS TIMEL UNION ALL
SELECT "A" AS CONVERSATIONID, "AGENT" AS SENTBY, 1631929404611 AS TIMEL UNION ALL
SELECT "A" AS CONVERSATIONID, "CONSUMER" AS SENTBY, 1631929448033 AS TIMEL UNION ALL
SELECT "A" AS CONVERSATIONID, "AGENT" AS SENTBY, 1631929477379 AS TIMEL
)
Achieved using this code:
select * except(grp),
if(
row_number() over(partition by conversationid, grp order by timel) = 1
and grp > 0, 1, 0
) first_message
from (
select * except(isnew),
countif(isnew) over(partition by conversationid order by timel) grp
from (
select *,
sentby != lag(sentby) over(partition by conversationid order by timel) isnew
from have
)
)
What I need to do now, is also get the value in TIMEL for the originating message and apply it to the first response. so in this scenario. The Consumer asks a question at 1631929267942 (record 1) and the Agent responds at 1631929298918 (record 2). The consumer could potentially as multiple questions before a reply from the Agent, but only the first of those questions should be used as the TIMEL value.
Thanks again for the help.
The below query would replace the TIMEL value of the first response from the AGENT with that of the CONSUMER (i.e) the time at which the CONSUMER response originated.
select CONVERSATIONID, SENTBY, TIMEL,
case when
new_col > lag(new_col) over (partition by conversationid, grp order by timel)
then min(timel) over(partition by conversationid, grp order by timel)
else timel
end as ORIGIN_TIMEL,
from(
select * except(isnew),
countif(isnew) over(partition by conversationid order by timel) grp
from (
select *,
new_col < lag(new_col) over(partition by conversationid order by timel) isnew
from(
select *,
if(sentby = 'CONSUMER', 0, 1) new_col,
from `project.dataset.table`
order by conversationid, timel
)
)
)
order by conversationid, timel
I've added a new row for CONSUMER to verify if the first CONSUMER response time is being used to replace the first AGENT response time.
Output of the above query:

How do I select only one row based on student's phone type priority

Hello and thank you for helping me.
A student can have multiple email addresses where the type is: personal, work, other, school
I need to write a query that select one email address for a student. If a student has more than one email address, then the email address selected needs to be based on the email type of the email addresses.
For example,
If a student has a personal email then I select only the personal email.
If a student does not have a personal email but has the other email types then I will select their school email address.
The priority order of the email types are: personal, school, work, other. The goal is to select only one record based on the priority list (personal, school, work, and then other)
Student table structure
student_id
email_type
email_addr
You can use window functions and a case expression:
select *
from (
select t.*,
row_number() over(
partition by student_id
order by case email_type
when 'personal' then 1
when 'school' then 2
when 'work' then 3
else 4
end
) rn
from mytable t
) t
where rn = 1
In Oracle, you can shorten this with decode():
select *
from (
select t.*,
row_number() over(
partition by student_id
order by decode(email_type, 'personal', 1, 'school', 2, 'work', 3, 4)
) rn
from mytable t
) t
where rn = 1
Another typical solution is fetch first row with ties:
select t.*
from mytable t
order by row_number() over(
partition by student_id
order by decode(email_type, 'personal', 1, 'school', 2, 'work', 3, 4)
)
fetch first row with ties
You can use an analytic function with KEEP:
SELECT student_id,
MAX( email_type ) KEEP (
DENSE_RANK FIRST
ORDER BY DECODE( email_type, 'personal', 1, 'school', 2, 'work', 3, 4 )
) AS email_type,
MAX( email_addr ) KEEP (
DENSE_RANK FIRST
ORDER BY DECODE( email_type, 'personal', 1, 'school', 2, 'work', 3, 4 )
) AS email_addr
FROM student
GROUP BY student_id
For some test data:
CREATE TABLE student ( student_id, email_type, email_addr ) AS
SELECT 1, 'school', 'person1#school' FROM DUAL UNION ALL
SELECT 1, 'work', 'person1#work' FROM DUAL UNION ALL
SELECT 1, 'other', 'person1#other' FROM DUAL UNION ALL
SELECT 2, 'personal', 'person2#home' FROM DUAL UNION ALL
SELECT 2, 'other', 'person2#other' FROM DUAL;
This outputs:
STUDENT_ID | EMAIL_TYPE | EMAIL_ADDR
---------: | :--------- | :-------------
1 | school | person1#school
2 | personal | person2#home
db<>fiddle here

How can i find rows before a specific value?

I have the next row and what I want to do is to select all the rows before the type "shop". I tried using case in the "where clause" but I didn't get any result. How can I do it?
|id|visitnumber|type |
|01| 1|register|
|01| 2|visit |
|01| 3|visit |
|01| 4|shop |
|01| 5|visit |
For example, what I want to get is the visitnumber before type = "shop".
it would be very helpful because what I'm trying to do is to get all the actions that happened before an specific event on big query.
|id|numberofvisits|
|01| 3|
One method uses correlated subqueries:
select id, count(*)
from t
where visitnumber < (select min(t2.visitnumber) from t t2 where t2.id = t.id and type = 'shop')
group by id;
However, in BigQuery, I prefer an approach using window functions:
select id, countif(visitnumber < visitnumber_shop)
from (select t.*,
min(case when type = 'shop' then visitnumber end) over (partition by id) as visitnumber_shop
from t
) t
group by id;
This has the advantage of keeping all ids even those that don't have a "shop" type.
One option uses a subquery for filtering:
select id, count(*) number_of_visits
from mytable t
where t.visit_number < (
select min(t1.visit_number)
from mytable t
where t1.id = t.id and t1.type = 'shop'
)
group by id
You can also use window functions:
select id, count(*) number_of_visits
from (
select
t.*,
countif(type = 'shop') over(partition by id order by visit_number) has_shop
from mytable t
) t
where has_shop = 0
group by id
Below option is for BigQuery Standard SQL
#standardSQL
SELECT id,
ARRAY_LENGTH(SPLIT(REGEXP_EXTRACT(',' || STRING_AGG(type ORDER BY visitnumber), r'(.*?),shop'))) - 1 AS number_of_visits_before_first_shop
FROM `project.dataset.table`
GROUP BY id
You can test, play with above using dummy data as in below example
#standardSQL
WITH `project.dataset.table` AS (
SELECT '01' id, 1 visitnumber, 'register' type UNION ALL
SELECT '01', 2, 'visit' UNION ALL
SELECT '01', 3, 'visit' UNION ALL
SELECT '01', 4, 'shop' UNION ALL
SELECT '01', 5, 'visit' UNION ALL
SELECT '02', 1, 'register' UNION ALL
SELECT '02', 2, 'visit' UNION ALL
SELECT '02', 3, 'visit' UNION ALL
SELECT '03', 1, 'shop' UNION ALL
SELECT '03', 2, 'shop' UNION ALL
SELECT '03', 3, 'visit'
)
SELECT id,
ARRAY_LENGTH(SPLIT(REGEXP_EXTRACT(',' || STRING_AGG(type ORDER BY visitnumber), r'(.*?),shop'))) - 1 AS number_of_visits_before_first_shop
FROM `project.dataset.table`
GROUP BY id
with result
Row id number_of_visits_before_first_shop
1 01 3
2 02 null
3 03 0
This is the query i run on Big Query with an Analytics 360 test dataset:
select
id,
visitnumber,
countif(hit_number < hitnumber_quickviewclick) as hitsprev_quickviewclick
from (
select
a.fullVisitorID as id,
a.visitnumber as visitnumber,
h.hitNumber as hit_number,
MIN (case when h.eventInfo.eventAction = 'Quickview Click' then h.hitNumber end) over (partition by a.fullVisitorID) as hitnumber_quickviewclick
FROM `bigquery-public-data.google_analytics_sample.ga_sessions_20170725` as a
CROSS JOIN UNNEST(hits) as h
) as T
group by 1,2;
I wanted to make a query where i could find the total number of hits before the event action 'quickview click' hitted. If this is wrong or can be improved let me know!
Thanks a lot, guys!
This is how I would approach in SQL in general:
select count(*)
from yourtable yt
where type = 'visit' and not exists (
select 1
from yourtable yt2
where yt.id > yt2.id and yt2.type = 'shop'
)
However, I would very much think about situations when we want to find visits before the next shop... And the next shop... And the next shop. For that purpose you could find out the ids of shop and group by intervals.

SQL Query for finding longest streak of wins

I have data like below -
Year,winning_country
2001,IND
2002,IND
2003,IND
2004,AUS
2005,AUS
2006,SA
2007,SA
2008,SA
2009,IND
2010,IND
2011,IND
2012,IND
2013,AUS
2014,AUS
2015,SA
2016,NZ
2017,SL
2018,IND
The question here is to find out the longest streak of wins for each country and desired output will be like below -
Country,no_of_wins
IND,4
AUS,2
SA,3
SL,1
NZ,1
Can someone help here.
This is a gaps and islands problem, but the simplest method is to subtract a sequence from the year. So, to get all the sequences:
select country, count(*) as streak,
min(year) as from_year, max(year) as to_year
from (select year, country,
row_number() over (partition by country order by year) as seqnum
from t
) t
group by country, (year - seqnum);
To get the longest per country, aggregate again or use window functions:
select country, streak
from (select country, count(*) as streak,
min(year) as from_year, max(year) as to_year,
row_number() over (partition by country order by count(*) desc) as seqnum_2
from (select year, country,
row_number() over (partition by country order by year) as seqnum
from t
) t
group by country, (year - seqnum)
) cy
where seqnum_2 = 1;
I prefer using row_number() to get the longest streak because it allows you to also get the years when it occurred.
Looks like an gaps-and-islands problem.
The SQL below calculates some ranking based on 2 row_number.
Then it's just a matter of grouping.
SELECT q2.Country, MAX(q2.no_of_wins) AS no_of_wins
FROM
(
SELECT q1.winning_country as Country,
COUNT(*) AS no_of_wins
FROM
(
SELECT t.Year, t.winning_country,
(ROW_NUMBER() OVER (ORDER BY t.Year ASC) -
ROW_NUMBER() OVER (PARTITION BY t.winning_country ORDER BY t.Year)) AS rnk
FROM yourtable t
) q1
GROUP BY q1.winning_country, q1.rnk
) q2
GROUP BY q2.Country
ORDER BY MAX(q2.no_of_wins) DESC
If Redshift supports analytic function, below would be the query.
with t1 as
(
select 2001 as year,'IND' as cntry from dual union
select 2002,'IND' from dual union
select 2003,'IND' from dual union
select 2004,'AUS' from dual union
select 2005,'AUS' from dual union
select 2006,'SA' from dual union
select 2007,'SA' from dual union
select 2008,'SA' from dual union
select 2009,'IND' from dual union
select 2010,'IND' from dual union
select 2011,'IND' from dual union
select 2012,'IND' from dual union
select 2013,'AUS' from dual union
select 2014,'AUS' from dual union
select 2015,'SA' from dual union
select 2016,'NZ' from dual union
select 2017,'SL' from dual union
select 2018,'IND' from dual) ,
t2 as (select year, cntry, year - row_number() over (partition by cntry order by year) as grpBy from t1 order by cntry),
t3 as (select cntry, count(grpBy) as consWins from t2 group by cntry, grpBy),
res as (select cntry, consWins, row_number() over (partition by cntry order by consWins desc) as rnk from t3)
select cntry, consWins from res where rnk=1;
Hope this helps.
Here is a solution that leverages the use of Redshift Python UDF's
There may be simpler ways to achieve the same but this is a good example of how to create a simple UDF.
create table temp_c (competition_year int ,winning_country varchar(4));
insert into temp_c (competition_year, winning_country)
values
(2001,'IND'),
(2002,'IND'),
(2003,'IND'),
(2004,'AUS'),
(2005,'AUS'),
(2006,'SA'),
(2007,'SA'),
(2008,'SA'),
(2009,'IND'),
(2010,'IND'),
(2011,'IND'),
(2012,'IND'),
(2013,'AUS'),
(2014,'AUS'),
(2015,'SA'),
(2016,'NZ'),
(2017,'SL'),
(2018,'IND')
;
create or replace function find_longest_streak(InputStr varChar)
returns integer
stable
as $$
MaxStreak=0
ThisStreak=0
ThisYearStr=''
LastYear=0
for ThisYearStr in InputStr.split(','):
if int(ThisYearStr) == LastYear + 1:
ThisStreak+=1
else:
if ThisStreak > MaxStreak:
MaxStreak=ThisStreak
ThisStreak=1
LastYear=int(ThisYearStr)
return max(MaxStreak,1)
$$ language plpythonu;
select winning_country,
find_longest_streak(listagg(competition_year,',') within group (order by competition_year))
from temp_c
group by winning_country
order by 2 desc
;
How about something like...
SELECT
winning_country,
COUNT(*)
GROUP BY winning_country
HAVING MAX(year) - MIN(year) = COUNT(year) - 1
This assumes no duplicate entries.
Creating a session abstraction do the trick:
WITH winning_changes AS (
SELECT *,
CASE WHEN LAG(winning_country) OVER (ORDER BY year) <> winning_country THEN 1 ELSE 0 END AS same_winner
FROM winners
),
sequences AS (
SELECT *,
SUM(same_winner) OVER (ORDER BY year) AS winning_session
FROM winning_changes
),
streaks AS (
SELECT winning_country AS country,
winning_session,
COUNT(*) streak
FROM sequences
GROUP BY 1,2
)
SELECT country,
MAX(streak) AS no_of_wins
FROM streaks
GROUP BY 1;

Count number of events before and after a particular event in SQL?

I have a table containing date and events. There is event named 'A'. I want to find out how many events occurred before and after event 'A' in Sql Bigquery.
for Example,
User Date Events
123 2018-02-13 D
123 2018-02-12 B
123 2018-02-10 C
123 2018-02-11 A
123 2018-02-01 X
The answer would be something like this.
User Event Before After
123 A 2 2
I have tried many queries but its not working. Any Idea, how to solve this problem?
below is for BigQuery Standard SQL
#standardSQL
WITH `project.dataset.events` AS (
SELECT 123 user, '2018-02-13' dt, 'D' event UNION ALL
SELECT 123, '2018-02-12', 'B' UNION ALL
SELECT 123, '2018-02-11', 'A' UNION ALL
SELECT 123, '2018-02-10', 'C' UNION ALL
SELECT 123, '2018-02-01', 'X'
)
SELECT user, event, before, after
FROM (
SELECT user, event,
COUNT(1) OVER(PARTITION BY user ORDER BY dt ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) before,
COUNT(1) OVER(PARTITION BY user ORDER BY dt ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING ) after
FROM `project.dataset.events`
)
WHERE event = 'A'
For each "A", you can get the number of events to the next "A" using row_number() and lead():
select t.*,
(lead(seqnum) over (order by date) - seqnum - 1) as num_other_events
from (select t.*, row_number() over (order by date) as seqnum
from t
) t
where event = 'A';
This produces the results for each "A". Given that you have three "A"s in your sample data and only want "2", I'm not sure what logic is used for that.
If you want to count number of events as they appear in the table before the row with event A, there is no way to do this because BigQuery doesn't preserve physical order of rows in a table.
If you want to count Before and After using the date column, you can do
WITH
events AS (
SELECT
DATE('2018-02-13') AS event_date,
"D" AS event
UNION ALL
SELECT
DATE('2018-02-12') AS event_date,
"B" AS event
UNION ALL
SELECT
DATE('2018-02-10') AS event_date,
"C" AS event
UNION ALL
SELECT
DATE('2018-02-11') AS event_date,
"A" AS event
UNION ALL
SELECT
DATE('2018-02-01') AS event_date,
"X" AS event),
event_a AS (
SELECT
*
FROM
events
WHERE
event = "A")
SELECT
ANY_VALUE(event_a.event) AS Event,
COUNTIF(events.event_date<event_a.event_date) AS Before,
COUNTIF(events.event_date>event_a.event_date) AS After
FROM
events,
event_a
Hope this answers your question
Create table #temp(T_date varchar(100),Events varchar(100))
​
insert into #temp values
('2018-02-13','A'),
('2018-02-12','B'),
('2018-02-10','C'),
('2018-02-11','A'),
('2018-02-01','X'),
('2018-02-06','A')
​
select max(rn)-min(rn)
from
(
select *,ROW_NUMBER() over(order by (select 1)) as rn from #temp
)a
where Events='A'