Find first event occurring after given event - google-bigquery

I am working with a table consisting of a number of web sessions with various events and event id:s. To simplify my question, let's say that I have 4 columns which are session_id, event_name and event_id, where the event id can be used to order the events in ascending/descending order. Let's also pretend that we have a large number of events and that I am particularly interest in 3 of the events with event_name: open, submit and decline. Assume that these 3 events can occur in any order.
What I would like to do, is that I would like to add a new column that for each session says which, if any, of the two events 'submit' and 'decline' that first follows the event 'open'. I have tried using the FIRST_VALUE partition function but have not made it successfully work yet.
So for a session with event sequence: 'open', ... (a number of different events happening in between), 'submit', 'decline', I would like to return 'submit',
and for a session with event sequence: open, ... (a number of different events happening in between), 'decline', I would like to return 'decline',
and for a sessions for which none of the events 'submit' nor 'decline' happens after 'open', I would like to return null.
You can use the following table with name 'events' for writing example SQL code:
I hope the question and its formulation is clear. Thank you very much in advance!
Sincerely,
Bertan

Use below (assuming you have only one accept or decline per session!)
select *, if(event_name != 'open', null, ['decline', 'accept'][ordinal(
sum(case event_name when 'decline' then 1 when 'accept' then 2 end) over win
)]) staus
from your_table
window win as (
partition by session_id order by event_id
rows between 1 following and unbounded following
)
if apply to sample data in your question - output is

Related

Is it possible to get time difference between dates and provide a default value, with PostgreSQL?

So, the table setup is something like this:
table: ticket(ticket_id, ...)
table: ticket_status_history(status_history_id, ticket_id, ticket_status, ticket_status_datetime)
The default ticket_status is OPEN, and the first ticket status that I'm interested in is ACKNOWLEDGED.
So, the idea is that a specific ticket has a set of ticket_status_history events, each recorded in the separate table. Each ticket status entry points to its corresponding ticket, ticket_id is a foreign key.
Now, a ticket can actually be created directly in ACKNOWLEDGED so it would get a corresponding entry directly in ACKNOWLEDGED, without ever being in OPEN. But many of them will go OPEN -> ACKNOWLEDGED -> ...
What I'd like to do would be to determine for each ticket the time interval between ticket creation (OPEN) and ticket acknowledgment (ACKNOWLEDGE), but if the state is directly ACKNOWLEDGE, set the time difference as a default of 0 (because the ticket was created directly in this state).
Is this doable in SQL, for PostgreSQL? I'm a bit stumped at the moment. I found this: Calculate Time Difference Between Two Rows but it's for SQL Server, instead, plus I'm not sure how the default value could be included.
The end state would actually be aggregating the time differences and computing an average duration, but I can't figure out the first step 😞
Your query could look like this:
SELECT t.*,coalesce(ack.ticket_status_datetime - op.ticket_status_datetime
,'0'::interval) AS op_ack_diff
FROM ticket t
LEFT JOIN ticket_status_history ack ON(t.ticket_id = ack.ticket_id
AND ack.ticket_status = 'ACKNOWLEDGED')
LEFT JOIN ticket_status_history op ON(t.ticket_id = op.ticket_id
AND op.ticket_status = 'OPENED')
WHERE t.ticket_id = x;
The difference of the timestamps yields null if one of the entries is missing. The coalesce function will return its second argument in this case.

Stream Analytics Left outer join Not Producing Rows

What I am trying to do:
I want to "throttle" an input stream to its output. Specifically, as I receive multiple similar inputs, I only want to produce an output if one hasn't already been produced in the last N hours.
For example, the input could be thought of as "send an email", but I will get dozens/hundreds of those events. I only want to send an email if I haven't already sent one in the last N hours (or have never sent one).
See the final example here: https://learn.microsoft.com/en-us/stream-analytics-query/join-azure-stream-analytics#examples for something similar to what I am trying to do
What my setup looks like:
There are two inputs to my query:
Ingress: this is the "raw" input stream
Throttled-Sent: this is just a consumer group off of my output stream
My query is as follows:
WITH
AllEvents as (
/* This CTE is here only because we can't seem to use GetMetadataPropertyValue in a join clause, so "materialize" it here for use- later */
SELECT
*,
GetMetadataPropertyValue([Ingress], '[User].[Type]') AS Type,
GetMetadataPropertyValue([Ingress], '[User].[NotifyType]') AS NotifyType,
GetMetadataPropertyValue([Ingress], '[User].[NotifyEntityId]') AS NotifyEntityId
FROM
[Ingress]
),
UseableEvents as (
SELECT *
FROM AllEvents
WHERE NotifyEntityId IS NOT NULL
),
AlreadySentEvents as (
/* These are the events that would have been previously output (referenced here via a consumer group). We want to capture these to make sure we are not sending new events when an older "already-sent" event can be found */
SELECT
*,
GetMetadataPropertyValue([Throttled-Sent], '[User].[Type]') AS Type,
GetMetadataPropertyValue([Throttled-Sent], '[User].[NotifyType]') AS NotifyType,
GetMetadataPropertyValue([Throttled-Sent], '[User].[NotifyEntityId]') AS NotifyEntityId
FROM
[Throttled-Sent]
)
SELECT i.*
INTO Throttled
FROM UseableEvents i
/* Left join our sent events, looking for those within a particular time frame */
LEFT OUTER JOIN AlreadySentEvents s
ON i.Type = s.Type
AND i.NotifyType = s.NotifyType
AND i.NotifyEntityId = s.NotifyEntityId
AND DATEDIFF(hour, i, s) BETWEEN 0 AND 4
WHERE s.Type IS NULL /* The is null here is for only returning those Ingress rows that have no corresponding AlreadySentEvents row */
The results I'm seeing:
This query is producing no rows to the output. However, I believe it should be producing something because the Throttled-Sent input has zero rows to begin with. I have validated that my Ingress events are showing up (by simply adjusting the query to remove the left join and checking the results).
I feel like my problem is probably linked to one of the following areas:
I can't have an input that is a consumer group off of the output (but I don't know why that wouldn't be allowed)
My datediff usage/understanding is incorrect
Appreciate any help/guidance/direction!
For throttling, I would recommend looking at IsFirst function, it might be easier solution that will not require reading from the output.
For the current query, I think order of DATEDIFF parameters need to be changed as s comes before i: DATEDIFF(hour, s, i) BETWEEN 0 AND 4

BigQuery Session & Hit level understanding

I want to ask about your knowledge regarding the concept of Events.
Hit level
Session Level
How in BigQuery (standard SQL) how i can map mind this logic, and also
Sessions
Events Per Session
Unique Events
Please can somebody guide me to understand these concepts?
totals.visitors is Session
sometime
visitId is taken as Session
to achieve that you need to grapple a little with a few different concepts. The first being "what is a session" in GA lingo. you can find that here. A session is a collection of hits. A hit is one of the following: pageview, event, social interaction or transaction.
Now to see how that is represented in the BQ schema, you can look here. visitId and visitorId will help you define a session (as opposed to a user).
Then you can count the number of totals.hits that are events of the type you want.
It could look something like:
select visitId,
sum(case when hits.type = "EVENT" then totals.hits else 0) from
dataset.table_* group by 1
That should work to get you an overview. If you need to slice and dice the event details (i.e. hits.eventInfo.*) then I suggest you make a query for all the visitId and one for all the relevant events and their respective visitId
I hope that works!
Cheers
You can think of these concepts like this:
every row is a session
technically every row with totals.visits=1 is a valid session
hits is an array containing structs which contain information for every hit
You can write subqueries on arrays - basically treat them as tables. I'd recommend to study Working with Arrays and apply/transfer every exercise directly to hits, if possible.
Example for subqueries on session level
SELECT
fullvisitorid,
visitStartTime,
(SELECT SUM(IF(type='EVENT',1,0)) FROM UNNEST(hits)) events,
(SELECT COUNT(DISTINCT CONCAT(eventInfo.eventCategory,eventInfo.eventAction,eventInfo.eventLabel) )
FROM UNNEST(hits) WHERE type='EVENT') uniqueEvents,
(SELECT SUM(IF(type='PAGE',1,0)) FROM UNNEST(hits)) pageviews
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_20170801`
WHERE
totals.visits=1
LIMIT
1000
Example for Flattening to hit level
There's also the possibility to use fields in arrays for grouping if you cross join arrays with their parent row
SELECT
h.type,
COUNT(1) hits
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_20170801` AS t CROSS JOIN t.hits AS h
WHERE
totals.visits=1
GROUP BY
1
Regarding the relation between visitId and Sessions you can read this answer.

SQL convert sample points into durations

This is similar to Compute dates and durations in mysql query, except that I don't have a unique ID column to work with, and I have samples not start/end points.
As an interesting experiment, I set cron to ps aux > 'date +%Y-%m-%d_%H-%M'.txt. I now have around 250,000 samples of "what the machine was running".
I would like to turn this into a list of "process | cmd | start | stop". The assumption is that a 'start' event is the first time when the pair existed, and a 'stop' event is the first sample where it stopped existing: there is no chance of a sample "missing" or anything.
That said, what ways exist for doing this transformation, preferably using SQL (on the grounds that I like SQL, and this seems like a nice challenge). Assuming that pids cannot be repeated this is a trivial task (put everything in a table, SELECT MIN(time), MAX(time), pid GROUP BY pid). However, since PID/cmd pairs are repeated (I checked, there are duplicates), I need a method that does a true "find all contiguous segments" search.
If necessary I can do something of the form
Load file0 -> oldList
ForEach fileN:
Load fileN ->newList
oldList-newList = closedN
newList-oldList = openedN
oldList=newList
But that is not SQL and not interesting. And who knows, I might end up having real SQL data to deal with with this property at some point.
I'm thinking something where one first constructs a table of diff's, and then joins all close's against all open's and pulls the minimum-distance close after each open, but I'm wondering if there's a better way.
You don't mention what database you are using. Let me assume that you are using a database that supports ranking functions, since that simplifies the solution.
The key to solving this is an observation. You want to assign an id to each pid to see if it is unique. I am going to assume that a pid represents a single process when the pid did not appear in the previous timestamped output.
Now, the idea is:
Assign a sequential number to each set of output. The first call to ps gets 1, the next 2, and so on, based on date.
Assign a sequential number to each pid, based on date. The first appearance gets 1, the next 2, and so on.
For pids that appear in sequence, the difference is a constant. We can call this the groupid for that set.
So, this is the query in action:
select groupid, pid, min(time), max(time)
from (select t.*,
(dense_rank() over (order by time) -
row_number() over (partition by pid order by time)
) as groupid
from t
) t
group by groupid, pid
This works in most databases (SQL Server, Oracle, DB2, Postgres, Teradata, among others). It does not work in MySQL because MySQL does not support the window/analytic functions.

Am I doing the grouping by right?

I have a system that logs errors. Selection from my errortable:
SELECT message, personid, count(*)
FROM errorlog
WHERE time BETWEEN TO_DATE(foo) AND TO_DATE(foo) AND substr(message,0,3) = 'ERR'
GROUP BY personid, message
ORDER BY 3
What I want is to see if any user is "producing" more errors then others. For instance ERROR FOO, if user A has 4 errors and user B has 4000, then logic strikes me that user B is doing something wrong.
But can I group the way I do? This is a modified version where the selection only grouped message and counted it, resolving so that ERROR FOO resulted in 4004 from my example over.
With your query, if the messages are different, then you will get multiple records per person.
If you only want one record per person, you would need to put an aggregate function around message
For example you could do:
SELECT MIN(message), personid, count(*)
FROM errorlog
WHERE time BETWEEN TO_DATE(foo) AND TO_DATE(foo) AND substr(message,0,3) = 'ERR'
GROUP BY personid, message
ORDER BY 3
Here I've changed message to MIN(message) which will return the first message for this person, alphabetically.
However if you are happy to return multiple records per person, then I see no problem with your script. It will show a list of personid and message ordered by the ones which are in the table the most, displaying only records which have a message starting with ERR