Analysis of sparse time series events with Spark or SQL - sql

I have a set of status change events for users, for simplicty let's say ACTIVATED (A) and DEACTIVATED (D).
Scenario is similar to e.g. a youtube premium subscription, where user might activate or deactivate their subscription multiple times. Hence, both events can occur multiple times for the same user with multiple time points (e.g. days, months) in-between.
I want to calculate from the event history the number of user with an ACTIVATED status per Month.
A example timeline could be
t: Time point (end of month) of aggregation
u: One user
A: ACTIVATED event
D: DEACTIVATED even
t: Jan Feb Mar Apr May
u1 A
u2 A D A
u3 A D A
Expected: 2 2 2 1 3
The data itself is available in a CSV / table with columns user-id, event-type time-stamp. For the example above the raw data would be:
user-id event-type time-stamp
u1 A 2020-Jan-01
u2 A 2020-Jan-15
u2 D 2020-Feb-05
u2 A 2020-May-17
u3 A 2020-Feb-04
u3 D 2020-Apr-10
u3 A 2020-May-09
Note, that even-though I want to have the count at the end of each Month, the events of course do not happen all at the same time. One user could also have more than one event in the same month.
The absolute count is not problematic, "count all users where last event is A".
The tricky thing is to calculate it for the individual months, where I have no change event. E.g. the Mar in the example timeline.
I can not group-by month, since in Mar no event happened, but I need to be aware, that an ACTIVATION or DEACTIVATION happened the time points before.
I can come up with two approaches:
Calculate for each time point with an increasing partitioning window in some loop. Hence "for tCursor in Jan to May do: count all users where last event in rang 'Jan - tCursor' is ACTIVATED".
Populate the history with redundant events in the time granularity of interest with some pre-processing loop for each user. Then I can avoid the iteratively increased time window.
Both approaches seem somewhat rough (though they would work).
Is there some good alternative? Maybe some magic Spark function that I should be aware off?
Happy to get some input here. I am not 100% sure what to google for, too. I would think there might be even a name for this general issue, since like said, all on / off subscription services with sparse events should have the same issue.
Thanks

You can unpivot the data, aggregate, and use window functions:
with t as (
select userid, 't1' as t,
(case when t1 = 'A' then 1 else -1 end)
from t
where t1 in ('A', 'D')
union all
select userid, 't2' as t,
(case when t2 = 'A' then 1 else -1 end)
from t
where t2 in ('A', 'D')
union all
. . . -- need to repeat for all times
)
select t, sum(inc) as change_at_time,
sum(sum(inc)) over (order by t) as active_on_day
from t
group by t
order by t;
The 't1' is whatever the time is that is represented by the column. It might really be a number (your question is not clear on the representation of the data).
The query would be simpler if you simply had rows with userid, time, and 'A'/'D' rather then having the values splayed across many columns.

Related

Azure Stream ANalytics - Find Most Recent `n` Events Within Time Interval

I am working with Azure Stream Analytics and, to illustrate my situation, I have streaming events corresponding to buy(+)/sell(-) orders from users of a certain amount. So, key fields in an individual event look like: {UserId: 'u12345', Type: 'Buy', Amt: 14.0}.
I want to write a query which outputs UserId's and the sum of Amt for the most recent (up to) 5 events within a sliding 24 hr period partitioned by UserId.
To clarify:
If there are more than 5 events for a given UserId in the last 24 hours, I only want the sum of Amt for the most recent 5.
If there are fewer than 5 events, I either want the UserId to be omitted or the sum of the Amt of the events that do exist.
I've tried looking at LIMIT DURATION predicates, but there doesn't seem to be a way to limit the number of events as well as filter on time while PARTITION'ing by UserId. Has anyone done something like this?
Considering the comments, I think this should work:
WITH Last5 AS (
SELECT
UserId,
System.Timestamp() AS windowEnd,
COLLECTTOP(5) OVER (ORDER BY CAST(EventEnqueuedUtcTime AS DATETIME) DESC) AS Top5
FROM input1
TIMESTAMP BY EventEnqueuedUtcTime
GROUP BY
SlidingWindow(hour,24),
UserId
HAVING COUNT(*) >= 5 --We want at least 5
)
SELECT
L.UserId,
System.Timestamp() AS ts,
SUM(C.ArrayValue.value.Amt) AS sumAmt
INTO myOutput
FROM Last5 AS L
CROSS APPLY GetArrayElements(L.Top5) AS C
GROUP BY
System.Timestamp(), --Snapshot window
L.UserId
We use a CTE to first get the sliding window of 24h. In there we both filter to only retain windows of more than 5 records (HAVING COUNT(*) > 5), and collect only the last 5 of them (COLLECTOP(5) OVER...). Note that I had to TIMESTAMP BY and CAST on my own timestamp when testing the query, you may not need that in your case.
Next we need to unpack the collected records, that's done via CROSS APPLY GetArrayElements, and sum them. I use a snapshot window for that, as I don't need time grouping on that one.
Please let me know if you need more details.

In SQL, Is there any way to construct a variable that tracks historical data within multiple groups?

I have inquiries about the "variable construction" in the SQL, more specifically Big Query in the GCP (Google Cloud Platform). I do not have a deep understanding of SQL, so I am having a hard time manipulating and constructing variables I intend to make. So, any comment would be very appreciated.
I’m thinking of constructing two variables, which seems quite tricky to me. I’d like to briefly introduce the structure of this dataset before I inquire about the way of constructing those variables. This dataset is the historical record of game matches played by around 25,000 users, totaling around 100 million matches. 10 players participate in a single match, and each player choose their own hero. Due to the technical issue, I can only manipulate and construct those two variables through Big Query in the GCP (Google Cloud Platform).
Constructing “Favorite Hero” Variable
First, I am planning to construct a “favorite hero” variable within match-user level. As shown in the tables below, the baseline variables are 1)match_id (that specifies each match) 2)user_id(that specifies each user) 3) day(that indicates the date of match played) 4)hero_type(that indicates which hero did each player(user) choose).
Let me make clear what I intend to construct. As shown below, the user “3258 (blue)” played four times within the observation period. So, for the fourth match of user 3258, his/her favorite hero_type is 36 because his/her cumulative favorite here_type is 36. Please note that the “cumulative” does not include the very that day. For example, the user “2067(red)” played three times: 20190208, 20190209, 20190212. Each time, the player chose heroes 81, 81, and 34, respectively. So, the “favorite_hero” for the third time is 81, not 34. Also, I’d like to set the number of favorite_hero is 2.
The important thing to note is that there are consecutive but split tables as shown below. Although those tables are split, the timeline should not be discontinued but be linked to each other.
Constructing “Familiarity” Variable
I think the second variable I intend to make is quite trickier than the previous one. I am planning to construct a “met_before” variable that counts the total number of cases where each user met another player (s). For example, in the match_id 2, the user 3258(blue) and 2067(red) met each other previously at the match_id 1. So, each user has a value of 1 for the variable “met_before” So, the concept of “match_id” particularly becomes more important when making this variable than the previous one, because this variable is primarily made based on the match_id. Another example is, for the match_id 5, the user 3258(blue) has the value of 4 for the variable “met_before” because the player met with user 2386(green) for two times (match_id 1 and 3) and with user 2067(red) for the two times(match_id 1 and 2), respectively.
Again, the important thing to note is that there are consecutive but split tables as shown below. Although those tables are split, the timeline should not be discontinued but be linked to each other.
As stated in the comments, it would be better if you could provide sample data.
Also, there are 2 separate problems in the question. It would be better to create 2 different threads for them.
I prepared sample data from your screenshots and the code you need.
So you can try the code and give feedback according to the output. So if there is anything wrong, we can iterate it.
CREATE TEMP FUNCTION find_fav_hero(heroes ARRAY<INT64>) AS
((
SELECT STRING_AGG(CAST(hero as string) ORDER BY hero)
FROM (
SELECT *, max(cnt) over () as max_cnt
FROM (
SELECT hero, count(*) as cnt
FROM UNNEST(heroes) as hero
GROUP BY 1
)
)
WHERE cnt = max_cnt
));
WITH
rawdata as (
SELECT 2386 AS user_id, 20190208 as day, 30 as hero_type UNION ALL
SELECT 3268 AS user_id, 20190208 as day, 30 as hero_type UNION ALL
SELECT 2067 AS user_id, 20190208 as day, 81 as hero_type UNION ALL
SELECT 3268 AS user_id, 20190209 as day, 36 as hero_type UNION ALL
SELECT 2067 AS user_id, 20190209 as day, 81 as hero_type UNION ALL
SELECT 2386 AS user_id, 20190210 as day, 3 as hero_type UNION ALL
SELECT 3268 AS user_id, 20190210 as day, 36 as hero_type UNION ALL
SELECT 2386 AS user_id, 20190212 as day, 203 as hero_type UNION ALL
SELECT 3268 AS user_id, 20190212 as day, 36 as hero_type UNION ALL
SELECT 2067 AS user_id, 20190212 as day, 34 as hero_type
)
SELECT *,
count(*) over (partition by user_id order by day) - 1 as met_before,
find_fav_hero(array_agg(hero_type) over (partition by user_id order by day rows between unbounded preceding and 1 preceding )) as favourite_hero
from rawdata
order by day, user_id

Query to analyze log-in data and intelligently identify a shift worked

I have a large Vertica table that tracks almost any user activity within an enterprise wide program. There is a subset of users where I want to identify the hours they worked on a day to day basis. The tricky part is that some users work 12 hour shifts that span multiple days. Could anyone suggest the best way to do this? Here's what I was originally thinking:
select users.max_hour - users.min_hour as shift_length,
timestamp_trunc(activity_dt_tm ,'ddd')
(select username,
ceil(max(hour(activity_dt_tm))) as max_hour,
floor(min(hour(activity_dt_tm))) as min_hour
from user_activity
where timestamp_trunc(activity_dt_tm ,'ddd') = '2014/11/10'
group by username
) users
I would look at the results from that query and see which users shifts were under a minimum threshold of say 8 hours, indicating they probably started working in the afternoon into the following day. Once I have that list of usernames, I would pass them into a second query that would look ahead to the next day and grab the maximum hour of the activity data row and substitute it in for their 'max_time'. I'm not a sql expert, but I think this might involve some temporary tables to pass the data around. If anyone could point me in the right direction it would be much appreciated.
Edit
Here's a SQL Fiddle with some staged data for 2 users. http://sqlfiddle.com/#!2/4ce900
User2 has activity of a normal 8-5 workday. User1 starts working around 7PM and works into the next day. I'd want the output to look something like this:
UserName | Shift Start | Shift End | Hours Worked
-------------------------------------------------
User1 | 7PM | 7AM | 12
User2 | 8AM | 5PM | 9
I'd want to attribute all the hours worked to the day the user started their shift.
You can use the SQL below to find the start, end and duration of breaks that a user had. You can then filter the breaks that are longer than a threshold and use them to separate user's shifts.
select t1.username, t1.end_dt_tm beforeBreak, t2.start_dt_tm afterBreak, t2.start_dt_tm - t1.end_dt_tm as diff
from user_activity t1, user_activity t2
where t1.username = t2.username and t2.start_dt_tm =
(
select min(nxt.start_dt_tm) from user_activity nxt
where nxt.username = t1.username and nxt.start_dt_tm > t1.end_dt_tm
)
;
(note that your fiddle has the same row twice for user 1)

count occurrences for each week using db2

I am looking for some general advice rather than a solution. My problem is that I have a list of dates per person where due to administrative procedures, a person may have multiple records stored for this one instance, yet the date recorded is when the data was entered in as this person is passed through the paper trail. I understand this is quite difficult to explain so I'll give an example:
Person Date Audit
------ ---- -----
1 2000-01-01 A
1 2000-01-01 B
1 2000-01-02 C
1 2003-04-01 A
1 2003-04-03 A
where I want to know how many valid records a person has by removing annoying audits that have recorded the date as the day the data was entered, rather than the date the person first arrives in the dataset. So for the above person I am only interested in:
Person Date Audit
------ ---- -----
1 2000-01-01 A
1 2003-04-01 A
what makes this problem difficult is that I do not have the luxury of an audit column (the audit column here is just to present how to data is collected). I merely have dates. So one way where I could crudely count real events (and remove repeat audit data) is to look at individual weeks within a persons' history and if a record(s) exists for a given week, add 1 to my counter. This way even though there are multiple records split over a few days, I am only counting the succession of dates as one record (which after all I am counting by date).
So does anyone know of any db2 functions that could help me solve this problem?
If you can live with standard weeks it's pretty simple:
select
person, year(dt), week(dt), min(dt), min(audit)
from
blah
group by
person, year(dt), week(dt)
If you need seven-day ranges starting with the first date you'd need to generate your own week numbers, a calendar of sorts, e.g. like so:
with minmax(mindt, maxdt) as ( -- date range of the "calendar"
select min(dt), max(dt)
from blah
),
cal(dt,i) as ( -- fill the range with every date, count days
select mindt, 0
from minmax
union all
select dt+1 day , i+1
from cal
where dt < (select maxdt from minmax) and i < 100000
)
select
person, year(blah.dt), wk, min(blah.dt), min(audit)
from
(select dt, int(i/7)+1 as wk from cal) t -- generate week numbers
inner join
blah
on t.dt = blah.dt
group by person, year(blah.dt), wk

Finding concurrent users in a sessions table created from log entries

We are exploring using Bigquery to store and analyze 100s of million of log entries representing user sessions. The source raw log entries contain a "connect" log type and "disconnect" log type.
We have the option of processing the logs before they are ingested to bigquery so that we have one entry per session, containing the session start TIMESTAMP and a "duration" value, or to insert each log entry individually and calculate session times at the analysis stage. Let's imagine our table schema is of the form:
sessionStartTime: TIMESTAMP,
clientId: STRING,
duration: INTEGER
or (in the case we store two log entries per session: one connect and one disconnect):
time: TIMESTAMP,
type: INTEGER, //enum, 0 for connect, 1 for disconnect
clientId: STRING
Our problem is we cannot find a way to get concurrent users using bigquery: ideally we would be able to write a query that partitions the sessions table by timestamp "buckets" (let's say every minute) and perform a query which would give us concurrents per minute over a certain time range.
The simple way to think about concurrents with respect to log entries is that at any moment in time they are calculated using the function f(t) = x0 + connects(t) - disconnects(t), where x0 is the initial concurrent users count (at time t0), and t is the "timestamp" bucket (in minutes in this example).
Can anybody recommend a way to do this?
Thanks!
Thanks for the sample data! (Available at https://bigquery.cloud.google.com/table/imgdge:sopub.sessions)
I'll take your offer to "We have the option of processing the logs before they are ingested to bigquery so that we have one entry per session, containing the session start TIMESTAMP and a 'duration' value". This time, I'll make the processing with BigQuery, and leave the results on a table of my own with:
SELECT u, start, MIN(end) end FROM (
SELECT a.f0_ u, a.time start, b.time end
FROM [imgdge:sopub.sessions] a
JOIN EACH [imgdge:sopub.sessions] b
ON a.f0_ = b.f0_
WHERE a.type = 'connect'
AND b.type='disconnect'
AND a.time < b.time
)
GROUP BY 1, 2
That gives me 819,321 rows. Not a big number for BigQuery, but since we are going to be doing combinations of it, it might explode. We'll limit the date range for calculating the concurrent sessions to keep it sane. I'll save the results of this query to [fh-bigquery:public_dump.imgdge_sopub_sessions_startend].
Once I have all the sessions with start and end time, I can go find how many concurrent sessions are per each interesting instant. By minute you said?
All the interesting minutes happen to be:
SELECT SEC_TO_TIMESTAMP(FLOOR(TIMESTAMP_TO_SEC(time)/60)*60) time
FROM [imgdge:sopub.sessions]
GROUP BY 1
Now let's combine this list of interesting times with all the sessions in my new table. For each minute we'll count all the sessions that started before this time, and ended after it:
SELECT time, COUNT(*) concurrent
FROM (
SELECT u, start, end, 99 x
FROM [fh-bigquery:public_dump.imgdge_sopub_sessions_startend]
WHERE start < '2013-09-30 00:00:00'
) a
JOIN
(
SELECT SEC_TO_TIMESTAMP(FLOOR(TIMESTAMP_TO_SEC(time)/60)*60) time, 99 x FROM [imgdge:sopub.sessions] GROUP BY 1) b
ON a.x = b.x
WHERE b.time < a.end
AND b.time >= a.start
GROUP BY 1
Notice the 99 x. It could be any number, I'm just using it to generate the combinatorial all the session * all the times. There are too many sessions for this kind of combinatorial game, so I'm limiting them with the WHERE start < '2013-09-30 00:00:00'.
And that's how you can count concurrent users.
Could you instead of sessionStartTime get sessionEndTime (or just add duration+sessionStartTime)? If you could do that something like this can be made. It is not perfect but it will give you somewhat relevant data.
SELECT AVG(perMinute) as avgUsersMin FROM
(
SELECT COUNT(distinct clientId, 1000000) as perMinute, YEAR(sessionEndTime) as y,
MONTH(sessionEndTime) as m, DAY(sessionEndTime) as d, HOUR(sessionEndTime) as h, MINUTE(sessionEndTime) as mn FROM [MyProject:MyTable]
WHERE sessionEndTime BETWEEN someDate AND someOtherDate
GROUP BY y,m,d,h,mn
);