Select the last user_property in a dataset in bigQuery per user - google-bigquery

I have this query, and the goal of it is to get all the user_properties of the events that are stored in the dataset, now the result is around 300k+ per day and it is quite too big and I only care for one user_property per user since it will have the keys that I want
To explain it more, we record the events done by a user on the mobile/web app in the dataset, so each button he clicks or every screen he searches for, we record those in order to be used later for analysis by clients, so a single user may have 0 or 100 events per day or more and usually, the last event been recorded contains all the updated keys I want
SELECT
user_pseudo_id AS user_id,
user_properties AS user_properties
FROM
`TABLENAME`
order by user_pseudo_id, event_timestamp
I tried grouping the user_properties by user_pseaudo_id, but that obviously didn't work because the properties are not the same
My solution was to get all the results from the query above, loop over the result, and store them in a Map<String, List<FieldValue>>, well this solution Is doing the trick but userPropertiesResult.iterateAll() is too expensive and is taking a lot of time
So I came up with a better query that reduced the number of rows by a lot following this answer https://stackoverflow.com/a/43863450/7298897
SELECT
a.user_pseudo_id AS user_id,
a.user_properties AS user_properties
FROM
`TABLENAME` AS a
JOIN (
SELECT
user_pseudo_id,
MAX(event_timestamp) AS event_timestamp
FROM
`TABLENAME`
GROUP BY
user_pseudo_id) AS b
ON
a.user_pseudo_id = b.user_pseudo_id
AND a.event_timestamp = b.event_timestamp
But the problem is that the data returned is not accurate as it was before
So my question would be, How I can get the last user_property only per user?

Related

BigQuery Join Using Most Recent Row

I have seen variations of this question but have been searching StackOverflow for almost a week now trying various solutions and still struggling with this. Really appreciate you taking the time to consider my question.
I am working on a research project in GCP using BigQuery. I have a table result of ~100 million rows of events where there is a session_id column that relates to the session that the event originated from. I would like to join this with another table status of about 40 million rows that has that same session_id and tracks the status of those sessions. Both tables have a time column. In the result table, this is the time of the event. In the status table this is the time of any status changes. I want to join the rows in the result table with the corresponding row in the status table for the most recent state of the session up to or before the time of the event using the session ID. The result would be that each row in the result table would have the corresponding information about the state of the session when the event occurred.
How can I achieve this? Any way to do it that won't be really inefficient? Thank you so much for your help!
You may be able to use a left join:
select r.*, s.status -- choose whatever columns you want
from result r left join
(select s.*,
lead(time) over (partition by session_id order by time) as next_time
from status s
) s
on r.session_id = s.session_id and
r.time <= s.time and
(r.time > s.next_time or s.next_time is null)

Select latest and earliest times within a time group and a pivot statement

I have attandance data that contains a username, time, and status (IN or OUT). I want to show attendance data that contains a name, and the check in/out times. I expect a person to check in and out no more than twice a day. The data looks like this:
As you can see, my problem is that one person can have multiple data entries in different seconds for the same login attempt. This is because I get data from a fingerprint attendace scanner, and the machine in some cases makes multiple entries, sometimes just within 5-10 seconds. I want to select the data to be like this:
How can I identify the proper time for the login attempt, and then select the data with a pivot?
First, you need to normalize your data by removing the duplicate entries. In your situation, that's a challenge because the duplicated data isn't easily identified as a duplicate. You can make some assumptions though. Below, I assume that no one will make multiple login attempts in a two minute window. You can do this by first using a Common Table Expression (CTE, using the WITH clause).
Within the CTE, you can use the LAG function. Essentially what this code is saying is "for each partition of user and entry type, if the previous value was within 2 minutes of this value, then put a number, otherwise put null." I chose null as the flag that will keep the value because LAG of the first entry is going to be null. So, your CTE will just return a table of entry events (ID) that were distinct attempts.
Now, you prepare another CTE that a PIVOT will pull from that has everything from your table, but only for the entry IDs you cared about. The PIVOT is going to look over the MIN/MAX of your IN/OUT times.
WITH UNIQUE_LOGINS AS (
SELECT ID FROM LOGIN_TABLE
WHERE CASE WHEN LAG(TIME, 1, 0) OVER (PARTITION BY USERNAME, STATUS ORDER BY TIME)
+ (2/60/24) < TIME THEN NULL ELSE 1 END IS NULL ), -- Times within 2 minutes
TEMP_FOR_PIVOT AS (
SELECT USERNAME, TIME, STATUS FROM LOGIN_TABLE WHERE ID IN (SELECT ID FROM UNIQUE_LOGINS)
)
SELECT * FROM TEMP_FOR_PIVOT
PIVOT (
MIN(TIME), MAX(TIME) FOR STATUS IN ('IN', 'OUT')
)
From there, if you need to rearrange or rename your columns, then you can just put that last SELECT into yet another CTE and then select your values from it. There is some more about PIVOT here: Rotate/pivot table with aggregation in Oracle

Calculating Unique events in Google-Bigquery

How can one find the count of unique events in Big Query ? I am having a hard time calculating the count of Unique events and matching it with GA interface.
two ways how this is used:
1) One is as the original linked documentation says, to combine full visitor user id, and their different session id: visitId, and count those.
SELECT
EXACT_COUNT_DISTINCT(combinedVisitorId)
FROM (
SELECT
CONCAT(fullVisitorId,string(VisitId)) AS combinedVisitorId
FROM
[google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910]
WHERE
hits.type='PAGE' )
2) The other is just counting distinct fullVisitorIds
SELECT
EXACT_COUNT_DISTINCT(fullVisitorId)
FROM
[google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910]
WHERE
hits.type='PAGE'

Can a nested Group By be done in a single Select?

Using T-SQL (we're on 2008, but if it can be done in 2012 using some new function/extension, please note)
This is purely out of curiosity...I ended up just going with a GROUP BY within a GROUP BY. But I'm curious to see if there is a way to do this in a single query, maybe there's some fancy shmancy functions or extensions I haven't learned yet....It's more of a challenge than it is a need to get the job done, as it's already done.
I tried building an example table on here, but it's too large to build, so here's the concept. The table has three columns, UserID, UserGroupID and Minutes. In one hour increments, we log how much time a user spends within an application. So, for example, UserID 1 spent 10 min during the hour of 04/28/2014 10:00:00, and then 15 minutes during the hour of 04/28/2014 11:00:00...and so on. (for this example, please ignore any time constraints as far as per day or per month, etc)
I wanted to see the number of users per group that have used the application for at least 30 minutes. This is the logic that was used:
SELECT UserGroupID, COUNT(*)
FROM (
SELECT UserGroupID, UserID
FROM Example
GROUP BY UserGroupID, UserID
HAVING SUM([Minutes]) >= 30
) AS x
GROUP BY UserGroupID
The question is, can this be done in a single query? Not looking for efficiency here, I'm just curious.
I don't think so, but a negative is quite hard to prove.
The following query (without the having clause) can be simplified. So:
SELECT UserGroupID, COUNT(*)
FROM (
SELECT UserGroupID, UserID
FROM Example
GROUP BY UserGroupID, UserID
) AS x
GROUP BY UserGroupID;
Is pretty much the same as:
SELECT UserGroupId, COUNT(DISTINCT UserId)
FROM Example
GROUP BY UserGroupId;
(These are not exactly equivalent if UserId can be NULL, but that case could also be handled.)
I don't think there is a way to do your full query, though. You need to aggregate by UserGroupId, UserId to get the sum() condition. Then you need to aggregate just by UserGroupId. Nothing comes to mind.

Finding first pending events from SQL table

I have a table which contains TV Guide data.
In a simplified form, the columns look like the following...
_id, title, start_time, end_time, channel_id
What I'm trying to do is create a list of TV shows in a NOW/NEXT format. Generating the 'NOW' list (what's currently being broadcast) is easy but trying to get a list of what is showing 'NEXT' is causing me problems.
I tried this...
SELECT * from TV_GUIDE where start_time >= datetime('now') GROUP BY channel_id
Sure enough this gives me one TV show for each TV channel_id but it gives me the very last shows (by date/time) in the TV_GUIDE table.
SQL isn't my strong point and I'm struggling to work out why only the last TV shows are returned. It seems I need to do a sub-query of a query (or a query of a sub-query). I've tried combinations of ORDER BY and LIMIT but they don't help.
I believe that you first must select the couples (channel_id, ) from TV_GUIDE, based on your criteria.
Then, you will display those records of TV_GUIDE which match those criteria:
SELECT source.* from TV_GUIDE AS source
JOIN (SELECT channel_id, MIN(start_time) AS start_time
FROM TV_GUIDE
WHERE start_time >= now() GROUP BY channel_id ) AS start_times
ON (source.channel_id = start_times.channel_id
AND source.start_time = start_times.start_time)
ORDER BY channel_id;
This first selects all shows with minimum start time, one for each channel, thus giving you the channel id and the start time. Then fills in the other information (MySQL sometimes lets you retrieve that from a single query, but I feel it's a bad habit to acquire - maybe you add a field and it won't work anymore) from a JOIN with the same table.
You might want to add an index on the combined fields (start_time, channel_id). Just to be on the safe side, make it a UNIQUE index.