Google Big Query SQL Aggregate Data based on Sessions - sql

I am currently working with Google Analytics Data in Big Query, and one thing I have yet to be able to wrap my head around is how to write a query to get aggregated data over events from one session.
I have searched around to find something that might work, but couldn't get it so far.
Bascially, this is how the table looks (vastly simplified):
UserID | event_name | event_timestamp
--------------------------------------
1 | login | 1543171146125000
1 | other event| 1543171155329000
1 | other event| 1543171155341001
1 | login | 1543171157796003
1 | other event| 1543171160541000
2 | login | 1543171157796003
2 | other event| 1543171177531000
What I want to do now is aggregating data over User AND session, whereas a session is defined as all events until another login event is shown for that user.
I'm assuming I have to come up with an additional field "session" that bascially is always showing a new ID once a login event_name is encountered for the currently aggregated UserID.
So, for instance, in that case, if I want to have an aggregated event count, the resulting table would look someting like:
UserID | session | EventCount
---------------------------
1 | 1 | 3
1 | 2 | 2
2 | 1 | 2
My assumption would be that there is some sub-query I could use to get that magical "session" field, so something like:
SELECT UserID, session, COUNT(event_name) as EventCount
FROM (Insert Magical Subquery here)
GROUP BY UserID, session
Any ideas how this might be done? It seems like a simple thing but I just can't figure it out.

Based on your example, a session seems to start with a "login". So, you can just do a cumulative count "login"s for each userid:
select t.*,
countif(event_name = 'login') over (partition by userid order by event_timestamp) as session
from t;
You can then aggregate:
select userid, session, count(*)
from (select t.*,
countif(event_name = 'login') over (partition by userid order by event_timestamp) as session
from t
) t
group by userid, session;

Related

PostgreSQL Count DISTINCT from one column when grouped by another

I have a single table that looks like the following (dumbed down):
userid | action | userstate
-----------------------------------------------------
1 | click | Maryland
2 | press | Delaware
3 | jog | New York
3 | leap | New York
What I'm trying to query is "number of users doing ANY action, per state"
So the result would be:
state | users_acting
---------------------
Maryland | 1
Delaware | 1
New York | 1
Note that individual users will only be part in one state.
I can't get the mix of distinct users correct with grouping by state. I can't
SELECT DISTINCT (userid), COUNT(userid) FROM data GROUP BY state
because the distinct column needs to be in the group by, which I don't want to actually do, not to mention problems w/ the select clause.
Thanks for any thoughts.
Just found out that there's a COUNT(DISTINCT( option which doesn't require that distinct value to be placed in the grouping clause.
SELECT COUNT(DISTINCT userid) FROM data GROUP BY state
Does the trick
You can try out the below format
SELECT COUNT(DISTINCT userid) FROM data GROUP BY state

Remove Duplicates Based Off of Two Columns in PostgreSQL

So let's say I have a table named Class with the following fields: userid, time, and score. The table looks like this:
+--------+------------+-------+
| userid | time | score |
+--------+------------+-------+
| 1 | 08-20-2018 | 75 |
| 1 | 10-25-2018 | 50 |
| 1 | 02-01-2019 | 88 |
| 2 | 04-23-2019 | 98 |<remove
| 2 | 04-23-2019 | 86 |
| 3 | 06-05-2019 | 71 |<remove
| 3 | 06-05-2019 | 71 |
+--------+------------+-------+
However, I would like to remove records where the userid and the time is the same (since it doesn't make sense for someone to give another score on the same day). This would also take care of the records where the userid, time, and score are the same. So in this table, rows 4 and 6 should be removed.
The following query gives me a list of the duplicated records:
select userid, time
FROM class
GROUP BY userid, time
HAVING count(*)>1;
However, how do I remove the duplicates while still keeping the userid, time, and score column in the outcome?
You can use the row_number() window function to assign a number to each record in the order of score for each userid and time and then select only the rows where this number is equal to one.
SELECT userid,
time,
score
FROM (SELECT userid,
time,
score,
row_number() OVER (PARTITION BY userid,
time
ORDER BY score) rn
FROM class) x
WHERE rn = 1;
First, you need some criterium to distinguish between two rows that have different scores (unless you want to randomly choose between the two). E.g., you could pick the highest score (like the SATs) or the lowest.
Assuming you want the highest score per day, you can do this:
SELECT distinct on (userid, time)
user_id, time, score
from class
order by userid, time, score desc
Some key things: you have to have the same columns in your distinct on in the left-most positions in your order by but the magic is in the field that comes next in the order by - it’ll pick the first row among dupes of (userid, time) when ordered by score desc.
You have a real problem with your data model. This is easy enough to fix in a select query, as the other answer suggest (I would recommend distinct on) for this.
For actually deleting the row, you can use ctid (as mentioned in a comment. The approach is:
delete from t
where exists (select 1
from t t2
where t2.user_id = t.user_id and t2.time = t.time and
t2.ctid < t.ctid
);
That is, delete any row where there is a smaller ctid for the user_id/time combination.

SQL query to get latest user to update record

I have a postgres database that contains an audit log table which holds a historical log of updates to documents. It contains which document was updated, which field was updated, which user made the change, and when the change was made. Some sample data looks like this:
doc_id | user_id | created_date | field | old_value | new_value
--------+---------+------------------------+-------------+---------------+------------
A | 1 | 2018-07-30 15:43:44-05 | Title | | War and Piece
A | 2 | 2018-07-30 15:45:13-05 | Title | War and Piece | War and Peas
A | 1 | 2018-07-30 16:05:59-05 | Title | War and Peas | War and Peace
B | 1 | 2018-07-30 15:43:44-05 | Description | test 1 | test 2
B | 2 | 2018-07-30 17:45:44-05 | Description | test 2 | test 3
You can see that the Title of document A was changed three times, first by user 1 then by user 2, then again by user 1.
Basically I need to know which user was the last one to update a field on a particular document. So for example, I need to know that User 1 was the last user to update the Title field on document A. I don't really care what time it happened, just the document, field, and user.
So sample output would be something like this:
doc_id | field | user_id
--------+-------------+---------
A | Title | 1
B | Description | 2
Seems like it should be fairly straightforward query to write but I'm having some trouble with it. I would think that group by would be in order but the problem is that if I group by doc_id I lose the user data:
select doc_id, max(created_date)
from document_history
group by doc_id;
doc_id | max
--------+------------------------
B | 2018-07-30 15:00:00-05
A | 2018-07-30 16:00:00-05
I could join these results table back to the document_history table but I would need to do so based on the doc_id and timestamp which doesn't seem quite right. If two people editing a document at the exact same time I would get multiple rows back for that document and field. Maybe that's so unlikely I shouldn't worry about it, but still...
Any thoughts on a way to do this in a single query?
You want to filter the records, so think where, not group by:
select dh.*
from document_history
where dh.created_date = (select max(dh2.created_date) from document_history dh2 where dh2.doc_id = dh.doc_id);
In most databases, this will have better performance than a group by, if you have an index on document_history(doc_id, created_date).
If your DBMS supports window functions (e.g. PostgreSQL, SQL Server; aka analytic function in Oracle) you could do something like this (SQLFiddle with Postgres, other systems might differ slightly in the syntax):
http://sqlfiddle.com/#!17/981af/4
SELECT DISTINCT
doc_id, field,
first_value(user_id) OVER (PARTITION BY doc_id, field ORDER BY created_date DESC) as last_user
FROM get_last_updated
first_value() OVER (... ORDER BY x DESC) orders the window frames/partitions descending and then takes the first value which is your latest time stamp.
I added the DISTINCT to get your expected result. The window function just adds a new column to your SELECT result but within the same partition with the same value. If you do not need it, remove it and then you are able to work with the origin data plus the new won information.

Is exist this type of database? (like time-series db, can get specific time point data...)

Suppose I have data structure like below...
USER_KEY is USER's Unique Key
STATUS, 1 is Online, 0 is Offline
UPDATED_DATE is Status last updated date
USER_KEY | STATUS | UPDATED_DATE
----------------------------------------
1 | 1 | 2017-06-19 00:01:00
2 | 1 | 2017-06-19 00:01:01
3 | 1 | 2017-06-19 00:01:02
4 | 1 | 2017-06-19 00:01:02
1 | 0 | 2017-06-19 05:42:06
and I wanna get specific time point data.
when I select '2017-06-19 04:00:00'
live user count is 4
and live user is {1,2,3,4}
and when i select '2017-06-19 11:00:00',
live user count is 3
and live user is {2,3,4}
because user 1 went to offline at 5:42 AM.
and I wanna get specific time point data.
I'm saving user status data every hour until now.
this data is too big and inefficient.
is exist database can save efficiently this type of data?
Surely, can save just count number everytime, but I need not only count, but also all user's key.
You are looking for the last status for each user before a given point in time. In SQL Server, you can use window functions:
select count(*)
from (select t.*,
row_number() over (partition by user_key order by updated_date desc) as seqnum
from t
where updated_date < #datetime
) t
where seqnum = 1 and status = 1;
You can say select * if you want the list of users.
I don't know what you mean by "type of database". You can readily do this in standard SQL.

Delete all rows except latest per user, having a certain column value

I've got a table events that contains events for users, for example:
PK | user | event_type | timestamp
--------------------------------
1 | ab | DTV | 1
2 | ab | DTV | 2
3 | ab | CPVR | 3
4 | cd | DTV | 1
5 | cd | DTV | 2
6 | cd | DTV | 3
What I want to do is keep only one event per user, namely the one with the latest timestamp and event_type = 'DTV'.
After applying the delete to the example above, the table should look like this:
PK | user | event_type | timestamp
--------------------------------
2 | ab | DTV | 2
6 | cd | DTV | 3
Can any one of you come up with something that accomplishes this task?
Update: I'm using Sqlite. This is what I have so far:
delete from events
where id not in (
select id from (
select id, user, max(timestamp)
from events
where event_type = 'DTV'
group by user)
);
I'm pretty sure this can be improved upon. Any ideas?
I think you should be able to do something like this:
delete from events
where (user, timestamp) not in (
select user, max(timestamp)
from events
where event_type = 'DTV'
group by user
)
You could potentially do some more sophisticated tricks like table or partition replacement, depending on the database you're working with
If using sql server roo5/2008 then use following sql:
;WITH ce
AS (SELECT *,
Row_number()
OVER (
partition BY [user], event_type
ORDER BY timestamp DESC) AS rownumber
FROM emp)
DELETE FROM ce
WHERE rownumber <> 1
OR event_type <> 'DTV'
Your solution doesn't seem to me reliable enough, because your subquery is pulling a column that is neither aggregated nor added to GROUP BY. I mean, I am not an experienced SQLite user and your solution did work when I tested it. And if there's any confirmation that the id column is always reliably correlated to the MAX(timestamp) value in this situation, fine, your approach seems quite a decent one.
But if you are as unsure about your solution as I am, you could try the following:
DELETE FROM events
WHERE NOT EXISTS (
SELECT *
FROM (
SELECT MAX(timestamp) AS ts
FROM events e
WHERE event_type = 'DTV'
AND user = events.user
) s
WHERE ts = events.timestamp
);
The inner instance of events is assigned a different alias so that the events alias could be used to unambiguously reference the outer instance of the table (the one the DELETE command is actually being applied to). This solution does assume that timestamp is unique per user, though.
A working example can be run and played with on SQL Fiddle.