BigQuery Session & Hit level understanding - sql

I want to ask about your knowledge regarding the concept of Events.
Hit level
Session Level
How in BigQuery (standard SQL) how i can map mind this logic, and also
Sessions
Events Per Session
Unique Events
Please can somebody guide me to understand these concepts?
totals.visitors is Session
sometime
visitId is taken as Session

to achieve that you need to grapple a little with a few different concepts. The first being "what is a session" in GA lingo. you can find that here. A session is a collection of hits. A hit is one of the following: pageview, event, social interaction or transaction.
Now to see how that is represented in the BQ schema, you can look here. visitId and visitorId will help you define a session (as opposed to a user).
Then you can count the number of totals.hits that are events of the type you want.
It could look something like:
select visitId,
sum(case when hits.type = "EVENT" then totals.hits else 0) from
dataset.table_* group by 1
That should work to get you an overview. If you need to slice and dice the event details (i.e. hits.eventInfo.*) then I suggest you make a query for all the visitId and one for all the relevant events and their respective visitId
I hope that works!
Cheers

You can think of these concepts like this:
every row is a session
technically every row with totals.visits=1 is a valid session
hits is an array containing structs which contain information for every hit
You can write subqueries on arrays - basically treat them as tables. I'd recommend to study Working with Arrays and apply/transfer every exercise directly to hits, if possible.
Example for subqueries on session level
SELECT
fullvisitorid,
visitStartTime,
(SELECT SUM(IF(type='EVENT',1,0)) FROM UNNEST(hits)) events,
(SELECT COUNT(DISTINCT CONCAT(eventInfo.eventCategory,eventInfo.eventAction,eventInfo.eventLabel) )
FROM UNNEST(hits) WHERE type='EVENT') uniqueEvents,
(SELECT SUM(IF(type='PAGE',1,0)) FROM UNNEST(hits)) pageviews
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_20170801`
WHERE
totals.visits=1
LIMIT
1000
Example for Flattening to hit level
There's also the possibility to use fields in arrays for grouping if you cross join arrays with their parent row
SELECT
h.type,
COUNT(1) hits
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_20170801` AS t CROSS JOIN t.hits AS h
WHERE
totals.visits=1
GROUP BY
1
Regarding the relation between visitId and Sessions you can read this answer.

Related

How to see whole session path in BigQuery?

I'm learning standard sql in BigQuery and I have a task, where I have to show, what users did after entering checkout - what specific urls they've visited. I figured out something like this, but it'll only show me one previous step and I have to see at least 5 of them. Is this possible? Thank you
SELECT ARRAY(
SELECT AS STRUCT hits.page.pagePath
, LAG(hits.page.pagePath) OVER(ORDER BY i) prevPagePath
FROM UNNEST(hits) hits WITH OFFSET i
) x
FROM `xxxx.ga_sessions_20160801`
)
SELECT COUNT(*) as cnt, pagePath, prevPagePath
FROM path_and_prev, UNNEST(x)
WHERE regexp_contains (pagePath,r'(checkout/cart)')
GROUP BY 2,3
ORDER BY
cnt desc
Here is official GA shema for BQ export :
https://support.google.com/analytics/answer/3437719?hl=en
(Just a tip, feel free to export it in a sheet (Excel or Google or whatever) and indent decently to ease understanding of nesting :) )
The only way to safely get session behaviour is to get hits.hitNumber. Since pagePath is under page, which is under hits, hitnumber will always be specified :)
Up to you to filter on filled pagePath only, but still displaying hitnumber value.
Tell me if the solution does match your issue, or correct me :)

Google Analytics session-scoped fields returning multiple values

I've discovered that there are certain GA "session" scoped fields in BigQuery that have multiple values for the same fullVisitorId and visitId fields. See the example below:
Grouping the fields doesn't help either. In GA, I've checked the number of users vs number of users split by different devices. The user count is different:
This explains what's going on, a user would be grouped under multiple devices. My conclusion is that at some point during the users session, their browser user-agent changes and in the subsequent hit, a new device type is set in GA.
I'd have hoped GA would use either the first or last value, to avoid this scenario, but I guess they don't. My question is, if I'm accepting this as a "flaw" in GA. I'd rather pick one value. What's the best way to select the last or first device value from the below query:
SELECT
fullVisitorId,
visitId,
device.deviceCategory
FROM (
SELECT
*
FROM
`project.dataset.ga_sessions_*` desktop
UNION ALL
SELECT
*
FROM
`project.dataset.ga_sessions_*` mobile ) table
I've tried doing a sub-select and using STRING_AGG(), attempting to order by hits.time and limiting to one value and that still creates another row.
I've tested and found that the below fields all have the same issue:
visitNumber
totals.hits
totals.pageviews
totals.timeOnSite
trafficSource.campaign
trafficSource.medium
trafficSource.source
device.deviceCategory
totals.sessionQualityDim
channelGrouping
device.mobileDeviceInfo
device.mobileDeviceMarketingName
device.mobileDeviceModel
device.mobileInputSelector
device.mobileDeviceBranding
UPDATE
See below queries around this particular fullVisitorId and visitId - UNION has been removed:
visitStartTime added:
visitStartTime and hits.time added:
Well, from the looks of things, I think you have 3 options:
1 - Group by fullVisitorId, visitId; and use Max or MIN deviceCategory. That should prevent a device switcher from being double-counted, It's kind of arbitrary but then so is the GA data.
2 - Option two is similar but, if the deviceCategory result can be anything (i.e. isn't constrained in the results to just the valid deviceCategory members), you can use a CASE to check MAX(deviceCategory) = MIN(deviceCategory) and if they are different, return 'Multiple Devices'
3 - You could go further, counting the number of different devices used, construct a concatenation that lists them in some way, etc.
I'm going to write up Number 2 for you. In your question, you have 2 different queries: one with [date] and one without - I'll provide both.
Without [date]:
SELECT
fullVisitorId,
visitId,
case when max(device.deviceCategory) = min(device.deviceCategory)
then max(device.deviceCategory)
else 'Multiple Devices'
end as deviceCategory,
{metric aggregations here}
FROM
(SELECT *
FROM `project.dataset.ga_sessions_*` desktop
UNION ALL
SELECT *
FROM `project.dataset.ga_sessions_*` mobile
) table
GROUP BY fullVisitorId, visitId
With [date]:
SELECT
[date],
fullVisitorId,
visitId,
case when max(device.deviceCategory) = min(device.deviceCategory)
then max(device.deviceCategory)
else 'Multiple Devices'
end as deviceCategory,
{metric aggregations here}
FROM
(SELECT *
FROM `project.dataset.ga_sessions_*` desktop
UNION ALL
SELECT *
FROM `project.dataset.ga_sessions_*` mobile
) table
GROUP BY [date], fullVisitorId, visitId
I'm assuming here that the Selects and Union that you gave are sound.
Also, I should point out that those {metric aggregations} should be something other than SUMs, otherwise you will still be double-counting.
I hope this helps.
It's simply not possible to have two values for one row in this field, because it can only contain one value.
There are 2 possibilities:
you're actually querying two separate datasets/ two different views - that's not clearly visible with the example code. Client id (=fullvisitorid) is only unique per Property (Tracking Id, the UA-xxxxx stuff). If you query two different views from different properties you have to expect to get same ids used twice.
Given they are coming from one property, these two rows could actually be one session on a midnight split, which means visitId stays the same, but visitStartTime changes. But that would also mean the decision algorithm for device type changed in the meantime ... that would be curious.
Try using visitStartTime and see what happens.
If you're using two different properties use a user id to combine or separate the sessions by adding a constant - you can't combine them.
SELECT 'property_A' AS constant FROM ...
hth

SQL convert sample points into durations

This is similar to Compute dates and durations in mysql query, except that I don't have a unique ID column to work with, and I have samples not start/end points.
As an interesting experiment, I set cron to ps aux > 'date +%Y-%m-%d_%H-%M'.txt. I now have around 250,000 samples of "what the machine was running".
I would like to turn this into a list of "process | cmd | start | stop". The assumption is that a 'start' event is the first time when the pair existed, and a 'stop' event is the first sample where it stopped existing: there is no chance of a sample "missing" or anything.
That said, what ways exist for doing this transformation, preferably using SQL (on the grounds that I like SQL, and this seems like a nice challenge). Assuming that pids cannot be repeated this is a trivial task (put everything in a table, SELECT MIN(time), MAX(time), pid GROUP BY pid). However, since PID/cmd pairs are repeated (I checked, there are duplicates), I need a method that does a true "find all contiguous segments" search.
If necessary I can do something of the form
Load file0 -> oldList
ForEach fileN:
Load fileN ->newList
oldList-newList = closedN
newList-oldList = openedN
oldList=newList
But that is not SQL and not interesting. And who knows, I might end up having real SQL data to deal with with this property at some point.
I'm thinking something where one first constructs a table of diff's, and then joins all close's against all open's and pulls the minimum-distance close after each open, but I'm wondering if there's a better way.
You don't mention what database you are using. Let me assume that you are using a database that supports ranking functions, since that simplifies the solution.
The key to solving this is an observation. You want to assign an id to each pid to see if it is unique. I am going to assume that a pid represents a single process when the pid did not appear in the previous timestamped output.
Now, the idea is:
Assign a sequential number to each set of output. The first call to ps gets 1, the next 2, and so on, based on date.
Assign a sequential number to each pid, based on date. The first appearance gets 1, the next 2, and so on.
For pids that appear in sequence, the difference is a constant. We can call this the groupid for that set.
So, this is the query in action:
select groupid, pid, min(time), max(time)
from (select t.*,
(dense_rank() over (order by time) -
row_number() over (partition by pid order by time)
) as groupid
from t
) t
group by groupid, pid
This works in most databases (SQL Server, Oracle, DB2, Postgres, Teradata, among others). It does not work in MySQL because MySQL does not support the window/analytic functions.

Dynamic user ranks

I have a basic karma/rep system that awards users based on their activities (questions, answers, etc..). I want to have user ranks (title) based on their points. Different ranks have different limitations and grant powers.
ranks table
id rankname points questions_per_day
1 beginner 150 10
2 advanced 300 30
I'm not sure if I need to have a lower and upper limit, but for the sake of simplicity I have only left a max points limit, that is, a user below 150 is a 'beginner' and below or higher than 300, he's an 'advanced'.
For example, Bob with 157 points would have an 'advanced' tag displayed by his username.
How can I determine and display the rank/title of an user? Do I loop through each row and compare values?
What problems might arise if I scale this to thousands of users having their rank calculated this way? Surely it will tax the system to query and loop each time a user's rank is requested, no?
You could better cache the rank and the score. If a user's score only changes when they do certain activities, you can put a trigger on that activity. When the score changes, you can recalculate the rank and save it in the users record. That way, retreiving the rank is trivial, you only need to calculate it when the score changes.
You can get the matching rank id like this; query the rank that is closest (but below or equal to) the user schore. Store this rank id in the user's record.
I added the pseudovariable {USERSCORE} because I don't know if you use parameters or any other way to enter values in a query.
select r.id
from ranks r
where r.points <= {USERSCORE}
order by r.points desc
limit 1
A little difficult without knowing your schema. Try:
SELECT user.id, MIN(ranks.id) AS rankid FROM user JOIN ranks ON (user.score <= ranks.points) GROUP BY user.id;
Now you know the ranks id.
This is non-trivial though (GROUP BY and MAX are pipeline breakers and so quite heavyweight operations), so GolezTrol advice is good; you should cache this information and update it only when a users score changes. A trigger sounds fine for this.

What is an unbounded query?

Is an unbounded query a query without a WHERE param = value statement?
Apologies for the simplicity of this one.
An unbounded query is one where the search criteria is not particularly specific, and is thus likely to return a very large result set. A query without a WHERE clause would certainly fall into this category, but let's consider for a moment some other possibilities. Let's say we have tables as follows:
CREATE TABLE SALES_DATA
(ID_SALES_DATA NUMBER PRIMARY KEY,
TRANSACTION_DATE DATE NOT NULL
LOCATION NUMBER NOT NULL,
TOTAL_SALE_AMOUNT NUMBER NOT NULL,
...etc...);
CREATE TABLE LOCATION
(LOCATION NUMBER PRIMARY KEY,
DISTRICT NUMBER NOT NULL,
...etc...);
Suppose that we want to pull in a specific transaction, and we know the ID of the sale:
SELECT * FROM SALES_DATA WHERE ID_SALES_DATA = <whatever>
In this case the query is bounded, and we can guarantee it's going to pull in either one or zero rows.
Another example of a bounded query, but with a large result set would be the one produced when the director of district 23 says "I want to see the total sales for each store in my district for every day last year", which would be something like
SELECT LOCATION, TRUNC(TRANSACTION_DATE), SUM(TOTAL_SALE_AMOUNT)
FROM SALES_DATA S,
LOCATION L
WHERE S.TRANSACTION_DATE BETWEEN '01-JAN-2009' AND '31-DEC-2009' AND
L.LOCATION = S.LOCATION AND
L.DISTRICT = 23
GROUP BY LOCATION,
TRUNC(TRANSACTION_DATE)
ORDER BY LOCATION,
TRUNC(TRANSACTION_DATE)
In this case the query should return 365 (or fewer, if stores are not open every day) rows for each store in district 23. If there's 25 stores in the district it'll return 9125 rows or fewer.
On the other hand, let's say our VP of Sales wants some data. He/she/it isn't quite certain what's wanted, but he/she/it is pretty sure that whatever it is happened in the first six months of the year...not quite sure about which year...and not sure about the location, either - probably in district 23 (he/she/it has had a running feud with the individual who runs district 23 for the past 6 years, ever since that golf tournament where...well, never mind...but if a problem can be hung on the door of district 23's director so be it!)...and of course he/she/it wants all the details, and have it on his/her/its desk toot sweet! And thus we get a query that looks something like
SELECT L.DISTRICT, S.LOCATION, S.TRANSACTION_DATE,
S.something, S.something_else, S.some_more_stuff
FROM SALES_DATA S,
LOCATIONS L
WHERE EXTRACT(MONTH FROM S.TRANSACTION_DATE) <= 6 AND
L.LOCATION = S.LOCATION
ORDER BY L.DISTRICT,
S.LOCATION
This is an example of an unbounded query. How many rows will it return? Good question - that depends on how business conditions were, how many location were open, how many days there were in February, etc.
Put more simply, if you can look at a query and have a pretty good idea of how many rows it's going to return (even though that number might be relatively large) the query is bounded. If you can't, it's unbounded.
Share and enjoy.
http://hibernatingrhinos.com/Products/EFProf/learn#UnboundedResultSet
An unbounded result set is where a query is performed and does not explicitly limit the number of returned results from a query. Usually, this means that the application assumes that a query will always return only a few records. That works well in development and in testing, but it is a time bomb waiting to explode in production.
The query may suddenly start returning thousands upon thousands of rows, and in some cases, it may return millions of rows. This leads to more load on the database server, the application server, and the network. In many cases, it can grind the entire system to a halt, usually ending with the application servers crashing with out of memory errors.
Here is one example of a query that will trigger the unbounded result set warning:
var query = from post in blogDataContext.Posts
where post.Category == "Performance"
select post;
If the performance category has many posts, we are going to load all of them, which is probably not what was intended. This can be fixed fairly easily by using pagination by utilizing the Take() method:
var query = (from post in blogDataContext.Posts
where post.Category == "Performance"
select post)
.Take(15);
Now we are assured that we only need to handle a predictable, small result set, and if we need to work with all of them, we can page through the records as needed. Paging is implemented using the Skip() method, which instructs Entity Framework to skip (at the database level) N number of records before taking the next page.
But there is another common occurrence of the unbounded result set problem from directly traversing the object graph, as in the following example:
var post = postRepository.Get(id);
foreach (var comment in post.Comments)
{
// do something interesting with the comment
}
Here, again, we are loading the entire set without regard for how big the result set may be. Entity Framework does not provide a good way of paging through a collection when traversing the object graph. It is recommended that you would issue a separate and explicit query for the contents of the collection, which will allow you to page through that collection without loading too much data into memory.