How to query Total Events based on Experiment Variant? - google-bigquery

I'm looking to query some data from GA through BQ for use in A/B-test analysis.
What I'd like to pull out is how many users were placed into each variant, and what was the total amount of add-to-cart completions.
The following query doesn't quite match up with what I'm seeing in GA (I know there will/can be differences), so I guess I just want to make sure that I've gotten it completely correct.
The following query very closely matches the 'Unique Events' Metric in GA, but I want to make sure that it's showing me the 'Total Events' Metric:
SELECT
exp_.experimentVariant AS variant,
COUNT(DISTINCT fullVisitorId) AS users,
COUNTIF(hits_.eventinfo.eventAction = "add to cart") AS add_to_cart
FROM
`XXXXX.YYYYY.ga_sessions_*`,
UNNEST(hits) AS hits_,
UNNEST(hits_.experiment) AS exp_
WHERE
exp_.experimentid = "XXXYYYZZZ"
AND _TABLE_SUFFIX BETWEEN "20220315" AND "20220405"
GROUP BY
variant
ORDER BY
variant
The reason for why I'm not sure this is quite right is because when I use the following query, the output completely matches the 'Total Events' Metric in GA:
SELECT
COUNT(DISTINCT fullVisitorId) AS users,
COUNTIF(hits.eventinfo.eventAction = "add to cart") AS add_to_cart
FROM
`XXXXX.YYYYY.ga_sessions_*`,
UNNEST(hits) AS hits
WHERE
_TABLE_SUFFIX BETWEEN "20220315" AND "20220405"

The query will return all users that had a hit with the specified experimentVariant and all add to cart events that had the specified variant sent together with the hit. In that way it looks correct.
A user segment in GA of users exposed to the experiment will work differently and return a different result. The experiment variant users can also have performed add to cart events that didn't have the experiment paramter sent together with them. For example, the add to cart event could have been sent before the user even became exposed to the experiment. If those events are within the timeframe they will be included if the user is qualified for the segment.

Related

Shopware API search products endpoint total from a stream inconsistent

I have two dynamic product groups
First: Test Product with variants
Conditions: Product Is equal to Variant product
Result total 7 like I expect this
Second: Active Products
Conditions: Active yes
we allready see that the stream ids are just set to 5 products
Now we get a total of 5 instead of 15 products like expected?
Why is it inconsistent, and how can I modify my request to consider also the variants?
You shouldn't rely on the stream_ids column as an indicator which product is shown in a dynamic product group at any given moment. This is because there are multiple more things that factor into whether a product is shown to a user in a dynamic product group.
The filters you define for the group resolve to an SQL query, which in simplified terms would yield something like WHERE active = 1 AND id IN ('...', '...'). So the stream_ids column isn't used to select the contents of a group, but the entire query including all filters is executed in the storefront request. The result of that query is what you see in the preview of the dynamic product group.
Why doesn't it correlate completely with the content of stream_ids?
Shopware features inheritance of fields. If fields of a variant haven't been assigned a value, they may inherit that value from their parents. This may not be reflected in the contents of stream_ids. In fact the children/variants may even inherit the contents of stream_ids.
Then there's the fact that contents of the product group may vary, depending on the current sales channel. That may be because the sales channel features a different language, hence the content of a translatable field used in a filter may vary. Also if you use price filters, there is the possibility of products with multiple prices, which might only be shown if certain conditions are met, defined by the rule builder.
In short, don't count on the stream_ids, which can't reflect all these variables but are used in some capacity internally, for invalidating caches and such. Instead use the preview to judge what the average user might find when they see a product group. There's also the possibility to choose which sales channel the preview should apply to, for the exact reason, that contents may differ depending on the sales channel.

Daily Retention with Filter in BigQuery

I am using a query to calculate daily retention on my Firebase Analytics data exported to BigQuery. It is working well and the numbers match with the numbers in Firebase, but when I try to filter the query by a cohort of users, the numbers don't add up.
I want to compare the results of an A/B test from Firebase, and so I've looked at the user_property "firebase_exp_2" which is my A/B test, and I've split up the users in each group (0/1). The retention numbers do not match (at all) the numbers that I can see in my A/B test results in Firebase - actually they show the opposite pattern.
The query is adapted from here: https://github.com/sagishporer/big-query-queries-for-firebase/wiki/Query:-Daily-retention
All I've changed is adding the following under the "WHERE" clause:
WHERE
event_name = 'user_engagement' AND user_pseudo_id IN
(SELECT user_pseudo_id
FROM `analytics_XXX.events_*`,
UNNEST (user_properties) user_properties
WHERE user_properties.key = 'firebase_exp_2' AND user_properties.value.string_value='1')
Firebase says that there are 6,043 users in the Control group and 6,127 in the Variant A group, but my numbers are 5,632 and 5,730, and the retained users are around 1,000 users more than what Firebase reports.
What am I doing wrong?
The export to BigQuery happens on a daily basis and each imported table is named events_YYYYMMDD. Additionally, a table is imported for events received throughout the current day. This table is named events_intraday_YYYYMMDD.
The additions you made are querying from events_* which is fine. The example uses events_201812* though which would ignore the intraday table. That would explain why your numbers a lower. You are missing users added to the A/B test during the current day.

How to get measurement protocol hit data in Big Query?

We're trying to connect online and offline behaviour via measurement protocol.
It's been sent a hit to Google Analytics with the following parameters (among others):
eventCategory= offline_transaction
source= store
medium= offline
The data are correctly registered in Google Analytics, been available in Reporting section.
I'm trying to get them in BigQuery this way:
SELECT
hits.eventInfo.eventCategory, trafficSource.source, trafficSource.medium
FROM [XXX:YYY.ga_sessions_20160827]
where hits.eventInfo.eventCategory="offline_transaction"
and trafficSource.source="store"
and trafficSource.medium="offline"
and the output is 'Query returned zero records'.
Any idea about what I'm doing wrong? Are available in BigQuery the data coming from Measurement Protocol?
Thanks in advance.
I believe what is happening is that the trafficSource.source/medium are being recorded at the session level and hits.eventCategory at the hit level, and thus they are never included in a single row together, so 0 rows match your query. Try something like the below:
SELECT
MAX(IF (hits.eventInfo.eventCategory = "offline_transaction", hits.eventInfo.eventCategory, NULL)) WITHIN RECORD AS eventCategory,
SUM(IF (hits.eventInfo.eventCategory = "offline_transaction", 1, NULL)) WITHIN RECORD AS eventCnt,
trafficSource.source,
trafficSource.medium
FROM [XXX:YYY.ga_sessions_20160827]
where hits.eventInfo.eventCategory="offline_transaction"
and trafficSource.source="store"
and trafficSource.medium="offline"
This should give you a count of how many times that event occurred within that session. Without knowing more about your use case/what you want to pull out of the table, I don't know how else to help.
I've had to use the aggregate_function() WITHIN RECORD syntax frequently to deal with these types of issues.

SQL complicated query with joins

I have problem with one query.
Let me explain what I want:
For the sake of bravity let's say that I have three tables:
-Offers
-Ratings
-Users
Now what I want to do is to create SQL query:
I want Offers to be listed with all its fields and additional temporary column that IS NOT storred anywhere called AverageUserScore.
This AverageUserScore is product of grabbing all offers, belonging to particular user and then grabbing all ratings belonging to these offers and then evaluating those ratings average - this average score is AverageUserScore.
To explain it even further, I need this query for Ruby on Rails application. In the browser inside application you can see all offers of other users , with AverageUserScore at the very end, as the last column.
Associations:
Offer has many ratings
Offer belongs to user
Rating belongs to offer
User has many offers
Assumptions made:
You actually have a numeric column (of any type that SQL's AVG is fine with) in your Rating model. I'm using a column ratings.rating in my examples.
AverageUserScore is unconventional, so average_user_score is better.
You don't mind not getting users that have no offers: average rating is not clearly defined for them anyway.
You don't deviate from Rails' conventions far enough to have a primary key other than id.
Displaying offers for each user is a straightforward task: in a loop of #users.each do |user|, you can do user.offers.each do |offer| and be set. The only problem here is that it will execute a separate query for every user. Not good.
The "fetching offers" part is a standard N+1 counter seen even in the guides.
#users = User.includes(:offers).all
The interesting part here is only getting the averages.
For that I'm going to use Arel. It's already part of Rails, ActiveRecord is built on top of it, so you don't need to install anything extra.
You should be able to do a join like this:
User.joins(offers: :ratings)
And this won't get you anything interesting (apart from filtering users that have no offers). Inside though, you'll get a huge set of every rating joined with its corresponding offer and that offer's user. Since we're taking averages per-user we need to group by users.id, effectively making one entry per one users.id value. That is, one per user. A list of users, yes!
Let's stop for a second and make some assignments to make Arel-related code prettier. In fact, we only need two:
users = User.arel_table
ratings = Rating.arel_table
Okay. So. We need to get a list of users (all fields), and for each user fetch an average value seen on his offers' ratings' rating field. So let's compose these SQL expressions:
# users.*
user_fields = users[Arel.star] # Arel.star is a portable SQL "wildcard"
# AVG(ratings.rating) AS average_user_score
average_user_score = ratings[:rating].average.as('average_user_score')
All set. Ready for the final query:
User.includes(:offers) # N+1 counteraction
.joins(offers: :ratings) # dat join
.select(user_fields, average_user_score) # fields we need
.group(users[:id]) # grouping to only get one row per user

Current database-structure improvements & implementing badge system

I'm making a website that has logos that need to be guessed. Currently this is my db setup:
Users(user_id, etc)
Logos(logo_id, img, company, level)
Guess(guess_id, user_id, logo_id, guess, guess_count, guessed, time)
When a user does a guess, it's done with an ajax request. In this request, 2 queries are done. 1 to retrieve the company-data (from Logos), one to insert/update the new guess in the db (Guess).
Now, on every page-load I need to know the total amount of guesses, and how many logos there are per level. This requires 2 queries - one that checks the Logos, one that gets the amount of guessed (guessed = 1) guesses from Guess per level.
Now I want to implement some kind of badge-system, like here on SO. Reading through some other questions, I saw that it might be better to have a separate table containing a the total amount of guesses and such, so that it takes the same resources if a user has 10 guesses or 10000. I didn't do this for several reasons:
requires an extra query in my ajax-call, which I'd like to keep as short as possible
page-reloading shouldn't happen that frequently, so that shouldn't take too long
I wouldn't know how to count the total amounts of guesses per level, unless the table would look like: AmountOfGuesses(id, user_id, level, counter) but then it'd take more resources depending on the amount of levels you've unlocked.
As for the badge system, I know the terms should just be checked eg when a user submits an answer. If course this requires yet another query every time an answer is submitted, namely to check the total amount of answers the user has. Then depending on that amount, the badge should be assigned. As for the badges, I was thinking about a table-structure like so:
Badges( badge_id, name, description, etc)
BadgeAssigned( user_id, badge_id, time )
Does this structure seem good for badges?
Is the current structure of the rest of my database good, or is it better if it is adjusted?