Different results UNNEST BigQuery

Different results UNNEST BigQuery - sql

I don't understand what is the differents between those queries:
SELECT event_timestamp ,user_pseudo_id, value.double_value as tax
FROM `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_*`, UNNEST(event_params) as event_params
WHERE event_name = "purchase" and event_params.key = "tax"
The other query is:
SELECT event_timestamp ,user_pseudo_id,
(SELECT value.double_value FROM UNNEST(event_params) WHERE key = "tax") as tax
FROM `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_*`
WHERE event_name = "purchase"
In the first query, I get 5.242 registers and in the second 5.692. What is the mistake?
Thank you!

It depends on what you define as accurate. The reason you are getting a row count mismatches is because of the way the tax field is being handled. You can see this by running the following query to see the discrepancies:
with unnested as (
SELECT event_timestamp ,user_pseudo_id, value.double_value as tax
FROM `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_*`, UNNEST(event_params) as event_params
WHERE event_name = "purchase" and event_params.key = "tax"
)
SELECT events.event_timestamp ,events.user_pseudo_id,
(SELECT value.double_value FROM UNNEST(event_params) WHERE key = "tax") as tax
FROM `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_*` events
LEFT JOIN unnested un
on events.event_timestamp=un.event_timestamp
and events.user_pseudo_id=un.user_pseudo_id
WHERE events.event_name = "purchase"
and un.event_timestamp is null
;
If you pick out a single record from that list and investigate with the two following queries:
SELECT *
FROM `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_*`, UNNEST(event_params) as event_params
WHERE 1=1
-- and event_name = "purchase" and event_params.key = "tax"
and event_name = "purchase" and event_timestamp=1608955242902332 and user_pseudo_id='43627350.3807676886';
SELECT
*,
(SELECT value.double_value FROM UNNEST(event_params) WHERE key = "tax") as tax
FROM `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_*` events
WHERE events.event_name = "purchase"
and event_timestamp=1608955242902332 and user_pseudo_id='43627350.3807676886'
;
The first query is filtering out the records without a tax field from your final set, while the second returns the records as having a null tax value. If the number registered is dependent on the presence of a value in the tax field the 5242 value is the correct number.

Related

How to join two consecutive events?

I have two tables, one with events and another with tracks, and I want to combine them together. In table events I have session_id, event_id, event_timestamp, product_id. In table tracks I have session_id, product_id, action, event_timestamp, track_id. In my final table I want to have session_id, event_id, event_timestamp, product_id and track_id. It doesn't look complicated, but the problem is that event_timestamp doesn't match in both tables, and one product_id can have multiple track_id. I have the following query, but is causing me duplicates.
SELECT
distinct tr.session_id
, tr.product_id::VARCHAR AS product_id
, tr.track_id::VARCHAR AS track_id
, me.action||'.'||me.category AS action_category
, me.event_timestamp
, me.EVENT_ID
FROM session_experiment se
INNER JOIN lyst_analytics.tracks tr --this join will filter all the tracks and mobile events that are in the experiment
ON se.session_id = tr.session_id
INNER JOIN CORE_APP.MOBILE_EVENTS me
ON me.session_id = se.session_id
AND tr.product_id::VARCHAR = me.product_id
AND tr.session_id = me.session_id
WHERE 1=1
AND me.product_id IS NOT NULL
AND COALESCE(DATE(tr.event_timestamp), DATE(me.event_timestamp)) > {{start_date}}
AND COALESCE(DATE(tr.event_timestamp), DATE(me.event_timestamp)) <= {{end_date}}
AND tr.event_timestamp > me.event_timestamp
AND me.action||'.'||me.category
IN ('clicked.add to bag', 'clicked.buy')

Query by specific event_params.key

I am using firebase analytics and have my event params in google big query. I did this query:
SELECT * FROM 'myProject.analytics_number.events_2021*', UNNEST(event_params) AS param WHERE event_name ="Read_Free_Article" AND param.value.string_value="X"
this gives me results like this:
Now I'd like to query multiple things. For example I'd like to query avg(timeSpend) to get the average timeSpend value and count(title) ... group by title to get the count of events that have the same title value.
But I don't understand how I can query by the different event_params.key values. I only managed to query by the event_name and param.value.string_value which just checks if any event_params.value.string_value has the desired value.

You can access the individual param values in the select part of the query like this:
SELECT (SELECT ep.value.string_value)
FROM UNNEST(event_params) ep
WHERE ep.key = 'title') title,
(SELECT ep.value.string_value
FROM UNNEST(event_params) ep
WHERE ep.key = 'publisher') publisher,
COUNT(1) count_titles,
(SELECT AVG(CAST(ep.value.string_value AS numeric))
FROM UNNEST(event_params) ep
WHERE ep.key = 'timeSpend) avg_time_spend,
FROM `myProject.analytics_number.events_2021*`
WHERE event_name ='Read_Free_Article'
AND publisher = 'X'
GROUP BY
title,
publisher

How to calculate number of users who did multiple events in big query?

I tried using JOIN function but not sure if that is the right/smart way to do it.
I want to know the number of users who did "first_open" and "BA_HOME"SCREEN" in particular time period.

You provided following dataset:
Date
event_name
user_id
05052021
first_open
123
25052021
ba_home
435
The goal is to count the occurrences of two different strings in the column event_name in two separated columns for each user_id.
This can be done with the pivot statement:
Select *
from (
Select "a" as user_id, "first_open" as event_name
Union all select "b" , "BA_HOME_SCREEN"
Union all select "b" , "first_open"
)
PIVOT(count(1) FOR event_name IN ("first_open", "BA_HOME_SCREEN"))
see big query references
Another approche would be to use a case statement (an if statement would be possible as well):
select user_id, sum(case when event_name="first_open" then 1 else 0 end) as first_open
from ( ..... )
group by 1

Query results half right after joining two tables

The following query resulted in correct results only for the inner query (post_engagement, website purchases) while all other numbers were incorrectly increased manyfold. Any ideas? Thanks.
Schema of the two tables:
Favorite_ads (id, campaign_id, campaign_name, objective, impressions, spend)
Actions (id, ads_id, action_type, value)
SELECT
f.campaign_id,
f.campaign_name,
f.objective,
SUM(f.impressions) AS Impressions,
SUM(f.spend) AS Spend,
SUM(a.post_engagement) AS "Post Engagement",
SUM(a.website_purchases) AS "Website Purchases"
FROM
favorite_ads f
LEFT JOIN (
SELECT
ads_id,
CASE WHEN action_type = 'post_engagement' THEN SUM(value) END AS
post_engagement,
CASE WHEN action_type = 'offsite_conversion.fb_pixel_purchase' THEN SUM(value) END AS website_purchases
FROM Actions a
GROUP BY ads_id, action_type
) a ON f.id = a.ads_id
WHERE date_trunc('month',f.date_start) = '2018-04-01 00:00:00' AND
date_trunc('month',f.date_stop) = '2018-04-01 00:00:00' --only get campaigns
that ran in April, 2018
GROUP BY f.campaign_id, campaign_name, objective
Order by campaign_id

Without knowing the actual table structure, constraints, dependencies and data, it's hard to tell, what the issue may be.
You already have some leads in the comments, which you should consider first.
For example you wrote, that this sub-query returns correct results:
SELECT ads_id,
CASE
WHEN action_type = 'post_engagement'
THEN SUM(value)
END AS post_engagement,
CASE
WHEN action_type = 'offsite_conversion.fb_pixel_purchase'
THEN SUM(value)
END AS website_purchases
FROM Actions a
GROUP BY ads_id, action_type
Is this one also giving correct results:
SELECT ads_id,
SUM(
CASE
WHEN action_type = 'post_engagement'
THEN value
END
) AS post_engagement,
SUM(
CASE
WHEN action_type = 'offsite_conversion.fb_pixel_purchase'
THEN value
END
) AS website_purchases
FROM Actions
GROUP BY ads_id
If so, then try replacing your sub-query with that one.
If you still have a problem, then I'd investigate if your join condition is correct, as it would seem, that for a campaign (campaign_id) you could probably have multiple entries with the same id, which will multiply the sub-query results - that depends on what is actually the primary key (or unique constraint) in the favorite_ads.

UPDATE FROM subquery using the same table in subquery's WHERE

I have 2 integer fields in a table "user": leg_count and leg_length. The first one stores the amount of legs of a user and the second one - their total length.
Each leg that belongs to user is stored in separate table, as far as typical internet user can have zero to infinity legs:
CREATE TABLE legs (
user_id int not null,
length int not null
);
I want to recalculate the statistics for all users in one query, so I try:
UPDATE users SET
leg_count = subquery.count, leg_length = subquery.length
FROM (
SELECT COUNT(*) as count, SUM(length) as length FROM legs WHERE legs.user_id = users.id
) AS subquery;
and get "subquery in FROM cannot refer to other relations of same query level" error.
So I have to do
UPDATE users SET
leg_count = (SELECT COUNT(*) FROM legs WHERE legs.user_id = users.id),
leg_length = (SELECT SUM(length) FROM legs WHERE legs.user_id = users.id)
what makes database to perform 2 SELECT's for each row, although, required data could be calculated in one SELECT:
SELECT COUNT(*), SUM(length) FROM legs;
Is it possible to optimize my UPDATE query to use only one SELECT subquery?
I use PostgreSQL, but I beleive, the solution exists for any SQL dialect.
TIA.

I would do:
WITH stats AS
( SELECT COUNT(*) AS cnt
, SUM(length) AS totlength
, user_id
FROM legs
GROUP BY user_id
)
UPDATE users
SET leg_count = cnt, leg_length = totlength
FROM stats
WHERE stats.user_id = users.id

You could use PostgreSQL's extended update syntax:
update users as u
set leg_count = aggr.cnt
, leg_length = aggr.length
from (
select legs.user_id
, count(*) as cnt
, sum(length) as length
from legs
group by
legs.user_id
) as aggr
where u.user_id = aggr.user_id

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Different results UNNEST BigQuery - sql

Related

How to join two consecutive events?

Query by specific event_params.key

How to calculate number of users who did multiple events in big query?

Query results half right after joining two tables

UPDATE FROM subquery using the same table in subquery's WHERE

Categories

Resources