How can understand the logic behind the Views metric in Google Looker Studio? - google-bigquery

I am trying to replicate a basic Looker-Studio report in Google Bigquery.
Looker-Studio is connected to Google Analytics 4 and pulls the correct data.
Views metric in Looker-Studio
I tried looking at following in BQ:
WITH
base_table AS(
SELECT
traffic_source.source AS source_text,
CAST((
SELECT
value.string_value
FROM
UNNEST(event_params)
WHERE
KEY = 'page_location') AS string) AS page_location,
COUNT(*) AS event_name_by_page_count
FROM
-my_table-
WHERE
1=1
AND event_name IN('page_view')
GROUP BY
1,
2
ORDER BY
1,
3 DESC )
SELECT
source_text,
SUM(base_table.event_name_by_page_count) AS page_views
FROM
base_table
WHERE
1=1
AND source_text IS NOT NULL
GROUP BY
1
ORDER BY
2 desc
I still can't get close to the metric in the Looker-Studio report. For context, for the same date range, my Looker report shows 40k views, and my queries shows 13k.

Related

How could I join 4 large tables in BigQuery with Salesforce tables? SQL

What I want to do is to join each table with the key field sendID which is not unique (I use the GROUP BY function to make it unique)
FYI: These tables just have information from year 2022.
I have 4 tables from SalesForce.
Table 1: salesforce_sent (emails sent to different destinations and the largest table)
Table 2: salesforce_open (destinations which open the email and it's also a pretty large table)
Table 3: salesforce_clicks (destinations which opens the email and clicks the link to a website)
Table 4: salesforce_sendjobs (Helps to link the information from Salesforce and Google Analytics)
Each table has different columns. I already tried using LEFT JOIN and INNER JOIN in my queries but the time running the query it's insane (I've been up to 2-3 hours waiting and then cancel the run).
What I tried is this (I guess using INNER JOIN it is better than a LEFT JOIN because it's less heavy):
WITH SENT AS (
SELECT
sendid as sendid,
EXTRACT(date FROM eventdate) as sent_date,
lower(emailaddress) as emailaddress,
COUNT(*) as sent,
FROM salesforce_sent
group by 1,2,3
),
CLICKS AS (
SELECT
sendid as sendid,
EXTRACT(DATE from eventdate) as click_date,
url as url,
regexp_extract(url, r'utm_source=([^&]+)') as source,
regexp_extract(url, r'utm_medium=([^&]+)') as medium,
regexp_extract(url, r'utm_campaign=([^&]+)') as campaign,
regexp_extract(url, r'utm_content=([^&]+)') as ad_content,
isunique as isunique_click,
COUNT(*) as clicks,
FROM salesforce_clicks
group by 1,2,3,4,5,6,7,8
),
OPEN AS (
SELECT
sendid as sendid,
EXTRACT(date FROM eventdate) as open_date,
isunique as isunique_open,
COUNT(*) as open
FROM salesforce_opens
group by 1,2,3
),
SENDJOBS AS (
SELECT
sendid as sendid,
EXTRACT(date FROM senttime) as sent_date,
LOWER(emailname) as emailname,
LOWER(SPLIT(emailname, '-')[SAFE_OFFSET(1)]) AS pos,
FROM salesforce_sendjobs
group by 1,2,3,4
)
SELECT
a.sendid as sendid,
a.sent_date,
c.open_date,
d.click_date,
a.emailaddress,
b.emailname,
d.url as url,
b.pos,
d.source,
d.medium,
d.campaign,
d.ad_content,
sum(a.sent) as sent,
sum(c.open) as open,
sum(d.clicks) as clicks
FROM SENT a
INNER JOIN SENDJOBS b ON a.sendid = b.sendid
INNER JOIN OPEN c ON a.sendid = c.sendid
INNER JOIN CLICKS d ON a.sendid = d.sendid
WHERE 1=1
GROUP BY 1,2,3,4,5,6,7,8,9,10,11,12
I also tried using UNION ALL but it's not what I'm searching for. Because you won't have the complete information row when it comes to information that you have in table A but not in table B.
What I want is to have a merge between all this 4 tables using the sendID as the key. It would be really nice to make lighter the query somehow. This query process 100 GB when run (is not that much).
I also see that the compute problem it's in the join.

Unrecognized name when joining 2 tables in Google Big Query

I want to join two tables from different datasets. It is possible to INNER JOIN these two datasets but it does not work with a regular JOIN.
I want to join a Google Analytics 4 (GA4) item id on the item id of the datawarehouse.
In order to access the GA4 item id I need to UNNEST the GA4 items array.
When using the code below, I get the following error:
Unrecognized name: dwh_id; Did you mean dwh? at [9:79]
Here's the query I'm using now.
SELECT
event_date as ga4_date,
ga4_items.item_id AS ga4_id,
ga4_items.item_name,
ga4_items.price,
dwh.Product_SKU__Google_Analytics as dwh_id,
FROM `ga4-data` as ga4
JOIN `datawarehouse-data` as dwh ON dwh_id = ga4_id,
UNNEST(ga4.items) as ga4_items
Let me know if you have the answer :)
My best quess of what you're trying to do:
CREATE TEMP TABLE `ga4-data` AS
SELECT '2022-01-01' AS event_date,
[STRUCT('item001' AS item_id, 'name1' AS item_name, 100 AS price),
STRUCT('item002' AS item_id, 'name2' AS item_name, 200 AS price)] AS items
;
CREATE TEMP TABLE `datawarehouse-data` AS
SELECT 'item001' AS Product_SKU__Google_Analytics,
'col1' AS col1;
SELECT event_date as ga4_date,
ga4_items.item_id AS ga4_id,
ga4_items.item_name,
ga4_items.price,
dwh.Product_SKU__Google_Analytics as dwh_id
FROM `ga4-data` as ga4, UNNEST(ga4.items) as ga4_items
JOIN `datawarehouse-data` as dwh
ON dwh.Product_SKU__Google_Analytics = ga4_items.item_id;
Alright, I figured it out. It took a lot of trial and error but I got it:
WITH ga as
(
SELECT event_date as ga4_date,
ga4_items.item_id as id,
ga4_items.item_name,
ga4_items.price
FROM `name-ga4-dataset` as ga4, UNNEST(ga4.items) as ga4_items
),
dwh as
(
SELECT Product_SKU__Google_Analytics as dwh_id
FROM `name-dwh-dataset` as dwh
)
SELECT * FROM ga
JOIN dwh
ON ga.id = dwh_id

Get apps with the highest review count since a dynamic series of days

I have two tables, apps and reviews (simplified for the sake of discussion):
apps table
id int
reviews table
id int
review_date date
app_id int (foreign key that points to apps)
2 questions:
1. How can I write a query / function to answer the following question?:
Given a series of dates from the earliest reviews.review_date to the latest reviews.review_date (incrementing by a day), for each date, D, which apps had the most reviews if the app's earliest review was on or later than D?
I think I know how to write a query if given an explicit date:
SELECT
apps.id,
count(reviews.*)
FROM
reviews
INNER JOIN apps ON apps.id = reviews.app_id
group by
1
having
min(reviews.review_date) >= '2020-01-01'
order by 2 desc
limit 10;
But I don't know how to query this dynamically given the desired date series and compile all this information in a single view.
2. What's the best way to model this data?
It would be nice to have the # of reviews at the time for each date as well as the app_id. As of now I'm thinking something that might look like:
... 2020-01-01_app_id | 2020-01-01_review_count | 2020-01-02_app_id | 2020-01-02_review_count ...
But I'm wondering if there's a better way to do this. Stitching the data together also seems like a challenge.
I think this is what you are looking for:
Postgres 13 or newer
WITH cte AS ( -- MATERIALIZED
SELECT app_id, min(review_date) AS earliest_review, count(*)::int AS total_ct
FROM reviews
GROUP BY 1
)
SELECT *
FROM (
SELECT generate_series(min(review_date)
, max(review_date)
, '1 day')::date
FROM reviews
) d(review_window_start)
LEFT JOIN LATERAL (
SELECT total_ct, array_agg(app_id) AS apps
FROM (
SELECT app_id, total_ct
FROM cte c
WHERE c.earliest_review >= d.review_window_start
ORDER BY total_ct DESC
FETCH FIRST 1 ROWS WITH TIES -- new & hot
) sub
GROUP BY 1
) a ON true;
WITH TIES makes it a bit cheaper. Added in Postgres 13 (currently beta). See:
Get top row(s) with highest value, with ties
Postgres 12 or older
WITH cte AS ( -- MATERIALIZED
SELECT app_id, min(review_date) AS earliest_review, count(*)::int AS total_ct
FROM reviews
GROUP BY 1
)
SELECT *
FROM (
SELECT generate_series(min(review_date)
, max(review_date)
, '1 day')::date
FROM reviews
) d(review_window_start)
LEFT JOIN LATERAL (
SELECT total_ct, array_agg(app_id) AS apps
FROM (
SELECT total_ct, app_id
, rank() OVER (ORDER BY total_ct DESC) AS rnk
FROM cte c
WHERE c.earliest_review >= d.review_window_start
) sub
WHERE rnk = 1
GROUP BY 1
) a ON true;
db<>fiddle here
Same as above, but without WITH TIES.
We don't need to involve the table apps at all. The table reviews has all information we need.
The CTE cte computes earliest review & current total count per app. The CTE avoids repeated computation. Should help quite a bit.
It is always materialized before Postgres 12, and should be materialized automatically in Postgres 12 since it is used many times in the main query. Else you could add the keyword MATERIALIZED in Postgres 12 or later to force it. See:
How to force evaluation of subquery before joining / pushing down to foreign server
The optimized generate_series() call produces the series of days from earliest to latest review. See:
Generating time series between two dates in PostgreSQL
Join a count query on generate_series() and retrieve Null values as '0'
Finally, the LEFT JOIN LATERAL you already discovered. But since multiple apps can tie for the most reviews, retrieve all winners, which can be 0 - n apps. The query aggregates all daily winners into an array, so we get a single result row per review_window_start. Alternatively, define tiebreaker(s) to get at most one winner. See:
What is the difference between LATERAL JOIN and a subquery in PostgreSQL?
If you are looking for hints, then here are a few:
Are you aware of generate_series() and how to use it to compose a table of dates given a start and end date? If not, then there are plenty of examples on this site.
To answer this question for any given date, you need to have only two measures for each app, and only one of these is used to compare an app against other apps. Your query in part 1 shows that you know what these two measures are.
Hints 1 and 2 should be enough to get this done. The only thing I can add is for you not to worry about making the database do "too much work." That is what it is there to do. If it does not do it quickly enough, then you can think about optimizations, but before you get to that step, concentrate on getting the answer that you want.
Please comment if you need further clarification on this.
The missing piece for me was lateral join.
I can accomplish just about what I want using the following:
select
review_windows.review_window_start,
id,
review_total,
earliest_review
from
(
select
date_trunc('day', review_windows.review_windows) :: date as review_window_start
from
generate_series(
(
SELECT
min(reviews.review_date)
FROM
reviews
),
(
SELECT
max(reviews.review_date)
FROM
reviews
),
'1 year'
) review_windows
order by
1 desc
) review_windows
left join lateral (
SELECT
apps.id,
count(reviews.*) as review_total,
min(reviews.review_date) as earliest_review
FROM
reviews
INNER JOIN apps ON apps.id = reviews.app_id
where
reviews.review_date >= review_windows.review_window_start
group by
1
having
min(reviews.review_date) >= review_windows.review_window_start
order by
2 desc,
3 desc
limit
2
) apps_most_reviews on true;

calculating cohort data in firebase -> bigquery, but I want to separate them by tracking source. Grouping wont work?

I'm trying to calculate quality of users with cohort data in bigquery
My current query is:
WITH analytics_data AS (
SELECT user_pseudo_id, event_timestamp, event_name, app_info.id,geo.country as country,platform ,app_info.id as bundle_id,
UNIX_MICROS(TIMESTAMP("2019-12-05 00:00:00")) AS start_day,
3600*1000*1000*24 AS one_day_micros
FROM `table.events_*`
WHERE _table_suffix BETWEEN "20191205" AND "20191218"
)
SELECT day_7_cohort / day_0_cohort AS seven_day_conversion FROM (
WITH day_7_users AS (
SELECT DISTINCT user_pseudo_id
FROM analytics_data
WHERE event_name = 'watched_20_ads' AND event_timestamp BETWEEN start_day AND start_day+(12*one_day_micros)
), day_0_users AS (
SELECT DISTINCT user_pseudo_id
FROM analytics_data
WHERE event_name = "first_open"
AND bundle_id = "com.bundle.id"
AND country = "United States"
AND platform = "ANDROID"
AND event_timestamp BETWEEN start_day AND start_day+(1*one_day_micros)
)
SELECT
(SELECT count(*)
FROM day_0_users) AS day_0_cohort,(SELECT count(*)
FROM day_7_users
JOIN day_0_users USING (user_pseudo_id)) AS day_7_cohort
)
the problem is that I'm unable to separate the users by tracking source.
I want to separate the users by: tracking source and country.
What I'm curently getting:
what I would like to see:
What would be perfect:
I'm not sure if it's possible to write a query that would return the data in a single table, without involving more queries and data storage elsewhere.
So your question is missing some data/fields, but I will provide a 'general' solution.
with data as (
-- Select the fields you need to define criteria and cohorts
),
cohort_info as (
-- Cohort Logic (might be more complicated than this)
select user_id, source, country---, etc...
from data
group by 1,2,3
),
day_0_users as (
-- Logic to determine who you are measuring for your calculation
),
day_7_users as (
-- Logic to detemine who qualifies as a 7 day user for your calculation
),
joined as (
-- Join your CTEs together
select
cohort_info.source,
cohort_info.country,
count(distinct day_0_users.user_id) as day_0_count,
count(distinct day_7_users.user_id) as day_7_count
from day_0_users
left join day_7_users using(user_id)
inner join cohort_info using(user_id)
group by 1,2
)
select *, day_7_count/day_0_count as seven_day_conversion
from joined
I think using several CTEs in this manner will make your code more readable and will enable you to track your logic a bit better. Nested Subqueries tend to get ugly.

Is there a way to calculate the average number of times an event happens when all data is stored as string?

I am working in BigQuery and using SQL to calculate the average number of ads viewed per user based on their engagement level (levels range from 1 - 5). I previously calculated the average number of days users were active based on their engagement level, but when I do average number of ads viewed based on engagement level the query fails. My guess is that the string for ads viewed is stored as a string.
Is there a way to average the number of times 'ad viewed' occurs in a list of events, based on engagement?
I tried changing the original code I used where I extracted 'Average Days' to extract 'Ads Viewed' but that does not work.
I tried average(count(if(ads.viewed,1,0))), but that won't work either. I can't figure out what I am doing wrong.
I also checked this post (SQL average of string values) but this doesn't seem to apply.
SELECT
engagement_level,
COUNT(event="ADSVIEWED") AS AverageAds
I have also tried:
SELECT
engagement_level,
AVG(IF(event="ADSVIEWED",1,0)) AS AverageAds
But that doesn't work either.
It should put out a table of the engagement level with the corresponding average. For 'Average Days' it worked out to be Engagement Level: Average Days (1: 2.45, 2: 3.21, 3: 4.67, etc.). But it doesn't work for the ads_viewed event.
If I understand correctly, you can do this without a subquery:
SELECT engagement_level,
COUNTIF(event = 'ADSVIEWED') / COUNT(DISTINCT user_id) as avg_per_user
FROM t
GROUP BY engagement_level;
This counts the number of events and divides by the number of users. If you only want to count users who have the event:
SELECT engagement_level,
COUNT(*) / COUNT(DISTINCT user_id) as avg_per_user
FROM t
WHERE event = 'ADSVIEWED'
GROUP BY engagement_level;
... to calculate the average number of ads viewed per user based on their engagement level ...
Below is for BigQuery Standard SQL
#standardSQL
SELECT engagement_level, AVG(Ads) AverageAds FROM (
SELECT engagement_level, user_id, COUNTIF(event = 'ADSVIEWED') Ads
FROM `project.dataset.table`
GROUP BY engagement_level, user_id
)
GROUP BY engagement_level
You can test, play with above using dummy data like in below example
#standardSQL
WITH `project.dataset.table` AS (
SELECT 1 user_id, 1 engagement_level, 'ADSVIEWED' event UNION ALL
SELECT 1, 1, 'a' UNION ALL
SELECT 1, 1, 'ADSVIEWED' UNION ALL
SELECT 2, 1, 'b' UNION ALL
SELECT 2, 1, 'ADSVIEWED'
)
SELECT engagement_level, AVG(Ads) AverageAds FROM (
SELECT engagement_level, user_id, COUNTIF(event = 'ADSVIEWED') Ads
FROM `project.dataset.table`
GROUP BY engagement_level, user_id
)
GROUP BY engagement_level
with result
Row engagement_level AverageAds
1 1 1.5