How to join two consecutive events? - sql

I have two tables, one with events and another with tracks, and I want to combine them together. In table events I have session_id, event_id, event_timestamp, product_id. In table tracks I have session_id, product_id, action, event_timestamp, track_id. In my final table I want to have session_id, event_id, event_timestamp, product_id and track_id. It doesn't look complicated, but the problem is that event_timestamp doesn't match in both tables, and one product_id can have multiple track_id. I have the following query, but is causing me duplicates.
SELECT
distinct tr.session_id
, tr.product_id::VARCHAR AS product_id
, tr.track_id::VARCHAR AS track_id
, me.action||'.'||me.category AS action_category
, me.event_timestamp
, me.EVENT_ID
FROM session_experiment se
INNER JOIN lyst_analytics.tracks tr --this join will filter all the tracks and mobile events that are in the experiment
ON se.session_id = tr.session_id
INNER JOIN CORE_APP.MOBILE_EVENTS me
ON me.session_id = se.session_id
AND tr.product_id::VARCHAR = me.product_id
AND tr.session_id = me.session_id
WHERE 1=1
AND me.product_id IS NOT NULL
AND COALESCE(DATE(tr.event_timestamp), DATE(me.event_timestamp)) > {{start_date}}
AND COALESCE(DATE(tr.event_timestamp), DATE(me.event_timestamp)) <= {{end_date}}
AND tr.event_timestamp > me.event_timestamp
AND me.action||'.'||me.category
IN ('clicked.add to bag', 'clicked.buy')

Related

Trying to join multiple tables without all the pair of common columns hence the values are repeating from the last tables. Need help to solve this

I Have the following lines and result is added in image link. The results of 1adjust` to be joined, there is no platform or date column in it, hence the records are repreated. Is there a way to avoid this. This will cause issue in visualizations at campaign level when the repeated items are getting summed
with
sent as (
select campaign_name, date(date) as date, platform, count(id) as sent
from send
group by 1,2,3
),
bounce as (
select campaign_name, platform, count(id) as bounce
from bounce
group by 1,2
),
open as (
select campaign_name, platform, count(id) as clicks
from open
group by 1,2
),
adjust as (
select campaign, sum(purchase_events) as transactions, count(distinct adjust_id) as sessions, sum(sessions) as s2, sum(clicks) as ad_clicks
from adjust
group by 1
)
select
s.campaign_name,
s.date,
s.platform,
s.sent,
(s.sent-b.bounce) as delivered,
b.bounce,
o.clicks,
a.ad_clicks,
a.sessions,
a.s2,
a.transactions
from sent s
join bounce b on s.campaign_name = b.campaign_name and s.platform = b.platform
join open o on s.campaign_name = o.campaign_name and s.platform = o.platform
left join adjust a on s.campaign_name = a.campaign
See the result here

How could I join 4 large tables in BigQuery with Salesforce tables? SQL

What I want to do is to join each table with the key field sendID which is not unique (I use the GROUP BY function to make it unique)
FYI: These tables just have information from year 2022.
I have 4 tables from SalesForce.
Table 1: salesforce_sent (emails sent to different destinations and the largest table)
Table 2: salesforce_open (destinations which open the email and it's also a pretty large table)
Table 3: salesforce_clicks (destinations which opens the email and clicks the link to a website)
Table 4: salesforce_sendjobs (Helps to link the information from Salesforce and Google Analytics)
Each table has different columns. I already tried using LEFT JOIN and INNER JOIN in my queries but the time running the query it's insane (I've been up to 2-3 hours waiting and then cancel the run).
What I tried is this (I guess using INNER JOIN it is better than a LEFT JOIN because it's less heavy):
WITH SENT AS (
SELECT
sendid as sendid,
EXTRACT(date FROM eventdate) as sent_date,
lower(emailaddress) as emailaddress,
COUNT(*) as sent,
FROM salesforce_sent
group by 1,2,3
),
CLICKS AS (
SELECT
sendid as sendid,
EXTRACT(DATE from eventdate) as click_date,
url as url,
regexp_extract(url, r'utm_source=([^&]+)') as source,
regexp_extract(url, r'utm_medium=([^&]+)') as medium,
regexp_extract(url, r'utm_campaign=([^&]+)') as campaign,
regexp_extract(url, r'utm_content=([^&]+)') as ad_content,
isunique as isunique_click,
COUNT(*) as clicks,
FROM salesforce_clicks
group by 1,2,3,4,5,6,7,8
),
OPEN AS (
SELECT
sendid as sendid,
EXTRACT(date FROM eventdate) as open_date,
isunique as isunique_open,
COUNT(*) as open
FROM salesforce_opens
group by 1,2,3
),
SENDJOBS AS (
SELECT
sendid as sendid,
EXTRACT(date FROM senttime) as sent_date,
LOWER(emailname) as emailname,
LOWER(SPLIT(emailname, '-')[SAFE_OFFSET(1)]) AS pos,
FROM salesforce_sendjobs
group by 1,2,3,4
)
SELECT
a.sendid as sendid,
a.sent_date,
c.open_date,
d.click_date,
a.emailaddress,
b.emailname,
d.url as url,
b.pos,
d.source,
d.medium,
d.campaign,
d.ad_content,
sum(a.sent) as sent,
sum(c.open) as open,
sum(d.clicks) as clicks
FROM SENT a
INNER JOIN SENDJOBS b ON a.sendid = b.sendid
INNER JOIN OPEN c ON a.sendid = c.sendid
INNER JOIN CLICKS d ON a.sendid = d.sendid
WHERE 1=1
GROUP BY 1,2,3,4,5,6,7,8,9,10,11,12
I also tried using UNION ALL but it's not what I'm searching for. Because you won't have the complete information row when it comes to information that you have in table A but not in table B.
What I want is to have a merge between all this 4 tables using the sendID as the key. It would be really nice to make lighter the query somehow. This query process 100 GB when run (is not that much).
I also see that the compute problem it's in the join.

Select other table as a column based on datetime in BigQuery [duplicate]

This question already has an answer here:
Full outer join and Group By in BigQuery
(1 answer)
Closed 5 months ago.
I have two tables which has a relationship, but I want to grouping them based on time. Here are the tables
I want select a receipt as a column based on published_at, it must be in between pickup_time and drop_time, so will get this result :
I tried with JOIN, but it seems like select rows with drop_time is NULL only
SELECT
t.source_id AS source_id,
t.pickup_time AS pickup_time,
t.drop_time AS drop_time,
ARRAY_AGG(STRUCT(r.source_id, r.receipt_id, r.published_at) ORDER BY r.published_at LIMIT 1)[SAFE_OFFSET(0)] AS receipt
FROM `my-project-gcp.data_source.trips` AS t
JOIN `my-project-gcp.data_source.receipts` AS r
ON
t.source_id = r.source_id
AND
r.published_at >= t.pickup_time
AND (
r.published_at <= t.drop_time
OR t.drop_time IS NULL
)
GROUP BY source_id, pickup_time, drop_time
and tried with sub-query, got
Correlated subqueries that reference other tables are not supported unless they can be de-correlated, such as by transforming them into an efficient JOIN
SELECT
t.source_id AS source_id,
t.pickup_time AS pickup_time,
t.drop_time AS drop_time,
ARRAY_AGG((
SELECT
STRUCT(r.source_id, r.receipt_id, r.published_at)
FROM `my-project-gcp.data_source.receipts` as r
WHERE
t.source_id = r.source_id
AND
r.published_at >= t.pickup_time
AND (
r.published_at <= t.drop_time
OR t.drop_time IS NULL
)
LIMIT 1
))[SAFE_OFFSET(0)] AS receipt
FROM `my-project-gcp.data_source.trips` as t
GROUP BY source_id, pickup_time, drop_time
Each source_id is a car and only one driver can drive a car at once.
We can partition therefore by that entry.
Your approach is working for small tables. Since there is no unique join key, the cross join fails on large tables.
I present here a solution with union all and look back technique. This is quite fast and works with up to middle large table sizes in the range of a few GB. It prevents the cross join, but is a quite long script.
In the table trips are all drives by the drivers are listed. The receipts list all fines.
We need a unique row identication of each trip to join on this one later on. We use the row number for this, please see table trips_with_rowid.
The table summery_tmp unions three tables. First we load the trips table and add an empty column for the fines. Then we load the trips table again to mark the times were no one was driving the car. Finally, we add the table receipts such that only the columns source_id, pickup_time and fine is filled.
This table is sorted by the pickup_time for each source_id and the table summary. So the fine entries are under the entry of the driver getting the car. The column row_id_new is filled for the fine entries by the value of the row_id of the driver getting the car.
Grouping by row_id_new and filtering unneeded entries does the job.
I changed the second of the entered times (lazyness), thus it differs a bit from your result.
With trips as
(Select 1 source_id ,timestamp("2022-7-19 9:37:47") pickup_time, timestamp("2022-07-19 9:40:00") as drop_time, "jhon" driver_name
Union all Select 1 ,timestamp("2022-7-19 12:00:01"),timestamp("2022-7-19 13:05:11"),"doe"
Union all Select 1 ,timestamp("2022-7-19 14:30:01"),null,"foo"
Union all Select 3 ,timestamp("2022-7-24 08:35:01"),timestamp("2022-7-24 09:15:01"),"bar"
Union all Select 4 ,timestamp("2022-7-25 10:24:01"),timestamp("2022-7-25 11:14:01"),"jhon"
),
receipts as
(Select 1 source_id, 101 receipt_id, timestamp("2022-07-19 9:37:47") published_at,40 price
Union all Select 1,102, timestamp("2022-07-19 13:04:47"),45
Union all Select 1,103, timestamp("2022-07-19 15:23:00"),32
Union all Select 3,301, timestamp("2022-07-24 09:15:47"),45
Union all Select 4,401, timestamp("2022-07-25 11:13:47"),45
Union all Select 5,501, timestamp("2022-07-18 07:12:47"),45
),
trips_with_rowid as
(
SELECT 2*row_number() over (order by source_id,pickup_time) as row_id, * from trips
),
summery_tmp as
(
Select *, null as fines from trips_with_rowid
union all Select row_id+1,source_id,drop_time,null,concat("no driver, last one ",driver_name),null from trips_with_rowid
union all select null,source_id, published_at, null,null, R from receipts R
),
summery as
(
SELECT last_value(row_id ignore nulls) over (partition by source_id order by pickup_time ) row_id_new
,*
from summery_tmp
order by 1,2
)
select source_id,min(pickup_time) pickup_time, min(drop_time) drop_time,
any_value(driver_name) driver_name, array_agg(fines IGNORE NULLS) as fines_Sum
from summery
group by row_id_new,source_id
having fines_sum is not null or (pickup_time is not null and driver_name not like "no driver%")
order by 1,2

Recursive subtraction from two separate tables to fill in historical data

I have two datasets hosted in Snowflake with social media follower counts by day. The main table we will be using going forward (follower_counts) shows follower counts by day:
This table is live as of 4/4/2020 and will be updated daily. Unfortunately, I am unable to get historical data in this format. Instead, I have a table with historical data (follower_gains) that shows net follower gains by day for several accounts:
Ideally - I want to take the follower_count value from the minimum date in the current table (follower_counts) and subtract the sum of gains (organic + paid gains) for each day, until the minimum date of the follower_gains table, to fill in the follower_count historically. In addition, there are several accounts with data in these tables, so it would need to be grouped by account. It should look like this:
I've only gotten as far as unioning these two tables together, but don't even know where to start with looping through these rows:
WITH a AS (
SELECT
account_id,
date,
organizational_entity,
organizational_entity_type,
vanity_name,
localized_name,
localized_website,
organization_type,
total_followers_count,
null AS paid_follower_gain,
null AS organic_follower_gain,
account_name,
last_update
FROM follower_counts
UNION ALL
SELECT
account_id,
date,
organizational_entity,
organizational_entity_type,
vanity_name,
localized_name,
localized_website,
organization_type,
null AS total_followers_count,
organic_follower_gain,
paid_follower_gain,
account_name,
last_update
FROM follower_gains)
SELECT
a.account_id,
a.date,
a.organizational_entity,
a.organizational_entity_type,
a.vanity_name,
a.localized_name,
a.localized_website,
a.organization_type,
a.total_followers_count,
a.organic_follower_gain,
a.paid_follower_gain,
a.account_name,
a.last_update
FROM a
ORDER BY date desc LIMIT 100
UPDATE: Changed union to union all and added not exists to remove duplicates. Made changes per the comments.
NOTE: Please make sure you don't post images of the tables. It's difficult to recreate your scenario to write a correct query. Test this solution and update so that I can make modifications if necessary.
You don't loop through in SQL because its not a procedural language. The operation you define in the query is performed for all the rows in a table.
with cte as (SELECT a.account_id,
a.date,
a.organizational_entity,
a.organizational_entity_type,
a.vanity_name,
a.localized_name,
a.localized_website,
a.organization_type,
(a.follower_count - (b.organic_gain+b.paid_gain)) AS follower_count,
a.account_name,
a.last_update,
b.organic_gain,
b.paid_gain
FROM follower_counts a
JOIN follower_gains b ON a.account_id = b.account_id
AND b.date < (select min(date) from
follower_counts c where a.account.id = c.account_id)
)
SELECT b.account_id,
b.date,
b.organizational_entity,
b.organizational_entity_type,
b.vanity_name,
b.localized_name,
b.localized_website,
b.organization_type,
b.follower_count,
b.account_name,
b.last_update,
b.organic_gain,
b.paid_gain
FROM cte b
UNION ALL
SELECT a.account_id,
a.date,
a.organizational_entity,
a.organizational_entity_type,
a.vanity_name,
a.localized_name,
a.localized_website,
a.organization_type,
a.follower_count,
a.account_name,
a.last_update,
NULL as organic_gain,
NULL as paid_gain
FROM follower_counts a where not exists (select 1 from
follower_gains c where a.account_id = c.account_id AND a.date = c.date)
You could do something like this, instead of using the variable you can just wrap it another bracket and write at end ) AS FollowerGrowth
DECLARE #FollowerGrowth INT =
( SELECT total_followers_count
FROM follower_gains
WHERE AccountID = xx )
-
( SELECT TOP 1 follower_count
FROM follower_counts
WHERE AccountID = xx
ORDER BY date ASCENDING )

calculating cohort data in firebase -> bigquery, but I want to separate them by tracking source. Grouping wont work?

I'm trying to calculate quality of users with cohort data in bigquery
My current query is:
WITH analytics_data AS (
SELECT user_pseudo_id, event_timestamp, event_name, app_info.id,geo.country as country,platform ,app_info.id as bundle_id,
UNIX_MICROS(TIMESTAMP("2019-12-05 00:00:00")) AS start_day,
3600*1000*1000*24 AS one_day_micros
FROM `table.events_*`
WHERE _table_suffix BETWEEN "20191205" AND "20191218"
)
SELECT day_7_cohort / day_0_cohort AS seven_day_conversion FROM (
WITH day_7_users AS (
SELECT DISTINCT user_pseudo_id
FROM analytics_data
WHERE event_name = 'watched_20_ads' AND event_timestamp BETWEEN start_day AND start_day+(12*one_day_micros)
), day_0_users AS (
SELECT DISTINCT user_pseudo_id
FROM analytics_data
WHERE event_name = "first_open"
AND bundle_id = "com.bundle.id"
AND country = "United States"
AND platform = "ANDROID"
AND event_timestamp BETWEEN start_day AND start_day+(1*one_day_micros)
)
SELECT
(SELECT count(*)
FROM day_0_users) AS day_0_cohort,(SELECT count(*)
FROM day_7_users
JOIN day_0_users USING (user_pseudo_id)) AS day_7_cohort
)
the problem is that I'm unable to separate the users by tracking source.
I want to separate the users by: tracking source and country.
What I'm curently getting:
what I would like to see:
What would be perfect:
I'm not sure if it's possible to write a query that would return the data in a single table, without involving more queries and data storage elsewhere.
So your question is missing some data/fields, but I will provide a 'general' solution.
with data as (
-- Select the fields you need to define criteria and cohorts
),
cohort_info as (
-- Cohort Logic (might be more complicated than this)
select user_id, source, country---, etc...
from data
group by 1,2,3
),
day_0_users as (
-- Logic to determine who you are measuring for your calculation
),
day_7_users as (
-- Logic to detemine who qualifies as a 7 day user for your calculation
),
joined as (
-- Join your CTEs together
select
cohort_info.source,
cohort_info.country,
count(distinct day_0_users.user_id) as day_0_count,
count(distinct day_7_users.user_id) as day_7_count
from day_0_users
left join day_7_users using(user_id)
inner join cohort_info using(user_id)
group by 1,2
)
select *, day_7_count/day_0_count as seven_day_conversion
from joined
I think using several CTEs in this manner will make your code more readable and will enable you to track your logic a bit better. Nested Subqueries tend to get ugly.