How could I join 4 large tables in BigQuery with Salesforce tables? SQL - sql

What I want to do is to join each table with the key field sendID which is not unique (I use the GROUP BY function to make it unique)
FYI: These tables just have information from year 2022.
I have 4 tables from SalesForce.
Table 1: salesforce_sent (emails sent to different destinations and the largest table)
Table 2: salesforce_open (destinations which open the email and it's also a pretty large table)
Table 3: salesforce_clicks (destinations which opens the email and clicks the link to a website)
Table 4: salesforce_sendjobs (Helps to link the information from Salesforce and Google Analytics)
Each table has different columns. I already tried using LEFT JOIN and INNER JOIN in my queries but the time running the query it's insane (I've been up to 2-3 hours waiting and then cancel the run).
What I tried is this (I guess using INNER JOIN it is better than a LEFT JOIN because it's less heavy):
WITH SENT AS (
SELECT
sendid as sendid,
EXTRACT(date FROM eventdate) as sent_date,
lower(emailaddress) as emailaddress,
COUNT(*) as sent,
FROM salesforce_sent
group by 1,2,3
),
CLICKS AS (
SELECT
sendid as sendid,
EXTRACT(DATE from eventdate) as click_date,
url as url,
regexp_extract(url, r'utm_source=([^&]+)') as source,
regexp_extract(url, r'utm_medium=([^&]+)') as medium,
regexp_extract(url, r'utm_campaign=([^&]+)') as campaign,
regexp_extract(url, r'utm_content=([^&]+)') as ad_content,
isunique as isunique_click,
COUNT(*) as clicks,
FROM salesforce_clicks
group by 1,2,3,4,5,6,7,8
),
OPEN AS (
SELECT
sendid as sendid,
EXTRACT(date FROM eventdate) as open_date,
isunique as isunique_open,
COUNT(*) as open
FROM salesforce_opens
group by 1,2,3
),
SENDJOBS AS (
SELECT
sendid as sendid,
EXTRACT(date FROM senttime) as sent_date,
LOWER(emailname) as emailname,
LOWER(SPLIT(emailname, '-')[SAFE_OFFSET(1)]) AS pos,
FROM salesforce_sendjobs
group by 1,2,3,4
)
SELECT
a.sendid as sendid,
a.sent_date,
c.open_date,
d.click_date,
a.emailaddress,
b.emailname,
d.url as url,
b.pos,
d.source,
d.medium,
d.campaign,
d.ad_content,
sum(a.sent) as sent,
sum(c.open) as open,
sum(d.clicks) as clicks
FROM SENT a
INNER JOIN SENDJOBS b ON a.sendid = b.sendid
INNER JOIN OPEN c ON a.sendid = c.sendid
INNER JOIN CLICKS d ON a.sendid = d.sendid
WHERE 1=1
GROUP BY 1,2,3,4,5,6,7,8,9,10,11,12
I also tried using UNION ALL but it's not what I'm searching for. Because you won't have the complete information row when it comes to information that you have in table A but not in table B.
What I want is to have a merge between all this 4 tables using the sendID as the key. It would be really nice to make lighter the query somehow. This query process 100 GB when run (is not that much).
I also see that the compute problem it's in the join.

Related

Query keeps giving me duplicate records. How can I fix this?

I wrote a query which uses 2 temp tables. And then joins them into 1. However, I am seeing duplicate records in the student visit temp table. (Query is below). How could this be modified to remove the duplicate records of the visit temp table?
with clientbridge as (Select *
from (Select visitorid, --Visid
roomnumber,
room_id,
profid,
student_id,
ambc.datekey,
RANK() over(PARTITION BY visitorid,student_id,profid ORDER BY ambc.datekey desc) as rn
from university.course_office_hour_bridge cohd
--where student_id = '9999999-aaaa-6634-bbbb-96fa18a9046e'
)
where rn = 1 --visitorid = '999999999999999999999999999999'---'1111111111111111111111111111111' --and pai.datekey is not null --- 00000000000000000000000000
),
-----------------Data Header Table
studentvisit as
(SELECT
--Visit key will allow us to track everything they did within that visit.
distinct visid_visitorid,
--calcualted_visitorid,
uniquevisitkey,
--channel, -- says the room they're in. Channel might not be reliable would need to see how that operates
--office_list, -- add 7 to exact
--user_college,
--first_office_hour_name,
--first_question_time_attended,
studentaccountid_5,
profid_officenumber_8,
studentvisitstarttime,
room_id_115,
--date_time,
qqq144, --Course Name
qqq145, -- Course Office Hour Benefit
qqq146, --Course Office Hour ID
datekey
FROM university.office_hour_details ohd
--left_join niversity.course_office_hour_bridge cohd on ohd.visid_visitorid
where DateKey >='2022-10-01' --between '2022-10-01' and '2022-10-27'
and (qqq146 <> '')
)
select
*
from clientbridge ab inner join studentvisit sv on sv.visid_visitorid = cb.visitorid
I wrote a query which uses 2 temp tables. And then joins them into 1. However, I am seeing duplicate records in the student visit temp table. (Query is below). How could this be modified to remove the duplicate records of the visit temp table?
I think you may get have a better shot by joining the two datasets in the same query where you want the data ranked, otherwise your rank from query will be ignored within the results from the second query. Perhaps, something like ->
;with studentvisit as
(SELECT
--Visit key will allow us to track everything they did within that visit.
distinct visid_visitorid,
--calcualted_visitorid,
uniquevisitkey,
--channel, -- says the room they're in. Channel might not be reliable would need to see how that operates
--office_list, -- add 7 to exact
--user_college,
--first_office_hour_name,
--first_question_time_attended,
studentaccountid_5,
profid_officenumber_8,
studentvisitstarttime,
room_id_115,
--date_time,
qqq144, --Course Name
qqq145, -- Course Office Hour Benefit
qqq146, --Course Office Hour ID
datekey
FROM university.office_hour_details ohd
--left_join niversity.course_office_hour_bridge cohd on ohd.visid_visitorid
where DateKey >='2022-10-01' --between '2022-10-01' and '2022-10-27'
and (qqq146 <> '')
)
,clientbridge as (
Select
sv.*,
university.course_office_hour_bridge cohd, --Visid
roomnumber,
room_id,
profid,
student_id,
ambc.datekey,
RANK() over(PARTITION BY sv.visitorid,sv.student_id,sv,profid ORDER BY ambc.datekey desc) as rn
from university.course_office_hour_bridge cohd
inner join studentvisit sv on sv.visid_visitorid = cohd.visitorid
)
select
*
from clientbridge WHERE rn=1

Trying to join multiple tables without all the pair of common columns hence the values are repeating from the last tables. Need help to solve this

I Have the following lines and result is added in image link. The results of 1adjust` to be joined, there is no platform or date column in it, hence the records are repreated. Is there a way to avoid this. This will cause issue in visualizations at campaign level when the repeated items are getting summed
with
sent as (
select campaign_name, date(date) as date, platform, count(id) as sent
from send
group by 1,2,3
),
bounce as (
select campaign_name, platform, count(id) as bounce
from bounce
group by 1,2
),
open as (
select campaign_name, platform, count(id) as clicks
from open
group by 1,2
),
adjust as (
select campaign, sum(purchase_events) as transactions, count(distinct adjust_id) as sessions, sum(sessions) as s2, sum(clicks) as ad_clicks
from adjust
group by 1
)
select
s.campaign_name,
s.date,
s.platform,
s.sent,
(s.sent-b.bounce) as delivered,
b.bounce,
o.clicks,
a.ad_clicks,
a.sessions,
a.s2,
a.transactions
from sent s
join bounce b on s.campaign_name = b.campaign_name and s.platform = b.platform
join open o on s.campaign_name = o.campaign_name and s.platform = o.platform
left join adjust a on s.campaign_name = a.campaign
See the result here

Select other table as a column based on datetime in BigQuery [duplicate]

This question already has an answer here:
Full outer join and Group By in BigQuery
(1 answer)
Closed 5 months ago.
I have two tables which has a relationship, but I want to grouping them based on time. Here are the tables
I want select a receipt as a column based on published_at, it must be in between pickup_time and drop_time, so will get this result :
I tried with JOIN, but it seems like select rows with drop_time is NULL only
SELECT
t.source_id AS source_id,
t.pickup_time AS pickup_time,
t.drop_time AS drop_time,
ARRAY_AGG(STRUCT(r.source_id, r.receipt_id, r.published_at) ORDER BY r.published_at LIMIT 1)[SAFE_OFFSET(0)] AS receipt
FROM `my-project-gcp.data_source.trips` AS t
JOIN `my-project-gcp.data_source.receipts` AS r
ON
t.source_id = r.source_id
AND
r.published_at >= t.pickup_time
AND (
r.published_at <= t.drop_time
OR t.drop_time IS NULL
)
GROUP BY source_id, pickup_time, drop_time
and tried with sub-query, got
Correlated subqueries that reference other tables are not supported unless they can be de-correlated, such as by transforming them into an efficient JOIN
SELECT
t.source_id AS source_id,
t.pickup_time AS pickup_time,
t.drop_time AS drop_time,
ARRAY_AGG((
SELECT
STRUCT(r.source_id, r.receipt_id, r.published_at)
FROM `my-project-gcp.data_source.receipts` as r
WHERE
t.source_id = r.source_id
AND
r.published_at >= t.pickup_time
AND (
r.published_at <= t.drop_time
OR t.drop_time IS NULL
)
LIMIT 1
))[SAFE_OFFSET(0)] AS receipt
FROM `my-project-gcp.data_source.trips` as t
GROUP BY source_id, pickup_time, drop_time
Each source_id is a car and only one driver can drive a car at once.
We can partition therefore by that entry.
Your approach is working for small tables. Since there is no unique join key, the cross join fails on large tables.
I present here a solution with union all and look back technique. This is quite fast and works with up to middle large table sizes in the range of a few GB. It prevents the cross join, but is a quite long script.
In the table trips are all drives by the drivers are listed. The receipts list all fines.
We need a unique row identication of each trip to join on this one later on. We use the row number for this, please see table trips_with_rowid.
The table summery_tmp unions three tables. First we load the trips table and add an empty column for the fines. Then we load the trips table again to mark the times were no one was driving the car. Finally, we add the table receipts such that only the columns source_id, pickup_time and fine is filled.
This table is sorted by the pickup_time for each source_id and the table summary. So the fine entries are under the entry of the driver getting the car. The column row_id_new is filled for the fine entries by the value of the row_id of the driver getting the car.
Grouping by row_id_new and filtering unneeded entries does the job.
I changed the second of the entered times (lazyness), thus it differs a bit from your result.
With trips as
(Select 1 source_id ,timestamp("2022-7-19 9:37:47") pickup_time, timestamp("2022-07-19 9:40:00") as drop_time, "jhon" driver_name
Union all Select 1 ,timestamp("2022-7-19 12:00:01"),timestamp("2022-7-19 13:05:11"),"doe"
Union all Select 1 ,timestamp("2022-7-19 14:30:01"),null,"foo"
Union all Select 3 ,timestamp("2022-7-24 08:35:01"),timestamp("2022-7-24 09:15:01"),"bar"
Union all Select 4 ,timestamp("2022-7-25 10:24:01"),timestamp("2022-7-25 11:14:01"),"jhon"
),
receipts as
(Select 1 source_id, 101 receipt_id, timestamp("2022-07-19 9:37:47") published_at,40 price
Union all Select 1,102, timestamp("2022-07-19 13:04:47"),45
Union all Select 1,103, timestamp("2022-07-19 15:23:00"),32
Union all Select 3,301, timestamp("2022-07-24 09:15:47"),45
Union all Select 4,401, timestamp("2022-07-25 11:13:47"),45
Union all Select 5,501, timestamp("2022-07-18 07:12:47"),45
),
trips_with_rowid as
(
SELECT 2*row_number() over (order by source_id,pickup_time) as row_id, * from trips
),
summery_tmp as
(
Select *, null as fines from trips_with_rowid
union all Select row_id+1,source_id,drop_time,null,concat("no driver, last one ",driver_name),null from trips_with_rowid
union all select null,source_id, published_at, null,null, R from receipts R
),
summery as
(
SELECT last_value(row_id ignore nulls) over (partition by source_id order by pickup_time ) row_id_new
,*
from summery_tmp
order by 1,2
)
select source_id,min(pickup_time) pickup_time, min(drop_time) drop_time,
any_value(driver_name) driver_name, array_agg(fines IGNORE NULLS) as fines_Sum
from summery
group by row_id_new,source_id
having fines_sum is not null or (pickup_time is not null and driver_name not like "no driver%")
order by 1,2

Unrecognized name when joining 2 tables in Google Big Query

I want to join two tables from different datasets. It is possible to INNER JOIN these two datasets but it does not work with a regular JOIN.
I want to join a Google Analytics 4 (GA4) item id on the item id of the datawarehouse.
In order to access the GA4 item id I need to UNNEST the GA4 items array.
When using the code below, I get the following error:
Unrecognized name: dwh_id; Did you mean dwh? at [9:79]
Here's the query I'm using now.
SELECT
event_date as ga4_date,
ga4_items.item_id AS ga4_id,
ga4_items.item_name,
ga4_items.price,
dwh.Product_SKU__Google_Analytics as dwh_id,
FROM `ga4-data` as ga4
JOIN `datawarehouse-data` as dwh ON dwh_id = ga4_id,
UNNEST(ga4.items) as ga4_items
Let me know if you have the answer :)
My best quess of what you're trying to do:
CREATE TEMP TABLE `ga4-data` AS
SELECT '2022-01-01' AS event_date,
[STRUCT('item001' AS item_id, 'name1' AS item_name, 100 AS price),
STRUCT('item002' AS item_id, 'name2' AS item_name, 200 AS price)] AS items
;
CREATE TEMP TABLE `datawarehouse-data` AS
SELECT 'item001' AS Product_SKU__Google_Analytics,
'col1' AS col1;
SELECT event_date as ga4_date,
ga4_items.item_id AS ga4_id,
ga4_items.item_name,
ga4_items.price,
dwh.Product_SKU__Google_Analytics as dwh_id
FROM `ga4-data` as ga4, UNNEST(ga4.items) as ga4_items
JOIN `datawarehouse-data` as dwh
ON dwh.Product_SKU__Google_Analytics = ga4_items.item_id;
Alright, I figured it out. It took a lot of trial and error but I got it:
WITH ga as
(
SELECT event_date as ga4_date,
ga4_items.item_id as id,
ga4_items.item_name,
ga4_items.price
FROM `name-ga4-dataset` as ga4, UNNEST(ga4.items) as ga4_items
),
dwh as
(
SELECT Product_SKU__Google_Analytics as dwh_id
FROM `name-dwh-dataset` as dwh
)
SELECT * FROM ga
JOIN dwh
ON ga.id = dwh_id

Recursive subtraction from two separate tables to fill in historical data

I have two datasets hosted in Snowflake with social media follower counts by day. The main table we will be using going forward (follower_counts) shows follower counts by day:
This table is live as of 4/4/2020 and will be updated daily. Unfortunately, I am unable to get historical data in this format. Instead, I have a table with historical data (follower_gains) that shows net follower gains by day for several accounts:
Ideally - I want to take the follower_count value from the minimum date in the current table (follower_counts) and subtract the sum of gains (organic + paid gains) for each day, until the minimum date of the follower_gains table, to fill in the follower_count historically. In addition, there are several accounts with data in these tables, so it would need to be grouped by account. It should look like this:
I've only gotten as far as unioning these two tables together, but don't even know where to start with looping through these rows:
WITH a AS (
SELECT
account_id,
date,
organizational_entity,
organizational_entity_type,
vanity_name,
localized_name,
localized_website,
organization_type,
total_followers_count,
null AS paid_follower_gain,
null AS organic_follower_gain,
account_name,
last_update
FROM follower_counts
UNION ALL
SELECT
account_id,
date,
organizational_entity,
organizational_entity_type,
vanity_name,
localized_name,
localized_website,
organization_type,
null AS total_followers_count,
organic_follower_gain,
paid_follower_gain,
account_name,
last_update
FROM follower_gains)
SELECT
a.account_id,
a.date,
a.organizational_entity,
a.organizational_entity_type,
a.vanity_name,
a.localized_name,
a.localized_website,
a.organization_type,
a.total_followers_count,
a.organic_follower_gain,
a.paid_follower_gain,
a.account_name,
a.last_update
FROM a
ORDER BY date desc LIMIT 100
UPDATE: Changed union to union all and added not exists to remove duplicates. Made changes per the comments.
NOTE: Please make sure you don't post images of the tables. It's difficult to recreate your scenario to write a correct query. Test this solution and update so that I can make modifications if necessary.
You don't loop through in SQL because its not a procedural language. The operation you define in the query is performed for all the rows in a table.
with cte as (SELECT a.account_id,
a.date,
a.organizational_entity,
a.organizational_entity_type,
a.vanity_name,
a.localized_name,
a.localized_website,
a.organization_type,
(a.follower_count - (b.organic_gain+b.paid_gain)) AS follower_count,
a.account_name,
a.last_update,
b.organic_gain,
b.paid_gain
FROM follower_counts a
JOIN follower_gains b ON a.account_id = b.account_id
AND b.date < (select min(date) from
follower_counts c where a.account.id = c.account_id)
)
SELECT b.account_id,
b.date,
b.organizational_entity,
b.organizational_entity_type,
b.vanity_name,
b.localized_name,
b.localized_website,
b.organization_type,
b.follower_count,
b.account_name,
b.last_update,
b.organic_gain,
b.paid_gain
FROM cte b
UNION ALL
SELECT a.account_id,
a.date,
a.organizational_entity,
a.organizational_entity_type,
a.vanity_name,
a.localized_name,
a.localized_website,
a.organization_type,
a.follower_count,
a.account_name,
a.last_update,
NULL as organic_gain,
NULL as paid_gain
FROM follower_counts a where not exists (select 1 from
follower_gains c where a.account_id = c.account_id AND a.date = c.date)
You could do something like this, instead of using the variable you can just wrap it another bracket and write at end ) AS FollowerGrowth
DECLARE #FollowerGrowth INT =
( SELECT total_followers_count
FROM follower_gains
WHERE AccountID = xx )
-
( SELECT TOP 1 follower_count
FROM follower_counts
WHERE AccountID = xx
ORDER BY date ASCENDING )