Attribute sales to email campaigns based on time window - sql

Hopefully someone is able to help me with the following. I have two tables, one containing events which are "clicked emails" and one containing "conversions" (=orders). I would like to write attribution logic in BigQuery SQL to match the conversions to the clicked emails based on the following logic:
the matching ID is a "person_id"
conversions that happened within a 5 day (120h) timeframe after the clicked email count to that mail
if the customer clicked on two different emails within a five day window, the email clicked most recent before the conversion gets attributed the revenue
To know: the clicks table also contains clicks which do not have a conversion, and the conversions table also contains conversions not related to emails.
Wished end result: a table containing a count of all attributed orders and a count of all clicks, on date and campaign name.
I figured I would need to do a left join getting only the conversions in that might be related to a click, on person_id. However, now I need to define the window up until which conversions are counted (the 5 days). Maybe I could include this in the where statement? Then after that, I need to check, in case the count of conversions is > 1, to only take the conversion into account where the "date diff" is smallest.
How far I got :
SELECT
c. person_id,
c.campaign_name,
c.datetime,
s.processed_at,
c.email,
s.order_number,
SUM(s.total_price) AS revenue,
COUNT(DISTINCT s.order_number) AS orders
FROM
`klaviyo_de_clicks` c
LEFT JOIN
`klaviyo_de_conversions` s
ON
c.person_id = s.person_id
GROUP BY
1,2,3,4,5,6
Thanks for your help!

You can get the emails just before the conversion using union all and window functions.
The idea is to put all the data into a single "result set" with the clicks and conversions separated. Then use window functions to get the previous click information for each conversion.
Your sample code as a bunch of columns not alluded to in your question. But this is the structure of the code:
with cc as (
select person_id, date_time, conversion_id, null as click_id
from conversions c
union all
select person_id, date_time, null, click_id
from clicks c
)
select cc.*,
(click_date_time >= datetime_add(datetime, interval -5 day)) as is_click_conversion
from (select cc.*,
max(case when click_id is not null then date_time end) over (partition by person_id order by date_time) as click_date_time,
last_value(click_id ignore nulls) over (partition by person_id order by date_time) as prev_click_id
from cc
) cc
where conversion_id is not null;
If you need additional columns, you can use join to bring them in.

Related

Big Query / SQL finding "new" data in a date range

I have a pretty big event log with columns:
id, timestamp, text, user_id
The text field contains a variety of things, like:
Road: This is the road name
City: This is the city name
Type: This is a type
etc..
I would like to get the result to the following:
Given a start and end date, how many **new** users used a road (that haven't before) grouped by road.
I've got various parts of this working fine (like the total amount of users, the grouping by, date range and so on. The SQL for getting the new users is alluding me though, having tried solutions like SELECT AS STRUCT on sub queries amongst other things.
Ultimately, I'd love to see a result like:
road, total_users, new_users
Any help would be much appreciated.
If I understand correctly, you want something like this:
select road, counif(seqnum = 1) as new_users, count(distinct user_id) as num_users
from (select l.*,
row_number() over (partition by l.user_id, l.text order by l.timestamp) as seqnum
from log l
where l.type = 'Road'
) l
where timestamp >= #timestamp1 and timestamp < #timestamp2
group by road;
This assumes that you have a column that specifies the type (i.e. "road") and another column with the name of the road (i.e. "Champs-Elysees").

How to find the distinct records when a value was changed in a table with daily snap shots

I have a table that has a SNAP_EFF_DT (date the record was inserted into the table) field. All records are inserted on a daily basis to record any changes a specific record may have. I want to pull out only the dates and values when a change took place from a previous date.
I am using Teradata SQL Assistant to query this data. This is what I have so far:
SEL DISTINCT MIN(a.SNAP_EFF_DT) as SNAP_EFF_DT, CLIENT_ID, FAVORITE_COLOR
FROM CUSTOMER_TABLE
GROUP BY 2,3;
This does give me the first instance of a change to a specific color. However, if a customer first likes blue on 1/1/2019, then changes to green on 2/1/2019, and then changes back to blue on 3/1/2019 I won't get that last change in the results and will assume their current favorite color is green, when in fact it changed back to blue. I would like a code that returns all 3 changes.
Simply use LAG to compare the current and the previous row's color:
SELECT t.*,
LAG(FAVORITE_COLOR)
OVER (PARTITION BY CLIENT_ID
ORDER BY SNAP_EFF_DT) AS prev_color
FROM CUSTOMER_TABLE AS t
QUALIFY
FAVORITE_COLOR <> prev_color
OR prev_color IS NULL
If your Teradata version doesn't support LAG switch to
MIN(FAVORITE_COLOR)
OVER (PARTITION BY CLIENT_ID
ORDER BY SNAP_EFF_DT
ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING) AS prev_color
One method uses JOIN
select ct.*
from CUSTOMER_TABLE ct left join
CUSTOMER_TABLE ctprev
on ctprev.client_id = ct.client_id AND
ctprev.SNAP_EFF_DT = ct.SNAP_EFF_DT - interval '1' day
where ctprev.client_id is null or
(ctprev.FAVORITE_COLOR <> ct.FAVORITE_COLOR or
. . .
);
Note: This assumes that the values are not null, although the logic can be adjusted to handle null values as well.

table returned after group and joins has unrealistically large numbers

I'm trying to fuse the marketing data we have in google with the data we have in facebook, by location. The first SELECT statement is getting columns from the table made via the nested SELECT statement in line 5. I then have to join that with a different table to get DMA Name (line 11). Finally I union that with the facebook data. When I run the query, the results for clicks, spend, and impressions are all in the billons when I sum up all DMA's. Instead, they should be anywhere from 10 million to 100 million, depending on the metric.
I am really new to SQL, so I am sure there are better ways to think about how to attack this problem. I am sure my syntax isn't up to the best practice standards. I welcome all feedback.
SELECT sum(sub.clicks) AS clicks, sum(sub.spend) AS spend,
sum(sub.impressions) AS impressions, sub.date,
location_with_adwordsID.DMA_NAME, sub.ad_network_type_2
FROM
(SELECT sum(clicks) AS clicks, sum(cost) AS spend,
sum(impressions) AS
impressions, cast(date AS Date) AS date, city_criteria_id ,
ad_network_type_2
FROM adwords.location
GROUP BY date, city_criteria_id, ad_network_type_2) AS sub
LEFT JOIN location_conversion.location_with_adwordsID ON
CAST(sub.city_criteria_id AS STRING) =
CAST(location_with_adwordsID.criteria_id AS STRING)
GROUP BY date, DMA_NAME, ad_network_type_2
UNION ALL
(SELECT sum(clicks) AS clicks, sum(spend) AS spend, sum(impressions) AS
impressions, CAST(date AS Date) AS date, lower(dma) AS fbdma,
'Facebook' as Source FROM
facebook_ad_insights_dma.ad_insights_locations
GROUP BY Date, fbdma)
Here is the structure of the 'location_with_adwordsID' table.
https://drive.google.com/file/d/1oKd3O_fVOjwO1EnZ5LFjHIiB3EB32be5/view?usp=sharing
Here is the structure of the 'adwords.location' table.
https://drive.google.com/file/d/1XlHC7Ug2yW9XNkNR6kolmmJPrfUa-S6n/view?usp=sharing
The reason for the LEFT JOIN is this: Google Ads gives me location data with a seemingly proprietary 'city_id'. To join this data with facebook data, I need to add a DMA column to my adwords table and then union FB and google. Thats where my 'location_with_adwordsID' comes in, which is a table made by google that has city_id by DMA and Zip code. So my desired outcome after this join is a table with the same number of rows as 'adwords.location', but with an extra column of 'DMA'.
Thanks.
It is hard to provide a definitive answer without seeing the tables structure and sample data.
However, based on your SQL code, it looks like you have an unnecessary nested query in your first SELECT : you do not need the sub subquery, you can join directly tables adwords.location and location_conversion.location, and use aggregated functions (SUM) in the SELECTed fields. This will simplify the query and eliminate potential duplication.
Try :
SELECT
sum(clicks) AS clicks,
sum(spend) AS spend,
sum(impressions) AS impressions,
cast(date AS Date) AS date,
location_with_adwordsID.dma_date,
sub.ad_network_type_2
FROM
adwords.location
LEFT JOIN location_conversion.location_with_adwordsID
ON CAST(loc.city_criteria_id AS STRING) = CAST(ad.criteria_id AS STRING)
GROUP BY
date,
dma_name,
ad_network_type_2
UNION ALL
SELECT
sum(clicks) AS clicks,
sum(spend) AS spend,
sum(impressions) AS impressions,
CAST(date AS Date) AS date,
lower(dma) AS fbdma,
'Facebook' as Source
FROM facebook_ad_insights_dma.ad_insights_locations
GROUP BY
date,
fbdma
If you still get unrealistic data, then you have to check the relation between adwords.location (which I aliased as loc) and location_conversion.location_with_adwordsID (aliased ad) : if there are multiple records in ad for a given criteria_id, then your query will count the same loc record several time, which is causing the issue. In this case you must refine the JOIN by adding additional criterias.

SQL: Average value per day

I have a database called ‘tweets’. The database 'tweets' includes (amongst others) the rows 'tweet_id', 'created at' (dd/mm/yyyy hh/mm/ss), ‘classified’ and 'processed text'. Within the ‘processed text’ row there are certain strings such as {TICKER|IBM}', to which I will refer as ticker-strings.
My target is to get the average value of ‘classified’ per ticker-string per day. The row ‘classified’ includes the numerical values -1, 0 and 1.
At this moment, I have a working SQL query for the average value of ‘classified’ for one ticker-string per day. See the script below.
SELECT Date( `created_at` ) , AVG( `classified` ) AS Classified
FROM `tweets`
WHERE `processed_text` LIKE '%{TICKER|IBM}%'
GROUP BY Date( `created_at` )
There are however two problems with this script:
It does not include days on which there were zero ‘processed_text’s like {TICKER|IBM}. I would however like it to spit out the value zero in this case.
I have 100+ different ticker-strings and would thus like to have a script which can process multiple strings at the same time. I can also do them manually, one by one, but this would cost me a terrible lot of time.
When I had a similar question for counting the ‘tweet_id’s per ticker-string, somebody else suggested using the following:
SELECT d.date, coalesce(IBM, 0) as IBM, coalesce(GOOG, 0) as GOOG,
coalesce(BAC, 0) AS BAC
FROM dates d LEFT JOIN
(SELECT DATE(created_at) AS date,
COUNT(DISTINCT CASE WHEN processed_text LIKE '%{TICKER|IBM}%' then tweet_id
END) as IBM,
COUNT(DISTINCT CASE WHEN processed_text LIKE '%{TICKER|GOOG}%' then tweet_id
END) as GOOG,
COUNT(DISTINCT CASE WHEN processed_text LIKE '%{TICKER|BAC}%' then tweet_id
END) as BAC
FROM tweets
GROUP BY date
) t
ON d.date = t.date;
This script worked perfectly for counting the tweet_ids per ticker-string. As I however stated, I am not looking to find the average classified scores per ticker-string. My question is therefore: Could someone show me how to adjust this script in such a way that I can calculate the average classified scores per ticker-string per day?
SELECT d.date, t.ticker, COALESCE(COUNT(DISTINCT tweet_id), 0) AS tweets
FROM dates d
LEFT JOIN
(SELECT DATE(created_at) AS date,
SUBSTR(processed_text,
LOCATE('{TICKER|', processed_text) + 8,
LOCATE('}', processed_text, LOCATE('{TICKER|', processed_text))
- LOCATE('{TICKER|', processed_text) - 8)) t
ON d.date = t.date
GROUP BY d.date, t.ticker
This will put each ticker on its own row, not a column. If you want them moved to columns, you have to pivot the result. How you do this depends on the DBMS. Some have built-in features for creating pivot tables. Others (e.g. MySQL) do not and you have to write tricky code to do it; if you know all the possible values ahead of time, it's not too hard, but if they can change you have to write dynamic SQL in a stored procedure.
See MySQL pivot table for how to do it in MySQL.

SQL merging result sets on a unique column value

I have 2 similar queries which both work on the same table, and I essentially want to combine their results such that the second query supplies default values for what the first query doesn't return. I've simplified the problem as much as possible here. I'm using Oracle btw.
The table has account information in it for a number of accounts, and there are multiple entries for each account with a commit_date to tell when the account information was inserted. I need get the account info which was current for a certain date.
The queries take a list of account ids and a date.
Here is the query:
-- Select the row which was current for the accounts for the given date. (won't return anything for an account which didn't exist for the given date)
SELECT actr.*
FROM Account_Information actr
WHERE actr.account_id in (30000316, 30000350, 30000351)
AND actr.commit_date <= to_date( '2010-DEC-30','YYYY-MON-DD ')
AND actr.commit_date =
(
SELECT MAX(actrInner.commit_date)
FROM Account_Information actrInner
WHERE actrInner.account_id = actr.account_id
AND actrInner.commit_date <= to_date( '2010-DEC-30','YYYY-MON-DD ')
)
This looks a little ugly, but it returns a single row for each account which was current for the given date. The problem is that it doesn't return anything if the account didn't exist until after the given date.
Selecting the earliest account info for each account is trival - I don't need to supply a date for this one:
-- Select the earliest row for the accounts.
SELECT actr.*
FROM Account_Information actr
WHERE actr.account_id in (30000316, 30000350, 30000351)
AND actr.commit_date =
(
SELECT MAX(actrInner .commit_date)
FROM Account_Information actrInner
WHERE actrInner .account_id = actr.account_id
)
But I want to merge the result sets in such a way that:
For each account, if there is account info for it in the first result set - use that.
Otherwise, use the account info from the second result set.
I've researched all of the joins I can use without success. Unions almost do it but they will only merge for unique rows. I want to merge based on the account id in each row.
Sql Merging two result sets - my case is obviously more complicated than that
SQL to return a merged set of results - I might be able to adapt that technique? I'm a programmer being forced to write SQL and I can't quite follow that example well enough to see how I could modify it for what I need.
The standard way to do this is with a left outer join and coalesce. That is, your overall query will look like this:
SELECT ...
FROM defaultQuery
LEFT OUTER JOIN currentQuery ON ...
If you did a SELECT *, each row would correspond to the current account data plus your defaults. With me so far?
Now, instead of SELECT *, for each column you want to return, you do a COALESCE() on matched pairs of columns:
SELECT COALESCE(currentQuery.columnA, defaultQuery.columnA) ...
This will choose the current account data if present, otherwise it will choose the default data.
You can do this more directly using analytic functions:
select *
from (SELECT actr.*, max(commit_date) over (partition by account_id) as maxCommitDate,
max(case when commit_date <= to_date( '2010-DEC-30','YYYY-MON-DD ') then commit_date end) over
(partition by account_id) as MaxCommitDate2
FROM Account_Information actr
WHERE actr.account_id in (30000316, 30000350, 30000351)
) t
where (MaxCommitDate2 is not null and Commit_date = MaxCommitDate2) or
(MaxCommitDate2 is null and Commit_Date = MaxCommitDate)
The subquery calculates two values, the two possibilities of commit dates. The where clause then chooses the appropriate row, using the logic that you want.
I've combined the other answers. Tried it out at apex.oracle.com. Here's some explanation.
MAX(CASE WHEN commit_date <= to_date('2010-DEC-30', 'YYYY-MON-DD')) will give us the latest date not before Dec 30th, or NULL if there isn't one. Combining that with a COALESCE, we get
COALESCE(MAX(CASE WHEN commit_date <= to_date('2010-DEC-30', 'YYYY-MON-DD') THEN commit_date END), MAX(commit_date)).
Now we take the account id and commit date we have and join them with the original table to get all the other fields. Here's the whole query that I came up with:
SELECT *
FROM Account_Information
JOIN (SELECT account_id,
COALESCE(MAX(CASE WHEN commit_date <=
to_date('2010-DEC-30', 'YYYY-MON-DD')
THEN commit_date END),
MAX(commit_date)) AS commit_date
FROM Account_Information
WHERE account_id in (30000316, 30000350, 30000351)
GROUP BY account_id)
USING (account_id, commit_date);
Note that if you do use USING, you have to use * instead of acrt.*.