How to find combination of intersection from many tables? - sql

I have a list of different channels that could potentially bring users to a website (organic, SEO, online marketing, etc.). I would like to find an efficient way to count daily active user that comes from the combination of these channels. Each channel has its own table and track its respective users.
The tables looks like the following,
channel A
date user_id
2020-08-01 A
2020-08-01 B
2020-08-01 C
channel B
date user_id
2020-08-01 C
2020-08-01 D
2020-08-01 G
channel C
date user_id
2020-08-01 A
2020-08-01 C
2020-08-01 F
I want to know the following combinations
Only visit channel A
Only visit channel A & B
Only visit channel B & C
Only visit channel B
etc.
However, when there are a lot of channels (I have around 8 channels) the combination is a lot. What I've done roughly is as simple as this (this one includes channel A)
SELECT
a.date,
COUNT(DISTINCT IF(b.user_id IS NULL AND c.user_id IS NULL, a.user_id, NULL)) AS dau_a,
COUNT(DISTINCT IF(b.user_id IS NOT NULL AND c.user_id IS NULL, a.user_id, NULL)) AS dau_a_b,
...
FROM a LEFT JOIN b ON a.user_id = b.user_id AND a.date = b.date
LEFT JOIN c ON a.user_id = c.user_id AND a.date = c.date
GROUP BY 1
but extremely tedious when the total channels is 8 (28 variations for 2 combinations, 56 for 3, 70 for 4, and many more).
Any smart ideas to solve this? I was thinking to use FULL OUTER JOIN but can't seem to get the grasp out of it. Answers really appreciated.

I would approach this with union all and two levels of aggregation:
select date, channels, count(*) as num_users
from (select date, user_id, string_agg(channel order by channel) as channels
from ((select distinct date, user_id, 'a' as channel from a) union all
(select distinct date, user_id, 'b' as channel from b) union all
(select distinct date, user_id, 'c' as channel from c)
) abc
group by date, user_id
) c
group by date, channels;

However, when there are a lot of channels (I have around 8 channels) the combination is a lot
extremely tedious when the total channels is 8 (28 variations for 2 combinations, 56 for 3, 70 for 4, and many more).
Any smart ideas to solve this?
Below is for BigQuery Standard SQL and addresses exactly above aspect of the OP's concerns
#standardSQL
CREATE TEMP FUNCTION generate_combinations(a ARRAY<INT64>)
RETURNS ARRAY<STRING>
LANGUAGE js AS '''
var combine = function(a) {
var fn = function(n, src, got, all) {
if (n == 0) {
if (got.length > 0) {
all[all.length] = got;
} return;
}
for (var j = 0; j < src.length; j++) {
fn(n - 1, src.slice(j + 1), got.concat([src[j]]), all);
} return;
}
var all = []; for (var i = 1; i < a.length; i++) {
fn(i, a, [], all);
}
all.push(a);
return all;
}
return combine(a)
''';
with users as (
select distinct date, user_id, 'A' channel from channel_A union all
select distinct date, user_id, 'B' from channel_B union all
select distinct date, user_id, 'C' from channel_C
), visits as (
select date, user_id,
string_agg(channel, ' & ' order by channel) combination
from users
group by date, user_id
), channels AS (
select channel, cast(row_number() over(order by channel) as string) channel_num
from (select distinct channel from users)
), combinations as (
select string_agg(channel, ' & ' order by channel_num) combination
from unnest(generate_combinations(generate_array(1,(select count(1) from channels)))) AS items,
unnest(split(items)) AS channel_num
join channels using(channel_num)
group by items
)
select date,
combination as channels_visited_only,
count(distinct user_id) dau
from visits
join combinations using (combination)
group by date, combination
order by combination
If to apply to sample data from your question - output is
Some explanations to help with using above
CTE users just simply union all tables and adds channel column to be able to distinguish from which table respective row came
CTE visits extracts list of all visited channels for each user-date combination
CTE channels just simply prepares list of channels and assigns number for later use
CTE combinations uses JS UDF to generate all combinations of channels' numbers and then joins them back to channels to generate channels combinations
and final SELECT statement is simply looks for those users whose list of visited channels match channels combination generated in previous step
Some recommendations for further streamlining above code
assuming your channel tables names follow channel_* pattern
you can use wildcard tables feature in users CTE and instead of
select distinct date, user_id, 'A' channel from channel_A union all
select distinct date, user_id, 'B' from channel_B union all
select distinct date, user_id, 'C' from channel_C
you can use something like below - so just one line instead of as many lines as cannles you have
select distinct date, user_id, _TABLE_SUFFIX as channel from channel_*

I think you could use set operators to answer your questions: https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax#set_operators
E.g.
is (A except B) except C
is A intersect B
etc.

I am thinking full join and aggregation:
select date, a.channel_a, b.channel_b, c.channel_c, count(*) cnt
from (select 'a' channel_a, a.* from channel_a) a
full join (select 'b' channel_b, b.* from channel_b b) b using (date, user_id)
full join (select 'c' channel_c, c.* from channel_c c) c using (date, user_id)
group by date, a.channel_a, b.channel_b, c.channel_c

Related

Trying to join multiple tables without all the pair of common columns hence the values are repeating from the last tables. Need help to solve this

I Have the following lines and result is added in image link. The results of 1adjust` to be joined, there is no platform or date column in it, hence the records are repreated. Is there a way to avoid this. This will cause issue in visualizations at campaign level when the repeated items are getting summed
with
sent as (
select campaign_name, date(date) as date, platform, count(id) as sent
from send
group by 1,2,3
),
bounce as (
select campaign_name, platform, count(id) as bounce
from bounce
group by 1,2
),
open as (
select campaign_name, platform, count(id) as clicks
from open
group by 1,2
),
adjust as (
select campaign, sum(purchase_events) as transactions, count(distinct adjust_id) as sessions, sum(sessions) as s2, sum(clicks) as ad_clicks
from adjust
group by 1
)
select
s.campaign_name,
s.date,
s.platform,
s.sent,
(s.sent-b.bounce) as delivered,
b.bounce,
o.clicks,
a.ad_clicks,
a.sessions,
a.s2,
a.transactions
from sent s
join bounce b on s.campaign_name = b.campaign_name and s.platform = b.platform
join open o on s.campaign_name = o.campaign_name and s.platform = o.platform
left join adjust a on s.campaign_name = a.campaign
See the result here

Including count combinations with null value in SQL

I have one dataset, and am trying to list all of the combinations of said dataset. However, I am unable to figure out how to include the combinations that are null. For example, Longitudinal? can be no and cohort can be 11-20, however for Region 1, there were no patients of that age in that region. How can I show a 0 for the count?
Here is the code:
SELECT "s_safe_005prod"."ig_eligi_group1"."site_name" AS "Site Name",
"s_safe_005prod"."ig_eligi_group1"."il_eligi_ellong" AS "Longitudinal?",
"s_safe_005prod"."ig_eligi_group1"."il_eligi_elcohort" AS "Cohort",
count(*) AS "count"
FROM "s_safe_005prod"."ig_eligi_group1"
GROUP BY "s_safe_005prod"."ig_eligi_group1"."site_name",
"s_safe_005prod"."ig_eligi_group1"."il_eligi_ellong",
"s_safe_005prod"."ig_eligi_group1"."il_eligi_elcohort"
ORDER BY "s_safe_005prod"."ig_eligi_group1"."site_name",
"s_safe_005prod"."ig_eligi_group1"."il_eligi_ellong" ASC,
"s_safe_005prod"."ig_eligi_group1"."il_eligi_elcohort" ASC
Create a cross join across the unique values from each of the three grouping fields to create a set of all possible combinations. Then left join that to the counts you have originally and coalesce null values to zero.
WITH groups AS
(
SELECT a.site_name, b.longitudinal, c.cohort
FROM (SELECT DISTINCT site_name FROM s_safe_005prod.ig_eligi_group1) a,
(SELECT DISTINCT il_eligi_ellong AS longitudinal FROM s_safe_005prod.ig_eligi_group1) b,
(SELECT DISTINCT il_eligi_elcohort AS cohort FROM s_safe_005prod.ig_eligi_group1) c
),
dat AS
(
SELECT site_name,
il_eligi_ellong AS longitudinal,
il_eligi_elcohort AS cohort,
count(*) AS "count"
FROM s_safe_005prod.ig_eligi_group1
GROUP BY site_name,
il_eligi_ellong,
il_eligi_elcohort
)
SELECT groups.site_name,
groups.longitudinal,
groups.cohort,
COALESCE(dat.[count],0) AS "count"
FROM groups
LEFT JOIN dat ON groups.site_name = dat.site_name
AND groups.longitudinal = dat.longitudinal
AND groups.cohort = dat.cohort;

Recursive subtraction from two separate tables to fill in historical data

I have two datasets hosted in Snowflake with social media follower counts by day. The main table we will be using going forward (follower_counts) shows follower counts by day:
This table is live as of 4/4/2020 and will be updated daily. Unfortunately, I am unable to get historical data in this format. Instead, I have a table with historical data (follower_gains) that shows net follower gains by day for several accounts:
Ideally - I want to take the follower_count value from the minimum date in the current table (follower_counts) and subtract the sum of gains (organic + paid gains) for each day, until the minimum date of the follower_gains table, to fill in the follower_count historically. In addition, there are several accounts with data in these tables, so it would need to be grouped by account. It should look like this:
I've only gotten as far as unioning these two tables together, but don't even know where to start with looping through these rows:
WITH a AS (
SELECT
account_id,
date,
organizational_entity,
organizational_entity_type,
vanity_name,
localized_name,
localized_website,
organization_type,
total_followers_count,
null AS paid_follower_gain,
null AS organic_follower_gain,
account_name,
last_update
FROM follower_counts
UNION ALL
SELECT
account_id,
date,
organizational_entity,
organizational_entity_type,
vanity_name,
localized_name,
localized_website,
organization_type,
null AS total_followers_count,
organic_follower_gain,
paid_follower_gain,
account_name,
last_update
FROM follower_gains)
SELECT
a.account_id,
a.date,
a.organizational_entity,
a.organizational_entity_type,
a.vanity_name,
a.localized_name,
a.localized_website,
a.organization_type,
a.total_followers_count,
a.organic_follower_gain,
a.paid_follower_gain,
a.account_name,
a.last_update
FROM a
ORDER BY date desc LIMIT 100
UPDATE: Changed union to union all and added not exists to remove duplicates. Made changes per the comments.
NOTE: Please make sure you don't post images of the tables. It's difficult to recreate your scenario to write a correct query. Test this solution and update so that I can make modifications if necessary.
You don't loop through in SQL because its not a procedural language. The operation you define in the query is performed for all the rows in a table.
with cte as (SELECT a.account_id,
a.date,
a.organizational_entity,
a.organizational_entity_type,
a.vanity_name,
a.localized_name,
a.localized_website,
a.organization_type,
(a.follower_count - (b.organic_gain+b.paid_gain)) AS follower_count,
a.account_name,
a.last_update,
b.organic_gain,
b.paid_gain
FROM follower_counts a
JOIN follower_gains b ON a.account_id = b.account_id
AND b.date < (select min(date) from
follower_counts c where a.account.id = c.account_id)
)
SELECT b.account_id,
b.date,
b.organizational_entity,
b.organizational_entity_type,
b.vanity_name,
b.localized_name,
b.localized_website,
b.organization_type,
b.follower_count,
b.account_name,
b.last_update,
b.organic_gain,
b.paid_gain
FROM cte b
UNION ALL
SELECT a.account_id,
a.date,
a.organizational_entity,
a.organizational_entity_type,
a.vanity_name,
a.localized_name,
a.localized_website,
a.organization_type,
a.follower_count,
a.account_name,
a.last_update,
NULL as organic_gain,
NULL as paid_gain
FROM follower_counts a where not exists (select 1 from
follower_gains c where a.account_id = c.account_id AND a.date = c.date)
You could do something like this, instead of using the variable you can just wrap it another bracket and write at end ) AS FollowerGrowth
DECLARE #FollowerGrowth INT =
( SELECT total_followers_count
FROM follower_gains
WHERE AccountID = xx )
-
( SELECT TOP 1 follower_count
FROM follower_counts
WHERE AccountID = xx
ORDER BY date ASCENDING )

How to make a row appear several times depending on a value in a column?

I'm creating a dataset for users eligible to win a raffle. All registered users are eligible, however premium users get 2 tickets to enter instead of 1. If I have a table like below:
user_id type
16234 premium
19273 regular
13846 regular
22343 regular
28820 premium
How do i get it to print:
user_id
16234
16234
19273
13846
22343
28820
28820
Here is a "BigQuery"ish way of expressing the logic:
SELECT u_id
FROM (SELECT 16234 as user_id, 'premium' as type UNION ALL
SELECT 19273, 'regular'
) t JOIN
UNNEST(ARRAY[t.user_id, t.user_id]) u_id with offset n
ON n = 1 or type = 'premium';
Or like this:
SELECT t.user_id
FROM (SELECT 16234 as user_id, 'premium' as type UNION ALL
SELECT 19273, 'regular'
) t CROSS JOIN
UNNEST(GENERATE_ARRAY(1, (CASE WHEN type = 'premium' THEN 2 ELSE 1 END))) n;
The advantage of this approach over something like UNION ALL is that it generalizes quite easily. For instance, if premium users got 20 tickets but regulars only got 5, this would be simpler to implement.
you can select all and then union the premium users:
(select user_id from my_table) union all
(select user_id from my_table where type='premium')

Compare and find differences in two tables in Oracle

I have 2 tables:
account: ID, ACC, AE_CCY, DRCR_IND, AMOUNT, MODULE
flex: ID, ACC, AE_CCY, DRCR_IND, AMOUNT, MODULE
I want to show differences comparing only by: AE_CCY, DRCR_IND, AMOUNT, MODULE and ACC by first 4 characters
Example:
ID ACC AE_CCY DRCR_IND AMOUNT MODULE
-- --------- ------ -------- ------ ------
1 734647674 USD D 100 OP
and in flex:
ID ACC AE_CCY DRCR_IND AMOUNT MODULE
-- --------- ------ -------- ------ ------
1 734647654 USD D 100 OP
2 734665474 USD D 100 OP
9 734611111 USD D 100 OP
ID's 2 and 9 should be shown as differences.
If I use FULL JOIN I'll get no differences as substr(account.ACC,1,4) = substr(flex.ACC,1,4) are equal and others are equal and MINUS doesn't work because ID's different.
Do you mean you want to group by the first 4 characters of ACC, then diff them?
And, if not, why is Flex:ID=1 NOT a difference to account:ID=1, if ID=2 and ID=9 are, especially since it reads that ID is not a comparison field?
a brute-force set theory answer:
SELECT * FROM ID
UNION
SELECT * FROM FLEX
MINUS
(SELECT * FROM ID
INTERSECT
SELECT * FROM FLEX)
I think what you want is the full join with an additional condition. Something like:
select F.ID, F.AE_CCY, F.DRCR_IND, F.AMOUNT, F.MODULE, F.ACC
from account a join flex f
on substr(a.ACC,1,4) = substr(f.ACC,1,4)
where a.AE_CCY <> f.AE_CCY
or a.DRCR_IND <> f.DRCR_IND
or a.AMOUNT <> f.AMOUNT
or a.MODULE <> f.MODULE
or a.ACC <> f.ACC
This way, the join is still performed on the first 4 characters, but the where condition checks the entire field (as well as the other four).
Revised solution: This is something of a stab-in-the-dark, by I'm wondering if what you're really looking for is a list of records that don't have a match in the other table. In that case, a full outer join might be the answer:
select coalesce(F.ID,a.ID) as ID,
coalesce(F.AE_CCY,a.AE_CCY) as AE_CCY,
coalesce(F.DRCR_IND,a.DRCR_IND) as DRCR_IND,
coalesce(F.AMOUNT,a.AMOUNT) as AMOUNT,
coalesce(F.MODULE,a.MODULE) as MODULE,
coalesce(F.ACC,a.ACC) as ACC
from account a full outer join flex f
on substr(a.ACC,1,4) = substr(f.ACC,1,4)
and a.AE_CCY = f.AE_CCY
and a.DRCR_IND = f.DRCR_IND
and a.AMOUNT = f.AMOUNT
and a.MODULE = f.MODULE
where a.id is null
or f.id is null
Third attempted solution: Thinking about it further, I think you're saying that you want each record from the first table to match to exactly one record in the second table (and vice-versa). That's a difficult problem because relational databases aren't really design work that way.
The solution below uses the full outer join again, to get only rows that don't appear in the other table. This time, we're adding ROW_NUMBER to assign a unique number to each member of a set of duplicate values found in either table. In the example from your comment, with 5 identical rows in one table and 1 of the same row in another, the first table will be numbered 1-5 and the second will be 1. Therefore, by adding that as a join condition, we assure that each row has only one match. The one flaw in this design is that a perfect match on ACC is not guaranteed to take precedence over another value. Making that work would be quite a bit more difficult.
select coalesce(F.ID,a.ID) as ID,
coalesce(F.AE_CCY,a.AE_CCY) as AE_CCY,
coalesce(F.DRCR_IND,a.DRCR_IND) as DRCR_IND,
coalesce(F.AMOUNT,a.AMOUNT) as AMOUNT,
coalesce(F.MODULE,a.MODULE) as MODULE,
coalesce(F.ACC,a.ACC) as ACC
from (select a.*,
row_number()
over (partition by AE_CCY,DRCR_IND,AMOUNT,MODULE,substr(ACC,1,4)
order by acc) as rn
from account a) a
full outer join
(select f.*,
row_number()
over (partition by AE_CCY,DRCR_IND,AMOUNT,MODULE,substr(ACC,1,4)
order by acc) as rn
from flex f) f
on substr(a.ACC,1,4) = substr(f.ACC,1,4)
and a.AE_CCY = f.AE_CCY
and a.DRCR_IND = f.DRCR_IND
and a.AMOUNT = f.AMOUNT
and a.MODULE = f.MODULE
and a.RN = f.RN
where a.id is null
or f.id is null
I like to use:
SELECT min(which) which, id, ae_ccy, drcr_ind, amount, module, acc
FROM (SELECT DISTINCT 'account' which, id, ae_ccy, drcr_ind, amount, module,
substr(acc, 1, 4) acc
FROM ACCOUNT
UNION ALL
SELECT DISTINCT 'flex' which, id, ae_ccy, drcr_ind, amount, module,
substr(acc, 1, 4) acc
FROM flex)
GROUP BY id, ae_ccy, drcr_ind, amount, module, acc
HAVING COUNT(*) != 2
ORDER BY id, 1
It will show both the new rows, the old missing rows and any difference.