SQL Pivot in BigQuery - sql

I have a SQL table with information about email campaigns that my company has created. Each line of the table is an action that a user has taken on a specific campaign:
User ID
Campaign Name
Status
01
Campaign#1
opened
01
Campaign#1
clicked
01
Campaign#2
opened
02
Campaign#1
opened
02
Campaign#2
opened
I wanted to Pivot this on SQL, in a way that would render the unique number of people who have opened and clicked on each campaign:
Campaign Name
Opened
Clicked
Campaign#1
2149
122
Campaign#2
4223
141
I've been trying to work with:
SELECT user_id, campaign_name, status from table
PIVOT(
COUNT (DISTINCT user_id)
FOR status IN
( [opened],
[clicked]
) ) AS PivotTable
But then I am getting:
Unrecognized name: opened at [5:6]

Consider below example
select * from your_table
pivot (count(distinct UserID) for Status in ('opened', 'clicked'))
if applied to sample data in your question
with your_table as (
select '01' UserID, 'Campaign#1' CampaignName, 'opened' Status union all
select '01', 'Campaign#1', 'clicked' union all
select '01', 'Campaign#2', 'opened' union all
select '02', 'Campaign#1', 'opened' union all
select '02', 'Campaign#2', 'opened'
)
the output is

Related

Window Function - date ranges

I'm trying to calculate duration between different status. Which is working for most part.
I have this table
Table
for id = 102, I was able to calculate duration of each status.
with ab as (
select id,
status,
max(updated_time) as end_time,
min(updated_time) as updated_time
from Table
group by id, status
)
select *,
lead(updated_time) over (partition by id order by updated_time) - updated_time as duration,
extract(epoch from duration) as duration_seconds
from ab
Output for id = 102
but for id = 101, status moved between 'IN_PROGRESS' to 'BLOCKED' & back to 'IN_PROGRESS'
here I need the below result so that I can get the correct IN_PROGRESS duration
Expected
One way to do this would be to track every time there is a change of STATUS for a given ID sorted by VERSION. The below query provides the desired output. More than brevity, I thought having multiple steps showing the transformations would be helpful. The column UNIX timestamp can be easily converted to human readable DateTimestamp format based on the specific database being used. The sample table definition and file used has also been shared below.
Query
WITH VW_STATUS_CHANGE AS
(
SELECT ID, STATUS, LAG(STATUS) OVER (PARTITION BY ID ORDER BY VERSION) LAG_STATUS, VERSION, UNIXTIME,
CASE WHEN LAG (STATUS) OVER (PARTITION BY ID ORDER BY VERSION) <> STATUS THEN 1 ELSE 0 END STATUS_CHANGE
FROM STACKOVERFLOWSQL
),
VW_CREATE_SYNTHETIC_PARTITION AS
(
SELECT ID, STATUS, LAG_STATUS, VERSION, UNIXTIME,STATUS_CHANGE,
SUM(STATUS_CHANGE) OVER (ORDER BY ID, VERSION) AS ROWNUMBER
FROM VW_STATUS_CHANGE
) ,
VW_RESULTS_INTERMEDIATE AS
(
SELECT ID, STATUS, LAG_STATUS, VERSION, UNIXTIME, STATUS_CHANGE,
"FIRST_VALUE"(UNIXTIME) OVER (
PARTITION BY "ID",
"STATUS", ROWNUMBER
ORDER BY
"VERSION"
) "TIME_FIRST_VALUE",
"FIRST_VALUE"(UNIXTIME) OVER (
PARTITION BY "ID",
"STATUS", ROWNUMBER
ORDER BY
"VERSION" DESC
) "TIME_LAST_VALUE"
FROM VW_CREATE_SYNTHETIC_PARTITION
ORDER BY ID, VERSION
)
SELECT DISTINCT ID, STATUS, TIME_FIRST_VALUE, TIME_LAST_VALUE
FROM VW_RESULTS_INTERMEDIATE
ORDER BY TIME_FIRST_VALUE
AWS Athena Table Used along with Sample data.
CREATE EXTERNAL TABLE STACKOVERFLOWSQL (
ID INTEGER,
STATUS STRING,
VERSION INTEGER,
UNIXTIME INTEGER
)
ROW FORMAT SERDE 'ORG.APACHE.HADOOP.HIVE.SERDE2.OPENCSVSERDE'
WITH SERDEPROPERTIES (
'SEPARATORCHAR' = ',',
"SKIP.HEADER.LINE.COUNT"="1"
)
STORED AS TEXTFILE
LOCATION 'S3://<S3BUCKETNAME>/';
Dataset Used:
ID,STATUS,VERSION,UNIXTIME
101,NOT_ASSIGNED,1,1668124141
101,IN_PROGRESS,2,1668124143
101,IN_PROGRESS,3,1668124146
101,IN_PROGRESS,4,1668124150
101,IN_PROGRESS,5,1668124155
101,BLOCKED,6,1668124161
101,BLOCKED,7,1668124168
101,IN_PROGRESS,8,1668124176
101,IN_PROGRESS,9,1668124185
101,IN_PROGRESS,10,1668124195
101,COMPLETED,11,1668124206
105,NOT_ASSIGNED,1,1668124207
105,IN_PROGRESS,2,1668124209
105,IN_PROGRESS,3,1668124212
105,IN_PROGRESS,4,1668124216
105,IN_PROGRESS,5,1668124221
105,IN_PROGRESS,6,1668124227
105,COMPLETED,7,1668124234
Result from the View
ID STATUS TIME_FIRST_VALUE TIME_LAST_VALUE
101 NOT_ASSIGNED 1668124141 1668124141
101 IN_PROGRESS 1668124143 1668124155
101 BLOCKED 1668124161 1668124168
101 IN_PROGRESS 1668124176 1668124195
101 COMPLETED 1668124206 1668124206
105 NOT_ASSIGNED 1668124207 1668124207
105 IN_PROGRESS 1668124209 1668124227
105 COMPLETED 1668124234 1668124234

Transpose in Google BigQuery/Excel

I have a question regarding data transport in BQ (or actually export and do it in Excel). I am trying to get this result (Sorry I am new and not sure how to separate 2 columns, variant1 and variant2 should be 2 columns) :
ClientID
Date
Variant1. Variant2
AB
12/2
123. 456
My current query will give this output:
ClientID
Date
Variant
AB
12/1
123
AB
12/2
456
SELECT DISTINCT
case when (hits.ecommerceAction.action_type = '3') then date end date, [enter image description here][1]
clientId AS client_id,
page.pagepath as pagepath,
product.productVariant as variant,
FROM
`xxxx.ga_sessions_`,
UNNEST(hits) AS hits, unnest(hits.product) as product
Is there anyway that I can use to achieve the transpose step? My current output is more like a master data, all the product related information is under one column. Appreciate if you can share any thoughts!
Consider below approach
select * from (
select ClientID, Variant, Pagepath,
max(Date) over win Date,
row_number() over (win order by Date) pos
from your_current_output
window win as (partition by ClientID)
)
pivot (any_value(Variant) as Variant, any_value(Pagepath) as Pagepath for pos in (1,2,3))
if to apply to sample in your question
with your_current_output as (
select '12/1' Date, 123 ClientID, 'abc' Variant, 'fis.com' Pagepath union all
select '12/2', 123, 'efg', 'fere.com'
)
output is

Pivoting columns from STRUCT type

I have a SQL table with information about email campaigns that my company has created. Each line of the table is an action that a user has taken on a specific campaign:
UserID
properties.CampaignName
properties.Status
01
Campaign#01
opened
01
Campaign#01
clicked
01
Campaign#02
opened
02
Campaign#01
opened
02
Campaign#02
opened
I wanted to Pivot this on SQL, in a way that would render the unique number of people who have opened and clicked on each campaign:
CampaignName
Opened
Clicked
Campaign#01
2149
122
Campaign#02
4223
141
My initial thought was to use the PIVOT query:
select * from table
pivot (count(distinct UserID) for properties.Status in ('opened', 'clicked'))
But I didn't realize something: the CampaignName and Status columns are nested under the "properties" column - so basically I have properties.status, properties.country, properties.campaignname, etc. Therefore the error I am getting is:
Error running query Column raw_properties of type STRUCT cannot be used as an implicit grouping column of a PIVOT clause at [2:1]
Consider below
select * from (select UserID, properties.* from your_table)
pivot (count(distinct UserID) for Status in ('opened', 'clicked'))
if applied to sample data in your question
with your_table as (
select '01' as UserID, struct('Campaign#01' as CampaignName, 'opened' as Status) as properties union all
select '01', struct('Campaign#01', 'clicked') union all
select '01', struct('Campaign#02', 'opened') union all
select '02', struct('Campaign#01', 'opened') union all
select '02', struct('Campaign#02', 'opened')
)
output is
So, technically this is exactly the same answer that I already provided you with - https://stackoverflow.com/a/70115346/5221944 - you just needed to flatten struct into separate columns first
You can still use your PIVOT clause, the deal here is with the struct values. So my approach was to unnest that data and set is as same level data and then proceed to reuse your query. Below are the steps I follow:
working table
create table `project.dataset.table`(
userid INT64,
properties STRUCT<CampaignName STRING,Status STRING>
);
dummy data
INSERT INTO `project.dataset.table`(userid,properties) VALUES(01,STRUCT("Campaign#01","opened"));
INSERT INTO `project.dataset.table`(userid,properties) VALUES(01,STRUCT("Campaign#01","clicked"));
INSERT INTO `project.dataset.table`(userid,properties) VALUES(01,STRUCT("Campaign#02","opened"));
INSERT INTO `project.dataset.table`(userid,properties) VALUES(02,STRUCT("Campaign#01","opened"));
INSERT INTO `project.dataset.table`(userid,properties) VALUES(02,STRUCT("Campaign#02","opened"));
query
with campaign_unnest as (
select o.userid,
prp.Status,prp.CampaignName
from `project.dataset.table` o, unnest([o.properties]) as prp
)
select * from campaign_unnest pivot (count(distinct userid) for Status in ('opened', 'clicked'))
output
CampaignName
Opened
Clicked
Campaign#01
2
1
Campaign#02
2
0
Let me know if this answer fits what you are trying to achieve. If not, please let me know so I can update my answer.
For reference I use the google bigquery documentation about arrays & structs, unnest and with.

Get ID from rows with repeated values on two colums to update them later

I have the following table:
ID, campaign, merchant,date.
I need to know what rows are repeated, which means, which have the same campaign with the same merchant, example:
ID, campaign, merchant, date
1, "hello", 260, 01/01/21
2, "hello", 260, 01/01/21
I can do this with this:
select campaign, merchant, count(*)
from public.my_table
group by campaign, merchant
HAVING count(*) > 1
That's ok, but I haven't found the way to change these repeated "campaign" to "hello day/month/year hour:minute:seconds" with this last query.
so what i want is to change the repeated campaigns (for the same merchant) like this:
ID, campaign, merchant, date
1, "hello 01/01/2021 02:22:22", 260
2, "hello 01/01/2021 02:22:32", 260
Its been hours but no luck yet, thank you!
i think I have found a solution minutes later I wrote the post:
UPDATE public.my_table
SET campaign = campaign || date
where id in (
SELECT id FROM
(SELECT *, count(*)
OVER
(PARTITION BY
campaign,
merchant
) AS count
FROM public.my_table) tableWithCount
WHERE tableWithCount.count > 1
)

How to make a row appear several times depending on a value in a column?

I'm creating a dataset for users eligible to win a raffle. All registered users are eligible, however premium users get 2 tickets to enter instead of 1. If I have a table like below:
user_id type
16234 premium
19273 regular
13846 regular
22343 regular
28820 premium
How do i get it to print:
user_id
16234
16234
19273
13846
22343
28820
28820
Here is a "BigQuery"ish way of expressing the logic:
SELECT u_id
FROM (SELECT 16234 as user_id, 'premium' as type UNION ALL
SELECT 19273, 'regular'
) t JOIN
UNNEST(ARRAY[t.user_id, t.user_id]) u_id with offset n
ON n = 1 or type = 'premium';
Or like this:
SELECT t.user_id
FROM (SELECT 16234 as user_id, 'premium' as type UNION ALL
SELECT 19273, 'regular'
) t CROSS JOIN
UNNEST(GENERATE_ARRAY(1, (CASE WHEN type = 'premium' THEN 2 ELSE 1 END))) n;
The advantage of this approach over something like UNION ALL is that it generalizes quite easily. For instance, if premium users got 20 tickets but regulars only got 5, this would be simpler to implement.
you can select all and then union the premium users:
(select user_id from my_table) union all
(select user_id from my_table where type='premium')