Pivoting columns from STRUCT type - google-bigquery

I have a SQL table with information about email campaigns that my company has created. Each line of the table is an action that a user has taken on a specific campaign:
UserID
properties.CampaignName
properties.Status
01
Campaign#01
opened
01
Campaign#01
clicked
01
Campaign#02
opened
02
Campaign#01
opened
02
Campaign#02
opened
I wanted to Pivot this on SQL, in a way that would render the unique number of people who have opened and clicked on each campaign:
CampaignName
Opened
Clicked
Campaign#01
2149
122
Campaign#02
4223
141
My initial thought was to use the PIVOT query:
select * from table
pivot (count(distinct UserID) for properties.Status in ('opened', 'clicked'))
But I didn't realize something: the CampaignName and Status columns are nested under the "properties" column - so basically I have properties.status, properties.country, properties.campaignname, etc. Therefore the error I am getting is:
Error running query Column raw_properties of type STRUCT cannot be used as an implicit grouping column of a PIVOT clause at [2:1]

Consider below
select * from (select UserID, properties.* from your_table)
pivot (count(distinct UserID) for Status in ('opened', 'clicked'))
if applied to sample data in your question
with your_table as (
select '01' as UserID, struct('Campaign#01' as CampaignName, 'opened' as Status) as properties union all
select '01', struct('Campaign#01', 'clicked') union all
select '01', struct('Campaign#02', 'opened') union all
select '02', struct('Campaign#01', 'opened') union all
select '02', struct('Campaign#02', 'opened')
)
output is
So, technically this is exactly the same answer that I already provided you with - https://stackoverflow.com/a/70115346/5221944 - you just needed to flatten struct into separate columns first

You can still use your PIVOT clause, the deal here is with the struct values. So my approach was to unnest that data and set is as same level data and then proceed to reuse your query. Below are the steps I follow:
working table
create table `project.dataset.table`(
userid INT64,
properties STRUCT<CampaignName STRING,Status STRING>
);
dummy data
INSERT INTO `project.dataset.table`(userid,properties) VALUES(01,STRUCT("Campaign#01","opened"));
INSERT INTO `project.dataset.table`(userid,properties) VALUES(01,STRUCT("Campaign#01","clicked"));
INSERT INTO `project.dataset.table`(userid,properties) VALUES(01,STRUCT("Campaign#02","opened"));
INSERT INTO `project.dataset.table`(userid,properties) VALUES(02,STRUCT("Campaign#01","opened"));
INSERT INTO `project.dataset.table`(userid,properties) VALUES(02,STRUCT("Campaign#02","opened"));
query
with campaign_unnest as (
select o.userid,
prp.Status,prp.CampaignName
from `project.dataset.table` o, unnest([o.properties]) as prp
)
select * from campaign_unnest pivot (count(distinct userid) for Status in ('opened', 'clicked'))
output
CampaignName
Opened
Clicked
Campaign#01
2
1
Campaign#02
2
0
Let me know if this answer fits what you are trying to achieve. If not, please let me know so I can update my answer.
For reference I use the google bigquery documentation about arrays & structs, unnest and with.

Related

SQL Pivot in BigQuery

I have a SQL table with information about email campaigns that my company has created. Each line of the table is an action that a user has taken on a specific campaign:
User ID
Campaign Name
Status
01
Campaign#1
opened
01
Campaign#1
clicked
01
Campaign#2
opened
02
Campaign#1
opened
02
Campaign#2
opened
I wanted to Pivot this on SQL, in a way that would render the unique number of people who have opened and clicked on each campaign:
Campaign Name
Opened
Clicked
Campaign#1
2149
122
Campaign#2
4223
141
I've been trying to work with:
SELECT user_id, campaign_name, status from table
PIVOT(
COUNT (DISTINCT user_id)
FOR status IN
( [opened],
[clicked]
) ) AS PivotTable
But then I am getting:
Unrecognized name: opened at [5:6]
Consider below example
select * from your_table
pivot (count(distinct UserID) for Status in ('opened', 'clicked'))
if applied to sample data in your question
with your_table as (
select '01' UserID, 'Campaign#1' CampaignName, 'opened' Status union all
select '01', 'Campaign#1', 'clicked' union all
select '01', 'Campaign#2', 'opened' union all
select '02', 'Campaign#1', 'opened' union all
select '02', 'Campaign#2', 'opened'
)
the output is

Count instances of value (say, '4') in several columns/ rows

I have survey responses in a SQL database. Scores are 1-5.
Current format of the data table is this:
Survey_id, Question_1, Question_2, Question_3
383838, 1,1,1
392384, 1,5,4
393894, 4,3,5
I'm running a new query where I need % 4's, % 5's ... question doesn't matter, just overall.
At first glance I'm thinking
sum(iif(Question_1 =5,1,0)) + sum(iif(Question_2=5,1,0)) .... as total5s
sum(iif(Question_1=4,1,0)) + sum(iif(Question_2=4,1,0)) .... as total4s
But I am unsure if this is the quickest or most elegant way to achieve this.
EDIT: Hmm on first test this query already appears not to work correctly
EDIT2: I think I need sum instead of count in my example, will edit.
You have to unpivot the data and calculate the % responses thereafter. Because there are a limited number of questions, you can use union all to unpivot the data.
select 100.0*count(case when question=4 then 1 end)/count(*) as pct_4s
from (select survey_id,question_1 as question from tablename
union all
select survey_id,question_2 from tablename
union all
select survey_id,question_3 from tablename
) responses
Another way to do this could be
select 100.0*(count(case when question_1=4 then 1 end)
+count(case when question_2=4 then 1 end)
+count(case when question_3=4 then 1 end))
/(3*count(*))
from tablename
With unpivot as #Dudu suggested,
with unpivoted as (select *
from tablename
unpivot (response for question in (question_1,question_2,question_3)) u
)
select 100.0*count(case when response=4 then 1 end)/count(*)
from unpivoted

Sorting twice on same column

I'm having a bit of a weird question, given to me by a client.
He has a list of data, with a date between parentheses like so:
Foo (14/08/2012)
Bar (15/08/2012)
Bar (16/09/2012)
Xyz (20/10/2012)
However, he wants the list to be displayed as follows:
Foo (14/08/2012)
Bar (16/09/2012)
Bar (15/08/2012)
Foot (20/10/2012)
(notice that the second Bar has moved up one position)
So, the logic behind it is, that the list has to be sorted by date ascending, EXCEPT when two rows have the same name ('Bar'). If they have the same name, it must be sorted with the LATEST date at the top, while staying in the other sorting order.
Is this even remotely possible? I've experimented with a lot of ORDER BY clauses, but couldn't find the right one. Does anyone have an idea?
I should have specified that this data comes from a table in a sql server database (the Name and the date are in two different columns). So I'm looking for a SQL-query that can do the sorting I want.
(I've dumbed this example down quite a bit, so if you need more context, don't hesitate to ask)
This works, I think
declare #t table (data varchar(50), date datetime)
insert #t
values
('Foo','2012-08-14'),
('Bar','2012-08-15'),
('Bar','2012-09-16'),
('Xyz','2012-10-20')
select t.*
from #t t
inner join (select data, COUNT(*) cg, MAX(date) as mg from #t group by data) tc
on t.data = tc.data
order by case when cg>1 then mg else date end, date desc
produces
data date
---------- -----------------------
Foo 2012-08-14 00:00:00.000
Bar 2012-09-16 00:00:00.000
Bar 2012-08-15 00:00:00.000
Xyz 2012-10-20 00:00:00.000
A way with better performance than any of the other posted answers is to just do it entirely with an ORDER BY and not a JOIN or using CTE:
DECLARE #t TABLE (myData varchar(50), myDate datetime)
INSERT INTO #t VALUES
('Foo','2012-08-14'),
('Bar','2012-08-15'),
('Bar','2012-09-16'),
('Xyz','2012-10-20')
SELECT *
FROM #t t1
ORDER BY (SELECT MIN(t2.myDate) FROM #t t2 WHERE t2.myData = t1.myData), T1.myDate DESC
This does exactly what you request and will work with any indexes and much better with larger amounts of data than any of the other answers.
Additionally it's much more clear what you're actually trying to do here, rather than masking the real logic with the complexity of a join and checking the count of joined items.
This one uses analytic functions to perform the sort, it only requires one SELECT from your table.
The inner query finds gaps, where the name changes. These gaps are used to identify groups in the next query, and the outer query does the final sorting by these groups.
I have tried it here (SQL Fiddle) with extended test-data.
SELECT name, dat
FROM (
SELECT name, dat, SUM(gap) over(ORDER BY dat, name) AS grp
FROM (
SELECT name, dat,
CASE WHEN LAG(name) OVER (ORDER BY dat, name) = name THEN 0 ELSE 1 END AS gap
FROM t
) x
) y
ORDER BY grp, dat DESC
Extended test-data
('Bar','2012-08-12'),
('Bar','2012-08-11'),
('Foo','2012-08-14'),
('Bar','2012-08-15'),
('Bar','2012-08-16'),
('Bar','2012-09-17'),
('Xyz','2012-10-20')
Result
Bar 2012-08-12
Bar 2012-08-11
Foo 2012-08-14
Bar 2012-09-17
Bar 2012-08-16
Bar 2012-08-15
Xyz 2012-10-20
I think that this works, including the case I asked about in the comments:
declare #t table (data varchar(50), [date] datetime)
insert #t
values
('Foo','20120814'),
('Bar','20120815'),
('Bar','20120916'),
('Xyz','20121020')
; With OuterSort as (
select *,ROW_NUMBER() OVER (ORDER BY [date] asc) as rn from #t
)
--Now we need to find contiguous ranges of the same data value, and the min and max row number for such a range
, Islands as (
select data,rn as rnMin,rn as rnMax from OuterSort os where not exists (select * from OuterSort os2 where os2.data = os.data and os2.rn = os.rn - 1)
union all
select i.data,rnMin,os.rn
from
Islands i
inner join
OuterSort os
on
i.data = os.data and
i.rnMax = os.rn-1
), FullIslands as (
select
data,rnMin,MAX(rnMax) as rnMax
from Islands
group by data,rnMin
)
select
*
from
OuterSort os
inner join
FullIslands fi
on
os.rn between fi.rnMin and fi.rnMax
order by
fi.rnMin asc,os.rn desc
It works by first computing the initial ordering in the OuterSort CTE. Then, using two CTEs (Islands and FullIslands), we compute the parts of that ordering in which the same data value appears in adjacent rows. Having done that, we can compute the final ordering by any value that all adjacent values will have (such as the lowest row number of the "island" that they belong to), and then within an "island", we use the reverse of the originally computed sort order.
Note that this may, though, not be too efficient for large data sets. On the sample data it shows up as requiring 4 table scans of the base table, as well as a spool.
Try something like...
ORDER BY CASE date
WHEN '14/08/2012' THEN 1
WHEN '16/09/2012' THEN 2
WHEN '15/08/2012' THEN 3
WHEN '20/10/2012' THEN 4
END
In MySQL, you can do:
ORDER BY FIELD(date, '14/08/2012', '16/09/2012', '15/08/2012', '20/10/2012')
In Postgres, you can create a function FIELD and do:
CREATE OR REPLACE FUNCTION field(anyelement, anyarray) RETURNS numeric AS $$
SELECT
COALESCE((SELECT i
FROM generate_series(1, array_upper($2, 1)) gs(i)
WHERE $2[i] = $1),
0);
$$ LANGUAGE SQL STABLE
If you do not want to use the CASE, you can try to find an implementation of the FIELD function to SQL Server.

SQL Pivot Command

I am looking for some help on designing a simple pivot so that I can link it into other parts of my queries.
My data is like this
Items Table
Below is my table if I run Select * from items
ITEM Weight
12345 10
12345 11
654321 50
654321 20
654321 100
There are hundreds of Items in this table but each item code will only ever have
maximum of 3 weight records each.
I want the desired output
ITEM Weight_1 Weight_2 Weight_3
12345 10 11 null
654321 50 20 100
Would appreciate any suggestions,
I have played around with pivots but each subsequent item puts the weights into weight 4,5,6,7,etc
instead of starting at weight1 for each item.
Thanks
Update
Below is what I have used so far,
SELECT r.*
FROM (SELECT 'weight' + CAST(Row_number() OVER (ORDER BY regtime ASC)AS
VARCHAR(10))
line,
id,
weight
FROM items it) AS o PIVOT(MIN([weight]) FOR line IN (weight1, weight2,
weight3)) AS r
You were almost there! You were only missing the PARTITION BY clause in OVER:
SELECT r.*
FROM (SELECT 'weight' + CAST(Row_number() OVER (PARTITION BY id ORDER BY
regtime ASC)
AS
VARCHAR(10)) line,
id,
weight
FROM items it) AS o PIVOT(MIN([weight]) FOR line IN (weight1, weight2,
weight3)) AS r
When you PARTITION BY by ID, the row numbers are reset for each different ID.
Update
You do not need dynamic pivot, since you will always have 3 weights. But, if you ever need dynamic number of columns, take a look at some of the examples here:
SQL Server PIVOT perhaps?
Pivot data in T-SQL
How do I build a summary by joining to a single table with SQL Server?
You will need a value to form the columns which I do with row_number. The outcome is what you want. The only negative that I have against PIVOT is that you need to know how many columns in advance. I use a similar method, but build up the select as dynamic SQL and can then insert my columns.
EDIT: updated to show columns as weight1, weight2, etc.
create table #temp (Item int, Weight int)
insert into #temp (Item, Weight)
Values (12345, 10),
(12345, 11),
(654321, 50),
(654321, 20),
(654321, 200)
SELECT *
FROM (SELECT Item,
Weight,
'weight' + cast(Row_number()
OVER (partition by Item order by item) as varchar(10)) as seq
FROM #temp) as Src
PIVOT ( MAX(Weight) FOR Seq IN ([Weight1], [Weight2], [Weight3]) ) as PVT
MySQL
Whenever you need a pivot, use group_concat it will output a CSV list of the values you need.
Once you get used to working with it, it's a great tool.
SELECT item, GROUP_CONCAT(weight) as weights FROM table1
GROUP BY item
See: http://dev.mysql.com/doc/refman/5.0/en/group-by-functions.html#function_group-concat
TSQL aka SQL-server
Many many questions on this because T-SQL supports a pivot keyword.
See:
Transact SQL Query-Pivot-SQL
Pivot data in T-SQL

SQL find min & max range within dataset

I have a table with the following columns:
contactId (int)
interval (int)
date (smalldate)
small sample data:
1,120,'12/02/2010'
1,121,'12/02/2010'
1,122,'12/02/2010'
1,123,'12/02/2010'
1,145,'12/02/2010'
1,146,'12/02/2010'
1,147,'12/02/2010'
2,122,'12/02/2010'
2,123,'12/02/2010'
2,124,'12/02/2010'
2,320,'12/02/2010'
2,321,'12/02/2010'
2,322,'12/02/2010'
2,450,'12/02/2010'
2,451,'12/02/2010'
how/is it possible - to get sql to return columns "contactId, minInterval, maxInterval, date", e.g
1,120,123,'12/02/2010'
1,145,147,'12/02/2010'
2,122,124,'12/02/2010'
2,320,322,'12/02/2010'
2,450,451,'12/02/2010'
hopefully this makes sense, basically i'm looking to figure out the min/max range of the intervals by provider & date for the range where they increment by one... once there is a break in the interval incrementer (e.g. more than one) then it would indicate a new min/max range...
any help is greatly appreciated :)
here is my exact SQL table setup:
create table availability
(
Id (int)
ProviderId (int)
IntervalId (int)
Date (date)
)
sample data
providerid,intervalid,date
1128,108,2010-12-27
1128,109,2010-12-27
1128,110,2010-12-27
1128,111,2010-12-27
1128,112,2010-12-27
1128,113,2010-12-27
1128,114,2010-12-27
1128,120,2010-12-27
1128,121,2010-12-27
1128,122,2010-12-27
1128,123,2010-12-27
1128,124,2010-12-27
1128,125,2010-12-27
1213,108,2010-12-27
1213,109,2010-12-27
1213,110,2010-12-27
1213,111,2010-12-27
1213,112,2010-12-27
1213,113,2010-12-27
1213,114,2010-12-27
1213,115,2010-12-27
1213,232,2010-12-27
1213,233,2010-12-27
1213,234,2010-12-27
3954,198,2010-12-27
3954,199,2010-12-27
3954,200,2010-12-27
3954,201,2010-12-27
3954,202,2010-12-27
3954,203,2010-12-27
3954,204,2010-12-27
3954,205,2010-12-27
3954,206,2010-12-27
3954,207,2010-12-27
3954,208,2010-12-27
3954,209,2010-12-27
3954,210,2010-12-27
3954,211,2010-12-27
3954,212,2010-12-27
3954,213,2010-12-27
3954,214,2010-12-27
3954,215,2010-12-27
3954,216,2010-12-27
3954,217,2010-12-27
3954,218,2010-12-27
3954,229,2010-12-27
3954,230,2010-12-27
3954,231,2010-12-27
3954,232,2010-12-27
3954,233,2010-12-27
3954,234,2010-12-27
1128,108,2010-12-28
1128,109,2010-12-28
1128,110,2010-12-28
1128,111,2010-12-28
1128,112,2010-12-28
1128,113,2010-12-28
1128,114,2010-12-28
1128,115,2010-12-28
1128,116,2010-12-28
3954,186,2010-12-28
3954,187,2010-12-28
3954,188,2010-12-28
3954,189,2010-12-28
3954,190,2010-12-28
3954,213,2010-12-28
3954,214,2010-12-28
3954,215,2010-12-28
3954,216,2010-12-28
3954,217,2010-12-28
3954,218,2010-12-28
3954,219,2010-12-28
3954,220,2010-12-28
3954,221,2010-12-28
3954,222,2010-12-28
sample result using current sql within answers:
1062,180,180,2010-12-20
1062,179,179,2010-12-20
1062,178,178,2010-12-20
1062,177,177,2010-12-20
1062,176,176,2010-12-20
1062,175,175,2010-12-20
1062,174,174,2010-12-20
1062,173,173,2010-12-20
1062,172,172,2010-12-20
1062,171,171,2010-12-20
1062,170,170,2010-12-20
1062,169,169,2010-12-20
1062,168,168,2010-12-20
1062,167,167,2010-12-20
1062,166,166,2010-12-20
1062,165,165,2010-12-20
1062,164,164,2010-12-20
1062,163,163,2010-12-20
1062,162,162,2010-12-20
1062,161,161,2010-12-20
1062,160,160,2010-12-20
1062,159,159,2010-12-20
1062,158,158,2010-12-20
1062,157,157,2010-12-20
1062,156,156,2010-12-20
1062,155,155,2010-12-20
1062,154,154,2010-12-20
1062,153,153,2010-12-20
1062,152,152,2010-12-20
1062,151,151,2010-12-20
1062,150,150,2010-12-20
1062,149,149,2010-12-20
1062,148,148,2010-12-20
1062,147,147,2010-12-20
1062,146,146,2010-12-20
1062,145,145,2010-12-20
1062,144,144,2010-12-20
1062,143,143,2010-12-20
1062,142,142,2010-12-20
1062,141,141,2010-12-20
1062,140,140,2010-12-20
1062,139,139,2010-12-20
1062,138,138,2010-12-20
1062,137,137,2010-12-20
1062,136,136,2010-12-20
1062,135,135,2010-12-20
1062,134,134,2010-12-20
1062,133,133,2010-12-20
1062,132,132,2010-12-20
1062,131,131,2010-12-20
1062,130,130,2010-12-20
1062,129,129,2010-12-20
1062,128,128,2010-12-20
1062,127,127,2010-12-20
1062,126,126,2010-12-20
1062,125,125,2010-12-20
1062,124,124,2010-12-20
1062,123,123,2010-12-20
1062,122,122,2010-12-20
1062,121,121,2010-12-20
1062,120,120,2010-12-20
1062,119,119,2010-12-20
1062,118,118,2010-12-20
1062,117,117,2010-12-20
1062,116,116,2010-12-20
1062,115,115,2010-12-20
1062,114,114,2010-12-20
1062,113,113,2010-12-20
1062,112,112,2010-12-20
In SQL Server, Oracle and PostgreSQL:
WITH q AS
(
SELECT t.*, interval - ROW_NUMBER() OVER (PARTITION BY contactID, date ORDER BY interval) AS sr
FROM mytable t
)
SELECT contactID, date, MIN(interval), MAX(interval)
FROM q
GROUP BY
date, contactID, sr
ORDER BY
date, contactID, sr
Update:
With your test data I get this output:
WITH mytable (providerId, intervalId, date) AS
(
SELECT 1128,108,'2010-12-27' UNION ALL
SELECT 1128,109,'2010-12-27' UNION ALL
SELECT 1128,110,'2010-12-27' UNION ALL
SELECT 1128,111,'2010-12-27' UNION ALL
SELECT 1128,112,'2010-12-27' UNION ALL
SELECT 1128,113,'2010-12-27' UNION ALL
SELECT 1128,114,'2010-12-27' UNION ALL
SELECT 1128,120,'2010-12-27' UNION ALL
SELECT 1128,121,'2010-12-27' UNION ALL
SELECT 1128,122,'2010-12-27' UNION ALL
SELECT 1128,123,'2010-12-27' UNION ALL
SELECT 1128,124,'2010-12-27' UNION ALL
SELECT 1128,125,'2010-12-27'
),
q AS
(
SELECT t.*, intervalId - ROW_NUMBER() OVER (PARTITION BY providerId, date ORDER BY intervalId) AS sr
FROM mytable t
)
SELECT providerId, date, MIN(intervalId), MAX(intervalId)
FROM q
GROUP BY
date, providerId, sr
ORDER BY
date, providerId, sr
1128 2010-12-27 108 114
1128 2010-12-27 120 125
, i. e. exactly what you were after.
Are you sure you are using the query correctly? Are you having duplicates on (providerId, intervalId, date)?
it's probably possible to do this with a SQL query alone, but it will probably be a bit mind-boggling. Basically a subquery to find places where it increments by one, joined to the original dataset, with tons of other logic in there. That's my impression at least.
If I were you,
If this is a one-time deal, don't care about performance and just iterate over it and do the calculation 'manually'.
If this is a production dataset and you need to do this operation on a frequent / automated / performance-intensive setting, then rearrange the original dataset to make this kind of query easier.
Hope one of those options is available to you.