How to scale Pivoting in BigQuery? - sql

Let's say, I have music video play stats table mydataset.stats for a given day (3B rows, 1M users, 6K artists).
Simplified schema is:
UserGUID String, ArtistGUID String
I need pivot/transpose artists from rows to columns, so schema will be:
UserGUID String, Artist1 Int, Artist2 Int, … Artist8000 Int
With Artist plays count by respective user
There was an approach suggested in How to transpose rows to columns with large amount of the data in BigQuery/SQL? and How to create dummy variable columns for thousands of categories in Google BigQuery? but looks like it doesn’t scale for numbers I have in my example
Can this approach be scaled for my example?

I tried below approach for up to 6000 features and it worked as expected. I believe it will work up to 10K features which is hard limit for number of columns in a table
STEP 1 - Aggregate plays by user / artist
SELECT userGUID as uid, artistGUID as aid, COUNT(1) as plays
FROM [mydataset.stats] GROUP BY 1, 2
STEP 2 – Normalize uid and aid – so they are consecutive numbers 1, 2, 3, … .
We need this at least for two reasons: a) make later dynamically created sql as compact as possible and b) to have more usable/friendly columns names
Combined with first step – it will be:
SELECT u.uid AS uid, a.aid AS aid, plays
FROM (
SELECT userGUID, artistGUID, COUNT(1) AS plays
FROM [mydataset.stats]
GROUP BY 1, 2
) AS s
JOIN (
SELECT userGUID, ROW_NUMBER() OVER() AS uid FROM [mydataset.stats] GROUP BY 1
) AS u ON u. userGUID = s.userGUID
JOIN (
SELECT artistGUID, ROW_NUMBER() OVER() AS aid FROM [mydataset.stats] GROUP BY 1
) AS a ON a.artistGUID = s.artistGUID
Let’s write output to table - mydataset.aggs
STEP 3 – Using already suggested (in above mentioned questions) approach for N features (artists) at a time.
In my particular example, by experimenting, I found that basic approach works well for number of features between 2000 and 3000.
To be on safe side I decided to use 2000 features at a time
Below script is used for dynamically generating query that then run to create partitioned tables
SELECT 'SELECT uid,' +
GROUP_CONCAT_UNQUOTED(
'SUM(IF(aid=' + STRING(aid) + ',plays,NULL)) as a' + STRING(aid)
)
+ ' FROM [mydataset.aggs] GROUP EACH BY uid'
FROM (SELECT aid FROM [mydataset.aggs] GROUP BY aid HAVING aid > 0 and aid < 2001)
Above query produces yet another query like below:
SELECT uid,SUM(IF(aid=1,plays,NULL)) a1,SUM(IF(aid=3,plays,NULL)) a3,
SUM(IF(aid=2,plays,NULL)) a2,SUM(IF(aid=4,plays,NULL)) a4 . . .
FROM [mydataset.aggs] GROUP EACH BY uid
This should be run and written to mydataset.pivot_1_2000
Executing STEP 3 two more times (adjusting HAVING aid > NNNN and aid < NNNN) we get three more tables mydataset.pivot_2001_4000, mydataset.pivot_4001_6000
As you can see - mydataset.pivot_1_2000 has expected schema but for features with aid from 1 to 2001; mydataset.pivot_2001_4000 has only features with aid from 2001 to 4000; and so on
STEP 4 – Merging all partitioned pivot table to final pivot table with all features represented as columns in one table
Same as in above steps. First we need generate query and then run it
So, initially we will “stitch” mydataset.pivot_1_2000 and mydataset.pivot_2001_4000. Then result with mydataset.pivot_4001_6000
SELECT 'SELECT x.uid uid,' +
GROUP_CONCAT_UNQUOTED(
'a' + STRING(aid)
)
+ ' FROM [mydataset.pivot_1_2000] AS x
JOIN EACH [mydataset.pivot_2001_4000] AS y ON y.uid = x.uid
'
FROM (SELECT aid FROM [mydataset.aggs] GROUP BY aid HAVING aid < 4001 ORDER BY aid)
Output string from above should be run and result written to mydataset.pivot_1_4000
Then we repeat STEP 4 like below
SELECT 'SELECT x.uid uid,' +
GROUP_CONCAT_UNQUOTED(
'a' + STRING(aid)
)
+ ' FROM [mydataset.pivot_1_4000] AS x
JOIN EACH [mydataset.pivot_4001_6000] AS y ON y.uid = x.uid
'
FROM (SELECT aid FROM [mydataset.aggs] GROUP BY aid HAVING aid < 6001 ORDER BY aid)
Result to be written to mydataset.pivot_1_6000
The resulted table has following schema:
uid int, a1 int, a2 int, a3 int, . . . , a5999 int, a6000 int
NOTE:
a. I tried this approach only up to 6000 features and it worked as expected
b. Run time for second/main queries in step 3 and 4 varied from 20 to 60 min
c. IMPORTANT: billing tier in steps 3 and 4 varied from 1 to 90. The good news is that respective table’s size is relatively small (30-40MB) so does billing bytes. For “before 2016” projects everything is billed as tier 1 but after October 2016 this can be an issue.
For more information, see Timing in High-Compute queries
d. Above example shows power of large-scale data transformation with BigQuery! Still I think (but I can be wrong) that storing materialized feature matrix is not the best idea

Related

statistics over kth events

Assuming a table with the following format
Team Score
---- -----
A 10
B 20
A 30
B 40
A 50
C 60
I would like to compute statistics, e.g. mean over the "kth" game a given team played, e.g. if k = 1, the mean is (10 + 20 + 60) / 3. How can one accomplish this using big query? Is there a much simpler way for special case k=1 vs. general case?
Consider below - it assumes you have column that represents game number or game date - something that defines game order - in this example I use column named game but you should replace it with your column
select avg(score) avg_score from (
select * from your_table
qualify row_number() over(partition by team order by game) <= 1
)

Transpose rows into columns in BigQuery (Pivot implementation) [duplicate]

This question already has answers here:
How to Pivot table in BigQuery
(7 answers)
Closed 2 years ago.
I want to generate a new table and place all key value pairs with keys as column names and values as their respective values using BigQuery.
Example:
**Key** **Value**
channel_title Mahendra Guru
youtube_id ugEGMG4-MdA
channel_id UCiDKcjKocimAO1tV
examId 72975611-4a5e-11e5
postId 1189e340-b08f
channel_title Ab Live
youtube_id 3TNbtTwLY0U
channel_id UCODeKM_D6JLf8jJt
examId 72975611-4a5e-11e5
postId 0c3e6590-afeb
I want to convert it to:
**channel_title youtube_id channel_id examId postId**
Mahendra Guru ugEGMG4-MdA UCiDKcjKocimAO1tV 72975611-4a5e-11e5 1189e340-b08f
Ab Live 3TNbtTwLY0U UCODeKM_D6JLf8jJt 72975611-4a5e-11e5 0c3e6590-afeb
How to do it using BigQuery?
BigQuery does not support yet pivoting functions
You still can do this in BigQuery using below approach
But first, in addition to two columns in input data you must have one more column that would specify groups of rows in input that needs to be combined into one row in output
So, I assume your input table (yourTable) looks like below
**id** **Key** **Value**
1 channel_title Mahendra Guru
1 youtube_id ugEGMG4-MdA
1 channel_id UCiDKcjKocimAO1tV
1 examId 72975611-4a5e-11e5
1 postId 1189e340-b08f
2 channel_title Ab Live
2 youtube_id 3TNbtTwLY0U
2 channel_id UCODeKM_D6JLf8jJt
2 examId 72975611-4a5e-11e5
2 postId 0c3e6590-afeb
So, first you should run below query
SELECT 'SELECT id, ' +
GROUP_CONCAT_UNQUOTED(
'MAX(IF(key = "' + key + '", value, NULL)) as [' + key + ']'
)
+ ' FROM yourTable GROUP BY id ORDER BY id'
FROM (
SELECT key
FROM yourTable
GROUP BY key
ORDER BY key
)
Result of above query will be string that (if to format) will look like below
SELECT
id,
MAX(IF(key = "channel_id", value, NULL)) AS [channel_id],
MAX(IF(key = "channel_title", value, NULL)) AS [channel_title],
MAX(IF(key = "examId", value, NULL)) AS [examId],
MAX(IF(key = "postId", value, NULL)) AS [postId],
MAX(IF(key = "youtube_id", value, NULL)) AS [youtube_id]
FROM yourTable
GROUP BY id
ORDER BY id
you should now copy above result (note: you don't really need to format it - i did it for presenting only) and run it as normal query
Result will be as you would expected
id channel_id channel_title examId postId youtube_id
1 UCiDKcjKocimAO1tV Mahendra Guru 72975611-4a5e-11e5 1189e340-b08f ugEGMG4-MdA
2 UCODeKM_D6JLf8jJt Ab Live 72975611-4a5e-11e5 0c3e6590-afeb 3TNbtTwLY0U
Please note: you can skip Step 1 if you can construct proper query (as in step 2) by yourself and number of fields small and constant or if it is one time deal. But Step 1 just helper step that makes it for you, so you can create it fast any time!
If you are interested - you can see more about pivoting in my other posts.
How to scale Pivoting in BigQuery?
Please note – there is a limitation of 10K columns per table - so you are limited with 10K organizations.
You can also see below as simplified examples (if above one is too complex/verbose):
How to transpose rows to columns with large amount of the data in BigQuery/SQL?
How to create dummy variable columns for thousands of categories in Google BigQuery?
Pivot Repeated fields in BigQuery

BigQuery full table to partition

I have a 340 GB of data in one table (270 days worth of data). Now planning move this data to partition table.
That means I will have 270 partitions. What is the best way to move this data to partition table.
I dont want to run 270 queries which is very costly operation. So looking for optimized solution.
I have multiple tables like this. I need to migrate all these tables to partition tables.
Thanks,
I see three options
Direct Extraction out of original table:
Actions (how many queries to run) = Days [to extract] = 270
Full Scans (how much data scanned measured in full scans of original table) = Days = 270
Cost, $ = $5 x Table Size, TB xFull Scans = $5 x 0.34 x 270 = $459.00
Hierarchical(recursive) Extraction: (described in Mosha’s answer)
Actions = 2^log2(Days) – 2 = 510
Full Scans = 2*log2(Days) = 18
Cost, $ = $5 x Table Size, TB xFull Scans = $5 x 0.34 x 18 = $30.60
Clustered Extraction: (I will describe it in a sec)
Actions = Days + 1 = 271
Full Scans = [always]2 = 2
Cost, $ = $5 x Table Size, TB xFull Scans = $5 x 0.34 x 2 = $3.40
Summary
Method Actions Total Full Scans Total Cost
Direct Extraction 270 270 $459.00
Hierarchical(recursive) Extraction 510 18 $30.60
Clustered Extraction 271 2 $3.40
Definitely, for most practical purposes Mosha’s solution is way to go (I use it in most such cases)
It is relatively simple and straightforward
Even though you need to run query 510 times – the query is "relatively" simple and orchestration logic is simple to implement with whatever client you usually use
And cost save is quite visible!
From $460 down to $31!
Almost 15 times down!
In case if you -
a) want to lower cost even further for yet another 9 times (so it will be total x135 times lower)
b) and like having fun and more challenges
- take a look at third option
“Clustered Extraction” Explanation
Idea / Goal:
Step 1
We want to transform original table into another [single] table with 270 columns – one column for one day
Each column will hold one serialized row for respective day from original table
Total number of rows in this new table will be equal to number of rows for most "heavy" day
This will require just one query (see example below) with one full scan
Step 2
After this new table is ready – we will be extracting day-by-day querying ONLY respective column and write into final daily table (schema of daily table are the very same as original table’s schema and all those tables could be pre-created)
This will require 270 queries to be run with scans approximately equivalent (this really depends on how complex your schema, so can vary) to one full size of original table
While querying column – we will need to de-serialize row’s value and parse it back to original scheme
Very simplified example: (using BigQuery Standard SQL here)
The purpose of this example is just to give direction if you will find idea interesting for you
Serialization / de-serialization is extremely simplified to keep focus on idea and less on particular implementation which can be different from case to case (mostly depends on schema)
So, assume original table (theTable) looks somehow like below
SELECT 1 AS id, "101" AS x, 1 AS ts UNION ALL
SELECT 2 AS id, "102" AS x, 1 AS ts UNION ALL
SELECT 3 AS id, "103" AS x, 1 AS ts UNION ALL
SELECT 4 AS id, "104" AS x, 1 AS ts UNION ALL
SELECT 5 AS id, "105" AS x, 1 AS ts UNION ALL
SELECT 6 AS id, "106" AS x, 2 AS ts UNION ALL
SELECT 7 AS id, "107" AS x, 2 AS ts UNION ALL
SELECT 8 AS id, "108" AS x, 2 AS ts UNION ALL
SELECT 9 AS id, "109" AS x, 2 AS ts UNION ALL
SELECT 10 AS id, "110" AS x, 3 AS ts UNION ALL
SELECT 11 AS id, "111" AS x, 3 AS ts UNION ALL
SELECT 12 AS id, "112" AS x, 3 AS ts UNION ALL
SELECT 13 AS id, "113" AS x, 3 AS ts UNION ALL
SELECT 14 AS id, "114" AS x, 3 AS ts UNION ALL
SELECT 15 AS id, "115" AS x, 3 AS ts UNION ALL
SELECT 16 AS id, "116" AS x, 3 AS ts UNION ALL
SELECT 17 AS id, "117" AS x, 3 AS ts UNION ALL
SELECT 18 AS id, "118" AS x, 3 AS ts UNION ALL
SELECT 19 AS id, "119" AS x, 4 AS ts UNION ALL
SELECT 20 AS id, "120" AS x, 4 AS ts
Step 1 – transform table and write result into tempTable
SELECT
num,
MAX(IF(ts=1, ser, NULL)) AS ts_1,
MAX(IF(ts=2, ser, NULL)) AS ts_2,
MAX(IF(ts=3, ser, NULL)) AS ts_3,
MAX(IF(ts=4, ser, NULL)) AS ts_4
FROM (
SELECT
ts,
CONCAT(CAST(id AS STRING), "|", x, "|", CAST(ts AS STRING)) AS ser,
ROW_NUMBER() OVER(PARTITION BY ts ORDER BY id) num
FROM theTable
)
GROUP BY num
tempTable will look like below:
num ts_1 ts_2 ts_3 ts_4
1 1|101|1 6|106|2 10|110|3 19|119|4
2 2|102|1 7|107|2 11|111|3 20|120|4
3 3|103|1 8|108|2 12|112|3 null
4 4|104|1 9|109|2 13|113|3 null
5 5|105|1 null 14|114|3 null
6 null null 15|115|3 null
7 null null 16|116|3 null
8 null null 17|117|3 null
9 null null 18|118|3 null
Here, I am using simple concatenation for serialization
Step 2 – extracting rows for specific day and write output to respective daily table
Please note: In below example - we extracting rows for ts = 2 : this corresponds to column ts_2
SELECT
r[OFFSET(0)] AS id,
r[OFFSET(1)] AS x,
r[OFFSET(2)] AS ts
FROM (
SELECT SPLIT(ts_2, "|") AS r
FROM tempTable
WHERE NOT ts_2 IS NULL
)
The result will look like below (which is expected):
id x ts
6 106 2
7 107 2
8 108 2
9 109 2
I wish I had more time for this to write down, so don’t judge to heavy if something missing – this is more directional answer - but at the same time example is pretty reasonable and if you have plain simple schema – almost no extra thinking is required. Of course with records, nested stuff in schema - most challenging part is serialization / de-serialization – but that’s where fun is – along with extra $saving
I will add another fourth option to #Mikhail's answer
DML QUERY
Action = 1 query to run
Full scans = 1
Cost = $5 x 0.34 = 1.7$ (x270 times cheaper than solution #1 \o/)
With the new DML feature of BiQuery you can convert a none partitioned table to a partitioned one while doing only one full scan of the source table
To illustrate my solution I will use one of BQ's public tables, namely bigquery-public-data:hacker_news.comments. below is the tables schema
name | type | description
_________________________________
id | INTGER | ...
_________________________________
by | STRING | ...
_________________________________
author | STRING | ...
_________________________________
... | |
_________________________________
time_ts | TIMESTAMP | human readable timestamp in UTC YYYY-MM-DD hh:mm:ss /!\ /!\ /!\
_________________________________
... | |
_________________________________
We are going to partition the comments table based on time_ts
#standardSQL
CREATE TABLE my_dataset.comments_partitioned
PARTITION BY DATE(time_ts)
AS
SELECT *
FROM `bigquery-public-data:hacker_news.comments`
I hope it helps :)
If your data was in sharded tables (i.e. with YYYYmmdd suffix), you could've used "bq partition" command. But with data in a single table - you will have to scan it multiple times applying different WHERE clauses on your partition key column.
The only optimization I can think of is to do it hierarchically, i.e. instead of 270 queries which will do 270 full table scans - first split table in half, then each half in half etc. This way you will need to pay for 2*log_2(270) = 2*9 = 18 full scans.
Once the conversion is done - all the temporary tables can be deleted to eliminate extra storage costs.

Dynamic pivot for thousands of columns

I'm using pgAdmin III / PostgreSQL 9.4 to store and work with my data. Sample of my current data:
x | y
--+--
0 | 1
1 | 1
2 | 1
5 | 2
5 | 2
2 | 2
4 | 3
6 | 3
2 | 3
How I'd like it to be formatted:
1, 2, 3 -- column names are unique y values
0, 5, 4 -- the first respective x values
1, 5, 6 -- the second respective x values
2, 2, 2 -- etc.
It would need to be dynamic because I have millions of rows and thousands of unique values for y.
Is using a dynamic pivot approach correct for this? I have not been able to successfully implement this:
DECLARE #columns VARCHAR(8000)
SELECT #columns = COALESCE(#columns + ',[' + cast(y as varchar) + ']',
'[' + cast(y as varchar)+ ']')
FROM tableName
GROUP BY y
DECLARE #query VARCHAR(8000)
SET #query = '
SELECT x
FROM tableName
PIVOT
(
MAX(x)
FOR [y]
IN (' + #columns + ')
)
AS p'
EXECUTE(#query)
It is stopping on the first line and giving the error:
syntax error at or near "#"
All dynamic pivot examples I've seen use this, so I'm not sure what I've done wrong. Any help is appreciated. Thank you for your time.
**Note: It is important for the x values to be stored in the correct order, as sequence matters. I can add another column to indicate sequential order if necessary.
The term "first row" assumes a natural order of rows, which does not exist in database tables. So, yes, you need to add another column to indicate sequential order like you suspected. I am assuming a column tbl_id for the purpose. Using the ctid would be a measure of last resort. See:
Deterministic sort order for window functions
The code you present looks like MS SQL Server code; invalid syntax for Postgres.
For millions of rows and thousands of unique values for Y it wouldn't even make sense to try and return individual columns. Postgres has generous limits, but not nearly generous enough for that. According to the source code or the manual, the absolute maximum number of columns is 1600.
So we don't even get to discuss the restrictive characteristics of SQL, which demands to know columns and data types at execution time, not dynamically adjusted during execution. You would need two separate calls, like we discussed in great detail under this related question.
Dynamic alternative to pivot with CASE and GROUP BY
Another answer by Clodoaldo under the same question returns arrays. That can actually be completely dynamic. And that's what I suggest here, too. The query is actually rather simple:
WITH cte AS (
SELECT *, row_number() OVER (PARTITION BY y ORDER BY tbl_id) AS rn
FROM tbl
ORDER BY y, tbl_id
)
SELECT text 'y' AS col, array_agg (y) AS values
FROM cte
WHERE rn = 1
UNION ALL
( -- parentheses required
SELECT text 'x' || rn, array_agg (x)
FROM cte
GROUP BY rn
ORDER BY rn
);
Result:
col | values
----+--------
y | {1,2,3}
x1 | {0,5,4}
x2 | {1,5,6}
x3 | {2,2,2}
db<>fiddle here
Old sqlfiddle
Explanation
The CTE computes a row_number rn for each row (each x) per group of y. We are going to use it twice, hence the CTE.
The 1st SELECT in the outer query generates the array of y values.
The 2nd SELECT in the outer query generates all arrays of x values in order. Arrays can have different length.
Why the parentheses for UNION ALL? See:
Sum results of a few queries and then find top 5 in SQL

SQL based data diff: longest common subsequence

I'm looking for research papers or writings in applying Longest Common Subsquence algorithm to SQL tables for obtaining a data diff view. Other sugestions on how to resolve a table diff problem are also welcomed. The challenge being that SQL tables have this nasty habit of geting rather BIG and applying straightforward algorithms designed for text processing may result in a program that never ends...
so given a table Original:
Key Content
1 This row is unchanged
2 This row is outdated
3 This row is wrong
4 This row is fine as it is
and the table New:
Key Content
1 This row was added
2 This row is unchanged
3 This row is right
4 This row is fine as it is
5 This row contains important additions
I need to find out the Diff:
+++ 1 This row was added
--- 2 This row is outdated
--- 3 This row is wrong
+++ 3 This row is right
+++ 5 This row contains important additions
If you export your tabls into csv files, you can use http://sourceforge.net/projects/csvdiff/
Quote:
csvdiff is a Perl script to diff/compare two csv files with the
possibility to select the separator. Differences will be shown like:
"Column XYZ in record 999" is different. After this, the actual and the
expected result for this column will be shown.
This is probably too simple for what you're after, and it's not research :-), but just conceptual. I imagine you're looking to compare different methods for processing overhead (?).
--This is half of what you don't want ( A )
SELECT o.Key FROM tbl_ORIGINAL o INNER JOIN tbl_NEW n WHERE o.Content = n.Content
--This is the other half of what you don't want ( B )
SELECT n.Key FROM tbl_ORIGINAL o INNER JOIN tbl_NEW n WHERE o.Content = n.Content
--This is half of what you DO want ( C )
SELECT '+++' as diff, n.key, Content FROM tbl_New n WHERE n.KEY NOT IN( B )
--This is the other half of what you DO want ( D )
SELECT '---' as diff, o.key, Content FROM tbl_Original o WHERE o.Key NOT IN ( A )
--Combining C & D
( C )
Union
( D )
Order By diff, key
Improvements...
try creating indexed views of the
base tables first
try reducing the length of the
content field to it's min for
uniqueness (trial/error), and then
use that shorter result to do your
comparisons
-- e.g. to get min length (1000 is arbitrary -- just need an exit)
declare #i int
set #i = 1
While i < 1000 and Exists (
Select Count(key), Left(content,#i) From Table Having Count(key) > 1 )
BEGIN
i = #i + 1
END