Aggregating one bigquery table to another bigquery table - google-bigquery

I am trying to aggregate multi PB (around 7PB) worth of BigQuery Table into another BigQuery Table
I have (partition_key, clusterkey1, clusterkey2, col1, col2, val)
Where partition_key is used for bigquery partition and clusterkey is used for clustering.
For example
(timestamp1, timestamp2, 0, 1, 2, 1)
(timestamp3, timestamp4, 0, 1, 2, 7)
(timestamp31, timestamp22, 2, 1, 2, 2)
(timestamp11, timestamp12, 2, 1, 2, 3)
should result to
(0, 1, 2, 8)
(2, 1, 2, 5)
I want to aggregate base on (clusterkey2, col1, col2), across all partition_key and all clusterkey1 for val
What is a feasible way to do this?
Should I write a custom loader and just read all data from it line by line, or is there a native way to do this?

Depending on where / how you are executing this you can do it by writing a simple sql script and defining the target output, for example:
SELECT clusterkey2
, col1
, col2
, sum(val)
from table
group by clusterkey2, col1, col2
This will get you the desired results.
From here you can do a few things, but they are mostly all outlined here in the documentation:
https://cloud.google.com/bigquery/docs/writing-results#writing_query_results
Specifically from the above you are looking to set the destination table.
One thing to note, you may want to include a partition key in the where clause to help narrow down your data if you do not want the aggregate results of the whole table.

Related

Parsing Multiple Snowflake Objects with consistent keys to rows

First post, hope I don't do anything too crazy
I want to go from JSON/object to long in terms of formatting.
I have a table set up as follows (note: there will be a large but finite number of 50+ activity columns, 2 is a minimal working example). I'm not concerned about the formatting of the date column - different problem.
customer_id(varcahr), activity_count(object, int), activity_duration(object, numeric)
sample starting point
In this case I'd like to explode this into this:
customer_id(varcahr), time_period, activity_count(int), activity_duration(numeric)
sample end point - long
minimum data set
WITH smpl AS (
SELECT
'12a' AS id,
OBJECT_CONSTRUCT(
'd1910', 0,
'd1911', 26,
'd1912', 6,
'd2001', 73) as activity_count,
OBJECT_CONSTRUCT(
'd1910', 0,
'd1911', 260.1,
'd1912', 30,
'd2001', 712.3) AS activity_duration
UNION ALL
SELECT
'13b' AS id,
OBJECT_CONSTRUCT(
'd1910', 1,
'd1911', 2,
'd1912', 3,
'd2001', 4) as activity_count,
OBJECT_CONSTRUCT(
'd1910', 1,
'd1911', 2.2,
'd1912', 3.3,
'd2001', 4.3) AS activity_duration
)
select * from smpl
Extra credit for also taking this from JSON/object to wide (in Google Big Query it's SELECT id, activity_count.* FROM tbl
Thanks in advance.
I've tried tons of random FLATTEN() based joins. In this instance I probably just need one working example.
This needs to scale to a moderate but finite number of objects (e.g. 50)
I'll also see if I can combine with THIS - I'll see if I can combine it - Lateral flatten two columns without repetition in snowflake
Using FLATTEN:
WITH (...)
SELECT s1.ID, s1.KEY, s1.value AS activity_count, s2.value AS activity_duration
FROM (select ID, Key, VALUE from smpl,table(flatten(input=>activity_count))) AS s1
JOIN (select ID, Key, VALUE from smpl,table(flatten(input=>activity_duration))) AS s2
ON S1.ID = S2.ID AND S1.KEY = S2.KEY;
Output:
#Lukasz Szozda gets close but the answer doesn't scale as well with multiple variables (it's essentially a bunch of cartesian products and I'd need to do a lot of ON conditions). I have a known constraint (each field is in a strict format) so it's easy to recycle the key.
After WAY WAY WAY too much messing with this (off and on searches for weeks) it finally snapped and it's pretty easy.
SELECT
id, key, activity_count[key], activity_duration[key], activity_duration2[key]
FROM smpl, LATERAL flatten(input => activity_count);
You can also use things OTHER than key such as index
It's inspired by THIS link but I just didn't quite follow it.
https://stackoverflow.com/a/36804637/20994650

Big query Pivoting with a specific requirement

I have used pivot in big query, but here is a specific use case and the data that I need to show in looker. I am trying the similar option in looker but wanted to know if I can just show this in big query.
This is how my data (Sample) in BIG QUERY table is:
The output should be as below:
If you look at it, it's pivoting but I need to assign the column names as shown (for the specific range) and for the range 6 and more, I need to add the pivot columns data into one.
I don't see pivot index or something like this in BIG_QUERY. Was thinking if there is a way to sum up the column data after pivot index 6 or so? Any suggestions how to achieve this?
Hope below approach would be helpful,
SELECT * FROM (
SELECT Node, bucket, total_code
FROM sample, UNNEST([RANGE_BUCKET(data1, [1, 2, 3, 4, 5, 6, 7])]) bucket
) PIVOT (SUM(total_code) `range` FOR bucket IN (1, 2, 3, 4, 5, 6, 7));
output:
RANGE_BUCKET - https://cloud.google.com/bigquery/docs/reference/standard-sql/mathematical_functions#range_bucket

How to get data in a column in order by using SQL in operator

There is a data set as shown below;
When input for event_type is 4, 1, 2, 3 for example, I would like to get 3, 999, 3, 9 from cnt_stamp in this order. I created a SQL code as shown below, but it seems like it always returns 999, 3, 9, 3 regardless the order of the input.
How can I fix the SQL to achieve this? Thank you for taking your time, and please let me know if you have any question.
SELECT `cnt_stamp` FROM `stm_events` WHERE `event_type` in (4,1,2,3)
Add ORDER BY FIELD(event_type, 4, 1, 2, 3) in your query. It should look like:
SELECT cnt_stamp FROM stm_events WHERE event_type in (4,1,2,3) ORDER BY FIELD(event_type, 4, 1, 2, 3);
its cannot because as default the data sort by ascending, if u want result like u want,, better u create 1 column for indexing

Doing a concat over a partition in SQL?

I have some data ordered like so:
date, uid, grouping
2018-01-01, 1, a
2018-01-02, 1, a
2018-01-03, 1, b
2018-02-01, 2, x
2018-02-05, 2, x
2018-02-01, 3, z
2018-03-01, 3, y
2018-03-02, 3, z
And I wanted a final form like:
uid, path
1, "a-a-b"
2, "x-x"
3, "z-y-z"
but running something like
select
a.uid
,concat(grouping) over (partition by date, uid) as path
from temp1 a
Doesn't seem to want to play well with SQL or Google BigQuery (which is the specific environment I'm working in). Is there an easy enough way to get the groupings concatenated that I'm missing? I imagine there's a way to brute force it by including a bunch of if-then statements as custom columns, then concatenating the result, but I'm sure that will be a lot messier. Any thoughts?
You are looking for string_agg():
select a.uid, string_agg(grouping, '-' order by date) as path
from temp1 a
group by a.uid;

Is IN with the multiple of the same values slower?

For example, is this noticeably slower:
DELETE FROM [table] WHERE [REOID] IN ( 1, 1, 1, 2, 2, 1, 3, 5)
Than this:
DELETE FROM [table] WHERE [REOID] IN ( 1, 2, 3, 4)
SQL Server 2008 R2.
Thanks!
Most engines would eliminate duplicates in the constant IN list on parsing stage.
Such a query would parse marginally slower than that with a non-duplicated list, will produce the same plan and with most real-world scenarios, you will hardly notice any difference.