Migrating from non-partitioned to Partitioned tables - google-bigquery

In June the BQ team announced support for date-partitioned tables. But the guide is missing how to migrate old non-partitioned tables into the new style.
I am looking for a way to update several or if not all tables to the new style.
Also outside of DAY type partitioned what other options are available? Does the BQ UI show this, as I wasn't able to create such a new partitioned table from the BQ Web UI.

What works for me is the following set of queries applied directly in the big query (big query create new query).
CREATE TABLE (new?)dataset.new_table
PARTITION BY DATE(date_column)
AS SELECT * FROM dataset.table_to_copy;
Then as the next step I drop the table:
DROP TABLE dataset.table_to_copy;
I got this solution from https://fivetran.com/docs/warehouses/bigquery/partition-table
using only step 2

from Pavan’s answer: Please note that this approach will charge you the scan cost of the source table for the query as many times as you query it.
from Pentium10 comments: So suppose I have several years of data, I need to prepare different query for each day and run all of it, and suppose I have 1000 days in history, I need to pay 1000 times the full query price from the source table?
As we can see - the main problem here is on having full scan for each and every day. The rest is less of a problem and can be easily scripted out in any client of the choice
So, below is to - How to partition table while avoid full table scan for each and every day?
Below step-by-step shows the approach
It is generic enough to extend/apply to anyone real use-case - meantime I am using bigquery-public-data.noaa_gsod.gsod2017 and I am limiting "exercise" to just 10 days to keep it readable
Step 1 – Create Pivot table
In this step we
a) compress each row’s content into record/array
and
b) put them all into respective ”daily” column
#standardSQL
SELECT
ARRAY_CONCAT_AGG(CASE WHEN d = 'day20170101' THEN r END) AS day20170101,
ARRAY_CONCAT_AGG(CASE WHEN d = 'day20170102' THEN r END) AS day20170102,
ARRAY_CONCAT_AGG(CASE WHEN d = 'day20170103' THEN r END) AS day20170103,
ARRAY_CONCAT_AGG(CASE WHEN d = 'day20170104' THEN r END) AS day20170104,
ARRAY_CONCAT_AGG(CASE WHEN d = 'day20170105' THEN r END) AS day20170105,
ARRAY_CONCAT_AGG(CASE WHEN d = 'day20170106' THEN r END) AS day20170106,
ARRAY_CONCAT_AGG(CASE WHEN d = 'day20170107' THEN r END) AS day20170107,
ARRAY_CONCAT_AGG(CASE WHEN d = 'day20170108' THEN r END) AS day20170108,
ARRAY_CONCAT_AGG(CASE WHEN d = 'day20170109' THEN r END) AS day20170109,
ARRAY_CONCAT_AGG(CASE WHEN d = 'day20170110' THEN r END) AS day20170110
FROM (
SELECT d, r, ROW_NUMBER() OVER(PARTITION BY d) AS line
FROM (
SELECT
stn, CONCAT('day', year, mo, da) AS d, ARRAY_AGG(t) AS r
FROM `bigquery-public-data.noaa_gsod.gsod2017` AS t
GROUP BY stn, d
)
)
GROUP BY line
Run above query in Web UI with pivot_table (or whatever name is preferred) as a destination
As we can see - here we will get table with 10 columns – one column for one day and schema of each column is a copy of schema of original table:
Step 2 – Processing partitions one-by-one ONLY scanning respective column (no full table scan) – inserting into respective partition
#standardSQL
SELECT r.*
FROM pivot_table, UNNEST(day20170101) AS r
Run above query from Web UI with destination table named mytable$20160101
You can run same for next day
#standardSQL
SELECT r.*
FROM pivot_table, UNNEST(day20170102) AS r
Now you should have destination table as mytable$20160102 and so on
You should be able to automate/script this step with any client of your choice
There are many variations of how you can use above approach - it is up to your creativity
Note: BigQuery allows up to 10000 columns in table, so 365 columns for respective days of one year is definitely not a problem here :o)
Unless there is a limitation on how far back you can go with new partitions – I heard (but didn’t have chance to check yet) there is now no more than 90 days back
Update
Please note:
Above version has a little extra logic of packing all aggregated cells into as least final number of rows as possible.
ROW_NUMBER() OVER(PARTITION BY d) AS line
and then
GROUP BY line
along with
ARRAY_CONCAT_AGG(…)
does this
This works well when row size in your original table is not that big so final combined row size still will be within rows size limit that BigQuery has (which I believe is 10 MB as of now)
If your source table already has row size close to that limit – use below adjusted version
In this version – grouping is removed such that each row has only value for one column
#standardSQL
SELECT
CASE WHEN d = 'day20170101' THEN r END AS day20170101,
CASE WHEN d = 'day20170102' THEN r END AS day20170102,
CASE WHEN d = 'day20170103' THEN r END AS day20170103,
CASE WHEN d = 'day20170104' THEN r END AS day20170104,
CASE WHEN d = 'day20170105' THEN r END AS day20170105,
CASE WHEN d = 'day20170106' THEN r END AS day20170106,
CASE WHEN d = 'day20170107' THEN r END AS day20170107,
CASE WHEN d = 'day20170108' THEN r END AS day20170108,
CASE WHEN d = 'day20170109' THEN r END AS day20170109,
CASE WHEN d = 'day20170110' THEN r END AS day20170110
FROM (
SELECT
stn, CONCAT('day', year, mo, da) AS d, ARRAY_AGG(t) AS r
FROM `bigquery-public-data.noaa_gsod.gsod2017` AS t
GROUP BY stn, d
)
WHERE d BETWEEN 'day20170101' AND 'day20170110'
As you can see now - pivot table (sparce_pivot_table) is sparse enough (same 21.5 MB but now 114,089 rows vs. 11,584 rows in pivot_table) so it has average row size of 190B vs 1.9KB in initial version. Which is obviously about 10 times less as per number of columns in the example.
So before using this approach some math needs to be done to project/estimate what and how can be done!
Still: each cell in pivot table is sort of JSON representation of whole row in original table. It is such as it holds not just values as it was for rows in original table but also has a schema in it
As such it is quite verbose - thus the size of cell can be multiple times bigger than original size [which limits the usage of this approach ... unless you get even more creative :o) ... which is still plenty of areas here to apply :o) ]

As of today, you can now create a partitioned table from a non-partitioned table by querying it and specifying the partition column. You'll pay for one full table scan on the original (non-partitioned) table. Note: this is currently in beta.
https://cloud.google.com/bigquery/docs/creating-column-partitions#creating_a_partitioned_table_from_a_query_result
To create a partitioned table from a query result, write the results to a new destination table. You can create a partitioned table by querying either a partitioned table or a non-partitioned table. You cannot change an existing standard table to a partitioned table using query results.

Until the new feature is rolled out in BigQuery, there is another (much cheaper) way to partition the table(s) by using Cloud Dataflow. We used this approach instead of running hundreds of SELECT * statements, which would have cost us thousands of dollars.
Create the partitioned table in BigQuery using the normal partition command
Create a Dataflow pipeline and use a BigQuery.IO.Read sink to read the table
Use a Partition transform to partition each row
Using a max of 200 shards/sinks at a time (any more than that and you hit API limits), create a BigQuery.IO.Write sink for each day/shard that will write to the corresponding partition using the partition decorator syntax - "$YYYYMMDD"
Repeat N times until all data is processed.
Here's an example on Github to get you started.
You still have to pay for the Dataflow pipeline(s), but it's a fraction of the cost of using multiple SELECT * in BigQuery.

If you have date sharded tables today, you can use this approach:
https://cloud.google.com/bigquery/docs/creating-partitioned-tables#converting_dated_tables_into_a_partitioned_table
If you have a single non-partitioned table to be converted to partitioned table, you can try the approach of running a SELECT * query with allow large results and using the table's partition as the destination (similar to how you'd restate data for a partition):
https://cloud.google.com/bigquery/docs/creating-partitioned-tables#restating_data_in_a_partition
Please note that this approach will charge you the scan cost of the source table for the query as many times as you query it.
We are working on something to make this scenario significantly better in the next few months.

CREATE TABLE `dataset.new_table`
PARTITION BY DATE(date_column)
AS SELECT * FROM `dataset.old_table`;
drop table `dataset.old_table`;
ALTER TABLE `dataset.new_table`
RENAME TO old_table;

Related

Difference of values in the same column ms access sql (mdb)

I have a table which contains two column with values that are not unique, those values are generated automatically and I have no way to do anything about it, cannot edit the table, db nor make custom functions.
With that in mind I've solved this problem in sql server, but it contains some functions that does not exist in ms-access.
The columns are Volume and ComponentID, here is my code in sql:
with rows as (
select row_number() over (order by volume) as rownum, volume
from test where componentid = 'S3')
select top 10
rowsMinusOne.volume, coalesce(rowsMinusOne.volume - rows.volume,0) as diff
from rows as rowsMinusOne
left outer join rows
on rows.rownum = rowsMinusOne.rownum - 1
Sample data:
58.29168
70.57396
85.67902
97.04888
107.7026
108.2022
108.3975
108.5777
109
109.8944
Expected results:
Volume
diff
58.29168
0
70.57396
12.28228
85.67902
15.10506
97.04888
11.36986
107.7026
10.65368
108.2022
0.4996719
108.3975
0.1952896
108.5777
0.1801834
109
0.4223404
109.8944
0.89431
I have solved the part of the coalesce by replacing it with NZ, I have tryed to use the DCOUNT to solve the row_number (How to show the record number in a MS Access report table?) but I reveive the error that it cannot find the function (I am reading the data by code, that is the only thing I can do).
I also tryed this but, as the answer says I need a column with a unique value which I do not have nor can create Microsoft Access query to duplicate ROW_NUMBER
Consider:
SELECT TOP 10 Table1.ComponentID,
DCount("*","Table1","ComponentID = 'S3' AND Volume<" & [Volume])+1 AS Seq, Table1.Volume,
Nz(Table1.Volume -
(SELECT Top 1 Dup.Volume FROM Table1 AS Dup
WHERE Dup.ComponentID = Table1.ComponentID AND Dup.Volume<Table1.Volume
ORDER BY Volume DESC),0) AS Diff
FROM Table1
WHERE (((Table1.ComponentID)="S3"))
ORDER BY Table1.Volume;
This will likely perform very slowly with large dataset.
Alternative solutions:
build query that calculates difference, use that query as source for a report, use textbox RunningSum property to calculate sequence number
VBA looping through recordset and saving results to a 'temp' table
export to Excel

Displaying single date header about multiple rows (Recycleview)

Evening everyone
I've currently got a simple recycle view adapter which is being populated by an SQL Lite database. The user can add information into the database from the app which then build a row inside of the recycle view. When you run the application it will display each row with its own date directly above it. I'm now looking to make the application look more professional by only displaying a single date above multiple records as a header.
So far I've built 2 custom designs, one which displays the header along with the row and the other which is just a standard row without a header built in. I also understand how to implement two layouts into a single adapter.
I've also incorporated a single row into my database which simply stores the date in a way in which I can order the database e.g. 20190101
Now my key question is when populating the adapter using the information from the SQL Lite database how can get it to check if the previous record has the same date. If the record has the same date then it doesn't need to show the custom row with header but if its a new date then it does?
Thank you
/////////////////////////////////////////////////////////////////////////////
Follow up question for Krokodilko, I've spent the last hour trying to work your implementation into my SQL Lite but still not being able to find the combination.
below the is the original code SQL Lite line I currently use to simply gain all the results.
Cursor cursor = sqLiteDatabase.rawQuery("SELECT * FROM " + Primary_Table + " " , null);
First you must define an order which will be used to determine which record is previous and which one is next. As I understand, you are simply using date column.
Then the query is simple - use LAG analytic function to pick a column value from previous row, here is a link to a simple demo (click "Run" button):
https://sqliteonline.com/#fiddle-5c323b7a7184cjmyjql6c9jh
DROP TABLE IF EXISTS d;
CREATE TABLE d(
d date
);
insert into d values ( '2012-01-22'),( '2012-01-22'),( '2015-01-22');
SELECT *,
lag( d ) OVER (order by d ) as prev_date,
CASE WHEN d = lag( d ) OVER (order by d )
THEN 'Previous row has the same date'
ELSE 'Previous row has different date'
END as Compare_status
FROM d
ORDER BY d;
In the above demo d column is used in OVER (order by d ) clause to determine the order of rows used by LAG function.

BigQuery - Joining and pivoting large tables

I know there are some posts on pivoting, which I have used to get where I am today (thanks to the BQ community!). But this post seeks some advice on optimising this where there is a large number of pivot columns needed, distributed table joins are needed....as well and deudping. Not asking much right!
Objective:
We have 2 large BQ tables, with a full 10 years history that needs joining:
sales_order_header (13 GB - 1.35 million rows)
sales_order_line (50GM - 5 million rows)
This is a typical 'header/line' one to many relationship. The data for the tables arrives as 2 seperate streams unfortunately rather then 1 document style where the line is nested inside the header which would be ideal - but its not so distributed joins become necesary for some of the views our BI tool (Tableau) wants to periodically (every 60 mins) call to ingest 'cleansed' data that is:
deduped (both tables that is)
joined header to line (on salesOrderId)
each has its own array of 'sourceData' namve / value paris that needs unpacking / 'pivot' so its not an array
Point 3 presents an issue in its own right. We have a column called 'sourceData' which is basically where the core data is - its an array of string name value pairs (a row in BQ is a replication of a single row from a DB so the key is a column name and value the value for a single row).
Now I think here lay the issue, as there are 250 array entries (we know the exact number up front) , this equates to 250 'unnest' statements each and using the best approach I can think of using sub selects:
(SELECT val FROM UNNEST(sourceData) WHERE name = 'a') AS a,
250 times
And this is done as a pattern for each of the header and the line tables repsective views.
So the SQL for the view for just retrieving a deduped, flattened/pivoted array for the sales_order_header table is as follows. The sales_order_line has the same pattern for its view:
#standardSQL
WITH latest_snapshot_dups AS (
SELECT
salesOrderId,
PARSE_TIMESTAMP("%Y-%m-%dT%H:%M:%E*S%Ez", lastUpdated) AS lastUpdatedTimestampUTC,
sourceData,
_PARTITIONTIME AS bqPartitionTime
FROM
`project.ds.sales_order_header_refdata`
),
latest_snapshot_nodups AS (
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY salesOrderId ORDER BY lastUpdatedTimestampUTC DESC) AS rowNum
FROM latest_snapshot_dups
)
SELECT
salesOrderId,
lastUpdatedTimestampUTC,
(SELECT val FROM UNNEST(sourceData) WHERE name = 'a') AS a,
(SELECT val FROM UNNEST(sourceData) WHERE name = 'b') AS b,
....250 of these
FROM
latest_snapshot_nodups
WHERE
rowNum = 1
Although just showing one here, we have these two similar views (with total of 250 + 300 = 550 unique subqueries that unnest/pivot), and now I want to join the header with the line views and I run into an issue straight away exceeding a limit of subqueries.
Is there a better way to do this, assuming this is the data there is to work with? A better way to 'pivot' perhaps? Or a more efficient way building a single view that optimises the order of things, rather then using 2 discrete views?
Thanks for your help BQ Community!
I run into an issue straight away exceeding a limit of subqueries
You currently using below pattern (removed mot significant part of code for simplicity)
#standardSQL
SELECT
salesOrderId,
(SELECT val FROM UNNEST(sourceData) WHERE name = 'a') AS a,
(SELECT val FROM UNNEST(sourceData) WHERE name = 'b') AS b,
....250 OF these
FROM latest_snapshot_nodups
Try below pattern
#standardSQL
SELECT
salesOrderId,
MAX(IF(name = 'a', val, NULL)) AS a,
MAX(IF(name = 'b', val, NULL)) AS b,
....250 OF these
FROM latest_snapshot_nodups, UNNEST(sourceData) kv
GROUP BY salesOrderId

Split table into multiple tables based on date using bigquery with a single query for partitioning

The original "why" of what I want to do is:
Restore a table maintaining its original partitioning instead of it all going into today's partition.
What I thought I could do is bq load to a temporary table. Then run a query to split that table into one table per day YYYYMMDD in the naming convention needed by bq partition i.e. sharded_YYYYMMDD. Then run bq partition.
This page https://cloud.google.com/bigquery/docs/creating-partitioned-tables gives examples but it requires running a query per day. That could be hundreds:
bq query --use_legacy_sql=false --allow_large_results --replace \
--noflatten_results --destination_table 'mydataset.temps$20160101' \
'SELECT stn,temp from `bigquery-public-data.noaa_gsod.gsod2016` WHERE mo="01" AND da="01" limit 100'
So how do I make a single query that will iterate over all the days and make one table per day?
I found a similar question here Split a table into multiple tables in BigQuery SQL but there is no answer about doing it with a single query.
The main problem here is having full scan for each and every day. The the rest is less of a problem and can be easily scripted out in any client of your choice
So, below is to - How avoid full table scan for each and every day?
Try below step-by-step to see the approach
It is generic enough to extend/apply to your real case - meantime I am using same example as you in your question and I am limiting exercise to just 10 days
Step 1 – Create Pivot table
In this step we a) compress each row’s content into record/array and b) put them all into respective ”daily” column
#standardSQL
SELECT
ARRAY_CONCAT_AGG(CASE WHEN d = 'day20160101' THEN r END) AS day20160101,
ARRAY_CONCAT_AGG(CASE WHEN d = 'day20160102' THEN r END) AS day20160102,
ARRAY_CONCAT_AGG(CASE WHEN d = 'day20160103' THEN r END) AS day20160103,
ARRAY_CONCAT_AGG(CASE WHEN d = 'day20160104' THEN r END) AS day20160104,
ARRAY_CONCAT_AGG(CASE WHEN d = 'day20160105' THEN r END) AS day20160105,
ARRAY_CONCAT_AGG(CASE WHEN d = 'day20160106' THEN r END) AS day20160106,
ARRAY_CONCAT_AGG(CASE WHEN d = 'day20160107' THEN r END) AS day20160107,
ARRAY_CONCAT_AGG(CASE WHEN d = 'day20160108' THEN r END) AS day20160108,
ARRAY_CONCAT_AGG(CASE WHEN d = 'day20160109' THEN r END) AS day20160109,
ARRAY_CONCAT_AGG(CASE WHEN d = 'day20160110' THEN r END) AS day20160110
FROM (
SELECT d, r, ROW_NUMBER() OVER(PARTITION BY d) AS line
FROM (
SELECT
stn, CONCAT('day', year, mo, da) AS d, ARRAY_AGG(t) AS r
FROM `bigquery-public-data.noaa_gsod.gsod2016` AS t
GROUP BY stn, d
)
)
GROUP BY line
Run above query in Web UI with pivot_table (you can choose whatever name you want here) as a destination
As you can see - here we will get table with 10 columns – one column for one day and schema of each column is a copy of schema of original table:
Step 2 – Creating sharded table one-by-one ONLY scanning respective column (no full table scan)
#standardSQL
SELECT r.*
FROM pivot_table, UNNEST(day20160101) AS r
Run above query from Web UI with destination table named mytable_20160101
You can run same for next day
#standardSQL
SELECT r.*
FROM pivot_table, UNNEST(day20160102) AS r
Now you should have destination table named mytable_20160102 and so on
You should be able to automate/script this step with any client of your choice
Note: those final daily tables will have exactly same schema as original table!
There are many variations of how you can use above approach - it is up to your creativity
Note: BigQuery allows up to 10000 columns in table, so 365 columns for respective days of one year is definitely not a problem here :o)
Answering myself here. Another approach I've seen done is to write a script that:
Parses the tablebackup.json file, outputs many files tablebackuppartitionYYYYMMDD.json split on a provided parameter.
Creates a batch script to bq load all the files into the appropriate table partitions.
The script would need to process row by row or chunks to be able to handle massive backups. And would take some time. The advantage of using this method is it would be generic and usable by an untrained-in-BQ sysadmin.

SQL - Getting the max effective date less than a date in another table

I'm currently working on a conversion script to transfer a bunch of old data out of an SQL Server 2000 database and onto a SQL Server 2008. One of thing things I'm trying to accomplish during this conversion is to eliminate all of the composite keys and replace them with a "proper" primary key. Obviously, when I transfer the data I need to inject the foreign key values into the new table structures.
I'm currently stuck with one data set though and I can't seem to get my head around it in a set-based fashion. The two tables with which I am working are called Charge and Law. They have a 1:1 relationship and "link" on three columns. The first two are an equal link on the LawSource and LawStatue columns, but the third column is causing me problems. The ChargeDate column should link to the LawDate column where LawDate <= ChargeDate.
My current query is returning more than one row (in some cases) for a given Charge because the Law may have more than one LawDate that is less than or equal to the ChargeDate.
Here's what I currently have:
select LawId
from Law a
join Charge b on b.LawSource = a.LawSource
and b.LawStatute = a.LawStatute
and b.ChargeDate >= a.LawDate
Any way I can rewrite this to get the most recent entry in the Law table that is the same (or earlier) date at the ChargeDate?
This would be easier in SQL 2008 with the partitioning functions (so, it should be easier in the future for you).
The usual caveats of "I don't have your schema, so this isn't tested" apply, but I think it should do what you need.
select
l.LawID
from
law l
join (
select
a.LawSource,
a.LawStatue,
max(a.LawDate) LawDate
from
Law a
join Charge b on b.LawSource = a.LawSource
and b.LawStatute = a.LawStatute
and b.ChargeDate >= a.LawDate
group by
a.LawSource, a.LawStatue
) d on l.LawSource = d.LawSource and l.LawStatue = d.LawStatue and l.LawDate = d.LawDate
If performance is not an issue, cross apply provides a very readable way:
select *
from Law l
cross apply
(
select top 1 *
from Charge
where LawSource = l.LawSource
and LawStatute = l.LawStatute
and ChargeDate >= l.LawDate
order by
ChargeDate
) c
For each row, this looks up the row in the Charge table with the smallest ChargeDate.
To include rows from Law without a matching Charge, change cross apply to outer apply.