Note: this is nearly a duplicate of this question with the distinction that in this case, the source table is date partitioned and the destination table does not yet exist. Also, the accepted solution to that question didn't work in this case.
I'm trying to copy a single day's worth of data from one date partitioned table into a new date partitoined table that I have not yet created. My hope is that BigQuery would simply create the date-partitioned destination table for me like it usually does for the non-date-partitioned case.
Using BigQuery CLI, here's my command:
bq cp mydataset.sourcetable\$20161231 mydataset.desttable\$20161231
Here's the output of that command:
BigQuery error in cp operation: Error processing job
'myproject:bqjob_bqjobid': Partitioning specification must be provided
in order to create partitioned table
I've tried doing something similar using the python SDK: running a select command on a date partitioned table (which selects data from only one date partition) and saving the results into a new destination table (which I hope would also be date partitioned). The job fails with the same error:
{u'message': u'Partitioning specification must be provided in order to
create partitioned table', u'reason': u'invalid'}
Clearly I need to add a partitioning specification, but I couldn't find any documentation on how to do so.
You need to create the partitioned destination table first (as per the docs):
If you want to copy a partitioned table into another partitioned
table, the partition specifications for the source and destination
tables must match.
So, just create the destination partitioned table before you start copying. If you can't be bothered specifying the schema, you can create the destination partitioned table like so:
bq mk --time_partitioning_type=DAY mydataset.temps
Then, use a query instead of a copy to write to the destination table. The schema will be copied with it:
bq query --allow_large_results --replace --destination_table 'mydataset.temps$20160101''SELECT * from `source`'
Related
Is it even possible to add a partition to an existing table in Athena that currently is without partitions? If so, please also write syntax for doing so in the answer.
For example:
ALTER TABLE table1 ADD PARTITION (ourDateStringCol = '2021-01-01')
The above command will give the following error:
FAILED: SemanticException table is not partitioned but partition spec exists
Note: I have done a web-search, and variants exist for SQL server, or adding a partition to an already partitioned table. However, I personally could not find a case where one could successfully add a partition to an existing non-partitioned table.
This is extremely similar to:
SemanticException adding partiton Hive table
However, the answer given there requires re-creating the table.
I want to do so without re-creating the table.
Partitions in Athena are based on folder structure in S3. Unlike standard RDBMS that are loading the data into their disks or memory, Athena is based on scanning data in S3. This is how you enjoy the scale and low cost of the service.
What it means is that you have to have your data in different folders in a meaningful structure such as year=2019, year=2020, and make sure that the data for each year is all and only in that folder.
The simple solution is to run a CREATE TABLE AS SELECT (CTAS) query that will copy the data and create a new table that can be optimized for your analytical queries. You can choose the table format (Parquet, for example), the compression (SNAPPY, for example), and also the partition schema (per year, for example).
We can not use create or replace table statement for partitioned tables in BigQuery. I can export the table to GCS but BigQuery generates then multiple JSON files that can not be imported into a table in once. Is there a safe way to drop a column from a partitioned table? I use BigQuery's web interface.
Renaming a column is not supported by the Cloud Console, the classic BigQuery web UI, the bq command-line tool, or the API. If you attempt to update a table schema using a renamed column, the following error is returned: BigQuery error in update operation: Provided Schema does not match Table project_id:dataset.table.
There are two ways to manually rename a column:
Using a SQL query: choose this option if you are more concerned about simplicity and ease of use, and you are less concerned about costs.
Recreating the table: choose this option if you are more concerned about costs, and you are less concerned about simplicity and ease of use.
If you want to drop a column you can either:
Use a SELECT * EXCEPT query that excludes the column (or columns) you want to remove and use the query result to overwrite the table or to create a new destination table
You can also remove a column by exporting your table data to Cloud Storage, deleting the data corresponding to the column (or columns) you want to remove, and then loading the data into a new table with a schema definition that does not include the removed column(s). You can also use the load job to overwrite the existing table
There is a guide published for Manually Changing Table Schemas.
edit
In order to change a Partitioned table to a Non-partitioned table, you can use the Console to query your data and overwrite your current table or copy to a new one. As an example, I have a table in BigQuery partitioned by _PARTITIONTIME. I used the following query to create a non-partitioned table,
SELECT *, _PARTITIONTIME as pt FROM `project.dataset.table`
With the above code, you will query the data among all table's partitions and create an extra column to show which partition it came from. Then, before executing it, there are two options, save the view in a new non-partitioned table or overwrite the current table:
Creating a new table go to: More(under the query editor) > Query Settings > Check the box "Set a destination table for query results" > Choose your project, dataset and write your new table's name > Under Destination table write preference check Write if empty.
Overwriting the current table: More(under the query editor) > Query Settings > Check the box "Set a destination table for query results" > Choose the same project and dataset for your current table > Write the same table's name as the one you want to overwrite > Under Destination table write preference check Overwrite table.
credit
I often want to load one day's worth of data into a date-partitioned BigQuery table, replacing any data that's already there. I know how to do this for 'old-style' data partitioned tables (the ones that have a _PARTITIONTIME field) but don't know how to do this with the new-style date-partitioned tables (which use a normal date/timestamp column to specify the partitioning because they don't allow one to use the $ decorator.
Let's say I want to do this on my_table. With old-style date-partitioned tables, I accomplished this using a load job that utilized the $ decorator and the WRITE_TRUNCATE write disposition -- e.g., I'd set the destination table to be my_table$20181005.
However, I'm not sure how to perform the equivalent operation using a DML. I find myself performing separate DELETE and INSERT commands. This isn't great because it increases complexity, the number of queries, and the operation isn't atomic.
I want to know how to do this using the MERGE command to keep this all contained within a single, atomic operation. However I can't wrap my head around the MERGE command's syntax and haven't found an example for this use case. Does anyone know how this should be done?
The ideal answer would be a DML statement that selected all columns from source_table and inserted it into the 2018-10-05 date partition of my_table, deleting any existing data that was in my_table's 2018-10-05 date partition. We can assume that source_table and my_table have the same schemas, and that my_table is partitioned on the day column, which is of type DATE.
because they don't allow one to use the $ decorator
But they do--you can use table_name$YYYYMMDD when you load into column-based partitioned table as well. For example, I made a partitioned table:
$ bq query --use_legacy_sql=false "CREATE TABLE tmp_elliottb.PartitionedTable (x INT64, y NUMERIC, date DATE) PARTITION BY date"
Then I loaded into a specific partition:
$ echo "1,3.14,2018-11-07" > row.csv
$ bq load "tmp_elliottb.PartitionedTable\$20181107" ./row.csv
I tried to load into the wrong partition for the input data, and received an error:
$ echo "1,3.14,2018-11-07" > row.csv
$ bq load "tmp_elliottb.PartitionedTable\$20181105" ./row.csv
Some rows belong to different partitions rather than destination partition 20181105
I then replaced the data for the partition:
$ echo "2,0.11,2018-11-07" > row.csv
$ bq load --replace "tmp_elliottb.PartitionedTable\$20181107" ./row.csv
Yes, you can use MERGE as a way of replacing data for a partitioned table's partition, but you can also use a load job.
The question is how to let Google BigQuery automatically create partitioned tables on the daily base (one day -> one table, etc.)?
I've used the following command in the command line to create the table:
bq mk --time_partitioning_type=DAY testtable1
The table1 appeared in the dataset, but how to create tables for every day automatically?
From the partitioned table documentation, you need to run the command to create the table only once. After that, you specify the partition to which you want to write as the destination table of the query, such as testtable1$20170919.
Can anyone please suggest how to create partition table in Big Query ?.
Example: Suppose I have one log data in google storage for the year of 2016. I stored all data in one bucket partitioned by year , month and date wise. Here I want create table with partitioned by date.
Thanks in Advance
Documentation for partitioned tables is here:
https://cloud.google.com/bigquery/docs/creating-partitioned-tables
In this case, you'd create a partitioned table and populate the partitions with the data. You can run a query job that reads from GCS (and filters data for the specific date) and writes to the corresponding partition of a table. For example, to load data for May 1st, 2016 -- you'd specify the destination_table as table$20160501.
Currently, you'll have to run several query jobs to achieve this process. Please note that you'll be charged for each query job based on bytes processed.
Please see this post for some more details:
Migrating from non-partitioned to Partitioned tables
There are two options:
Option 1
You can load each daily file into separate respective table with name as YourLogs_YYYYMMDD
See details on how to Load Data from Cloud Storage
After tables created, you can access them either using Table wildcard functions (Legacy SQL) or using Wildcard Table (Standar SQL). See also Querying Multiple Tables Using a Wildcard Table for more examples
Option 2
You can create Date-Partitioned Table (just one table - YourLogs) - but you still will need to load each daily file into respective partition - see Creating and Updating Date-Partitioned Tables
After table is loaded you can easily Query Date-Partitioned Tables
Having partitions for an External Table is not allowed as for now. There is a Feature Request for it:
https://issuetracker.google.com/issues/62993684
(please vote for it if you're interested in it!)
Google says that they are considering it.