I have to tables with the same schema tab1 and tab1_partitioned where the latter is partitioned by day.
I am trying to insert data into the partitioned table with the following command:
bq query --allow_large_results --replace --noflatten_results --destination_table 'advertiser.development_partitioned$20160101' 'select * from advertiser.development where ymd = 20160101';
but I get the following error:
BigQuery error in query operation: Error processing job 'total-handler-133811:bqjob_r78379ac2513cb515_000001553afb7196_1': Provided Schema does not match Table
Both have exactly the same schema and I really don't understand why I am getting that error. Can someone shed some light on my issue?
In fact, I would prefer If BigQuery supported the dynamic partitioning insert that is supported in Hive, but some days of search seem to point that is not possible :-/
The behavior you are seeing is due to how we treat write dispositions when using them with table partitions.
You should be able to append to the partition using a WRITE_APPEND disposition to get the query to go through.
bq query --allow_large_results --append_table --noflatten_results --destination_table 'advertiser.development_partitioned$20160101' 'select * from advertiser.development where ymd = 20160101';
There are some complications to making it work with --replace, but we are looking into improved schema support for table partitions at this time.
Please let me know if this doesn't work for you. Thanks!
To answer the other part of your question about dynamic partitioning - we do plan to support richer flavors of partitioning and we believe that they will handle the majority of use cases.
FYI, I don't think it was always so, but there is now a way to copy data from non-partitioned to partitioned tables in bigquery just using DML from the bigquery UI. For example, if you have a date string in your origin table, of the form YYYY-MM-DD, you could run this to move the data to a partitioned table ...
create table my_dataset.my_table (sesh STRING, prod STRING) partition by DATE(_PARTITIONTIME);
insert into my_dataset.my_table (_PARTITIONTIME, sesh, prod) select CAST(PARSE_DATE('%Y-%m-%d', mydatestr) as TIMESTAMP), sesh, prod from my_dataset.my_orig_table;
Related
Is it even possible to add a partition to an existing table in Athena that currently is without partitions? If so, please also write syntax for doing so in the answer.
For example:
ALTER TABLE table1 ADD PARTITION (ourDateStringCol = '2021-01-01')
The above command will give the following error:
FAILED: SemanticException table is not partitioned but partition spec exists
Note: I have done a web-search, and variants exist for SQL server, or adding a partition to an already partitioned table. However, I personally could not find a case where one could successfully add a partition to an existing non-partitioned table.
This is extremely similar to:
SemanticException adding partiton Hive table
However, the answer given there requires re-creating the table.
I want to do so without re-creating the table.
Partitions in Athena are based on folder structure in S3. Unlike standard RDBMS that are loading the data into their disks or memory, Athena is based on scanning data in S3. This is how you enjoy the scale and low cost of the service.
What it means is that you have to have your data in different folders in a meaningful structure such as year=2019, year=2020, and make sure that the data for each year is all and only in that folder.
The simple solution is to run a CREATE TABLE AS SELECT (CTAS) query that will copy the data and create a new table that can be optimized for your analytical queries. You can choose the table format (Parquet, for example), the compression (SNAPPY, for example), and also the partition schema (per year, for example).
Using the Bigquery Schedule functionality, is it possible to write truncate a partition using the run_time parameters like #run_time?
What I am trying to do is to give the destination table_name like mytable${run_time|"%Y%m%d"}, but this gives me an error saying:
Invalid partition decorator in column-partitioned table mytable$20200124 with partitioning field { value: "event_date" }
If I don't give the partition decorate the whole table is written truncated irrespective of the partition
Yes, it is possible. I just replicated it with the following configuration
Notice that the Partitioning field is empty. In addition, I found this issue where is provided the following workaround:
[...] If you just want to overwrite on partition, you could use MERGE statement on the query and set the "Destination table" to the column-partitioned table.
I often want to load one day's worth of data into a date-partitioned BigQuery table, replacing any data that's already there. I know how to do this for 'old-style' data partitioned tables (the ones that have a _PARTITIONTIME field) but don't know how to do this with the new-style date-partitioned tables (which use a normal date/timestamp column to specify the partitioning because they don't allow one to use the $ decorator.
Let's say I want to do this on my_table. With old-style date-partitioned tables, I accomplished this using a load job that utilized the $ decorator and the WRITE_TRUNCATE write disposition -- e.g., I'd set the destination table to be my_table$20181005.
However, I'm not sure how to perform the equivalent operation using a DML. I find myself performing separate DELETE and INSERT commands. This isn't great because it increases complexity, the number of queries, and the operation isn't atomic.
I want to know how to do this using the MERGE command to keep this all contained within a single, atomic operation. However I can't wrap my head around the MERGE command's syntax and haven't found an example for this use case. Does anyone know how this should be done?
The ideal answer would be a DML statement that selected all columns from source_table and inserted it into the 2018-10-05 date partition of my_table, deleting any existing data that was in my_table's 2018-10-05 date partition. We can assume that source_table and my_table have the same schemas, and that my_table is partitioned on the day column, which is of type DATE.
because they don't allow one to use the $ decorator
But they do--you can use table_name$YYYYMMDD when you load into column-based partitioned table as well. For example, I made a partitioned table:
$ bq query --use_legacy_sql=false "CREATE TABLE tmp_elliottb.PartitionedTable (x INT64, y NUMERIC, date DATE) PARTITION BY date"
Then I loaded into a specific partition:
$ echo "1,3.14,2018-11-07" > row.csv
$ bq load "tmp_elliottb.PartitionedTable\$20181107" ./row.csv
I tried to load into the wrong partition for the input data, and received an error:
$ echo "1,3.14,2018-11-07" > row.csv
$ bq load "tmp_elliottb.PartitionedTable\$20181105" ./row.csv
Some rows belong to different partitions rather than destination partition 20181105
I then replaced the data for the partition:
$ echo "2,0.11,2018-11-07" > row.csv
$ bq load --replace "tmp_elliottb.PartitionedTable\$20181107" ./row.csv
Yes, you can use MERGE as a way of replacing data for a partitioned table's partition, but you can also use a load job.
I have some Insert queries written in hive to be migrated in Bigquery.
For example:
insert into test.abc partition(yrmth) select * from test.xyz
In Bigquery, partition is only supported in YYYYMMDD format. I'm able to dump the data in partitioned table through BQ command line tool by loading test.abc$20171125.
How can I achieve the same using DML statements in Bigquery?
I have learnt that Legacy SQL doesn't support writing DML statements and Standard SQL doesn't support the table specifications like test.abc$20171125 that is required for loading the data in corresponding partition.
You are correct - DML statements are not yet supported over partitioned tables.
Just do simple select select * from test.xyz with destination table test.abc$20171125. This is supported by Web UI, bq command line, API and any client of your choice
Check https://issuetracker.google.com/issues/36383555 if you want to try alpha release for column based partitioned tables - DML over partitioned tables is part of it
Note: this is nearly a duplicate of this question with the distinction that in this case, the source table is date partitioned and the destination table does not yet exist. Also, the accepted solution to that question didn't work in this case.
I'm trying to copy a single day's worth of data from one date partitioned table into a new date partitoined table that I have not yet created. My hope is that BigQuery would simply create the date-partitioned destination table for me like it usually does for the non-date-partitioned case.
Using BigQuery CLI, here's my command:
bq cp mydataset.sourcetable\$20161231 mydataset.desttable\$20161231
Here's the output of that command:
BigQuery error in cp operation: Error processing job
'myproject:bqjob_bqjobid': Partitioning specification must be provided
in order to create partitioned table
I've tried doing something similar using the python SDK: running a select command on a date partitioned table (which selects data from only one date partition) and saving the results into a new destination table (which I hope would also be date partitioned). The job fails with the same error:
{u'message': u'Partitioning specification must be provided in order to
create partitioned table', u'reason': u'invalid'}
Clearly I need to add a partitioning specification, but I couldn't find any documentation on how to do so.
You need to create the partitioned destination table first (as per the docs):
If you want to copy a partitioned table into another partitioned
table, the partition specifications for the source and destination
tables must match.
So, just create the destination partitioned table before you start copying. If you can't be bothered specifying the schema, you can create the destination partitioned table like so:
bq mk --time_partitioning_type=DAY mydataset.temps
Then, use a query instead of a copy to write to the destination table. The schema will be copied with it:
bq query --allow_large_results --replace --destination_table 'mydataset.temps$20160101''SELECT * from `source`'