Daily I’m receiving a new table in the BigQuery, I want concatenate this new table data to the main table, dataset schema are same - google-bigquery

Daily I’m receiving a new table (example :tablename_20220811) in the BigQuery, I want concatenate this new table data to the main_table, dataset schema are same.
I tried using wild cards,I don’t know how to pull the daily loaded table.

You can use BigQuery scheduled queries with an interval (cron) in the schedule parameters :
Example with gcloud cli :
bq query \
--use_legacy_sql=false \
--destination_table=mydataset.desttable \
--display_name='My Scheduled Query' \
--schedule='every 24 hours' \
--append_table=true \
'SELECT
1
FROM
mydataset.tablename_*
where _TABLE_SUFFIX = FORMAT_DATE('%Y%m%d', CURRENT_DATE())'
In order to target on the expected table, I used a wildcard and a filter based on the table suffix. The table suffix should be equals to the current date as STRING with the following format yyyymmdd.
The cron plan to run the query every day.
You can also configure it directly with the Google Cloud console.

It sounds like you have the right naming format for BigQuery to treat your tables as a single 'date-sharded table'.
You need to ensure that the daily tables
have the same schema
are in the same dataset
have the same name apart from the _yyyymmdd suffix
You will know if this worked because only one table will appear (with an icon showing multiple tables, rather than the usual icon).
With this in hand, you can write queries like
SELECT
fieldA,
fieldB,
FROM
`some_dataset.tablename_*`
WHERE
_table_suffix BETWEEN '20220101' AND '20221201'
This gives you some idea of what's possible:
select from the full date-sharded table using backticks (essential!) and the wildcard syntax
filter using the special _table_suffix meta-field

Related

Duplicate several tables in bigquery project at once

In our BQ export schema, we have one table for each day as per the screenshot below.
I want to copy the tables before a certain date (2021-feb-07). I know how to copy one day at a time via the UI, but is there not a way to use the cloud console to write a code for copying the selected date range, all at once? Or maybe an sql command directly from a query window?
I think you should transform your sharding tables into a partitioned table. So you can handled your tables with just a single query. As mention in the official documentation, partitioned tables perform better.
To make the conversion, you can just execute the following commands in the console.
bq partition \
--time_partitioning_type=DAY \
--time_partitioning_expiration 259200 \
mydataset.sourcetable_ \
mydataset.mytable_partitioned
This will make your sharded tables sourcetable_(xxx) into a single partitioned table mytable_partitioned which can be query with just a single query trough your entire set of data entries.
SELECT
*
FROM
`myprojectid.mydataset.mytable_partitioned`
WHERE
_PARTITIONTIME BETWEEN TIMESTAMP('2022-01-01') AND TIMESTAMP('2022-01-03')
For more details about the conversion commands you can check this link. Also, I recommend to check the links about querying partionated tables and partiotioned tables for more details.

Google BigQuery: How to remove a column using the bq command line?

Table is not that large, I want to remove a column. Tried bq update project:schema.table schema_one_column_less.json but got an exception: BigQuery error in update operation: Provided Schema does not match Table xxx. Field xxxxx is missing in new schema. What't the right way to remove a column in place (without having to create a new table)?
This isn't supported by bq update see here. Google offers the following workarounds which include mechanisms via the CLI, but have drawbacks.
Using the SELECT * EXCEPT mechanism you can overwrite the original table which would avoid creating a new table, but might also result in significant query costs.
From the example on the linked page, the command would look something like:
bq query \
--destination_table mydataset.mytable \
--replace \
--use_legacy_sql=false \
'SELECT
* EXCEPT(column_to_delete)
FROM
mydataset.mytable'
Want to complement the other answer with a pure SQL solution:
CREATE OR REPLACE TABLE
<your_table> AS
SELECT
* EXCEPT(<column_to_remove>)
FROM
<your_table>
It is worth mentioning that if there is constrain (say, NOT NULL) in your table, you will have to specify the full column list with the constrain like:
CREATE OR REPLACE TABLE
<your_table>(<full_column_list>) AS
SELECT
* EXCEPT(<column_to_remove>)
FROM
<your_table>

BigQuery: Atomically replace a date partition using DML

I often want to load one day's worth of data into a date-partitioned BigQuery table, replacing any data that's already there. I know how to do this for 'old-style' data partitioned tables (the ones that have a _PARTITIONTIME field) but don't know how to do this with the new-style date-partitioned tables (which use a normal date/timestamp column to specify the partitioning because they don't allow one to use the $ decorator.
Let's say I want to do this on my_table. With old-style date-partitioned tables, I accomplished this using a load job that utilized the $ decorator and the WRITE_TRUNCATE write disposition -- e.g., I'd set the destination table to be my_table$20181005.
However, I'm not sure how to perform the equivalent operation using a DML. I find myself performing separate DELETE and INSERT commands. This isn't great because it increases complexity, the number of queries, and the operation isn't atomic.
I want to know how to do this using the MERGE command to keep this all contained within a single, atomic operation. However I can't wrap my head around the MERGE command's syntax and haven't found an example for this use case. Does anyone know how this should be done?
The ideal answer would be a DML statement that selected all columns from source_table and inserted it into the 2018-10-05 date partition of my_table, deleting any existing data that was in my_table's 2018-10-05 date partition. We can assume that source_table and my_table have the same schemas, and that my_table is partitioned on the day column, which is of type DATE.
because they don't allow one to use the $ decorator
But they do--you can use table_name$YYYYMMDD when you load into column-based partitioned table as well. For example, I made a partitioned table:
$ bq query --use_legacy_sql=false "CREATE TABLE tmp_elliottb.PartitionedTable (x INT64, y NUMERIC, date DATE) PARTITION BY date"
Then I loaded into a specific partition:
$ echo "1,3.14,2018-11-07" > row.csv
$ bq load "tmp_elliottb.PartitionedTable\$20181107" ./row.csv
I tried to load into the wrong partition for the input data, and received an error:
$ echo "1,3.14,2018-11-07" > row.csv
$ bq load "tmp_elliottb.PartitionedTable\$20181105" ./row.csv
Some rows belong to different partitions rather than destination partition 20181105
I then replaced the data for the partition:
$ echo "2,0.11,2018-11-07" > row.csv
$ bq load --replace "tmp_elliottb.PartitionedTable\$20181107" ./row.csv
Yes, you can use MERGE as a way of replacing data for a partitioned table's partition, but you can also use a load job.

Google BQ: Running Parameterized Queries where Parameter Variable is the BQ Table Destination

I am trying to run a SQL from the Linux Commandline for a BQ Table destination. This SQL script will be used for multiple dates, clients, and BQ Table destinations, so this would require using parameters in my BQ API-commandline calls (the flag --parameter). Now, I have followed this link to learn about parameterized queries: https://cloud.google.com/bigquery/docs/parameterized-queries , but it's limited in helping me with declaring a table name.
My SQL script, called Advertiser_Date_Check.sql, is the following:
#standardSQL
SELECT *
FROM (SELECT *
FROM #variable_table
WHERE CAST(_PARTITIONTIME AS DATE) = #variable_date) as final
WHERE final.Advertiser IN UNNEST(#variable_clients)
Where the parameter variables represent the following:
variable_table: The BQ Table destination that I want to call
variable_date: The Date that I want to pull from the BQ table
variable_clients: An Array list of specific clients that I want to pull from the data (which is from the date I referenced)
Now, my Commandline (LINUX) for the BQ data is the following
TABLE_NAME=table_name_example
BQ_TABLE=$(echo '`project_id.dataset_id.'$TABLE_NAME'`')
TODAY=$(date +%F)
/bin/bq query --use_legacy_sql=false \
--parameter='variable_table::'$BQ_TABLE'' \
--parameter=variable_date::"$TODAY" \
--parameter='variable_clients:ARRAY<STRING>:["Client_1","Client_2","Client_3"]' \
"`cat /path/to/script/Advertiser_Date_Check.sql`"
The parameters of #variable_date and #variable_clients have worked just fine in the past when it was just them. However, since I desire to run this exact SQL command on various tables in a loop, I created a parameter called variable_table. Parameterized Queries have to be in Standard SQL format, so the table name convention needs to be in such format:
`project_id.dataset_id.table_name`
Whenever I try to run this on the Commandline, I usually get the following error:
Error in query string: Error processing job ... : Syntax error: Unexpected "#" at [4:12]
Which is referencing the parameter #variable_table, so it's having a hard time processing that this is referencing a table name.
In past attempts, there even has been the error:
project_id.dataset_id.table_name: command not found
But this was mostly due to poor reference of table destination name. The first error is the most common occurrence.
Overall, my questions regarding this matter are:
How do I reference a BQ Table as a parameter in the Commandline for Parameterized Queries at the FROM Clause (such as what I try to do with #variable_table)? Is it even possible?
Do you know of other methods to run a query on multiple BQ tables from the commandline besides by the way I am currently doing it?
Hope this all makes sense and thank you for your assistance!
From the documentation that you linked:
Parameters cannot be used as substitutes for identifiers, column names, table names, or other parts of the query.
I think what might work for you in this case, though, is performing the injection of the table name as a regular shell variable (instead of a query parameter). You'd want to make sure that you trust the contents of it, or that you are building the string yourself in order to avoid SQL injection. One approach is to have hardcoded constants for the table names and then choose which one to insert into the query text based on the user input.
I thought I would just post my example here which only covers your question about creating a "dynamic table name", but you can also use my approach for your other variables. My approach was to do this operation directly in python just before doing the BigQuery API call, by leveraging python's internal time function (assuming you want your variables to be time-based).
Create BigQuery-table via Python BQ API:
from google.colab import auth
from datetime import datetime
from google.cloud import bigquery
auth.authenticate_user()
now = datetime.now()
current_time = now.strftime("%Y%m%d%H%M")
project_id = '<project_id>'
client = bigquery.Client(project=project_id)
table_id = "<project_id>.<dataset_id>.table_"
table_id = table_id + current_time
job_config = bigquery.QueryJobConfig(destination=table_id)
sql = """
SELECT
dataset_id,
project_id,
table_id,
CASE
WHEN type = 1 THEN 'table'
WHEN type = 2 THEN 'view'
WHEN type = 3 THEN 'external'
ELSE '?'
END AS type,
DATE(TIMESTAMP_MILLIS(creation_time)) AS creation_date,
TIMESTAMP_MILLIS(creation_time) AS creation_time,
row_count,
size_bytes,
round(safe_divide(size_bytes, (1000*1000)),1) as size_mb,
round(safe_divide(size_bytes, (1000*1000*1000)),3) as size_gb
FROM (select * from `<project_id>:<dataset_id>.__TABLES__`)
ORDER BY dataset_id, table_id asc;
"""
query_job = client.query(sql, job_config=job_config)
query_job.result()
print("Query results loaded to the table {}".format(table_id))
# Output:
# Query results loaded to the table <project_id>.<dataset_id>.table_202101141450
Feel free to copy and test it within a google colab notebook. Just fill in your own:
<project_id>
<dataset_id>

BigQuery insert into a partitioned table from an existing table

I have to tables with the same schema tab1 and tab1_partitioned where the latter is partitioned by day.
I am trying to insert data into the partitioned table with the following command:
bq query --allow_large_results --replace --noflatten_results --destination_table 'advertiser.development_partitioned$20160101' 'select * from advertiser.development where ymd = 20160101';
but I get the following error:
BigQuery error in query operation: Error processing job 'total-handler-133811:bqjob_r78379ac2513cb515_000001553afb7196_1': Provided Schema does not match Table
Both have exactly the same schema and I really don't understand why I am getting that error. Can someone shed some light on my issue?
In fact, I would prefer If BigQuery supported the dynamic partitioning insert that is supported in Hive, but some days of search seem to point that is not possible :-/
The behavior you are seeing is due to how we treat write dispositions when using them with table partitions.
You should be able to append to the partition using a WRITE_APPEND disposition to get the query to go through.
bq query --allow_large_results --append_table --noflatten_results --destination_table 'advertiser.development_partitioned$20160101' 'select * from advertiser.development where ymd = 20160101';
There are some complications to making it work with --replace, but we are looking into improved schema support for table partitions at this time.
Please let me know if this doesn't work for you. Thanks!
To answer the other part of your question about dynamic partitioning - we do plan to support richer flavors of partitioning and we believe that they will handle the majority of use cases.
FYI, I don't think it was always so, but there is now a way to copy data from non-partitioned to partitioned tables in bigquery just using DML from the bigquery UI. For example, if you have a date string in your origin table, of the form YYYY-MM-DD, you could run this to move the data to a partitioned table ...
create table my_dataset.my_table (sesh STRING, prod STRING) partition by DATE(_PARTITIONTIME);
insert into my_dataset.my_table (_PARTITIONTIME, sesh, prod) select CAST(PARSE_DATE('%Y-%m-%d', mydatestr) as TIMESTAMP), sesh, prod from my_dataset.my_orig_table;