Duplicate several tables in bigquery project at once - google-bigquery

In our BQ export schema, we have one table for each day as per the screenshot below.
I want to copy the tables before a certain date (2021-feb-07). I know how to copy one day at a time via the UI, but is there not a way to use the cloud console to write a code for copying the selected date range, all at once? Or maybe an sql command directly from a query window?

I think you should transform your sharding tables into a partitioned table. So you can handled your tables with just a single query. As mention in the official documentation, partitioned tables perform better.
To make the conversion, you can just execute the following commands in the console.
bq partition \
--time_partitioning_type=DAY \
--time_partitioning_expiration 259200 \
mydataset.sourcetable_ \
mydataset.mytable_partitioned
This will make your sharded tables sourcetable_(xxx) into a single partitioned table mytable_partitioned which can be query with just a single query trough your entire set of data entries.
SELECT
*
FROM
`myprojectid.mydataset.mytable_partitioned`
WHERE
_PARTITIONTIME BETWEEN TIMESTAMP('2022-01-01') AND TIMESTAMP('2022-01-03')
For more details about the conversion commands you can check this link. Also, I recommend to check the links about querying partionated tables and partiotioned tables for more details.

Related

Daily I’m receiving a new table in the BigQuery, I want concatenate this new table data to the main table, dataset schema are same

Daily I’m receiving a new table (example :tablename_20220811) in the BigQuery, I want concatenate this new table data to the main_table, dataset schema are same.
I tried using wild cards,I don’t know how to pull the daily loaded table.
You can use BigQuery scheduled queries with an interval (cron) in the schedule parameters :
Example with gcloud cli :
bq query \
--use_legacy_sql=false \
--destination_table=mydataset.desttable \
--display_name='My Scheduled Query' \
--schedule='every 24 hours' \
--append_table=true \
'SELECT
1
FROM
mydataset.tablename_*
where _TABLE_SUFFIX = FORMAT_DATE('%Y%m%d', CURRENT_DATE())'
In order to target on the expected table, I used a wildcard and a filter based on the table suffix. The table suffix should be equals to the current date as STRING with the following format yyyymmdd.
The cron plan to run the query every day.
You can also configure it directly with the Google Cloud console.
It sounds like you have the right naming format for BigQuery to treat your tables as a single 'date-sharded table'.
You need to ensure that the daily tables
have the same schema
are in the same dataset
have the same name apart from the _yyyymmdd suffix
You will know if this worked because only one table will appear (with an icon showing multiple tables, rather than the usual icon).
With this in hand, you can write queries like
SELECT
fieldA,
fieldB,
FROM
`some_dataset.tablename_*`
WHERE
_table_suffix BETWEEN '20220101' AND '20221201'
This gives you some idea of what's possible:
select from the full date-sharded table using backticks (essential!) and the wildcard syntax
filter using the special _table_suffix meta-field

Get table names in dataset after dataset truncate

It seems that the BigQuery CLI supports restoring tables in a dataset after they have been deleted by using BigQuery Time Travel functionality -- as in:
bq cp dataset.table#TIME_AGO_UNIX dataset.table
However, this assumes we know the names of the tables. I want to write a script to iterate over all the tables that were in the dataset at TIME_AGO_UNIX time.
How would I go about finding those tables at that time?

Finding Bigquery table size across all projects

We are maintaining a table in Bigquery that captures all the activity logs from the Stack driver logs. This table helps me list all the tables present, User, who created the table, what was the last command run on the table etc across projects and data sets in our organization. Along with this information, I also want the table size for the tables I am trying to check.
I can Join with the TABLES and TABLE_SUMMARY however I need to explicitly specify the project and dataset I want to query, but my driving table has details of multiple projects and Datasets.
Is there any other metadata table I can get the table size from, or any logs that I can load into a Bigquery table to join and get the desired results
You can use the bq command line tool. With the command:
bq show --format=prettyjson
This provides the numBytes, datasetId, projectId and more.
With a script you can use:
bq ls
and loop through the datasets and tables in each projects to get the information needed. Keep in mind that you can also use API or a client library.

Efficient way to copy date-sharded table in BigQuery via the command-line bq utility?

Is there a way to copy a date-sharded table to another dataset via the bq utility?
My current solution is generating a bash script to copy each day one-by-one and splitting the work, but more efficient would be to do everything in parallel:
#!/bin/sh
bq cp old_dataset.table_20140101 new_dataset_20140101
..
bq cp old_dataset.table_20171001 new_dataset_20171001
You can specify multiple source tables but only a single destination table (refer to this question), so this may not work for you. However, if your data is date-partitioned (instead of sharded) then you can copy the table in one command.
I recommend you convert the sharded table into a date-partitioned table which will be effectively copying all the sharded tables to a new table. You can do this with the following command:
bq partition old_dataset.table_ new_dataset.partitioned

Create Partition table in Big Query

Can anyone please suggest how to create partition table in Big Query ?.
Example: Suppose I have one log data in google storage for the year of 2016. I stored all data in one bucket partitioned by year , month and date wise. Here I want create table with partitioned by date.
Thanks in Advance
Documentation for partitioned tables is here:
https://cloud.google.com/bigquery/docs/creating-partitioned-tables
In this case, you'd create a partitioned table and populate the partitions with the data. You can run a query job that reads from GCS (and filters data for the specific date) and writes to the corresponding partition of a table. For example, to load data for May 1st, 2016 -- you'd specify the destination_table as table$20160501.
Currently, you'll have to run several query jobs to achieve this process. Please note that you'll be charged for each query job based on bytes processed.
Please see this post for some more details:
Migrating from non-partitioned to Partitioned tables
There are two options:
Option 1
You can load each daily file into separate respective table with name as YourLogs_YYYYMMDD
See details on how to Load Data from Cloud Storage
After tables created, you can access them either using Table wildcard functions (Legacy SQL) or using Wildcard Table (Standar SQL). See also Querying Multiple Tables Using a Wildcard Table for more examples
Option 2
You can create Date-Partitioned Table (just one table - YourLogs) - but you still will need to load each daily file into respective partition - see Creating and Updating Date-Partitioned Tables
After table is loaded you can easily Query Date-Partitioned Tables
Having partitions for an External Table is not allowed as for now. There is a Feature Request for it:
https://issuetracker.google.com/issues/62993684
(please vote for it if you're interested in it!)
Google says that they are considering it.