I need to create a copy of the production dataset in BigQuery to the testing environment and use it to simulate the pipeline processing with new changes.
However, the production dataset is huge. so I usually want to only keep its most recent data for testing.
To do that, I would like to truncate all partitioned data that is older than 30 days in my dataset.
I tried setting partition expiration at the dataset level. it doesn't work.
So how could I do that.
I did some tests about this and confirmed that.
When you set a default partition expiration at the dataset level. It only applies to the new tables.
for the existing partitioned tables, you need to set the partition at individual table level to expire its partitions.
For example:
ALTER TABLE `gcp_A.dataset_1.measurements`
SET OPTIONS (
-- Sets partition expiration to 30 days
partition_expiration_days=30
);
select min(stamp) from `gcp_A.dataset_1.measurements`
-- [result]
-- 2021-06-15 00:00:00 UTC
Related
I want to store data to BigQuery by using specific partitions. The partitions are ingestion-time based. I want to use a range of partitions spanning over two years. I use the partition alias destination project-id:data-set.table-id$partition-date.
I get failures since it does recognise the destination as an alias but as an actual table.
Is it supported?
When you ingest data into BigQuery, it will land automatically in the corresponding partition. If you choose a daily ingestion time as partition column, that means that every new day will be a new partition. To be able to "backfill" partitions, you need to choose some other column for the partition (e.g. a column in the table with the ingestion date). When you write data from Dataflow (from anywhere actually), the data will be stored in the partition corresponding to the value of that column for each record.
Direct writes to partitions by ingestion time is not supported using the Write API.
Also using the stream api is not supported if a window of 31 days has passed
From the documentation:
When streaming using a partition decorator, you can stream to partitions within the last 31 days in the past and 16 days in the future relative to the current date, based on current UTC time.
The solution that works is to use BigQuery load jobs to insert data. This can handle this scenario.
Because this operation has lot's of IO involved (files getting created on GCS), it can be lengthy, costly and resource intensive depending on the data.
A approach can be to create table shards and split the Big Table to small ones so the Storage Read and the Write api can be used. Then load jobs can be used from the sharded tables towards the partitioned table would require less resources, and the problem is already divided.
I have extensive experience working with Hive Partitioned tables. I use Hive 2.X. I was interviewing for a Big Data Solution Architect role and I was asked the below question.
Question: How would you ingest a streaming data in a Hive table partitioned on Date? The streaming data is first stored in S3 bucket and then loaded to Hive. Although the S3 bucket names have a date identifier such as S3_ingest_YYYYMMDD, the content could have data for more than 1 date.
My Answer: Since the content could have more than 1 Date, creating external table might not be possible since we want to read the file and distribute the file based on the date. I suggested we first load the S3 bucket in an external staging table with no partitions and then Load/Insert the final date partition table using Dynamic Partition settings which will dynamically distribute the data to the correct partition directory.
The interviewer said my answer was not correct and I was curious to know what the correct answer was, but ran out of time.
The only caveat in my answer is that, over time the partitioned date directories will have multiple small files that can lead to small file issue, which can always be handled via batch maintenance process.
What are the other/correct options to handle this scenario?
Thanks.
It depends on the requirements.
As per my understanding if one file or folder with S3_ingest_YYYYMMDD files can contain more than one date, then some events are loaded the next day or even later. This is rather common scenario.
Ingestion date and event date are two different dates. Put ingested files into table partitioned by ingestion date (LZ). You can track the initial data. If reprocessing is possible, then use ingestion_date as a bookmark for reprocessing of LZ table.
Then schedule a process which will take two or more last days of ingestion date and load into table partitioned by event_date. Last day will be always incomplete, and may be you need to increase look-back period to 3 or even more ingestion days (using ingestion_date >= current_date - 2 days filter), it depends how many dates back ingestion may load event dates. And in this process you are using dynamic partitioning by event_date and applying some logic - cleaning, etc and loading into ODS or DM.
This approach is very similar to what you proposed. The difference is in first table, it should be partitioned to allow you process data incrementally and to do easy restatement if you need to change the logic or upstream data was also restated and reloaded in the LZ.
I have a date partitioned table with around 400 partitions.
Unfortunately one of the columns datatypes has changed and should be changed from INT to STR.
I can change the datatype as follows:
SELECT
CAST(change_var AS STRING) change_var
<rest of columns>
FROM dataset.table_name
and overwrite the table, but the date partitioning is then lost.
Is there any way to keep the partitioning and change a columns datatype?
Option 1.
Export table by partition. I created a simple library to achieve it. https://github.com/rdtr/bq-partition-porter
Then create a new table with a correct type and load data into the new table again, by partition. Be careful about the quota (1000 exports per day). 400 should be okay.
Option 2.
By using Cloud Dataflow, you can export a whole table then use DynamicDestination to import data into BQ by partition. If a number of partitions are too many, this would suffice the requirement.
I expect bq load command to have some way to specify a partition key field name (since it's already described in bq load help). Until then, you need to follow either of these options.
I'm running a query job on a large dataset in BigQuery. The job results are stored in a destinationTable. I want the tables to expire either within 1 day or 1 hour (historical data va. today's data).
Is there an option to set expirationTime on each table?
I am aware that I can set a defaultExpirationTime on the entire dataset, but since I have different expiration times, this is not an ideal solution.
Check expirationTime Table's Property
expirationTime long [Optional] The time when this table expires, in milliseconds since
the epoch. If not present, the table will persist indefinitely.
Expired tables will be deleted and their storage reclaimed.
You need to set it using tables.patch API after table is created or updated (depends on your logic)
When creating a partitioned table using bq mk --time_partitioning_type=DAY are the partitions created based on the load time of the data, not a date key within the table data itself?
To create partitions based on dates within the date, is the current approach to manually create sharded tables, and load them based on date, as in this post from 2012?
Yes, partitions created based on data load time not based on data itself
You can use partition decorator (mydataset.mytable1$20160810) if you want to load data into specific partition
Per my understanding, partition by column is something that we should expect to be supported at some point - but not now
Good news, BigQuery currently supports 2 type data partition, included partition by column. Please check here.
I like the feature: An individual operation can commit data into up to 2,000 distinct partitions.