I'm trying to use the partitioning function in Google Bigquery.
The logs entering the table appear to be reflected late in the table.If the log occurred at 13 o'clock, there will be a difference of approximately 15 minutes on the table.Is there any way to apply real time to the partitioned table?
It is likely that you are adding data to the table using streaming inserts. According to BigQuery documentation for partitioned tables, data still in streaming buffer is associated with _PARTITIONTIME IS NULL partition, so if you use _PARTITIONTIME in WHERE clause of your query - you are likely missing that data. You can add explicit _PARTITIONTIME IS NULL to the WHERE clause to see streaming data which is still unpartitioned. It usually makes into partitions within 15 minutes.
Related
I want to store data to BigQuery by using specific partitions. The partitions are ingestion-time based. I want to use a range of partitions spanning over two years. I use the partition alias destination project-id:data-set.table-id$partition-date.
I get failures since it does recognise the destination as an alias but as an actual table.
Is it supported?
When you ingest data into BigQuery, it will land automatically in the corresponding partition. If you choose a daily ingestion time as partition column, that means that every new day will be a new partition. To be able to "backfill" partitions, you need to choose some other column for the partition (e.g. a column in the table with the ingestion date). When you write data from Dataflow (from anywhere actually), the data will be stored in the partition corresponding to the value of that column for each record.
Direct writes to partitions by ingestion time is not supported using the Write API.
Also using the stream api is not supported if a window of 31 days has passed
From the documentation:
When streaming using a partition decorator, you can stream to partitions within the last 31 days in the past and 16 days in the future relative to the current date, based on current UTC time.
The solution that works is to use BigQuery load jobs to insert data. This can handle this scenario.
Because this operation has lot's of IO involved (files getting created on GCS), it can be lengthy, costly and resource intensive depending on the data.
A approach can be to create table shards and split the Big Table to small ones so the Storage Read and the Write api can be used. Then load jobs can be used from the sharded tables towards the partitioned table would require less resources, and the problem is already divided.
I am currently working on the Optimization of a huge table in Google's BigQuery. The tables has approximately 19 billions records resulting in a total size of 5.2 TB. In order to experiment on performance with regards to clustering and time partitioning, I duplicated the table with a Time Partitioning on a custom DATE MyDate column which is frequently used in queries.
When performing a query with a WHERE clause (for instance, WHERE(MyDate) = "2022-08-08") on the time partitioned table, the query is quicker and only reads around 20 GB compared to the 5.2 TB consumed by the table without partition. So far, so good.
My issue, however, arises when applying an aggregated function, i.e. in my case a MAX(MyDate): the query on the partitioned and the non-partitioned tables read the same amount of data and execute in roughly the same time. However, I would have expected the query on the partitioned table to be way quicker as it only needs to scan a single partition.
There seem to be workarounds by fetching the dataset's metadata (information schema) as described here. However, I would like to avoid solutions like this as it adds complexity to our queries.
Are there a more elegant ways to get the MAX of a time-partitioned BigQuery table based on a custom column without scanning the whole table or fetching metadata from the information schema?
I'm working with a BigQuery partitioned table. The partition is based on a Timestamp column in the data (rather than ingestion-based). We're streaming data into this table at a rate of several million rows per day.
We noticed that our queries based on specific days were scanning much more data than they should in a partitioned table.
Here is the current state of the UNPARTITIONED partition:
I'm assuming that little blip at the bottom-right is normal (streaming buffer for the rows inserted this morning), but there is this massive block of data between mid-November and early-December that lives in the UNPARTITIONED partition, instead of being sent to the proper daily partitions (the partitions for that period don't appear to exist at all in __PARTITIONS_SUMMARY__).
My two questions are:
Is there a particular reason why these rows would not have been partitioned correctly, while data before and after that period is fine?
Is there a way to 'flush' the UNPARTITIONED partition, i.e. force BigQuery to dispatch the rows to their correct daily partition?
I faced a similar type of issue where a lot of rows stayed unpartitioned in a column-based partitioned table. So, what I observed that some records are not partitioned due to the source of the streaming insert. For the soulition, I update the table using the update and set a partitioned date where the partitioned column date is null. For safer side make sure that partitioned date column should not be nullable.
I have tried this solution Migrating from non-partitioned to Partitioned tables, but i get this error.
"Error: Cannot query rows larger than 100MB limit."
Job ID: sandbox-kiana-analytics:bquijob_4a1b2032_15d2c7d17f3.
Vidhya,
I internally looked at the query you sent to BigQuery, and can see that as part of your query, you are using ARRAY_AGG() to put all data for a day in one row. This results in very large rows, which ultimately exceed Big Query's 100MB per-row limit. This is a rather complex and inefficient way of partitioning the data. Instead, I suggest using the built-in support for data partitioning provided by BigQuery (example here). In this approach, you can create an empty date-partitioned table, and add day-partition data to it for each day.
I'm trying to use dataflow to stream into BQ partitioned table.
The documentation says that:
Data in the streaming buffer has a NULL value for the _PARTITIONTIME column.
I can see that's the case when inserting rows into a date partitioned table.
Is there a way to be able to set the partition time of the rows I want to insert so that BigQuery can infer the correct partition?
So far I've tried doing: tableRow.set("_PARTITIONTIME", milliessinceepoch);
but I get hit with a no such field exception.
As of a month or so ago, you can stream into a specific partition of a date-partitioned table. For example, to insert into partition for date 20160501 in table T, you can call insertall with table name T$20160501
AFAIK, as of writing, BigQuery does not allow specifying the partition manually per row - it is inferred from the time of insertion.
However, as an alternative to BigQuery's built-in partitioned tables feature, you can use Dataflow's feature for streaming to multiple BigQuery tables at the same time: see Sharding BigQuery output tables.