Streaming into BQ partitioned tables - google-bigquery

I'm trying to use dataflow to stream into BQ partitioned table.
The documentation says that:
Data in the streaming buffer has a NULL value for the _PARTITIONTIME column.
I can see that's the case when inserting rows into a date partitioned table.
Is there a way to be able to set the partition time of the rows I want to insert so that BigQuery can infer the correct partition?
So far I've tried doing: tableRow.set("_PARTITIONTIME", milliessinceepoch);
but I get hit with a no such field exception.

As of a month or so ago, you can stream into a specific partition of a date-partitioned table. For example, to insert into partition for date 20160501 in table T, you can call insertall with table name T$20160501

AFAIK, as of writing, BigQuery does not allow specifying the partition manually per row - it is inferred from the time of insertion.
However, as an alternative to BigQuery's built-in partitioned tables feature, you can use Dataflow's feature for streaming to multiple BigQuery tables at the same time: see Sharding BigQuery output tables.

Related

BigQuery: Add ingestion timestamp to records ingested via streaming insert

I'm streaming records into BigQuery. I need to record when each row actually makes its way into BigQuery. How can I do this? I don't mind if this is off by up to three seconds.
Bigquery doesn't offer this functionality.
You need to script it so when it arrives in BQ it already has the column with the correct timestamp value zst5.

How do i create partitioned table from non partitioned table in BigQuery using query?

I have tried this solution Migrating from non-partitioned to Partitioned tables, but i get this error.
"Error: Cannot query rows larger than 100MB limit."
Job ID: sandbox-kiana-analytics:bquijob_4a1b2032_15d2c7d17f3.
Vidhya,
I internally looked at the query you sent to BigQuery, and can see that as part of your query, you are using ARRAY_AGG() to put all data for a day in one row. This results in very large rows, which ultimately exceed Big Query's 100MB per-row limit. This is a rather complex and inefficient way of partitioning the data. Instead, I suggest using the built-in support for data partitioning provided by BigQuery (example here). In this approach, you can create an empty date-partitioned table, and add day-partition data to it for each day.

Data arriving late in the partitioned table

I'm trying to use the partitioning function in Google Bigquery.
The logs entering the table appear to be reflected late in the table.If the log occurred at 13 o'clock, there will be a difference of approximately 15 minutes on the table.Is there any way to apply real time to the partitioned table?
It is likely that you are adding data to the table using streaming inserts. According to BigQuery documentation for partitioned tables, data still in streaming buffer is associated with _PARTITIONTIME IS NULL partition, so if you use _PARTITIONTIME in WHERE clause of your query - you are likely missing that data. You can add explicit _PARTITIONTIME IS NULL to the WHERE clause to see streaming data which is still unpartitioned. It usually makes into partitions within 15 minutes.

Partitioning based on column data?

When creating a partitioned table using bq mk --time_partitioning_type=DAY are the partitions created based on the load time of the data, not a date key within the table data itself?
To create partitions based on dates within the date, is the current approach to manually create sharded tables, and load them based on date, as in this post from 2012?
Yes, partitions created based on data load time not based on data itself
You can use partition decorator (mydataset.mytable1$20160810) if you want to load data into specific partition
Per my understanding, partition by column is something that we should expect to be supported at some point - but not now
Good news, BigQuery currently supports 2 type data partition, included partition by column. Please check here.
I like the feature: An individual operation can commit data into up to 2,000 distinct partitions.

Import CSV to partitioned table on BigQuery using specific timestamp column?

I want to import a large csv to a bigquery partitioned table that has a timestamp type column that is actually the date of some transaction, the problem is that when I load the data it imports everything into one partition of today's date.
Is it possible to use my own timestamp value to partition it? How can I do that.
In BigQuery, currently, partitioning based on specific column is not supported.
Even if this column is date related (timestamp).
You either rely on time of insertion so BigQuery engine will insert into respective partition or you specify which exactly partition you want to insert your data into
See more about Creating and Updating Date-Partitioned Tables
The best way to do that today is by using Google Dataflow [1]. You can develop a streaming pipeline which will read the file from Google Cloud Storage bucket and insert the rows into BigQuery's table.
You will need to create the partitioned table manually [2] before running the pipeline, because Dataflow right now doesn't support creating partitioned tables
There are multiple examples available at [3]
[1] https://cloud.google.com/dataflow/docs/
[2] https://cloud.google.com/bigquery/docs/creating-partitioned-tables
[3] https://cloud.google.com/dataflow/examples/all-examples