Process streaming data from BigQuery table using Dataflow

Process streaming data from BigQuery table using Dataflow - google-bigquery

Like the example shown in https://github.com/GoogleCloudPlatform/DataflowJavaSDK-examples/blob/master/src/main/java/com/google/cloud/dataflow/examples/cookbook/TriggerExample.java
There is a BigQuery table where new data gets appended eveny 15 minutes. There is a Timestamp column in table. Is it possible to perform streaming analysis by fixedWindow time-based trigger from data being added to that BigQuery table? similar to the above example which uses pub/sub?

Streaming data out of BigQuery is tricky -- unlike PubSub, BigQuery does not have a "subscribe to notifications" API. Is there a way you can stream upstream from BigQuery -- i.e., can you stream from whoever is pushing the 15-minute updates?

Related

BigQuery streaming insert from Dataflow - no results

I have a Dataflow pipeline which is reading messages from PubSub Lite and streams data into a BigQuery table. The table is partitioned by day. When querying the table with:
SELECT * FROM `my-project.my-dataset.my-table` WHERE DATE(timestamp) = "2021-10-14"
The BigQuery UI tells me This query will process 1.9 GB when run. But when actually running the query I don't get any results. My pipeline is running for a whole week now and I am getting the same results for the last two days. However, for 2021-10-11 and the days before that I am seeing actual results.
I am currently using Apache Beam version 2.26 and my Dataflow writer looks like this:
return BigQueryIO.<Event>write()
.withSchema(createTableSchema())
.withFormatFunction(event -> createTableRow(event))
.withCreateDisposition(CreateDisposition.CREATE_NEVER)
.withWriteDisposition(WriteDisposition.WRITE_APPEND)
.withTimePartitioning(new TimePartitioning().setType("DAY").setField("timestamp"))
.to(TABLE);
Why is BigQuery taking so long for committing the values to the partitions but at the same time telling me there is actually data available?
EDIT 1:

BigQuery is processing data and not returning any rows because its processing also the data in your streaming buffer. Data on buffer is can take up to 90 min to be committed in the partitioned tables.
Check more details in this stack and also in the documentation available here.
When streaming to a partitioned table, data in the
streaming buffer has a NULL value for the _PARTITIONTIME pseudo column.
If you are having problems to write the data from pubsub in BigQuery, I recommend you to use an template avaiable in dataflow.
Use an Dataflow template avaiable in GCP to write the data from PubSub to BigQuery:
There is an tempate to write data from a pubsub topic to bigquery and it already takes care of the possible corner cases.
I tested it as following and works perfectly:
Create a subscription in you PubSub topic;
Create bucket for temporary storage;
Create the job as following:
For testing, I just sent a message to the topic in json format and the new data was added in the output table:
gcloud pubsub topics publish test-topic --message='{"field_dt": "2021-10-15T00:00:00","field_ts": "2021-10-15 00:00:00 UTC","item": "9999"}'
If you want something more complex, you can fork from the templates code from github and adjust it for your need.

Result of Bigquery job running on a table in which data is loaded via streamingAPI

I have a BQ wildcard query that merges a couple of tables with the same schema (company_*) into a new, single table (all_companies). (all_companies will be exported later into Google Cloud Storage)
I'm running this query using the BQ CLI with all_companies as the destination table and this generates a BQ Job (runtime: 20mins+).
The company_* tables are populated constantly using the streamingAPI.
I've read about BigQuery jobs, but I can't find any information about streaming behavior.
If I start the BQ CLI query at T0, the streamingAPI adds data to company_* tables at T0+1min and the BQ CLI query finishes at T0+20min, will the data added at T0+1min be present in my destination table or not?

As described here the query engine will look at both the Columnar Storage and the streaming buffer, so potentially the query should see the streamed data.
It depends what you mean by a runtime of 20 minutes+. If the query is run 20 minutes after you create the job then all data in the streaming buffer by T0+20min will be included.
If on the other hand the job starts immediately and takes 20 minutes to complete, you will only see data that is in the streaming buffer at the moment the table is queried.

Update or Delete tables with streaming buffer in BigQuery?

I'm getting this following error when trying to delete records from a table created through GCP Console and updated with GCP BigQuery Node.js table insert function.
UPDATE or DELETE DML statements are not supported over table stackdriver-360-150317:my_dataset.users with streaming buffer
The table was created without streaming features. And from what I'm reading in documentation Tables that have been written to recently via BigQuery Streaming (tabledata.insertall) cannot be modified using UPDATE or DELETE statements.
Does it mean that once a record has been inserted with this function into a table, there's no way to delete records? At all? If that's the case, does it mean that table needs to be deleted and recreated from scratch? If that's not the case. Can you please suggest a workaround to avoid this issue?
Thanks!
Including new error message for SEO: "UPDATE or DELETE statement over table ... would affect rows in the streaming buffer, which is not supported" -- Fh

To check if the table has a streaming buffer, check the tables.get response for a section named streamingBuffer or, when streaming to a partitioned table, data in the streaming buffer has a NULL value for the _PARTITIONTIME pseudo column, so even with a simple WHERE query can be checked.
Streamed data is available for real-time analysis within a few seconds of the first streaming insertion into a table but it can take up to 90 minutes to become available for copy/export and other operations. You probably have to wait up to 90 minutes so all buffer is persisted on the cluster. You can use queries to see if the streaming buffer is empty or not like you mentioned.
If you use load job to create the table, you won't have streaming buffer, but probably you streamed some values to it.
Note the answer below to work with tables that have ongoing streaming buffers. Just use a WHERE to filter out the latest minutes of data and your queries will work. -- Fh

Make sure to change your filters so they don't include data that could be in the current streaming buffer.
For example, this query fails while I'm streaming to this table:
DELETE FROM `project.dataset.table`
WHERE id LIKE '%-%'
Error: UPDATE or DELETE statement over table project.dataset.table would affect rows in the streaming buffer, which is not supported
You can fix it by only deleting older records:
DELETE FROM `project.dataset.table`
WHERE id LIKE '%-%'
AND ts < TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 40 MINUTE)
4282 rows affected.

Insert bigquery query result to mysql

In one of my PHP application, I need to show a report based on the aggregate data, which is fetched from BigQuery. I am planning to execute the queries using a PHP cron job then insert data to MySQL table from which the report will fetch data. Is there any better way of doing this like directly insert the data to MySQL without an application layer in between ?
Also I am interested in real time data, but the daily cron only update data once and there will be some mismatch of the counts with actual data if I check it after some time. If I run hourly cron jobs, I am afraid the data reading charges will be high as I am processing a dataset which is 20GB. Also my report cannot be fetched fro Bigquery itself and it needs to have data from MySQL database.

Google Bigquery query execution using google cloud dataflow

Is it possible to execute Bigquery's query using Google cloud data flow directly and fetch data, not reading data from table then putting conditions?
For example, PCollections res=p.apply(BigqueryIO.execute("Select col1,col2 from publicdata:samples.shakeseare where ...."))
Instead of reinventing using iterative method what Bigquery queries already implemented, we can use the same directly.
Thanks and Regards
Ajay K N

BigQueryIO currently only supports reading from a Table and not a Query or View (FAQ).
One way to work around this is in your main program to create a BigQuery permanent table by issuing a query before you run your Dataflow job. After, your job runs you could delete the table.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Process streaming data from BigQuery table using Dataflow - google-bigquery

Streaming data out of BigQuery is tricky -- unlike PubSub, BigQuery does not have a "subscribe to notifications" API. Is there a way you can stream upstream from BigQuery -- i.e., can you stream from whoever is pushing the 15-minute updates?

Related

BigQuery streaming insert from Dataflow - no results

Result of Bigquery job running on a table in which data is loaded via streamingAPI

Update or Delete tables with streaming buffer in BigQuery?

Insert bigquery query result to mysql

Google Bigquery query execution using google cloud dataflow

Categories

Resources