I have a Dataflow pipeline which is reading messages from PubSub Lite and streams data into a BigQuery table. The table is partitioned by day. When querying the table with:
SELECT * FROM `my-project.my-dataset.my-table` WHERE DATE(timestamp) = "2021-10-14"
The BigQuery UI tells me This query will process 1.9 GB when run. But when actually running the query I don't get any results. My pipeline is running for a whole week now and I am getting the same results for the last two days. However, for 2021-10-11 and the days before that I am seeing actual results.
I am currently using Apache Beam version 2.26 and my Dataflow writer looks like this:
return BigQueryIO.<Event>write()
.withSchema(createTableSchema())
.withFormatFunction(event -> createTableRow(event))
.withCreateDisposition(CreateDisposition.CREATE_NEVER)
.withWriteDisposition(WriteDisposition.WRITE_APPEND)
.withTimePartitioning(new TimePartitioning().setType("DAY").setField("timestamp"))
.to(TABLE);
Why is BigQuery taking so long for committing the values to the partitions but at the same time telling me there is actually data available?
EDIT 1:
BigQuery is processing data and not returning any rows because its processing also the data in your streaming buffer. Data on buffer is can take up to 90 min to be committed in the partitioned tables.
Check more details in this stack and also in the documentation available here.
When streaming to a partitioned table, data in the
streaming buffer has a NULL value for the _PARTITIONTIME pseudo column.
If you are having problems to write the data from pubsub in BigQuery, I recommend you to use an template avaiable in dataflow.
Use an Dataflow template avaiable in GCP to write the data from PubSub to BigQuery:
There is an tempate to write data from a pubsub topic to bigquery and it already takes care of the possible corner cases.
I tested it as following and works perfectly:
Create a subscription in you PubSub topic;
Create bucket for temporary storage;
Create the job as following:
For testing, I just sent a message to the topic in json format and the new data was added in the output table:
gcloud pubsub topics publish test-topic --message='{"field_dt": "2021-10-15T00:00:00","field_ts": "2021-10-15 00:00:00 UTC","item": "9999"}'
If you want something more complex, you can fork from the templates code from github and adjust it for your need.
Related
I am running the following query in my BQ console to see the query history data:
select * from `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT;
I can see the all the data query data present in the results. I came across Audit Logs https://cloud.google.com/bigquery/docs/reference/auditlogs
I have created the sink using command:
gcloud logging sinks create bq-audit-sink pubsub.googleapis.com/projects/my-project/topics/bq_audit --log-filter='protoPayload.metadata."#type"="type.googleapis.com/google.cloud.audit.BigQueryAuditMetadata
But i am not able to find the query data i.e past query which is fired and the information about the job.
How to get the data is which we can get via INFORMATION_SCHEMA.JOBS_BY_PROJECT view.
The INFORMATION_SCHEMA table is a historical record, the log sync receives events when they flow through the log mechanism. The sink doesn't get backfilled with events from before the sink was setup, if that was your hope.
Are you not receiving any events in the pubsub topic? Try running a query in the instrumented project and observe what's emitted into the pubsub topic.
We are trying to export all the http request to our google load balancer into big query. Unfortunately we notice that data arrives 3 minutes later to BigQuery.
Starting from this tutorial:https://cloud.google.com/solutions/serverless-pixel-tracking
We created a Load Balancer that points to a pixel.png on a public storage
Created a sink to export all log to Pub/Sub
Created DataFlow with streaming insert pub/sub to BigQuery Table with provided template
Table is partitioned on date and has a cluster column on hour and minutes.
After we scale to 1000 request per seconds we noticed that data was delayed by 2 or 3 minutes
SELECT * FROM DATASET ORDER BY Timestamp desc Limit 100
this query will be executed with few seconds but the last result is 3 minutes old
I am exporting log for a lot of different resources into BigQuery directly without using dataflow or pub/sub and I can see them in realtime. If yuo do not need to do some special pre-processing in dataflow, you might want to try to export directly into BigQuery and remove other stuff in between that introduce latency.
I have a BQ wildcard query that merges a couple of tables with the same schema (company_*) into a new, single table (all_companies). (all_companies will be exported later into Google Cloud Storage)
I'm running this query using the BQ CLI with all_companies as the destination table and this generates a BQ Job (runtime: 20mins+).
The company_* tables are populated constantly using the streamingAPI.
I've read about BigQuery jobs, but I can't find any information about streaming behavior.
If I start the BQ CLI query at T0, the streamingAPI adds data to company_* tables at T0+1min and the BQ CLI query finishes at T0+20min, will the data added at T0+1min be present in my destination table or not?
As described here the query engine will look at both the Columnar Storage and the streaming buffer, so potentially the query should see the streamed data.
It depends what you mean by a runtime of 20 minutes+. If the query is run 20 minutes after you create the job then all data in the streaming buffer by T0+20min will be included.
If on the other hand the job starts immediately and takes 20 minutes to complete, you will only see data that is in the streaming buffer at the moment the table is queried.
I'm getting this following error when trying to delete records from a table created through GCP Console and updated with GCP BigQuery Node.js table insert function.
UPDATE or DELETE DML statements are not supported over table stackdriver-360-150317:my_dataset.users with streaming buffer
The table was created without streaming features. And from what I'm reading in documentation Tables that have been written to recently via BigQuery Streaming (tabledata.insertall) cannot be modified using UPDATE or DELETE statements.
Does it mean that once a record has been inserted with this function into a table, there's no way to delete records? At all? If that's the case, does it mean that table needs to be deleted and recreated from scratch? If that's not the case. Can you please suggest a workaround to avoid this issue?
Thanks!
Including new error message for SEO: "UPDATE or DELETE statement over table ... would affect rows in the streaming buffer, which is not supported" -- Fh
To check if the table has a streaming buffer, check the tables.get response for a section named streamingBuffer or, when streaming to a partitioned table, data in the streaming buffer has a NULL value for the _PARTITIONTIME pseudo column, so even with a simple WHERE query can be checked.
Streamed data is available for real-time analysis within a few seconds of the first streaming insertion into a table but it can take up to 90 minutes to become available for copy/export and other operations. You probably have to wait up to 90 minutes so all buffer is persisted on the cluster. You can use queries to see if the streaming buffer is empty or not like you mentioned.
If you use load job to create the table, you won't have streaming buffer, but probably you streamed some values to it.
Note the answer below to work with tables that have ongoing streaming buffers. Just use a WHERE to filter out the latest minutes of data and your queries will work. -- Fh
Make sure to change your filters so they don't include data that could be in the current streaming buffer.
For example, this query fails while I'm streaming to this table:
DELETE FROM `project.dataset.table`
WHERE id LIKE '%-%'
Error: UPDATE or DELETE statement over table project.dataset.table would affect rows in the streaming buffer, which is not supported
You can fix it by only deleting older records:
DELETE FROM `project.dataset.table`
WHERE id LIKE '%-%'
AND ts < TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 40 MINUTE)
4282 rows affected.
Like the example shown in https://github.com/GoogleCloudPlatform/DataflowJavaSDK-examples/blob/master/src/main/java/com/google/cloud/dataflow/examples/cookbook/TriggerExample.java
There is a BigQuery table where new data gets appended eveny 15 minutes. There is a Timestamp column in table. Is it possible to perform streaming analysis by fixedWindow time-based trigger from data being added to that BigQuery table? similar to the above example which uses pub/sub?
Streaming data out of BigQuery is tricky -- unlike PubSub, BigQuery does not have a "subscribe to notifications" API. Is there a way you can stream upstream from BigQuery -- i.e., can you stream from whoever is pushing the 15-minute updates?