We are trying to export all the http request to our google load balancer into big query. Unfortunately we notice that data arrives 3 minutes later to BigQuery.
Starting from this tutorial:https://cloud.google.com/solutions/serverless-pixel-tracking
We created a Load Balancer that points to a pixel.png on a public storage
Created a sink to export all log to Pub/Sub
Created DataFlow with streaming insert pub/sub to BigQuery Table with provided template
Table is partitioned on date and has a cluster column on hour and minutes.
After we scale to 1000 request per seconds we noticed that data was delayed by 2 or 3 minutes
SELECT * FROM DATASET ORDER BY Timestamp desc Limit 100
this query will be executed with few seconds but the last result is 3 minutes old
I am exporting log for a lot of different resources into BigQuery directly without using dataflow or pub/sub and I can see them in realtime. If yuo do not need to do some special pre-processing in dataflow, you might want to try to export directly into BigQuery and remove other stuff in between that introduce latency.
Related
I have a Dataflow pipeline which is reading messages from PubSub Lite and streams data into a BigQuery table. The table is partitioned by day. When querying the table with:
SELECT * FROM `my-project.my-dataset.my-table` WHERE DATE(timestamp) = "2021-10-14"
The BigQuery UI tells me This query will process 1.9 GB when run. But when actually running the query I don't get any results. My pipeline is running for a whole week now and I am getting the same results for the last two days. However, for 2021-10-11 and the days before that I am seeing actual results.
I am currently using Apache Beam version 2.26 and my Dataflow writer looks like this:
return BigQueryIO.<Event>write()
.withSchema(createTableSchema())
.withFormatFunction(event -> createTableRow(event))
.withCreateDisposition(CreateDisposition.CREATE_NEVER)
.withWriteDisposition(WriteDisposition.WRITE_APPEND)
.withTimePartitioning(new TimePartitioning().setType("DAY").setField("timestamp"))
.to(TABLE);
Why is BigQuery taking so long for committing the values to the partitions but at the same time telling me there is actually data available?
EDIT 1:
BigQuery is processing data and not returning any rows because its processing also the data in your streaming buffer. Data on buffer is can take up to 90 min to be committed in the partitioned tables.
Check more details in this stack and also in the documentation available here.
When streaming to a partitioned table, data in the
streaming buffer has a NULL value for the _PARTITIONTIME pseudo column.
If you are having problems to write the data from pubsub in BigQuery, I recommend you to use an template avaiable in dataflow.
Use an Dataflow template avaiable in GCP to write the data from PubSub to BigQuery:
There is an tempate to write data from a pubsub topic to bigquery and it already takes care of the possible corner cases.
I tested it as following and works perfectly:
Create a subscription in you PubSub topic;
Create bucket for temporary storage;
Create the job as following:
For testing, I just sent a message to the topic in json format and the new data was added in the output table:
gcloud pubsub topics publish test-topic --message='{"field_dt": "2021-10-15T00:00:00","field_ts": "2021-10-15 00:00:00 UTC","item": "9999"}'
If you want something more complex, you can fork from the templates code from github and adjust it for your need.
I have a very simple process which first step is to read a BigQuery Table
p.apply("BigQuery data load", BigQueryIO.read().usingStandardSql().fromQuery(BG_SELECT).withoutValidation().withoutResultFlattening())
This step take about 2/3 minutes to perform (about 1000 lines retreived) !
When I look BigQuery I see multiple lines linked to my query
10:54:37.703 BigQuery delete temp_
10:54:37.244 BigQuery delete temp_
10:54:35.492 BigQuery jobcompleted
10:54:34.802 BigQuery insert jobs
10:54:22.081 BigQuery jobcompleted
10:52:33.812 BigQuery insert jobs
10:52:33.106 BigQuery insert datas
10:52:32.410 BigQuery insert jobs
This 2 minutes for job completion is normal ?
(I have no parallel activity on bigquery)
How can I have better (normal !) performance ?
By default BigQueryIO uses BATCH priority. Batch mode queries are queued by BigQuery. These are started as soon as idle resources are available, usually within a few minutes.
You can explicitly set the priority to INTERACTIVE.
p.apply("BigQuery data load", BigQueryIO.readTableRows()
.withQueryPriority(BigQueryIO.TypedRead.QueryPriority.INTERACTIVE)
.usingStandardSql()
.fromQuery(BG_SELECT)
.withoutValidation()
.withoutResultFlattening())
Interactive mode allows for BigQuery to execute the query as soon as possible.
I have a BQ wildcard query that merges a couple of tables with the same schema (company_*) into a new, single table (all_companies). (all_companies will be exported later into Google Cloud Storage)
I'm running this query using the BQ CLI with all_companies as the destination table and this generates a BQ Job (runtime: 20mins+).
The company_* tables are populated constantly using the streamingAPI.
I've read about BigQuery jobs, but I can't find any information about streaming behavior.
If I start the BQ CLI query at T0, the streamingAPI adds data to company_* tables at T0+1min and the BQ CLI query finishes at T0+20min, will the data added at T0+1min be present in my destination table or not?
As described here the query engine will look at both the Columnar Storage and the streaming buffer, so potentially the query should see the streamed data.
It depends what you mean by a runtime of 20 minutes+. If the query is run 20 minutes after you create the job then all data in the streaming buffer by T0+20min will be included.
If on the other hand the job starts immediately and takes 20 minutes to complete, you will only see data that is in the streaming buffer at the moment the table is queried.
I'm trying to bulk load some records to BigQuery, but it takes a long time to upload even a few thousands records.
I'm using the following command to load a gzipped JSON file. The file has ~2k rows with ~200 columns each:
./bin/bq load --project_id=my-project-id --source_format=NEWLINE_DELIMITED_JSON dataset.table /tmp/file.json.gz
Waiting on bqjob_r3a269dd7388c7b8e_000001579a6e064f_1 ... (50s)
Current status: DONE
This command takes ~50 seconds to load the records. As I want to load at least 1 million records, this would take ~7 hours, which seems too much for a tool that is supposed to handle petabytes of data.
Is it possible to speed up the process?
Try using --nosync flag. This will start an Asynchronous job over bigQuery, found this having much better performance.
Optimally, I would suggest to store file.json.gz inside Google Cloud Storage.
./bin/bq load --nosync
Like the example shown in https://github.com/GoogleCloudPlatform/DataflowJavaSDK-examples/blob/master/src/main/java/com/google/cloud/dataflow/examples/cookbook/TriggerExample.java
There is a BigQuery table where new data gets appended eveny 15 minutes. There is a Timestamp column in table. Is it possible to perform streaming analysis by fixedWindow time-based trigger from data being added to that BigQuery table? similar to the above example which uses pub/sub?
Streaming data out of BigQuery is tricky -- unlike PubSub, BigQuery does not have a "subscribe to notifications" API. Is there a way you can stream upstream from BigQuery -- i.e., can you stream from whoever is pushing the 15-minute updates?