Result of Bigquery job running on a table in which data is loaded via streamingAPI - google-bigquery

I have a BQ wildcard query that merges a couple of tables with the same schema (company_*) into a new, single table (all_companies). (all_companies will be exported later into Google Cloud Storage)
I'm running this query using the BQ CLI with all_companies as the destination table and this generates a BQ Job (runtime: 20mins+).
The company_* tables are populated constantly using the streamingAPI.
I've read about BigQuery jobs, but I can't find any information about streaming behavior.
If I start the BQ CLI query at T0, the streamingAPI adds data to company_* tables at T0+1min and the BQ CLI query finishes at T0+20min, will the data added at T0+1min be present in my destination table or not?

As described here the query engine will look at both the Columnar Storage and the streaming buffer, so potentially the query should see the streamed data.
It depends what you mean by a runtime of 20 minutes+. If the query is run 20 minutes after you create the job then all data in the streaming buffer by T0+20min will be included.
If on the other hand the job starts immediately and takes 20 minutes to complete, you will only see data that is in the streaming buffer at the moment the table is queried.

Related

Why Spark-BigQuery creates extra tables in the dataset

So i am running a spark (scala) serverless dataproc job that reads and write data from/in bigquery.
Here is the code that writes the data :
df.write.format("bigquery").mode(SaveMode.Overwrite).option("table", "table_name").save()
Everythings works fine but these extra tables got created on my dataset in addition of the final table. Do you know why and what i can do so i wont have them?
Those tables are created as the result of view materialization or loading result from a query. The have an expiry time of 24 hours, configurable by the materializationExpirationTimeInMinutes option

BigQuery streaming insert from Dataflow - no results

I have a Dataflow pipeline which is reading messages from PubSub Lite and streams data into a BigQuery table. The table is partitioned by day. When querying the table with:
SELECT * FROM `my-project.my-dataset.my-table` WHERE DATE(timestamp) = "2021-10-14"
The BigQuery UI tells me This query will process 1.9 GB when run. But when actually running the query I don't get any results. My pipeline is running for a whole week now and I am getting the same results for the last two days. However, for 2021-10-11 and the days before that I am seeing actual results.
I am currently using Apache Beam version 2.26 and my Dataflow writer looks like this:
return BigQueryIO.<Event>write()
.withSchema(createTableSchema())
.withFormatFunction(event -> createTableRow(event))
.withCreateDisposition(CreateDisposition.CREATE_NEVER)
.withWriteDisposition(WriteDisposition.WRITE_APPEND)
.withTimePartitioning(new TimePartitioning().setType("DAY").setField("timestamp"))
.to(TABLE);
Why is BigQuery taking so long for committing the values to the partitions but at the same time telling me there is actually data available?
EDIT 1:
BigQuery is processing data and not returning any rows because its processing also the data in your streaming buffer. Data on buffer is can take up to 90 min to be committed in the partitioned tables.
Check more details in this stack and also in the documentation available here.
When streaming to a partitioned table, data in the
streaming buffer has a NULL value for the _PARTITIONTIME pseudo column.
If you are having problems to write the data from pubsub in BigQuery, I recommend you to use an template avaiable in dataflow.
Use an Dataflow template avaiable in GCP to write the data from PubSub to BigQuery:
There is an tempate to write data from a pubsub topic to bigquery and it already takes care of the possible corner cases.
I tested it as following and works perfectly:
Create a subscription in you PubSub topic;
Create bucket for temporary storage;
Create the job as following:
For testing, I just sent a message to the topic in json format and the new data was added in the output table:
gcloud pubsub topics publish test-topic --message='{"field_dt": "2021-10-15T00:00:00","field_ts": "2021-10-15 00:00:00 UTC","item": "9999"}'
If you want something more complex, you can fork from the templates code from github and adjust it for your need.

Beam direct-runner slow BigQuery read

I have a very simple process which first step is to read a BigQuery Table
p.apply("BigQuery data load", BigQueryIO.read().usingStandardSql().fromQuery(BG_SELECT).withoutValidation().withoutResultFlattening())
This step take about 2/3 minutes to perform (about 1000 lines retreived) !
When I look BigQuery I see multiple lines linked to my query
10:54:37.703 BigQuery delete temp_
10:54:37.244 BigQuery delete temp_
10:54:35.492 BigQuery jobcompleted
10:54:34.802 BigQuery insert jobs
10:54:22.081 BigQuery jobcompleted
10:52:33.812 BigQuery insert jobs
10:52:33.106 BigQuery insert datas
10:52:32.410 BigQuery insert jobs
This 2 minutes for job completion is normal ?
(I have no parallel activity on bigquery)
How can I have better (normal !) performance ?
By default BigQueryIO uses BATCH priority. Batch mode queries are queued by BigQuery. These are started as soon as idle resources are available, usually within a few minutes.
You can explicitly set the priority to INTERACTIVE.
p.apply("BigQuery data load", BigQueryIO.readTableRows()
.withQueryPriority(BigQueryIO.TypedRead.QueryPriority.INTERACTIVE)
.usingStandardSql()
.fromQuery(BG_SELECT)
.withoutValidation()
.withoutResultFlattening())
Interactive mode allows for BigQuery to execute the query as soon as possible.

Is the cost incurred when partitioning date tables in BigQuery?

BigQuery quotes this command for creating a partition from existing tables:
bq partition mydataset.sharded_ mydataset.partitioned
(see partitioned tables)
But when I run this, I see that the data is actually getting moved. Since selecting data from raw large tables is very expensive, I wonder how Google applies billing for this situation.
The bq partition CLI command leverages copy jobs rather than queries, which don't incur execution costs (but you do still get charged for the persisted storage that it may generate).
If you're using the CLI, copy jobs can be specified using the bq cp command.

Insert bigquery query result to mysql

In one of my PHP application, I need to show a report based on the aggregate data, which is fetched from BigQuery. I am planning to execute the queries using a PHP cron job then insert data to MySQL table from which the report will fetch data. Is there any better way of doing this like directly insert the data to MySQL without an application layer in between ?
Also I am interested in real time data, but the daily cron only update data once and there will be some mismatch of the counts with actual data if I check it after some time. If I run hourly cron jobs, I am afraid the data reading charges will be high as I am processing a dataset which is 20GB. Also my report cannot be fetched fro Bigquery itself and it needs to have data from MySQL database.