Google Dataflow SQL | Creating Branches | Error Handling - sql

Trying to use Dataflow SQL for Stream ingestion:
We have a Pubsub topic (source) and BigQuery Table (sink).
To achieve that we need to follow steps:
From BigQuery UI, adding schema for topic manually.
Question: Can we automate this process using commandline options?
Writing SQL for the transformation and executing using gcloud dataflow query command (helps us with dynamic queries and automation).
Question: Suppose we have missing key from Pubsub messages and the pipeline will mark those messages as error in stack driver. Can we add some capability like if validation of schema fails move to table y else table x? Something like, if we get message type y move of table y else table x?

You can use gcloud to add a schema to a topic. This was actually the only way to do it, at first: https://cloud.google.com/dataflow/docs/guides/sql/data-sources-destinations#gcloud
For saving messages that cannot be parsed into SQL rows, the functionality is often called "dead letter queue". It is available in Beam SQL DDL for Pubsub but is not yet available when using Dataflow SQL through the BigQuery UI. See https://beam.apache.org/documentation/dsls/sql/extensions/create-external-table/#pubsub

Related

Databricks SQL equivalent to "Create Trigger" logic?

Is there a Databricks SQL equivalent to "Create Trigger" logic? Basically every time table X gets new data, a few merge statements need to run on another table.
Alternatively, can a notebook be triggered whenever table X updates?
Even more alternatively, what about monitoring table X with some other Azure service/ADF and triggering required SQL statements?
Desired result is to be able to update a table Y upon updates in table X without this blocking some other activity, so if the solution is code based, it should not block the execution of the rest of the code in a notebook, for example.
If you store your data in Delta Format, you have access to a Change Data Feed.
If the data changes are new files appearing in your data lake, you can also use Autoloader to create a Streaming job that is triggered for each new file.

BigQuery streaming insert from Dataflow - no results

I have a Dataflow pipeline which is reading messages from PubSub Lite and streams data into a BigQuery table. The table is partitioned by day. When querying the table with:
SELECT * FROM `my-project.my-dataset.my-table` WHERE DATE(timestamp) = "2021-10-14"
The BigQuery UI tells me This query will process 1.9 GB when run. But when actually running the query I don't get any results. My pipeline is running for a whole week now and I am getting the same results for the last two days. However, for 2021-10-11 and the days before that I am seeing actual results.
I am currently using Apache Beam version 2.26 and my Dataflow writer looks like this:
return BigQueryIO.<Event>write()
.withSchema(createTableSchema())
.withFormatFunction(event -> createTableRow(event))
.withCreateDisposition(CreateDisposition.CREATE_NEVER)
.withWriteDisposition(WriteDisposition.WRITE_APPEND)
.withTimePartitioning(new TimePartitioning().setType("DAY").setField("timestamp"))
.to(TABLE);
Why is BigQuery taking so long for committing the values to the partitions but at the same time telling me there is actually data available?
EDIT 1:
BigQuery is processing data and not returning any rows because its processing also the data in your streaming buffer. Data on buffer is can take up to 90 min to be committed in the partitioned tables.
Check more details in this stack and also in the documentation available here.
When streaming to a partitioned table, data in the
streaming buffer has a NULL value for the _PARTITIONTIME pseudo column.
If you are having problems to write the data from pubsub in BigQuery, I recommend you to use an template avaiable in dataflow.
Use an Dataflow template avaiable in GCP to write the data from PubSub to BigQuery:
There is an tempate to write data from a pubsub topic to bigquery and it already takes care of the possible corner cases.
I tested it as following and works perfectly:
Create a subscription in you PubSub topic;
Create bucket for temporary storage;
Create the job as following:
For testing, I just sent a message to the topic in json format and the new data was added in the output table:
gcloud pubsub topics publish test-topic --message='{"field_dt": "2021-10-15T00:00:00","field_ts": "2021-10-15 00:00:00 UTC","item": "9999"}'
If you want something more complex, you can fork from the templates code from github and adjust it for your need.

How to save a view using federated queries across two projects?

I'm looking to save a view which uses federated queries (from a MySQL Cloud SQL connection) between two projects. I'm receiving two different errors (depending on which project I try to save in).
If I try to save in the project containing the dataset I get error:
Not found: Connection my-connection-name
If I try to save in the project that contains the connection I get error:
Not found: Dataset my-project:my_dataset
My example query that crosses projects looks like:
SELECT
bq.uuid,
sql.item_id,
sql.title
FROM
`project_1.my_dataset.psa_v2_202005` AS bq
LEFT OUTER JOIN
EXTERNAL_QUERY( 'project_2.us-east1.my-connection-name',
'''SELECT item_id, title
FROM items''') AS sql
ON
bq.looks_info.query_item.item_id = sql.item_id
The documentation at https://cloud.google.com/bigquery/docs/cloud-sql-federated-queries#known_issues_and_limitations doesn't mention any limitations here.
Is there a way around this so I can save a view using an external connection from one project and dataset from another?
Your BigQuery table is located in US and your MySQL data source is located in us-east1. BigQuery automatically chooses to run the query in the location of your BigQuery table (i.e. in US), however, your Cloud MySQL is in us-east1 and that's why your query fails. Therefore the BigQuery table and Cloud SQL instance, must be in the same location in order for this query to succeed.
The solution for this kind of cases is moving your BigQuery dataset to the same location as your Cloud SQL instance manually by following the steps explained in detail in this documentation. However, the us-east1 is not currently supported for copying datasets. Thus, I will recommend you to create a new connection in one of the locations mentioned in the documentation.
I hope you find the above pieces of information useful.

Get the Last Modified date for all BigQuery tables in a BigQuery Project

I have several databases within a BigQuery project which are populated by various jobs engines and applications. I would like to maintain a dashboard of all of the Last Modified dates for every table within our project to monitor job failures.
Are there any command line or SQL commands which could provide this list of Last Modified dates?
For a SQL command you could try this one:
#standardSQL
SELECT *, TIMESTAMP_MILLIS(last_modified_time)
FROM `dataset.__TABLES__` where table_id = 'table_id'
I recommend you though to see if you can log these errors at the application level. By doing so you can also understand why something didn't work as expected.
If you are already using GCP you can make use of Stackdriver (it works on AWS as well), we started using it in our projects and I recommend giving it a try (we tested for python applications though, not sure how the tool performs on other clients but it might be quite similar).
I've just queried stacked GA4 data using the following code:
FROM analytics_#########.__TABLES__
where table_id LIKE 'events_2%'
I have kept the 2 on the events to ensure my intraday tables do not pull through also.

Google Bigquery query execution using google cloud dataflow

Is it possible to execute Bigquery's query using Google cloud data flow directly and fetch data, not reading data from table then putting conditions?
For example, PCollections res=p.apply(BigqueryIO.execute("Select col1,col2 from publicdata:samples.shakeseare where ...."))
Instead of reinventing using iterative method what Bigquery queries already implemented, we can use the same directly.
Thanks and Regards
Ajay K N
BigQueryIO currently only supports reading from a Table and not a Query or View (FAQ).
One way to work around this is in your main program to create a BigQuery permanent table by issuing a query before you run your Dataflow job. After, your job runs you could delete the table.