As the title says, can you have more than one query in an Azure Streaming Analytics job? If so, how should that be structured?
yes, you can have multiple queries in stream analytics job.
You would do something like below
select * into type1Output from inputSource where type = 1
select * into type2Output from inputSource where type = 2
The job has two outputs defined, called type1Output and type2Output. Each query writes to a different output.
Related
I have a Dataflow pipeline which is reading messages from PubSub Lite and streams data into a BigQuery table. The table is partitioned by day. When querying the table with:
SELECT * FROM `my-project.my-dataset.my-table` WHERE DATE(timestamp) = "2021-10-14"
The BigQuery UI tells me This query will process 1.9 GB when run. But when actually running the query I don't get any results. My pipeline is running for a whole week now and I am getting the same results for the last two days. However, for 2021-10-11 and the days before that I am seeing actual results.
I am currently using Apache Beam version 2.26 and my Dataflow writer looks like this:
return BigQueryIO.<Event>write()
.withSchema(createTableSchema())
.withFormatFunction(event -> createTableRow(event))
.withCreateDisposition(CreateDisposition.CREATE_NEVER)
.withWriteDisposition(WriteDisposition.WRITE_APPEND)
.withTimePartitioning(new TimePartitioning().setType("DAY").setField("timestamp"))
.to(TABLE);
Why is BigQuery taking so long for committing the values to the partitions but at the same time telling me there is actually data available?
EDIT 1:
BigQuery is processing data and not returning any rows because its processing also the data in your streaming buffer. Data on buffer is can take up to 90 min to be committed in the partitioned tables.
Check more details in this stack and also in the documentation available here.
When streaming to a partitioned table, data in the
streaming buffer has a NULL value for the _PARTITIONTIME pseudo column.
If you are having problems to write the data from pubsub in BigQuery, I recommend you to use an template avaiable in dataflow.
Use an Dataflow template avaiable in GCP to write the data from PubSub to BigQuery:
There is an tempate to write data from a pubsub topic to bigquery and it already takes care of the possible corner cases.
I tested it as following and works perfectly:
Create a subscription in you PubSub topic;
Create bucket for temporary storage;
Create the job as following:
For testing, I just sent a message to the topic in json format and the new data was added in the output table:
gcloud pubsub topics publish test-topic --message='{"field_dt": "2021-10-15T00:00:00","field_ts": "2021-10-15 00:00:00 UTC","item": "9999"}'
If you want something more complex, you can fork from the templates code from github and adjust it for your need.
I am running the following query in my BQ console to see the query history data:
select * from `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT;
I can see the all the data query data present in the results. I came across Audit Logs https://cloud.google.com/bigquery/docs/reference/auditlogs
I have created the sink using command:
gcloud logging sinks create bq-audit-sink pubsub.googleapis.com/projects/my-project/topics/bq_audit --log-filter='protoPayload.metadata."#type"="type.googleapis.com/google.cloud.audit.BigQueryAuditMetadata
But i am not able to find the query data i.e past query which is fired and the information about the job.
How to get the data is which we can get via INFORMATION_SCHEMA.JOBS_BY_PROJECT view.
The INFORMATION_SCHEMA table is a historical record, the log sync receives events when they flow through the log mechanism. The sink doesn't get backfilled with events from before the sink was setup, if that was your hope.
Are you not receiving any events in the pubsub topic? Try running a query in the instrumented project and observe what's emitted into the pubsub topic.
Trying to use Dataflow SQL for Stream ingestion:
We have a Pubsub topic (source) and BigQuery Table (sink).
To achieve that we need to follow steps:
From BigQuery UI, adding schema for topic manually.
Question: Can we automate this process using commandline options?
Writing SQL for the transformation and executing using gcloud dataflow query command (helps us with dynamic queries and automation).
Question: Suppose we have missing key from Pubsub messages and the pipeline will mark those messages as error in stack driver. Can we add some capability like if validation of schema fails move to table y else table x? Something like, if we get message type y move of table y else table x?
You can use gcloud to add a schema to a topic. This was actually the only way to do it, at first: https://cloud.google.com/dataflow/docs/guides/sql/data-sources-destinations#gcloud
For saving messages that cannot be parsed into SQL rows, the functionality is often called "dead letter queue". It is available in Beam SQL DDL for Pubsub but is not yet available when using Dataflow SQL through the BigQuery UI. See https://beam.apache.org/documentation/dsls/sql/extensions/create-external-table/#pubsub
I have a scenario in which I am ingesting data from a MS SQL DB into Azure Data Lake using U-SQL. My table is quite big, with over 16 millions records (soon it will be much more). I just do a SELECT a, b, c FROM dbo.myTable;
I realized, however, that only one vertex is used to read from the table.
My question is, is there any way to leverage parallelism while reading from a SQL table?
I don't believe parallelism for external data sources is supported yet for U-SQL (although happy to be corrected). If you feel this is an important missing feature you can create a request and vote for it here:
https://feedback.azure.com/forums/327234-data-lake
As a workaround, you could manually parallelise your queries, depending on the columns available in your datasource. eg by date
// External query working
USE DATABASE yourADLADB;
// Create the external query for year 2016
#results2016 =
SELECT *
FROM EXTERNAL yourSQLDBDataSource EXECUTE
#"SELECT * FROM dbo.yourBigTable WITH (NOLOCK) WHERE yourDateCol Between '1 Jan 2016 and 31 Dec 2016'";
// Create the external query for year 2017
#results2017 =
SELECT *
FROM EXTERNAL yourSQLDBDataSource EXECUTE
#"SELECT * FROM dbo.yourBigTable WITH (NOLOCK) WHERE yourDateCol Between '1 Jan 2017 and 31 Dec 2017";
// Output 2016 results
OUTPUT #results2016
TO "/output/bigTable/results2016.csv"
USING Outputters.Csv();
// Output 2017 results
OUTPUT #results2017
TO "/output/bigTable/results2017.csv"
USING Outputters.Csv();
Now, I have created a different issue by breaking up the files into multiple parts. However you could then read these using filesets which will also parallelise, eg:
#input =
EXTRACT
... // your column list
FROM "/output/bigTable/results{year}.csv"
USING Extractors.Csv();
I would ask why you are choosing to move such a large file into your lake given ADLA and U-SQL offer the you ability to query data where it lives. Can you explain further?
Queries to external datasources are not automatically parallelized in U-SQL. (This is something we are considering for the future)
wBob's answer does give one option for achieving somewhat the same effect - though it of course requires you to manually partition and query the data using multiple U-SQL statements.
Please note that doing parallel read in a non-transacted environment can lead to duplicate or missed data if parallel writes occur at the source. So some care needs to be taken and the users will need to know the tradeoffs.
Another potential solution here would be to create an HDInsight cluster backed by the same ADLS store as your ADLA account.
You can then use Apache Sqoop to copy the data in parallel from SQL server to a directory in ADLS, and then import that data (which will be split across multiple files) to tables using U-SQL.
I have several databases within a BigQuery project which are populated by various jobs engines and applications. I would like to maintain a dashboard of all of the Last Modified dates for every table within our project to monitor job failures.
Are there any command line or SQL commands which could provide this list of Last Modified dates?
For a SQL command you could try this one:
#standardSQL
SELECT *, TIMESTAMP_MILLIS(last_modified_time)
FROM `dataset.__TABLES__` where table_id = 'table_id'
I recommend you though to see if you can log these errors at the application level. By doing so you can also understand why something didn't work as expected.
If you are already using GCP you can make use of Stackdriver (it works on AWS as well), we started using it in our projects and I recommend giving it a try (we tested for python applications though, not sure how the tool performs on other clients but it might be quite similar).
I've just queried stacked GA4 data using the following code:
FROM analytics_#########.__TABLES__
where table_id LIKE 'events_2%'
I have kept the 2 on the events to ensure my intraday tables do not pull through also.