How do I add concurrency limits to a scheduled query in BigQuery - google-bigquery

I monitor my logs for an externally controlled event.
The observation of an instance of that externally controlled event triggers a Cloud Function (I intentionally limit the maximum instances of the Cloud Function to 1).
The Cloud Function triggers an “on-demand” Scheduled Query in BigQuery.
First Preference:
Can I set a limit to the number of active instances of a BigQuery on-demand Scheduled Query to just one (kind of like how I limit the instances of the Cloud Function to just one at a time)?
Alternative Possibility:
(If I cannot limit the number of instances directly) Can I limit the the BigQuery on-demand Scheduled Query to not run if it has been triggered less than five minutes ago?

Related

How to batch process data from Google Pub/Sub to Cloud Storage using Dataflow?

I'm building a Change Data Capture pipeline that reads data from a MYSQL database and creates a replica in BigQuery. I'll be pushing the changes in Pub/Sub and using Dataflow to transfer them to Google Cloud Storage. I have been able to figure out how to stream the changes, but I need to run batch processing for a few tables in my Database.
Can Dataflow be used to run a batch job while reading from an unbounded source like Pub/Sub? Can I run this batch job to transfer data from Pub/Sub to Cloud Storage and then load this data to BigQuery? I want a batch job because a stream job costs more.
Thank you for the precision.
First, when you use PubSub in Dataflow (Beam framework), it's only possible in streaming mode
Cloud Pub/Sub sources and sinks are currently supported only in streaming pipelines, during remote execution.
If your process don't need realtime, you can skip Dataflow and save money. You can use Cloud Functions or Cloud Run for the process that I propose you (App Engine also if you want, but not my first recommendation).
In both cases, create a process (Cloud Run or Cloud Function) that is triggered periodically (every week?) by Cloud Scheduler.
Solution 1
Connect your process to the pull subscription
Every time that you read a message (or a chunk of message, for example 1000), write stream into BigQuery. -> However, stream write is not free on big Query ($0.05 per Gb)
Loop until the queue is empty. Set the timeout to the max value(9 minutes with Cloud Function, 15 minutes to Cloud Run) to prevent any timeout issue
Solution 2
Connect your process to the pull subscription
Read a chunk of messages (for example 1000) and write them in memory (into an array).
Loop until the queue is empty. Set the timeout to the max value(9 minutes with Cloud Function, 15 minutes to Cloud Run) to prevent any timeout issue. Set also the memory to the max value (2Gb) for preventing out of memory crashes.
Create a load job into BigQuery from your in memory data array. -> Here the load job is free and you are limited to 1000 load jobs per day and per table.
However, this solution can fail if your app + the data size is larger than the ma memory value. An alternative, is to create a file into GCS every, for example, each 1 million of rows (depends the size and the memory footprint of each row). Name the file with a unique prefix, for example the date of the day (YYYYMMDD-tempFileXX), and increment the XX at each file creation. Then, create a load job, not from data in memory, but with data in GCS with a wild card in the file name (gs://myBucket/YYYYMMDD-tempFile*). Like this all the files which match with the prefix will be loaded.
Recommendation The PubSub messages are kept up to 7 days into a pubsub subscription. I recommend you to trigger your process at least every 3 days for having time to react and debug before message deletion into the subscription.
Personal experience The stream write into BigQuery is cheap for a low volume of data. For some cents, I recommend you to consider the 1st solution is you can pay for this. The management and the code are smaller/easier!

Service that does advanced queries on a data set, and automatically returns relevant updated results every time new data is added to the set?

I'm looking for a cloud service that can do advanced statistics calculations on a large amount of votes submitted by users, in "real time".
In our app, users can submit different kind of votes like picking a favorite, rating 1-5, say yes/no etc. on various topics.
We also want to show "live" statistics to the user, showing the popularity of a person etc. This will be generated by a rather complex SQL where we are calculating the average number of times a person was picked as favorite, divided by total number of votes and the number of games in which the person has been participating etc. And the score for the latest X games should count higher than the overall score for all games. This is just an example, there are several other SQL queries with similar complexity.
All our presentable data (including calculated statistics) is served from Firestore documents, and the votes will be saved as Firestore documents.
Ideally, the Firebase-backend (functions, firestore etc) should not need to know about the query logic.
What I wish for is a pay as you go cloud service that does the following:
I define some schemas and set up the queries we need for the statistics we have (15-20 different SQLs). Like setting up views in MySQL
On every vote, we push the vote data to this service, which will store it in a row.
The service should then, based on its knowledge about the defined queries, and the content of the pushed vote data, determine which statistics that are affected by the newly added row, and recalculate these. A specific vote type can affect one or more statistics.
Every time a statistic is recalculated, the result should be automatically pushed back to our Firebase backend (for instance by calling an HTTPS endpoint that hits a cloud function) - so we can update the relevant Firestore documents.
The service should be able to throttle the calculations, like only regenerating new statistics every 1 minute despite having several votes per second on the same topic.
Is there any product like this in the market? Or can it be built by combining available cloud services? And what is the official term for such a product, if I should search for it myself?
I know that I can probably build a solution like this myself, and run it on a cloud hosted database server, which can scale as our need grows - but I believe that I'm not the first developer with a need of this, so I hope that someone has solved it before me :)
You can leverage the existing cloud services available on the Google Cloud Platform.
Google BigQuery, Google Cloud Firestore, Google App Engine (CRON Jobs), Google Cloud Tasks
The services can be used to solve the problems mentioned above:
1) Google BigQuery : Here you can define schema for the data on which you're going to run the SQL queries. BigQuery supports Standard and legacy SQL queries.
2) Every vote can be pushed to the defined BigQuery tables using its streaming insert service.
3) Every vote pushed can trigger the recalculation service which calculates the statistics by executing the defined SQL queries and the query results can be stored as documents in collections in Google Cloud Firestore.
4) Google Cloud Firestore: Here you can store the live statistics of the user. This is a real time database, so you'll be able to configure listeners for the modifications to the statistics and show the modifications as soon as the statistics are recalculated.
5) In the same service which inserts every vote, create a new record with a "syncId" in an another table. The idea is to group a number of votes cast in a particular interval to a its corresponding syncId. The syncId can be suffixed with a timestamp. According to your requirement a particular time interval can be set so that the recalculation can be triggered using CRON jobs service which invokes the recalculation service within the interval. Once the recalculation related to a particular syncId is completed the record corresponding to the syncId should be marked as completed.
We are leveraging the above technologies to build a web application on Google Cloud Platform, where the inputs are recorded on Google Firestore and then stream-inserted to Google BigQuery. The data stored in BigQuery is queried after 30 sec of each update using SQL queries and the query results are stored in Google Cloud Firestore to serve dashboards which are automatically updated using listeners configured for the collection in which the dashboard information is stored.

Azure Function to Azure SQL - Performance and Scaling

I have an Azure Function that writes to Azure SQL. It currently reads from a topic but could be changed to read from a queue. Is there a preference to read from a topic or a queue?
There are 200K messages per hour hitting the topic. I need to write the 200K messages per hour to Azure SQL. During processing I regularly get the error "The request limit for the database is 60 and has been reached.". I understand that I've hit the maximum number of DB connections. Is there a way to stop Azure from scaling up the number of Azure Function instances? What's the best way to share SQL connections?
Any other Azure Function to Azure SQL performance tips?
Thanks,
Richard
There is no well-defined way to achieve this with Service Bus. You may want to play with host.json file and change maxConcurrentCalls parameter:
"serviceBus": {
// The maximum number of concurrent calls to the callback the message
// pump should initiate. The default is 16.
"maxConcurrentCalls": XYZ,
}
but it only controls the amount of parallel calls at a single instance.
I would suggest you look at Event Hubs. You get at least 2 bonuses:
You can switch to batches of events instead of 1 by 1 processing. This is usually a very effective way to insert large amount of data into SQL table.
Max concurrency is limited by amount of Event Hub partitions, so you know the hard limit of concurrent calls.
On the downside, you would lose some Service Bus features like dead lettering, auto retries etc.

BigQuery External Data Source Query Quotas

I have a BigQuery table set up with a Cloud BigTable external data source. This works fine, and I'm able to run queries joining my BigTable data to some of my other BigQuery data. However, when I run too many queries against this table simultaneously, I get the following error:
Error encountered during job execution:
Exceeded rate limits: too many concurrent queries that read Cloud Bigtable data sources for this project. For more information, see https://cloud.google.com/bigquery/troubleshooting-errors
I can't find any documentation about the limits on concurrent queries on the linked page or on the BigQuery Quotas and Limits page. I'm not running that many queries here - max 10 at a time. Has anyone run into this before who knows what the actual concurrent query limit is?
edit:
So people don't have to dig through the attached Google ticket, the correct answer (as of April 2018) is 4 concurrent queries.
You should rather look for Quotas & Limits for Query Jobs
The following limits apply to query jobs created automatically by running interactive queries and to jobs submitted programmatically using jobs.query and query-type jobs.insert method calls.
Concurrent rate limit for on-demand, interactive queries — 50 concurrent queries
Queries with results that are returned from the query cache, and dry run queries do not count against this limit. You can specify a dry run query using the --dry_run flag or by setting the dryRun property in a query job.
Concurrent rate limit for queries that contain user-defined functions (UDFs) — 6 concurrent queries
The concurrent rate limit for queries that contain UDFs includes both interactive and batch queries. Interactive queries that contain UDFs also count toward the concurrent rate limit for interactive queries.
you cn find more in provided link

Inserting into BigQuery via load jobs (not streaming)

I'm looking to use Dataflow to load data into BigQuery tables using BQ load jobs - not streaming (streaming would cost too much for our use case). I see that the Dataflow SDK has built in support for inserting data via BQ streaming, but I wasn't able to find anything in the Dataflow SDK that supports load jobs out of the box.
Some questions:
1) Does the Dataflow SDK have OOTB support for BigQuery load job inserts? If not, is it planned?
2) If I need to roll my own, what are some good approaches?
If I have to roll my own, performing a BQ load job using Google Cloud Storage is a multi step process - write the file to GCS, submit the load job via the BQ API, and (optionally) check the status until the job has completed (or failed). I'd hope I could use the existing TextIO.write() functionality to write to GCS, but I'm not sure how I'd compose that step with the subsequent call to the BQ API to submit the load job (and optionally the subsequent calls to check the status of the job until it's complete).
Also, I'd be using Dataflow in streaming mode, with windows of 60 seconds - so I'd want to do the load job every 60 seconds as well.
Suggestions?
I'm not sure which version of Apache Beam you are using, but now it's possible to use a micro-batching tactic using a Stream Pipeline. If you decide one way or another you can use something like this:
.apply("Saving in batches", BigQueryIO.writeTableRows()
.to(destinationTable(options))
.withMethod(Method.FILE_LOADS)
.withJsonSchema(myTableSchema)
.withCreateDisposition(CreateDisposition.CREATE_IF_NEEDED)
.withWriteDisposition(WriteDisposition.WRITE_APPEND)
.withExtendedErrorInfo()
.withTriggeringFrequency(Duration.standardMinutes(2))
.withNumFileShards(1);
.optimizedWrites());
Things to keep in mind
There are 2 different methods: FILE_LOADS and STREAMING_INSERT, if you use the first one you need to include the withTriggeringFrequency and withNumFileShards. For the first one, from my experience, is better to use minutes and the number will depend on the amount of throughput data. If you receive quite a lot try to keep it small, I have seen "stuck errors" when you increase it too much. The shards can affect mostly your GCS billing, if you add to much shards it will create more files per table per x amount of minutes.
If your input data size is not so big the streaming insert can work really well and the cost shouldn't be a big deal. In that scenario you can use STREAMING_INSERT method and remove the withTriggeringFrequency and withNumFileShards. Also, you can add withFailedInsertRetryPolicy like InsertRetryPolicy.retryTransientErrors() so no rows are being lost (keep in mind that idempotency is not guaranteed with STREAM_INSERTS, so duplication is possible)
You can check your Jobs in BigQuery and validate that everything is working! Keep in mind the policies for jobs with BigQuery (I think is 1000 jobs per table) when you are trying to define triggering frequency and shards.
Note: You can always read this article about efficient aggregation pipelines https://cloud.google.com/blog/products/data-analytics/how-to-efficiently-process-both-real-time-and-aggregate-data-with-dataflow
BigQueryIO.write() always uses BigQuery load jobs when the input PCollection is bounded. If you'd like it to also use them if it is unbounded, specify .withMethod(FILE_LOADS).withTriggeringFrequency(...).