BigQuery - Move from a synchronous job to an asynchronous job - google-bigquery

After a certain time, I would like to make a synchronous job (launched with jobs.query) asynchronous (as if it were launched with jobs.insert) but I don't want to launch a new job. Is it possible?

When you say jobs.query here I'm assuming you're referring bigquery rest api.
One approach to make your query job async, referring to their doc, is as below:
specify a minimum timeoutMs property in QueryRequest while preparing query POST request, so that you request does not wait till query finishes i.e. become synchronous
after you POSTing the query, past minimum timeout value, you'll receive response with a JobReference attribute
use the jobId from JobReference to track/poll status of your query at later time asynchronously
And if you refer to not-rest api approaches such as for Java or Python languages, they have out-of-the-box api specifically to execute the query synchronously or asynchronously.

Related

Test async data processing flows with Karate Labs

I'm looking for best practices or the recommended approach to test async code execution with Karate.
Our use cases are all pretty similar but a basic one is:
Client makes HTTP request to API
API accepts request and creates a messages which is added to a queue
API replies with ACCEPTED / 202
Worker picks up message from queue processes it and updates database
Eventually after the work is finished another endpoint delivers updated data
How can I check with Karate that after processing has finished other endpoints return the correct result?
Concrete real life example:
Client requests a processing intensive data export to API e.g. via HTTP POST /api/export
API create message with information for creating the export and puts it on AWS SQS queue
API replies with 202
Worker receives message and creates export, uploads result as ZIP to S3 and finally creates and entry in the database symbolizing this export
Client can now query list exports endpoint e.g. via HTTP GET /api/exports
API returns 200 with the list of exports incl. the newly created entry
Generally I have two ideas on how to approach this:
Use karate retry until on the endpoint that returns the list of exports
In the API response (step #3) return the message ID and use the HTTP API of SQS to poll that endpoint until the message has been processed and then query the list endpoint to check the result
Is either of those approach recommended or should I choose an entirely different solution?
The moment queuing comes into the picture, I would not recommend retry until. It would work if you are in a hurry, but if you are ok to write a little bit of Java code, please read on. Note that this Java "glue code" needs to be written only once, and then the team responsible for writing the functional flows will be up and running.
I personally would prefer option (2) just because when a test fails, you will have a lot more diagnostic information and traces to look at.
Pretty sure you won't have a problem using AWS Java libs to do things such as polling SQS.
I think this example will answer all your questions: https://twitter.com/getkarate/status/1417023536082812935

Are async routing functions and asynchronous middleware in Express blocking the execution process (in 2021)?

I know that Express allows to execute asynchronous functions in the routes and in the middlewares, but is this correct? I read the documentation and it specifies that NO ROUTES OR ASYNCHRONOUS MIDDLEWARES SHOULD BE ASSIGNED, today, currently, does Express support asynchronous functions? Does it block the execution process? o Currently asynchronous functions DO NOT BLOCK THE EXECUTION PROCESS?,
For example, if I place in an asynchronous route, and if requests are made in that route at the same time, are they resolved in parallel?, that is:
Or when assigning asynchronous routes, will these requests be resolved one after the other ?, that is:
This is what I mean by "blocking the execution process", because if one fails, are the other requests pending? or Am I misunderstanding?
I hope you can help me.
You can use async functions just fine with Express, but whether or not they block has nothing to do with whether they are async, but everything to do with what the code in the function does. If it starts an asynchronous operation and then returns, then it won't block. But, if it executes a bunch of time consuming synchronous code before it returns, that will block.
If getDBInfo() is asynchronous and returns a promise that resolves when it completes, then your examples will have the three database operations in flight at the same time. Whether or not they actually run truly in parallel depends entirely upon your database implementation, but the code you show here allows them to run in parallel if the database implements that.
The single thread of Javascript execution will run the first call to getDBInfo(), that DB request will be started and will immediately return a promise. Then, it will hit the await and it will suspend the execution of the containing function. That will allow the event loop to then start processing the second request and it will do the same. When it hits the await, it will suspend execution of the containing function and allow the event loop to process the third request will do likewise. Then, sometime later, one of the DB calls will complete (it could be any one of the three) which will resolve its promise which will unsuspend the function and it will send the response. Then, one after another the other two DB calls will finish and send their responses.

Is it safe to call BackgroundJob.Enqueue from inside a RecurringJob?

I have a RecurringJob that receives some rows from a database and I want to send an email for each row. Is it safe to call BackgroundJob.Enqueue within the recurring job handler for each row to send an email?
My aim is to keep the work in the recuring job to a minimum.
It's a bit late to answer this but yes, you can call BackgroundJon.Enqueue() from within the recurring job. We use MongoDB and since it is thread safe, we have one job creating other jobs that run both serially and parallel.
Purpose of background jobs is to perform a task whether you start from an API call, recurring job or the background job itself.

Are there queues between WebFilters

I am planning to use webflux to create a set of micro-services. Each service has a set of processing stages - a set of webfilters for pre-processing, a filter/stage to do some IO (which could become very slow), and a set of filters to do post processing. Each post procession filter need result of the IO operation to do its work. How does this work under the hood? Does the framework automatically trigger filters when results of the proceeding filter is available?
More information
Lets say I have a pipeline of filters. First filter does a remote API call - which may be slow. Second filter uses results of this API call. In a event loop based asynchronous processing pipeline, this would be modeled using one or more queues and a thread pool. Incoming requests would be put in to the queue and will be picked up by a worker thread. As the first step of the pipeline is doing the API call, this thread would do the API call and return immediately without waiting for the result. When result is received, it will be placed in the queue again and will be picked up by a worker thread and dispatched to the second filter which can use the result of the API call. Without a framework, we will setup queues, threads, message handlers in proper way so that at runtime above sequence of events happen. I just wanted to understand how this work in webflux.

How do I poll_job for batch operations in BigQuery?

Code:
batch.add(bigquery.jobs().insert(projectId=project_id, body=query_request_body))
Once I do batch.execute(), is there a way I can do a poll_job() on this batch request object which would return true if all jobs in the batch operation have completed?
Batching allows your client to put multiple API calls into a single request. This makes your use of the http channel slightly more efficient. Note that there is no "batch" API offered by BigQuery itself: each API call is processed independently.
Overview of batching: Batch Details
Some details that better describe the response: Batch Request
Given this, if you want to inspect "all the jobs in one request", then you will need to construct a batched set of jobs.get calls to inspect all the jobs.
If you provide the job_id references for each inserted job, then this will be easy to construct since you have all the job_ids. If not, you will have to extract these from the batched reply from all those jobs.insert calls. (You may already be inspecting the batched reply to ensure all the jobs.insert calls were successful, so extracting a bit of extra data may be trivial for you.)
It sounds like your ultimate goal here is to be as efficient as possible with your http connection, so don't forget to remove already-done jobs from consecutive batched jobs.get calls.
With all that said, there is a simpler way that may result in more efficient use of the channel: if you just want to wait until all jobs are done, then you could instead poll on each job individually until done. Latency will be bounded by the slowest job, and single job requests are simpler to manage.