How do I poll_job for batch operations in BigQuery? - google-bigquery

Code:
batch.add(bigquery.jobs().insert(projectId=project_id, body=query_request_body))
Once I do batch.execute(), is there a way I can do a poll_job() on this batch request object which would return true if all jobs in the batch operation have completed?

Batching allows your client to put multiple API calls into a single request. This makes your use of the http channel slightly more efficient. Note that there is no "batch" API offered by BigQuery itself: each API call is processed independently.
Overview of batching: Batch Details
Some details that better describe the response: Batch Request
Given this, if you want to inspect "all the jobs in one request", then you will need to construct a batched set of jobs.get calls to inspect all the jobs.
If you provide the job_id references for each inserted job, then this will be easy to construct since you have all the job_ids. If not, you will have to extract these from the batched reply from all those jobs.insert calls. (You may already be inspecting the batched reply to ensure all the jobs.insert calls were successful, so extracting a bit of extra data may be trivial for you.)
It sounds like your ultimate goal here is to be as efficient as possible with your http connection, so don't forget to remove already-done jobs from consecutive batched jobs.get calls.
With all that said, there is a simpler way that may result in more efficient use of the channel: if you just want to wait until all jobs are done, then you could instead poll on each job individually until done. Latency will be bounded by the slowest job, and single job requests are simpler to manage.

Related

Can I send an API response before successful persistence of data?

I am currently developing a Microservice that is interacting with other microservices.
The problem now is that those interactions are really time-consuming. I already implemented concurrent calls via Uni and uses caching where useful. Now I still have some calls that still need some seconds in order to respond and now I thought of another thing, which I could do, in order to improve the performance:
Is it possible to send a response before the sucessfull persistence of data? I send requests to the other microservices where they have to persist the results of my methods. Can I already send the user the result in a first response and make a second response if the persistence process was sucessfull?
With that, the front-end could already begin working even though my API is not 100% finished.
I saw that there is a possible status-code 207 but it's rather used with streams where someone wants to split large files. Is there another possibility? Thanks in advance.
"Is it possible to send a response before the sucessfull persistence of data? Can I already send the user the result in a first response and make a second response if the persistence process was sucessfull? With that, the front-end could already begin working even though my API is not 100% finished."
You can and should, but it is a philosophy change in your API and possibly you have to consider some edge cases and techniques to deal with them.
In case of a long running API call, you can issue an "ack" response, a traditional 200 one, only the answer would just mean the operation is asynchronous and will complete in the future, something like { id:49584958, apicall:"create", status:"queued", result:true }
Then you can
poll your API with the returned ID to see if the operation that is still ongoing, has succeeded or failed.
have a SSE channel (realtime server side events) where your server can issue status messages as pending operations finish
maybe using persistent connections and keepalives, or flushing the response in the middle, you can achieve what you point out, ie. like a segmented response. I am not familiar with that approach as I normally go for the suggesions above.
But in any case, edge cases apply exactly the same: For example, what happens if then through your API a user issues calls dependent on the success of an ongoing or not even started previous command? like for example, get information about something still being persisted?
You will have to deal with these situations with mechanisms like:
Reject related operations until pending call is resolved "server side": Api could return ie. a BUSY error informing that operations are still ongoing when you want to, for example, delete something that still is being created.
Queue all operations so the server executes all them sequentially.
Allow some simulatenous operations if you find they will not collide (ie. create 2 unrelated items)

BigQuery - Move from a synchronous job to an asynchronous job

After a certain time, I would like to make a synchronous job (launched with jobs.query) asynchronous (as if it were launched with jobs.insert) but I don't want to launch a new job. Is it possible?
When you say jobs.query here I'm assuming you're referring bigquery rest api.
One approach to make your query job async, referring to their doc, is as below:
specify a minimum timeoutMs property in QueryRequest while preparing query POST request, so that you request does not wait till query finishes i.e. become synchronous
after you POSTing the query, past minimum timeout value, you'll receive response with a JobReference attribute
use the jobId from JobReference to track/poll status of your query at later time asynchronously
And if you refer to not-rest api approaches such as for Java or Python languages, they have out-of-the-box api specifically to execute the query synchronously or asynchronously.

File Based Processing versus REST API [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
We have a requirement where we need to process 10,000 transactions once daily in an offline (non real time mode).
Which of the 2 options are preferable
A batch file with 10,000 rows sent once a day and processed
or
An API call in small batches (as I am presuming sending 10K rows at once is not an option).
I was advised by my architects that option 1 is preferable and an API would only make sense when batch sizes are small - as the disadvantage of 2 is that the person calling the API has to break the payload down into small chunks when they have all the information available to them at once.
I am keen to see how "2" could be a viable option so any comments/suggestion to help make the case would be very helpful.
Thanks
Rahul
This is not a full answer. However, I would like to mention one reason in favor of REST API: Validation. This is better managed through the API. Once the file is dropped into an FTP location, it will be your responsibility to validate the format of the file. Will it be easy to route a "bad" file back to its source with a message to explain the bounce back?
With an API call, if the representation coming in does not adhere to a valid schema e.g. XML, json, etc. then your service can respond with a: "400 Bad Request" http status code. This keeps the responsibility of sending data in a valid format with the consumer of the service and helps to achieve a better separation of concerns.
Additional reasoning for a REST API:
Since your file contains transactions, each record should be atomic (If this were not true e.g. there are relationships between the records in the file, then those records should not be considered "transactions"). Therefore, chunking the file up into smaller batches should be trivial.
Regardless, you can define a service that accepts transactions in batch and respond with an HTTP status code of "202 Accepted". A 202 code indicates that the request was received and will be processed asynchronously. Therefore, the response can also contain callback links to check the status of individual transactions; or the batch as a whole. At that point, you would be implementing HATEOAS (Hypermedia as the Engine of Application State) and be in a position to automate the entire process and report on status.
Alteratively with batch files, if the file passes an upfront format validation check, then you'll still have to process each transaction individually downstream. Some records may load, others may not. My assumption is the records that fail to load would still need to be handled. And, you may need to provide the users a view of what succeeded vs. failed. Now, this can all be handled outside the REST API. However, the API pattern is simple and elegant IMHO to this purpose.
Using Batch Process is always a better idea. you can trigger batch process using REST API.
With Batch processing you can always send an email with msg "improper file format" or you can also send "Which records processed and which did not" . With Rest you cannot keep track records and transactions.
As mentioned in above comment you can use Rest API to trigger a batch Process asynchronously and send the status response using HATEOAS.
SPRING BATCH + SPring REST using SPring BOOT
I have the same question and all answer I found the same subjective answer. I would like put some ideas to compare both concepts:
Batch solution requires more storage than REST API. You will need
store your results on intermediate storage area, and write it on an
open format. Perhaps you can compress it, but you are changing
storage with processing.
REST API could use more network bandwidth than batch solution, only
if the intermediate storage is not in network drive. Fetch request,
and query pooling could require a lot of network bandwidth,
but could be solved with web-hooks or web-sockets.
REST API is easiest to automatic recovery than batch solution. REST
API response code can help to take automatic decision to recover
from a FAIL. And you reduce the number of services required to
identify it. If the network is down an email could fail as REST API.
And REST API help you to define a good API on these cases.
REST API can manage high number of rows as any other TCP protocol
(as FTP). But in case of any fail you will need logic to manage it.
It means the REST API will require a chunk enabled protocol too. For
batch service, this logic is in FTP protocol, but with his own
logic, not your business logic.
Batch service does not require to reserve an instance all time
(CPU, IP address, port, etc), just
run when it is needed. You will need a scheduler to start it, or men
force. Or a man to restart it if it fails. Again, out of scheduler,
it is not natural to automatize.
Batch service does not require more security setup from developer
side: REST API must take care about authentication. Also, must think
on injection or other attack methods. REST API could be use helper
services to prevent all of this, but it means more configuration.
Batch services are easy to deploy. Batch services could run on your
machine, or a server and run it when business need. REST API requires
continues health check, use a deployment strategy to keep it up, take
care about DNS configuration, etc. Check if your company give you all
this services.
If this solution is for your company, check what your company is
doing. Right now there is a common policy to move to REST API, but
if your support team do not know about it but has a lot of
experience with batch solution, could be a good idea do not improve.

MKNetworkKit: Chain of requests where each subsequent request needs data from the previous

i'm not sure how to implement this the best way:
i have multiple rest requests where each on retrieves data from a different resource. the thing is that each requests needs data from the previous one.
now i have mknetworkkit running in this project and do i really have to make a request, then evaluate the data in the result block and start a new one from this result block which in turn will end up in the next result block and so forth...
it is not really recursive since evaluation is different for every request and it seems to me that nesting request/block combinations ten levels deep is not really a nice way to do this (synchronous requests apparently are also bad and not supported in mknetworkkit).
what would be the best practice to do this?
EDIT: i also would like to do this in one function call
Same issue here. What I've ended up with is placing each desired network call in a queue (array or whatever you want to store your operations in) and updating my network response delegate so that it checks the queue for the next operation in the chain.

How should i design my workflow so that taks can run parallel

how to design parallel processing workflow
I have a scenarial case about data analysis.
There are four steps basicly:
pick up task either read from a queue or receive a message throught API (web service maybe) to trigger the service
submit request to remote service base on the parameters from step 1
wait from remote service finished and download
perform process on the data that downloaded from step 3
the four step above looks like a sequence workflow.
my question is that how can i scale it out.
every day i might need to perform hundreds to thousands of this task.
if i can do them in parallel, that will help a lot.
e.g run 20 tasks at a time.
so can we config windows workflow foundation to run parallel?
Thanks.
You may want to use pfx (http://www.albahari.com/threading/part5.aspx), then you can control how many threads to make for fetching, and using PLINQ I find helpful.
So, you loop over the list of urls, perhaps reading from a file or database, and then in your select you can then call a function to do the processing.
If you can go into more detail as to whether you want to have the fetching and processing be on different threads, for example, it may be easier to give a more complete answer.
UPDATE:
This is how I would approach this, but I am also using ConcurrentQueue (http://www.codethinked.com/net-40-and-system_collections_concurrent_concurrentqueue) so I can be putting data into the queue while reading from it.
This way each thread can dequeue safely, without worrying about having to lock your collection.
Parallel.For(0, queue.Count, new ParallelOptions() { MaxDegreeOfParallelism = 20 },
(j) =>
{
String i;
queue.TryDequeue(out i);
// call out to URL
// process data
}
});
You may want to put the data into another concurrent collection and have that be processed separately, it depends on your application needs.
Depending on the way your tasks and workflow is modeled you can use a Parallel activity and create different branches for the different tasks to be performed. Each branch has its own logic and the WF runtime will start a second WCF request to retrieve data as soon as it is waiting for the first to respond. This requires you to model the number of branches explicitly but allows for different activities in each branch.
But from you description it sounds like you have the same steps for each task and in that case you could model it using a ParallelForEach activity and have that iterate over a collection of tasks. Each task object would need to contain all the information used for the request. This requires each task to have the same steps but you can put in as many tasks as you want.
What works best really depends on your scenario.