As the title states, are BigQuery Storage Write Requests atomic?
I know that the legacy streaming API may return a partial error, indicating that some rows were successfully written. However, I understand we may achieve exactly-once processing at stream level through the use of offsets. But it seems that these offsets are for each write request and not at row level. Thus I am wondering, does a write operation using the new Storage API, fully complete or fully fail?
Offsets are row based, not request based. If an append request contains three rows for example, the offset would advance by three when processed successfully.
A success for an append in the write API means all the rows in the request were committed. If the stream type supports it (e.g. not a default stream) then the append response will confirm the offset.
This differs from the legacy streaming API (tabledata.insertAll), which allows for partial failures. In this surface, a three row request could have a single successful row and 2 failures, or some other combination of successes/failures.
Related
We have an API endpoint which accepts people.
During the call we check to ensure that the person PIN has not already been used, if so we reject the request with a 422 input error.
Recently a client complained about a duplicate PIN and we found that the API endpoint was being triggered by two different REST calls with duplicated PINs but not a duplicate body.
i.e.
{First Name: John, Last Name: Doe, PIN: 722}
{First Name: Jane, Last Name: Doe, PIN: 722}
As both come in within milliseconds of each other when the test is performed on the second record for the duplicate PIN it returns false as the first record has not yet been inserted into the DB and as a result continues to process the second record.
We have looked into a few options, such as unique constraints on the DB which do work but would require huge amounts of rework to bubble the error up to the REST API response. A huge risk on a thriving production APP.
There are around 5-6 different API calls that can modify the PIN collection in one guise or another and around 20-30 different APIs where this sort of problem exists (unique emails, unique item name etc) so I am not sure we can maintain lists for quick access.
Aside from the DB constraints are there any other options available to us that would be simpler to implement? Perhaps some obscure attribute on the .NET APIController class.
In my head at least I would like a request to be made, and subsequent requests to be queued. I am aware we can simply store the payload for processing later however on the basis the APIs are already being consumed this doesn't seem to be an option the queue would have to block the response.
Trying to Google this has been far from successful as everything assumes I am trying to decline duplicate complete bodies.
If you're basing the check on whether the PIN exists in the db, in theory you could have tens or hundreds of potential duplicates coming in on the REST API, all assuming the PIN is ok as the database is slow to say whether a PIN exists.
If you think of it as a pipeline:
client -> API -> PIN check from db -> ok -> insert in db
the PIN check from db is a bottleneck as it will most likely be slower than incoming REST calls. As you say, the solution is to enforce the uniqueness at the database level but if that's not practical, you could use a 'gateway' check instead. Basically a PIN cache that's fast to update/query and PINs are added to it once the request has been accepted for processing and before they're written to the database. So the pipeline becomes:
client -> API -> PIN check from cache -> ok -> write PIN to cache ->
insert in db
and the second request would end up as:
client -> API -> PIN check from cache -> exists -> fail
so the cache is keeping up with the requests, allowing the more leisurely database operations to continue with known clean data. You'd have to build the cache (from PINs in the database) on every service restart and only allow API calls once the cache is ready if it's a memory cache. Or you could use a disk cache. The downside is it duplicates all the PINs in your database but it's a compromise that lets you keep up with checking PINs in API calls.
I have several APIs that serve a large number of records to an application. The API responses are generally user-based (same API might serve different responses to different users).
To make it easier for the application side to get and load the data, the application receives the response in chunks. For example, it makes n consecutive requests, like this:
/api/myapi/1
/api/myapi/2
...
/api/myapi/n
The application caches the API responses in it's local memory as objects. In each API response, an etag header is sent back to the application, containing the hashed value of the response. In each request made by the application, this etag value is sent along with request parameters. Based on the etag value, the server determines whether the application has an old response in it cache or not.
In the case of the API's that serve data in chunks, the application still only keeps one etag for API. That makes it impossible for the server to check whether the application's cached data are fresh or not.
An illustration:
In the case of the API above, each time the request is made, the server calculates the etag and sends it along with the response. Each time the application receives a response, it updates the value of the etag. In the end (when the nth call is made), the application only has the nth etag stored.
When the application needs these data again, first it makes the request to the server (/api/myapi/1 ), sending the nth etag. Most probably, the response with the first set of data will be different than the nth response, therefore the server asks the application to retake the data. The application retakes the data and updates the etag. This repeats till the nth request.
As you can see, even if the total response (all the sets of data) have not changed, the server will always compare the etag from the response of the previous set with the etag of the current data. This means that the server response will always be 'retake the data', which is wrong.
The alternatives I came up with are:
The application stores all etags and sends etag[i] in the request /api/myapi/i
Another way would be for the server to store all etags for every user (which I don't find effective). That could cause a problem in the case when the server successfully has sent a response but the application was not able to update that set of data. The server would not know that the application is still using an old response.
The third alternative would be for the server to calculate the etag of the whole response, but send the response in n sets. Every time it would send the same etag (that of the whole sets). This means that the server has to do the same job twice, which I still do not like as a solution.
PS: The front-end developer says that it is not possible for him to store more than 1 etag for each request. That makes alternative 1 somewhat complicated (although not impossible).
Is there any other way to treat this scenario? If not, what could be a more efficient solution?
what would be the best option for exposing 220k records to third party applications?
SF style 'bulk API' - independent of the standard API to maintain availability
server-side pagination
call back to a ftp generated file?
webhooks?
This bulk will have to happen once a day or so. ANY OTHER SUGGESTIONS WELCOME!
How are the 220k records being used?
Must serve it all at once
Not ideal for human consumers of this endpoint without special GUI considerations and communication.
A. I think that using a 'bulk API' would be marginally better than reading a file of the same data. (Not 100% sure on this.) Opening and interpreting a file might take a little bit more time than directly accessing data provided in an endpoint's response body.
Can send it in pieces
B. If only a small amount of data is needed at once, then server-side pagination should be used and allows the consumer to request new batches of data as desired. This reduces unnecessary server load by not sending data without it being specifically requested.
C. If all of it needs to be received during a user-session, then find a way to send the consumer partial information along the way. Often users can be temporarily satisfied with partial data while the rest loads, so update the client periodically with information as it arrives. Consider AJAX Long-Polling, HTML5 Server Sent Events (SSE), HTML5 Websockets as described here: What are Long-Polling, Websockets, Server-Sent Events (SSE) and Comet?. Tech stack details and third party requirements will likely limit your options. Make sure to communicate to users that the application is still working on the request until it is finished.
Can send less data
D. If the third party applications only need to show updated records, could a different endpoint be created for exposing this more manageable (hopefully) subset of records?
E. If the end-result is displaying this data in a user-centric application, then maybe a manageable amount of summary data could be sent instead? Are there user-centric applications that show 220k records at once, instead of fetching individual ones (or small batches)?
I would use a streaming API. This is an API that does a "select * from table" and then streams the results to the consumer. You do this using a for loop to fetch and output the records. This way you never use much memory and as long as you frequently flush the output the webserver will not close the connection and you will support any size of result set.
I know this works as I (shameless plug) wrote the mysql-crud-api that actually does this.
From a lot of articles and commercial API I saw, most people make their APIs idempotent by asking the client to provide a requestId or idempotent-key (e.g. https://www.masteringmodernpayments.com/blog/idempotent-stripe-requests) and basically store the requestId <-> response map in the storage. So if there's a request coming in which already is in this map, the application would just return the stored response.
This is all good to me but my problem is how do I handle the case where the second call coming in while the first call is still in progress?
So here is my questions
I guess the ideal behaviour would be the second call keep waiting until the first call finishes and returns the first call's response? Is this how people doing it?
if yes, how long should the second call wait for the first call to be finished?
if the second call has a wait time limit and the first call still hasn't finished, what should it tell the client? Should it just not return any responses so the client will timeout and retry again?
For wunderlist we use database constraints to make sure that no request id (which is a column in every one of our tables) is ever used twice. Since our database technology (postgres) guarantees that it would be impossible for two records to be inserted that violate this constraint, we only need to react to the potential insertion error properly. Basically, we outsource this detail to our datastore.
I would recommend, no matter how you go about this, to try not to need to coordinate in your application. If you try to know if two things are happening at once then there is a high likelihood that there would be bugs. Instead, there might be a system you already use which can make the guarantees you need.
Now, to specifically address your three questions:
For us, since we use database constraints, the database handles making things queue up and wait. This is why I personally prefer the old SQL databases - not for the SQL or relations, but because they are really good at locking and queuing. We use SQL databases as dumb disconnected tables.
This depends a lot on your system. We try to tune all of our timeouts to around 1s in each system and subsystem. We'd rather fail fast than queue up. You can measure and then look at your 99th percentile for timings and just set that as your timeout if you don't know ahead of time.
We would return a 504 http status (and appropriate response body) to the client. The reason for having a idempotent-key is so the client can retry a request - so we are never worried about timing out and letting them do just that. Again, we'd rather timeout fast and fix the problems than to let things queue up. If things queue up then even after something is fixed one has to wait a while for things to get better.
It's a bit hard to understand if the second call is from the same client with the same request token, or a different client.
Normally in the case of concurrent requests from different clients operating on the same resource, you would also want to implementing a versioning strategy alongside a request token for idempotency.
A typical version strategy in a relational database might be a version column with a trigger that auto increments the number each time a record is updated.
With this in place, all clients must specify their request token as well as the version they are updating (typical the IfMatch header is used for this and the version number is used as the value of the ETag).
On the server side, when it comes time to update the state of the resource, you first check that the version number in the database matches the supplied version in the ETag. If they do, you write the changes and the version increments. Assuming the second request was operating on the same version number as the first, it would then fail with a 412 (or 409 depending on how you interpret HTTP specifications) and the client should not retry.
If you really want to stop the second request immediately while the first request is in progress, you are going down the route of pessimistic locking, which doesn't suit REST API's that well.
In the case where you are actually talking about the client retrying with the same request token because it received a transient network error, it's almost the same case.
Both requests will be running at the same time, the second request will start because the first request still has not finished and has not recorded the request token to the database yet, but whichever one ends up finishing first will succeed and record the request token.
For the other request, it will receive a version conflict (since the first request has incremented the version) at which point it should recheck the request token database table, find it's own token in there and assume that it was a concurrent request that finished before it did and return 200.
It's seems like a lot, but if you want to cover all the weird and wonderful failure modes when your dealing with REST, idempotency and concurrency this is way to deal with it.
i'm not sure how to implement this the best way:
i have multiple rest requests where each on retrieves data from a different resource. the thing is that each requests needs data from the previous one.
now i have mknetworkkit running in this project and do i really have to make a request, then evaluate the data in the result block and start a new one from this result block which in turn will end up in the next result block and so forth...
it is not really recursive since evaluation is different for every request and it seems to me that nesting request/block combinations ten levels deep is not really a nice way to do this (synchronous requests apparently are also bad and not supported in mknetworkkit).
what would be the best practice to do this?
EDIT: i also would like to do this in one function call
Same issue here. What I've ended up with is placing each desired network call in a queue (array or whatever you want to store your operations in) and updating my network response delegate so that it checks the queue for the next operation in the chain.