how would I expose 200k+ records via an API? - api

what would be the best option for exposing 220k records to third party applications?
SF style 'bulk API' - independent of the standard API to maintain availability
server-side pagination
call back to a ftp generated file?
webhooks?
This bulk will have to happen once a day or so. ANY OTHER SUGGESTIONS WELCOME!

How are the 220k records being used?
Must serve it all at once
Not ideal for human consumers of this endpoint without special GUI considerations and communication.
A. I think that using a 'bulk API' would be marginally better than reading a file of the same data. (Not 100% sure on this.) Opening and interpreting a file might take a little bit more time than directly accessing data provided in an endpoint's response body.
Can send it in pieces
B. If only a small amount of data is needed at once, then server-side pagination should be used and allows the consumer to request new batches of data as desired. This reduces unnecessary server load by not sending data without it being specifically requested.
C. If all of it needs to be received during a user-session, then find a way to send the consumer partial information along the way. Often users can be temporarily satisfied with partial data while the rest loads, so update the client periodically with information as it arrives. Consider AJAX Long-Polling, HTML5 Server Sent Events (SSE), HTML5 Websockets as described here: What are Long-Polling, Websockets, Server-Sent Events (SSE) and Comet?. Tech stack details and third party requirements will likely limit your options. Make sure to communicate to users that the application is still working on the request until it is finished.
Can send less data
D. If the third party applications only need to show updated records, could a different endpoint be created for exposing this more manageable (hopefully) subset of records?
E. If the end-result is displaying this data in a user-centric application, then maybe a manageable amount of summary data could be sent instead? Are there user-centric applications that show 220k records at once, instead of fetching individual ones (or small batches)?

I would use a streaming API. This is an API that does a "select * from table" and then streams the results to the consumer. You do this using a for loop to fetch and output the records. This way you never use much memory and as long as you frequently flush the output the webserver will not close the connection and you will support any size of result set.
I know this works as I (shameless plug) wrote the mysql-crud-api that actually does this.

Related

Need suggestions: Send multiple images to backend, perform upload operation in backend, send response

I need some best practice guidelines for a backend service in a scenario like this one:
UI sends multiple images for uploading to the backend service
Backend service receives all of the images and processes upload to storage one by one
There can be failure in 1 or multiple image upload
My question is how do I send the response towards UI if my backend service is unable to upload 1 or more file(s).
One way can be to send failed and successful image link together in a JSON response body. So the UI knows about the failure and handles it in its own way.
Another way can be to send only the successfully uploaded images' link which is the best case scenario.
Any suggestions will be welcomed with some reference links.
Use an Orchestrator - something specific that can coordinate multiple actions and provide a meaningful result back to the caller.
This might be as simple as a component sitting in the UI that orchestrates calls to the backend. The UI component and the backend service might be designed as parts of a cohesive solution, or the UI component might simply act as a type of client/proxy/facade to some random backend service.
UI calls the orchestrator with references to all the images it needs uploading.
The orchestrator works through the items, uploading each as you prefer (sequentially or in parallel, etc). For each file, handle errors however you prefer - e.g. try once and die gracefully on failure; put errors into a queue or some other mechanism for retry (how many times is up to you); etc.
Based on rules internal to the orchestrator, return status to the caller.
For potentially long-running processes (like file uploads) make sure the call to the orchestrator is asynchronous.
Rather than only returning "complete" result at the end, the orchestrator might provide a simple status back, allowing callers to get some idea of where processing is at. For example, you might have a call-back (from the orchestrator to it's caller) that simply emits very simple statuses like: processing, failed and complete. A more complex solution would be for the orchestrator to return more specific info like %complete and detailed error info.
Have a look at how the big cloud providers do complex file uploads by reading their documentation and studying their API's.
I need some best practice guidelines for a backend service
In no particular order:
Keep it as simple as possible - generally, the fewer moving parts the better. E.g. pay attention to the Single Responsibility Principle (SRP).
Clean up after yourself. If the upload service generates any data - make sure you have a clean-up process so you don't end up with mountains of un-needed data lying around, especially stuff like image files. If you design an upload solution that maintains state (which is independent of what happens to the images once they are uploaded) then you'll be storing data which probably won't be needed once the images are all processed.
Think about support - not just developer debugging but also operational support. Getting your solution into production is not the end result, it's just the beginning.
If designing this solution across teams (e.g. frontend and backend teams) make sure both teams are involved in the design. If the backend team can't provide a solution that works for the frontend team then it's not going to end well.
Think about the likely error scenarios and how can you handle them.
This isn't really just a question of best practice, as there are multiple ways you could implement it, more than one of which could be valid. This is actually an architecture and design question, with more than one valid answer, hence I don't think it fits as a Stack Overflow question and you will not get references to any one correct approach.
That said, by way of an answer I will outline what I think you need. At a very high level, and not necessarily in this order but taking these factors into account, I would:
Design the UI process flow. For example, you may decide that the user process will have several stages:
User selects first image for upload;
User selects each subsequent image for upload;
User presses some kind of "Go" button after selecting all images;
System now uploads the batch, and user receives a response confirming success or otherwise;
User has option to click through to detailed success/error details.
Design the required success/error reports
Design the data needed to support the overall functionality
Provide one or more APIs giving the upload function and the report function(s) the CRUD access they need to this data
If you hit any specific technical issues at any stage, then please post a new questions accordingly as you go.
As to the point you mentioned, how to send the UI response, there is more than one valid way but I would return a basic success/falure response initially, containing only minimal details such as number of successes, and return more details in further messages in response to user actions (such as clicking through to detailed success/error details), at which point I would retrieve the requested error details from the database.
As I said at the start of my answer, I don't think your question can be answered just in terms of best practices, as it's a whole architecture and design question, but I hope my answer helps you along this path.

File Based Processing versus REST API [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
We have a requirement where we need to process 10,000 transactions once daily in an offline (non real time mode).
Which of the 2 options are preferable
A batch file with 10,000 rows sent once a day and processed
or
An API call in small batches (as I am presuming sending 10K rows at once is not an option).
I was advised by my architects that option 1 is preferable and an API would only make sense when batch sizes are small - as the disadvantage of 2 is that the person calling the API has to break the payload down into small chunks when they have all the information available to them at once.
I am keen to see how "2" could be a viable option so any comments/suggestion to help make the case would be very helpful.
Thanks
Rahul
This is not a full answer. However, I would like to mention one reason in favor of REST API: Validation. This is better managed through the API. Once the file is dropped into an FTP location, it will be your responsibility to validate the format of the file. Will it be easy to route a "bad" file back to its source with a message to explain the bounce back?
With an API call, if the representation coming in does not adhere to a valid schema e.g. XML, json, etc. then your service can respond with a: "400 Bad Request" http status code. This keeps the responsibility of sending data in a valid format with the consumer of the service and helps to achieve a better separation of concerns.
Additional reasoning for a REST API:
Since your file contains transactions, each record should be atomic (If this were not true e.g. there are relationships between the records in the file, then those records should not be considered "transactions"). Therefore, chunking the file up into smaller batches should be trivial.
Regardless, you can define a service that accepts transactions in batch and respond with an HTTP status code of "202 Accepted". A 202 code indicates that the request was received and will be processed asynchronously. Therefore, the response can also contain callback links to check the status of individual transactions; or the batch as a whole. At that point, you would be implementing HATEOAS (Hypermedia as the Engine of Application State) and be in a position to automate the entire process and report on status.
Alteratively with batch files, if the file passes an upfront format validation check, then you'll still have to process each transaction individually downstream. Some records may load, others may not. My assumption is the records that fail to load would still need to be handled. And, you may need to provide the users a view of what succeeded vs. failed. Now, this can all be handled outside the REST API. However, the API pattern is simple and elegant IMHO to this purpose.
Using Batch Process is always a better idea. you can trigger batch process using REST API.
With Batch processing you can always send an email with msg "improper file format" or you can also send "Which records processed and which did not" . With Rest you cannot keep track records and transactions.
As mentioned in above comment you can use Rest API to trigger a batch Process asynchronously and send the status response using HATEOAS.
SPRING BATCH + SPring REST using SPring BOOT
I have the same question and all answer I found the same subjective answer. I would like put some ideas to compare both concepts:
Batch solution requires more storage than REST API. You will need
store your results on intermediate storage area, and write it on an
open format. Perhaps you can compress it, but you are changing
storage with processing.
REST API could use more network bandwidth than batch solution, only
if the intermediate storage is not in network drive. Fetch request,
and query pooling could require a lot of network bandwidth,
but could be solved with web-hooks or web-sockets.
REST API is easiest to automatic recovery than batch solution. REST
API response code can help to take automatic decision to recover
from a FAIL. And you reduce the number of services required to
identify it. If the network is down an email could fail as REST API.
And REST API help you to define a good API on these cases.
REST API can manage high number of rows as any other TCP protocol
(as FTP). But in case of any fail you will need logic to manage it.
It means the REST API will require a chunk enabled protocol too. For
batch service, this logic is in FTP protocol, but with his own
logic, not your business logic.
Batch service does not require to reserve an instance all time
(CPU, IP address, port, etc), just
run when it is needed. You will need a scheduler to start it, or men
force. Or a man to restart it if it fails. Again, out of scheduler,
it is not natural to automatize.
Batch service does not require more security setup from developer
side: REST API must take care about authentication. Also, must think
on injection or other attack methods. REST API could be use helper
services to prevent all of this, but it means more configuration.
Batch services are easy to deploy. Batch services could run on your
machine, or a server and run it when business need. REST API requires
continues health check, use a deployment strategy to keep it up, take
care about DNS configuration, etc. Check if your company give you all
this services.
If this solution is for your company, check what your company is
doing. Right now there is a common policy to move to REST API, but
if your support team do not know about it but has a lot of
experience with batch solution, could be a good idea do not improve.

Best way to store data between two request

I need one a bit theoretical advice. Here is my situation : I have a search system, which returns a list of found items. But the user is allowed to display only particular amount of items on one page, so when his first request is sent to my WCF service, it gets the whole list, then tests if the list isn't longer then the ammount of items my user is allowed to get and if the list isn't longer, there is no problem and my service returns the whole list, but when it is, then there is problem. I need to let the user choose which page he wants to display, so I let the javascript know that the user should choose page and the "page number dialog" is shown and then user is sending the second request with page number. And based on this request the webservice selects relewant items and sends them back to user. So what I need to do is to store the whole list on the server between first and second request and I 'd appreciate any idehow to store it. I was thinking about session, but I don't know if it is possible to set timeout only to particular sesion (ex. Session["list"]), because the list is used only once and can have thousands of items, so I don't want to keep it on the server to long.
PS. I Can't use standart pagination, the scenario has to be exactly how is described above.
Thanks
This sounds like a classic use-case for memcached. It is a network based key-value store for storing temporary values. Unlike in-memory state, it can be used to share temporary cached values among servers (say you have multiple nodes), and it is a great way to save state across requests (avoiding the latency that would be caused by using cookies, which are transmitted to/from the server on each http request).
The basic approach is to create a unique ID for each request, and associate it with a particular (set of) memcached key for that user's requests. You then save this unique ID in a cookie (or similar mechanism).
A warning, though, the memory is volatile, so can be lost at any point. In practice, this is not frequent, and the memcached algorithm uses a LRU queue. More details http://code.google.com/p/memcached/wiki/NewOverview
http://memcached.org/
Memcached is an in-memory key-value store for small chunks of arbitrary data (strings, objects) from results of database calls, API calls, or page rendering.
I'm not a .net programmer, but there appear to be implementations:
http://code.google.com/p/memcached/wiki/Clients
.Net memcached client
https://sourceforge.net/projects/memcacheddotnet .Net 2.0 memcached
client
http://www.codeplex.com/EnyimMemcached Client developed in .NET 2.0
keeping performance and extensibility in mind. (Supports consistent
hashing.) http://www.codeplex.com/memcachedproviders BeIT Memcached
Client (optimized C# 2.0)
http://code.google.com/p/beitmemcached jehiah
http://jehiah.cz/projects/memcached-win32

what are the Streaming APIs

basically I want to understand both high level and also technical point of view as what constitutes a streaming API, there are all sorts of data available but I could not find a satisfactory explanation of streaming API, also how does it differ from general APIs (REST if applicable)
PS:I am not asking about multimedia streaming.
Kind of a vague question. I guess streaming usually means one of the following (or a combination)
downloading data for immediate consumption, rather than a whole file for storage, potentially with support for delivering partial data (lower quality, only relevant pieces etc), sometimes even without any storage at all in between producer and consumer
a persistent connection that continues to deliver new data as it becomes available, rather than having the client poll
A good example (for the first pattern) are streaming XML parsers (such as SAX). They allow you to handle XML data that is too big to fit into memory (which a DOM parser likes to do).
I just find another good answer here:
https://www.quora.com/What-is-meant-by-streaming-API
A streaming API differs from the normal REST API in the way that it leaves the HTTP connection open for as long as possible(i.e. "persistent connection"). It pushes data to the client as and when it's available and there is no need for the client to poll the requests to the server for newer data. This approach of maintaining a persistent connection reduces the network latency significantly when a server produces continous stream of data like say, today's social media channels. These APIs are mostly used to read/subscribe to data.

Streaming API vs Rest API?

The canonical example here is Twitter's API. I understand conceptually how the REST API works, essentially its just a query to their server for your particular request in which you then receive a response (JSON, XML, etc), great.
However I'm not exactly sure how a streaming API works behind the scenes. I understand how to consume it. For example with Twitter listen for a response. From the response listen for data and in which the tweets come in chunks. Build up the chunks in a string buffer and wait for a line feed which signifies end of Tweet. But what are they doing to make this work?
Let's say I had a bunch of data and I wanted to setup a streaming API locally for other people on the net to consume (just like Twitter). How is this done, what technologies? Is this something Node JS could handle? I'm just trying to wrap my head around what they are doing to make this thing work.
Twitter's stream API is that it's essentially a long-running request that's left open, data is pushed into it as and when it becomes available.
The repercussion of that is that the server will have to be able to deal with lots of concurrent open HTTP connections (one per client). A lot of existing servers don't manage that well, for example Java servlet engines assign one Thread per request which can (a) get quite expensive and (b) quickly hits the normal max-threads setting and prevents subsequent connections.
As you guessed the Node.js model fits the idea of a streaming connection much better than say a servlet model does. Both requests and responses are exposed as streams in Node.js, but don't occupy an entire thread or process, which means that you could continue pushing data into the stream for as long as it remained open without tying up excessive resources (although this is subjective). In theory you could have a lot of concurrent open responses connected to a single process and only write to each one when necessary.
If you haven't looked at it already the HTTP docs for Node.js might be useful.
I'd also take a look at technoweenie's Twitter client to see what the consumer end of that API looks like with Node.js, the stream() function in particular.