Handling Large Requests to API - api

Are there best practices for how to pass large lists between services? I see some recommendations to pass S3 file URLs between services if the payloads can be large, but that seems like a step backwards because if the data is in S3 then the client can't use the server's API schema to validate the request as easily as if the data were passed in a list.
I can't process the data in small batches because it all needs to be processed at once.
Example:
Service B has API 1.
API 1's job is to receive a list of cars and when all cars are received to take some action on each car. All cars need to be acted on, it's not OK to take the action on only some cars.
Service A wants to send Service B 400,000 cars to store in Service B's database.
Should Service B structure API 1's API so that it expects:
A list of cars
A URL for an S3 file that contains a list of cars
Something else

It hugely depends on what actual requirements and constraints are. Sending a big amount of data via API in one go is usually not a good idea due to multiple reasons (network interruptions, memory consumption, etc.). If you don't have transactional requirements - you can just send data in small batches and (re)design the API to support that. Potentially you should consider completely switching from synchronous calls via API to asynchronous ones for example using some messaging pipeline (using Kafka for example).

Related

Is it better to have one or two REST endpoints for this scenario?

I have a microservice that handles the bookings process. Half of the data is on a db bookings table and half within an external 3rd party service that is required for completing the booking.
From a REST perspective is it better to expose 2 endpoints on the same microservice each for each case ie
POST /bookings (pass relevant data)
POST /external-service/bookings (pass relevant data)
or one
POST /bookings (pass all data for both cases)
That does create a new record in the bookings db table but also talks with the external service API to complete the booking?
Personally I am leaning towards the second approach.
From a REST perspective, it is generally recommended to have a single endpoint for each resource, in this case, the bookings is the resource.
The second approach, where you have one endpoint for creating a booking and it handles both saving to the database and interacting with the external service, would be more consistent with REST principles.
POST /bookings (pass all data for both cases)
It allows for better separation of concerns as the microservice is responsible for handling the booking resource, and it doesn't expose the details of how the booking is completed to the client. Additionally, it eliminates the need for the client to make multiple requests to different endpoints to complete a single action.
However, this approach may make testing and debugging more challenging if the external service is not available or if the external service is required to be updated.

What is the diff between data-sync and pub-sub in Deepstream

All:
I am pretty new to deepstream, on its website, it described in core concepts section as:
data-sync Interactive JSON documents that can be edited and observed.
Changes are persisted and synced across clients.
and
publish-subscribe Many clients can subscribe to topics and receive
data whenever other clients publish it to the same topic
I wonder what is the diff between its data-sync and pub-sub in terms of their purpose, in anther way, what task can one do while the other can not?
Thanks
PubSub is a way for clients and servers to send messages to each other. These messages can contain all sorts of data, but once the message is delivered its gone - there's no storage or statefulness. If you're familiar with EventEmitters in e.g. JavaScript you're already familiar with the pattern.
Data-Sync on the other hand is stateful, persistent data. Clients can request JSON documents called records, update them and subscribe to changes made by other records. Records can be arranged in lists and lists can be referenced by records, allowing for data-sync to become the realtime backbone for all the data that drives your app.

how would I expose 200k+ records via an API?

what would be the best option for exposing 220k records to third party applications?
SF style 'bulk API' - independent of the standard API to maintain availability
server-side pagination
call back to a ftp generated file?
webhooks?
This bulk will have to happen once a day or so. ANY OTHER SUGGESTIONS WELCOME!
How are the 220k records being used?
Must serve it all at once
Not ideal for human consumers of this endpoint without special GUI considerations and communication.
A. I think that using a 'bulk API' would be marginally better than reading a file of the same data. (Not 100% sure on this.) Opening and interpreting a file might take a little bit more time than directly accessing data provided in an endpoint's response body.
Can send it in pieces
B. If only a small amount of data is needed at once, then server-side pagination should be used and allows the consumer to request new batches of data as desired. This reduces unnecessary server load by not sending data without it being specifically requested.
C. If all of it needs to be received during a user-session, then find a way to send the consumer partial information along the way. Often users can be temporarily satisfied with partial data while the rest loads, so update the client periodically with information as it arrives. Consider AJAX Long-Polling, HTML5 Server Sent Events (SSE), HTML5 Websockets as described here: What are Long-Polling, Websockets, Server-Sent Events (SSE) and Comet?. Tech stack details and third party requirements will likely limit your options. Make sure to communicate to users that the application is still working on the request until it is finished.
Can send less data
D. If the third party applications only need to show updated records, could a different endpoint be created for exposing this more manageable (hopefully) subset of records?
E. If the end-result is displaying this data in a user-centric application, then maybe a manageable amount of summary data could be sent instead? Are there user-centric applications that show 220k records at once, instead of fetching individual ones (or small batches)?
I would use a streaming API. This is an API that does a "select * from table" and then streams the results to the consumer. You do this using a for loop to fetch and output the records. This way you never use much memory and as long as you frequently flush the output the webserver will not close the connection and you will support any size of result set.
I know this works as I (shameless plug) wrote the mysql-crud-api that actually does this.

Best way to store data between two request

I need one a bit theoretical advice. Here is my situation : I have a search system, which returns a list of found items. But the user is allowed to display only particular amount of items on one page, so when his first request is sent to my WCF service, it gets the whole list, then tests if the list isn't longer then the ammount of items my user is allowed to get and if the list isn't longer, there is no problem and my service returns the whole list, but when it is, then there is problem. I need to let the user choose which page he wants to display, so I let the javascript know that the user should choose page and the "page number dialog" is shown and then user is sending the second request with page number. And based on this request the webservice selects relewant items and sends them back to user. So what I need to do is to store the whole list on the server between first and second request and I 'd appreciate any idehow to store it. I was thinking about session, but I don't know if it is possible to set timeout only to particular sesion (ex. Session["list"]), because the list is used only once and can have thousands of items, so I don't want to keep it on the server to long.
PS. I Can't use standart pagination, the scenario has to be exactly how is described above.
Thanks
This sounds like a classic use-case for memcached. It is a network based key-value store for storing temporary values. Unlike in-memory state, it can be used to share temporary cached values among servers (say you have multiple nodes), and it is a great way to save state across requests (avoiding the latency that would be caused by using cookies, which are transmitted to/from the server on each http request).
The basic approach is to create a unique ID for each request, and associate it with a particular (set of) memcached key for that user's requests. You then save this unique ID in a cookie (or similar mechanism).
A warning, though, the memory is volatile, so can be lost at any point. In practice, this is not frequent, and the memcached algorithm uses a LRU queue. More details http://code.google.com/p/memcached/wiki/NewOverview
http://memcached.org/
Memcached is an in-memory key-value store for small chunks of arbitrary data (strings, objects) from results of database calls, API calls, or page rendering.
I'm not a .net programmer, but there appear to be implementations:
http://code.google.com/p/memcached/wiki/Clients
.Net memcached client
https://sourceforge.net/projects/memcacheddotnet .Net 2.0 memcached
client
http://www.codeplex.com/EnyimMemcached Client developed in .NET 2.0
keeping performance and extensibility in mind. (Supports consistent
hashing.) http://www.codeplex.com/memcachedproviders BeIT Memcached
Client (optimized C# 2.0)
http://code.google.com/p/beitmemcached jehiah
http://jehiah.cz/projects/memcached-win32

Do I really need reliable sessions for my services? (description inside)

Our company leases a music service to it's clients. The product consists of an automated mp3 player and daily renewals/updates of the costumers music library (mp3 songs) downloaded to their machines. So far we use an ugly solution for the mp3 updates, by synchronizing server and client folders using GBridge. This is obviously a disadvantage, as we force our clients to download our whole music library (currently 25.000 songs) while most of them will never play songs from all of our music categories (pop, rock etc). Most important we can only offer one subscription packet (our whole music library) while our competitors offer packets by categories with lower prices. For those reasons we decided to turn to WCF.
The service uses PerCall instancing mode and implements two operations, invoked from a winform client application with the classic request-reply pattern.
The first operation retrieves from a database the categories a client is allowed to download from (request) and sends back to the client a list of these categories (reply).
The second operation is used for downloading. The client first downloads an xml version of the server's database. A similar xml lies on the client side. The client app checks which songs, in each of the categories returned from the first operation, are missing in it's own xml compared to the server's xml file. If there are any files (elements on the xml) missing, it downloads them one file at a time. After each download, the client updates his xml and does the same comparison again until all files (elements) match in the 2 xml.
Long story short, considering that the instancing mode on the service is PerCall for throughput reasons and keeping memory consumption low and that both my operations use the request-reply pattern which means that the acknowledgement messages will be send back to the client with each response from the service, so if something goes wrong in the connection or if the client can't reach the service I can catch the CommunicationObjectFaultedException on the client, reconstruct the proxy and retry do you think theres a need for reliable sessions on my service implementation? What problems could arise if I don't have reliable sessions in the operations just described?
What problems could arise if I don't have reliable sessions in the
operations just described?
I am aware of only few problems being solved by reliable sessions while it puts a lot of stress on the server.
I would personally go for BasicHttpBinding (for better interoperability) without reliable session.
UPDATE
In order to understand Reliable Sessions, have a read of this and this.
If you are a bank, it makes sense to use Reliable Sessions, if you are sending money to and from other banks. This will ensure the message is received by the final party involved. But in most cases, you would not need it.