Need suggestions: Send multiple images to backend, perform upload operation in backend, send response - amazon-s3

I need some best practice guidelines for a backend service in a scenario like this one:
UI sends multiple images for uploading to the backend service
Backend service receives all of the images and processes upload to storage one by one
There can be failure in 1 or multiple image upload
My question is how do I send the response towards UI if my backend service is unable to upload 1 or more file(s).
One way can be to send failed and successful image link together in a JSON response body. So the UI knows about the failure and handles it in its own way.
Another way can be to send only the successfully uploaded images' link which is the best case scenario.
Any suggestions will be welcomed with some reference links.

Use an Orchestrator - something specific that can coordinate multiple actions and provide a meaningful result back to the caller.
This might be as simple as a component sitting in the UI that orchestrates calls to the backend. The UI component and the backend service might be designed as parts of a cohesive solution, or the UI component might simply act as a type of client/proxy/facade to some random backend service.
UI calls the orchestrator with references to all the images it needs uploading.
The orchestrator works through the items, uploading each as you prefer (sequentially or in parallel, etc). For each file, handle errors however you prefer - e.g. try once and die gracefully on failure; put errors into a queue or some other mechanism for retry (how many times is up to you); etc.
Based on rules internal to the orchestrator, return status to the caller.
For potentially long-running processes (like file uploads) make sure the call to the orchestrator is asynchronous.
Rather than only returning "complete" result at the end, the orchestrator might provide a simple status back, allowing callers to get some idea of where processing is at. For example, you might have a call-back (from the orchestrator to it's caller) that simply emits very simple statuses like: processing, failed and complete. A more complex solution would be for the orchestrator to return more specific info like %complete and detailed error info.
Have a look at how the big cloud providers do complex file uploads by reading their documentation and studying their API's.
I need some best practice guidelines for a backend service
In no particular order:
Keep it as simple as possible - generally, the fewer moving parts the better. E.g. pay attention to the Single Responsibility Principle (SRP).
Clean up after yourself. If the upload service generates any data - make sure you have a clean-up process so you don't end up with mountains of un-needed data lying around, especially stuff like image files. If you design an upload solution that maintains state (which is independent of what happens to the images once they are uploaded) then you'll be storing data which probably won't be needed once the images are all processed.
Think about support - not just developer debugging but also operational support. Getting your solution into production is not the end result, it's just the beginning.
If designing this solution across teams (e.g. frontend and backend teams) make sure both teams are involved in the design. If the backend team can't provide a solution that works for the frontend team then it's not going to end well.
Think about the likely error scenarios and how can you handle them.

This isn't really just a question of best practice, as there are multiple ways you could implement it, more than one of which could be valid. This is actually an architecture and design question, with more than one valid answer, hence I don't think it fits as a Stack Overflow question and you will not get references to any one correct approach.
That said, by way of an answer I will outline what I think you need. At a very high level, and not necessarily in this order but taking these factors into account, I would:
Design the UI process flow. For example, you may decide that the user process will have several stages:
User selects first image for upload;
User selects each subsequent image for upload;
User presses some kind of "Go" button after selecting all images;
System now uploads the batch, and user receives a response confirming success or otherwise;
User has option to click through to detailed success/error details.
Design the required success/error reports
Design the data needed to support the overall functionality
Provide one or more APIs giving the upload function and the report function(s) the CRUD access they need to this data
If you hit any specific technical issues at any stage, then please post a new questions accordingly as you go.
As to the point you mentioned, how to send the UI response, there is more than one valid way but I would return a basic success/falure response initially, containing only minimal details such as number of successes, and return more details in further messages in response to user actions (such as clicking through to detailed success/error details), at which point I would retrieve the requested error details from the database.
As I said at the start of my answer, I don't think your question can be answered just in terms of best practices, as it's a whole architecture and design question, but I hope my answer helps you along this path.

Related

Can I send an API response before successful persistence of data?

I am currently developing a Microservice that is interacting with other microservices.
The problem now is that those interactions are really time-consuming. I already implemented concurrent calls via Uni and uses caching where useful. Now I still have some calls that still need some seconds in order to respond and now I thought of another thing, which I could do, in order to improve the performance:
Is it possible to send a response before the sucessfull persistence of data? I send requests to the other microservices where they have to persist the results of my methods. Can I already send the user the result in a first response and make a second response if the persistence process was sucessfull?
With that, the front-end could already begin working even though my API is not 100% finished.
I saw that there is a possible status-code 207 but it's rather used with streams where someone wants to split large files. Is there another possibility? Thanks in advance.
"Is it possible to send a response before the sucessfull persistence of data? Can I already send the user the result in a first response and make a second response if the persistence process was sucessfull? With that, the front-end could already begin working even though my API is not 100% finished."
You can and should, but it is a philosophy change in your API and possibly you have to consider some edge cases and techniques to deal with them.
In case of a long running API call, you can issue an "ack" response, a traditional 200 one, only the answer would just mean the operation is asynchronous and will complete in the future, something like { id:49584958, apicall:"create", status:"queued", result:true }
Then you can
poll your API with the returned ID to see if the operation that is still ongoing, has succeeded or failed.
have a SSE channel (realtime server side events) where your server can issue status messages as pending operations finish
maybe using persistent connections and keepalives, or flushing the response in the middle, you can achieve what you point out, ie. like a segmented response. I am not familiar with that approach as I normally go for the suggesions above.
But in any case, edge cases apply exactly the same: For example, what happens if then through your API a user issues calls dependent on the success of an ongoing or not even started previous command? like for example, get information about something still being persisted?
You will have to deal with these situations with mechanisms like:
Reject related operations until pending call is resolved "server side": Api could return ie. a BUSY error informing that operations are still ongoing when you want to, for example, delete something that still is being created.
Queue all operations so the server executes all them sequentially.
Allow some simulatenous operations if you find they will not collide (ie. create 2 unrelated items)

Microservices + CQRS implementation

I am working on implementing a microservice architecture using the CQRS pattern. I have a working implementation using API Gateway, Lambda and DynamoDB with one exception - the event sourcing.
Event Sourcing has the applications publishing a notification to an event stream that other services in the platform can consume. This notification represents an event that took place as part of the originating HTTP request. For instance, if the user makes a HTTP POST with a complete "check patient into hospital" model then the Lambda will break that apart and publish multiple events in sequential order.
Patient Checked in (includes Patient Id, hospital id + visit id)
Room Assigned (includes room number, + visit id)
Patient tested (includes tested + visit id)
Patient checked-out (visit id)
The intent for this pattern is to provide an audit trail of all events that took place while the patient was in the hospital. This example (not what I'm actually building) would be stored in an event source that can be replayed at any time. If the VisitId was deleted across all services we could just replay the events one at a time, in order, and reproduce an exact copy of the original record. You consider all records immutable to achieve this. Each POST would push into the event source and then land in the database that would pull the data out during a HTTP GET request. It would also have subscribers that would take pieces of this data and do other things - such as a "Visit Survey" service that would listen to the Patient Checked Out event and prep a post-op survey.
I've looked at several AWS services to provide this. I know about Kinesis Data Streams but I don't like the pricing structure nor do I want to deal with shards (no autoscaling). Since my entire platform is built on consumption based pricing (Dynamo, Lambda etc) I want to keep my event source the same way. This makes it easier for me to estimate a per-user cost as I just do math based on estimated requests per month, per user.
I've been using SNS for the stream itself, delivering the notifications, and it's been great. Super fast and not had any major issues while developing it. The issue though is that this is not suitable for a replay store - only delivery of the event messages. For a replay store I thought Kinesis Firehose made a lot of sense... Send it to S3 + SNS at the same time. Turns out SNS isn't a delivery destination available. I can Put to S3 myself and then publish to SNS but that seems like duplicate work in the code base when I can setup an S3 trigger to fire a Lambda and just have another small Lambda that reacts to the Event landing in S3 and do the insert into the DynamoDB. I've seen that this can be much slower though than just publishing through SNS. I'm also not sure about retry policies on the Put event. This simplifies retries though as I can just re-use the code in the triggered Lambda to replay all events in a bucket path.
I could just PutObject and then Publish to SNS within the same HTTP POST Lambda. If the SNS Publish fails though then I now have an object in S3 that was never published. I'd have to write a different Lambda to handle the fixing and publishing. Not the end of the world - either-way I have two Lambdas to deploy. I'm just not sure which way makes more sense in this pattern with AWS services.
Has anyone done something similar and have any recommendations? Am I working my way into a technical hole that will be difficult to manage later? I'm open to other paths as well if I can keep it to a consumption based pricing model. Thanks!
Event Sourcing has the applications publishing a notification to an event stream that other services in the platform can consume.
You'll want to be a little bit careful here -- there are at least two different definitions of "event sourcing" running around.
If you care about event sourcing, in the sense usually coupled with CQRS (Greg Young, et al), then your events are your book of record. The important complication this introduces is that your service needs to be able to lock the "event stream" when making changes to it (without that lock, you run into "lost edit" scenarios and have to clean up the mess).
So the "pointer to your current changes" needs to live in something that has transactions. DynamoDB should be fine for this (based on my memory of the event sourcing break out room at re:Invent 2017). In theory, you could have the lock in dynamo, which contains a pointer to an immutable document stored in S3. I haven't been able to persuade myself that the trade offs justify the complexity, but as best I can tell there's nothing in that architecture that violates physics and causality.
If your operations team isn't happy with Dynamo, another reasonable option is RDS; choose your preferred relational data engine, deploy an event storage schema to it, and off you go.
As for the pub sub part, I believe you to be on the right track with SNS. It's the right choice for "fanning out" messages from a publisher to multiple consumers. Yes, it doesn't support replay, but that's fine -- replay can happen by pulling events from the book of record. See the later parts of Greg Young's Polyglot Data talk. Yes, sometimes you will get messages on both the push channel and the pull channel, but that's fine; you already signed up for idempotent message handling when you decided a distributed architecture was a good idea.
Edit
Why the need to store a pointer in DynamoDB?
Because S3 doesn't offer you any locking; which means that on the unhappy path, where two copies of your logic are trying to write different versions of your data, you end up victim to the lost edit problem.
You could manage the situation with optimistic locking - something analogous to HTTP's conditional PUT; but S3 (last time I checked) doesn't support conditional modification.
You could use S3 as an object store for immutable documents, but now you need some mechanism to determine which document in S3 is the "current" one. If you try to implement that in S3, you run into the same lost edit problem all over again.
So you need a different tool to handle that part of the problem; some tool that is suitable for "state succession". So DynamoDB fits there.
If you are using DynamoDB for locking, can you also use it for event storage? I don't have enough laps to feel confident that I know the answer there. For small problems, I'm mostly confident that the answer is yes. For large problems...?
Possibly useful discussions:
Rich Hickey; The Language of the System
Kenneth Truyers; Git as a NoSql Database

What are some use cases for the Application Log (BC-SRV-BAL)

Hello fellow developers,
I recently stumbled upon the Application Log and find it to be quite handy. Now I am wondering, from a best practice perspective, what are some use cases for utilizing the Application Log vs. normal messages / class based exceptions?
Normally application log is used when end-user need not be informed of this information. Application log complements the normal messages and class based exceptions but not completely replace them.
Imagine a situation, there is an issue with data on a background processing. If a developer want to see what is the data that was being processed (after it is processed), it will be difficult. A developer can thus write some data to application log based on his gut if there is a possibility of failure.
Normally, this application logging is controlled by some user parameters and also the granularity of the data that is being stored in application log.
Hope this helps.
The application log comes in handy to
store messages. Interactive messages and exceptions are lost after the user clicks them away. The application log stores that information for longer periods of time.
log background processes. These have no direct means to inform a user because there is no user, only some other process that triggered the batch.
provide additional details. Interactive messages are usually minimized to not spam the user with too many popups. The application log can provide additional aspects and side infos to accompany the main result.
log "undercurrents". If a reuse component is unsure what level of detail its consumer wants, it can write an application log with high level of detail that the consumer later can consume or not, as desired.
It is not appropriate when
you want to process the logged details in an automatic way. Application logs are for display to the end user. Application processing should store or hand over data in a more appropriate format.
you need to process vast amounts of data. Writing the application log is fast, but takes time for the database roundtrips, such that large numbers of records can slow down the actual application too much.
you need to store sensitive data. Application logs are secured with authorization checks, but still they may not be the appropriate place for really sensitive information.

how would I expose 200k+ records via an API?

what would be the best option for exposing 220k records to third party applications?
SF style 'bulk API' - independent of the standard API to maintain availability
server-side pagination
call back to a ftp generated file?
webhooks?
This bulk will have to happen once a day or so. ANY OTHER SUGGESTIONS WELCOME!
How are the 220k records being used?
Must serve it all at once
Not ideal for human consumers of this endpoint without special GUI considerations and communication.
A. I think that using a 'bulk API' would be marginally better than reading a file of the same data. (Not 100% sure on this.) Opening and interpreting a file might take a little bit more time than directly accessing data provided in an endpoint's response body.
Can send it in pieces
B. If only a small amount of data is needed at once, then server-side pagination should be used and allows the consumer to request new batches of data as desired. This reduces unnecessary server load by not sending data without it being specifically requested.
C. If all of it needs to be received during a user-session, then find a way to send the consumer partial information along the way. Often users can be temporarily satisfied with partial data while the rest loads, so update the client periodically with information as it arrives. Consider AJAX Long-Polling, HTML5 Server Sent Events (SSE), HTML5 Websockets as described here: What are Long-Polling, Websockets, Server-Sent Events (SSE) and Comet?. Tech stack details and third party requirements will likely limit your options. Make sure to communicate to users that the application is still working on the request until it is finished.
Can send less data
D. If the third party applications only need to show updated records, could a different endpoint be created for exposing this more manageable (hopefully) subset of records?
E. If the end-result is displaying this data in a user-centric application, then maybe a manageable amount of summary data could be sent instead? Are there user-centric applications that show 220k records at once, instead of fetching individual ones (or small batches)?
I would use a streaming API. This is an API that does a "select * from table" and then streams the results to the consumer. You do this using a for loop to fetch and output the records. This way you never use much memory and as long as you frequently flush the output the webserver will not close the connection and you will support any size of result set.
I know this works as I (shameless plug) wrote the mysql-crud-api that actually does this.

File Based Processing versus REST API [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
We have a requirement where we need to process 10,000 transactions once daily in an offline (non real time mode).
Which of the 2 options are preferable
A batch file with 10,000 rows sent once a day and processed
or
An API call in small batches (as I am presuming sending 10K rows at once is not an option).
I was advised by my architects that option 1 is preferable and an API would only make sense when batch sizes are small - as the disadvantage of 2 is that the person calling the API has to break the payload down into small chunks when they have all the information available to them at once.
I am keen to see how "2" could be a viable option so any comments/suggestion to help make the case would be very helpful.
Thanks
Rahul
This is not a full answer. However, I would like to mention one reason in favor of REST API: Validation. This is better managed through the API. Once the file is dropped into an FTP location, it will be your responsibility to validate the format of the file. Will it be easy to route a "bad" file back to its source with a message to explain the bounce back?
With an API call, if the representation coming in does not adhere to a valid schema e.g. XML, json, etc. then your service can respond with a: "400 Bad Request" http status code. This keeps the responsibility of sending data in a valid format with the consumer of the service and helps to achieve a better separation of concerns.
Additional reasoning for a REST API:
Since your file contains transactions, each record should be atomic (If this were not true e.g. there are relationships between the records in the file, then those records should not be considered "transactions"). Therefore, chunking the file up into smaller batches should be trivial.
Regardless, you can define a service that accepts transactions in batch and respond with an HTTP status code of "202 Accepted". A 202 code indicates that the request was received and will be processed asynchronously. Therefore, the response can also contain callback links to check the status of individual transactions; or the batch as a whole. At that point, you would be implementing HATEOAS (Hypermedia as the Engine of Application State) and be in a position to automate the entire process and report on status.
Alteratively with batch files, if the file passes an upfront format validation check, then you'll still have to process each transaction individually downstream. Some records may load, others may not. My assumption is the records that fail to load would still need to be handled. And, you may need to provide the users a view of what succeeded vs. failed. Now, this can all be handled outside the REST API. However, the API pattern is simple and elegant IMHO to this purpose.
Using Batch Process is always a better idea. you can trigger batch process using REST API.
With Batch processing you can always send an email with msg "improper file format" or you can also send "Which records processed and which did not" . With Rest you cannot keep track records and transactions.
As mentioned in above comment you can use Rest API to trigger a batch Process asynchronously and send the status response using HATEOAS.
SPRING BATCH + SPring REST using SPring BOOT
I have the same question and all answer I found the same subjective answer. I would like put some ideas to compare both concepts:
Batch solution requires more storage than REST API. You will need
store your results on intermediate storage area, and write it on an
open format. Perhaps you can compress it, but you are changing
storage with processing.
REST API could use more network bandwidth than batch solution, only
if the intermediate storage is not in network drive. Fetch request,
and query pooling could require a lot of network bandwidth,
but could be solved with web-hooks or web-sockets.
REST API is easiest to automatic recovery than batch solution. REST
API response code can help to take automatic decision to recover
from a FAIL. And you reduce the number of services required to
identify it. If the network is down an email could fail as REST API.
And REST API help you to define a good API on these cases.
REST API can manage high number of rows as any other TCP protocol
(as FTP). But in case of any fail you will need logic to manage it.
It means the REST API will require a chunk enabled protocol too. For
batch service, this logic is in FTP protocol, but with his own
logic, not your business logic.
Batch service does not require to reserve an instance all time
(CPU, IP address, port, etc), just
run when it is needed. You will need a scheduler to start it, or men
force. Or a man to restart it if it fails. Again, out of scheduler,
it is not natural to automatize.
Batch service does not require more security setup from developer
side: REST API must take care about authentication. Also, must think
on injection or other attack methods. REST API could be use helper
services to prevent all of this, but it means more configuration.
Batch services are easy to deploy. Batch services could run on your
machine, or a server and run it when business need. REST API requires
continues health check, use a deployment strategy to keep it up, take
care about DNS configuration, etc. Check if your company give you all
this services.
If this solution is for your company, check what your company is
doing. Right now there is a common policy to move to REST API, but
if your support team do not know about it but has a lot of
experience with batch solution, could be a good idea do not improve.