Working of S3 file download - amazon-s3

To download a file from S3 using java SDK, we need to do the following ->
Note- Multipart download is off.
S3Object s3Object = s3.getObject(getObjectRequest);
S3ObjectInputStream s3ObjectInputStream = s3Object.getObjectContent();
//Write to a file from this stream
When we make a getObject call, the SDK does a GET call on that object.
This call returns just the headers of the response.
When we actually start reading from the s3ObjectInputStream, we get the response body.
But this all is one REST call.
So, I was confused why the call returned only the headers first.
And how did S3 know when to start sending in the response body?
We are making only one call, so how are we notifying S3 that we have now started reading from the s3ObjectInputStream.
Where is the actual file stored till we read it from the stream ?

S3 starts sending the response body immediately.
You just haven't started reading it from the network.
getObject
Be extremely careful when using this method; the returned Amazon S3 object contains a direct stream of data from the HTTP connection. The underlying HTTP connection cannot be reused until the user finishes reading the data and closes the stream.
https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/s3/AmazonS3.html#getObject-com.amazonaws.services.s3.model.GetObjectRequest-
A small amount has been buffered, but the object isn't being stored anywhere. The network connection is stalled.
If you were to start a request and wait long enough before reading it, S3 would eventually detect the connection as stalled, give up, and close the connection.
In practice, it's easy to separate HTTP headers from body in a stream, because the boundary between them is always exactly \r\n\r\n. This 4 byte sequence is invalid within the headers, and mandatory after the headers, so the SDK simply stops extracting headers at this point in the response from S3 and builds and returns the response object, from which you can read the body from the stream from the network.

Related

How to upload image with Socket Ktor?

I want to send image in Socket.
How can I do that with Ktor?
AFAIK , you cannot do like http multi part upload in websockets directly.
Several things you can try.
Try convert the image into Base64/Byte Array in client and send it to websocket server.
Several things you may need to handle is, if you use byte array you might need to handle file headers in the byte array.
If you read file from image you can get like this and pass it to the websocket.
var arr = File(path).inputStream().use { it.readBytes() }
Downside of doing this is , if image size is higher it may not work as expected. And if your websocket sends it to multiple listening clients, it will be quite overload and leads to some delay while sending additional data with file byte array.
Another best approach is to upload image to you sever using http multipart upload(Without socket) and send the image url to server. So url can be sent to clients listening to the particular socket. So the image data can be loaded in client only when required.
If you send byte array in webssocket for big image, the particular websocket response size will be higher than sending the image with image url.
Recommended approach will be method 2 mostly , except some specific use cases.

Requests - How to upload large chunks of file in requests?

I have to upload large files (~5GB). I am dividing the file in small chunks (10MB), can't send all data(+5GB) at once(as the api I am requesting fails for large data than 5GB if sent in one request). The api I am uploading to, has a specification that it needs minimum of 10MB data to be sent. I did use read(10485760) and send it via requests which works fine.
However, I do not want to read all the 10MB in the memory and if I leverage multithreading in my script, so each thread reading 10MB would cost me too much memory.
Is there a way I can send a total of 10MB to the api requests but read only 4096/8192 bytes at a time and transfer till I reach 10MB, so that I do not overuse memory.
Pls.note I cannot send the fileobj in the requests as that will use less memory but I will not be able to break the chunk at 10MB and entire 5GB data will go to the request, which I do not want.
Is there any way via requests. I see the httplib has it. https://github.com/python/cpython/blob/3.9/Lib/http/client.py - I will call the send(fh.read(4096) function here in loop till I complete 10MB and will complete one request of 10MB without heavy memory usage.
this is what documentation says:
In the event you are posting a very large file as a multipart/form-data request, you may want to stream the request. By default, requests does not support this, but there is a separate package which does - requests-toolbelt. You should read the toolbelt’s documentation for more details about how to use it.
so try to stream the upload if it doesn't work as per your needs then go for requests-toolbelt
In order to stream the upload, you need to pass stream=True in the function call whether its post or put.

Logic App HTTP action Bad request due to max buffer size

Real quickly I am trying to complete an http action in azure logic app that will send out a get request to return a csv file as the response body. The issue is when I run it I get a "BadRequest. Http request failed as there is an error: 'Cannot write more bytes to the buffer than the configured maximum buffer size: 104857600.'". I am not sure how to mitigate this buffer limit or whether I can increase it. I could use some help I really need this csv file returned so I can get it into to blob storage.
Please try this way:
1. In the HTTP action's upper-right corner, choose the ellipsis button (...), and then choose Settings.
2. Under Content Transfer, set Allow chunking to On.
You can refer to Handle large messages with chunking in Azure Logic Apps

S3 to redshift nifi

I read a while about how to upload my S3 data to Redshift, COPY command, Glue, etc.
My pipeline is almost entirely in NIFI something like:
extract_data->insert to S3->excecute Lamda process to transform the data or enrich it using Athena, in 2 or 3 stages, into another S3 bucket (lets call it processed bucket).
Now I want to continue this pipeline, loading the data from the processed bucket and inserting it to redshift, I have an empty table created for this.
The idea is to add incrementally in some tables and in others to delete all the data loaded that day and reload it.
Can anyone give me a hint of where to start?
Thank you!
When data lands in your "processed bucket", you can fire a lambda function, that triggers a flow in Apache NiFi by calling an HTTP webhook. To expose such a webhook you use one of the following processors:
ListenHTTP
Starts an HTTP Server and listens on a given base path to transform
incoming requests into FlowFiles. The default URI of the Service will
be http://{hostname}:{port}/contentListener. Only HEAD and POST
requests are supported. GET, PUT, and DELETE will result in an error
and the HTTP response status code 405.
HandleHttpRequest
Starts an HTTP Server and listens for HTTP Requests. For each request,
creates a FlowFile and transfers to 'success'. This Processor is
designed to be used in conjunction with the HandleHttpResponse
Processor in order to create a Web Service
So the flow would be ListenHTTP -> FetchS3 -> Process -> PutSQL (with Redshift connection pool). The lambda function would call GET my-nifi-instance.com:PORT/my-webhook, such that ListenHTTP creates a flowfile for the incoming request.

Is there a way to add header to apache response on how long it took to retrieve a resource?

Is there a module or a built-in function in apache which I can use/activate to send information how long it took to retrieve/process a resource?
For example the resource http://dom.net/resource is accessed. The response header will include the total time it took to wait for the resource to be ready before it gets sent back to the client.
Apache doesn't really 'wait' until the resource is ready before sending the response back to you - it streams data back to the client as and when it receives it.
Depending on what you're interested in measuring, you could record the time taken for the client to receive the first byte/last byte back from Apache, or measure the time taken for Apache to receive the first byte from the (remote?) resource like so. The time taken for Apache to receive the entire response back from the remote resource is not something you can send in the headers, as the headers will have been sent to the client before the remote response is fully received. This information could trivially be written to the Apache logs, however.