Is there "S3 range read function" that allows to read assigned byte range from AWS-S3 file? - amazon-s3

Trying to process large file in AWS Lamba and skipping through the whole file seems a bit wasteful.
Is there a "range read" function that allows to read only predefined byte range from S3 file?

Yes, this is possible. According to S3 documentation of GET Object in the REST API, it supports use of the HTTP Range header.
Range
Downloads the specified range bytes of an object. For more information about the HTTP Range header, go to http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.35.
In the example syntax:
GET /ObjectName HTTP/1.1
Host: BucketName.s3.amazonaws.com
Date: date
Authorization: authorization string (see Authenticating Requests (AWS Signature Version 4))
Range:bytes=byte_range
Popular S3 client libraries, such as the AWS SDK for Java provide convenient client-side APIs for specifying the range information.

Related

Implementing basic S3 compatible API with akka-http

I'm trying to implement the file storage ыукмшсу with basic S3 compatible API using akka-http.
I use s3 java sdk to test my service API and got the problem with the putObject(...) method. I can't consume file properly on my akka-http backend. I wrote simple route for the test purposes:
def putFile(bucket: String, file: String) = put{
extractRequestEntity{ ent =>
val finishedWriting = ent.dataBytes.runWith(FileIO.toPath(new File(s"/tmp/${file}").toPath))
onComplete(finishedWriting) { ioResult =>
complete("Finished writing data: " + ioResult)
}
}
}
It saves file, but file is always corrupted. Looking inside the file I found the lines like these:
"20000;chunk-signature=73c6b865ab5899b5b7596b8c11113a8df439489da42ddb5b8d0c861a0472f8a1".
When I try to PUT file with any other rest client it works as fine as expected.
I know S3 uses "Expect: 100-continue" header and may it he causes problems.
I really can't figure out how to deal with that. Any help appreciated.
This isn't exactly corrupted. Your service is not accounting for one of the four¹ ways S3 supports uploads to be sent on the wire, using Content-Encoding: aws-chunked and x-amz-content-sha256: STREAMING-AWS4-HMAC-SHA256-PAYLOAD.
It's a non-standards-based mechanism for streaming an object, and includes chunks that look exactly like this:
string(IntHexBase(chunk-size)) + ";chunk-signature=" + signature + \r\n + chunk-data + \r\n
...where IntHexBase() is pseudocode for a function that formats an integer as a hexadecimal number as a string.
This chunk-based algorithm is similar to, but not compatible with, Transfer-Encoding: chunked, because it embeds checksums in the stream.
Why did they make up a new HTTP transfer encoding? It's potentially useful on the client side because it eliminates the need to either "read your payload twice or buffer [the entire object payload] in memory [concurrently]" -- one or the other of which is otherwise necessary if you are going to calculate the x-amz-content-sha256 hash before the upload begins, as you otherwise must, since it's required for integrity checking.
I am not overly familiar with the internals of the Java SDK, but this type of upload might be triggered by using .withInputStream() or it might be standard behavor for files too, or for files over a certain size.
Your minimum workaround would be to throw an HTTP error if you see x-amz-content-sha256: STREAMING-AWS4-HMAC-SHA256-PAYLOAD in the request headers since you appear not to have implemented this in your API, but this would most likely only serve to prevent storing objects uploaded by this method. The fact that this isn't already what happens automatically suggests that you haven't implemented x-amz-content-sha256 handling at all, so you are not doing the server-side payload integrity checks that you need to be doing.
For full compatibility, you'll need to implement the algorithm supported by S3 and assumed to be available by the SDKs, unless the SDKs specifically support a mechanism for disabling this algorithm -- which seems unlikely, since it serves a useful purpose, particularly (it appears) for streams whose length is known but that aren't seekable.
¹ one of four -- the other three are a standard PUT, a web-based html form POST, and the multipart API that is recommended for large files and mandatory for files larger than 5 GB.

is there a size limit to individual fields in HTTP POST?

I have an API for a file upload that expects a multipart form submission. But I have a customer writing a client and his system can't properly generate a multipart/form-data request. He's asking that I modify my API to accept the file in a application/x-www-form-urlencoded request, with the filename in one key/value pair and the contents of the file, base64 encoded, in another key/value pair.
In principle I can easily do this (tho I need a shower afterwards), but I'm worried about size limits. The files we expect in Production will be fairly large: 5-10MB, sometimes up to 20MB. I can't find anything that tells me about length limitations on individual key/value pair data inside a form POST, either in specs (I've looked at, among others, the HTTP spec and the Forms spec) or in a specific implementation (my API runs on a Java application server, Jetty, with an Apache HTTP server in front of it).
What is the technical and practical limit for an individual value in a key/value pair in a form POST?
There are artificial limits, configurations, present on the HttpConfiguration class. Both for maximum number of keys, and maximum size of the request body content.
In practical terms, this is a really bad idea.
You'll have a String, which uses 2-bytes per character for the Base64 data.
And you have the typical 33% overhead just being Base64.
They'll also have to utf8 urlencode the Base64 string for various special characters (such as "+" which has meaning in Base64, but is space " " in urlencoded form. So they'll need to encode that "+" to "%2B").
So for a 20MB file you'll have ...
20,971,520 bytes of raw data, represented as 27,892,122 characters in raw Base64, using (on average) 29,286,728 characters when urlencoded, which will use 58,573,455 bytes of memory in its String form.
The decoding process on Jetty will take the incoming raw urlencoded bytes and allocate 2x that size in a String before decoding the urlencoded form. So that's a 58,573,456 length java.lang.String (that uses 117,146,912 bytes of heap memory for the String, and don't forget the 29MB of bytebuffer data being held too!) just to decode that Base64 binary file as a value in a x-www-form-urlencoded String form.
I would push back and force them to use multipart/form-data properly. There are tons of good libraries to generate that form-data properly.
If they are using Java, tell them to use the httpmime library from the Apache HttpComponents project (they don't have to have/use/install Apache Http Client to use the httpmime, its a standalone library).
Alternative Approach
There's nothing saying you have to use application/x-www-form-urlecnoded or multipart/form-data.
Offer a raw upload option via application/octet-stream
They use POST, and MUST include the following valid request headers ...
Connection: close
Content-Type: application/octet-stream
Content-Length: <whatever_size_the_content_is>
Connection: close to indicate when the http protocol is complete.
Content-Type: application/octet-stream means Jetty will not process that content as request parameters and will not apply charset translations to it.
Content-Length is required to ensure that the entire file is sent/received.
Then just stream the raw binary bytes to you.
This is just for the file contents, if you have other information that needs to be passed in (such as filename) consider using either the query parameters for that, or a custom request header (eg: X-Filename: secretsauce.doc)
On your servlet, you just use HttpServletRequest.getInputStream() to obtain those bytes, and you use the Content-Length variable to verify that you received the entire file.
Optionally, you can make them provide a SHA1 hash in the request headers, like X-Sha1Sum: bed0213d7b167aa9c1734a236f798659395e4e19 which you then use on your side to verify that the entire file was sent/received properly.

WinHTTP decompression function? [duplicate]

Presently I am using winhttp api in c++ to get data from a server. The server can support various compression formats. So presently I want to use winhttp to get the compressed format (eg gzip) and decompress it. Is it possible to decompress the data using winhttp?
Surely it's possible.
From here:
To set the decoding option, the application calls InternetSetOption
with the handle returned from InternetOpen, InternetConnect, or
HttpOpenRequest. The INTERNET_OPTION_HTTP_DECODING option is specified
in the dwOption parameter, and the lpBuffer parameter points to a
boolean variable set to true. To disable decoding, the application
calls InternetSetOption with the INTERNET_OPTION_HTTP_DECODING option
and the boolean variable set to false.
So http compression is transprent for user code, you just need one call to InternetSetOption and your traffic will be compressed.

md5 checksums when uploading to file picker

Background
I'm working on integrating an existing app with File Picker. In our existing setup we are relying on md5 checksums to ensure data integrity. As far as I can see File Picker does not provide any md5 when they respond to an upload against the REST API (nor using JavaScript client).
S3 storage, md5 and data integrity
We are using S3 for storage, and as far as I know you may provide S3 with an md5 checksum when storing files so that Amazon may verify and reject storing request if data seems to be wrong.
To ensure that data is not corrupted traversing the network, use the Content-MD5 header. When you use this header, Amazon S3 checks the object against the provided MD5 value and, if they do not match, returns an error. Additionally, you can calculate the MD5 while putting an object to Amazon S3 and compare the returned ETag to the calculated MD5 value.
I have investigated the etag header which Amazon returns a bit, and found that it isn't clear what actually is returned as etag. The Java documentation states:
Gets the hex encoded 128-bit MD5 hash of this object's contents as computed by Amazon S3.
The Ruby documentation states:
Generally the ETAG is the MD5 of the object. If the object was uploaded using multipart upload then this is the MD5 all of the upload-part-md5s
Another place in their documentation I found this:
The entity tag is a hash of the object. The ETag only reflects changes to the contents of an object, not its metadata. The ETag is determined when an object is created. For objects created by the PUT Object operation and the POST Object operation, the ETag is a quoted, 32-digit hexadecimal string representing the MD5 digest of the object data. For other objects, the ETag may or may not be an MD5 digest of the object data. If the ETag is not an MD5 digest of the object data, it will contain one or more non-hexadecimal characters and/or will consist of less than 32 or more than 32 hexadecimal digits.
This seems to describe how etag is actually calculated on S3, and this stack overflow post seems to imply the same thing: Etag cannot be trusted to always be equal to the file MD5.
So - here are my questions
In general, how does file picker store files to s3? Is multipart post requests used?
I see that when I do a HEAD request against for example https://www.filepicker.io/api/file/<file handle> I do get an etag header back. The etag I get back do indeed match the md5 of the file I have uploaded. Are the headers returned more or less taken from S3 directly? Or is this actually an md5 calculated by filepicker which I can trust?
Is it possible to have an explicit statement of the md5 returned to clients of File Picker's API? For instance when we POST a file we get a JSON structure back including the URL to the file and it's size. Could md5 be included here?
Is it possible to provide File Picker with an md5 which in turn will be used when posting files to S3 so we can get an end-to-end check on files?
Yes, we use the python boto library to be specific.
The ETag is pulled from S3.
& 4. It's been considered and is in our backlog, but hasn't been implemented yet.

I need Multi-Part DOWNLOADS from Amazon S3 for huge files

I know Amazon S3 added the multi-part upload for huge files. That's great. What I also need is a similar functionality on the client side for customers who get part way through downloading a gigabyte plus file and have errors.
I realize browsers have some level of retry and resume built in, but when you're talking about huge files I'd like to be able to pick up where they left off regardless of the type of error out.
Any ideas?
Thanks,
Brian
S3 supports the standard HTTP "Range" header if you want to build your own solution.
S3 Getting Objects
I use aria2c. For private content, you can use "GetPreSignedUrlRequest" to generate temporary private URLs that you can pass to aria2c
S3 has a feature called byte range fetches. It’s kind of the download compliment to multipart upload:
Using the Range HTTP header in a GET Object request, you can fetch a byte-range from an object, transferring only the specified portion. You can use concurrent connections to Amazon S3 to fetch different byte ranges from within the same object. This helps you achieve higher aggregate throughput versus a single whole-object request. Fetching smaller ranges of a large object also allows your application to improve retry times when requests are interrupted. For more information, see Getting Objects.
Typical sizes for byte-range requests are 8 MB or 16 MB. If objects are PUT using a multipart upload, it’s a good practice to GET them in the same part sizes (or at least aligned to part boundaries) for best performance. GET requests can directly address individual parts; for example, GET ?partNumber=N.
Source: https://docs.aws.amazon.com/whitepapers/latest/s3-optimizing-performance-best-practices/use-byte-range-fetches.html
Just updating for current situation, S3 natively supports multipart GET as well as PUT. https://youtu.be/uXHw0Xae2ww?t=1459.
NOTE: For Ruby user only
Try aws-sdk gem from Ruby, and download
object = AWS::S3::Object.new(...)
object.download_file('path/to/file.rb')
Because it download a large file with multipart by default.
Files larger than 5MB are downloaded using multipart method
http://docs.aws.amazon.com/sdkforruby/api/Aws/S3/Object.html#download_file-instance_method