Amazon S3 pre signed URLs using Amazon Java SDK and extra / characters - amazon-s3

I've been creating Presigned HTTP PUT URLs and everything was working great until I wanted to start using "folders" in S3; I wanted the key to have the character '/'.
Now I get Signature doesn't match when I send the HTTP PUT requests due to the fact the '/' probably changes to %2F... If I escape the character before creating the presigned URL it works great, but then the Amazon console management doesn't understand it and shows it as one file instead of subfolders.
Any idea?
P.s.
The HTTP PUT requests are sent using C++ with POCO NET library.
EDIT
I'm using Poco HttpRequest from C++ to my Java web server to generate a signed url (returned on the response).
C++ then uses this url to put a file in s3 using Poco again.
The problem was that the urls returned from the web server were parsed through Poco URI objects that auto decoded the s3 object key thus changing it.With that in mind I was able to fix my problem.

Tricky - I'll try to approach this bottom up.
Disclaimer: I got carried away visually inspecting the Poco libraries instead of actually debugging a code sample, which should yield more reliable results much faster, see below ;)
Analysis
If I escape the character before creating the presigned URL it works
great, but then the Amazon console management doesn't understand it
and shows it as one file instead of subfolders.
The latter stems from S3 not having a concept of folders on the storage level actually, see e.g. section Index Documents and Folders within Index Document Support:
Objects stored in Amazon S3 are stored within a flat container, i.e.,
an Amazon S3 bucket, and it does not provide any hierarchical
organization, similar to a file system's. However, you can create a
logical hierarchy using object key names and use these names to infer
logical folders that contain these objects.
That's exactly what the AWS Management Console is doing here as well:
The AWS Management Console also supports the concept of folders, by
using the same key naming convention used in the preceding sample.
However, your test regarding the assumption of / being encoded as %2F proves, that this is indeed how Poco::Net is encoding the URL when performing the HTTP PUT request.
(I'm actually a bit surprised that the AWS Java SDK seems to generate different URLs here for / vs. %2F, insofar a recent analysis regarding Why is my S3 pre-signed request invalid when I set a response header override that contains a “+”? seems to indicate respective canonicalization by the AWS .NET SDK, see below for more on this.)
Potential Solution
In order for your scenario to work as desired, you'll need to figure out where the URL is encoded this way - I could think of two components in principle:
Poco::Net
Finding out why Poco::Net is encoding the URL different than S3 (if at all, see below) is best done by debugging your code, here's where I'd start:
Class HTTPRequest uses class URI in turn, which automatically performs a few normalizations on all URIs and URI parts passed to it, in particular percent-encoded characters are decoded. The other way round is handled by method encode(), which is where things get interesting and call for a breakpoint, see URI.cpp:
lines 575 ff. - here encode() does its magic, which indeed seems to be in place, insofar neither the code within the function nor the various chars passed in via the reserved parameter contain the offending / (see lines 47 ff. for the respective constants in use)
consequently you might want to set a breakpoint in this function and backtrace the callstack to find out which code is actually doing the encoding upfront, which might not yield an offender at all, see below.
Java => C++ transition
You haven't specified yet, which channel is actually used to communicate the pre-signed URL generated by the AWS Java SDK to C++ in turn. Given the code review (mind you, visual inspection only, I haven't debugged this myself yet) of the Poco::Net functionality yields the conclusion, that no obvious offender can be identified in the library itself, thus it seems more likely that it might already enter your C++ layer encoded (easily verified via debugging of course) - are you by chance using any kind of web service between these components for example?
Good luck!

Related

Appropriate REST design for data export

What's the most appropriate way in REST to export something as PDF or other document type?
The next example explains my problem:
I have a resource called Banana.
I created all the canonical CRUD rest endpoint for that resource (i.e. GET /bananas; GET /bananas/{id}; POST /bananas/{id}; ...)
Now I need to create an endpoint which downloads a file (PDF, CSV, ..) which contains the representation of all the bananas.
First thing that came to my mind is GET /bananas/export, but in pure rest using verbs in url should not be allowed. Using a more appropriate httpMethod might be cool, something like EXPORT /bananas, but unfortunately this is not (yet?) possible.
Finally I thought about using the Accept header on the same GET /bananas endpoint, which based on the different media type (application/json, application/pdf, ..) returns the corresponding representation of the data (json, pdf, ..), but I'm not sure if I am misusing the Accept header in this way.
Any ideas?
in pure rest using verbs in url should not be allowed.
REST doesn't care what spelling conventions you use in your resource identifiers.
Example: https://www.merriam-webster.com/dictionary/post
Even though "post" is a verb (and worse, an HTTP method token!) that URI works just like every other resource identifier on the web.
The more interesting question, from a REST perspective, is whether the identifier should be the same that is used in some other context, or different.
REST cares a lot about caching (that's important to making the web "web scale"). In HTTP, caching is primarily about re-using prior responses.
The basic (but incomplete) idea being that we may be able to re-use a response that shares the same target URI.
HTTP also has built into it a general purpose mechanism for invalidating stored responses that is also focused on the target URI.
So here's one part of the riddle you need to think about: when someone sends a POST request to /bananas, should caches throw away the prior responses with the PDF representations?
If the answer is "no", then you need a different target URI. That can be anything that makes sense to you. /pdfs/bananas for example. (How many common path segments are used in the identifiers depends on how much convenience you will realize from relative references and dot segments.)
If the answer is "yes", then you may want to lean into using content negotiation.
In some cases, the answer might be "both" -- which is to say, to have multiple resources (each with its own identifier) that return the same representations.
That's a normal thing to do; we even have a mechanism for describing which resource is "preferred" (see RFC 6596).
REST does not care about this, but the HTTP standard does. Using the accept header for the expected MIME type is the standard way of doing this, so you did the right thing. No need to move it to a separate endpoint if the data is the same just the format is different.
Media types are the best way to represent this, but there is a practical aspect of this in that people will browse a rest API using root nouns... I'd put some record-count limits on it, maybe GET /bananas/export/100 to get the first 100, and GET /bananas/export/all if they really want all of them.

Implementing basic S3 compatible API with akka-http

I'm trying to implement the file storage ыукмшсу with basic S3 compatible API using akka-http.
I use s3 java sdk to test my service API and got the problem with the putObject(...) method. I can't consume file properly on my akka-http backend. I wrote simple route for the test purposes:
def putFile(bucket: String, file: String) = put{
extractRequestEntity{ ent =>
val finishedWriting = ent.dataBytes.runWith(FileIO.toPath(new File(s"/tmp/${file}").toPath))
onComplete(finishedWriting) { ioResult =>
complete("Finished writing data: " + ioResult)
}
}
}
It saves file, but file is always corrupted. Looking inside the file I found the lines like these:
"20000;chunk-signature=73c6b865ab5899b5b7596b8c11113a8df439489da42ddb5b8d0c861a0472f8a1".
When I try to PUT file with any other rest client it works as fine as expected.
I know S3 uses "Expect: 100-continue" header and may it he causes problems.
I really can't figure out how to deal with that. Any help appreciated.
This isn't exactly corrupted. Your service is not accounting for one of the four¹ ways S3 supports uploads to be sent on the wire, using Content-Encoding: aws-chunked and x-amz-content-sha256: STREAMING-AWS4-HMAC-SHA256-PAYLOAD.
It's a non-standards-based mechanism for streaming an object, and includes chunks that look exactly like this:
string(IntHexBase(chunk-size)) + ";chunk-signature=" + signature + \r\n + chunk-data + \r\n
...where IntHexBase() is pseudocode for a function that formats an integer as a hexadecimal number as a string.
This chunk-based algorithm is similar to, but not compatible with, Transfer-Encoding: chunked, because it embeds checksums in the stream.
Why did they make up a new HTTP transfer encoding? It's potentially useful on the client side because it eliminates the need to either "read your payload twice or buffer [the entire object payload] in memory [concurrently]" -- one or the other of which is otherwise necessary if you are going to calculate the x-amz-content-sha256 hash before the upload begins, as you otherwise must, since it's required for integrity checking.
I am not overly familiar with the internals of the Java SDK, but this type of upload might be triggered by using .withInputStream() or it might be standard behavor for files too, or for files over a certain size.
Your minimum workaround would be to throw an HTTP error if you see x-amz-content-sha256: STREAMING-AWS4-HMAC-SHA256-PAYLOAD in the request headers since you appear not to have implemented this in your API, but this would most likely only serve to prevent storing objects uploaded by this method. The fact that this isn't already what happens automatically suggests that you haven't implemented x-amz-content-sha256 handling at all, so you are not doing the server-side payload integrity checks that you need to be doing.
For full compatibility, you'll need to implement the algorithm supported by S3 and assumed to be available by the SDKs, unless the SDKs specifically support a mechanism for disabling this algorithm -- which seems unlikely, since it serves a useful purpose, particularly (it appears) for streams whose length is known but that aren't seekable.
¹ one of four -- the other three are a standard PUT, a web-based html form POST, and the multipart API that is recommended for large files and mandatory for files larger than 5 GB.

Google cloud load balancer custom http header is missing

While using Google Cloud HTTPS Load Balancer we hit the following bug. Couldn't find any information on it.
We have a custom http header in our request:
X-<Company name>-abcde. If we are working directly against the server all is good, but once we are working through the load balancer, than our custom header is missing. We didn't find any reference in the documentation that there is a need to white list our headers or something like that.
Why my custom header is not being transferred to my backend server while working through Google Cloud Load Balancer? And how to make it work?
Thanks
Data
After a lot of testing, these are the results I've come up with:
The Google Cloud HTTPS Load Balancer does transfer custom HTTP headers to the backend service.
However, it changes them to lower-case.
So, in your case, X-Custom-Header is transformed to x-custom-header.
Solution
Simply change your code to read the lower-case version of your custom HTTP header. This is a simple fix, but one which may not be supported in the long-term by Google (there's not a word on this in Google's documentation so it's subject to change with no notice).
Petition Google to change this idiosyncratic behaviour or at the very least mention it clearly in their documentation.
A little extra
As far as I know, the RFC 2047 which specified the X- prefix for custom HTTP headers and propagated the pseudo-standard of a capital letter for each word has been deprecated and replaced by RFC 6648 which recommends against the X- prefix in general and mentions nothing regarding the rest of the words in the custom HTTP header key name. If I were Google, I would change this behaviour to pass custom HTTP headers as is and let developers deal with the strings as they've set them.
The RFC (RFC 7230) for HTTP/1.1 Message Syntax and Routing says that header fields have a case-insensitive field name. If you're relying on case to match the header that doesn't align with the RFC.
Way back in the day I looked through either the Tomcat of Jetty source and they worked with everything as a .toLower().
Go has a CanonicalMIMEHeaderKey where it'll format the headers in a common way to be sure everything is on the same page.
Python still harkens back to the RFC822 (hg.python.org/cpython/file/2.7/Lib/rfc822.py#l211) days, but it forces a .lower() on headers to standardize.
Basically though what the GCP HTTP(S) Load Balancer is doing is acceptable as far as the RFC is concerned.
This is most likely an application bug.
As other answers have stated, HTTP header names are case insensitive. Ime, every time headers appear to be case sensitive, it is because there is a request wrapper somewhere in the application call stack.
Request wrappers like this are common (usually necessary) in Java servlet filters. It's a common, newbie mistake to use case-sensitive matching (e.g. a regular Java HashMap<String, T>()) for the header names in the wrapper.
That's where I would start looking for your bug.
A reasonable way to create a Java Map<String, T> that is both case insensitive and that doesn't modify the keys is to use new TreeMap<String, T>( String.CASE_INSENSITIVE_ORDER ).

Changing page content with custom scheme in WebKitGTK1

I have an app using the WebKitGTK1 API with WebKit-GTK 2.4.9 on Linux. (This is the current version in Debian Jessie and versions 2.5+ don't support the v1 API.)
I've implemented a custom URI scheme for loading entire basic page content by using a resource-request-starting handler, which parses the incoming URI via webkit_web_resource_get_uri and if it matches the custom scheme, generates some HTML content and calls webkit_network_request_set_uri to replace the original URI with a base64'd data: URI containing the content to render. (This is similar to the accepted answer of this question.)
This mostly works well, and my handler is called on each request (including repeated requests with the same original URI) and generates the correct content -- but somewhere upstream the browser appears to render only the first returned data for any given original URI, even if the data URI I generate is different.
Possibly of note is that webkit_web_resource_get_uri returns the original non-data: URI even after calling webkit_network_request_set_uri, so I assume this URI is being cached, and in turn is then being used as a key in some higher-level component to cache the data instead of using the real URI from the request.
Unfortunately this appears to be a G_PARAM_CONSTRUCT_ONLY property and there doesn't appear to be any public API to set and/or clear it so that it uses the rewritten URI of the request instead. Is there some way to force GTK to set the property after construction anyway? As far as I can tell it does have a setter method internally, and the getter would do the Right Thing™ if the internal property were reset to NULL.
Or is there some better method to force WebKit to render the new data: URI despite anything it thinks to the contrary?
For the moment I've worked around it by including the values that make it generate different data in the original custom URI (passed to webkit_web_view_load_uri or in links in the generated page). This does work but it's a bit ugly, and could be problematic if I forget to add something in the future, or if something changes generation but is not known in advance. It seems a bit silly that it goes to all the trouble of raising the event that generates the correct data, only to throw it away later, (presumably) due to an URI comparison on the wrong URI.
I suppose using a known-unique value (eg. sequential incrementing id) would also work, and resolve some of the unknown-in-advance issues, but that's no less ugly.

Uploading a file via Jaxax REST Client interface, with third party server

I need to invoke a remote REST interface handler and submit it a file in request body. Please note that I don't control the server. I cannot change the request to be multipart, the client has to work in accordance to external specification.
So far I managed to make it work like this (omitting headers etc. for brevity):
byte[] data = readFileCompletely ();
client.target (url).request ().post (Entity.entity (data, "file/mimetype"));
This works, but will fail with huge files that don't fit into memory. And since I have no restriction on filesize, this is a concern.
Question: is it somehow possible to use streams or something similar to avoid reading the whole file into memory?
If possible, I'd prefer to avoid implementation-specific extensions. If not, a solution that works with RESTEasy (on Wildfly) is also acceptable.
ReastEasy as well as Jersey support InputStream out of the box so simply use Entity.entity(inputStream, "application/octet-stream"); or whatever Content-Type header you want to set.
You can go low-level and construct the HTTP request using a library such as the plain java.net.URLConnection.
I have not tried it myself but there is example code which reads a local file and writes it to the request stream without loading it into a byte array.
Upload files from Java client to a HTTP server
Of course this solution requires more manual coding but it should work (unless java.net.URLConnection loads the whole file into memory)