s3 file downloads are unpredictable

s3 file downloads are unpredictable - amazon-s3

We have uploaded several zip files to s3. All are in the hundreds of MB range.
We download the files, typically via a script, it appears that the file size and type both change. The new file size typically is about 300 bytes and the file type once downloaded is xml.
The content of the files look similar to this (whitespace added for clarity):
<?xml version="1.0" encoding="UTF-8"?>
<Error>
<Code>NoSuchKey</Code>
<Message>The specified key does not exist.</Message>
<Key>gpdb-5.0.0.0/greenplum-db-5.0.0.0-rhel5-x86_64.zip</Key>
<RequestId>83D2047BDBA195A6</RequestId>
<HostId>tXKFaiRaNjD26j6fcrTjCk858PGBH2RAjLE1aO4+8hovD6mf+hUzJvCdWKKgrDJGaHXsjWbQP2A=</HostId>
</Error>
Any thoughts as to what might be causing this? It does not happen all of the time. It's somewhat intermittent.

As you will note in the S3 API Reference, this isn't a file -- it's an error message.
Files in S3 are called Objects, and the path + filename of the object is referred to as the object key.
A key is the unique identifier for an object within a bucket. Every object in a bucket has exactly one key. [...] Every object in Amazon S3 can be uniquely addressed through the combination of the web service endpoint, bucket name, key, and optionally, a version. For example, in the URL http://doc.s3.amazonaws.com/2006-03-01/AmazonS3.wsdl, "doc" is the name of the bucket and "2006-03-01/AmazonS3.wsdl" is the key.
http://docs.aws.amazon.com/AmazonS3/latest/dev/Introduction.html#BasicsKeys
This error message, which should have been accompanied by a 404 Not Found error code when you tried to access the object, indicates that there is no object in the bucket whose key (path + filename) is the one shown in the error -- the one you requested. You should be able to confirm its absence in the S3 console.
If the object should have been uploaded some time in the past, this error means the object wasn't actually uploaded, or has subsequently been deleted.
If the object was very recently uploaded (typically within seconds), you should not get this error, but it is possible for this error to occur under either of two additional conditions:
If you try to check whether an object exists by sending a GET or HEAD request to the bucket, then uploaded the object. If you do this, a short period of time may elapse before the object is accessible because of internal optimizations inside S3. When you try to fetch a non-existent object, S3 may -- for a brief time -- have an internal concept that the object is not there, even though it has safely been stored. Retry your request.
If you already had an object with the same key, then you deleted it, then you uploaded a new object with the same key, for a brief time after the new upload, you could either get the error above or you could actually download the old object again.
These conditions are somewhat uncommon, but they can occur, particularly if your bucket has a lot of traffic, due to S3's consistency model, which is an engineered tradeoff between performance, reliability, and immediate availability of uploaded objects when the same object has been recently downloaded, attempted to be downloaded, deleted, or overwritten.
The <RequestId> and <HostId> codes in the error response are opaque diagnostic codes that you can provide to AWS support, if you need to submit a support request about a specific problem you are experiencing with S3... they can use these to find the specific request and identify the problem. They are not considered sensitive information, since they have no meaning outside of AWS.
In this case, there is no apparent need to contact AWS support, because it appears that you are simply trying to download an object that is not in the specific bucket from which you tried to download this particular file. If you get alternating success and failure for the exact same file, that's unexpected, and a support case might be in order... but typically, an internal error in S3 should result in a very different response.

Related

Lambda script to direct to fallback S3 domain subfolder when not found

As per this question, and this one the following piece of code, allows me to point a subfolder in a S3 bucket to my domain.
However in instances where the subdomain is not found, I get the following error message:
<Error>
<Code>AccessDenied</Code>
<Message>Access Denied</Message>
<RequestId>2CE9B7837081C817</RequestId>
<HostId>
T3p7mzSYztPhXetUu7GHPiCFN6l6mllZgry+qJWYs+GFOKMjScMmRNUpBQdeqtDcPMN3qSYU/Fk=
</HostId>
</Error>
I would not like it to display this error message, instead in instances like this I would like to serve from another S3 bucket subdomain (i.e. example-bucket.s3-website.us-east-2.amazonaws.com/error) for example where the user will be greeted with a fancy error message. So therefore in a situation where a S3 bucket subfolder is not found, it should fall back to there. How do I accomplish this by changing the node function below.
'use strict';
// if the end of incoming Host header matches this string,
// strip this part and prepend the remaining characters onto the request path,
// along with a new leading slash (otherwise, the request will be handled
// with an unmodified path, at the root of the bucket)
const remove_suffix = '.example.com';
// provide the correct origin hostname here so that we send the correct
// Host header to the S3 website endpoint
const origin_hostname = 'example-bucket.s3-website.us-east-2.amazonaws.com'; // see comments, below
exports.handler = (event, context, callback) => {
const request = event.Records[0].cf.request;
const headers = request.headers;
const host_header = headers.host[0].value;
if(host_header.endsWith(remove_suffix))
{
// prepend '/' + the subdomain onto the existing request path ("uri")
request.uri = '/' + host_header.substring(0,host_header.length - remove_suffix.length) + request.uri;
}
// fix the host header so that S3 understands the request
headers.host[0].value = origin_hostname;
// return control to CloudFront with the modified request
return callback(null,request);
};

The Lambda#Edge function is an origin request trigger -- it runs after the CloudFront cache is checked and a cache miss has occurred, immediately before the request (as it stands after being modified by the trigger code) is sent to the origin server. By the time the response arrives from the origin, this code has finished and can't be used to modify the response.
There are several solutions, including some that are conceptually valid but extremely inefficient. Still, I'll mention those as well as the cleaner/better solutions, in the interest of thoroughness.
Lambda#Edge has 4 possible trigger points:
viewer-request - when request first arrives at CloudFront, before the cache is checked; fires for every request.
origin-request - after the request is confirmed to be a cache miss, but before the request is sent to the origin server; only fires in cache misses.
origin-response - after a response (whether success or error) is returned from the origin server, but before the response is potentially stored in the cache and returned to the viewer; if this trigger modifies the response, the modified response will be stored in the CloudFront cache if cacheable, and returned to the viewer; only fires on cache misses
viewer-response - inmediately before the response is returned to the viewer, whether from the origin or cache; fires for every non-error response, unless that response was spontaneously emitted by a viewer-request trigger, or is the result of a custom error document that sets the status code to 200 (a definite anti-pattern, but still possible), or is a CloudFront-generated HTTP to HTTPS redirect.
Any of the trigger points can assume control of the signal flow, generate its own spontaneous response, and thus change what CloudFront would have ordinarily done -- e.g. if you generate a response directly from an origin-request trigger, CloudFront doesn't actually contact the origin... so what you could theoretically do is check S3 in the origin-request trigger to see if the request will succeed and generate a custom error response, instead. The AWS Javascript SDK is automatically bundled into the Lambda#Edge environmemt. Technically legitimate, this is probably a terrible idea in almost any case, since it will increase both costs and latency due to extra "look-ahead" requests to S3.
Another option is to write a separate origin-response trigger to check for errors, and if occurs, replace it with a customized response from the trigger code. But this idea also qualifies as non-viable, since that trigger will fire for all responses to cache misses, whether success or failure, increasing costs and latency, wasting time for a majority of cases.
A better idea (cost, performance, ease-of-use) is CloudFront Custom Error Pages, which allows you to define a specific HTML document that CloudFront will use for every error matching the specified code (e.g. 403 for access denied, as in the original question). CloudFront can also change that 403 to a 404 when handling those errors. This requires that you do several things when the source of the error file is a bucket:
create a second CloudFront origin pointing to the bucket
create a new cache behavior that routes exactly that one path (e.g. /shared/errors/not-found.html) to the error file over to the new origin (this means you can't use that path on any of the subdomains -- it will always go directly to the error file any time it's requested)
configure a CloudFront custom error response for code 403 to use the path /shared/errors/not-found.html.
set Error Caching Minimum TTL to 0, at least while testing, to avoid some frustration for yourself. See my write-up on this feature but disregard the part where I said "Leave Customize Error Response set to No".
But... that may or may not be needed, since S3's web hosting feature also includes optional Custom Error Document support. You'll need to create a single HTML file in your original bucket, enable the web site hosting feature on the bucket, and change the CloudFront Origin Domain Name to the bucket's web site hosting endpoint, which is in the S3 console but takes the form of${bucket}.s3-website.${region}.amazonaws.com. In some regions, the hostname might have a dash - rather than a dot . after s3-website for legacy reasons, but the dot format should work in any region.
I almost hesitate mention one other option that comes to mind, since it's fairly advanced and I fear the description might seem quite convoluted... but you also could do the following, and it would be pretty slick, since it would allow you to potentiallh generate a custom HTML page for each erroneous URL requested.
Create a CloudFront Origin Group with your main bucket as the primary and a second, empty, "placeholder" bucket as secondary. The only purpose served by the second bucket is so that we give CloudFront a valid name that it plans to connect to, even though we won't actually connect to it, as may become clear, below.
When request fails to the primary origin, matching one of the configured error status codes, the secondary origin is contacted. This is intended for handling the case when an origin fails, but we can leverage it for our purposes, because before actually contacting the failover origin, the same origin request trigger fires a second time.
If the primary origin returns an HTTP status code that you’ve configured for failover, the Lambda function is triggered again, when CloudFront re-routes the request to the second origin.
https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/high_availability_origin_failover.html#concept_origin_groups.lambda
(It would be more accurate to say "...when CloudFront is preparing to re-route the request to the second origin," because the trigger fires first.)
When the trigger fires a second time, the specific reason it fires isn't preserved, but there is a way to identify whether you're running in the first or second invocation: one of these two values will contain the hostname of the origin server CloudFront is preparing to contact:
event.Records[0].cf.request.origin.s3.domainName # S3 rest endpoints
event.Records[0].cf.request.origin.custom.domainName # non-S3 origins and S3 website-hosting endpoints
So we can test the appropriate value (depending on origin type) in the trigger code, looking for the name of the second "placeholder" bucket. If it's there, bypass the current logic and generate the 404 response from inside the Lambda function. This could be dynamic/customized HTML, such as with the page URI or perhaps one that varies depending on whether / or some other page is requested. As noted above, spontaneously generating a response from an origin-request trigger prevents CloudFronr from actually contacting the origin. Generated responses from an origin-request trigger are limited to 1MB but that should be beyond sufficient for this use case.

Why isn't List Parts to be used with Complete Multipart Upload?

The multipart upload overview documentation has, in the Multipart Upload Listings section, the following warning:
Note
Only use the returned listing for verification. You should not use the result of this listing when sending a complete multipart upload request. Instead, maintain your own list of the part numbers you specified when uploading parts and the corresponding ETag values that Amazon S3 returns.
Why?
Why I ask: Let's say I want to support resuming an upload that is interrupted. Doing so means knowing what remains to be uploaded, and therefore what already was uploaded. Knowing this is simpler if I may disregard the above warning. S3 is persisting the list of already-uploaded parts. I can obtain it from List Parts.
Whereas if I heed that warning, instead I'd need to intercept break or kill signals and persist the uploaded parts list locally. Although that's feasible, it seems silly to do this if S3 already has the list.
Furthermore, the warning says to use List Parts "only for verification". OK. Let's say I persist my own list, and compare it to List Parts. If they do not match, what am I going to do? I'm going to believe List Parts -- if S3 doesn't think it has a part, of course I'm going to upload it again. Therefore if List Parts is the ultimate authority, why not simply use it in the first place, and use it alone?

If they do not match, what am I going to do? I'm going to believe List Parts -- if S3 doesn't think it has a part, of course I'm going to upload it again.
You're missing the point of the warning.
It's not so much about whether parts were received. It's about whether they were received intact.
When you complete a multipart upload, you have to send a list of the parts and their etags. The etags are the hex md5sum of each part.
The lazy and careless way to complete a multipart upload would be to blindly submit the etags of the parts by just reading them from the "list" operation.
That is what they are warning against.
The correct way is to use your locally-created list, based on what you think S3 should have received, what you think the etag of each part should have been, based on the local file.
If you are resuming an upload that was interrupted, you should go back and compare the parts already uploaded (by re-reading and re-checksumming the parts of the local file) against the checksums S3 has calculated against the parts already stored (as returned by the list operation)... then either resend any incorrect parts or missing parts, or abandon the upload because the local file may have changed if one or more parts doesn't match your local calculation.
Additionally, in the interest of data integrity, you should be sending the md5 of each part with the individual part uploads, base64-encoded, with a Content-MD5 header, since this will cause S3 to refuse to accept a part that has been corrupted in any way during the upload.

S3 notification when file is overwritten, or deleted

since we store our log files on S3 and to meet PCI requirements we have to be notified when someone tampers with the log files.
How can I be notified every time a put request is placed that replaces an existing object, or when an existing object is delete. The alert should not fire if a new object is created unless it replaces an existing one.

S3 does not currently provide deletion or overwrite-only notifications. Deletion notifications were added after the initial launch of the notification feature and can notify you when an object is deleted, but does not notify you when on object is implicitly deleted by overwrite.
However, S3 does have functionality to accomplish what you need, in a way that seems superior to what you are contemplating: object versioning and multi-factor authentication for deletion, both discussed here:
http://docs.aws.amazon.com/AmazonS3/latest/dev/Versioning.html
With versioning enabled on the bucket, an overwrite of a file doesn't remove the old version of the file. Instead, each version of the file has an opaque string, assigned by S3, identifying the Version ID.
If someone overwrites a file, you would then have two versions of the same file in the bucket -- the original one and the new one -- so you not only have evidence of tampering, you also have the original file, undisturbed. Any object with more than one version in the bucket has, by definition, been overwritten at some point.
If you also enable Multi-Factor Authentication (MFA) Delete, then none of the versions of any object can be removed without access to the hardware or virtual MFA device.
As an developer of AWS utilities, tools, and libraries (3rd party; I'm not affiliated with Amazon), I am highly impressed by Amazon's implementation of object versioning in S3, because it works in such a way that client utilities that are unaware of versioning or that versioning is enabled on the bucket should not be affected in any way. This means you should be able to activate versioning on a bucket without changing anything in your existing code. For example:
fetching an object without an accompanying version id in the request simply fetches the newest version of the object
objects in versioned buckets aren't really deleted unless you explicitly delete a particular version; however, you can still "delete an object," and get the expected response back. Subsequently fetching the "deleted" object without specifying an accompanying version id still returns a 404 Not Found, as in the non-versioned environment, with the addition of an unobtrusive x-amz-delete-marker: header included in the response to indicate that the "latest version" of the object is in fact a delete marker placeholder. The individual versions of the "deleted" object remain accessible to version-aware code, unless purged.
other operations that are unrelated to versioning, which work on non-versioned buckets, continue to work the same way they did before versioning was enabled on the bucket.
But, again... with code that is version-aware, including the AWS console (two new buttons appear when you're looking at a versioned bucket -- you can choose to view it with a versioning-aware console view or versioning-unaware console view) you can iterate through the different versions of an object and fetch any version that has not been permanently removed... but preventing unauthorized removal of objects is the point of MFA delete.
Additionally, of course, there's bucket logging, which is typically only delayed by a few minutes from real-time and could be used to detect unusual activity... the history of which would be preserved by the bucket versioning.

Get the eventual url for a file uploaded to S3

Working on an app for a client that will asynchronously receive a request, reply immediately, then go out and fetch a set of large files to perform work on, and finally upload the results to S3 minutes or hours later.
Can we know ahead of time what the eventual url of the file on S3 will be? I'm thinking of creating a hash based on the filename and some other metadata that we know at the incoming request initialization and using that as the name of the S3 file. Is there a predictable pattern of S3 host plus bucket plus file name, or is it something that we don't know until the file upload is complete?
I'm entertaining the idea of returning the eventual S3 filename to the initial request, with the expectation that on the client's end they can periodically check the url for the result. In addition, I'm considering requiring the client to pass a callback url in with the request. The app will then hit that url later with the success/fail status of the work.
Thanks.

The URL of a file uploaded to S3 can be entirely determined by you - it's purely dependent on the bucket and key name. Specifically, it's of the form:
http://s3.amazonaws.com/BUCKETNAME/KEYNAME
(Or certain other formats, depending. It's still completely predictable.)
So long as you pick the key name ahead of time, you'll know what the eventual URL will be.

Amazon S3 pre signed URLs using Amazon Java SDK and extra / characters

I've been creating Presigned HTTP PUT URLs and everything was working great until I wanted to start using "folders" in S3; I wanted the key to have the character '/'.
Now I get Signature doesn't match when I send the HTTP PUT requests due to the fact the '/' probably changes to %2F... If I escape the character before creating the presigned URL it works great, but then the Amazon console management doesn't understand it and shows it as one file instead of subfolders.
Any idea?
P.s.
The HTTP PUT requests are sent using C++ with POCO NET library.
EDIT
I'm using Poco HttpRequest from C++ to my Java web server to generate a signed url (returned on the response).
C++ then uses this url to put a file in s3 using Poco again.
The problem was that the urls returned from the web server were parsed through Poco URI objects that auto decoded the s3 object key thus changing it.With that in mind I was able to fix my problem.

Tricky - I'll try to approach this bottom up.
Disclaimer: I got carried away visually inspecting the Poco libraries instead of actually debugging a code sample, which should yield more reliable results much faster, see below ;)
Analysis
If I escape the character before creating the presigned URL it works
great, but then the Amazon console management doesn't understand it
and shows it as one file instead of subfolders.
The latter stems from S3 not having a concept of folders on the storage level actually, see e.g. section Index Documents and Folders within Index Document Support:
Objects stored in Amazon S3 are stored within a flat container, i.e.,
an Amazon S3 bucket, and it does not provide any hierarchical
organization, similar to a file system's. However, you can create a
logical hierarchy using object key names and use these names to infer
logical folders that contain these objects.
That's exactly what the AWS Management Console is doing here as well:
The AWS Management Console also supports the concept of folders, by
using the same key naming convention used in the preceding sample.
However, your test regarding the assumption of / being encoded as %2F proves, that this is indeed how Poco::Net is encoding the URL when performing the HTTP PUT request.
(I'm actually a bit surprised that the AWS Java SDK seems to generate different URLs here for / vs. %2F, insofar a recent analysis regarding Why is my S3 pre-signed request invalid when I set a response header override that contains a “+”? seems to indicate respective canonicalization by the AWS .NET SDK, see below for more on this.)
Potential Solution
In order for your scenario to work as desired, you'll need to figure out where the URL is encoded this way - I could think of two components in principle:
Poco::Net
Finding out why Poco::Net is encoding the URL different than S3 (if at all, see below) is best done by debugging your code, here's where I'd start:
Class HTTPRequest uses class URI in turn, which automatically performs a few normalizations on all URIs and URI parts passed to it, in particular percent-encoded characters are decoded. The other way round is handled by method encode(), which is where things get interesting and call for a breakpoint, see URI.cpp:
lines 575 ff. - here encode() does its magic, which indeed seems to be in place, insofar neither the code within the function nor the various chars passed in via the reserved parameter contain the offending / (see lines 47 ff. for the respective constants in use)
consequently you might want to set a breakpoint in this function and backtrace the callstack to find out which code is actually doing the encoding upfront, which might not yield an offender at all, see below.
Java => C++ transition
You haven't specified yet, which channel is actually used to communicate the pre-signed URL generated by the AWS Java SDK to C++ in turn. Given the code review (mind you, visual inspection only, I haven't debugged this myself yet) of the Poco::Net functionality yields the conclusion, that no obvious offender can be identified in the library itself, thus it seems more likely that it might already enter your C++ layer encoded (easily verified via debugging of course) - are you by chance using any kind of web service between these components for example?
Good luck!

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas