Get the eventual url for a file uploaded to S3 - amazon-s3

Working on an app for a client that will asynchronously receive a request, reply immediately, then go out and fetch a set of large files to perform work on, and finally upload the results to S3 minutes or hours later.
Can we know ahead of time what the eventual url of the file on S3 will be? I'm thinking of creating a hash based on the filename and some other metadata that we know at the incoming request initialization and using that as the name of the S3 file. Is there a predictable pattern of S3 host plus bucket plus file name, or is it something that we don't know until the file upload is complete?
I'm entertaining the idea of returning the eventual S3 filename to the initial request, with the expectation that on the client's end they can periodically check the url for the result. In addition, I'm considering requiring the client to pass a callback url in with the request. The app will then hit that url later with the success/fail status of the work.
Thanks.

The URL of a file uploaded to S3 can be entirely determined by you - it's purely dependent on the bucket and key name. Specifically, it's of the form:
http://s3.amazonaws.com/BUCKETNAME/KEYNAME
(Or certain other formats, depending. It's still completely predictable.)
So long as you pick the key name ahead of time, you'll know what the eventual URL will be.

Related

Lambda script to direct to fallback S3 domain subfolder when not found

As per this question, and this one the following piece of code, allows me to point a subfolder in a S3 bucket to my domain.
However in instances where the subdomain is not found, I get the following error message:
<Error>
<Code>AccessDenied</Code>
<Message>Access Denied</Message>
<RequestId>2CE9B7837081C817</RequestId>
<HostId>
T3p7mzSYztPhXetUu7GHPiCFN6l6mllZgry+qJWYs+GFOKMjScMmRNUpBQdeqtDcPMN3qSYU/Fk=
</HostId>
</Error>
I would not like it to display this error message, instead in instances like this I would like to serve from another S3 bucket subdomain (i.e. example-bucket.s3-website.us-east-2.amazonaws.com/error) for example where the user will be greeted with a fancy error message. So therefore in a situation where a S3 bucket subfolder is not found, it should fall back to there. How do I accomplish this by changing the node function below.
'use strict';
// if the end of incoming Host header matches this string,
// strip this part and prepend the remaining characters onto the request path,
// along with a new leading slash (otherwise, the request will be handled
// with an unmodified path, at the root of the bucket)
const remove_suffix = '.example.com';
// provide the correct origin hostname here so that we send the correct
// Host header to the S3 website endpoint
const origin_hostname = 'example-bucket.s3-website.us-east-2.amazonaws.com'; // see comments, below
exports.handler = (event, context, callback) => {
const request = event.Records[0].cf.request;
const headers = request.headers;
const host_header = headers.host[0].value;
if(host_header.endsWith(remove_suffix))
{
// prepend '/' + the subdomain onto the existing request path ("uri")
request.uri = '/' + host_header.substring(0,host_header.length - remove_suffix.length) + request.uri;
}
// fix the host header so that S3 understands the request
headers.host[0].value = origin_hostname;
// return control to CloudFront with the modified request
return callback(null,request);
};
The Lambda#Edge function is an origin request trigger -- it runs after the CloudFront cache is checked and a cache miss has occurred, immediately before the request (as it stands after being modified by the trigger code) is sent to the origin server. By the time the response arrives from the origin, this code has finished and can't be used to modify the response.
There are several solutions, including some that are conceptually valid but extremely inefficient. Still, I'll mention those as well as the cleaner/better solutions, in the interest of thoroughness.
Lambda#Edge has 4 possible trigger points:
viewer-request - when request first arrives at CloudFront, before the cache is checked; fires for every request.
origin-request - after the request is confirmed to be a cache miss, but before the request is sent to the origin server; only fires in cache misses.
origin-response - after a response (whether success or error) is returned from the origin server, but before the response is potentially stored in the cache and returned to the viewer; if this trigger modifies the response, the modified response will be stored in the CloudFront cache if cacheable, and returned to the viewer; only fires on cache misses
viewer-response - inmediately before the response is returned to the viewer, whether from the origin or cache; fires for every non-error response, unless that response was spontaneously emitted by a viewer-request trigger, or is the result of a custom error document that sets the status code to 200 (a definite anti-pattern, but still possible), or is a CloudFront-generated HTTP to HTTPS redirect.
Any of the trigger points can assume control of the signal flow, generate its own spontaneous response, and thus change what CloudFront would have ordinarily done -- e.g. if you generate a response directly from an origin-request trigger, CloudFront doesn't actually contact the origin... so what you could theoretically do is check S3 in the origin-request trigger to see if the request will succeed and generate a custom error response, instead. The AWS Javascript SDK is automatically bundled into the Lambda#Edge environmemt. Technically legitimate, this is probably a terrible idea in almost any case, since it will increase both costs and latency due to extra "look-ahead" requests to S3.
Another option is to write a separate origin-response trigger to check for errors, and if occurs, replace it with a customized response from the trigger code. But this idea also qualifies as non-viable, since that trigger will fire for all responses to cache misses, whether success or failure, increasing costs and latency, wasting time for a majority of cases.
A better idea (cost, performance, ease-of-use) is CloudFront Custom Error Pages, which allows you to define a specific HTML document that CloudFront will use for every error matching the specified code (e.g. 403 for access denied, as in the original question). CloudFront can also change that 403 to a 404 when handling those errors. This requires that you do several things when the source of the error file is a bucket:
create a second CloudFront origin pointing to the bucket
create a new cache behavior that routes exactly that one path (e.g. /shared/errors/not-found.html) to the error file over to the new origin (this means you can't use that path on any of the subdomains -- it will always go directly to the error file any time it's requested)
configure a CloudFront custom error response for code 403 to use the path /shared/errors/not-found.html.
set Error Caching Minimum TTL to 0, at least while testing, to avoid some frustration for yourself. See my write-up on this feature but disregard the part where I said "Leave Customize Error Response set to No".
But... that may or may not be needed, since S3's web hosting feature also includes optional Custom Error Document support. You'll need to create a single HTML file in your original bucket, enable the web site hosting feature on the bucket, and change the CloudFront Origin Domain Name to the bucket's web site hosting endpoint, which is in the S3 console but takes the form of${bucket}.s3-website.${region}.amazonaws.com. In some regions, the hostname might have a dash - rather than a dot . after s3-website for legacy reasons, but the dot format should work in any region.
I almost hesitate mention one other option that comes to mind, since it's fairly advanced and I fear the description might seem quite convoluted... but you also could do the following, and it would be pretty slick, since it would allow you to potentiallh generate a custom HTML page for each erroneous URL requested.
Create a CloudFront Origin Group with your main bucket as the primary and a second, empty, "placeholder" bucket as secondary. The only purpose served by the second bucket is so that we give CloudFront a valid name that it plans to connect to, even though we won't actually connect to it, as may become clear, below.
When request fails to the primary origin, matching one of the configured error status codes, the secondary origin is contacted. This is intended for handling the case when an origin fails, but we can leverage it for our purposes, because before actually contacting the failover origin, the same origin request trigger fires a second time.
If the primary origin returns an HTTP status code that you’ve configured for failover, the Lambda function is triggered again, when CloudFront re-routes the request to the second origin.
https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/high_availability_origin_failover.html#concept_origin_groups.lambda
(It would be more accurate to say "...when CloudFront is preparing to re-route the request to the second origin," because the trigger fires first.)
When the trigger fires a second time, the specific reason it fires isn't preserved, but there is a way to identify whether you're running in the first or second invocation: one of these two values will contain the hostname of the origin server CloudFront is preparing to contact:
event.Records[0].cf.request.origin.s3.domainName # S3 rest endpoints
event.Records[0].cf.request.origin.custom.domainName # non-S3 origins and S3 website-hosting endpoints
So we can test the appropriate value (depending on origin type) in the trigger code, looking for the name of the second "placeholder" bucket. If it's there, bypass the current logic and generate the 404 response from inside the Lambda function. This could be dynamic/customized HTML, such as with the page URI or perhaps one that varies depending on whether / or some other page is requested. As noted above, spontaneously generating a response from an origin-request trigger prevents CloudFronr from actually contacting the origin. Generated responses from an origin-request trigger are limited to 1MB but that should be beyond sufficient for this use case.

s3 file downloads are unpredictable

We have uploaded several zip files to s3. All are in the hundreds of MB range.
We download the files, typically via a script, it appears that the file size and type both change. The new file size typically is about 300 bytes and the file type once downloaded is xml.
The content of the files look similar to this (whitespace added for clarity):
<?xml version="1.0" encoding="UTF-8"?>
<Error>
<Code>NoSuchKey</Code>
<Message>The specified key does not exist.</Message>
<Key>gpdb-5.0.0.0/greenplum-db-5.0.0.0-rhel5-x86_64.zip</Key>
<RequestId>83D2047BDBA195A6</RequestId>
<HostId>tXKFaiRaNjD26j6fcrTjCk858PGBH2RAjLE1aO4+8hovD6mf+hUzJvCdWKKgrDJGaHXsjWbQP2A=</HostId>
</Error>
Any thoughts as to what might be causing this? It does not happen all of the time. It's somewhat intermittent.
As you will note in the S3 API Reference, this isn't a file -- it's an error message.
Files in S3 are called Objects, and the path + filename of the object is referred to as the object key.
A key is the unique identifier for an object within a bucket. Every object in a bucket has exactly one key. [...] Every object in Amazon S3 can be uniquely addressed through the combination of the web service endpoint, bucket name, key, and optionally, a version. For example, in the URL http://doc.s3.amazonaws.com/2006-03-01/AmazonS3.wsdl, "doc" is the name of the bucket and "2006-03-01/AmazonS3.wsdl" is the key.
http://docs.aws.amazon.com/AmazonS3/latest/dev/Introduction.html#BasicsKeys
This error message, which should have been accompanied by a 404 Not Found error code when you tried to access the object, indicates that there is no object in the bucket whose key (path + filename) is the one shown in the error -- the one you requested. You should be able to confirm its absence in the S3 console.
If the object should have been uploaded some time in the past, this error means the object wasn't actually uploaded, or has subsequently been deleted.
If the object was very recently uploaded (typically within seconds), you should not get this error, but it is possible for this error to occur under either of two additional conditions:
If you try to check whether an object exists by sending a GET or HEAD request to the bucket, then uploaded the object. If you do this, a short period of time may elapse before the object is accessible because of internal optimizations inside S3. When you try to fetch a non-existent object, S3 may -- for a brief time -- have an internal concept that the object is not there, even though it has safely been stored. Retry your request.
If you already had an object with the same key, then you deleted it, then you uploaded a new object with the same key, for a brief time after the new upload, you could either get the error above or you could actually download the old object again.
These conditions are somewhat uncommon, but they can occur, particularly if your bucket has a lot of traffic, due to S3's consistency model, which is an engineered tradeoff between performance, reliability, and immediate availability of uploaded objects when the same object has been recently downloaded, attempted to be downloaded, deleted, or overwritten.
The <RequestId> and <HostId> codes in the error response are opaque diagnostic codes that you can provide to AWS support, if you need to submit a support request about a specific problem you are experiencing with S3... they can use these to find the specific request and identify the problem. They are not considered sensitive information, since they have no meaning outside of AWS.
In this case, there is no apparent need to contact AWS support, because it appears that you are simply trying to download an object that is not in the specific bucket from which you tried to download this particular file. If you get alternating success and failure for the exact same file, that's unexpected, and a support case might be in order... but typically, an internal error in S3 should result in a very different response.

How to work around RequestTimeTooSkewed in Fine Uploader

I'm using Fine Uploader with S3 and I have a client whose computer time is off, resulting in an S3 RequestTimeTooSkewed error. Ideally, my client would have the right time, but I'd like to have my app be robust to this situation.
I've seen this post - https://github.com/aws/aws-sdk-js/issues/399 on how to automatically retry the request. You take the ServerTime from the error response and use that as the time in the response. An alternative approach would just be to get the time from a reliable external source every time, avoiding the need for a retry. However, I'm not sure how to hook either approach into S3 Fine Uploader. Does anyone have an idea of how to do this?
A solution was provided in Fine Uploader 5.5 to address this very situation. From the S3 feature documentation:
If the clock on the machine running Fine Uploader is too far off of the current date, S3 may reject any requests sent from this machine. To overcome this situation, you can include a clock drift value, in milliseconds, when creating a new Fine Uploader instance. One way to set this value is to subtract the current time according to the browser from the current unix time according to your server. For example:
var uploader = new qq.s3.FineUploader({
request: {
clockDrift: SERVER_UNIX_TIME_IN_MS - Date.now()
}
})
If this value is non-zero, Fine Uploader S3 will use it to pad the x-amz-date header and the policy expiration date sent to S3.

Why isn't List Parts to be used with Complete Multipart Upload?

The multipart upload overview documentation has, in the Multipart Upload Listings section, the following warning:
Note
Only use the returned listing for verification. You should not use the result of this listing when sending a complete multipart upload request. Instead, maintain your own list of the part numbers you specified when uploading parts and the corresponding ETag values that Amazon S3 returns.
Why?
Why I ask: Let's say I want to support resuming an upload that is interrupted. Doing so means knowing what remains to be uploaded, and therefore what already was uploaded. Knowing this is simpler if I may disregard the above warning. S3 is persisting the list of already-uploaded parts. I can obtain it from List Parts.
Whereas if I heed that warning, instead I'd need to intercept break or kill signals and persist the uploaded parts list locally. Although that's feasible, it seems silly to do this if S3 already has the list.
Furthermore, the warning says to use List Parts "only for verification". OK. Let's say I persist my own list, and compare it to List Parts. If they do not match, what am I going to do? I'm going to believe List Parts -- if S3 doesn't think it has a part, of course I'm going to upload it again. Therefore if List Parts is the ultimate authority, why not simply use it in the first place, and use it alone?
If they do not match, what am I going to do? I'm going to believe List Parts -- if S3 doesn't think it has a part, of course I'm going to upload it again.
You're missing the point of the warning.
It's not so much about whether parts were received. It's about whether they were received intact.
When you complete a multipart upload, you have to send a list of the parts and their etags. The etags are the hex md5sum of each part.
The lazy and careless way to complete a multipart upload would be to blindly submit the etags of the parts by just reading them from the "list" operation.
That is what they are warning against.
The correct way is to use your locally-created list, based on what you think S3 should have received, what you think the etag of each part should have been, based on the local file.
If you are resuming an upload that was interrupted, you should go back and compare the parts already uploaded (by re-reading and re-checksumming the parts of the local file) against the checksums S3 has calculated against the parts already stored (as returned by the list operation)... then either resend any incorrect parts or missing parts, or abandon the upload because the local file may have changed if one or more parts doesn't match your local calculation.
Additionally, in the interest of data integrity, you should be sending the md5 of each part with the individual part uploads, base64-encoded, with a Content-MD5 header, since this will cause S3 to refuse to accept a part that has been corrupted in any way during the upload.

I need Multi-Part DOWNLOADS from Amazon S3 for huge files

I know Amazon S3 added the multi-part upload for huge files. That's great. What I also need is a similar functionality on the client side for customers who get part way through downloading a gigabyte plus file and have errors.
I realize browsers have some level of retry and resume built in, but when you're talking about huge files I'd like to be able to pick up where they left off regardless of the type of error out.
Any ideas?
Thanks,
Brian
S3 supports the standard HTTP "Range" header if you want to build your own solution.
S3 Getting Objects
I use aria2c. For private content, you can use "GetPreSignedUrlRequest" to generate temporary private URLs that you can pass to aria2c
S3 has a feature called byte range fetches. It’s kind of the download compliment to multipart upload:
Using the Range HTTP header in a GET Object request, you can fetch a byte-range from an object, transferring only the specified portion. You can use concurrent connections to Amazon S3 to fetch different byte ranges from within the same object. This helps you achieve higher aggregate throughput versus a single whole-object request. Fetching smaller ranges of a large object also allows your application to improve retry times when requests are interrupted. For more information, see Getting Objects.
Typical sizes for byte-range requests are 8 MB or 16 MB. If objects are PUT using a multipart upload, it’s a good practice to GET them in the same part sizes (or at least aligned to part boundaries) for best performance. GET requests can directly address individual parts; for example, GET ?partNumber=N.
Source: https://docs.aws.amazon.com/whitepapers/latest/s3-optimizing-performance-best-practices/use-byte-range-fetches.html
Just updating for current situation, S3 natively supports multipart GET as well as PUT. https://youtu.be/uXHw0Xae2ww?t=1459.
NOTE: For Ruby user only
Try aws-sdk gem from Ruby, and download
object = AWS::S3::Object.new(...)
object.download_file('path/to/file.rb')
Because it download a large file with multipart by default.
Files larger than 5MB are downloaded using multipart method
http://docs.aws.amazon.com/sdkforruby/api/Aws/S3/Object.html#download_file-instance_method