Lambda script to direct to fallback S3 domain subfolder when not found - amazon-s3

As per this question, and this one the following piece of code, allows me to point a subfolder in a S3 bucket to my domain.
However in instances where the subdomain is not found, I get the following error message:
<Error>
<Code>AccessDenied</Code>
<Message>Access Denied</Message>
<RequestId>2CE9B7837081C817</RequestId>
<HostId>
T3p7mzSYztPhXetUu7GHPiCFN6l6mllZgry+qJWYs+GFOKMjScMmRNUpBQdeqtDcPMN3qSYU/Fk=
</HostId>
</Error>
I would not like it to display this error message, instead in instances like this I would like to serve from another S3 bucket subdomain (i.e. example-bucket.s3-website.us-east-2.amazonaws.com/error) for example where the user will be greeted with a fancy error message. So therefore in a situation where a S3 bucket subfolder is not found, it should fall back to there. How do I accomplish this by changing the node function below.
'use strict';
// if the end of incoming Host header matches this string,
// strip this part and prepend the remaining characters onto the request path,
// along with a new leading slash (otherwise, the request will be handled
// with an unmodified path, at the root of the bucket)
const remove_suffix = '.example.com';
// provide the correct origin hostname here so that we send the correct
// Host header to the S3 website endpoint
const origin_hostname = 'example-bucket.s3-website.us-east-2.amazonaws.com'; // see comments, below
exports.handler = (event, context, callback) => {
const request = event.Records[0].cf.request;
const headers = request.headers;
const host_header = headers.host[0].value;
if(host_header.endsWith(remove_suffix))
{
// prepend '/' + the subdomain onto the existing request path ("uri")
request.uri = '/' + host_header.substring(0,host_header.length - remove_suffix.length) + request.uri;
}
// fix the host header so that S3 understands the request
headers.host[0].value = origin_hostname;
// return control to CloudFront with the modified request
return callback(null,request);
};

The Lambda#Edge function is an origin request trigger -- it runs after the CloudFront cache is checked and a cache miss has occurred, immediately before the request (as it stands after being modified by the trigger code) is sent to the origin server. By the time the response arrives from the origin, this code has finished and can't be used to modify the response.
There are several solutions, including some that are conceptually valid but extremely inefficient. Still, I'll mention those as well as the cleaner/better solutions, in the interest of thoroughness.
Lambda#Edge has 4 possible trigger points:
viewer-request - when request first arrives at CloudFront, before the cache is checked; fires for every request.
origin-request - after the request is confirmed to be a cache miss, but before the request is sent to the origin server; only fires in cache misses.
origin-response - after a response (whether success or error) is returned from the origin server, but before the response is potentially stored in the cache and returned to the viewer; if this trigger modifies the response, the modified response will be stored in the CloudFront cache if cacheable, and returned to the viewer; only fires on cache misses
viewer-response - inmediately before the response is returned to the viewer, whether from the origin or cache; fires for every non-error response, unless that response was spontaneously emitted by a viewer-request trigger, or is the result of a custom error document that sets the status code to 200 (a definite anti-pattern, but still possible), or is a CloudFront-generated HTTP to HTTPS redirect.
Any of the trigger points can assume control of the signal flow, generate its own spontaneous response, and thus change what CloudFront would have ordinarily done -- e.g. if you generate a response directly from an origin-request trigger, CloudFront doesn't actually contact the origin... so what you could theoretically do is check S3 in the origin-request trigger to see if the request will succeed and generate a custom error response, instead. The AWS Javascript SDK is automatically bundled into the Lambda#Edge environmemt. Technically legitimate, this is probably a terrible idea in almost any case, since it will increase both costs and latency due to extra "look-ahead" requests to S3.
Another option is to write a separate origin-response trigger to check for errors, and if occurs, replace it with a customized response from the trigger code. But this idea also qualifies as non-viable, since that trigger will fire for all responses to cache misses, whether success or failure, increasing costs and latency, wasting time for a majority of cases.
A better idea (cost, performance, ease-of-use) is CloudFront Custom Error Pages, which allows you to define a specific HTML document that CloudFront will use for every error matching the specified code (e.g. 403 for access denied, as in the original question). CloudFront can also change that 403 to a 404 when handling those errors. This requires that you do several things when the source of the error file is a bucket:
create a second CloudFront origin pointing to the bucket
create a new cache behavior that routes exactly that one path (e.g. /shared/errors/not-found.html) to the error file over to the new origin (this means you can't use that path on any of the subdomains -- it will always go directly to the error file any time it's requested)
configure a CloudFront custom error response for code 403 to use the path /shared/errors/not-found.html.
set Error Caching Minimum TTL to 0, at least while testing, to avoid some frustration for yourself. See my write-up on this feature but disregard the part where I said "Leave Customize Error Response set to No".
But... that may or may not be needed, since S3's web hosting feature also includes optional Custom Error Document support. You'll need to create a single HTML file in your original bucket, enable the web site hosting feature on the bucket, and change the CloudFront Origin Domain Name to the bucket's web site hosting endpoint, which is in the S3 console but takes the form of${bucket}.s3-website.${region}.amazonaws.com. In some regions, the hostname might have a dash - rather than a dot . after s3-website for legacy reasons, but the dot format should work in any region.
I almost hesitate mention one other option that comes to mind, since it's fairly advanced and I fear the description might seem quite convoluted... but you also could do the following, and it would be pretty slick, since it would allow you to potentiallh generate a custom HTML page for each erroneous URL requested.
Create a CloudFront Origin Group with your main bucket as the primary and a second, empty, "placeholder" bucket as secondary. The only purpose served by the second bucket is so that we give CloudFront a valid name that it plans to connect to, even though we won't actually connect to it, as may become clear, below.
When request fails to the primary origin, matching one of the configured error status codes, the secondary origin is contacted. This is intended for handling the case when an origin fails, but we can leverage it for our purposes, because before actually contacting the failover origin, the same origin request trigger fires a second time.
If the primary origin returns an HTTP status code that you’ve configured for failover, the Lambda function is triggered again, when CloudFront re-routes the request to the second origin.
https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/high_availability_origin_failover.html#concept_origin_groups.lambda
(It would be more accurate to say "...when CloudFront is preparing to re-route the request to the second origin," because the trigger fires first.)
When the trigger fires a second time, the specific reason it fires isn't preserved, but there is a way to identify whether you're running in the first or second invocation: one of these two values will contain the hostname of the origin server CloudFront is preparing to contact:
event.Records[0].cf.request.origin.s3.domainName # S3 rest endpoints
event.Records[0].cf.request.origin.custom.domainName # non-S3 origins and S3 website-hosting endpoints
So we can test the appropriate value (depending on origin type) in the trigger code, looking for the name of the second "placeholder" bucket. If it's there, bypass the current logic and generate the 404 response from inside the Lambda function. This could be dynamic/customized HTML, such as with the page URI or perhaps one that varies depending on whether / or some other page is requested. As noted above, spontaneously generating a response from an origin-request trigger prevents CloudFronr from actually contacting the origin. Generated responses from an origin-request trigger are limited to 1MB but that should be beyond sufficient for this use case.

Related

Results of S3 function call are being cached by my Lambda function

I have a lambda function that uses S3.listObjects to return a directory listing. The listing is sometimes (not always!) out of date - it doesn't contain recently uploaded objects and has old modification dates for the objects that it does have.
When I run the identical code locally it always works fine.
Clearly some sort of caching but I don't understand where...
Here's the relevant code:
function listFiles() {
return new Promise(function (resolve, reject) {
const params = {
Bucket: "XXXXX",
Prefix: "YYYYY"
};
s3.listObjects(params, function (err, data) {
if (err) reject(err);
else resolve(data.Contents);
});
})
}
That is due to Amazon S3 Data Consistency Model. S3 provides read-after-write consistency for PUTs, however other requests - including listObjects are eventually consistent which means there could be a delay in propagation.
The read-after-write consistency in practice settles in a matter of seconds. It's not a guarantee, however. It's unlikely, but not impossible that amazon returns stale data minutes later, esp if across zones. It's more likely however that your client is caching a previous response for that same URL.
You might have run into a side effect of your lambda container being reused. This is explained at a high-level here. One consequence of container reuse is that background processes, temporary files, and global variable modifications are still around when your lambda is re-invoked. Another article talking about how to guard for it.
If you are sending your logs to cloudwatch logs, you can confirm that a container is being reused if the logs for a lambda seem to be appended to the end of a previous log stream, instead of creating a new log stream.
When your lambda container gets reused, the global variables outside your handler function will be reused. For instance, if you change the loglevel of your logging calls to DEBUG at the end of your handler, if your container gets reused, it will start at the top of the handler in the same loglevel.
If you're using the default s3 client session (it seems like you are), then this connection stays in a global (singleton). If your s3 client connection is reused, it might pull the cached results of calls prior, and I would expect that connection to be reused in a later invocation.
One way to avoid this is to specify the If-None-Match request header. If the ETag of the object you're accessing doesn't match on the remote end, you'll get fresh data. You may set it to the last Etag you got (which you'd store in a global), or alternatively you may try setting a completely random value -- which should act as a cache buster. It doesn't look like list_objects() accepts an If-None-Match header, however. You may try to create a new client session just for the current invocation.
This article on recursive lambdas discusses the issue.

Best HTTP Method for Get or Create

I'm writing an HTTP based API, and I have a situation where the user specifies a resource, and if that resource doesn't yet exist the server creates it. It's basically built on top of Django's get_or_create method.
What would be the most idiomatic/correct HTTP method to use in this situation?
I'm suspecting that POST would be proper. However, I'm not entirety sure. Though it seems that GET would be incorrect seeing as it's not supposed to have any side-effects.
I would use GET for this. Repeated calls to this end point will return the same resource, so it's still Idempotent.
A GET request expresses the user's intent to not have any side effects. Naturally, there will always be side effects on the server like log entries for example, but the important distinction here is whether the user had asked for a side effect or not.
Another reason to stay away from GET surfaces if you respond with the recommended 201 Created response for a request where the resource is being created on the server. The next request would result in a different response with status 200 OK and thus it cannot be cached as is usually the case with GET requests.
Instead, I would suggest to use PUT, which is defined as
The PUT method requests that the enclosed entity be stored under the
supplied Request-URI. If the Request-URI refers to an already
existing resource, the enclosed entity SHOULD be considered as a
modified version of the one residing on the origin server. If the
Request-URI does not point to an existing resource, and that URI is
capable of being defined as a new resource by the requesting user
agent, the origin server can create the resource with that URI.
If a
new resource is created, the origin server MUST inform the user agent
via the 201 (Created) response. If an existing resource is modified,
either the 200 (OK) or 204 (No Content) response codes SHOULD be sent
to indicate successful completion of the request. If the resource
could not be created or modified with the Request-URI, an appropriate
error response SHOULD be given that reflects the nature of the
problem.
In the above form, it should be considered a "create or update" action.
To implement pure "get or create" you could respond with 409 Conflict in case an update would result in a different state.
However, especially if you are looking for idempotence, you might find that "create or update" semantics could actually be a better fit than "get or create". This depends heavily on the use case though.
I would not use GET for creating resources because you never know if some robot (like a search engine robot) is following listed GET calls which would then create a lot of useless resources.

Get the eventual url for a file uploaded to S3

Working on an app for a client that will asynchronously receive a request, reply immediately, then go out and fetch a set of large files to perform work on, and finally upload the results to S3 minutes or hours later.
Can we know ahead of time what the eventual url of the file on S3 will be? I'm thinking of creating a hash based on the filename and some other metadata that we know at the incoming request initialization and using that as the name of the S3 file. Is there a predictable pattern of S3 host plus bucket plus file name, or is it something that we don't know until the file upload is complete?
I'm entertaining the idea of returning the eventual S3 filename to the initial request, with the expectation that on the client's end they can periodically check the url for the result. In addition, I'm considering requiring the client to pass a callback url in with the request. The app will then hit that url later with the success/fail status of the work.
Thanks.
The URL of a file uploaded to S3 can be entirely determined by you - it's purely dependent on the bucket and key name. Specifically, it's of the form:
http://s3.amazonaws.com/BUCKETNAME/KEYNAME
(Or certain other formats, depending. It's still completely predictable.)
So long as you pick the key name ahead of time, you'll know what the eventual URL will be.

Temporary URL for an asynchronous API

I am designing an asynchronous API according to the RESTful principles set out here. This involves a temporary URL, specified in the response's Location header, which clients can poll for status updates until their result is ready. I've never dealt with temporary URLs before, so how might I go about building one? We are using the Hunchentoot (Common Lisp) webserver.
I would not expect that URL to be temporary. That shows the status of the creation request. When creation is done, the response from that URL would indicate <status>done</status> or somesuch. If you really want to pull that URL (which I don't recommend), then you would return a 404.

Idempotence of HTTP PUT and DELETE

So the HTTP spec says that HTTP PUT and DELETE should be idempotent. Meaning, multiple PUT requests to the same URL with the same body should not result in additional side-effects on the server. Same goes with multiple HTTP DELETEs, if 2 or more DELETE requests are sent to the same URL, the second (or third, etc) requests should not return an error indicating that the resource has already been deleted.
However, what about PUT requests to a URI after a DELETE has been processed? Should it return 404?
For example, consider the following requests are executed in this order:
POST /api/items - creates an item resource, returns HTTP 201 and URI /api/items/6
PUT /api/items/6 - updates the data associated with item #6
PUT /api/items/6 - has no side effects as long as request body is same as previous PUT
DELETE /api/items/6 - deletes item #6 and returns HTTP 202
DELETE /api/items/6 - has no side effects, and also returns HTTP 202
GET /api/items/6 - this will now return a 404
PUT /api/items/6 - WHAT SHOULD HAPPEN HERE? 404? 409? something else?
So, should PUT be consistent with get and return a 404, or like #CodeCaster suggests, would a 409 be more appropriate?
RFC 2616, section 9.6, PUT:
The fundamental difference between the POST and PUT requests is
reflected in the different meaning of the Request-URI. The URI in a
POST request identifies the resource that will handle the enclosed
entity. That resource might be a data-accepting process, a gateway to
some other protocol, or a separate entity that accepts annotations.
In contrast, the URI in a PUT request identifies the entity enclosed
with the request -- the user agent knows what URI is intended and the
server MUST NOT attempt to apply the request to some other resource.
And:
If the resource could not be created or modified with the Request-URI, an appropriate error response SHOULD be given that reflects the nature of the problem.
So to define 'appropriate' is to look at the 400-series, indicating there's a client error. First I'll eliminate the irrelevant ones:
400 Bad Request: The request could not be understood by the server due to malformed
syntax.
401 Unauthorized: The request requires user authentication.
402 Payment Required: This code is reserved for future use.
406 Not Acceptable: The resource identified by the request [...] not acceptable
according to the accept headers sent in the request.
407 Proxy Authentication Required: This code [...] indicates that the
client must first authenticate itself with the proxy.
408 Request Timeout: The client did not produce a request within the time that the server was prepared to wait.
411 Length Required: The server refuses to accept the request without a defined Content-
Length.
So, which ones may we use?
403 Forbidden
The server understood the request, but is refusing to fulfill it.
Authorization will not help and the request SHOULD NOT be repeated.
This description actually fits pretty well, altough it is usually used in a permissions-related context (as in: YOU may not ...).
404 Not Found
The server has not found anything matching the Request-URI. No
indication is given of whether the condition is temporary or
permanent. The 410 (Gone) status code SHOULD be used if the server
knows, through some internally configurable mechanism, that an old
resource is permanently unavailable and has no forwarding address.
This status code is commonly used when the server does not wish to
reveal exactly why the request has been refused, or when no other
response is applicable.
This one too, especially the last line.
405 Method Not Allowed
The method specified in the Request-Line is not allowed for the
resource identified by the Request-URI. The response MUST include an
Allow header containing a list of valid methods for the requested
resource.
There are no valid methods we can respond with, since we don't want any method to be executed on this resource at the moment, so we cannot return a 405.
409 Conflict
Conflicts are most likely to occur in response to a PUT request. For
example, if versioning were being used and the entity being PUT
included changes to a resource which conflict with those made by an
earlier (third-party) request, the server might use the 409 response
to indicate that it can't complete the request. In this case, the
response entity would likely contain a list of the differences
between the two versions in a format defined by the response
Content-Type.
But that assumes there already is a resource at the URI (how can there be a conflict with nothing?).
410 Gone
The requested resource is no longer available at the server and no
forwarding address is known. This condition is expected to be
considered permanent. Clients with link editing capabilities SHOULD
delete references to the Request-URI after user approval. If the
server does not know, or has no facility to determine, whether or not
the condition is permanent, the status code 404 (Not Found) SHOULD be
used instead.
This one also makes sense.
I've edited this post a few times now, it was accepted when it claimed "use 410 or 404", but now I think 403 might also be applicable, since the RFC doesn't state a 403 has to be permissions-related (but it seems to be implemented that way by popular web servers). I think I have eliminated all other 400-codes, but feel free to comment (before you downvote).
Your question has an unstated, assumed premise, that the resource must exist for a PUT to succeed. This is not a valid assumption.
The relevant portion of the spec (RFC2616) says:
the user agent knows what URI is intended and the server MUST NOT attempt to apply the request to some other resource.
The spec does not say, "An object at the referenced URI must already exist in order for a PUT to that URI to succeed."
The easy example is a web store implemented via REST. GET returns a representation of the object at the given path, while DELETE removes the item at the given path. Those are easy. But the POST and PUT are not much more difficult to understand. POST can do anything, but one use of POST creates an object in a container that the client specifies, and lets the server return the URI of the newly created object within that container. PUT is more limited; it gives the server the representation for an object at a given URI. The object may already exist, or it may not. PUT is not a synonym for REPLACE.
In my opinion 409 or 410 is wrong for a PUT, unless the container itself - the thing you are trying to put into, does not exist.
therefore:
POST /container
==> returns 200 with `Location:/container/resource-12345`
PUT /container/resource-98928
==> returns 201 CREATED or 200 OK
PUT /this-container-does-not-exist/resource-22828282
--> returns 400
Of course it is up to you, whether you'd like your server to allow these PUT semantics. But there's nothing in the spec that says you must not allow clients to provide the URI of the resource that he is PUTting.