Does an amazon cloudfront distribution with multiple origins conflict? - conflict

i have 2 different images in 2 websites at:
http://www.siteA.com/avatar.png
http://www.siteB.com/avatar.png
If i create an Amazon Cloudfront distribution with 2 origins: www.siteA.com and www.siteB.com and then i call for uniqueDistributionID.cloudfront.net/avatar.png, then which avatar.png will be returned? The one in siteA or the one in siteB?
Why & why not?
Trying to understand the potential of conflicts in Cloudfront distributions.

No, CloudfFront doesn't have a concept of a "conflict," because when you have a distribution with multiple origins, you have to define which path matches go to which origin.
CloudFront's path pattern matching is deterministic. It uses first match, not best match. Whichever pattern matches first is the one that will be used, even if that path is a dead-end at the origin server.
When CloudFront receives an end-user request, the requested path is compared with path patterns in the order in which cache behaviors are listed in the distribution. The first match determines which cache behavior is applied to that request.
http://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/distribution-web-values-specify.html#DownloadDistValuesPathPattern
Update
CloudFront now supports a concept of Origin Groups, which allow any given Cache Behavior to send a request to one origin, and then -- if one of the error types that you specify (e.g. 404 or 503) is returned by the first origin, then CloudFront will attempt to fetch the content from a second origin. This can be used for failover, but it can also be used for cases where you want CloudFront to try one origin, and then another. The two origins in the origin group are tried, in order, for every cache miss. If either origin returns a cacheable response, that response will be stored in the cache.

Related

Lambda script to direct to fallback S3 domain subfolder when not found

As per this question, and this one the following piece of code, allows me to point a subfolder in a S3 bucket to my domain.
However in instances where the subdomain is not found, I get the following error message:
<Error>
<Code>AccessDenied</Code>
<Message>Access Denied</Message>
<RequestId>2CE9B7837081C817</RequestId>
<HostId>
T3p7mzSYztPhXetUu7GHPiCFN6l6mllZgry+qJWYs+GFOKMjScMmRNUpBQdeqtDcPMN3qSYU/Fk=
</HostId>
</Error>
I would not like it to display this error message, instead in instances like this I would like to serve from another S3 bucket subdomain (i.e. example-bucket.s3-website.us-east-2.amazonaws.com/error) for example where the user will be greeted with a fancy error message. So therefore in a situation where a S3 bucket subfolder is not found, it should fall back to there. How do I accomplish this by changing the node function below.
'use strict';
// if the end of incoming Host header matches this string,
// strip this part and prepend the remaining characters onto the request path,
// along with a new leading slash (otherwise, the request will be handled
// with an unmodified path, at the root of the bucket)
const remove_suffix = '.example.com';
// provide the correct origin hostname here so that we send the correct
// Host header to the S3 website endpoint
const origin_hostname = 'example-bucket.s3-website.us-east-2.amazonaws.com'; // see comments, below
exports.handler = (event, context, callback) => {
const request = event.Records[0].cf.request;
const headers = request.headers;
const host_header = headers.host[0].value;
if(host_header.endsWith(remove_suffix))
{
// prepend '/' + the subdomain onto the existing request path ("uri")
request.uri = '/' + host_header.substring(0,host_header.length - remove_suffix.length) + request.uri;
}
// fix the host header so that S3 understands the request
headers.host[0].value = origin_hostname;
// return control to CloudFront with the modified request
return callback(null,request);
};
The Lambda#Edge function is an origin request trigger -- it runs after the CloudFront cache is checked and a cache miss has occurred, immediately before the request (as it stands after being modified by the trigger code) is sent to the origin server. By the time the response arrives from the origin, this code has finished and can't be used to modify the response.
There are several solutions, including some that are conceptually valid but extremely inefficient. Still, I'll mention those as well as the cleaner/better solutions, in the interest of thoroughness.
Lambda#Edge has 4 possible trigger points:
viewer-request - when request first arrives at CloudFront, before the cache is checked; fires for every request.
origin-request - after the request is confirmed to be a cache miss, but before the request is sent to the origin server; only fires in cache misses.
origin-response - after a response (whether success or error) is returned from the origin server, but before the response is potentially stored in the cache and returned to the viewer; if this trigger modifies the response, the modified response will be stored in the CloudFront cache if cacheable, and returned to the viewer; only fires on cache misses
viewer-response - inmediately before the response is returned to the viewer, whether from the origin or cache; fires for every non-error response, unless that response was spontaneously emitted by a viewer-request trigger, or is the result of a custom error document that sets the status code to 200 (a definite anti-pattern, but still possible), or is a CloudFront-generated HTTP to HTTPS redirect.
Any of the trigger points can assume control of the signal flow, generate its own spontaneous response, and thus change what CloudFront would have ordinarily done -- e.g. if you generate a response directly from an origin-request trigger, CloudFront doesn't actually contact the origin... so what you could theoretically do is check S3 in the origin-request trigger to see if the request will succeed and generate a custom error response, instead. The AWS Javascript SDK is automatically bundled into the Lambda#Edge environmemt. Technically legitimate, this is probably a terrible idea in almost any case, since it will increase both costs and latency due to extra "look-ahead" requests to S3.
Another option is to write a separate origin-response trigger to check for errors, and if occurs, replace it with a customized response from the trigger code. But this idea also qualifies as non-viable, since that trigger will fire for all responses to cache misses, whether success or failure, increasing costs and latency, wasting time for a majority of cases.
A better idea (cost, performance, ease-of-use) is CloudFront Custom Error Pages, which allows you to define a specific HTML document that CloudFront will use for every error matching the specified code (e.g. 403 for access denied, as in the original question). CloudFront can also change that 403 to a 404 when handling those errors. This requires that you do several things when the source of the error file is a bucket:
create a second CloudFront origin pointing to the bucket
create a new cache behavior that routes exactly that one path (e.g. /shared/errors/not-found.html) to the error file over to the new origin (this means you can't use that path on any of the subdomains -- it will always go directly to the error file any time it's requested)
configure a CloudFront custom error response for code 403 to use the path /shared/errors/not-found.html.
set Error Caching Minimum TTL to 0, at least while testing, to avoid some frustration for yourself. See my write-up on this feature but disregard the part where I said "Leave Customize Error Response set to No".
But... that may or may not be needed, since S3's web hosting feature also includes optional Custom Error Document support. You'll need to create a single HTML file in your original bucket, enable the web site hosting feature on the bucket, and change the CloudFront Origin Domain Name to the bucket's web site hosting endpoint, which is in the S3 console but takes the form of${bucket}.s3-website.${region}.amazonaws.com. In some regions, the hostname might have a dash - rather than a dot . after s3-website for legacy reasons, but the dot format should work in any region.
I almost hesitate mention one other option that comes to mind, since it's fairly advanced and I fear the description might seem quite convoluted... but you also could do the following, and it would be pretty slick, since it would allow you to potentiallh generate a custom HTML page for each erroneous URL requested.
Create a CloudFront Origin Group with your main bucket as the primary and a second, empty, "placeholder" bucket as secondary. The only purpose served by the second bucket is so that we give CloudFront a valid name that it plans to connect to, even though we won't actually connect to it, as may become clear, below.
When request fails to the primary origin, matching one of the configured error status codes, the secondary origin is contacted. This is intended for handling the case when an origin fails, but we can leverage it for our purposes, because before actually contacting the failover origin, the same origin request trigger fires a second time.
If the primary origin returns an HTTP status code that you’ve configured for failover, the Lambda function is triggered again, when CloudFront re-routes the request to the second origin.
https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/high_availability_origin_failover.html#concept_origin_groups.lambda
(It would be more accurate to say "...when CloudFront is preparing to re-route the request to the second origin," because the trigger fires first.)
When the trigger fires a second time, the specific reason it fires isn't preserved, but there is a way to identify whether you're running in the first or second invocation: one of these two values will contain the hostname of the origin server CloudFront is preparing to contact:
event.Records[0].cf.request.origin.s3.domainName # S3 rest endpoints
event.Records[0].cf.request.origin.custom.domainName # non-S3 origins and S3 website-hosting endpoints
So we can test the appropriate value (depending on origin type) in the trigger code, looking for the name of the second "placeholder" bucket. If it's there, bypass the current logic and generate the 404 response from inside the Lambda function. This could be dynamic/customized HTML, such as with the page URI or perhaps one that varies depending on whether / or some other page is requested. As noted above, spontaneously generating a response from an origin-request trigger prevents CloudFronr from actually contacting the origin. Generated responses from an origin-request trigger are limited to 1MB but that should be beyond sufficient for this use case.

Two sets of cache with one server (Varnish Cache)

Is it possible to setup Varnish Cache with two independent cache stores?
Then based on a http custom header either use cache1 or cache2.
For example:
Request 1 comes in with header (store=Cache1) this should go to the Cache1
store on Varnish cache
Request 2 comes in which is exactly like Request 1 but with header (store=Cache2) this should go to the Cache2 store on Varnish cache
This use case occurs when the backend responds with different body based on the header (but with the same url) - a legitimate use case.
You could deal with this exactly as described by partitioning Varnish cache, similar to putting Varnish static files cache separately.
But what you want is actually much more simple. Your particular case should be addressed easily by adjusting VCL. You will only need to tell Varnish that the cache should be different based on that particular header. So in your VCL, you would specify:
sub vcl_hash {
if (req.http.store) {
hash_data(req.http.store);
}
}
The vcl_hash specifies that cache should be different depending on the value of store HTTP header.

How to force dispatcher cache urls with get parameters

As I understood after reading these links:
How to find out what does dispatcher cache?
http://docs.adobe.com/docs/en/dispatcher.html
The Dispatcher always requests the document directly from the AEM instance in the following cases:
If the HTTP method is not GET. Other common methods are POST for form data and HEAD for the HTTP header.
If the request URI contains a question mark "?". This usually indicates a dynamic page, such as a search result, which does not need to be cached.
The file extension is missing. The web server needs the extension to determine the document type (the MIME-type).
The authentication header is set (this can be configured)
But I want to cache url with parameters.
If I once request myUrl/?p1=1&p2=2&p3=3
then next request to myUrl/?p1=1&p2=2&p3=3 must be served from dispatcher cache, but myUrl/?p1=1&p2=2&p3=3&newParam=newValue should served by CQ for the first time and from dispatcher cache for subsequent requests.
I think the config /ignoreUrlParams is what you are looking for. It can be used to white list the query parameters which are used to determine whether a page is cached / delivered from cache or not.
Check http://docs.adobe.com/docs/en/dispatcher/disp-config.html#Ignoring%20URL%20Parameters for details.
It's not possible to cache the requests that contain query string. Such calls are considered dynamic therefore it should not be expected to cache them.
On the other hand, if you are certain that such request should be cached cause your application/feature is query driven you can work on it this way.
Add Apache rewrite rule that will move the query string of given parameter to selector
(optional) Add a CQ filter that will recognize the selector and move it back to query string
The selector can be constructed in a way: key_value but that puts some constraints on what could be passed here.
You can do this with Apache rewrites BUT it would not be ideal practice. You'll be breaking the pattern that AEM uses.
Instead, use selectors and extensions. E.g. instead of server.com/mypage.html?somevalue=true, use:
server.com/mypage.myvalue-true.html
Most things you will need to do that would ever get cached will work this way just fine. If you give me more details about your requirements and what you are trying to achieve, I can help you perfect the solution.

Cache-control: Is it possible to ignore query parameters when validating the cache?

Is it possible to set a cache-control header communicating with a reverse proxy to ignore query parameters in determining what is a unique uri or in short: validate a cache even if some query parameters have changed?
Sometimes query parameters have nothing to do with the rendering of the page at least from a server side perspective. For instance all utm_* variables from Google Adwords. These are needed for the javascript on your page so you don't want to strip them away and redirect to a cached page but at the same time it would be advantageous not to treat two uri's which are basically the same but have different utm_* parameters as unique when communicating with a reverse proxy.
An example:
http://www.example.com/search?sort=price
http://www.example.com/search?sort=price&utm_campaign=shoes
Is there anyway to tell the reverse proxy using the HTTP 1.1 spec (i.e. some type of http header) that it can just treat these two pages as the same?
You can filter the query string in vcl_recv and there is also a Varnish module for that [1].
Also, you have to keep in mind that query string parameter order matters in this case [2]
See also this related question [3]
[1] https://www.varnish-cache.org/vmod/querystring
[2] http://cyberroadie.wordpress.com/2012/01/05/varnish-reordering-query-string/
[3] Stripping out select querystring attribute/value pairs so varnish will not vary cache by them

Does the `Expires` HTTP header needs to be consistent across multiple cold-cache requests?

I'm implementing a custom web server of a kind. And am looking into adding an Expires header support. However, I'm a little unsure of how exactly to implement it.
If multiple cold-cache requests are being made to the same unchanged resource on the server and the server returned different Expires header (say it uses relative time to calculate the exact value of the Expires date e.g. +6 hours from the request time), does that invalidate the cache on all the proxy servers in-between as well? Or is it impossible to happen (per the spec)?
Does the Expires HTTP header needs to be consistent across multiple cold-cache requests?
Ok, never mind, found the relevant information under the Cache Revalidation and Reload Controls section of the HTTP Spec
Basically, you can serve all the different validators you want but you must be aware that in such case proxies may have a set of different validators from their own cache and from various user agents communicating with the proxy. They may choose to send one to you and that might not be the correct or the most optimal one for the end-users. However, a "best approach" has been suggested in the spec.
I suppose this should covers Expires headers as well as ETags, Cache-Control and whatnot.
Here's the relevant excerpt, in case anyone's interested:
When an intermediate cache is forced,
by means of a max-age=0 directive, to
revalidate its own cache entry, and
the client has supplied its own
validator in the request, the supplied
validator might differ from the
validator currently stored with the
cache entry. In this case, the cache
MAY use either validator in making its
own request without affecting semantic
transparency. However, the choice of
validator might affect performance.
The best approach is for the
intermediate cache to use its own
validator when making its request. If
the server replies with 304 (Not
Modified), then the cache can return
its now validated copy to the client
with a 200 (OK) response. If the
server replies with a new entity and
cache validator, however, the
intermediate cache can compare the
returned validator with the one
provided in the client's request,
using the strong comparison function.
If the client's validator is equal to
the origin server's, then the
intermediate cache simply returns 304
(Not Modified). Otherwise, it returns
the new entity with a 200 (OK)
response. If a request includes the
no-cache directive, it SHOULD NOT
include min-fresh, max-stale, or
max-age.