We are running an SPA that communicates with an API. Both are exposed to the public via Cloudfront.
We now have the issue that the requests we see in the backend are masked by Cloudfront. Meaning:
The Remote Address we see is the address of the AWS Cloud
The User Agent Header field is set to "Amazon Cloudfront" and not the browser of the user
So Cloudfront somehow intercepts the request in a way we didn't anticipate.
I already went through these steps: https://aws.amazon.com/premiumsupport/knowledge-center/configure-cloudfront-to-forward-headers/ but ended up cutting the connection between the API and the frontend.
We don't care about caching implications (we don't have a lot of traffic), we just need to the right fields to show up in the backend.
By default, most request headers are removed, because CloudFront's default behaviors are generally designed around optimal caching. CloudFront's default header handling behavior is documented.
If you need to see specific headers at the origin, whitelist those headers for forwarding in the cache distribution. The documentation refers to this as “Selecting the Headers on Which You Want CloudFront to Base Caching” -- and that is what it does -- but that description masks what's actually happening. CloudFront removes the rest of the headers because it has no way of knowing for certain whether a specific header with a certain value might change the response that the origin generates. If it didn't remove these headers by default, there would be confusion in the other direction when users saw the "wrong" responses served from the cache.
In your case, you almost certainly don't want to include the Host header in what you are whitelisting for forwarding.
When testing, especially, be sure you also set the Error Caching Minimum TTL to 0, because the default value is 300 seconds... so you can't see whether the problem is fixed for up to 5 minutes after you fix it. This default is also by design, a protective measure to avoid overloading your origin with requests that are likely to continue to fail.
When examining responses from CloudFront, keep an eye on the Age response header, which is present any time the response is served from cache. It tells you how long it's been (in seconds) since CloudFront obtained the response it is currently returning to you.
If you want to disable CloudFront caching, you can set Maximum, Minimum, and Default TTL all to 0 (this only affects 2xx and 3xx HTTP responses -- errors are cached for a different time window, as noted above), or your origin can consistently return Cache-Control: s-maxage=0, which will prevent CloudFront from caching the response.
Related
I have built a system where I have product templates. A brand will overwrite the template to create a product. Images can be uploaded to the template and be overwritten on the product. The product images are uploaded to the corresponding brand's S3 bucket. But on the product template images are uploaded to a generic S3 bucket.
Is there a way to make the brand's bucket fallback to the generic bucket if it receives a 404 or 403 with a file url. Similar to the hosted website redirect rules? These are just buckets with images so it wouldn't be a hosted website and I was hoping to avoid turning that on.
There is not a way to do this with S3 alone, but it can be done with CloudFront, in conjunction with two S3 buckets, configured in an origin group with appropriate origin failover settings, so that 403/404 errors from the first bucket cause CloudFront to make a follow-up request from the second bucket.
After you configure origin failover for a cache behavior, CloudFront does the following for viewer requests:
When there’s a cache hit, CloudFront returns the requested file.
When there’s a cache miss, CloudFront routes the request to the primary origin that you identified in the origin group.
When a status code that has not been configured for failover is returned, such as an HTTP 2xx or HTTP 3xx status code, CloudFront serves the requested content.
When the primary origin returns an HTTP status code that you’ve configured for failover, or after a timeout, CloudFront routes the request to the backup origin in the origin group.
https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/high_availability_origin_failover.html
This seems to be the desired behavior you're describing. It means, of course, that cache misses that need to fall back to the second bucket will require additional time to be served, but for cache hits, there won't be any delay since CloudFront only goes through the "try one, then try the other" on cache misses. It also means that you'll be paying for some traffic on the primary bucket for objects that aren't present, so it makes the most sense sense if the primary bucket will have the object more often than not.
This solution does not redirect the browser -- CloudFront follows the second path before returning a response -- so you'll want to be mindful of the Cache-Control settings you attach to the fallback objects when you upload them, since adding a (previously-absent) primary object after a fallback object is already fetched and cached (by either CloudFront or the browser) will not be visible until any cached objects expire.
I am thinking to use cloudflare to cache a resource generated by a REST API endpoint.
Because the API can take time to return the result, I am wondering if it is possible to configure cloudflare to refresh the resource in background returning always the cached resource to clients.
You can use page rules on the API endpoint to cache the result for X hours (or days, etc).
I think it will have to be GET though, I don't think POST is ever cached.
Problem:
I am storing number of HLS streams in S3 with given file structure:
Video1
├──hls3
├──hlsv3-master.m3u8
├──media-1
├──media-2
├──media-3
├──media-4
├──media-5
├──hls4
├──hlsv4-master.m3u8
├──media-1
├──media-2
├──media-3
├──media-4
├──media-5
In my user API I know which exactly user has access to which video content
but I also need to ensure that video links are not sharable and only accessible
by users with right permissions.
Solutions:
1) Use signed / temp S3 urls for private S3 content. Whenever client wants to play some specific video it is
sending request to my API. If user has right permissions the API is generating signed url
and returning it back to client which is passing it to player.
The problem I see here is that real video content is stored in dozen of segments files in media-* directories
and I do not really see how can I protect all of them - would I need to sign each of the segment file urls separately?
2) S3 content is private. Video stream requests made by players are going through my API or separate reverse-proxy.
So whenever client decides to play specific video, API / reverse-proxy is getting the request, doing authentication & authorization
and passing the right content (master play list files & segments).
In this case I still need to make S3 content private and accessible only by my API / reverse-proxy. What should be the recommended way here?
S3 rest authentication via tokens?
3) Use encryption with protected key. In this case all of video segments are encrypted and publicly available. The key is also stored in S3
but is not publicly available. Every key request made by player is authenticated & authorized by my API / reverse-proxy.
These are 3 solutions I have in my mind right now. Not convinced on all of them. I am looking for something simple and bullet proof secure. Any recommendations / suggestions?
Used technology:
ffmpeg for video encoding to different bitrates
bento4 for video segmentation
would I need to sign each of the segment file urls separately?
If the player is requesting directly from S3, then yes. So that's probably not going to be the ideal approach.
One option is CloudFront in front of the bucket. CloudFront can be configured with an Origin Access Identity, which allows it to sign requests and send them to S3 so that it can fetch private S3 objects on behalf of an authorized user, and CloudFront supports both signed URLs (using a different algorithm than S3, with two important differences that I will explain below) or with signed cookies. Signed requests and cookies in CloudFront work very similarly to each other, with the important difference being that a cookie can be set once, then automatically used by the browser for each subsequent request, avoiding the need to sign individual URLs. (Aha.)
For both signed URLs and signed cookies in CloudFront, you get two additional features not easily done with S3 if you use a custom policy:
The policy associated with a CloudFront signature can allow a wildcard in the path, so you could authorize access to any file in, say /media/Video1/* until the time the signature expires. S3 signed URLs do not support wildcards in any form -- an S3 URL can only be valid for a single object.
As long as the CloudFront distribution is configured for IPv4 only, you can tie a signature to a specific client IP address, allowing only access with that signature from a single IP address (CloudFront now supports IPv6 as an optional feature, but it isn't currently compatible with this option). This is a bit aggressive and probably not desirable with a mobile user base, which will switch source addresses as they switch from provider network to Wi-Fi and back.
Signed URLs must still all be generated for all of the content links, but you can generate and sign a URL only once and then reuse the signature, just string-rewriting the URL for each file making that option computationally less expensive... but still cumbersome. Signed cookies, on the other hand, should "just work" for any matching object.
Of course, adding CloudFront should also improve performance through caching and Internet path shortening, since the request hops onto the managed AWS network closer to the browser than it typically will for requests direct to S3. When using CloudFront, requests from the browser are sent to whichever of 60+ global "edge locations" is assumed to be nearest the browser making the request. CloudFront can serve the same cached object to different users with different URLs or cookies, as long as the sigs or cookies are valid, of course.
To use CloudFront signed cookies, at least part of your application -- the part that sets the cookie -- needs to be "behind" the same CloudFront distribution that points to the bucket. This is done by declaring your application as an additional Origin for the distribution, and creating a Cache Behavior for a specific path pattern which, when requested, is forwarded by CloudFront to your application, which can then respond with the appropriate Set-Cookie: headers.
I am not affiliated with AWS, so don't mistake the following as a "pitch" -- just anticipating your next question: CloudFront + S3 is priced such that the cost difference compared to using S3 alone is usually negligible -- S3 doesn't charge you for bandwidth when objects are requested through CloudFront, and CloudFront's bandwidth charges are in some cases slightly lower than the charge for using S3 directly. While this seems counterintuitive, it makes sense that AWS would structure pricing in such a way as to encourage distribution of requests across its network rather than to focus them all against a single S3 region.
Note that no mechanism, either the one above or the one below is completely immune to unauthorized "sharing," since the authentication information is necessarily available to the browser, and thus to the user, depending on the user's expertise... but both approaches seem more than sufficient to keep honest users honest, which is all you can ever hope to do. Since signatures on signed URLs and cookies have expiration times, the duration of the share-ability is limited, and you can identify such patterns through CloudFront log analysis, and react accordingly. No matter what approach you take, don't forget the importance of staying on top of your logs.
The reverse proxy is also a good idea, probably easily implemented, and should perform quite acceptably with no additional data transport charges or throughput issues, if the EC2 machines running the proxy are in the same AWS region as the bucket, and the proxy is based on solid, efficient code like that found in Nginx or HAProxy.
You don't need to sign anything in this environment, because you can configure the bucket to allow the reverse proxy to access the private objects because it has a fixed IP address.
In the bucket policy, you do this by granting "anonymous" users the s3:getObject privilege, only if their source IPv4 address matches the IP address of one of the proxies. The proxy requests objects anonymously (no signing needed) from S3 on behalf of authorized users. This requires that you not be using an S3 VPC endpoint, but instead give the proxy an Elastic IP address or put it behind a NAT Gateway or NAT instance and have S3 trust the source IP of the NAT device. If you do use an S3 VPC endpoint, it should be possible to allow S3 to trust the request simply because it traversed the S3 VPC Endpoint, though I have not tested this. (S3 VPC Endpoints are optional; if you didn't explicitly configure one, then you don't have one, and probably don't need one).
Your third option seems weakest, if I understand it correctly. An authorized but malicious user gets the key an can share it all day long.
We attempted a passbook program but it never made it out of beta, but there are a few passes out there that keep phoning home (and throwing errors because the passes are out of sync with existing data). My plan is to 404 any incoming requests, but I'm not sure if that is the best way to handle existing passes. Any other ideas or is 404 the right solution?
There are a few of options:
Return an updated pass without that has a blank web service url
Return an appropriate error
Remove the DNS entry of the subdomain
Update the web service url
Any of the fields in the pass can be updated including the web service url. Removing the url will prevent further requests for updates. This s potentially the most effective, but would require a bit of development to return the updated pass and would need to be maintained until all passes have been "disabled."
Return an appropriate error code
It may be easier to simply return an error code. This could be done through the web server configuration preventing the requests from being processed by your application (and presumably stop the errors in the application). This would allow you to remove the code altogether from your application.
The Passbook Web Service Reference indicates that Passbook will eventually give up when receiving persistent errors.
If a request fails—for example, due to a network connectivity issue—Passbook tries again several times after waiting a period of time. Each time it tries again, it waits longer. If the request continues to fail, it eventually gives up.
The documentation also indicates that standard HTTP status codes should be used in the response from the call to Getting the Latest Version of a Pass (and others).
Response
If request is authorized, return HTTP status 200 with a payload of the pass data.
If the request is not authorized, return HTTP status 401.
Otherwise, return the appropriate standard HTTP status.
Discussion
Support standard HTTP caching on this endpoint: check for the If-Modified-Since header and return HTTP status code 304 if the pass has not changed.
It sounds like the ending of the passbook program is permanent in which case 410 Gone would be an appropriate error code. (From RFC 2616).
410 Gone
The requested resource is no longer available at the server and no forwarding address is known. This condition is expected to be considered permanent. Clients with link editing capabilities SHOULD delete references to the Request-URI after user approval. If the server does not know, or has no facility to determine, whether or not the condition is permanent, the status code 404 (Not Found) SHOULD be used instead. This response is cacheable unless indicated otherwise.
The 410 response is primarily intended to assist the task of web maintenance by notifying the recipient that the resource is intentionally unavailable and that the server owners desire that remote links to that resource be removed. Such an event is common for limited-time, promotional services and for resources belonging to individuals no longer working at the server's site. It is not necessary to mark all permanently unavailable resources as "gone" or to keep the mark for any length of time -- that is left to the discretion of the server owner.
Remove subdomain DNS
If your web service url was set up on a separate subdomain (e.g. passbook.example.com) you can simply remove the DNS entry for the subdomain. The requests will never reach the server and Passbook will eventually give up.
I am using CloudFront to front requests to our service hosted outside of amazon. The service is protected and we expect an "Authorization" header to be passed by the applications invoking our service.
We have tried invoking our service from Cloud Front but looks like the header is getting dropped by cloud front. Hence the service rejects the request and client gets 401 forbidden response.
For some static requests, which do not need authorization, we are not getting any error and are getting proper response from CloudFront.
I have gone through CloudFront documentation and there is no specific information available on how headers are handled and hence was hoping that they will be passed as is, but looks like thats not the case. Any guidance from you folks?
The list of the headers CF drops or modifies can be found here
http://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/RequestAndResponseBehaviorCustomOrigin.html#RequestCustomRemovedHeaders
CloudFront does drop Authorization headers by default and will not pass it to the origin.
If you would like certain headers to be sent to the origin, you can setup a whitelist of headers under CloudFront->Behavior Settings->Forward headers. Just select the headers that you would like to be forwarded and CloudFront will do the job for you. I have tested it this way for one of our location based services and it works like a charm.
One thing that I need to verify is if the Authorization header will be included in the cache key and if its safe to do that?? That is something you might want to watch out for as well.
It makes sense CF drops the Authorization header, just imagine 2 users asking for the same object, the first one will grant access, CF will cache the object, then the second user will get the object as it was previously cached by CloudFront.
Great news are using forward headers you can forward the Authorization header to the origin, that means the object will be cached more than once as the header value is part of the cache "key"
For exmple user A GETS private/index.html
Authorization: XXXXXXXXXXXXX
The object will be cached as private/index.html + XXXXXXXXXXXXX (this is the key to cahce the object in CF)
Now when the new request from a diferent user arrives to CloudFront
GET private/index.html
Authorization: YYYYYYYYYYYY
The object will be passed to the origin as the combinaiton of private/index.html + YYYYYYYYYYYY is not in CF cache.
Then Cf will be cached 2 diferent objects with the same name (but diferent hash combinaiton name).
In addition to specifying them under the Origin Behaviour section, you can also add custom headers to your origin configuration. In the AWS documentation for CloudFront custom headers:
If the header names and values that you specify are not already present in the viewer request, CloudFront adds them. If a header is present, CloudFront overwrites the header value before forwarding the request to the origin.
The benefit of this is that you can then use an All/wildcard setting for whitelisting your headers in the behaviour section.
It sounds like you are trying to serve up dynamic content from CloudFront (at least in the sense that the content is different for authenticated vs unauthenticated users) which is not really what it is designed to do.
CloudFront is a Content Distribution Network (CDN) for caching content at distributed edge servers so that the data is served near your clients rather than hitting your server each time.
You can configure CloudFront to cache pages for a short time if it changes regularly and there are some use cases where this is worthwhile (e.g. a high volume web site where you want to "micro cache" to reduce server load) but it doesn't sound like this is the way you are trying to use it.
In the case you describe:
The user will hit CloudFront for the page.
It won't be in the cache so CloudFront will try to pull a copy from the origin server.
The origin server will reply with a 401 so CloudFront will not cache it.
Even if this worked and headers were passed back and forth in some way, there is is simply no point in using CloudFront if every page is going to hit your server anyway; you would just make the page slower because of the extra round trip to your server.