Our application data storage is backed by Google Cloud Storage (and S3 and Azure Blob Storage). We need to give access to this storage to random outside tools (upload from local disk using CLI tools, unload from analytical database like Redshift, Snowflake and others). The specific use case is that users need to upload multiple big files (you can think about it much like m3u8 playlists for streaming videos - it's m3u8 playlist and thousands of small video files). The tools and users MAY not be affiliated with Google in any way (may not have Google account). We also absolutely need to data transfer to be directly to the storage, outside of our servers.
In S3 we use federation tokens to give access to a part of the S3 bucket.
So model scenario on AWS S3:
customer requests some data upload via our API
we give customers S3 credentials, that are scoped to s3://customer/project/uploadId, allowing upload of new files
client uses any tool to upload the data
client uploads s3://customer/project/uploadId/file.manifest, s3://customer/project/uploadId/file.00001, s3://customer/project/uploadId/file.00002, ...
other data (be it other uploadId or project) in the bucket is safe because the given credentials are scoped
In ABS we use STS token for the same purpose.
GCS does not seem to have anything similar, except for Signed URLs. Signed URLs have a problem though that they refer to a single file. That would either require us to know in advance how many files will be uploaded (we don't know) or the client would need to request each file's signed URL separately (strain on our API and also it's slow).
ACL seemed to be a solution, but it's only tied to Google-related identities. And those can't be created on demand and fast. Service users are also and option, but their creation is slow and generally they are discouraged for this use case IIUC.
Is there a way to create a short lived credentials that are limited to a subset of the CGS bucket?
Ideal scenario would be that the service account we use in the app would be able to generate a short lived token that would only have access to a subset of the bucket. But nothing such seems to exist.
Unfortunately, no. For retrieving objects, signed URLs need to be for exact objects. You'd need to generate one per object.
Using the * wildcard will specify the subdirectory you are targeting and will identify all objects under it. For example, if you are trying to access objects in Folder1 in your bucket, you would use gs://Bucket/Folder1/* but the following command gsutil signurl -d 120s key.json gs://bucketname/folderName/** will create a SignedURL for each of the files inside your bucket but not a single URL for the entire folder/subdirectory
Reason : Since subdirectories are just an illusion of folders in a bucket and are actually object names that contain a ‘/’, every file in a subdirectory gets its own signed URL. There is no way to create a single signed URL for a specific subdirectory and allow its files to be temporarily available.
There is an ongoing feature request for this https://issuetracker.google.com/112042863. Please raise your concern here and look for further updates.
For now, one way to accomplish this would be to write a small App Engine app that they attempt to download from instead of directly from GCS which would check authentication according to whatever mechanism you're using and then, if they pass, generate a signed URL for that resource and redirect the user.
Reference : https://stackoverflow.com/a/40428142/15803365
Related
I'm currently looking to host an app with the Angular frontend in a AWS S3 bucket connecting to a PHP backend using the AWS Elastic Beanstalk. I've got it set up and it's working nicely.
However, using S3 to create a static website, anyone can view your code, including the various Angular JS files. This is mostly fine, but I want to create either a file or folder to keep sensitive information in that cannot be viewed by anyone, but can be included/required by all other files. Essentially I want a key that I can attach to all calls to the backend to make sure only authorised requests get through.
I've experimented with various permissions but always seems to be able to view all files, presumably because the static website hosting bucket policy ensures everything is public.
Any suggestions appreciated!
Cheers.
The whole idea of static website hosting on S3 means the content to be public, for example, you have maintenance of your app/web, so you redirect users to the S3 static page notifying there is maintenance ongoing.
I am not sure what all have you tried when you refer to "experimented with various permissions", however, have you tried to setup a bucket policy or maybe setup the bucket as a CloudFront origin and set a Signed URL. This might be a bit tricky considering you want to call these sensitive files by other files. But the way to hide those sensitive files will either be by using some sort of bucket policy or by restricting using some sort of signed URL in my opinion.
Problem:
I am storing number of HLS streams in S3 with given file structure:
Video1
├──hls3
├──hlsv3-master.m3u8
├──media-1
├──media-2
├──media-3
├──media-4
├──media-5
├──hls4
├──hlsv4-master.m3u8
├──media-1
├──media-2
├──media-3
├──media-4
├──media-5
In my user API I know which exactly user has access to which video content
but I also need to ensure that video links are not sharable and only accessible
by users with right permissions.
Solutions:
1) Use signed / temp S3 urls for private S3 content. Whenever client wants to play some specific video it is
sending request to my API. If user has right permissions the API is generating signed url
and returning it back to client which is passing it to player.
The problem I see here is that real video content is stored in dozen of segments files in media-* directories
and I do not really see how can I protect all of them - would I need to sign each of the segment file urls separately?
2) S3 content is private. Video stream requests made by players are going through my API or separate reverse-proxy.
So whenever client decides to play specific video, API / reverse-proxy is getting the request, doing authentication & authorization
and passing the right content (master play list files & segments).
In this case I still need to make S3 content private and accessible only by my API / reverse-proxy. What should be the recommended way here?
S3 rest authentication via tokens?
3) Use encryption with protected key. In this case all of video segments are encrypted and publicly available. The key is also stored in S3
but is not publicly available. Every key request made by player is authenticated & authorized by my API / reverse-proxy.
These are 3 solutions I have in my mind right now. Not convinced on all of them. I am looking for something simple and bullet proof secure. Any recommendations / suggestions?
Used technology:
ffmpeg for video encoding to different bitrates
bento4 for video segmentation
would I need to sign each of the segment file urls separately?
If the player is requesting directly from S3, then yes. So that's probably not going to be the ideal approach.
One option is CloudFront in front of the bucket. CloudFront can be configured with an Origin Access Identity, which allows it to sign requests and send them to S3 so that it can fetch private S3 objects on behalf of an authorized user, and CloudFront supports both signed URLs (using a different algorithm than S3, with two important differences that I will explain below) or with signed cookies. Signed requests and cookies in CloudFront work very similarly to each other, with the important difference being that a cookie can be set once, then automatically used by the browser for each subsequent request, avoiding the need to sign individual URLs. (Aha.)
For both signed URLs and signed cookies in CloudFront, you get two additional features not easily done with S3 if you use a custom policy:
The policy associated with a CloudFront signature can allow a wildcard in the path, so you could authorize access to any file in, say /media/Video1/* until the time the signature expires. S3 signed URLs do not support wildcards in any form -- an S3 URL can only be valid for a single object.
As long as the CloudFront distribution is configured for IPv4 only, you can tie a signature to a specific client IP address, allowing only access with that signature from a single IP address (CloudFront now supports IPv6 as an optional feature, but it isn't currently compatible with this option). This is a bit aggressive and probably not desirable with a mobile user base, which will switch source addresses as they switch from provider network to Wi-Fi and back.
Signed URLs must still all be generated for all of the content links, but you can generate and sign a URL only once and then reuse the signature, just string-rewriting the URL for each file making that option computationally less expensive... but still cumbersome. Signed cookies, on the other hand, should "just work" for any matching object.
Of course, adding CloudFront should also improve performance through caching and Internet path shortening, since the request hops onto the managed AWS network closer to the browser than it typically will for requests direct to S3. When using CloudFront, requests from the browser are sent to whichever of 60+ global "edge locations" is assumed to be nearest the browser making the request. CloudFront can serve the same cached object to different users with different URLs or cookies, as long as the sigs or cookies are valid, of course.
To use CloudFront signed cookies, at least part of your application -- the part that sets the cookie -- needs to be "behind" the same CloudFront distribution that points to the bucket. This is done by declaring your application as an additional Origin for the distribution, and creating a Cache Behavior for a specific path pattern which, when requested, is forwarded by CloudFront to your application, which can then respond with the appropriate Set-Cookie: headers.
I am not affiliated with AWS, so don't mistake the following as a "pitch" -- just anticipating your next question: CloudFront + S3 is priced such that the cost difference compared to using S3 alone is usually negligible -- S3 doesn't charge you for bandwidth when objects are requested through CloudFront, and CloudFront's bandwidth charges are in some cases slightly lower than the charge for using S3 directly. While this seems counterintuitive, it makes sense that AWS would structure pricing in such a way as to encourage distribution of requests across its network rather than to focus them all against a single S3 region.
Note that no mechanism, either the one above or the one below is completely immune to unauthorized "sharing," since the authentication information is necessarily available to the browser, and thus to the user, depending on the user's expertise... but both approaches seem more than sufficient to keep honest users honest, which is all you can ever hope to do. Since signatures on signed URLs and cookies have expiration times, the duration of the share-ability is limited, and you can identify such patterns through CloudFront log analysis, and react accordingly. No matter what approach you take, don't forget the importance of staying on top of your logs.
The reverse proxy is also a good idea, probably easily implemented, and should perform quite acceptably with no additional data transport charges or throughput issues, if the EC2 machines running the proxy are in the same AWS region as the bucket, and the proxy is based on solid, efficient code like that found in Nginx or HAProxy.
You don't need to sign anything in this environment, because you can configure the bucket to allow the reverse proxy to access the private objects because it has a fixed IP address.
In the bucket policy, you do this by granting "anonymous" users the s3:getObject privilege, only if their source IPv4 address matches the IP address of one of the proxies. The proxy requests objects anonymously (no signing needed) from S3 on behalf of authorized users. This requires that you not be using an S3 VPC endpoint, but instead give the proxy an Elastic IP address or put it behind a NAT Gateway or NAT instance and have S3 trust the source IP of the NAT device. If you do use an S3 VPC endpoint, it should be possible to allow S3 to trust the request simply because it traversed the S3 VPC Endpoint, though I have not tested this. (S3 VPC Endpoints are optional; if you didn't explicitly configure one, then you don't have one, and probably don't need one).
Your third option seems weakest, if I understand it correctly. An authorized but malicious user gets the key an can share it all day long.
I'm building a web application and am looking into using Amazon S3 to store user uploads.
My concern is, I dont want user A to see his download link for a document he uploaded is urltoMyS3/doc1234.pdf and try urltoMyS3/doc1235.pdf and get another users document.
The only way I can think of to do this, is to only allow the web application to connect to S3, then check if the user has access to a file on the web application, have the web app download the file, and then serve it to the client. The problem with this method is the application would have to download the file first and would inevitably slow the download process down for the user.
How is user files typically handled with Amazon S3? Or is it simply not typically used in a scenario where the files should not be public? Is there another service for something like this?
Thanks
You can implement Query String Authentication, which will solve your problem.
Query string authentication is useful for giving HTTP or browser
access to resources that would normally require authentication. The
signature in the query string secures the request. Query string
authentication requests require an expiration date. You can specify
any future expiration time in epoch or UNIX time (number of seconds
since January 1, 1970).
You can do this by generating the appropriate links, see the following
https://docs.aws.amazon.com/AmazonS3/latest/dev/RESTAuthentication.html#RESTAuthenticationQueryStringAuth
If time-bound authentication will not work for (as suggested in other answers). You could consider implementing something like s3fs to mount your S3 bucket as a drive on your web application server. In this manner you can simply make your authentication and then serve up the file directly to the user, without them having any idea that the file resides in S3. Similarly, you can simply write uploaded files directly to this s3fs mount.
S3fs, also allows you to configure a local cache of the S3 directory on your machine for faster access.
This works nicely in a cluster web server environment as well, as you can just have each server mount the s3fs drive and perform/read/writes on it independently.
A link with more info
I am attempting to use an S3 bucket as a deployment location for an internal, auto-updating application's files. It would be the location where the new version's files are dumped for the application to puck up on an update. Since this is an internal application, I was hoping to have the URL be private, but to be able to access it using only a URL. I was hoping to look into using third party auto updating software, which means I can't use the Amazon API to access it.
Does anyone know a way to get a URL to a private bucket on S3?
You probably want to use one of the available AWS Software Development Kits (SDKs), which all implement the respective methods to generate these URLs by means of the GetPreSignedURL() method (e.g. Java: generatePresignedUrl(), C#: GetPreSignedURL()):
The GetPreSignedURL operations creates a signed http request. Query
string authentication is useful for giving HTTP or browser access to
resources that would normally require authentication. When using query
string authentication, you create a query, specify an expiration time
for the query, sign it with your signature, place the data in an HTTP
request, and distribute the request to a user or embed the request in
a web page. A PreSigned URL can be generated for GET, PUT and HEAD
operations on your bucket, keys, and versions.
There are a couple of related questions already and e.g. Why is my S3 pre-signed request invalid when I set a response header override that contains a “+”? contains a working sample in C# (aside from the content type issue Ragesh is experiencing of course).
Good luck!
I'm thinking about whether to host uploaded media files (video and audio) on S3 instead of locally. I need to check user's permissions on each download.
So there would be an action like get_file, which first checks the user's permissions and then gets the file from S3 and sends it using send_file to the user.
def get_file
if #user.can_download(params[:file_id])
# first, download the file from S3 and then send it to the user using send_file
end
end
But in this case, the server (unnecessarily) downloads the file first from S3 and then sends it to the user. I thought the use case for S3 was to bypass the Rails/HTTP server stack for reduced load.
Am I thinking this wrong?
PS. I'm using CarrierWave for file uploads. Not sure if that's relevant.
Amazon S3 provides something called RESTful authenticated reads, which are basically timeoutable URLs to otherwise protected content.
CarrierWave provides support for this. Simply declare S3 access policy to authenticated read:
config.s3_access_policy = :authenticated_read
and then model.file.url will automatically generate the RESTful URL.
Typically you'd embed the S3 URL in your page, so that the client's browser fetches the file directly from Amazon. Note however that this exposes the raw unprotected URL. You could name the file with a long hash instead of something predictable, so it's at least not guessable -- but once that URL is exposed, it's essentially open to the Internet. So if you absolutely always need access control on the files, then you'll need to proxy it like you're currently doing. In that case, you may decide it's just better to store the file locally.