Downloading a public bucket from s3 without s3 tools (s3cmd) - amazon-s3

I'd like to download a complete repository from S3. I know the bucket is reachable at https://s3.amazonaws.com/big-data-benchmark/pavlo
I'd like everything under /pavlo/sequence-snappy/5nodes
How should one download this with the least amount of manual effort with readily available tools like wget? (The s3 tools require an actual s3 account, which I do not have and want.)

Though a bit of manual effort is needed, this is how it can be done:
Goto the bucket http URL and add the ?marker=/pavlo/sequence-snappy/5nodes, resulting in https://s3.amazonaws.com/big-data-benchmark/pavlo/sequence-snappy/5nodes
Now, binary search manually on how large the dataset it is. Fortunately, the listing of your specific bucket is predictable and it seems to have 100 items ranging from 000000_0-000099_0
Use the following shell one-liner:
for i in {0000..0099}; do echo https://s3.amazonaws.com/big-data-benchmark/pavlo/sequence-snappy/5nodes/rankings/00${i}_0; done | xargs -n1 -P8 wget
Preferably we would like a more general solution which would also work for unpredictable filenames.

I think you will find that the S3 tools do not require an account for anonymous access to public buckets. (Nor do I understand why anyone wouldn't want a free account, but I digress.)
But here is a solution that works when the keys (paths/filenames) are not known or predictable:
If a bucket is truly public, as this one is, you'll find paginated XML list of all the keys at the root of the bucket.
curl -v https://s3.amazonaws.com/big-data-benchmark/, for example.
Each <Key> contains the path to an object. This is the List Objects V1 API, so you add ?marker= and the value of the last key in the listing, on the next request, to resume the listing, repeating the process until <IsTruncated> is no longer true.
Use this to build a list to pass to curl, wget, or your http client of choice, by appending the key to the bucket URL. S3 can handle many, many parallel requests, so you may want to parallelize the process.

Related

hiding s3 path in aws cloudfront url

I am trying to make sure I did not miss anything in the AWS CloudFront documentation or anywhere else ...
I have a (not public) S3 bucket configured as origin in a CloudFront web distribution (i.e. I don't think it matters but I am using signed urls).
Let's say a have a file in a S3 path like
/someRandomString/someCustomerName/someProductName/somevideo.mp4
So, perhaps the url generated by CloudFront would be something like:
https://my.domain.com/someRandomString/someCustomerName/someProductName/somevideo.mp4?Expires=1512062975&Signature=unqsignature&Key-Pair-Id=keyid
Is there a way to obfuscate the path to actual file on the generated URL. All 3 parts before the filename can change, so I prefer not to use "Origin Path" on Origin Settings to hide the begging of the path. With that approach, I would have to create a lot of origins mapped to the same bucket but different paths. If that's the only way, then the limit of 25 origins per distribution would be a problem.
Ideally, I would like to get something like
https://my.domain.com/someRandomObfuscatedPath/somevideo.mp4?Expires=1512062975&Signature=unqsignature&Key-Pair-Id=keyid
Note: I am also using my own domain/CNAME.
Thanks
Cris
One way could be to use a lambda function that receives the S3 file's path, copies it into an obfuscated directory (maybe it has a simple mapping from source to origin) and then returns the signed URL of the copied file. This will ensure that only the obfuscated path is visible externally.
Of course, this will (potentially) double the data storage so you need some way to clean up the obfuscated folders. That could be done on a time-based manner, so if each signed URL is expected to expire after 24 hours, you could create folders based on date, and each of the obfuscated directories could be deleted every other day.
Alternatively, you could use a service like tinyurl.com or something similar to create a mapping. It would be much easier, save on storage, etc. The only downside would be that it would not reflect your domain name.
If you have the ability to modify the routing of your domain then this is a non-issue, but I presume that's not an option.
Obfuscation is not a form of security.
If you wish to control which objects users can access, you should use Pre-Signed URLs or Cookies. This way, you can grant access to private objects via S3 or CloudFront and not worry about people obtaining access to other objects.
See: Serving Private Content through CloudFront

S3 Bucket Types

Just wondering if there is a recommended strategy for storing different types of assets/files in separate S3 buckets or just put them all in one bucket? The different types of assets that I have include: static site images, user's profile images, user-generated content like documents, files, and videos.
As far as how to group files into buckets. That is really not that critical of an issue unless you want to have different domain names or CNAMEs fordifferent types on content, in which case you would need a separate bucket for each domain name you would want to use.
I would tend to group them by functionality. Perhaps static files used in your application that you have full control over you might deploy into a separate bucket from content that is going to be user generated. Or you might want to have video in a different bucket than images, etc.
To add to my earlier comments about S3 metadata. It is going to be a critical part of optimizing how you server up content from S3/Cloudfront.
Basically, S3 metadata consists of key-value pairs. So you could have Content-Type as a key with a value of image/jpeg for example if the file is .jpg. This will automatically send appropriate Content-Type headers corresponding to your values for requests made directly to S3 URL or via Cloudfront. The same is true of Cache-Control metatags. You can also use your own custom metatags. For example, I use a custom metatag named x-amz-meta-md5 to store an md5 hash of the file. It is used for simple bucket comparisons against content stored in a revision control system, so we don't have to make checksums of each file in the bucket on the fly. We use this for pushing differential content updates to the buckets (i.e. only push those that have changed).
As far as how revision control goes. I would HIGHLY recommend using versioned file names. In other words say you have bigimage.jpg and you want to make an update, call it bigimage1.jpg and change your code to reflect this. Why? Because optimally, you would like to set long expiration time frames in your Cache-Control headers. Unfortunately, if you then want to deploy a file of the same name and you are using Cloudfront, it becomes problematic to invalidate the edge caching locations. Whereas if you have a new file name, Cloudfront would just begin to populate the edge nodes and you don't have to worry about invalidating the cache at all.
Similarly for user-produced content, you might want to include an md5 or some other (mostly) unique identifier scheme, so that each video/image can have its own unique filename and place in the cache.
For your reference here is a link to the AWs documentation on setting up streaming in Cloudfront
http://docs.amazonwebservices.com/AmazonCloudFront/latest/DeveloperGuide/CreatingStreamingDistributions.html

Searching Inside an Amazon S3 Bucket

If I have a bucket with hundreds of thousands of images, is it ok to have to search for each image I want to display in my site via it's ID or is there a more efficient way (including having multiple folders in a bucket maybe)?
I was also thinking of giving each image a unique hash or something similar in order to stop duplicated names in the bucket. Does that seem like a good idea?
You just link to each image using normal urls. for public files the urls are in the format:
http://mybucket.s3.amazonaws.com/myimage.jpg
For private urls, you need to generate a url (which is easy using any of the sdks) in the format:
http://mybucket.s3.amazonaws.com/myimage.jpg?AWSAccessKeyId=44CF9SAMPLEF252F707&Expires=1177363698&Signature=vjSAMPLENmGa%2ByT272YEAiv4%3D
There's nothing wrong with storing each file with a unique name. If you set the correct headers on the file, any downloads can still have the original name. eg Content-Disposition: attachment; filename=myimage.jpg;
For listing a buckets contents you would use the APIs GetBucket command. I find it easier to use the SDKs for any access via the API.
It can be a pain to search or do things in parallel over bucket objects as amazon lists everything lexicographically (the only way currently supported). The problem with using random IDs is that all of it would be written to the same block storage and you cannot do search in parallel to optimize.
Here is an interesting article on performance improvements. I use it for my work and see significant difference in high load.
http://aws.typepad.com/aws/2012/03/amazon-s3-performance-tips-tricks-seattle-hiring-event.html

How to create multiple keys for the same object in Amazon S3 OR CLOUDFRONT?

I am syndicating out my multi-media content (mp4 and images) to several clients. So I create one S3 object for every mp4 say "my_content_that_pays_my_bills.mp4" and let the client access the S3 URL for the objects and embed it wherever they want.
What I want is for client A to access this MP4 as "A_my_content_that_pays_my_bills.mp4"
and Client B to access this as "B_my_content_that_pays_my_bills.mp4" and so on.
I want to bill the clients by usage: so I could process access logs and count access to "B_my_content_that_pays_my_bills.mp4" and bill client B for usage.
I know that S3 allows only one key per object. So how do I get around this ?
I don't know that you can alias file names in the way you'd like. Here are a couple of hacks I can think of for public files embedded freely by a customer:
1) Create one Cloudfront distribution per client, each pointing at the same bucket. Each AWS account can have 100 distributions, so you could support only that many clients. Or,
2) Duplicate the files, using the the client-specific names that you'd like. This is simpler but your file storage costs scale with your clients (which may or may not be significant).

How can I update files on Amazon's CDN (CloudFront)?

Is there any way to update files stored on Amazon CloudFront (Amazon's CDN service)?
Seems like it won't take any update of a file we make (e.g. removing the file and storing the new one with the same file name as before).
Do I have to explicitly trigger an update process to remove the files from the edge servers to get the new file contents published?
Thanks for your help
Here is how I do it using the CloudFront control panel.
Select CloudFront from the list of services.
Make sure Distributions from the top left is selected.
Next click the link for the associated distribution from the list (under id).
Select the Invalidations tab.
Click the Create Invalidation button and enter the location of the files you want to be invalidated (updated).
For example:
Then click the Invalidate button and you should now see InProgress under status.
It usually takes 10 to 15 minutes to complete your invalidation
request, depending on the size of your request.
Once it says completed you are good to go.
Tip:
Once you have created a few invalidations if you come back and need to invalidate the same files use the select box and the Copy link will become available making it even quicker.
Amazon added an Invalidation Feature. This is API Reference.
Sample Request from the API Reference:
POST /2010-08-01/distribution/[distribution ID]/invalidation HTTP/1.0
Host: cloudfront.amazonaws.com
Authorization: [AWS authentication string]
Content-Type: text/xml
<InvalidationBatch>
<Path>/image1.jpg</Path>
<Path>/image2.jpg</Path>
<Path>/videos/movie.flv</Path>
<CallerReference>my-batch</CallerReference>
</InvalidationBatch>
Set TTL=1 hour and replace
http://developer.amazonwebservices.com/connect/ann.jspa?annID=655
Download Cloudberry Explorer freeware version to do this on single files:
http://blog.cloudberrylab.com/2010/08/how-to-manage-cloudfront-object.html
Cyberduck for Mac & Windows provides a user interface for object invalidation. Refer to http://trac.cyberduck.ch/wiki/help/en/howto/cloudfront.
I seem to remember seeing this on serverfault already, but here's the answer:
By "Amazon CDN" I assume you mean "CloudFront"?
It's cached, so if you need it to be updated right now (as opposed to "new version will be visible in 24hours") you'll have to choose a new name. Instead of "logo.png", use "logo.png--0", and then update it using "logo.png--1", and change your html to point to that.
There is no way to "flush" amazon cloudfront.
Edit: This was not possible, it is now. See comments to this reply.
CloudFront's user interface offers this under the [i] button > "Distribution Settings", tab "Invalidations": https://console.aws.amazon.com/cloudfront/home#distribution-settings
In ruby, using the fog gem
AWS_ACCESS_KEY = ENV['AWS_ACCESS_KEY_ID']
AWS_SECRET_KEY = ENV['AWS_SECRET_ACCESS_KEY']
AWS_DISTRIBUTION_ID = ENV['AWS_DISTRIBUTION_ID']
conn = Fog::CDN.new(
:provider => 'AWS',
:aws_access_key_id => AWS_ACCESS_KEY,
:aws_secret_access_key => AWS_SECRET_KEY
)
images = ['/path/to/image1.jpg', '/path/to/another/image2.jpg']
conn.post_invalidation AWS_DISTRIBUTION_ID, images
even on invalidation, it still takes 5-10 minutes for the invalidation to process and refresh on all amazon edge servers
CrossFTP for Win, Mac, and Linux provides a user interface for CloudFront invalidation, check this for more details: http://crossftp.blogspot.com/2013/07/cloudfront-invalidation-with-crossftp.html
I am going to summarize possible solutions.
Case 1: One-time update: Use Console UI.
You can manually go through the console's UI as per #CoalaWeb's answer and initiate an "invalidation" on CloudFront that usually takes less than one minute to finish. It's a single click.
Additionally, you can manually update the path it points to in S3 there in the UI.
Case 2: Frequent update, on the Same path in S3: Use AWS CLI.
You can use AWS CLI to simply run the above thing via command line.
The command is:
aws cloudfront create-invalidation --distribution-id E1234567890 --paths "/*"
Replace the E1234567890 part with the DistributionId that you can see in the console. You can also limit this to certain files instead of /* for everything.
An example of how to put it in package.json for a Node/JavaScript project as a target can be found in this answer. (different question)
Notes:
I believe the first 1000 invalidations per month are free right now (April 2021).
The user that performs AWS CLI invalidation should have CreateInvalidation access in IAM. (Example in the case below.)
Case 3: Frequent update, the Path on S3 Changes every time: Use a Manual Script.
If you are storing different versions of your files in S3 (i.e. the path contains the version-id of the files/artifacts) and you need to change that in CloudFront every time, you need to write a script to perform that.
Unfortunately, AWS CLI for CloudFront doesn't allow you to easily update the path with one command. You need to have a detailed script. I wrote one, which is available with details in this answer. (different question)