Managing files on Amazon S3 - amazon-s3

I have a git repository that stores audio files.
Obviously, it's not the best usage of git, and the repo has become quite large.
As an alternative, I would like to be able to manipulate these audio files at the command line, "commiting" when some work is done.
Is this type of context possible with manipulating Amazon S3 files at the command line?
Or do you scp, for example, files to S3?

There are some rsync tools to S3 that may work for you, here is an example which I have not tried: http://www.s3rsync.com/
How important are the older versions of the audio? Amazon S3 buckets can have 'versioning' turned on, and you get full versioning support. You pay full $ for each version - I don't know if you have 10 GB or 10TB to store, and your budget, etc... The amazon versioning is nice, but there are not a lot of tools that fully support it.

To manipulate S3 files you will first have to download it and then upload it when you are done, this is relatively simple to do.
However, if the amount of files you have is truly large, the slow transfer rate and bandwidth charge will kill you. If you don't have that much files, DropBox is built on top of S3 and have syncing and a rudimentary version control, bandwidth is not charged..
I felt like using a good networked storage system and git on your LAN is still the better idea.

Related

When to use s3cmd over accessing the S3 API programmatically?

I've been having difficulty understanding when to use s3cmd program over using the Java API. A vendor has documentation on accessing S3 with s3cmd. It is unclear to me as the bucket names appear to be dynamic. No region is specified. Additionally, I'm reaching out over an endpoint. I've tried writing some Java code to interact with S3 the same way that s3cmd does but I haven't been able to connect. Overall, it appears to quite a bit different.
To me s3cmd seems to be a utility to manipulate these files or quickly get at them. Integrating this utility into a Java program seems meaningless.
Anyone have any resources or can help me understand this better?
S3cmd (s3cmd) is a free command line tool and client for uploading, retrieving and managing data in Amazon S3 and other cloud storage service providers that use the S3 protocol, such as Google Cloud Storage or DreamHost DreamObjects. It is best suited for power users who are familiar with command line programs. It is also ideal for batch scripts and automated backup to S3, triggered from cron, etc.
S3cmd is written in Python. It's an open source project available under GNU Public License v2 (GPLv2) and is free for both commercial and private use. You will only have to pay Amazon for using their storage.
Lots of features and options have been added to S3cmd, since its very first release in 2008.... we recently counted more than 60 command line options, including multipart uploads, encryption, incremental backup, s3 sync, ACL and Metadata management, S3 bucket size, bucket policies, and more!

Comparison of uploading files to GCS using Google Drive vs gsutil

I have been comparing how to upload files to a cloud storage, one is in-browser (or emulating a browser) and the other is command-line via gsutil to a Google Cloud Storage bucket.
Does Google Drive use gsutil in the backend, or or the uploader a totally customized and proprietary piece of software? Is there a way to achieve upload speeds to a Google Cloud Storage bucket similar to the upload speeds I'm able to achieve via Drive? If not, what would you suggest for how to get upload speeds equivalent to that in Google Drive, to upload files to a GCS bucket?
I'm not sure about GDrive using gsutil on the background.
There are several optimizations that you can use to improve gsutil speeds.
First of all you might use perfdiag to launch a small diagnostics tests that will give you and overview and possible speeds achievable.
gsutil perfdiag -o test.json gs://<your bucket name>
Secondly you will need to understand your workload(small/big files) and identifying the need for a regional or multi regional bucket(yes there is a perf difference)tl;dr:
"Regional buckets are great for data processing since their physical distance is fairly tight, and the overhead of write consistency is low."
"Multiregional Storage, on the other hand, guarantees 2 replicates which are geo diverse (100 miles apart) which can get better remote latency and availability.
"
There is some information on cloud Atlas specifically on this topic, you can check out in here:
https://medium.com/google-cloud/google-cloud-storage-what-bucket-class-for-the-best-performance-5c847ac8f9f2
https://medium.com/google-cloud/google-cloud-storage-large-object-upload-speeds-7339751eaa24?source=user_profile---------12----------------
https://medium.com/#duhroach/optimizing-google-cloud-storage-small-file-upload-performance-ad26530201dc
https://medium.com/#duhroach/google-cloud-storage-performance-4cfcec8bad72
https://cloud.google.com/storage/docs/best-practices

S3: Service that replace access to the local file system with S3

I have an application that heavily uses the local file system. We need to port the application to use S3. What services are out there that will automate the access to the S3 without having to changing the source code of the application.
These services somehow mask the S3 FS as a local FS.
Thanks.
See FuseOverAmazon (or s3fs) but keep in mind that S3 is an eventual consistency data store and your app should be architected to take that into account. It's also important to note that trying to mount an S3 bucket as a file system has very poor performance.
Take a look at RioFS. Our project is an alternative to “s3fs” project, main advantages comparing to “s3fs” are: simplicity, the speed of operations and bugs-free code. Currently the project is in the “beta” state, but it's been running on several high-loaded fileservers for quite some time.
We are seeking for more people to join our project and help with the testing. From our side we offer quick bugs fix and will listen to your requests to add new features.
Hope it helps !

Allowing users to download files as a batch from AWS s3 or Cloudfront

I have a website that allows users to search for music tracks and download those they they select as mp3.
I have the site on my server and all of the mp3s on s3 and then distributed via cloudfront. So far so good.
The client now wishes for users to be able to select a number of music track and then download them all in bulk or as a batch instead of 1 at a time.
Usually I would place all the files in a zip and then present the user a link to that new zip file to download. In this case, as the files are on s3 that would require I first copy all the files from s3 to my webserver process them in to a zip and then download from my server.
Is there anyway i can create a zip on s3 or CF or is there someway to batch / group files in to a zip?
Maybe i could set up an EC2 instance to handle this?
I would greatly appreciate some direction.
Best
Joe
I am afraid you won't be able to create the batches w/o additional processing. firing up an EC2 instance might be an option to create a batch per user
I am facing the exact same problem. So far the only thing I was able to find is Amazon's s3sync tool:
https://docs.aws.amazon.com/cli/latest/reference/s3/sync.html
In my case, I am using Rails + its Paperclip addon which means that I have no way to easily download all of the user's images in one go, because the files are scattered in a lot of subdirectories.
However, if you can group your user's files in a better way, say like this:
/users/<ID>/images/...
/users/<ID>/songs/...
...etc., then you can solve your problem right away with:
aws s3 sync s3://<your_bucket_name>/users/<user_id>/songs /cache/<user_id>
Do have in mind you'll have to give your server the proper credentials so the S3 CLI tools can work without prompting for usernames/passwords.
And that should sort you.
Additional discussion here:
Downloading an entire S3 bucket?
s3 is single http request based.
So the answer is threads to achieve the same thing
Java api - uses TransferManager
http://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/s3/transfer/TransferManager.html
You can get great performance with multi threads.
There is no bulk download sorry.

Fastest / best way copy data between S3 to EC2?

I have a fairly large amount of data (~30G, split into ~100 files) I'd like to transfer between S3 and EC2: when I fire up the EC2 instances I'd like to copy the data from S3 to EC2 local disks as quickly as I can, and when I'm done processing I'd like to copy the results back to S3.
I'm looking for a tool that'll do a fast / parallel copy of the data back and forth. I have several scripts hacked up, including one that does a decent job, so I'm not looking for pointers to basic libraries; I'm looking for something fast and reliable.
Unfortunately, Adam's suggestion won't work as his understanding of EBS is wrong (although I wish he was right and often thought myself it should work that way)... as EBS has nothing to do with S3, but it will only give you an "external drive" for EC2 instances that are separate, but connectable to the instances. You still have to do copying between S3 and EC2, even though there are no data transfer costs between the two.
You didn't mention an operating system of your instance, so I cannot give tailored information. A popular command line tool I use is http://s3tools.org/s3cmd ... it is based on Python and therefore, according to info on its website it should work on Win as well as Linux, although I use it ALL the time on Linux. You could easily whip up a quick script that uses its built in "sync" command that works similar to rsync, and have it triggered every time you're done processing your data. You could also use the recursive put and get commands to get and put data only when needed.
There are graphical tools like Cloudberry Pro that have some command line options for Windows too that you can setup schedule commands. http://s3tools.org/s3cmd is probably the easiest.
By now, there is a sync command in the AWS Command line tools, that should do the trick: http://docs.aws.amazon.com/cli/latest/reference/s3/sync.html
On startup:
aws s3 sync s3://mybucket /mylocalfolder
before shutdown:
aws s3 sync /mylocalfolder s3://mybucket
Of course, the details are always fun to work out eg. how can parallel it is (and can you make it more parallel and is that any faster goven the virtual nature of the whole setup)
Btw hope you're still working on this... or somebody is. ;)
I think you might be better off using an Elastic Block Store to store your files instead of S3. An EBS is akin to a 'drive' on S3 that can be mounted into your EC2 instance without having to copy the data each time, thereby allowing you to persist your data between EC2 instances without having to write to or read from S3 each time.
http://aws.amazon.com/ebs/
Install s3cmd Package as
yum install s3cmd
or
sudo apt-get install s3cmd
depending on your OS
then copy data with this
s3cmd get s3://tecadmin/file.txt
also ls can list the files.
for more detils see this
For me the best form is:
wget http://s3.amazonaws.com/my_bucket/my_folder/my_file.ext
from PuTTy