Trying to download S3 directory to local machine using s3cmd. I'm using the command:
s3cmd sync --skip-existing s3://bucket_name/remote_dir ~/local_dir
But if I restart downloading after interruption s3cmd doesn't skip existing local files downloaded earlier and rewrites them. What is wrong with the command?
I had the same problem and found the solution in comment # 38 from William Denniss there http://s3tools.org/s3cmd-sync
If you have:
$s3cmd sync —verbose s3://mybucket myfolder
Change it to:
$s3cmd sync —verbose s3://mybucket/ myfolder/ # note the trailing slash
Then, the MD5 hashes are compared and everything works correctly! —skip-existing works as well.
To recap, both —skip-existing and md5 checks won’t happen if you use the first command, and both work if you use the second (I made a mistake in my previous post, as I was testing with 2 different directories).
Use boto-rsync instead. https://github.com/seedifferently/boto_rsync
It correctly syncs only new/changed files from s3 to the local directory.
Related
gsutil -m rm gs://{our_bucket}/{dir}/{subdir}/*
...
Removing gs://our_bucket/dir/subdir/staging-000000000102.json...
Removing gs://our_bucket/dir/subdir/staging-000000000101.json...
CommandException: 103 files/objects could not be removed.
The command is able to find the directory with the 103 .JSON files, and "tries" removing them per the Removing gs://... being output. For what reason might we be receiving CommandException: 103 files/objects could not be removed.?
This works on my local machine
This works in our docker container run locally
This does not work in our docker container on the GCP compute engine where we need it to be working.
Perhaps this is a permissions issue with the compute engine not having permission to remove files in our GCS?
Edit: We have a service account JSON in the /config folder of our Airflow project, and that service account is shared to an IAM user with Storage Admin permission. Perhaps having the JSON in the /config folder is not sufficient for assigning permissions to the entire GCP compute engine? I am particularly confused because this server is able to query from our BQ database, and WRITE to GCS, but cannot delete from GCS...
The solution in this link - https://gist.github.com/ryderdamen/926518ddddd46dd4c8c2e4ef5167243d was exactly what we needed:
Stop the instance
Edit the settings
Remove gsutil cache
I'm very new at this and need some help; I'm sure I'm not doing something right. I have a Synology NAS that has a cool options to sync files to Google cloud storage. This is a great way to get my backups off site
I have my backups syncing to a cold line storage bucket. Now that my files are syncing I'm looking to document the process if I need to retrieve them.
I want to download a whole folder and all of the files inside it to a windows server. I installed the gsutil and trying to run this command.
gsutil -m cp -R dir gs://bhp_backup_sync/backup/foldername
but after I run this I get the following exception.
CommandException: No URLs matched: dir
CommandException: 1 file/object could not be transferred.
NOOB here what am I missing?
I am trying to make a Django server on AWS. My django app depends on some mathematical python libraries like numpy, scipy, sklearn etc. However there is an issue for which I need to this after every deployment
sudo nano /etc/httpd/conf.d/wsgi.conf
---------------------------------------
add this line in the file
WSGIApplicationGroup %{GLOBAL}
---------------------------------------
sudo /etc/init.d/httpd reload
Basically I need "WSGIApplicationGroup %{GLOBAL}" in my wsgi.conf file otherwise I get 504. I am using a Custom AMI built on top of Amazon Linux 2014 and I am using EB CLI for deployment. However whenever I deploy the wsgi.conf is reset and it does not contain the line that I have added previously and I need to manually SSH into the EC2 instance and do this task myself. It gives a overhead on every deployment and its also not feasible once we scale up (cloning or creating instances also resets it). So is there a way that this will be automatically done after every deployment ?
The content of the wsgi.conf is fixed, so basically I can make a script easily to create it but the issue is how to trigger the script automatically ?
PS:I am new to AWS
You need to use AWS Elastic Beanstalk feature called .ebextensions: http://docs.aws.amazon.com/elasticbeanstalk/latest/dg/customize-containers-ec2.html
In your case you can't use Files or Commands sections, because:
The commands are processed in alphabetical order by name, and they run
before the application and web server are set up and the application
version file is extracted.
You need to use Container_commands section:
They run after the application and web server have been set up and the
application version file has been extracted, but before the
application version is deployed.
Example .ebextensions/01wsgi.config (not tested :-))
container_commands:
apache_reload:
command: |
echo "WSGIApplicationGroup %{GLOBAL}" >> /etc/httpd/conf.d/wsgi.conf
/etc/init.d/httpd reload
Feel free to tweak my example as you want, for example you can copy your temporary wsgi.conf file somewhere and then replace original in Container_commands section.
I have a 27GB file that I am trying to move from an AWS Linux EC2 to S3. I've tried both the 'S3put' command and the 'S3cmd put' command. Both work with a test file. Neither work with the large file. No errors are given, the command returns immediately but nothing happens.
s3cmd put bigfile.tsv s3://bucket/bigfile.tsv
Though you can upload objects to S3 with sizes up to 5TB, S3 has a size limit of 5GB for an individual PUT operation.
In order to load files larger than 5GB (or even files larger than 100MB) you are going to want to use the multipart upload feature of S3.
http://docs.amazonwebservices.com/AmazonS3/latest/dev/UploadingObjects.html
http://aws.typepad.com/aws/2010/11/amazon-s3-multipart-upload.html
(Ignore the outdated description of a 5GB object limit in the above blog post. The current limit is 5TB.)
The boto library for Python supports multipart upload, and the latest boto software includes an "s3multiput" command line tool that takes care of the complexities for you and even parallelizes part uploads.
https://github.com/boto/boto
The file did not exist, doh. I realised this after running the s3 commands in verbose mode by adding the -v tag:
s3cmd put -v bigfile.tsv s3://bucket/bigfile.tsv
s3cmd version 1.1.0 supports the multi-part upload as part of the "put" command, but its still in beta (currently.)
I have the s3cmd command line tool for linux installed. It works fine to put files in a bucket. However, I want to move a file into a 'folder'. I know that folders aren't natively supported by S3, but my Cyberduck GUI tool converts them nicely for me to view my backups.
For instance, I have a file in the root of the bucket, called 'test.mov' that I want to move to the 'idea' folder. I am trying this:
s3cmd mv s3://mybucket/test.mov s3://mybucket/idea/test.mov
but I get strange errors like:
WARNING: Retrying failed request: /idea/test.mov (timed out)
WARNING: Waiting 3 sec...
I also tried quotes, but that didn't help either:
s3cmd mv 's3://mybucket/test.mov' 's3://mybucket/idea/test.mov'
Neither did just the folder name
s3cmd mv 's3://mybucket/test.mov' 's3://mybucket/idea/'
Is there a way within having to delete and reput this 3GB file?
Update: Just FYI, I can put new files directly into a folder like this:
s3cmd put test2.mov s3://mybucket/idea/test2.mov
But still don't know how to move them around....
To move/copy from one bucket to another or the same bucket I use s3cmd tool and works fine. For instance:
s3cmd cp --r s3://bucket1/directory1 s3://bucket2/directory1
s3cmd mv --recursive s3://bucket1/directory1 s3://bucket2/directory1
Probably your file is quite big, try increasing socket_timeout s3cmd configuration setting
http://sumanrs.wordpress.com/2013/03/19/s3cmd-timeout-problems-moving-large-files-on-s3-250mb/
Remove the ' signs. Your code should be:
s3cmd mv s3://mybucket/test.mov s3://mybucket/idea/test.mov
Also try what are the permissions of your bucket - for your username you should have all the permissions.
Also try to connect CloudFront to your bucket. I know it doesn' make sense but I have similar problem to bucket which do not have cloudfront instance clonnected to it.