I am trying to transfer data from AWS S3 bucket (e.g. s3://mySrcBkt) to GCS location ( a folder under a bucket as gs://myDestBkt/myDestination ). I could not find the same option from Interface as it has only provision to provide bucket and not a subfolder. Neither I found the similar povision from the storagetransfer API. Here is my code snippet:
String SOURCE_BUCKET = .... ;
String ACCESS_KEY = .....;
String SECRET_ACCESS_KEY = .....;
String DESTINATION_BUCKET = .......;
String STATUS = "ENABLED";
TransferJob transferJob =
new TransferJob()
.setName(NAME)
.setDescription(DESCRIPTION)
.setProjectId(PROJECT)
.setTransferSpec(
new TransferSpec()
.setObjectConditions(new ObjectConditions()
.setIncludePrefixes(includePrefixes))
.setTransferOptions(new TransferOptions()
.setDeleteObjectsFromSourceAfterTransfer(false)
.setOverwriteObjectsAlreadyExistingInSink(false)
.setDeleteObjectsUniqueInSink(false))
.setAwsS3DataSource(
new AwsS3Data()
.setBucketName(SOURCE_BUCKET)
.setAwsAccessKey(
new AwsAccessKey()
.setAccessKeyId(ACCESS_KEY)
.setSecretAccessKey(SECRET_ACCESS_KEY))
)
.setGcsDataSink(
new GcsData()
.setBucketName(DESTINATION_BUCKET)
))
.setSchedule(
new Schedule()
.setScheduleStartDate(date)
.setScheduleEndDate(date)
.setStartTimeOfDay(time))
.setStatus(STATUS);
Unfortunately I could not find anywhere to mention the destination folder for this transfer. I know gsutil rsync has similar however the scale & data integrity is a concern. Can anyone guide me/point me any way/workaround to achieve the goal ?
As the bucket and not a subdirectory is the available option for data transfer destination, the workaround for this scenario would be doing the transfer to your bucket, then doing the rsync operation between your bucket and the subdirectory, just keep in mind that you should try running the gsutil -m rsync -r -d -n to verify what it'll do, as you could delete data accidentally.
There are a lot of files in my S3 bucket, I am looking to download most 1000 recent ones (uploaded by date).
How do I go on about doing that with AWS cli or s3 boto
You can use the following command:
aws s3api list-objects --bucket <bucket> \
--query 'reverse(sort_by(Contents[].{Key: Key, LastModified: LastModified}, &LastModified))[:1000].[Key]' --output text | \
xargs -I {} aws s3 cp s3://<bucket>/{} .
I use the following JMESPath functions:
sort_by : sort the json array, in this case I filter the contents to be only the (Key, LastModified) arguments and will filter on the LastModified attribute
reverse : need to reverse the result as you want the most recent results
[:x]: takes only x arguments of the array, in your case you want 1000 items
read only the key element with .[Key] put as array so we can output as text each element as a new line
xargs -I {} aws s3 cp s3://<bucket>/{} . will copy each of the file found previously from your s3 bucket account locally
You can use the below code to download the latest element from S3:
import boto
conn = boto.connect_s3()
bucket = conn.get_bucket('test-bucket')
bucket_files = bucket.list('subdir/file_2017_')
pointer = [(bucket_file.last_modified, bucket_file) for bucket_file in bucket_files]
key_to_download = sorted(pointer, cmp=lambda x,y: cmp(x[0], y[0]))[-1][1]
key_to_download.get_contents_to_filename('target_filename')
In my current project I need to check my S3 bucket contents every 4 seconds for new files.
This script will run for around 3 hours every time that the service is used, and will have something around 2700 files by the end at a single prefix.
This is my function to list those files:
public function listFiles($s3Prefix, $limit, $get_after = ''){
$command = $this->s3Client->getCommand('ListObjects');
$command['Bucket'] = $this->s3_bucket;
$command['Prefix'] = $s3Prefix;
$command['MaxKeys'] = $limit;
$command['Marker'] = $s3Prefix.'/'.$get_after;
//command['Query'] = 'sort_by(Contents,&LastModified)';
$ret_s3 = $this->s3Client->execute($command);
$ret['truncated'] = $ret_s3['IsTruncated'];
$ret['files'] = $ret_s3['Contents'];
return $ret;
}// listFiles
What I do need is get the files, order by the LastModified field, so I do not need to fetch over 2k files.
Is there an extra parameter like
command['Query'] = 'sort_by(Contents,&LastModified)';
to add in the php API?
---------- EDITED ------------
As pointed for Abhishek Meena answer, in the shell it is possible to use
aws s3api list-objects --bucket "bucket-name" --prefix "some-prefix" --query "Contents[?LastModified>=\`2017-03-08\`]"
What I'm looking is how to implement this in PHP.
PHP API: https://github.com/aws/aws-sdk-php
I don't know if they have some thing to sort the objects on the bases of LastModified but you can query and filter objects on the LastModified column.
This is what you can use to filter all the files modified after certain time aws s3api list-objects --bucket "bucket-name" --prefix "some-prefix" --query "Contents[?LastModified>=\`2017-03-08\`]"
This is for the shell they might have something similar for the php.
I have included some code that uploads a war file into an s3 bucket (creating the bucket first if it does not exist). It then creates an elastic beanstalk application version using the just-uploaded war file.
Assume /tmp/server_war exists and is a valid war file. The following code will fail with boto.exception.BotoServerError: BotoServerError: 400 Bad Request:
#!/usr/bin/env python
import time
import boto
BUCKET_NAME = 'foo_bar23498'
s3 = boto.connect_s3()
bucket = s3.lookup(BUCKET_NAME)
if not bucket:
bucket = s3.create_bucket(BUCKET_NAME, location='')
version_label = 'server%s' % int(time.time())
# uplaod the war file
key_name = '%s.war' % version_label
s3key = bucket.new_key(key_name)
print 'uploading war file...'
s3key.set_contents_from_filename('/tmp/server.war',
headers={'Content-Type' : 'application/x-zip'})
# uses us-east-1 by default
eb = boto.connect_beanstalk()
eb.create_application_version(
application_name='TheApp',
version_label=version_label,
s3_bucket=BUCKET_NAME,
s3_key=key_name,
auto_create_application=True)
what would cause this?
One possible cause of this error is the bucket name. Apparently you can have s3 bucket names that contain underscores, but you cannot create application versions using keys in those buckets.
If you change the fourth line above to
BUCKET_NAME = 'foo-bar23498'
It should work.
Yes, it feels weird to be answering my own question...apparently this the recommended approach for this situation on stack overflow. I hope I save someone else a whole lot of debugging time.
I have the following folder structure in S3. Is there a way to recursively remove all files under a certain folder (say foo/bar1 or foo or foo/bar2/1 ..)
foo/bar1/1/..
foo/bar1/2/..
foo/bar1/3/..
foo/bar2/1/..
foo/bar2/2/..
foo/bar2/3/..
With the latest aws-cli python command line tools, to recursively delete all the files under a folder in a bucket is just:
aws s3 rm --recursive s3://your_bucket_name/foo/
Or delete everything under the bucket:
aws s3 rm --recursive s3://your_bucket_name
If what you want is to actually delete the bucket, there is one-step shortcut:
aws s3 rb --force s3://your_bucket_name
which will remove the contents in that bucket recursively then delete the bucket.
Note: the s3:// protocol prefix is required for these commands to work
This used to require a dedicated API call per key (file), but has been greatly simplified due to the introduction of Amazon S3 - Multi-Object Delete in December 2011:
Amazon S3's new Multi-Object Delete gives you the ability to
delete up to 1000 objects from an S3 bucket with a single request.
See my answer to the related question delete from S3 using api php using wildcard for more on this and respective examples in PHP (the AWS SDK for PHP supports this since version 1.4.8).
Most AWS client libraries have meanwhile introduced dedicated support for this functionality one way or another, e.g.:
Python
You can achieve this with the excellent boto Python interface to AWS roughly as follows (untested, from the top of my head):
import boto
s3 = boto.connect_s3()
bucket = s3.get_bucket("bucketname")
bucketListResultSet = bucket.list(prefix="foo/bar")
result = bucket.delete_keys([key.name for key in bucketListResultSet])
Ruby
This is available since version 1.24 of the AWS SDK for Ruby and the release notes provide an example as well:
bucket = AWS::S3.new.buckets['mybucket']
# delete a list of objects by keys, objects are deleted in batches of 1k per
# request. Accepts strings, AWS::S3::S3Object, AWS::S3::ObectVersion and
# hashes with :key and :version_id
bucket.objects.delete('key1', 'key2', 'key3', ...)
# delete all of the objects in a bucket (optionally with a common prefix as shown)
bucket.objects.with_prefix('2009/').delete_all
# conditional delete, loads and deletes objects in batches of 1k, only
# deleting those that return true from the block
bucket.objects.delete_if{|object| object.key =~ /\.pdf$/ }
# empty the bucket and then delete the bucket, objects are deleted in batches of 1k
bucket.delete!
Or:
AWS::S3::Bucket.delete('your_bucket', :force => true)
You might also consider using Amazon S3 Lifecycle to create an expiration for files with the prefix foo/bar1.
Open the S3 browser console and click a bucket. Then click Properties and then LifeCycle.
Create an expiration rule for all files with the prefix foo/bar1 and set the date to 1 day since file was created.
Save and all matching files will be gone within 24 hours.
Just don't forget to remove the rule after you're done!
No API calls, no third party libraries, apps or scripts.
I just deleted several million files this way.
A screenshot showing the Lifecycle Rule window (note in this shot the Prefix has been left blank, affecting all keys in the bucket):
The voted up answer is missing a step.
Per aws s3 help:
Currently, there is no support for the use of UNIX style wildcards in a
command's path arguments. However, most commands have --exclude "<value>" and --include "<value>" parameters that can achieve the
desired result......... When there are multiple
filters, the rule is the filters that appear later in the command take
precedence over filters that appear earlier in the command. For example, if the filter parameters passed to the command were --exclude "*" --include "*.txt" All files will be excluded from the command except for files ending
with .txt
aws s3 rm --recursive s3://bucket/ --exclude="*" --include="/folder_path/*"
With s3cmd package installed on a Linux machine, you can do this
s3cmd rm s3://foo/bar --recursive
In case if you want to remove all objects with "foo/" prefix using Java AWS SDK 2.0
import java.util.ArrayList;
import java.util.Iterator;
import software.amazon.awssdk.services.s3.S3Client;
import software.amazon.awssdk.services.s3.model.*;
//...
ListObjectsRequest listObjectsRequest = ListObjectsRequest.builder()
.bucket(bucketName)
.prefix("foo/")
.build()
;
ListObjectsResponse objectsResponse = s3Client.listObjects(listObjectsRequest);
while (true) {
ArrayList<ObjectIdentifier> objects = new ArrayList<>();
for (Iterator<?> iterator = objectsResponse.contents().iterator(); iterator.hasNext(); ) {
S3Object s3Object = (S3Object)iterator.next();
objects.add(
ObjectIdentifier.builder()
.key(s3Object.key())
.build()
);
}
s3Client.deleteObjects(
DeleteObjectsRequest.builder()
.bucket(bucketName)
.delete(
Delete.builder()
.objects(objects)
.build()
)
.build()
);
if (objectsResponse.isTruncated()) {
objectsResponse = s3Client.listObjects(listObjectsRequest);
continue;
}
break;
};
In case using AWS-SKD for ruby V2.
s3.list_objects(bucket: bucket_name, prefix: "foo/").contents.each do |obj|
next if obj.key == "foo/"
resp = s3.delete_object({
bucket: bucket_name,
key: obj.key,
})
end
attention please, all "foo/*" under bucket will delete.
To delete all the versions of the objects under a specific folder:
Pass the path /folder/subfolder/ to the Prefix -
import boto3
s3 = boto3.resource('s3')
bucket = s3.Bucket("my-bucket-name")
bucket.object_versions.filter(Prefix="foo/bar1/1/").delete()
I just removed all files from my bucket by using PowerShell:
Get-S3Object -BucketName YOUR_BUCKET | % { Remove-S3Object -BucketName YOUR_BUCKET -Key $_.Key -Force:$true }
Just saw that Amazon added a "How to Empty a Bucket" option to the AWS console menu:
http://docs.aws.amazon.com/AmazonS3/latest/UG/DeletingaBucket.html
Best way is to use lifecycle rule to delete whole bucket contents. Programmatically you can use following code (PHP) to PUT lifecycle rule.
$expiration = array('Date' => date('U', strtotime('GMT midnight')));
$result = $s3->putBucketLifecycle(array(
'Bucket' => 'bucket-name',
'Rules' => array(
array(
'Expiration' => $expiration,
'ID' => 'rule-name',
'Prefix' => '',
'Status' => 'Enabled',
),
),
));
In above case all the objects will be deleted starting Date - "Today GMT midnight".
You can also specify Days as follows. But with Days it will wait for at least 24 hrs (1 day is minimum) to start deleting the bucket contents.
$expiration = array('Days' => 1);
I needed to do the following...
def delete_bucket
s3 = init_amazon_s3
s3.buckets['BUCKET-NAME'].objects.each do |obj|
obj.delete
end
end
def init_amazon_s3
config = YAML.load_file("#{Rails.root}/config/s3.yml")
AWS.config(:access_key_id => config['access_key_id'],:secret_access_key => config['secret_access_key'])
s3 = AWS::S3.new
end
s3cmd del --recursive s3://your_bucket --force