Can you put code from .aws/config directly into your program? - amazon-s3

I am using boto3.client('s3') to upload files with s3.upload_file(filename, bucket, key, Callback=callback, Config=TransferConfig(use_threads=False)) and in my .aws/config file I have s3 = max_concurrent_requests = 5
Is there a way to get max_concurrent_requests hard coded into my program?

If you look at the documentation on the library there is the option to provide configuration from within your program. https://boto3.amazonaws.com/v1/documentation/api/latest/guide/configuration.html
An example from the docs
my_config = Config(
region_name = 'us-west-2',
signature_version = 'v4',
retries = {
'max_attempts': 10,
'mode': 'standard'
}
)
client = boto3.client('kinesis', config=my_config)

max_concurrent_requests is only supported for the AWS CLI. You could shell out a command from a Python script to leverage this. Also, you can set it via the command line as well: aws configure set s3 "max_concurrent_requests = 5". I haven't tested but I'd start there.
Here's a blog on how to execute shell commands from python: https://janakiev.com/blog/python-shell-commands/

Related

How to read csv files from s3 bucket using Pyspark (in macos)?

I am trying to read csv df from s3 bucket , but facing issues. Can you let me know where am I masking mistakes here ?
conf=SparkConf()
conf.setMaster('local')
conf.setAppName('sparkbasic')
sc = SparkContext.getOrCreate(conf=conf)
sc._jsc.hadoopConfiguration().set("fs.s3a.access.key", "abc")
sc._jsc.hadoopConfiguration().set("fs.s3a.secret.key", "xyz")
sc._jsc.hadoopConfiguration().set("fs.s3a.impl","org.apache.hadoop.fs.s3a.S3AFileSystem")
sc._jsc.hadoopConfiguration().set("com.amazonaws.services.s3.enableV4", "true")
sc._jsc.hadoopConfiguration().set("fs.s3a.aws.credentials.provider","org.apache.hadoop.fs.s3a.BasicAWSCredentialsProvider")
sc._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "mybucket/path/fileeast-1.redshift.amazonaws.com")
from pyspark.sql import SparkSession
sc = SparkSession.builder.appName('sparkbasic').getOrCreate()
This is the code where I get the error
csvDf = sc.read.csv("s3a://bucket/path/file/*.csv")
This is the error I get , I tried links given in stackoverflow answers , but nothing worked me so far
ava.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
Maybe you can have a look to S3Fs
Given your details, maybe a configuration like that could work:
import s3fs
fs = s3fs.S3FileSystem(client_kwargs={'endpoint_url': 'fileeast-1.redshift.amazonaws.com',
"aws_access_key_id": "abc",
"aws_secret_access_key": "xyz"})
To check if you manage to interact with s3, you can try the following command (NB: change somefile.csv to an existing one)
fs.info('s3://bucket/path/file/somefile.csv')
Note that in fs.info we start the path with s3. If you do not encounter an error, you might hope the following command works:
csvDf = sc.read.csv("s3a://bucket/path/file/*.csv")
This time you have the path begins by s3a

How can I reference an external SQL file using Airflow's BigQuery operator?

I'm currently using Airflow with the BigQuery operator to trigger various SQL scripts. This works fine when the SQL is written directly in the Airflow DAG file. For example:
bigquery_transform = BigQueryOperator(
task_id='bq-transform',
bql='SELECT * FROM `example.table`',
destination_dataset_table='example.destination'
)
However, I'd like to store the SQL in a separate file saved to a storage bucket. For example:
bql='gs://example_bucket/sample_script.sql'
When calling this external file I recieve a "Template Not Found" error.
I've seen some examples load the SQL file into the Airflow DAG folder, however, I'd really like to access files saved to a separate storage bucket. Is this possible?
You can reference any SQL files in your Google Cloud Storage Bucket. Here's a following example where I call the file Query_File.sql in the sql directory in my airflow dag bucket.
CONNECTION_ID = 'project_name'
with DAG('dag', schedule_interval='0 9 * * *', template_searchpath=['/home/airflow/gcs/dags/'], max_active_runs=15, catchup=True, default_args=default_args) as dag:
battery_data_quality = BigQueryOperator(
task_id='task-id',
sql='/SQL/Query_File.sql',
destination_dataset_table='project-name.DataSetName.TableName${{ds_nodash}}',
write_disposition='WRITE_TRUNCATE',
bigquery_conn_id=CONNECTION_ID,
use_legacy_sql=False,
dag=dag
)
You can also consider using the gcs_to_gcs operator to copy things from your desired bucket into one that is accessible by composer.
download works differently in GoogleCloudStorageDownloadOperator for Airflow version 1.10.3 and 1.10.15.
def execute(self, context):
self.object = context['dag_run'].conf['job_name'] + '.sql'
logging.info('filemname in GoogleCloudStorageDownloadOperator: %s', self.object)
self.filename = context['dag_run'].conf['job_name'] + '.sql'
self.log.info('Executing download: %s, %s, %s', self.bucket,
self.object, self.filename)
hook = GoogleCloudStorageHook(
google_cloud_storage_conn_id=self.google_cloud_storage_conn_id,
delegate_to=self.delegate_to
)
file_bytes = hook.download(bucket=self.bucket,
object=self.object)
if self.store_to_xcom_key:
if sys.getsizeof(file_bytes) < 49344:
context['ti'].xcom_push(key=self.store_to_xcom_key, value=file_bytes.decode('utf-8'))
else:
raise RuntimeError(
'The size of the downloaded file is too large to push to XCom!'
)

How to remove multiple S3 buckets at once?

I have a dozen of buckets that I would like to remove in AWS S3, all having a similar name containing bucket-to-remove and with some objects in it.
Using the UI is quite slow, is there a solution to remove all these buckets quickly using the CLI?
You could try this sample line to delete all at once. Remember that this is highly destructive, so I hope you know what you are doing:
for bucket in $(aws s3 ls | awk '{print $3}' | grep my-bucket-pattern); do aws s3 rb "s3://${bucket}" --force ; done
You are done with that. May take a while depending on the amount of buckets and their content.
I did this
aws s3api list-buckets \
--query 'Buckets[?starts_with(Name, `bucket-pattern `) == `true`].[Name]' \
--output text | xargs -I {} aws s3 rb s3://{} --force
Then update the bucket pattern as needed.
Be careful though, this is a pretty dangerous operation.
The absolute easiest way to bulk delete S3 buckets is to not write any code at all. I use Cyberduck to browse my S3 account and delete buckets and their contents quite easily.
Using boto3 you cannot delete buckets that have objects in it thus you first need to remove the objects before deleting the bucket. The easiest solution is a simple Python script such as:
import boto3
import botocore
import json
s3_client = boto3.client(
"s3",
aws_access_key_id="<your key id>",
aws_secret_access_key="<your secret access key>"
)
response = s3_client.list_buckets()
for bucket in response["Buckets"]:
# Only removes the buckets with the name you want.
if "bucket-to-remove" in bucket["Name"]:
s3_objects = s3_client.list_objects_v2(Bucket=bucket["Name"])
# Deletes the objects in the bucket before deleting the bucket.
if "Contents" in s3_objects:
for s3_obj in s3_objects["Contents"]:
rm_obj = s3_client.delete_object(
Bucket=bucket["Name"], Key=s3_obj["Key"])
print(rm_obj)
rm_bucket = s3_client.delete_bucket(Bucket=bucket["Name"])
print(rm_bucket)
Here is a windows solution.
First test the filter before you delete
aws s3 ls ^| findstr "<search here>"
and then execute
for /f "tokens=3" %a in ('<copy the correct command between the quotes>') do aws s3 rb s3://%a --force
According to the S3 docs you can remove a bucket using the CLI command aws s3 rb only if the bucket does not have versioning enabled. If that's the case you can write a simple bash script to get the bucket names and delete them one by one, like:
#!/bin/bash
# get buckets list => returns the timestamp + bucket name separated by lines
S3LS="$(aws s3 ls | grep 'bucket-name-pattern')"
# split the lines into an array. #see https://stackoverflow.com/a/13196466/6569593
oldIFS="$IFS"
IFS='
'
IFS=${IFS:0:1}
lines=( $S3LS )
IFS="$oldIFS"
for line in "${lines[#]}"
do
BUCKET_NAME=${line:20:${#line}} # remove timestamp
aws s3 rb "s3://${BUCKET_NAME}" --force
done
Be careful to don't remove important buckets! I recommend to output each bucket name before actually remove them. Also be aware that the aws s3 rb command takes a while to run, because it recursively deletes all the objects inside the bucket.
For deleting all s3 buckets in you account use below technique, It's work very well using local
Step 1 :- export your profile using below command Or you can export access_key and secrete_access_key locally as well
export AWS_PROFILE=<Your-Profile-Name>
Step 2:- Use below python code, Run it on local and see your all s3 buckets will delete.
import boto3
client = boto3.client('s3', Region='us-east-2')
response = client.list_buckets()
for bucket in response['Buckets']:
s3 = boto3.resource('s3')
s3_bucket = s3.Bucket(bucket['Name'])
bucket_versioning = s3.BucketVersioning(bucket['Name'])
if bucket_versioning.status == 'Enabled':
s3_bucket.object_versions.delete()
else:
s3_bucket.objects.all().delete()
response = client.delete_bucket(Bucket=bucket['Name'])
If you see error like boto3 not found please go to link and install it
Install boto3 using pip
I have used lambda for deleting buckets with the specified prefix.
It will delete all the objects regardless versioning is enabled or not.
Note that: You should give appropriate S3 access to your lambda.
import boto3
s3_client = boto3.client('s3')
s3 = boto3.resource('s3')
def lambda_handler(event, context):
bucket_prefix = "your prefix"
response = s3_client.list_buckets()
for bucket in response["Buckets"]:
# Only removes the buckets with the name you want.
if bucket_prefix in bucket["Name"]:
s3_bucket = s3.Bucket(bucket['Name'])
bucket_versioning = s3.BucketVersioning(bucket['Name'])
if bucket_versioning.status == 'Enabled':
s3_bucket.object_versions.delete()
else:
s3_bucket.objects.all().delete()
response = s3_client.delete_bucket(Bucket=bucket['Name'])
return {
'message' : f"delete buckets with prefix {bucket_prefix} was successfull"
}
if you're using PowerShell, this will work:
Foreach($x in (aws s3api list-buckets --query
'Buckets[?starts_with(Name, `name-pattern`) ==
`true`].[Name]' --output text))
{aws s3 rb s3://$x --force}
The best option that I find is to use the Cyberduck. You can select all the buckets from the GUI and delete them. I provide a screenshot for how to do it.

What would cause a 'BotoServerError: 400 Bad Request' when calling create_application_version?

I have included some code that uploads a war file into an s3 bucket (creating the bucket first if it does not exist). It then creates an elastic beanstalk application version using the just-uploaded war file.
Assume /tmp/server_war exists and is a valid war file. The following code will fail with boto.exception.BotoServerError: BotoServerError: 400 Bad Request:
#!/usr/bin/env python
import time
import boto
BUCKET_NAME = 'foo_bar23498'
s3 = boto.connect_s3()
bucket = s3.lookup(BUCKET_NAME)
if not bucket:
bucket = s3.create_bucket(BUCKET_NAME, location='')
version_label = 'server%s' % int(time.time())
# uplaod the war file
key_name = '%s.war' % version_label
s3key = bucket.new_key(key_name)
print 'uploading war file...'
s3key.set_contents_from_filename('/tmp/server.war',
headers={'Content-Type' : 'application/x-zip'})
# uses us-east-1 by default
eb = boto.connect_beanstalk()
eb.create_application_version(
application_name='TheApp',
version_label=version_label,
s3_bucket=BUCKET_NAME,
s3_key=key_name,
auto_create_application=True)
what would cause this?
One possible cause of this error is the bucket name. Apparently you can have s3 bucket names that contain underscores, but you cannot create application versions using keys in those buckets.
If you change the fourth line above to
BUCKET_NAME = 'foo-bar23498'
It should work.
Yes, it feels weird to be answering my own question...apparently this the recommended approach for this situation on stack overflow. I hope I save someone else a whole lot of debugging time.

How to delete files recursively from an S3 bucket

I have the following folder structure in S3. Is there a way to recursively remove all files under a certain folder (say foo/bar1 or foo or foo/bar2/1 ..)
foo/bar1/1/..
foo/bar1/2/..
foo/bar1/3/..
foo/bar2/1/..
foo/bar2/2/..
foo/bar2/3/..
With the latest aws-cli python command line tools, to recursively delete all the files under a folder in a bucket is just:
aws s3 rm --recursive s3://your_bucket_name/foo/
Or delete everything under the bucket:
aws s3 rm --recursive s3://your_bucket_name
If what you want is to actually delete the bucket, there is one-step shortcut:
aws s3 rb --force s3://your_bucket_name
which will remove the contents in that bucket recursively then delete the bucket.
Note: the s3:// protocol prefix is required for these commands to work
This used to require a dedicated API call per key (file), but has been greatly simplified due to the introduction of Amazon S3 - Multi-Object Delete in December 2011:
Amazon S3's new Multi-Object Delete gives you the ability to
delete up to 1000 objects from an S3 bucket with a single request.
See my answer to the related question delete from S3 using api php using wildcard for more on this and respective examples in PHP (the AWS SDK for PHP supports this since version 1.4.8).
Most AWS client libraries have meanwhile introduced dedicated support for this functionality one way or another, e.g.:
Python
You can achieve this with the excellent boto Python interface to AWS roughly as follows (untested, from the top of my head):
import boto
s3 = boto.connect_s3()
bucket = s3.get_bucket("bucketname")
bucketListResultSet = bucket.list(prefix="foo/bar")
result = bucket.delete_keys([key.name for key in bucketListResultSet])
Ruby
This is available since version 1.24 of the AWS SDK for Ruby and the release notes provide an example as well:
bucket = AWS::S3.new.buckets['mybucket']
# delete a list of objects by keys, objects are deleted in batches of 1k per
# request. Accepts strings, AWS::S3::S3Object, AWS::S3::ObectVersion and
# hashes with :key and :version_id
bucket.objects.delete('key1', 'key2', 'key3', ...)
# delete all of the objects in a bucket (optionally with a common prefix as shown)
bucket.objects.with_prefix('2009/').delete_all
# conditional delete, loads and deletes objects in batches of 1k, only
# deleting those that return true from the block
bucket.objects.delete_if{|object| object.key =~ /\.pdf$/ }
# empty the bucket and then delete the bucket, objects are deleted in batches of 1k
bucket.delete!
Or:
AWS::S3::Bucket.delete('your_bucket', :force => true)
You might also consider using Amazon S3 Lifecycle to create an expiration for files with the prefix foo/bar1.
Open the S3 browser console and click a bucket. Then click Properties and then LifeCycle.
Create an expiration rule for all files with the prefix foo/bar1 and set the date to 1 day since file was created.
Save and all matching files will be gone within 24 hours.
Just don't forget to remove the rule after you're done!
No API calls, no third party libraries, apps or scripts.
I just deleted several million files this way.
A screenshot showing the Lifecycle Rule window (note in this shot the Prefix has been left blank, affecting all keys in the bucket):
The voted up answer is missing a step.
Per aws s3 help:
Currently, there is no support for the use of UNIX style wildcards in a
command's path arguments. However, most commands have --exclude "<value>" and --include "<value>" parameters that can achieve the
desired result......... When there are multiple
filters, the rule is the filters that appear later in the command take
precedence over filters that appear earlier in the command. For example, if the filter parameters passed to the command were --exclude "*" --include "*.txt" All files will be excluded from the command except for files ending
with .txt
aws s3 rm --recursive s3://bucket/ --exclude="*" --include="/folder_path/*"
With s3cmd package installed on a Linux machine, you can do this
s3cmd rm s3://foo/bar --recursive
In case if you want to remove all objects with "foo/" prefix using Java AWS SDK 2.0
import java.util.ArrayList;
import java.util.Iterator;
import software.amazon.awssdk.services.s3.S3Client;
import software.amazon.awssdk.services.s3.model.*;
//...
ListObjectsRequest listObjectsRequest = ListObjectsRequest.builder()
.bucket(bucketName)
.prefix("foo/")
.build()
;
ListObjectsResponse objectsResponse = s3Client.listObjects(listObjectsRequest);
while (true) {
ArrayList<ObjectIdentifier> objects = new ArrayList<>();
for (Iterator<?> iterator = objectsResponse.contents().iterator(); iterator.hasNext(); ) {
S3Object s3Object = (S3Object)iterator.next();
objects.add(
ObjectIdentifier.builder()
.key(s3Object.key())
.build()
);
}
s3Client.deleteObjects(
DeleteObjectsRequest.builder()
.bucket(bucketName)
.delete(
Delete.builder()
.objects(objects)
.build()
)
.build()
);
if (objectsResponse.isTruncated()) {
objectsResponse = s3Client.listObjects(listObjectsRequest);
continue;
}
break;
};
In case using AWS-SKD for ruby V2.
s3.list_objects(bucket: bucket_name, prefix: "foo/").contents.each do |obj|
next if obj.key == "foo/"
resp = s3.delete_object({
bucket: bucket_name,
key: obj.key,
})
end
attention please, all "foo/*" under bucket will delete.
To delete all the versions of the objects under a specific folder:
Pass the path /folder/subfolder/ to the Prefix -
import boto3
s3 = boto3.resource('s3')
bucket = s3.Bucket("my-bucket-name")
bucket.object_versions.filter(Prefix="foo/bar1/1/").delete()
I just removed all files from my bucket by using PowerShell:
Get-S3Object -BucketName YOUR_BUCKET | % { Remove-S3Object -BucketName YOUR_BUCKET -Key $_.Key -Force:$true }
Just saw that Amazon added a "How to Empty a Bucket" option to the AWS console menu:
http://docs.aws.amazon.com/AmazonS3/latest/UG/DeletingaBucket.html
Best way is to use lifecycle rule to delete whole bucket contents. Programmatically you can use following code (PHP) to PUT lifecycle rule.
$expiration = array('Date' => date('U', strtotime('GMT midnight')));
$result = $s3->putBucketLifecycle(array(
'Bucket' => 'bucket-name',
'Rules' => array(
array(
'Expiration' => $expiration,
'ID' => 'rule-name',
'Prefix' => '',
'Status' => 'Enabled',
),
),
));
In above case all the objects will be deleted starting Date - "Today GMT midnight".
You can also specify Days as follows. But with Days it will wait for at least 24 hrs (1 day is minimum) to start deleting the bucket contents.
$expiration = array('Days' => 1);
I needed to do the following...
def delete_bucket
s3 = init_amazon_s3
s3.buckets['BUCKET-NAME'].objects.each do |obj|
obj.delete
end
end
def init_amazon_s3
config = YAML.load_file("#{Rails.root}/config/s3.yml")
AWS.config(:access_key_id => config['access_key_id'],:secret_access_key => config['secret_access_key'])
s3 = AWS::S3.new
end
s3cmd del --recursive s3://your_bucket --force