I have about 1000 objects in S3 which named after
abcyearmonthday1
abcyearmonthday2
abcyearmonthday3
...
want to rename them to
abc/year/month/day/1
abc/year/month/day/2
abc/year/month/day/3
how could I do it through boto3. Is there easier way of doing this ?
As explained in Boto3/S3: Renaming an object using copy_object
you can not rename an object in S3 you have to copy object with a new name and then delete the Old object
s3 = boto3.resource('s3')
s3.Object('my_bucket','my_file_new').copy_from(CopySource='my_bucket/my_file_old')
s3.Object('my_bucket','my_file_old').delete()
There is not direct way to rename S3 object.
Below two steps need to perform :
Copy the S3 object at same location with new name.
Then delete the older object.
I had the same problem (in my case I wanted to rename files generated in S3 using the Redshift UNLOAD command). I solved creating a boto3 session and then copy-deleting file by file.
Like
import boto3
session = boto3.session.Session(aws_access_key_id=my_access_key_id,aws_secret_access_key=my_secret_access_key).resource('s3')
# Save in a list the tuples of filenames (with prefix): [(old_s3_file_path, new_s3_file_path), ..., ()] e.g. of tuple ('prefix/old_filename.csv000', 'prefix/new_filename.csv')
s3_files_to_rename = []
s3_files_to_rename.append((old_file, new_file))
for pair in s3_files_to_rename:
old_file = pair[0]
new_file = pair[1]
s3_session.Object(s3_bucket_name, new_file).copy_from(CopySource=s3_bucket_name+'/'+old_file)
s3_session.Object(s3_bucket_name, old_file).delete()
Related
I am continuously add parquet data sets to an S3 folder with a structure like this:
s3:::my-bucket/public/data/set1
s3:::my-bucket/public/data/set2
s3:::my-bucket/public/data/set3
At the beginning I only have set1 and my crawler is configured to run on the whole bucket s3:::my-bucket. This leads to the creation of a partitioned tabled named my-bucket with partitions named public, data and set1. What I actually want is to have a table named set1 without any partitions.
I see the reasons why this happens, as it is explained under How Does a Crawler Determine When to Create Partitions?. But when a new data set is uploaded (e.g. set2) I don't want it to be another partition (because it is completely different data with a different schema).
How can I force the Glue crawler to NOT create partitions?
I know I could define the crawler path as s3:::my-bucket/public/data/ but unfortunately I don't know where the new data sets will be created (e.g. could also be s3:::my-bucket/other/folder/set2).
Any ideas how to solve this?
You can use the TableLevelConfiguration to specify in which folder level the crawler should look for tables.
More information on that here.
My solution was to manually add the specific paths to the Glue crawler. The big picture is that I am using a Glue job to transform data from one S3 bucket and write it to another one. I now ended up to initially configure the Glue crawler to crawl the whole bucket. But every time the Glue transformation job runs it also updates the Glue crawler: it removes the initial full bucket location (if it still exists) and then adds the new path to the S3 targets.
In Python it looks something like this:
def update_target_paths(crawler):
"""
Remove initial include path (whole bucket) from paths and
add folder for current files to include paths.
"""
def path_is(c, p):
return c["Path"] == p
# get S3 targets and remove initial bucket target
s3_targets = list(
filter(
lambda c: not path_is(c, f"s3://{bucket_name}"),
crawler["Targets"]["S3Targets"],
)
)
# add new target path if not in targets yet
if not any(filter(lambda c: path_is(c, output_loc), s3_targets)):
s3_targets.append({"Path": output_loc})
logging.info("Appending path '%s' to Glue crawler include path.", output_loc)
crawler["Targets"]["S3Targets"] = s3_targets
return crawler
def remove_excessive_keys(crawler):
"""Remove keys from Glue crawler dict that are not needed/allowed to update the crawler"""
for k in ["State", "CrawlElapsedTime", "CreationTime", "LastUpdated", "LastCrawl", "Version"]:
try:
del crawler[k]
except KeyError:
logging.warning(f"Key '{k}' not in crawler result dictionary.")
return crawler
if __name__ == "__main__":
logging.info(f"Transforming from {input_loc} to {output_loc}.")
if prefix_exists(curated_zone_bucket_name, curated_zone_key):
logging.info("Target object already exists, appending.")
else:
logging.info("Target object doesn't exist, writing to new one.")
transform() # do data transformation and write to output bucket
while True:
try:
crawler = get_crawler(CRAWLER_NAME)
crawler = update_target_paths(crawler)
crawler = remove_excessive_keys(crawler)
# Update Glue crawler with updated include paths
glue_client.update_crawler(**crawler)
glue_client.start_crawler(Name=CRAWLER_NAME)
logging.info("Started Glue crawler '%s'.", CRAWLER_NAME)
break
except (
glue_client.exceptions.CrawlerRunningException,
glue_client.exceptions.InvalidInputException,
):
logging.warning("Crawler still running...")
time.sleep(10)
Variables defined defined globally: input_loc, output_loc, CRAWLER_NAME, bucket_name.
For every new data set a new path is added to the Glue crawler. No partitions will be created.
I am building a Python Lambda in AWS and wanted to add an S3 trigger to it. Following these instructions I saw how to get the bucket and key on which I got the trigger using:
def func(event):
bucket = event['Records'][0]['s3']['bucket']['name']
key = urllib.parse.unquote_plus(event['Records'][0]['s3']['object']['key'], encoding='utf-8')
There is an example of such an object in the link, but I wasn't able, however, to find a description of the entire event object anywhere in AWS' documentation.
Is there a documentation for this object's structure? Where might I find it?
You can find documentation about the whole object in the S3 documentation:
https://docs.aws.amazon.com/AmazonS3/latest/userguide/notification-content-structure.html
I would also advise to iterate the records, because there could be multiple at once:
for record in event['Records']:
bucket = record['s3']['bucket']['name']
key = record['s3']['object']['key']
[...]
I am trying to update the existing metadata of my S3 object but in spite of updating it is creating the new one. As per the documentation, it is showing the same way but don't know why it is not able to update it.
k = s3.head_object(Bucket='test-bucket', Key='test.json')
s3.copy_object(Bucket='test-bucket', Key='test.json', CopySource='test-bucket' + '/' + 'test.json', Metadata={'Content-Type': 'text/plain'}, MetadataDirective='REPLACE')
I was able to update using the copy_from method
s3 = boto3.resource('s3')
object = s3.Object(bucketName, uploadedKey)
object.copy_from(
CopySource={'Bucket': bucketName,'Key': uploadedKey},
MetadataDirective="REPLACE",
ContentType=value
)
S3 metadata is read-only, so updating only metadata of an S3 object is not possible. The only way to update the metadata is to recreate/copy the object. Check the 1st paragraph of the official docs
You can set object metadata at the time you upload it. After you upload the object, you cannot modify object metadata. The only way to modify object metadata is to make a copy of the object and set the metadata.
I am trying to simulate directory listing for my bucket on ASW S3. Currently I am creating "index.html" locally as follows:
for root, dirs, files in os.walk(job_dir):
objects = []
for obj in dirs+files:
m_time_epoch = os.stat(os.path.join(path,obj)).st_mtime
mtime = datetime.fromtimestamp(m_time_epoch).strftime('%c')
size = os.stat(os.path.join(path,obj)).st_size
type = 'dir' if os.path.isdir(os.path.join(path,obj)) else 'file'
objects.append({'name': obj,
'mtime': mtime,
'size': size,
'type': type})
generate_index(objects, dest_path)
And then passing it together with destination path (bucket URL) to a function which will create "index.html" using jinja template.
Is there better way to do it? I would like to avoid JavaScript though. I made some googling however so far did not find an elegant solution.
What would be the easiest alternative of "os.walk" using boto3 python client?
I found some snippets e.g. here:
How do I list directory contents of an S3 bucket using Python and Boto3?
But is not there a simpler solution?
Thanks...
I'd recommend using the list_objects_v2 method in boto3.
import boto3
s3 = boto3.client('s3')
paginator = s3.get_paginator('list_objects_v2')
response_iterator = paginator.paginate(
Bucket='MyBucket'
)
objects = []
for response in response_iterator:
for r in response['Contents']:
print("File is called {}".format(r['Key']))
While iterating through the objects in the bucket, you could build an object you could pass to a Jinja template to create the index.html page
How do you rename a S3 key in a bucket with boto?
You can't rename files in Amazon S3. You can copy them with a new name, then delete the original, but there's no proper rename function.
Here is an example of a Python function that will copy an S3 object using Boto 2:
import boto
def copy_object(src_bucket_name,
src_key_name,
dst_bucket_name,
dst_key_name,
metadata=None,
preserve_acl=True):
"""
Copy an existing object to another location.
src_bucket_name Bucket containing the existing object.
src_key_name Name of the existing object.
dst_bucket_name Bucket to which the object is being copied.
dst_key_name The name of the new object.
metadata A dict containing new metadata that you want
to associate with this object. If this is None
the metadata of the original object will be
copied to the new object.
preserve_acl If True, the ACL from the original object
will be copied to the new object. If False
the new object will have the default ACL.
"""
s3 = boto.connect_s3()
bucket = s3.lookup(src_bucket_name)
# Lookup the existing object in S3
key = bucket.lookup(src_key_name)
# Copy the key back on to itself, with new metadata
return key.copy(dst_bucket_name, dst_key_name,
metadata=metadata, preserve_acl=preserve_acl)
There is no direct method to rename the file in s3. what do you have to do is copy the existing file with new name (Just set the target key) and delete the old one. Thank you
//Copy the object
AmazonS3Client s3 = new AmazonS3Client("AWSAccesKey", "AWSSecretKey");
CopyObjectRequest copyRequest = new CopyObjectRequest()
.WithSourceBucket("SourceBucket")
.WithSourceKey("SourceKey")
.WithDestinationBucket("DestinationBucket")
.WithDestinationKey("DestinationKey")
.WithCannedACL(S3CannedACL.PublicRead);
s3.CopyObject(copyRequest);
//Delete the original
DeleteObjectRequest deleteRequest = new DeleteObjectRequest()
.WithBucketName("SourceBucket")
.WithKey("SourceKey");
s3.DeleteObject(deleteRequest);