Unable to delete S3 objects with spaces in name - amazon-s3

Through a typo I ended up creating a number of S3 files with spaces in their name. I realize based on the key naming guidelines that this is not an ideal situation, but the objects now exist. I have tried to delete them both from the AWS CLI and from the S3 console. Neither method produces an error, but the objects are not deleted. I tried renaming the files to remove the offending space, but this also fails on both CLI and console. How can I delete these objects?

Try using AWS SDKs (links to boto3 commands):
List the objects - See (boto3) S3.Client.list_objects
Filter the objects (keys) you want to delete from the list
Delete the objects of the filtered list using S3.Bucket.delete_objects

This answer applies to cases when you are using boto3 instead of aws cli but run into the same problem of the OP.
The problem:
When boto3 retrieves object names spaces in the key are encoded as "+" character. I don't know why the spaces are not url-encoded as %20 (although this post has answers that might explain why) Other special characters in the key name are url-encoded. Ironically "+" in an object name is encoded as %2B by boto3.
The solution:
Before passing a key name to boto3 delete_objects method, I cleaned up the key this way:
remove_plus = x-www-form-urlencoded_key.replace("+", " ")
uncoded_key = urllib.parse.unquote(remove_plus)
response = client.delete_object(
Bucket=bucket_name,
Key=uncoded_key
)
I suppose there's a more correct way of handling application/x-www-form-urlencoded type strings, but this is working for me right now.

Related

Is there a way to list the directories in a using PySpark in a notebook?

I'm trying to see every file is a certain directory, but since each file in the directory is very large, I can't use sc.wholeTextfile or sc.textfile. I wanted to just get the filenames from them, and then pull the file if needed in a different cell. I can access the files just fine using Cyberduck and it shows the names on there.
Ex: I have the link for one set of data at "name:///mainfolder/date/sectionsofdate/indiviual_files.gz", and it works, But I want to see the names of the files in "/mainfolder/date" and in "/mainfolder/date/sectionsofdate" without having to load them all in via sc.textFile or sc.Wholetextfile. Both those functions work, so I know my keys are correct, but it takes too long for them to be loaded.
Considering that the list of files can be retrieve by one single node, you can just list the files in the directory. Look at this response.
wholeTextFiles returns a tuple (path, content) but I don't know if the file content is lazy to get only the first part of the tuple.

Using a wildcard on S3 Event Notification prefix

I have a Lambda function that creates a thumbnail image for every image that gets uploaded to my bucket, it then places the Thumbnail inside another bucket. When I upload a user image (profile pic) I use the users ID and name as part of the key:
System-images/users/250/john_doe.jpg
Is there a way to use a wildcard in the prefix path? This is what I have so far but it doesn't work
No, you can't -- it's a literal prefix.
In your example, you could use either of these prefixes, depending on what else is in the bucket (if there are things sharing the common prefix that you don't want to match):
System-images/
System-images/users/
Wildcards in prefix/suffix filters of Lambda are not supported and will never be since the asterisk (*) is a valid character that can be used in S3 object key names. However, you could somehow fix this problem by adding a filter in your Lambda function. For example:
First, get the source key:
var srcKey = decodeURIComponent(event.Records[0].s3.object.key.replace(/\+/g, " "));
Then, check if it is inside the users folder:
if (srcKey.indexOf('/users/') === -1) {
callback('Not inside users folder!');
return;
}

Search for a specific file in S3 using boto

I am using boto to parse S3 buckets. Basically I want to file a certain file in the bucket (say *.header or any other regex expression that has been provided by user). Since I could not find any function for that in boto I was trying to write a BFS routine to search through content of each folder but I couldn’t find any method to get contents of folder by key/key.name (which I am getting by bucketObj.list() ). Is there any other method for doing this?
For instance, lets say i have multiple folders in bucket
like
mybucket/A/B/C/x.txt
mybucket/A/B/D/y.jpg
mybucket/A/E/F/z.txt
and i want to find where are *.txt
so the boto script should return me following result
mybucket/A/B/C/x.txt
mybucket/A/E/F/z.txt
There is no way to do wildcard searches or file-globbing service-side with S3. The only filtering available via the API is a prefix. If you specify a prefix string, only results that begin with that prefix will be returned.
Otherwise, all filtering would have to happen on the client-side. Or, you could store your keys in a database and use that to do the searching and only retrieve the matches from S3.

ruby on rails AWS-S3 list files in a bucket

I have a select box the I wish to fill with filenames from a clients S3 bucket.
In my controller I set the variable as this:
#files = AWS::S3::Bucket.find("clientsbucket").objects
which when called in the view as options_for_select(#files) gives a list of objects but in the format of <AWS::S3::Object:0x4f9e5b8>, <AWS::S3::Object:0x4f9e5a0> etc
For the life of me I cant figure out how to list the filename instead of this object info?
Any help muchly appreciated
Well, access the key property of the each object in the view!
key property is the entire path to the file in the bucket.
objects.each do |object|
= object.key
Even though the AWS SDK Documentation isn't as informative try and dig around..
Use the as_tree method on the objects so you can get the specific data you want.
http://docs.amazonwebservices.com/AWSRubySDK/latest/AWS/S3/Tree.html
Good luck!!

Preventing YQL from URL encoding a key

I am wondering if it is possible to prevent YQL from URL encoding a key for a datatable?
Example:
The current guardian API works with IDs like this:
item_id = "environment/2010/oct/29/biodiversity-talks-ministers-nagoya-strategy"
The problem with these IDs is that they contain slashes (/) and these characters should not be URL encoded in the API call but instead stay as they are.
So If I now have this query
SELECT * FROM guardian.content.item WHERE item_id='environment/2010/oct/29/biodiversity-talks-ministers-nagoya-strategy'
while using the following url defintion in my datatable
<url>http://content.guardianapis.com/{item_id}</url>
then this results in this API call
http://content.guardianapis.com/environment%2F2010%2Foct%2F29%2Fbiodiversity-talks-ministers-nagoya-strategy?format=xml&order-by=newest&show-fields=all
Instead the guardian API expects the call to look like this:
http://content.guardianapis.com/environment/2010/oct/29/biodiversity-talks-ministers-nagoya-strategy?format=xml&order-by=newest&show-fields=all
So the problem is really just that the / characters gets encoded as %2F which I don't want to happen in this case.
Any ideas on how this can be achieved?
You can also check the full datatable I am using:
http://github.com/spier/yql-tables/blob/master/guardian/guardian.content.item.xml
The URI-template expansions in YQL (e.g. {item_id}) only follow the version 3 spec. With version 4 it would be possible to simply (only slightly) change the expansion to do what you want, but alas not currently with YQL.
So, a solution. You could bring a very, very basic <execute> block into play: one which adds the item_id value to the path as needed.
<execute><![CDATA[
response.object = request.path(item_id).get().response;
]]></execute>
Finally, see the diff against your table (with a few other, minor tweaks to allow the above to work).