S3 api operation failure , garbage handler - amazon-s3

I have build on top of AWS S3 sdk an operation which uses the copy operation of the amazon sdk.
I'm using the multi part copying as my object is larger than the maximum available (5GB)
enter link description here
My question is: what happen if all parts of the "multi part copy" are successfully done, but the last part?
Should i handle a situation of deleting the parts that have been copied?
Generally i'm expecting the copy operation to put the object in a tmp folder and only if the operation has been successful to mv it to the final name (the dest s3 bucket name). is it working like that?

If a part doesn't transfer successfully, you can send it again.
Until the parts are all copied and the multipart upload (including those created using put-part+copy) is completed, you don't have an accessible object... but you are still being charged for storage of what you have successfully uploaded/copied, unless you clean up manually or configure the bucket to automatically purge incomplete multipart objects.
Best practice is to do both -- configure the bucket to discard, but also configure your code to clean up after itself.

It looks like AWS sdk isn't writing/closing the object as an s3 object until it won't finish copying successfully the entire obj.
i have run a simple test which verifying rather it is writing the parts during the copy part code line, and it looks it won't write the obj to s3.
so the answer is that multi part won't write the obj until all part are copied successfully to the dest bucket.
there is no need for cleanup

Related

Make airflow read from S3 and post to slack?

I have a requirement where I want my airflow job to read a file from S3 and post its contents to slack.
Background
Currently, the airflow job has an S3 key sensor that waits for a file to be put in an S3 location and if that file doesn't appear in the stipulated time, it fails and pushes error messages to slack.
What needs to be done now
If airflow job succeeds, it needs to check another S3 location and if file there exists, then push its contents to slack.
Is this usecase possible with airflow?
You have already figured that the first step of your workflow has to be an S3KeySensor
As for the subsequent steps, depending of what you mean by ..it needs to check another S3 location and if file there exists,.., go can go about it in the following way
Step 1
a. If the file at another S3 location is also supposed to appear there in sometime, then of course you will require another S3KeySensor
b. Or else if this other file is expected to be there (or to not be there, but need not be waited upon to appear in sometime), we perform the check for presence of this file using check_for_key(..) function of S3_Hook (this can be done within python_callable of a simple PythonOperator / any other custom operator that you are using for step 2)
Step 2
By now, it is ascertained that either the second file is present in the expected location (or else we won't have come this far). Now you just need to read the contents of this file using read_key(..) function. After this you can push the contents to Slack using call(..) function of SlackHook. You might have an urge to use SlackApiOperator, (which you can, of course) but still reading the file from S3 and sending contents to Slack should be clubbed into single task. So you are better off doing these things in a generic PythonOperator by employing the same hooks that are used by the native operators also

How to manage user profile pic updates on AWS S3?

I am using AWS S3 for saving user profile pics on a mobile app.
How do I guarantee that request for those pics will not result in a corrupted file if it gets requested while a user updates his image?
Please note that although those files are small it could happen that the connection on the mobile app drops, resulting in a stalled upload (maybe even for hours).
My first idea was to upload the new file under a temporary name and upon completion delete the original file and rename the uploaded file.
I couldn't find any commands for that in the iOS SDK though.
Another approach would be to just increment a number with the filename and always point to the new file in the database upon completion. But this results in a big headache and unneeded complexity for cleanup since I am using a denormalized nosql database.
any ideas?
You're worrying about a non-problem.
S3 uploads are atomic. When you overwrite an object on S3, there is zero chance of corrupting a download of the previous object. The object isn't technically "overwritten" -- it is replaced -- a fine distinction, but with a difference -- nothing at all happens to the old object until the replacement upload has finished successfully.
(In fact, it's possible though unlikely that the previous object will still be returned for a short time after the new upload has completed, because of S3's eventual consistency model on overwrites).
Additionally, if you send the Content-MD5 header with an S3 upload, then a failure in the upload process (stall, lost connection, corruption, etc.) will absolutely not allow the replacement object to be stored at all -- S3 will abort the operation and the prior version will remain intact unless the uploaded object can be validated against the Content-MD5 specified. (The SDK should be doing this for you.)
Note that this holds true whether or not object versioning is enabled on the bucket.

Simple way to load new files only into Redshift from S3?

The documentation for the Redshift COPY command specifies two ways to choose files to load from S3, you either provide a base path and it loads all the files under that path, or you specify a manifest file with specific files to load.
However in our case, which I imagine is pretty common, the S3 bucket periodically receives new files with more recent data. We'd like to be able to load only the files that haven't already been loaded.
Given that there is a table stl_file_scan that logs all the files that have been loaded from S3, it would be nice to somehow exclude those that have successfully been loaded. This seems like a fairly obvious feature, but I can't find anything in the docs or online about how to do this.
Even the Redshift S3 loading template in AWS Data Pipeline appears to manage this scenario by loading all the data -- new and old -- to a staging table, and then comparing/upserting to the target table. This seems like an insane amount of overhead when we can tell up front from the filenames that a file has already been loaded.
I know we could probably move the files that have already been loaded out of the bucket, however we can't do that, this bucket is the final storage place for another process which is not our own.
The only alternative I can think of is to have some other process running that tracks files that have been successfully loaded to redshift, and then periodically compares that to the s3 bucket to determine the differences, and then writes the manifest file somewhere before triggering the copy process. But what a pain! We'd need a separate ec2 instance to run the process which would have it's own management and operational overhead.
There must be a better way!
This is how I solved the problem,
S3 -- (Lambda Trigger on newly created Logs) -- Lambda -- Firehose -- Redshift
It works at any scale. With more load, more calls to Lambda, more data to firehose and everything taken care automatically.
If there are issues with the format of the file, you can configure dead letter queues, events will be sent there and you can reprocess once you fix lambda.
Here I would like to mention some steps that includes process that how to load data in redshift.
Export local RDBMS data to flat files (Make sure you remove invalid
characters, apply escape sequence during export).
Split files into 10-15 MB each to get optimal performance during
upload and final Data load.
Compress files to *.gz format so you don’t end up with $1000
surprise bill :) .. In my case Text files were compressed 10-20
times.
List all file names to manifest file so when you issue COPY command
to Redshift its treated as one unit of load.
Upload manifest file to Amazon S3 bucket.
Upload local *.gz files to Amazon S3 bucket.
Issue Redshift COPY command with different options.
Schedule file archiving from on-premises and S3 Staging area on AWS.
Capturing Errors, setting up restart ability if something fails
Doing it easy way you can follow this link.
In general compare of loaded files to existing on S3 files is a bad but possible practice. The common "industrial" practice is to use message queue between data producer and data consumer that actually loads the data. Take a look on RabbitMQ vs Amazon SQS and etc..

Moving files >5 gig to AWS S3 using a Data Pipeline

We are experiencing problems with files produced by Java code which are written locally and then copied by the Data Pipeline to S3. The error mentions file size.
I would have thought that if multipart uploads is required, then the Pipeline would figure that out. I wonder if there is a way of configuring the Pipeline so that it indeed uses multipart uploading. Because otherwise the current Java code which is agnostic about S3 has to write directly to S3 or has to do what it used to and then use multipart uploading -- in fact, I would think the code would just directly write to S3 and not worry about uploading.
Can anyone tell me if Pipelines can use multipart uploading and if not, can you suggest whether the correct approach is to have the program write directly to S3 or to continue to write to local storage and then perhaps have a separate program be invoked within the same Pipeline which will do the multipart uploading?
The answer, based on AWS support, is that indeed 5 gig files can't be uploaded directly to S3. And there is no way currently for a Data Pipeline to say, "You are trying to upload a large file, so I will do something special to handle this." It simply fails.
This may change in the future.
Data Pipeline CopyActivity does not support files larger than 4GB. http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-copyactivity.html
This is below the 5GB limit imposed by S3 for each file-part put.
You need to write your own script wrapping AWS CLI or S3cmd (older). This script may be executed as a shell activity.
Writing directly to S3 may be an issue as S3 does not support append operations - unless you can somehow write multiple smaller objects in a folder.

Folder won't delete on Amazon S3

I'm trying to delete a folder created as a result of a MapReduce job. Other files in the bucket delete just fine, but this folder won't delete. When I try to delete it from the console, the progress bar next to its status just stays at 0. Have made multiple attempts, including with logout/login in between.
I had the same issue and used AWS CLI to fix it:
aws s3 rm s3://<your-bucket>/<your-folder-to-delete>/ --recursive ;
(this assumes you have run aws configure and aws s3 ls s3://<your-bucket>/ already works)
First and foremost, Amazon S3 doesn't actually have a native concept of folders/directories, rather is a flat storage architecture comprised of buckets and objects/keys only - the directory style presentation seen in most tools for S3 (including the AWS Management Console itself) is based solely on convention, i.e. simulating a hierarchy for objects with identical prefixes - see my answer to How to specify an object expiration prefix that doesn't match the directory? for more details on this architecture, including quotes/references from the AWS documentation.
Accordingly, your problem might stem from a tool using a different convention for simulating this hierarchy, see for example the following answers in the AWS forums:
Ivan Moiseev's answer to the related question Cannot delete file from bucket, where he suggests to use another tool to inspect whether you might have such a problem and remedy it accordingly.
The AWS team response to What are these _$folder$ objects? - This is a convention used by a number of tools including Hadoop to make directories in S3. They're primarily needed to designate empty directories. One might have preferred a more aesthetic scheme, but well that is the way that these tools do it.
Good luck!
I was getting the following error when I tried to delete a bucket which was a directory that held log files from Cloudfront.
An unexpected error has occurred. Please try again later.
After I disabled logging in Cloudfront I was able to delete the folder successfully.
My guess is that it was a system folder used by Cloudfront that did not allow deletion by the owner.
In your case, you may want to check if MapReduce is holding on to the folder in question.
I was facing the same problem. Tried many login, logout attempts and refresh but problem persist. Searched stackoverflow and found suggestions to cut and paste folder in different folder then delete but didn't worked.
Another thing you should look is for versioning that might effect your bucket may be suspending the versioning allow you to delete the folder.
My solution was to delete it with code. I have used boto package in python for file handling over s3 and the deletion worked when I tried to delete that folder from my python code.
import boto
from boto.s3.key import Key
keyId = "your_aws_access_key"
sKeyId = "your_aws_secret_key"
fileKey="dummy/foldertodelete/" #Name of the file to be deleted
bucketName="mybucket001" #Name of the bucket, where the file resides
conn = boto.connect_s3(keyId,sKeyId) #Connect to S3
bucket = conn.get_bucket(bucketName) #Get the bucket object
k = Key(bucket,fileKey) #Get the key of the given object
k.delete() #Delete
S3 doesn't keep directory it just have a flat file structure so everything is managed with key.
For you its a folder but for S3 it just an key.
If you want to delete a folder named -> dummy
then key would be
fileKey = "/dummy/"
Firstly, read the content of directory from getBucket method, then you got a array list of all files, then delete the file from deleteObject method.
if (($contents = $this->S3->getBucket(AS_S3_BUCKET, "file_path")) !== false)
{
foreach ($contents as $file)
{
$result = $this->S3->deleteObject(AS_S3_BUCKET,$file['name']);
}
}
$this->S3 is S3 class object, and AS_S3_BUCKET is bucket name.