Lambda getting access denied even though it has the necessary permissions - amazon-s3

I have a lambda that is sourced to fire whenever a new csv file is added to an s3 bucket. It parses the csv file into the individual rows of the csv and puts them into an sqs queue to be processed further.
The problem is that even though the lambda has the appropriate permissions (s3:GetObject for arn:aws:s3:::my-bucket-name/*), it always fails with a access denied error when trying to execute the GetObject function.
Any idea why this is happening?

The issue was that file name as received by the lambda was encoded wrong, causing the lambda to look for a non-existent file.
AWS treats looking for a non-existent file the same as trying to access a restricted resource which is why I was receiving the somewhat misleading Access Denied error.
To fix, I changed the file naming schema to a simpler one unaffected by the encoding. This fixed it.

Related

Make airflow read from S3 and post to slack?

I have a requirement where I want my airflow job to read a file from S3 and post its contents to slack.
Background
Currently, the airflow job has an S3 key sensor that waits for a file to be put in an S3 location and if that file doesn't appear in the stipulated time, it fails and pushes error messages to slack.
What needs to be done now
If airflow job succeeds, it needs to check another S3 location and if file there exists, then push its contents to slack.
Is this usecase possible with airflow?
You have already figured that the first step of your workflow has to be an S3KeySensor
As for the subsequent steps, depending of what you mean by ..it needs to check another S3 location and if file there exists,.., go can go about it in the following way
Step 1
a. If the file at another S3 location is also supposed to appear there in sometime, then of course you will require another S3KeySensor
b. Or else if this other file is expected to be there (or to not be there, but need not be waited upon to appear in sometime), we perform the check for presence of this file using check_for_key(..) function of S3_Hook (this can be done within python_callable of a simple PythonOperator / any other custom operator that you are using for step 2)
Step 2
By now, it is ascertained that either the second file is present in the expected location (or else we won't have come this far). Now you just need to read the contents of this file using read_key(..) function. After this you can push the contents to Slack using call(..) function of SlackHook. You might have an urge to use SlackApiOperator, (which you can, of course) but still reading the file from S3 and sending contents to Slack should be clubbed into single task. So you are better off doing these things in a generic PythonOperator by employing the same hooks that are used by the native operators also

S3 api operation failure , garbage handler

I have build on top of AWS S3 sdk an operation which uses the copy operation of the amazon sdk.
I'm using the multi part copying as my object is larger than the maximum available (5GB)
enter link description here
My question is: what happen if all parts of the "multi part copy" are successfully done, but the last part?
Should i handle a situation of deleting the parts that have been copied?
Generally i'm expecting the copy operation to put the object in a tmp folder and only if the operation has been successful to mv it to the final name (the dest s3 bucket name). is it working like that?
If a part doesn't transfer successfully, you can send it again.
Until the parts are all copied and the multipart upload (including those created using put-part+copy) is completed, you don't have an accessible object... but you are still being charged for storage of what you have successfully uploaded/copied, unless you clean up manually or configure the bucket to automatically purge incomplete multipart objects.
Best practice is to do both -- configure the bucket to discard, but also configure your code to clean up after itself.
It looks like AWS sdk isn't writing/closing the object as an s3 object until it won't finish copying successfully the entire obj.
i have run a simple test which verifying rather it is writing the parts during the copy part code line, and it looks it won't write the obj to s3.
so the answer is that multi part won't write the obj until all part are copied successfully to the dest bucket.
there is no need for cleanup

BigQuery InternalError loading from Cloud Storage (works with direct file upload)

Whenever I try to load a CSV file stored in CloudStorage into BigQuery, I get an InternalError (both using the web interface as well as the command line). The CSV is (an abbreviated) part of the Google Ngram dataset.
command like:
bq load 1grams.ngrams gs://otichybucket/import_test.csv word:STRING,year:INTEGER,freq:INTEGER,volume:INTEGER
gives me:
BigQuery error in load operation: Error processing job 'otichyproject1:bqjob_r28187461b449065a_000001504e747a35_1': An internal error occurred and the request could not be completed.
However, when I load this file directly using the web interface and the File upload as a source (loading from my local drive), it works.
I need to load from Cloud Storage, since I need to load much larger files (original ngrams datasets).
I tried different files, always the same.
I'm an engineer on the BigQuery team. I was able to look up your job, and it looks like there was a problem reading the Google Cloud Storage object.
Unfortunately, we didn't log much of the context, but looking at the code, the things that could cause this are:
The URI you specified for the job is somehow malformed. It doesn't look malformed, but maybe there is some odd UTF8 non-printing character that I didn't notice.
The 'region' for your bucket is somehow unexpected. Is there any chance you've set data location on your GCS bucket to something other than {US, EU, or ASIA}. See here for more info on bucket locations. If so, and you've set location to a region, rather than a continent, that could cause this error.
There could have been some internal error in GCS that caused this. However, I didn't see this in any of the logs, and it should be fairly rare.
We're putting in some more logging to detect this in the future and to fix the issue with regional buckets (however, regional buckets may fail, because bigquery doesn't support cross-region data movement, but at least they will fail with an intelligible error).

What's My.Computer.Network.UploadFile behavior on duplicate filename?

I have been given a program that uploades pdf files to an ftp server, which is something I never did. I've been asked what the behavior regarding attempting to upload a duplicate filename is. It apparently doesnt check for duplicate filenames manually, but the comand that uploads the file is My.Computer.Network.UploadFile and I can't find what happens when attempting to upload a duplicate file anywhere, does it throw an exception or overwrites the file?
It looks like My.Computer.Network.UploadFile is a wrapper around WebClient.UploadFile, and the documentation for that states:
This method uses the STOR command to upload an FTP resource.
In the FTP RFC 959 it says (I highlighted the relevant part):
STORE (STOR)
This command causes the server-DTP to accept the data
transferred via the data connection and to store the data as
a file at the server site. If the file specified in the
pathname exists at the server site, then its contents shall
be replaced by the data being transferred. A new file is
created at the server site if the file specified in the
pathname does not already exist.
So, if everything is following standards (and that part of RFC 959 hasn't been replaced, I didn't dig further!), then it should replace the existing file. However, it is possible for the server to deny overwriting of existing files, so the behavior is not guaranteed.
Of course, the best thing to do would be to try it out in your environment and see what it does.

Write files to S3 through Java

I have a program which takes input from S3, generates a text file, and then sends it to the mapper class. I am unable to write the file to S3, from where the mapper can read it later. Now, I realize that we cannot write files to S3 directly, so I am trying to upload the text file created to S3 using copyFromLocalFile(). However, I get a null pointer exception in the following line:
fs.copyFromLocalFile(true, new Path(tgiPath), mapIP);
I am creating the text file in main function, so I am not sure where exactly it's being created. The only reason behind the null pointer exception, that I can think of is that the text file is not being written on the local disk. So my question is: How do I write files on the local disk? If I just specify the name of the file while creating it, where is it created and how do I access it?
Have a look at Jets3t
This seems to be exactly what you need.
Jets3t is awesome, but I am using Google's App Engine, and it doesn't work on there because of threading limitations.
I banged my head against the wall until I came up with a solution that worked on App Engine by combining a bunch of existing libraries: http://socialappdev.com/using-amazon-s3-with-google-app-engine-02-2011