AWS File already exists or not (Swift) - amazon-s3

I am uploading too many files on AWS S3 and that too of different types.
I am trying to restrict re-uploading to AWS if the same object already persist there.
For this I got listObjectsV2() function but it's response seems too long. If there are 200+ data, this function doesn't seems to be feasible. So I am searching some function, to which I can pass path and it will give me if exist or not.
The default behaviour of AWS is to override the file, but I need to restrict this.
I am new with AWS S3, please do add solutions and thoughts.
I tried listObjectsV2().
I am looking for some function like:
func isAlreadyExist(url: String) {
return true/false
}

Related

Passing AWS role to the application that uses default boto3 configs

I have an aws setup that requires me to assume role and get corresponding credentials in order to write to s3. For example, to write with aws cli, I need to use --profile readwrite flag. If I write code myself with boot, I'd assume role via sts, get credentials, and create new session.
However, there is a bunch of applications and packages relying on boto3's configuration, e.g. internal code runs like this:
s3 = boto3.resource('s3')
result_s3 = s3.Object(bucket, s3_object_key)
result_s3.put(
Body=value.encode(content_encoding),
ContentEncoding=content_encoding,
ContentType=content_type,
)
From documentation, boto3 can be set to use default profile using (among others) AWS_PROFILE env variable, and it clearly "works" in terms that boto3.Session().profile_name does match the variable - but the applications still won't write to s3.
What would be the cleanest/correct way to set them properly? I tried to pull credentials from sts, and write them as AWS_SECRET_TOKEN etc, but that didn't work for me...
Have a look at the answer here:
How to choose an AWS profile when using boto3 to connect to CloudFront
You can get boto3 to use the other profile like so:
rw = boto3.session.Session(profile_name='readwrite')
s3 = rw.resource('s3')
I think the correct answer to my question is one shared by Nathan Williams in the comment.
In my specific case, given that I had to initiate code from python, and was a bit worried about setting AWS settings that might spill into other operations, I used
the fact that boto3 has DEFAULT_SESSION singleton, used each time, and just overwrote this with a session that assumed the proper role:
hook = S3Hook(aws_conn_id=aws_conn_id)
boto3.DEFAULT_SESSION = hook.get_session()
(here, S3Hook is airflow's s3 handling object). After that (in the same runtime) everything worked perfectly

Lambda not invoking if the uploaded files are large in size in s3 bucket?

I have created a lambda which would invoke and do the transformation based on the event in the target source bucket.
This is working fine when I upload the small size of file in the targeted source bucket.
But when I upload large file(eg: 65 mb file), it looks lambda not invoking based on that event..
Appreciate if anyone can help on this kind of issue?
Thanks
I am guessing, big files would be uploaded on S3 via S3 Multipart Upload instead of a regular put-object operation.
Maybe your Lambda function is just subscribed to s3:ObjectCreated:Put events. You need to add s3:ObjectCreated:CompleteMultipartUpload permission to Lambda as well.
The large files in S3 are uploaded via S3 Multipart Upload instead of a regular PUT or single part upload process.
There can be two problems
``In your lambda you probably have created the subscription for s3:ObjectCreated:Put events. You should add s3:ObjectCreated:CompleteMultipartUpload too in the Lambda subscription list.
Your lambda timeout could be small for and that works for the smaller files. You might want to increase that.
There could be any of these issues:
Your event only captures s3:ObjectCreated:Put event, as others have mentioned. Usually if it's a big file, the event is s3:ObjectCreated:CompleteMultipartUpload instead. You could either (a) add s3:ObjectCreated:CompleteMultipartUpload event to your capture, or (b) simply use s3:ObjectCreated:* event - this will include Put, MultiPart Upload, Post, Copy, and also other similar events to be added in the future (source: https://aws.amazon.com/blogs/aws/s3-event-notification/)
Your Lambda function might run longer limit you set (limit is 15min).
Your Lambda function requires more memory than the limit you set.
Your Lambda function requires more disk space than the limit you set. This may be an issue if your function downloads the data on disk first and perform transformation there (limit is 512MB).

Airflow S3Hook object has no attribute load_bytes

The goal of my operator is to communicate with s3, then write some string data to my s3 bucket.
I saw there is already an s3_hook to be used. I thought maybe this is a better way than using boto3.
The main logic looks like:
from airflow.hooks.S3_hook import S3Hook
hook = S3Hook('test_s3')
log.info(hook.load_bytes('some_data', 'some_key', 'a_bucket'))
Then I got some error like
'S3Hook' object has no attribute 'load_bytes'
I'm pretty sure that the S3Hook class has that function (see here).
After this, I switched to use the function load_string. However, the airflow throw me an error like:
'S3' object has no attribute 'upload_fileobj'
I'm using the airflow with s3 support. Not sure why I the error above. My test_s3 connection is fine since I have tested my test_s3 connection by using read_key to read some texts file from s3 without any issue.
Does anyone else have the similar situation? I'm very confused, where did I miss? Thanks!

Reducing Parse Server to only Parse Cloud

I'm currently using a self hosted Parse Server up to date but I'm facing some security issues.
At the moment, calls done to the route /classes can retrieve any object in any table and, even though I might want an object to be public readable, I wouldn't like to show all the parameters of that object. Briefly I don't want the database to be retrieved in any case, I would like to disable "everything" except the Parse Cloud code. So that is, I would be able to run calls to my own functions, but not able to use clients (Android, iOS, C#, Javascript...) to retrieve data.
Is there any way to do this? I've been searching deeply for this, trying to debug some Controllers but I don't have any clue.
Thank you very much in advance.
tl;dr: set the ACL for all objects to be only readable when using the master key and then tell the query in Cloud Code to use the MK when querying your data
So without changing Parse Server itself you could make use of ACL and only allow a specific user to access objects. You would then "login" as that user in your Cloud Code and be able to access all objects.
As the old method, Parse.Cloud.useMasterKey() isn't available in the OS Parse Server you will have to pass the parameter useMasterKey to the query you are running which should do the trick for this particular request and will bypass ACL/CLP. There is an example in the Wiki of Parse Server as well.
For convenience, here is a short code example from the Wiki:
Parse.Cloud.define('getTotalMessageCount', function(request, response) {
var query = new Parse.Query('Messages');
query.count({
useMasterKey: true
}) // count() will use the master key to bypass ACLs
.then(function(count) {
response.success(count);
});
});

Locally reading S3 files through Spark (or better: pyspark)

I want to read an S3 file from my (local) machine, through Spark (pyspark, really). Now, I keep getting authentication errors like
java.lang.IllegalArgumentException: AWS Access Key ID and Secret
Access Key must be specified as the username or password
(respectively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId
or fs.s3n.awsSecretAccessKey properties (respectively).
I looked everywhere here and on the web, tried many things, but apparently S3 has been changing over the last year or months, and all methods failed but one:
pyspark.SparkContext().textFile("s3n://user:password#bucket/key")
(note the s3n [s3 did not work]). Now, I don't want to use a URL with the user and password because they can appear in logs, and I am also not sure how to get them from the ~/.aws/credentials file anyway.
So, how can I read locally from S3 through Spark (or, better, pyspark) using the AWS credentials from the now standard ~/.aws/credentials file (ideally, without copying the credentials there to yet another configuration file)?
PS: I tried os.environ["AWS_ACCESS_KEY_ID"] = … and os.environ["AWS_SECRET_ACCESS_KEY"] = …, it did not work.
PPS: I am not sure where to "set the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties" (Google did not come up with anything). However, I did try many ways of setting these: SparkContext.setSystemProperty(), sc.setLocalProperty(), and conf = SparkConf(); conf.set(…); conf.set(…); sc = SparkContext(conf=conf). Nothing worked.
Yes, you have to use s3n instead of s3. s3 is some weird abuse of S3 the benefits of which are unclear to me.
You can pass the credentials to the sc.hadoopFile or sc.newAPIHadoopFile calls:
rdd = sc.hadoopFile('s3n://my_bucket/my_file', conf = {
'fs.s3n.awsAccessKeyId': '...',
'fs.s3n.awsSecretAccessKey': '...',
})
The problem was actually a bug in the Amazon's boto Python module. The problem was related to the fact that MacPort's version is actually old: installing boto through pip solved the problem: ~/.aws/credentials was correctly read.
Now that I have more experience, I would say that in general (as of the end of 2015) Amazon Web Services tools and Spark/PySpark have a patchy documentation and can have some serious bugs that are very easy to run into. For the first problem, I would recommend to first update the aws command line interface, boto and Spark every time something strange happens: this has "magically" solved a few issues already for me.
Here is a solution on how to read the credentials from ~/.aws/credentials. It makes use of the fact that the credentials file is an INI file which can be parsed with Python's configparser.
import os
import configparser
config = configparser.ConfigParser()
config.read(os.path.expanduser("~/.aws/credentials"))
aws_profile = 'default' # your AWS profile to use
access_id = config.get(aws_profile, "aws_access_key_id")
access_key = config.get(aws_profile, "aws_secret_access_key")
See also my gist at https://gist.github.com/asmaier/5768c7cda3620901440a62248614bbd0 .
Environment variables setup could help.
Here in Spark FAQ under the question "How can I access data in S3?" they suggest to set AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables.
I cannot say much about the java objects you have to give to the hadoopFile function, only that this function already seems depricated for some "newAPIHadoopFile". The documentation on this is quite sketchy and I feel like you need to know Scala/Java to really get to the bottom of what everything means.
In the mean time, I figured out how to actually get some s3 data into pyspark and I thought I would share my findings.
This documentation: Spark API documentation says that it uses a dict that gets converted into a java configuration (XML). I found the configuration for java, this should probably reflect the values you should put into the dict: How to access S3/S3n from local hadoop installation
bucket = "mycompany-mydata-bucket"
prefix = "2015/04/04/mybiglogfile.log.gz"
filename = "s3n://{}/{}".format(bucket, prefix)
config_dict = {"fs.s3n.awsAccessKeyId":"FOOBAR",
"fs.s3n.awsSecretAccessKey":"BARFOO"}
rdd = sc.hadoopFile(filename,
'org.apache.hadoop.mapred.TextInputFormat',
'org.apache.hadoop.io.Text',
'org.apache.hadoop.io.LongWritable',
conf=config_dict)
This code snippet loads the file from the bucket and prefix (file path in the bucket) specified on the first two lines.