S3 Copy Object with new metadata - amazon-s3

I am trying to set the Cache-Control Header on all our existing files in the s3 storage by executing a copy to the exact same key but with new metadata. This is supported by the s3 api through the x-amz-metadata-directive: REPLACE Header. In the documentation to the s3 api compatability at https://docs.developer.swisscom.com/service-offerings/dynamic.html#s3-api the Object Copy method is neither listed as supported or unsupported.
The copy itself works fine (to another key), but the option to set new metadata does not seem to work with either copying to the same or a different key. Is this not supported by the ATMOS s3-compatible API and/or is there any other way to update the metadata without having to read all the content and write it back to the storage?
I am currently using the Amazon Java SDK (v. 1.10.75.1) to make the calls.
UPDATE:
After some more testing it seems that the issue I am having is more specific. The copy works and I can change other metadata like Content-Disposition or Content-Type successfully. Just the Cache-Control is ignored.
As requested here is the code I am using to make the call:
BasicAWSCredentials awsCreds = new BasicAWSCredentials(accessKey, sharedsecret);
AmazonS3 amazonS3 = new AmazonS3Client(awsCreds);
amazonS3.setEndpoint(endPoint);
ObjectMetadata metadata = amazonS3.getObjectMetadata(bucketName, storageKey).clone();
metadata.setCacheControl("private, max-age=31536000");
CopyObjectRequest copyObjectRequest = new CopyObjectRequest(bucketName, storageKey, bucketName, storageKey).withNewObjectMetadata(metadata);
amazonS3.copyObject(copyObjectRequest);
Maybe the Cache-Control header on the PUT (Copy) request to the API is dropped somewhere on the way?

According to the latest ATMOS Programmer's Guide, version 2.3.0, Table 11 and 12, there's nothing specified that COPY of objects are unsupported, or supported either.
I've been working with ATMOS for quite some time, and what I believe is that the S3 copy function is somehow internally translated to a sequence of commands using the ATMOS object versioning (page 76). So, they might translate the Amazon copy operation to "create a version", and then, "delete or truncate the old referenced object". Maybe I'm totally wrong (since I don't work for EMC :-)) and they handle that in a different way... but, that's how I see through reading the native ATMOS API's documentation.
What you could try to do:
Use the native ATMOS API (which is a bit painful, yes, I know), and then, create a version of the original object (page 76), update the metadata of such version (User Metadata, page 12), and then restore the version to the top-level object (page 131). After that, check if the metadata will be properly returned in the S3 API.
That's my 2 cents. If you decide to try such solution, post it here if that worked.

Related

Passing AWS role to the application that uses default boto3 configs

I have an aws setup that requires me to assume role and get corresponding credentials in order to write to s3. For example, to write with aws cli, I need to use --profile readwrite flag. If I write code myself with boot, I'd assume role via sts, get credentials, and create new session.
However, there is a bunch of applications and packages relying on boto3's configuration, e.g. internal code runs like this:
s3 = boto3.resource('s3')
result_s3 = s3.Object(bucket, s3_object_key)
result_s3.put(
Body=value.encode(content_encoding),
ContentEncoding=content_encoding,
ContentType=content_type,
)
From documentation, boto3 can be set to use default profile using (among others) AWS_PROFILE env variable, and it clearly "works" in terms that boto3.Session().profile_name does match the variable - but the applications still won't write to s3.
What would be the cleanest/correct way to set them properly? I tried to pull credentials from sts, and write them as AWS_SECRET_TOKEN etc, but that didn't work for me...
Have a look at the answer here:
How to choose an AWS profile when using boto3 to connect to CloudFront
You can get boto3 to use the other profile like so:
rw = boto3.session.Session(profile_name='readwrite')
s3 = rw.resource('s3')
I think the correct answer to my question is one shared by Nathan Williams in the comment.
In my specific case, given that I had to initiate code from python, and was a bit worried about setting AWS settings that might spill into other operations, I used
the fact that boto3 has DEFAULT_SESSION singleton, used each time, and just overwrote this with a session that assumed the proper role:
hook = S3Hook(aws_conn_id=aws_conn_id)
boto3.DEFAULT_SESSION = hook.get_session()
(here, S3Hook is airflow's s3 handling object). After that (in the same runtime) everything worked perfectly

Implementing basic S3 compatible API with akka-http

I'm trying to implement the file storage ыукмшсу with basic S3 compatible API using akka-http.
I use s3 java sdk to test my service API and got the problem with the putObject(...) method. I can't consume file properly on my akka-http backend. I wrote simple route for the test purposes:
def putFile(bucket: String, file: String) = put{
extractRequestEntity{ ent =>
val finishedWriting = ent.dataBytes.runWith(FileIO.toPath(new File(s"/tmp/${file}").toPath))
onComplete(finishedWriting) { ioResult =>
complete("Finished writing data: " + ioResult)
}
}
}
It saves file, but file is always corrupted. Looking inside the file I found the lines like these:
"20000;chunk-signature=73c6b865ab5899b5b7596b8c11113a8df439489da42ddb5b8d0c861a0472f8a1".
When I try to PUT file with any other rest client it works as fine as expected.
I know S3 uses "Expect: 100-continue" header and may it he causes problems.
I really can't figure out how to deal with that. Any help appreciated.
This isn't exactly corrupted. Your service is not accounting for one of the four¹ ways S3 supports uploads to be sent on the wire, using Content-Encoding: aws-chunked and x-amz-content-sha256: STREAMING-AWS4-HMAC-SHA256-PAYLOAD.
It's a non-standards-based mechanism for streaming an object, and includes chunks that look exactly like this:
string(IntHexBase(chunk-size)) + ";chunk-signature=" + signature + \r\n + chunk-data + \r\n
...where IntHexBase() is pseudocode for a function that formats an integer as a hexadecimal number as a string.
This chunk-based algorithm is similar to, but not compatible with, Transfer-Encoding: chunked, because it embeds checksums in the stream.
Why did they make up a new HTTP transfer encoding? It's potentially useful on the client side because it eliminates the need to either "read your payload twice or buffer [the entire object payload] in memory [concurrently]" -- one or the other of which is otherwise necessary if you are going to calculate the x-amz-content-sha256 hash before the upload begins, as you otherwise must, since it's required for integrity checking.
I am not overly familiar with the internals of the Java SDK, but this type of upload might be triggered by using .withInputStream() or it might be standard behavor for files too, or for files over a certain size.
Your minimum workaround would be to throw an HTTP error if you see x-amz-content-sha256: STREAMING-AWS4-HMAC-SHA256-PAYLOAD in the request headers since you appear not to have implemented this in your API, but this would most likely only serve to prevent storing objects uploaded by this method. The fact that this isn't already what happens automatically suggests that you haven't implemented x-amz-content-sha256 handling at all, so you are not doing the server-side payload integrity checks that you need to be doing.
For full compatibility, you'll need to implement the algorithm supported by S3 and assumed to be available by the SDKs, unless the SDKs specifically support a mechanism for disabling this algorithm -- which seems unlikely, since it serves a useful purpose, particularly (it appears) for streams whose length is known but that aren't seekable.
¹ one of four -- the other three are a standard PUT, a web-based html form POST, and the multipart API that is recommended for large files and mandatory for files larger than 5 GB.

Locally reading S3 files through Spark (or better: pyspark)

I want to read an S3 file from my (local) machine, through Spark (pyspark, really). Now, I keep getting authentication errors like
java.lang.IllegalArgumentException: AWS Access Key ID and Secret
Access Key must be specified as the username or password
(respectively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId
or fs.s3n.awsSecretAccessKey properties (respectively).
I looked everywhere here and on the web, tried many things, but apparently S3 has been changing over the last year or months, and all methods failed but one:
pyspark.SparkContext().textFile("s3n://user:password#bucket/key")
(note the s3n [s3 did not work]). Now, I don't want to use a URL with the user and password because they can appear in logs, and I am also not sure how to get them from the ~/.aws/credentials file anyway.
So, how can I read locally from S3 through Spark (or, better, pyspark) using the AWS credentials from the now standard ~/.aws/credentials file (ideally, without copying the credentials there to yet another configuration file)?
PS: I tried os.environ["AWS_ACCESS_KEY_ID"] = … and os.environ["AWS_SECRET_ACCESS_KEY"] = …, it did not work.
PPS: I am not sure where to "set the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties" (Google did not come up with anything). However, I did try many ways of setting these: SparkContext.setSystemProperty(), sc.setLocalProperty(), and conf = SparkConf(); conf.set(…); conf.set(…); sc = SparkContext(conf=conf). Nothing worked.
Yes, you have to use s3n instead of s3. s3 is some weird abuse of S3 the benefits of which are unclear to me.
You can pass the credentials to the sc.hadoopFile or sc.newAPIHadoopFile calls:
rdd = sc.hadoopFile('s3n://my_bucket/my_file', conf = {
'fs.s3n.awsAccessKeyId': '...',
'fs.s3n.awsSecretAccessKey': '...',
})
The problem was actually a bug in the Amazon's boto Python module. The problem was related to the fact that MacPort's version is actually old: installing boto through pip solved the problem: ~/.aws/credentials was correctly read.
Now that I have more experience, I would say that in general (as of the end of 2015) Amazon Web Services tools and Spark/PySpark have a patchy documentation and can have some serious bugs that are very easy to run into. For the first problem, I would recommend to first update the aws command line interface, boto and Spark every time something strange happens: this has "magically" solved a few issues already for me.
Here is a solution on how to read the credentials from ~/.aws/credentials. It makes use of the fact that the credentials file is an INI file which can be parsed with Python's configparser.
import os
import configparser
config = configparser.ConfigParser()
config.read(os.path.expanduser("~/.aws/credentials"))
aws_profile = 'default' # your AWS profile to use
access_id = config.get(aws_profile, "aws_access_key_id")
access_key = config.get(aws_profile, "aws_secret_access_key")
See also my gist at https://gist.github.com/asmaier/5768c7cda3620901440a62248614bbd0 .
Environment variables setup could help.
Here in Spark FAQ under the question "How can I access data in S3?" they suggest to set AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables.
I cannot say much about the java objects you have to give to the hadoopFile function, only that this function already seems depricated for some "newAPIHadoopFile". The documentation on this is quite sketchy and I feel like you need to know Scala/Java to really get to the bottom of what everything means.
In the mean time, I figured out how to actually get some s3 data into pyspark and I thought I would share my findings.
This documentation: Spark API documentation says that it uses a dict that gets converted into a java configuration (XML). I found the configuration for java, this should probably reflect the values you should put into the dict: How to access S3/S3n from local hadoop installation
bucket = "mycompany-mydata-bucket"
prefix = "2015/04/04/mybiglogfile.log.gz"
filename = "s3n://{}/{}".format(bucket, prefix)
config_dict = {"fs.s3n.awsAccessKeyId":"FOOBAR",
"fs.s3n.awsSecretAccessKey":"BARFOO"}
rdd = sc.hadoopFile(filename,
'org.apache.hadoop.mapred.TextInputFormat',
'org.apache.hadoop.io.Text',
'org.apache.hadoop.io.LongWritable',
conf=config_dict)
This code snippet loads the file from the bucket and prefix (file path in the bucket) specified on the first two lines.

S3 notification when file is overwritten, or deleted

since we store our log files on S3 and to meet PCI requirements we have to be notified when someone tampers with the log files.
How can I be notified every time a put request is placed that replaces an existing object, or when an existing object is delete. The alert should not fire if a new object is created unless it replaces an existing one.
S3 does not currently provide deletion or overwrite-only notifications. Deletion notifications were added after the initial launch of the notification feature and can notify you when an object is deleted, but does not notify you when on object is implicitly deleted by overwrite.
However, S3 does have functionality to accomplish what you need, in a way that seems superior to what you are contemplating: object versioning and multi-factor authentication for deletion, both discussed here:
http://docs.aws.amazon.com/AmazonS3/latest/dev/Versioning.html
With versioning enabled on the bucket, an overwrite of a file doesn't remove the old version of the file. Instead, each version of the file has an opaque string, assigned by S3, identifying the Version ID.
If someone overwrites a file, you would then have two versions of the same file in the bucket -- the original one and the new one -- so you not only have evidence of tampering, you also have the original file, undisturbed. Any object with more than one version in the bucket has, by definition, been overwritten at some point.
If you also enable Multi-Factor Authentication (MFA) Delete, then none of the versions of any object can be removed without access to the hardware or virtual MFA device.
As an developer of AWS utilities, tools, and libraries (3rd party; I'm not affiliated with Amazon), I am highly impressed by Amazon's implementation of object versioning in S3, because it works in such a way that client utilities that are unaware of versioning or that versioning is enabled on the bucket should not be affected in any way. This means you should be able to activate versioning on a bucket without changing anything in your existing code. For example:
fetching an object without an accompanying version id in the request simply fetches the newest version of the object
objects in versioned buckets aren't really deleted unless you explicitly delete a particular version; however, you can still "delete an object," and get the expected response back. Subsequently fetching the "deleted" object without specifying an accompanying version id still returns a 404 Not Found, as in the non-versioned environment, with the addition of an unobtrusive x-amz-delete-marker: header included in the response to indicate that the "latest version" of the object is in fact a delete marker placeholder. The individual versions of the "deleted" object remain accessible to version-aware code, unless purged.
other operations that are unrelated to versioning, which work on non-versioned buckets, continue to work the same way they did before versioning was enabled on the bucket.
But, again... with code that is version-aware, including the AWS console (two new buttons appear when you're looking at a versioned bucket -- you can choose to view it with a versioning-aware console view or versioning-unaware console view) you can iterate through the different versions of an object and fetch any version that has not been permanently removed... but preventing unauthorized removal of objects is the point of MFA delete.
Additionally, of course, there's bucket logging, which is typically only delayed by a few minutes from real-time and could be used to detect unusual activity... the history of which would be preserved by the bucket versioning.

Amazon S3 pre signed URLs using Amazon Java SDK and extra / characters

I've been creating Presigned HTTP PUT URLs and everything was working great until I wanted to start using "folders" in S3; I wanted the key to have the character '/'.
Now I get Signature doesn't match when I send the HTTP PUT requests due to the fact the '/' probably changes to %2F... If I escape the character before creating the presigned URL it works great, but then the Amazon console management doesn't understand it and shows it as one file instead of subfolders.
Any idea?
P.s.
The HTTP PUT requests are sent using C++ with POCO NET library.
EDIT
I'm using Poco HttpRequest from C++ to my Java web server to generate a signed url (returned on the response).
C++ then uses this url to put a file in s3 using Poco again.
The problem was that the urls returned from the web server were parsed through Poco URI objects that auto decoded the s3 object key thus changing it.With that in mind I was able to fix my problem.
Tricky - I'll try to approach this bottom up.
Disclaimer: I got carried away visually inspecting the Poco libraries instead of actually debugging a code sample, which should yield more reliable results much faster, see below ;)
Analysis
If I escape the character before creating the presigned URL it works
great, but then the Amazon console management doesn't understand it
and shows it as one file instead of subfolders.
The latter stems from S3 not having a concept of folders on the storage level actually, see e.g. section Index Documents and Folders within Index Document Support:
Objects stored in Amazon S3 are stored within a flat container, i.e.,
an Amazon S3 bucket, and it does not provide any hierarchical
organization, similar to a file system's. However, you can create a
logical hierarchy using object key names and use these names to infer
logical folders that contain these objects.
That's exactly what the AWS Management Console is doing here as well:
The AWS Management Console also supports the concept of folders, by
using the same key naming convention used in the preceding sample.
However, your test regarding the assumption of / being encoded as %2F proves, that this is indeed how Poco::Net is encoding the URL when performing the HTTP PUT request.
(I'm actually a bit surprised that the AWS Java SDK seems to generate different URLs here for / vs. %2F, insofar a recent analysis regarding Why is my S3 pre-signed request invalid when I set a response header override that contains a “+”? seems to indicate respective canonicalization by the AWS .NET SDK, see below for more on this.)
Potential Solution
In order for your scenario to work as desired, you'll need to figure out where the URL is encoded this way - I could think of two components in principle:
Poco::Net
Finding out why Poco::Net is encoding the URL different than S3 (if at all, see below) is best done by debugging your code, here's where I'd start:
Class HTTPRequest uses class URI in turn, which automatically performs a few normalizations on all URIs and URI parts passed to it, in particular percent-encoded characters are decoded. The other way round is handled by method encode(), which is where things get interesting and call for a breakpoint, see URI.cpp:
lines 575 ff. - here encode() does its magic, which indeed seems to be in place, insofar neither the code within the function nor the various chars passed in via the reserved parameter contain the offending / (see lines 47 ff. for the respective constants in use)
consequently you might want to set a breakpoint in this function and backtrace the callstack to find out which code is actually doing the encoding upfront, which might not yield an offender at all, see below.
Java => C++ transition
You haven't specified yet, which channel is actually used to communicate the pre-signed URL generated by the AWS Java SDK to C++ in turn. Given the code review (mind you, visual inspection only, I haven't debugged this myself yet) of the Poco::Net functionality yields the conclusion, that no obvious offender can be identified in the library itself, thus it seems more likely that it might already enter your C++ layer encoded (easily verified via debugging of course) - are you by chance using any kind of web service between these components for example?
Good luck!