I'm trying to figure out a concise way to get data from s3 via boto
my current code looks like this. s3 manager is simply a class that does all the s3 setup for my app.
log.debug("generating downloader")
downloader = s3_manager()
log.debug("accessing bucket")
bucket_archive = downloader.s3_buckets['#archive']
log.debug("getting key")
key = bucket_archive.get_key(archive_filename)
log.debug("getting key into string")
source = key.get_contents_as_string()
the problem is that , looking at my debug logs, i'm making two requests to amazon s3:
key = bucket_archive.get_key(archive_filename)
source = key.get_contents_as_string()
looking at the docs [ http://boto.readthedocs.org/en/latest/ref/s3.html ] , it seems that the call to get_key checks to see if it exists , while the second call gets the actual data. does anyone know of a method to do both at once ? a more concise way of doing this with one request is preferable for our app.
The get_key() method performs a HEAD request on the object to verify that it exists. If you are certain that the bucket and key exist and would prefer not to have the overhead of a HEAD request, you can simply create a Key object directly. Something like this would work:
import boto
s3 = boto.connect_s3()
bucket = s3.get_bucket('mybucket', validate=False)
key = bucket.new_key('myexistingkey')
contents = key.get_contents_as_string()
The validate=False on the call to get_bucket eliminates a GET request that also is intended to validate that the bucket exists.
Related
Here is my code that I use to create a s3 client and generate a presigned url, which are some quite standard codes. They have been up running in the server for quite a while. I pulled the code out and ran it locally in a jupyter notebook
def get_s3_client():
return get_s3(create_session=False)
def get_s3(create_session=False):
session = boto3.session.Session() if create_session else boto3
S3_ENDPOINT = os.environ.get('AWS_S3_ENDPOINT')
if S3_ENDPOINT:
AWS_ACCESS_KEY_ID = os.environ['AWS_ACCESS_KEY_ID']
AWS_SECRET_ACCESS_KEY = os.environ['AWS_SECRET_ACCESS_KEY']
AWS_DEFAULT_REGION = os.environ["AWS_DEFAULT_REGION"]
s3 = session.client('s3',
endpoint_url=S3_ENDPOINT,
aws_access_key_id=AWS_ACCESS_KEY_ID,
aws_secret_access_key=AWS_SECRET_ACCESS_KEY,
region_name=AWS_DEFAULT_REGION)
else:
s3 = session.client('s3', region_name='us-east-2')
return s3
s3 = get_s3_client()
BUCKET=[my-bucket-name]
OBJECT_KEY=[my-object-name]
signed_url = s3.generate_presigned_url(
'get_object',
ExpiresIn=3600,
Params={
"Bucket": BUCKET,
"Key": OBJECT_KEY,
}
)
print(signed_url)
When I tried to download the file using the url in the browser, I got an error message and it says "The specified key does not exist." I noticed in the error message that my object key becomes "[my-bucket-name]/[my-object-name]" rather than just "[my-object-name]".
Then I used the same bucket/key combination to generate a presigned url using aws cli, which is working as expected. I found out that somehow the s3 client method (boto3) inserted [my-object-name] in front of [my-object-name] compared to the aws cli method. Here are the results
From s3.generate_presigned_url()
https://[my-bucket-name].s3.us-east-2.amazonaws.com/[my-bucket-name]/[my-object-name]?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAV17K253JHUDLKKHB%2F20210520%2Fus-east-2%2Fs3%2Faws4_request&X-Amz-Date=20210520T175014Z&X-Amz-Expires=3600&X-Amz-SignedHeaders=host&X-Amz-Signature=5cdcc38e5933e92b5xed07b58e421e5418c16942cb9ac6ac6429ac65c9f87d64
From aws cli s3 presign
https://[my-bucket-name].s3.us-east-2.amazonaws.com/[my-object-name]?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAYA7K15LJHUDAVKHB%2F20210520%2Fus-east-2%2Fs3%2Faws4_request&X-Amz-Date=20210520T155926Z&X-Amz-Expires=3600&X-Amz-SignedHeaders=host&X-Amz-Signature=58208f91985bf3ce72ccf884ba804af30151d158d6ba410dd8fe9d2457369894
I've been working on this and searching for solutions for day and half and I couldn't find out what was wrong with my implementation. I guess it might be that I ignored some basic but important settings to create a s3 client using boto3 or something else. Thanks for the help!
Ok, myth is solved, I shouldn't provide the endpoint_url=S3_ENDPOINT param when I create the s3 client, boto3 will figure it out. After i removed it, everything works as expected.
I have a S3 bucket with this configuration:
I'm trying to create a bucket with this same configuration via CDK:
Bucket.Builder.create(this, "test1")
.bucketName("com.myorg.test1")
.encryption(BucketEncryption.KMS_MANAGED)
.bucketKeyEnabled(true)
.build()
But I'm getting this error:
Error: bucketKeyEnabled is specified, so 'encryption' must be set to KMS (value: MANAGED)
This seems like a bug to me, but I'm relatively new to CDK so I'm not sure. Am I doing something wrong, or is this indeed a bug?
I encountered the issue recently, and I have found the answer. I want to share the findings here just in case anyone gets stuck.
Yes, this was a bug in the AWS-CDK. The fix was merged this month: https://github.com/aws/aws-cdk/pull/22331
According to the CDK doc (https://docs.aws.amazon.com/cdk/api/v2/docs/aws-cdk-lib.aws_s3.Bucket.html#bucketkeyenabled), if bucketKeyEnabled is set to true, S3 will use its own time-limited key instead, which helps reduce the cost (see https://docs.aws.amazon.com/AmazonS3/latest/userguide/bucket-key.html); it's only relevant when Encryption is set to BucketEncryption.KMS or BucketEncryption.KMS_MANAGED.
bucketKeyEnabled flag , straight from docs:
Specifies whether Amazon S3 should use an S3 Bucket Key with
server-side encryption using KMS (SSE-KMS)
BucketEncryption has 4 options.
NONE - no encryption
MANAGED - Kms key managed by AWS
S3MANAGED - Key managed by S3
KMS - key managed by user, a new KMS key will be created.
we don't need to set bucketKeyEnabled at all for any scenario. In this case, all we need is aws/s3 , so,
bucketKeyEnabled: Need not be set.(since this is only for SSE-KMS)
encryption: Should be set to BucketEncryption.S3_MANAGED
Example:
const buck = new s3.Bucket(this, "my-bucket", {
bucketName: "my-test-bucket-1234",
encryption: s3.BucketEncryption.KMS_MANAGED,
});
How to delete the log files in Amazon s3 according to date.? I have log files in a logs folder folder inside my bucket.
string sdate = datetime.ToString("yyyy-MM-dd");
string key = "logs/" + sdate + "*" ;
AmazonS3 s3Client = AWSClientFactory.CreateAmazonS3Client();
DeleteObjectRequest delRequest = new DeleteObjectRequest()
.WithBucketName(S3_Bucket_Name)
.WithKey(key);
DeleteObjectResponse res = s3Client.DeleteObject(delRequest);
I tried this but doesn't seem to work. I can delete individual files if I put the whole name in the key. But I want to delete all the log files created for a particular date.
You can use S3's Object Lifecycle feature, specifically Object Expiration, to delete all objects under a given prefix and over a given age. It's not instantaneous, but it beats have to make myriad individual requests. To delete everything, just make the age small.
http://docs.aws.amazon.com/AmazonS3/latest/dev/object-lifecycle-mgmt.html
When I had my hard-coded access key and secret key present, I was able to generate authenticated urls for users to view private files on S3. This was done with the following code:
import hmac
import sha
import urllib
import time
filename = "".join([s3_url_prefix, filename])
expiry = str(int(time.time()) + 3600)
h = hmac.new(
current_app.config.get("S3_SECRET_KEY", None),
"".join(["GET\n\n\n", expiry, "\n", filename]),
sha
)
signature = urllib.quote_plus(base64.encodestring(h.digest()).strip())
return "".join([
"https://s3.amazonaws.com",
filename,
"?AWSAccessKeyId=",
current_app.config.get("S3_ACCESS_KEY", None),
"&Expires=",
expiry,
"&Signature=",
signature
])
Which gave me something to the effect of https://s3.amazonaws.com/bucket_name/path_to_file?AWSAccessKeyId=xxxxx&Expires=5555555555&Signature=eBFeN32eBb2MwxKk4nhGR1UPhk%3D. Unfortunately, I am unable to store keys in config files for security reasons. For this reason, I switched over to IAM roles. Now, I get my keys using:
_iam = boto.connect_iam()
S3_ACCESS_KEY = _iam.access_key
S3_SECRET_KEY = _iam.secret_key
However, this gives me the error "The AWS Access Key Id you provided does not exist in our records.". From my research, I understand that this is because my IAM keys aren't the actual keys, but instead used with a token. My question therefore, is twofold:
How do I get the token programatically? It doesn't seem that there is a simple iam property I can use.
How do I send the token in the signature? I believe my signature should end up looking something like "".join(["GET\n\n\n", expiry, "\n", token, filename]), but I can't figure out what format to use.
Any help would be greatly appreciated.
There was a change to the generate_url method in https://github.com/boto/boto/commit/99e14f3df039997f54a5377cb6aecc83f22c2670 (June 2012) that made it possible to sign a URL using session credentials. That means you would need to be using a boto version 2.6.0 or later. If you are, you should be able to just do this:
import boto
s3 = boto.connect_s3()
url = s3.generate_url(expires_in=3600, method='GET', bucket='<bucket_name>', key='<key_name>')
What version are you using?
I have a collection of JSON messages in a file stored on S3 (one message per line). Each message has a unique key as part of the message. I also have a simple DynamoDB table where this key is used as the primary key. The table contains the name of the S3 file where the corresponding JSON message is located.
My goal is to extract a JSON message from the file given the key. Of course, the worst case scenario is when the message is the very last line in the file.
What is the fastest way of extracting the message from the file using the boto library? In particular, is it possible to somehow read the file line by line directly? Of course, I can read the entire contents to a local file using boto.s3.key.get_file() then open the file and read it line by line and check for the id to match. But is there a more efficient way?
Thanks much!
S3 cannot do this. That said, you have some other options:
Store the record's length and position (byte offset) instead of the line number in DynamoDB. This would allow you to retrieve just that record using the Range: header.
Use caching layer to store { S3 object key, line number } => { position, length } tuples. When you want to look up a record by { S3 object key, line number }, reference the cache. If you don't already have this data, you have to fetch the whole file like you do now -- but having fetched the file, you can calculate offsets for every line within it, and save yourself work down the line.
Store the JSON record in DynamoDB directly. This may or may not be practical, given the 64 KB item limit.
Store each JSON record in S3 separately. You could then eliminate the DynamoDB key lookup, and go straight to S3 for a given record.
Which is most appropriate for you depends on your application architecture, the way in which this data is accessed, concurrency issues (probably not significant given your current solution), and your sensitivities for latency and cost.
you can use the built-in readline with streams:
const readline = require('readline');
const AWS = require('aws-sdk');
const s3 = new AWS.S3();
const params = {Bucket: 'yourbucket', Key: 'somefile.txt'};
const readStream = s3.getObject(params).createReadStream();
const lineReader = readline.createInterface({
input: readStream,
});
lineReader.on('line', (line) => console.log(line));
You can use S3 SELECT to accomplish this. Also works on parquet files.
https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-glacier-select-sql-reference-select.html