How to correctly configure auto-removing of objects? - amazon-s3

I have a bucket BUCKET. This bucket contains several folders. A folder has name FOLDER. I use it to store files, that have expire_date.
It seems I need to configure something else to activate auto-remove, because all files still exist, but their expire_date < current_date.
I tried to create a Lifecycle rule, but no result. I set prefix for the rule as "FOLDER/*", set checkbox "Clean up expired object delete markers" and "Clean up incomplete multipart uploads".

Related

Do triggers exist for folders in S3 bucket when all objects are deleted?

I have a S3 folder structure as below:
s3://ABC/sample-file.csv
I want to trigger a lambda function whenever sample-file lands in ABC folder. Now, the requirement is to move(copy and delete) the sample file. Now, when I delete the sample-file using python boto3, even ABC folder is deleted as it has no objects.
boto3_session = boto3.client('s3')
boto3_session.delete_object(Bucket = Bucketname, Key = 'ABC/sample-file.csv')
I want to trigger that lambda function whenever I write sample file to ABC folder again. My question is will the previously defined lambda function be triggered as the ABC folder will be created again?
Alternatively, can I retain ABC folder without any files?
Yes. Remember that there aren't actually folders in S3, but rather the folder path is just part of the key name. The folders are just a friendly way to display the information in the console and on the cli.
Once you create the S3 event notification it will stay in effect even if you deleted the last file in a "folder" and it appears there is no folder.

S3 eventual consistency for read after write

I read a lot about different scenarios and questions that are about s3 eventual consistency and how to handle it to not get 404 error. But here I have a little bit strange use case/requirement! What I'm doing is writing bunch of files to a temp/transient folder in a s3 bucket (using a spark job and make sure job is not going to fail), then remove the main/destination folder if the previous step is successful, and finally copy files over from temp to main folder in the same bucket. Here is part of my code:
# first writing objects into the tempPrefix here using pyspark
...
# delete the main folder (old data) here
...
# copy files from the temp to the main folder
for obj in bucket.objects.filter(Prefix=tempPrefix):
# this function make sure the specific key is available for read
# by calling HeadObject with retries - throwing exception otherwise
waitForObjectToBeAvaiableForRead(bucketName, obj.key)
copy_source = {
"Bucket": bucketName,
"Key": obj.key
}
new_key = obj.key.replace(tempPrefix, mainPrefix, 1)
new_obj = bucket.Object(new_key)
new_obj.copy(copy_source)
This seems to work to avoid any 404 (NoSuchKey) error for immediate read after write. My question is will the bucket.objects.filter give me the newly written objects/keys always? Can eventual consistency affect that as well? The reason I'm asking this because the HeadObject call (in the waitForObjectToBeAvaiableForRead function) sometimes returns 404 error for a key which is returned by bucket.objects.filter!!! I mean the bucket.objects returns a key which is not available for read!!!
When you delete an object in S3, AWS writes a "delete marker" for the object (this assumes that the bucket is versioned). The object appears to be deleted, but that is a sort of illusion created by AWS.
So, if you are writing objects over previously-existing-but-now-deleted objects then you are actually updating an object which results in "eventual consistency" rather than "strong consistency."
Some helpful comments from AWS docs:
A delete marker is a placeholder (marker) for a versioned object that
was named in a simple DELETE request. Because the object was in a
versioning-enabled bucket, the object was not deleted. The delete
marker, however, makes Amazon S3 behave as if it had been deleted.
If you try to get an object and its current version is a delete
marker, Amazon S3 responds with:
A 404 (Object not found) error
A response header, x-amz-delete-marker: true
Specific Answers
My question is will the bucket.objects.filter give me the newly written objects/keys always?
Yes, newly written objects/keys will be included if you have fewer than 1,000 objects in the bucket. The API returns up to 1,000 objects.
Can eventual consistency affect that as well?
Eventual consistency affects the availability of the latest version of an object, not the presence of the object in filter results. The 404 errors are the result of trying to read newly written objects that were last deleted (and full consistency has not yet been achieved).

Data Copy from s3 to Redshift: Manifest is in different bucket than files I need to download

I am trying to copy data from a large number of files in s3 over to Redshift. I have read-only access to the s3 bucket which contains these files. In order to COPY them efficiently, I created a manifest file that contains the links to each of the files I need copied over.
Bucket 1:
- file1.gz
- file2.gz
- ...
Bucket 2:
- manifest
Here is the command I've tried to copy data from bucket 1 using the manifest in bucket 2:
-- Load data from s3
copy data_feed_eval from 's3://bucket-2/data_files._manifest'
CREDENTIALS 'aws_access_key_id=bucket_1_key;aws_secret_access_key=bucket_1_secret'
manifest
csv gzip delimiter ',' dateformat 'YYYY-MM-DD' timeformat 'YYYY-MM-DD HH:MI:SS'
maxerror 1000 TRUNCATECOLUMNS;
However, when running this command, I get the following error:
09:45:32 [COPY - 0 rows, 7.576 secs] [Code: 500310, SQL State: XX000] [Amazon](500310) Invalid operation: Problem reading manifest file - S3ServiceException:Access Denied,Status 403,Error AccessDenied,Rid 901E02533CC5010D,ExtRid tEvf/TVfZzPfSNAFa8iTYjTBjvaHnMMPmuwss58SwopY/sZSkhUBe3yMGHTDyA0yDhDCD7ybX9gl45pV/eQ=,CanRetry 1
Details:
-----------------------------------------------
error: Problem reading manifest file - S3ServiceException:Access Denied,Status 403,Error AccessDenied,Rid 901E02533CC5010D,ExtRid tEvf/TVfZzPfSNAFa8iTYjTBjvaHnMMPmuwss58SwopY/sZSkhUBe3yMGHTDyA0yDhDCD7ybX9gl45pV/eQ=,CanRetry 1
code: 8001
context: s3://bucket-2/data_files._manifest
query: 2611231
location: s3_utility.cpp:284
process: padbmaster [pid=10330]
-----------------------------------------------;
I believe the issue here is I'm passing bucket_1 credentials in my COPY command. Is it possible to pass credentials for multiple buckets (bucket_1 with the actual files, and bucket_2 with the manifest) to the COPY command? How should I approach this assuming I don't have write access to bucket_1?
You have indicated that bucket_1_key key (which is IAM user) has permissions limited to "read-only" from bucket_1. If this is the case then the error occurs because that key has no permission read from bucket_2. You have mentioned this a possible cause already and it is exactly that.
There is no option to supply two sets of keys to COPY command. But, you should consider the following options:
Option 1
According to this "You can specify the files to be loaded by using an Amazon S3 object prefix or by using a manifest file."
If there is a common prefix for the set of files you want to load, you can use that prefix in bucket_1 in COPY command.
See http://docs.aws.amazon.com/redshift/latest/dg/t_loading-tables-from-s3.html
You have mentioned you have read-only access to bucket 1. Make sure this is sufficient access as defined in http://docs.aws.amazon.com/redshift/latest/dg/copy-usage_notes-access-permissions.html#copy-usage_notes-iam-permissions
All the other options require changes to your key/IAM user permissions or Redshift itself.
Option 2
Extend permissions of bucket_1_key key to be able to read from bucket_2 as well. You will have to make sure that your bucket_1_key key has LIST access to bucket_2 and GET access for the bucket_2 objects (as documented here).
This way you can continue using bucket_1_key key in COPY command. This method is referred to as Key-Based Access Control and uses plain-text access key ID and secret access key. AWS recommends to use Role-Based Access Control (option 3) instead.
Option 3
Use IAM role in COPY command instead of key (option 2). This is referred to as Role-Based Access Control. This is also strongly recommended authentication option to use in COPY command.
This IAM role would have to privileges to LIST access on buckets 1 and 2 and GET access for the objects in those buckets.
More info about Key-Based and Role-Based Access Control is here.

BizTalk 2010 Dynamic FTP Send Port Output Directory and File Name Issue

I have a rather complex requirement - that I have to drop a very specifically named file in an FTP location, and the trick here is that I would often have to drop it into a new location and with a new file name each time (both directory name and file name depending on the year, month, date and time). Obviously, for this purpose I chose to use a Dynamic Send Port, which I have configured using a MessageAssignment Shape.
A file will be generated each day. I need to drop it in a remote location in this form:
sample-servername-stage/default/file/ftp/PaymentReports/YYYY/MM_[MonthName]/PaymentReportYYYYMMDD_HHMISS
For example, for a file posted on March, 2 2016 at 6:45pm, we would have:
sample-servername-stage/default/file/ftp/PaymentReports/2016/03_March/PaymentReport20160302_184500
Here's the code I have in the MessageAssignment Shape:
FTPSendPort1(Microsoft.XLANGs.BaseTypes.Address) = "ftp://sample-servername-stage:721";
FTPSendPort1(Microsoft.XLANGs.BaseTypes.TransportType) = "FTP";
Output(FTP.CommandLogFileName) = "D:\\BiztalkLogs\\FTPLog\\DynamicFTPLog.txt";
Output(FTP.UserName) = "sampleUsername";
Output(FTP.Password) = "samplePassword";
Output(FTP.BeforePut) = "MKD " + Variable_1 + ";CWD " + Variable_1;
FTPSendport1 - name of the Dynamic Send Port.
Output - name of the Output message.
Variable_1 - variable where I will store the directory name to be created.
Here are the biggest issues:
I need to check if a directory already exists - the year, then navigate in and check if the month already exists. If they exist I simply go in there and drop the file. If not, I create it and drop the file in there.
I need to name the file with the date time specifics in the format shown above. In addition to the code shown above, I have tried a number of things including setting FILE.ReceivedFileName, FTP.ReceivedFileName properties etc. Nothing seems to work. This may be because I cannot use the macro %SourceFileName% anywhere. Because of this it keeps dropping the file into the location with a GUID name instead of the one I set. It behaves as though it completely skips/overlooks the command where I set the file name.
I'm thoroughly confused at this point. I'm not sure of how I can mix checking conditions (if the folders already exist etc.) with FTP commands, and especially not sure of how to do this within an orchestration.
The file naming is done in the address property where you provide the FTP URL. In fact you can even use macros in there. Try that:
FTPSendPort1(Microsoft.XLANGs.BaseTypes.Address) = "ftp://sample-servername-stage:721/SomeFolder/SomeFileName_%datetime%.xml"
For you other problem of checking if folders exists on the FTP and creating them, I think you'll have to write a custom pipeline component.

Using sub-repo with hgwebdir difficulties in mercurial

Allright I got myself in a deadlock with Mercurial and sub-repos... Here's what happenend:
I had a large mercurial repo that I server via apache and hgweb.cgi.
Due to the size of the repo I decided to move to sub-repositories and share these with hgwebdir.cgi.
Using the convert tool with the filemap option I created several sub-repositories:
/main/foo
/main/bar
Nicely created an entry for the sub-repositories in .hgsub:
foo = foo
bar = bar
And set hgwebdir.cgi up to show $/** as the root folder.
Now when I went to my site (foo.com/hg) I saw my sub-repositories with one empty reposory among them (no name, no content), but I could not download it (archive location unknown):
empty_repo http://img707.imageshack.us/img707/8237/emptysubrepo.png
That was allright until I added a new sub-repository.
I could not push the new .hgsub file to foo.com/hg, since that page is served by hgwebdir.
The only method I can work currently is switch from hgwebdir to hgweb, commit .hgsubste and switch back to hgwebdir.
Does someone have a good setup for such a mess?
On the webserver your main and its subrepos should appear as siblings -- not with the subrepos inside main.
Main
ASCII
AlignDistribute
And the URLs in your .hgsub should look like:
ASCII = ../ASCII
AlignDistribute = ../AlignDsitribute
Then you'll be able to push/pull to http://foo.com/hg/Main and when you clone it the clone/update will automatically attach and clone down the separate subrepos.
From what I've read on https://www.mercurial-scm.org/wiki/PublishingRepositories#multiple
The keys (on the left) and the values (on the right) are both filesystem paths
The keys should be prefixes of the values and are "subtracted" from the values in order to generate the URL paths to each repository
What I'm guessing happened is that in your hgweb(dir) configuration you're specifying the same value for a collection possibly as the key, so during subtraction it ends up with a blank name and no way to get to it.
When I use [collections] to set /a/full/path = /a/full/path directly to a repo, it'll end up blank too, because it's reading that folder as a repo because it is a repo, instead of each sub-directory being an individual repo, after I removed the .hg folder and .hgsubs and everything from the root of my collection entry, all the subfolders started showing up properly.
I originally used in [paths], /path/to/my/project = /path/to/my/project, and since that is a single referenced repository, it'll subtract the value from the key, leaving you once again with '', instead I used project = /path/to/my/project and it came out as 'project'.
Hopefully that URL or these descriptions will get you out of your pickle!