AWS S3 life cycle configuration - amazon-s3

I want to set the life cycle rule of S3 bucket so that the each file in the bucket will be deleted 7 days after they are generated.
If I set the lifecycle rule as follows (the below is terraform code, similar to console setting, so I just use it here), will all the files in the bucket "test" be removed after 7 days from today or will each file be deleted on different date since they are created in different day? I want them to be deleted on different date, not together.
BTW, I guess I do not need to configure: Permanently delete previous versions because my s3 is not version enabled. Please correct me if I am wrong.
resource "aws_s3_bucket" "s3" {
bucket = "test"
lifecycle_rule {
id = "remove_after_7d"
enabled = true
expiration {
days = 7
}
}
}

The objects will be removed 7 days after their individual creation -- not 7 days after you create the rule. If, for example, all the objects in a bucket are at least 7 days old, they should all be gone within approximately 24 hours after you create the rule.
Note that the timing is not precise, because the deletion process is done in the background, so objects will usually linger a few hours longer than you might expect, if you assume exacrly 7 × 24 hours is how long the objects will remain in the bucket. It may take a day or two for the objects to disappear after the policy is first created. However, once the policy has been fully evaluated against all the objects, S3 will stop billing you for storage of expired objects when their expiration time arrives, even if the delete process hasn't gotten around to actually removing them, yet.
For non-versioned buckets, you are correct -- there is no previous version to delete. Using versioned buckets is generally a good idea, though, since it eliminates the risk of data loss from inadvertently deleting or overwriting an object, for whatever reason (like a bug in your application).

As i know, above configuration will delete the bucket after 7 days from the bucket creation. If you want to delete a specific file after a specific days from that object creation, then you have to mention the path as prefix. eg. delete log.txt which is inside log folder of the bucket:
resource "aws_s3_bucket" "bucket"
{
bucket = "<<bucket_name>>"
acl = "private"
lifecycle_rule {
id = "log"
enabled = true
prefix = "log/log.txt"
expiration {
days = 7
}
}
}
But i was facing an issue(error 409) while updating an existing bucket.

Related

Is multipart copy really needed to revert an object to a prior version?

For https://github.com/wlandau/gittargets/issues/6, I am trying to programmatically revert an object in an S3 bucket to an earlier version. From reading https://docs.aws.amazon.com/AmazonS3/latest/userguide/RestoringPreviousVersions.html, it looks like copying the object to itself (old version to current version) is recommended. However, I also read that there is a 5 GB limit for copying objects in S3. Does that limit apply to reverting an object to a previous version in the same bucket? A local download followed by a multi-part upload seems extremely inefficient for this use case.
You can create a multi-part transfer request that transfers from S3 to S3. It still takes time, but it doesn't require downloading the object's data and uploading it again, so in practice it tends to be considerably faster than other options:
import boto3
s3 = boto3.resource('s3')
bucket = s3.Bucket('example-bucket')
bucket.copy(
{
'Bucket': 'example-bucket',
'Key': 'test.dat',
'VersionId': '0011223344', # From previous call to bucket.object_versions
},
Key='test.dat',
)

S3 eventual consistency for read after write

I read a lot about different scenarios and questions that are about s3 eventual consistency and how to handle it to not get 404 error. But here I have a little bit strange use case/requirement! What I'm doing is writing bunch of files to a temp/transient folder in a s3 bucket (using a spark job and make sure job is not going to fail), then remove the main/destination folder if the previous step is successful, and finally copy files over from temp to main folder in the same bucket. Here is part of my code:
# first writing objects into the tempPrefix here using pyspark
...
# delete the main folder (old data) here
...
# copy files from the temp to the main folder
for obj in bucket.objects.filter(Prefix=tempPrefix):
# this function make sure the specific key is available for read
# by calling HeadObject with retries - throwing exception otherwise
waitForObjectToBeAvaiableForRead(bucketName, obj.key)
copy_source = {
"Bucket": bucketName,
"Key": obj.key
}
new_key = obj.key.replace(tempPrefix, mainPrefix, 1)
new_obj = bucket.Object(new_key)
new_obj.copy(copy_source)
This seems to work to avoid any 404 (NoSuchKey) error for immediate read after write. My question is will the bucket.objects.filter give me the newly written objects/keys always? Can eventual consistency affect that as well? The reason I'm asking this because the HeadObject call (in the waitForObjectToBeAvaiableForRead function) sometimes returns 404 error for a key which is returned by bucket.objects.filter!!! I mean the bucket.objects returns a key which is not available for read!!!
When you delete an object in S3, AWS writes a "delete marker" for the object (this assumes that the bucket is versioned). The object appears to be deleted, but that is a sort of illusion created by AWS.
So, if you are writing objects over previously-existing-but-now-deleted objects then you are actually updating an object which results in "eventual consistency" rather than "strong consistency."
Some helpful comments from AWS docs:
A delete marker is a placeholder (marker) for a versioned object that
was named in a simple DELETE request. Because the object was in a
versioning-enabled bucket, the object was not deleted. The delete
marker, however, makes Amazon S3 behave as if it had been deleted.
If you try to get an object and its current version is a delete
marker, Amazon S3 responds with:
A 404 (Object not found) error
A response header, x-amz-delete-marker: true
Specific Answers
My question is will the bucket.objects.filter give me the newly written objects/keys always?
Yes, newly written objects/keys will be included if you have fewer than 1,000 objects in the bucket. The API returns up to 1,000 objects.
Can eventual consistency affect that as well?
Eventual consistency affects the availability of the latest version of an object, not the presence of the object in filter results. The 404 errors are the result of trying to read newly written objects that were last deleted (and full consistency has not yet been achieved).

How to invoke an on-demand bigquery Data transfer service?

I really liked BigQuery's Data Transfer Service. I have flat files in the exact schema sitting to be loaded into BQ. It would have been awesome to just setup DTS schedule that picked up GCS files that match a pattern and load the into BQ. I like the built in option to delete source files after copy and email in case of trouble. But the biggest bummer is that the minimum interval is 60 minutes. That is crazy. I could have lived with a 10 min delay perhaps.
So if I set up the DTS to be on demand, how can I invoke it from an API? I am thinking create a cronjob that calls it on demand every 10 mins. But I can’t figure out through the docs how to call it.
Also, what is my second best most reliable and cheapest way of moving GCS files (no ETL needed) into bq tables that match the exact schema. Should I use Cloud Scheduler, Cloud Functions, DataFlow, Cloud Run etc.
If I use Cloud Function, how can I submit all files in my GCS at time of invocation as one bq load job?
Lastly, anyone know if DTS will lower the limit to 10 mins in future?
So if I set up the DTS to be on demand, how can I invoke it from an API? I am thinking create a cronjob that calls it on demand every 10 mins. But I can’t figure out through the docs how to call it.
StartManualTransferRuns is part of the RPC library but does not have a REST API equivalent as of now. How to use that will depend on your environment. For instance, you can use the Python Client Library (docs).
As an example, I used the following code (you'll need to run pip install google-cloud-bigquery-datatransfer for the depencencies):
import time
from google.cloud import bigquery_datatransfer_v1
from google.protobuf.timestamp_pb2 import Timestamp
client = bigquery_datatransfer_v1.DataTransferServiceClient()
PROJECT_ID = 'PROJECT_ID'
TRANSFER_CONFIG_ID = '5e6...7bc' # alphanumeric ID you'll find in the UI
parent = client.project_transfer_config_path(PROJECT_ID, TRANSFER_CONFIG_ID)
start_time = bigquery_datatransfer_v1.types.Timestamp(seconds=int(time.time() + 10))
response = client.start_manual_transfer_runs(parent, requested_run_time=start_time)
print(response)
Note that you'll need to use the right Transfer Config ID and the requested_run_time has to be of type bigquery_datatransfer_v1.types.Timestamp (for which there was no example in the docs). I set a start time 10 seconds ahead of the current execution time.
You should get a response such as:
runs {
name: "projects/PROJECT_NUMBER/locations/us/transferConfigs/5e6...7bc/runs/5e5...c04"
destination_dataset_id: "DATASET_NAME"
schedule_time {
seconds: 1579358571
nanos: 922599371
}
...
data_source_id: "google_cloud_storage"
state: PENDING
params {
...
}
run_time {
seconds: 1579358581
}
user_id: 28...65
}
and the transfer is triggered as expected (nevermind the error):
Also, what is my second best most reliable and cheapest way of moving GCS files (no ETL needed) into bq tables that match the exact schema. Should I use Cloud Scheduler, Cloud Functions, DataFlow, Cloud Run etc.
With this you can set a cron job to execute your function every ten minutes. As discussed in the comments, the minimum interval is 60 minutes so it won't pick up files less than one hour old (docs).
Apart from that, this is not a very robust solution and here come into play your follow-up questions. I think these might be too broad to address in a single StackOverflow question but I would say that, for on-demand refresh, Cloud Scheduler + Cloud Functions/Cloud Run can work very well.
Dataflow would be best if you needed ETL but it has a GCS connector that can watch a file pattern (example). With this you would skip the transfer, set the watch interval and the load job triggering frequency to write the files into BigQuery. VM(s) would be running constantly in a streaming pipeline as opposed to the previous approach but a 10-minute watch period is possible.
If you have complex workflows/dependencies, Airflow has recently introduced operators to start manual runs.
If I use Cloud Function, how can I submit all files in my GCS at time of invocation as one bq load job?
You can use wildcards to match a file pattern when you create the transfer:
Also, this can be done on a file-by-file basis using Pub/Sub notifications for Cloud Storage to trigger a Cloud Function.
Lastly, anyone know if DTS will lower the limit to 10 mins in future?
There is already a Feature Request here. Feel free to star it to show your interest and receive updates
Now your can easy manual run transfer Bigquery data use RESTApi:
HTTP request
POST https://bigquerydatatransfer.googleapis.com/v1/{parent=projects/*/locations/*/transferConfigs/*}:startManualRuns
About this part > {parent=projects//locations//transferConfigs/*}, check on CONFIGURATION of your Transfer then notice part like image bellow.
Here
More here:
https://cloud.google.com/bigquery-transfer/docs/reference/datatransfer/rest/v1/projects.locations.transferConfigs/startManualRuns
following the Guillem's answer and the API updates, this is my new code:
import time
from google.cloud.bigquery import datatransfer_v1
from google.protobuf.timestamp_pb2 import Timestamp
client = datatransfer_v1.DataTransferServiceClient()
config = '34y....654'
PROJECT_ID = 'PROJECT_ID'
TRANSFER_CONFIG_ID = config
parent = client.transfer_config_path(PROJECT_ID, TRANSFER_CONFIG_ID)
start_time = Timestamp(seconds=int(time.time()))
request = datatransfer_v1.types.StartManualTransferRunsRequest(
{ "parent": parent, "requested_run_time": start_time }
)
response = client.start_manual_transfer_runs(request, timeout=360)
print(response)
For this to work, you need to know the correct TRANSFER_CONFIG_ID.
In my case, I wanted to list all the BigQuery Scheduled queries, to get a specific ID. You can do it like that :
# Put your projetID here
PROJECT_ID = 'PROJECT_ID'
from google.cloud import bigquery_datatransfer_v1
bq_transfer_client = bigquery_datatransfer_v1.DataTransferServiceClient()
parent = bq_transfer_client.project_path(PROJECT_ID)
# Iterate over all results
for element in bq_transfer_client.list_transfer_configs(parent):
# Print Display Name for each Scheduled Query
print(f'[Schedule Query Name]:\t{element.display_name}')
# Print name of all elements (it contains the ID)
print(f'[Name]:\t\t{element.name}')
# Extract the IDs:
TRANSFER_CONFIG_ID= element.name.split('/')[-1]
print(f'[TRANSFER_CONFIG_ID]:\t\t{TRANSFER_CONFIG_ID}')
# You can print the entire element for debug purposes
print(element)

How to correctly configure auto-removing of objects?

I have a bucket BUCKET. This bucket contains several folders. A folder has name FOLDER. I use it to store files, that have expire_date.
It seems I need to configure something else to activate auto-remove, because all files still exist, but their expire_date < current_date.
I tried to create a Lifecycle rule, but no result. I set prefix for the rule as "FOLDER/*", set checkbox "Clean up expired object delete markers" and "Clean up incomplete multipart uploads".

Alfresco: unable to backup alf_data

I am an alfresco 3.3c user with an instance supporting more that 4 million objects. I’m starting having problems with backup, because to backup the alf_data/contentstore folder even in a incremental mode, it takes to long (always need to analyze all those files for changes).
I’ve noticed that alf_data/contentstore is organized internally per years, could I assume that the olders years (2012) are not anymore changed? (if yes, I can just create an exception and remove those dirs from the backup process, obviously with a previous full backup )
Thanks, kind regards.
Yes, you can assume that no objects will be created (and items are never updated) in old directories within your content store, although items may be removed by the repository's cleanup jobs after being deleted from Alfresco's trash can.
This is the section from org.alfresco.repo.content.filestore.FileContentStore which generates a new content URL. You can easily see that it always uses the current date and time.
/**
* Creates a new content URL. This must be supported by all
* stores that are compatible with Alfresco.
*
* #return Returns a new and unique content URL
*/
public static String createNewFileStoreUrl()
{
Calendar calendar = new GregorianCalendar();
int year = calendar.get(Calendar.YEAR);
int month = calendar.get(Calendar.MONTH) + 1; // 0-based
int day = calendar.get(Calendar.DAY_OF_MONTH);
int hour = calendar.get(Calendar.HOUR_OF_DAY);
int minute = calendar.get(Calendar.MINUTE);
// create the URL
StringBuilder sb = new StringBuilder(20);
sb.append(FileContentStore.STORE_PROTOCOL)
.append(ContentStore.PROTOCOL_DELIMITER)
.append(year).append('/')
.append(month).append('/')
.append(day).append('/')
.append(hour).append('/')
.append(minute).append('/')
.append(GUID.generate()).append(".bin");
String newContentUrl = sb.toString();
// done
return newContentUrl;
}
Actually no you can't, because if the file was modified/updated in Alfresco the filesystem path doesn't change. Remember, you can hot-backup the content-store (not the lucene index folder) dir, and it's not necessary to check every single file for consistency. Just launch a shell/batch script executing a copy without check, or use a tool like xxcopy.
(I'm talking about node properties, not the node content)