Delete old files based on date extracted from filename by custom rules - amazon-s3

I am trying to find best way to automatically delete my old SQL backup files on s3 based on following rules:
keep all backups for last 7 day
keep last backup of each day for last 6 month
keep last backup of each week for last 2 year
keep last backup of each month for more than 2 years old file.
My file names contains backup datetime as following XX_backup_2016_12_09_150003_4066809.bak.
What do you recommend? AWS Lambda or what?

Consider using S3 Lifecycle Rules with ObjectTagging.
You can use S3 Events to trigger a lambda for each PutObject. Your lambda can create tags on the S3 objects based on the rules you have outlined. The file name will be input to the lambda from the S3 Event.
That is:
keep all backups for last 7 day (default tag for 7 day retention)
keep last backup of each day for last 6 month (tag as 6 month retention)
keep last backup of each week for last 2 year (tag as 2 year retention)
keep last backup of each month for more than 2 years old file (tag for x retention)
The lambda can deal with edge cases to determine if a particular file is both required for 6 months and 2 years. A default tag could be used if no other tag can be applied for the 7 day retention.
Then the lifecycle rules with expiration can be created and applied according to the tag.

Related

Appsflyer Data Locker Override data BigQuery

I have an issue with setting up Appsflyer Cost ETL with Google BigQuery. We get parquet files each day.
The issue is the following - each day you get the file with 10 dates.
enter image description here
The problem is that each day you have 6 dates that shoud rewrite yesterday file. And the task is how to set a data transfers or scheduled queries to override the data for each date that you have in newer file to make the data for long period in one table.

Group data by weeks since the start of event in sql

I’m a data analyst in the insurance industry and we currently have a program in SAS EG that tracks catastrophe development week by week since the start of the event for all of the catastrophic events that are reported.
(I.E week 1 is catastrophe start date + 7 days, week 2 would be end of week 1 + 7 days and so on) then all transaction amounts (dollars) for the specific catastrophes would be grouped into the respective weeks based on the date each transaction was made.
Problem that we’re faced with is we are moving away from SAS EG to GCP big query and the current process of calculating those weeks is a manually read in list which isn’t very efficient and not easily translated to BigQuery.
Curious if anybody has an idea that would allow me to calculate each week number in periods of 7 days since the start of an event in SQL or has an idea specific for BigQuery? There would be different start dates for each event.
It is complex, I know and I’m willing to give more explanation as needed. Open to any ideas for this as I haven’t been able to find anything.

Extract files from a particular date to process the data -- U-SQL

I have two pipelines which will make two files in different paths. In path 1 i have only 1 file which will have all data. In path 2 i have one file for each day.
Suppose path 1 file is not refreshed for last 2 days and path 2 have files for that last 2 days. Then i want to process that last 2 days in u-sql script. How can we do that ?

Pentaho data integration issue with loading a kettle based on some condition

I have a Pentaho Data Integration job which has the following steps:
Generate row step which has an initial date (for e.g. 2010-01-01) and the limit as 10*366 = 3660 rows for 10 years.
Next step has an incrementer to increment the number of days.
Next step uses this information viz. initial date, limit, and the incrementer, to generate dates for each day for 10 years starting 2010-01-01 using javascript functions.
Final step loads a table with the generated dates.
All this works fine.
Now, I have a requirement where I do not want this table to be static with dates for 10 years only. If the max date in the date table is 2 years from today, I want to load dates for 10 more years in the table.
For the above example, with the 1st load loading dates for 10 years from 2010, I should be able to load 10 more years in 2018, the next 10 years in 2028 and so on and so forth.
What will be the best way to achieve this?
How can I:
1) Read the max date from my date table? - I know how to do this.
2) Use the read date to compare against today. And if the max date is within 2 years from today, I populate the table with next 10 years.
I don't know how to do 2 above in Pentaho data integration. Will really appreciate any pointers on a way to resolve this issue.
You need to read the current date (today) in a variable. For example with a Get system info step.
Then you can compare the two fields, max date and today, with a Filter Rows step.
As the previous step may give you more than one row, you need to either use a Unique Row (no field to provide) either a Group by (no group by field).
If any row gets by, then you launch you generate 10 years process. As you cannot have a hop from a step into this second Generate row, you must use a Transformation executor to launch your currently existing transformation.
Now, if your requirement gets a tiny little bit more complex than that, I strongly suggest you to use jobs to orchestrate your transformations.

Schedule algorithm for nightly SQL extract of data

I am looking for an algorithm to extract data from one system in to another but on a sliding scale. Here are the details:
Every two weeks, 80 weeks of data needs to be extracted.
Extracts take a long time and are resource intensive so we would like to distribute the load of the extract over time.
The first 8-12 weeks are the most important and need be updated more often over the two week window. Data further out can be updated less frequently to the point where the last 40 weeks+ could even just be extracted once every two weeks.
Every two weeks, the start date shifts two weeks ahead and so two new weeks are extracted.
Extract procedure takes a start and end date (this is already made and should be treated like a black box). The procedure could be run for multiple date spans in a day if required but contiguous dates are faster than multiple blocks of dates.
Extracts blocks should be no smaller than 2 weeks and probably no greater than 16 weeks. Longer blocks are possible but at 16 weeks are already a significant load to the system.
4 contiguous weeks of data takes about 1 hour approximately. It takes a long time because the data needs to be generated/calculated.
Data that is newly extracted replaces the old data for the timespan. No need to merge or diff the data, it is just replaced.
This algorithm needs to be built into a SQL job which will handle the daily process (triggered once a day only).
My initial thought was to create a sliding schedule pretty much. Rotate the first 4 week block every second day and then the second 4 week block every 3 to 4 days. The rest of the data would be extracted in blocks in smaller chunks over the two week period.
What I am going to do will work but I wanted to spend some time seeing if there might be a better way to approach the problem. Mainly looking for an algorithm to do the start/end date schedule for the daily extract.