Extract files from a particular date to process the data -- U-SQL - azure-data-lake

I have two pipelines which will make two files in different paths. In path 1 i have only 1 file which will have all data. In path 2 i have one file for each day.
Suppose path 1 file is not refreshed for last 2 days and path 2 have files for that last 2 days. Then i want to process that last 2 days in u-sql script. How can we do that ?

Related

Write 2 months historical data from S3 to aws timestream

I want to write 2 months old data from S3 bucket and each day has 4 to 5 parquet files. Now I have converted the parquet files to data frame but the amount of rows forming from 1-day data is around 3.5M.
Now I have created 100 batches to send the records but the overall execution time is too long. Please help me.

AWS Quicksight monthly summarize in percentage from different csv files

1. I have a Lambda function that is running monthly, it is running Athena query, and export the results in a CSV file to my S3 bucket.
2. Now i have a Quicksight dashboard which is using this CSV file in Dataset and visual all the rows from the report into a dashboard.
Everything is good and working until here.
3. Now every month I'm getting a new csv file in my S3 bucket, and i want to add a "Visual Type" in my main dashboard that will show me the difference in % from the previous csv file(previous month).
For example:
My dashboard is focusing on the collection of missing updates.
In May i see i have 50 missing updates.
In June i got a CSV file with 25 missing updates.
Now i want it somehow to reflect into my dashboard with a "Visual Type" that this month we have reduced the number of missing updates by 50%.
And in month July, i get a file with 20 missing updates, so i want to see that we reduced with with 60% from the month May.
Any idea how i can do it?
I'm not sure I quite understand where you're standing, but I'll assume that you have an S3 manifest that points to an S3 directory and not a different manifest (and dataset) per each file.
If that's your case you could try to tackle that comparison creating a calculated field and using the periodOverPeriodPercentDifference
Hope this helps!

Delete old files based on date extracted from filename by custom rules

I am trying to find best way to automatically delete my old SQL backup files on s3 based on following rules:
keep all backups for last 7 day
keep last backup of each day for last 6 month
keep last backup of each week for last 2 year
keep last backup of each month for more than 2 years old file.
My file names contains backup datetime as following XX_backup_2016_12_09_150003_4066809.bak.
What do you recommend? AWS Lambda or what?
Consider using S3 Lifecycle Rules with ObjectTagging.
You can use S3 Events to trigger a lambda for each PutObject. Your lambda can create tags on the S3 objects based on the rules you have outlined. The file name will be input to the lambda from the S3 Event.
That is:
keep all backups for last 7 day (default tag for 7 day retention)
keep last backup of each day for last 6 month (tag as 6 month retention)
keep last backup of each week for last 2 year (tag as 2 year retention)
keep last backup of each month for more than 2 years old file (tag for x retention)
The lambda can deal with edge cases to determine if a particular file is both required for 6 months and 2 years. A default tag could be used if no other tag can be applied for the 7 day retention.
Then the lifecycle rules with expiration can be created and applied according to the tag.

XSLT to compare same nodeset A in 2 files and report specified nodes X,Y, Z where A differs

I want to compare 2 XML files made up of records which each contain an ID field.
File 1 is the raw data input, file 2 is the cleaned up version. I want to do a basic QA on the cleaning process.
Where there's any difference between the files in terms of
specific ID values that are in file 1 but not in file 2 (or vice versa)
multiple instances of ID values in either file (1 or 2)
...I want to report a couple of other nodes X, Y, Z from the offending "records" that contain the additional/missing/multiple IDs, to see why (and if) those records were cleaned going from file 1 to file 2.
I get this conceptually, my main question is how to write
the references to the separate input files;
the if statement(s).
XSLT 1.0 preferred.

How to delete a specific type of files after x amount of days?

I have a service that produces .txt files to a folder every 30 seconds. Is there anyway to delete files that are older than x amount of days?
In your service you can loop through the list of files, find the ones that are x number of days old and delete them.
See the File, FileInfo, Directory and DirectoryInfo classes on MSDN.
Just use the WMI service to find a collection of files with a certain creation date. Then delete them.