incrementally copy files from S3 to local hdfs - amazon-s3

i have an app write data to S3 daily or hourly or just randomly, another app to read data from S3 to local HBase. is there any way to tell what's the last file uploaded from last update, then read files after that, in other word, incrementally copy the files?
for example:
day 1: App1 write files 1,2,3 to folder 1;App2 read those 3 files to HBase;
day 4: App1 write file 4 & 5 to folder 1, 6,7,8 to folder 2; App2 need to read 4 &5 from folder 1 and then 6,7,8 from folder 2.
thanks

The LastModified header field can be used to process data based on the creation date. This requires a built in logic on the client side which stores the items which are already processed and the new items. You can simply store the date which you processed so everything comes after that is considered as new.
Example:
s3cmd ls s3://test
2012-07-24 18:29 36303234 s3://test/dl.pdf
See the date in the front of the file.

Related

Copy multiple files from multiple folder to single folder in Azure data factory

I have a folder structure like this as a source
Source/2021/01/01/*.xlsx files
Source/2021/03/02/*.xlsx files
Source/2021/04/03/.*xlsx files
Source/2021/05/04/.*xlsx files
I want to drop all these excel files into a different folder called Output.
Method 1:
When I am trying this, I used copy activity and I am able to get Files with folder structure( not a requirement) in Output folder. I used Binary file format.
Method 2:
Also, I am able to get files as some random id .xlsx in my output folder. I used Flatten Hierrachy.
My requirement is to get files with the same name as source.
This is what i suggest and I have implemented something in the past and i am pretty confident this should work .
Steps
Use getmetada activity and try to loop through all the folder inside Source/2021/
Use a FE loop and pass the ItemType as folder ( so that you get folder only and NO files , I know at this time you dont; have file )
Inside the IF , add a Execute pipeline activity , they should point to a new pipeline which will take a parameter like
Source/2021/01/01/
Source/2021/03/02/
The new pipeline should have a getmetadata activity and FE loop and this time we will look for files only .
Inside the FE loop add a copy activity and now will have to use the full file name as source name .

Ruby Compare Very Large Files

I have two CSV files that are lists of S3 bucket objects:
The first CSV file represents the objects in the source S3 bucket.
The second CSV file represents the objects in the destination S3 bucket.
I need to know which files to copy from the source S3 bucket to the destination bucket by finding the objects that aren't already in the destination bucket. The lines of the CSV match path, size, and modified date. If any one of these is different I need the source object copied to the destination bucket.
Here's the first example CSV file:
folder1/sample/test1,55,2019-07-19 19:36:56 UTC
folder2/sample/test5,55,2019-07-19 19:34:31 UTC
folder3/sample/test9,55,2019-07-19 19:32:12 UTC
Here's the second example CSV file:
folder1/sample/test1,55,2019-07-16 19:32:58 UTC
folder2/sample/test5,55,2019-07-14 19:34:31 UTC
folder3/sample/test9,55,2019-07-19 19:32:12 UTC
In this example the first and second lines would be returned.
The following code works on these 3 liners but fails on randomly generated files 1000+ lines:
f1 = File.open('file1.csv', 'r')
f2 = File.open('file2.csv', 'r')
f1.each.zip(f2.each).each do |line1, line2|
if line1 != line2
puts line1
end
end
How can I accurately compare all lines - preferably with the least amount of CPU/Memory overhead?
You could load the destination list into an array in memory, then step through the source list one line at a time. If source line is not in the array, the file needs to be copied.
If even one file is too big to load into memory, and the files are sorted in filename order, then you could step through both files together and compare lines. You'll need to use the filenames to determine whether to skip over lines to stay in sync.
An alternative option is to use Amazon Athena, joining the data between files to find lines that don't match.

Split CSV file in records and save as a csv file format - Apache NIFI

What I want to do is the following...
I want to divide the input file into registers, convert each record into a
file and leave all the files in a directory.
My .csv file has the following structure:
ERP,J,JACKSON,8388 SOUTH CALIFORNIA ST.,TUCSON,AZ,85708,267-3352,,ALLENTON,MI,48002,810,710-0470,369-98-6555,462-11-4610,1953-05-00,F,
ERP,FRANK,DIETSCH,5064 E METAIRIE AVE.,BRANDSVILLA,MO,65687,252-5592,1176 E THAYER ST.,COLUMBIA,MO,65215,557,291-9571,217-38-5525,129-10-0407,1/13/35,M,
As you can see it doesn't have Header row.
Here is my flow.
My problem is that when the Split Proccessor divides my csv into flows with 400 lines, it isn't save in my output directory.
It's first time using NIFI, sorry.
Make sure your RecordReader controller service is configured correctly(delimiter..etc) to read the incoming flowfile.
Records per split value as 1
You need to use UpdateAttribute processor before PutFile processor to change the filename to unique value (like UUID) unless if you are configured PutFile processor Conflict Resolution strategy as Ignore
The reason behind changing filename is SplitRecord processor is going to have same filename for all the splitted flowfiles.
Flow:
I tried your case and flow worked as expected, Use this template for your reference and upload to your NiFi instance, Make changes as per your requirements.

Using Boto To Copy Multiple Paths/Files From S3 To S3

I have a bunch of S3 files and folders I need to copy locally, decrypt, and then copy into my S3 bucket.
The setup per S3 bucket varies but is basically like this:
S3 bucket name > event folder (A, B, C, D, E for example) > country sub folder (UK, US, Germany for example) > subfolder containing all the data 'runs' (2017-Jan, 2017-Feb, etc) > files within the sub folder.
I need to copy a few events e.g. just A, C for a few countries e.g. just the UK and Germany, for the latest 'data run' ie. 2017-August. I need to do this monthly and, in reality, there are ~100 paths I need so I really don't want to manually copy each one. Copying the entire bucket is also not an option as its way too big.
I am wondering if boto is the best tool for this or if it will only allow me to copy one path at a time. The S3 files are in .gzip format hence I can't copy it over to my S3 bucket directly (have to decrypt first). I have been trying to look for an example but could not find something.
Edit: I had a look at the recursive function but that (I believe) only applies to files within the same folder. So if you have a folder > subfolder > subfolder > files you are screwed.
Thanks!

Need SQL CODE for all File/Folder properties including all subfolders, Size in MB, modified/Create/Last access date, Path/Location

Need SQL code to find the following details in a Drive/Path: - Folder Name and File Name - Folder Size and File Size in MB - Date Created and Date Modified and Date Accessed - File Flag (1/0) Yes or no based on whether the line item is a file or a folder - Path/URL/Location of the file or folder Need the details for all the folders and sub-folders and files in a given Path/URL
I have multiple codes which provides each of these details individually, but unable to merge it in a single query or join post execution.
Expected output:
http://s6.postimg.org/cn845zwdd/expected_output.jpg
Parts of the Codes are available in the Blogs as below:
http://sqljourney.wordpress.com/2010/06/08/get-list-of-files-from-a-windows-directory-to-sql-server/#comment-1674 (This link provides a function that gives date modified/path)
Get each file size inside a Folder using SQL (this link provides file size, file flag)
The above links provides parts of the query but i am unable to join them to get them in a single Store procedure..