Using Boto To Copy Multiple Paths/Files From S3 To S3 - amazon-s3

I have a bunch of S3 files and folders I need to copy locally, decrypt, and then copy into my S3 bucket.
The setup per S3 bucket varies but is basically like this:
S3 bucket name > event folder (A, B, C, D, E for example) > country sub folder (UK, US, Germany for example) > subfolder containing all the data 'runs' (2017-Jan, 2017-Feb, etc) > files within the sub folder.
I need to copy a few events e.g. just A, C for a few countries e.g. just the UK and Germany, for the latest 'data run' ie. 2017-August. I need to do this monthly and, in reality, there are ~100 paths I need so I really don't want to manually copy each one. Copying the entire bucket is also not an option as its way too big.
I am wondering if boto is the best tool for this or if it will only allow me to copy one path at a time. The S3 files are in .gzip format hence I can't copy it over to my S3 bucket directly (have to decrypt first). I have been trying to look for an example but could not find something.
Edit: I had a look at the recursive function but that (I believe) only applies to files within the same folder. So if you have a folder > subfolder > subfolder > files you are screwed.
Thanks!

Related

How to get all files and folders recursively using GetMetadata activity

I need to delete many folders with an exempt list which has a list of folders and files that I should not delete. So I tried delete activity and tried to use if function of Add dynamic content to check whether the file or folder name is the same of the specified ones. But I do not have what should be the parameters #if(). In other words, to use these functions, how do we get the file name or folder name?
It is difficult to get all files and folders by using GetMetadata activity.
As workarounds, you can try these ways:
1.create a Delete activity and select List of files option. Then create a file filled with those path of files and folders need to be deleted.(relative path to the path configured in the dataset).
2.using blob SDK to do this thing.

NiFi data insertion into s3 subdirectory

I have a flow where I am extracting data from the database, converting the Avro to the CSV format and pushing the CSV in an s3 bucket which has subfolder in it. My S3 structure is like the following:
As you can see in the above screenshot my files are going into a blank folder(highlighted by red) instead of going inside a subfolder called 'Thermal'. Please see my PutS3Object settings:
The final s3 path I want my files to go into is: export-csv-vehicle-telemetry/vin11/Thermal
What settings should I change in my processor so the file goes directly inside the 'Thermal' folder?
Use Bucket name as: export-csv-vehicle-telemetry/vin15/Thermal instead of export-csv-vehicle-telemetry/vin15/Thermal/
The extra slash at the end is not required while specifying bucket names.
BTW, Your image shows vin11 directory instead of vin15. Check if that is correct.

Ruby Compare Very Large Files

I have two CSV files that are lists of S3 bucket objects:
The first CSV file represents the objects in the source S3 bucket.
The second CSV file represents the objects in the destination S3 bucket.
I need to know which files to copy from the source S3 bucket to the destination bucket by finding the objects that aren't already in the destination bucket. The lines of the CSV match path, size, and modified date. If any one of these is different I need the source object copied to the destination bucket.
Here's the first example CSV file:
folder1/sample/test1,55,2019-07-19 19:36:56 UTC
folder2/sample/test5,55,2019-07-19 19:34:31 UTC
folder3/sample/test9,55,2019-07-19 19:32:12 UTC
Here's the second example CSV file:
folder1/sample/test1,55,2019-07-16 19:32:58 UTC
folder2/sample/test5,55,2019-07-14 19:34:31 UTC
folder3/sample/test9,55,2019-07-19 19:32:12 UTC
In this example the first and second lines would be returned.
The following code works on these 3 liners but fails on randomly generated files 1000+ lines:
f1 = File.open('file1.csv', 'r')
f2 = File.open('file2.csv', 'r')
f1.each.zip(f2.each).each do |line1, line2|
if line1 != line2
puts line1
end
end
How can I accurately compare all lines - preferably with the least amount of CPU/Memory overhead?
You could load the destination list into an array in memory, then step through the source list one line at a time. If source line is not in the array, the file needs to be copied.
If even one file is too big to load into memory, and the files are sorted in filename order, then you could step through both files together and compare lines. You'll need to use the filenames to determine whether to skip over lines to stay in sync.
An alternative option is to use Amazon Athena, joining the data between files to find lines that don't match.

How do I save csv file to AWS S3 with specified name from AWS Glue DF?

I am trying to generate a file from a Dataframe that I have created in AWS-Glue, I am trying to give it a specific name, I see most answers on stack overflow actually uses Filesystem modules, but here this particular csv file is generated in S3, also I want to give the file a name while generating it, and not rename it after it is generated, is there any way to do that?
I have tried using df.save(s3:://PATH/filename.csv) which actually generates a new directory in S3 named filename.csv and then generates part-*.csv inside that directory
df.repartition(1).write.mode('append').format('csv').save('s3://PATH').option("header", "true")

incrementally copy files from S3 to local hdfs

i have an app write data to S3 daily or hourly or just randomly, another app to read data from S3 to local HBase. is there any way to tell what's the last file uploaded from last update, then read files after that, in other word, incrementally copy the files?
for example:
day 1: App1 write files 1,2,3 to folder 1;App2 read those 3 files to HBase;
day 4: App1 write file 4 & 5 to folder 1, 6,7,8 to folder 2; App2 need to read 4 &5 from folder 1 and then 6,7,8 from folder 2.
thanks
The LastModified header field can be used to process data based on the creation date. This requires a built in logic on the client side which stores the items which are already processed and the new items. You can simply store the date which you processed so everything comes after that is considered as new.
Example:
s3cmd ls s3://test
2012-07-24 18:29 36303234 s3://test/dl.pdf
See the date in the front of the file.