Lambda trigger dynamic specific path s3 upload - amazon-s3

I am trying to create a lambda function that will get triggered once a folder is uploaded to a S3 Bucket. But the lambda will perform an operation that will save files back on the same folder, how can I do so without having a self calling function?
I want to upload the following folder structure to the bucket:
Project_0001/input/inputs.csv
The outputs will create and be saved on:
Project_0001/output/outputs.csv
But, my project number will change, so I can't simply assign a static prefix. Is there a way of dynamically change the prefix, something like:
Project_*/input/

From Shubham's comment I drafted my solution using the prefix and sufix.
For my case, I stated the prefix being 'Project_' and for the suffix I choose one specific file for the trigger, so my suffix is '/input/myFile.csv'.
So every time I upload the structure Project_/input/allmyfiles_with_myFile.csv it triggers the function and then I save my output in the same project folder, under the output folder, thus not triggering the function again.
I get project name with the following code
key = event['Records'][0]['s3']['object']['key']
project_id = key.split("/")[0]

Related

Azure Data Factory - Switch Activity - File name startsWith

I need to create a Azure Data Factory pipeline which has to first format the source file and then call another pipeline. The pipeline would be triggered every time a new file is uploaded in the source blob storage. I want to re-use this pipeline for different source file formats.
For this I intend to use a Switch activity and based on the source file name, call corresponding Copy activity to create a formatted sink file. The issue is that the source files have standard prefixes but then have a timestamp, which means that file name would be different every time, something like:
File 1:
ABCDEF_1233
ABCDEF_2244
File 2:
UVWXYZ_1222
UVWXYX_2345
Can anyone help me understand how to do this?
I was thinking of using a Switch activity, and in the expression, use the #startsWith(triggerBody().fileName, ) and then in the CASE statements, I would like to provide the file name prefixes like ABCDEF, UVWXYZ etc. and then call a copy activity for each of the CASE statements.
But I am not sure how to specify the second argument in the startsWith() function.
suppose you have the filename in a variable called filename. write expression like this to find out which file we are going to load.
Have a set variable activity and assign file prefix to another variable called prefix
#if(greater(indexof(filename),'ABCDEF'),0),'ABCDEF',if(greater(indexof(filename),'UVWXYZ'),0),'UVWXYZ'))
At the end of this set variable, your prefix will have either ABCDEF or UVWXYZ
Then, you can use a switch activity based on prefix variable and mention the cases as
ABCDEF
UVWXYZ
for each case, you can have a copy activity for doing related transforamtions.

Is there a way to list the directories in a using PySpark in a notebook?

I'm trying to see every file is a certain directory, but since each file in the directory is very large, I can't use sc.wholeTextfile or sc.textfile. I wanted to just get the filenames from them, and then pull the file if needed in a different cell. I can access the files just fine using Cyberduck and it shows the names on there.
Ex: I have the link for one set of data at "name:///mainfolder/date/sectionsofdate/indiviual_files.gz", and it works, But I want to see the names of the files in "/mainfolder/date" and in "/mainfolder/date/sectionsofdate" without having to load them all in via sc.textFile or sc.Wholetextfile. Both those functions work, so I know my keys are correct, but it takes too long for them to be loaded.
Considering that the list of files can be retrieve by one single node, you can just list the files in the directory. Look at this response.
wholeTextFiles returns a tuple (path, content) but I don't know if the file content is lazy to get only the first part of the tuple.

dynamic path in Keystone file using keystone-storage-adapter-s3

How do you generate a path based on dynamic data inside the model the file is being saved to? An example would be having a User model with a fileAttachment field. If one instance of the User model has a registrationNumber of 123, I want to store their file at /123/fileName.pdf. If another user has a registrationNumber of 456, I want to store their file at 456/fileName.pdf. The path field accepts a string, and unfortunately during the time it is set, there's no access to the model fields. In addition, the file is named and uploaded to AWS by the time the pre('save', ...) hook is executed which prevents renaming there.

Search for a specific file in S3 using boto

I am using boto to parse S3 buckets. Basically I want to file a certain file in the bucket (say *.header or any other regex expression that has been provided by user). Since I could not find any function for that in boto I was trying to write a BFS routine to search through content of each folder but I couldn’t find any method to get contents of folder by key/key.name (which I am getting by bucketObj.list() ). Is there any other method for doing this?
For instance, lets say i have multiple folders in bucket
like
mybucket/A/B/C/x.txt
mybucket/A/B/D/y.jpg
mybucket/A/E/F/z.txt
and i want to find where are *.txt
so the boto script should return me following result
mybucket/A/B/C/x.txt
mybucket/A/E/F/z.txt
There is no way to do wildcard searches or file-globbing service-side with S3. The only filtering available via the API is a prefix. If you specify a prefix string, only results that begin with that prefix will be returned.
Otherwise, all filtering would have to happen on the client-side. Or, you could store your keys in a database and use that to do the searching and only retrieve the matches from S3.

Single file versioning best practices?

User is selecting rather hefty single XML files via an NSOpenPanel. The application is making moderate changes to the file so I'd like to include an option of creating a backup in a subfolder based on the directory the original file was selected. Creating the new subfolder is no problem but does anybody have a good way to to create a backup of said foo.xml, is there a practice for such thing or is it as simple as creating a duplicate and renaming it foo.back01.xml?
Not sure, how much this Approach will fit with your requirement, but this is what i was doing,
-- Have a directory in the Temporary folder of the System : Assuming once the Application is closed all this files will be deleted,
-- To have the uniqueness in the file, generate file name with following pattern , have a function say [+(NSString *) generateFileNameForExtension:(NSString *)extension Create:(bool)bCreate]
Assuming input is .xml and false , it might give fileName something like this,
AppName128908765445.xml , i.e. [AppName][UTCTimeStamp].[Fileextension]
-- Once you think its done, there could be Function call [self addToDeleteList:(NSString *)fileName] which will add a file to delete list,
-- There would be a function, which shall invoke a timer for 1 minute and every one minute it will read all the files gets added into delete list then delete it.
Will share the code with you if needed...