Search for a specific file in S3 using boto - amazon-s3

I am using boto to parse S3 buckets. Basically I want to file a certain file in the bucket (say *.header or any other regex expression that has been provided by user). Since I could not find any function for that in boto I was trying to write a BFS routine to search through content of each folder but I couldn’t find any method to get contents of folder by key/key.name (which I am getting by bucketObj.list() ). Is there any other method for doing this?
For instance, lets say i have multiple folders in bucket
like
mybucket/A/B/C/x.txt
mybucket/A/B/D/y.jpg
mybucket/A/E/F/z.txt
and i want to find where are *.txt
so the boto script should return me following result
mybucket/A/B/C/x.txt
mybucket/A/E/F/z.txt

There is no way to do wildcard searches or file-globbing service-side with S3. The only filtering available via the API is a prefix. If you specify a prefix string, only results that begin with that prefix will be returned.
Otherwise, all filtering would have to happen on the client-side. Or, you could store your keys in a database and use that to do the searching and only retrieve the matches from S3.

Related

Lambda trigger dynamic specific path s3 upload

I am trying to create a lambda function that will get triggered once a folder is uploaded to a S3 Bucket. But the lambda will perform an operation that will save files back on the same folder, how can I do so without having a self calling function?
I want to upload the following folder structure to the bucket:
Project_0001/input/inputs.csv
The outputs will create and be saved on:
Project_0001/output/outputs.csv
But, my project number will change, so I can't simply assign a static prefix. Is there a way of dynamically change the prefix, something like:
Project_*/input/
From Shubham's comment I drafted my solution using the prefix and sufix.
For my case, I stated the prefix being 'Project_' and for the suffix I choose one specific file for the trigger, so my suffix is '/input/myFile.csv'.
So every time I upload the structure Project_/input/allmyfiles_with_myFile.csv it triggers the function and then I save my output in the same project folder, under the output folder, thus not triggering the function again.
I get project name with the following code
key = event['Records'][0]['s3']['object']['key']
project_id = key.split("/")[0]

Get blobs from azure storage excluding subfolders

Trying to get blobs from Azure storage container using CloudBlobContainer.ListBlobs, but I want to only get blobs from the specific folder (using prefix), not the sub-folders under the main folder.
For example lets say I have this:
Folder
Sub-Folder
image2.jpg
image1.jpg
If I use Folder as the prefix, I want to get image1.jpg AND exclude image2.jpg (anything under sub-folders)
From Reference
If you call CloudBlobContainer.listBlobs() method,it will by default return a list of BlobItems that contains the blobs and directories immediately under the container. It's the default behavior with v8.
Even if you want to match pattern ,As mentioned by #Gaurav mantri- reference
There's limited support for server-side searching in Azure Blob
Storage. Only thing you can filter on is the blob prefix i.e. you can
instruct Azure Storage service to only returns blobs names of which
start with certain characters.
But if you want to get based on file name ,below example might give an idea
var container = blobClient.GetContainerReference(containerName);
foreach (var file in container.ListBlobs(prefix: " Folder/filename.xml", useFlatBlobListing: true))
{ …}
In your case try with Folder/filename.jpg.
Note:
useFlatBlobListing: Setting this value to true will ensure that only blobs are returned (including inside any sub folders inside that
directory) and not directories and blobs.
So, if you only want blobs, you have to set the UseFlatBlobListing property option to true.
Other references:
Ref 1,ref 2

Is there a way to list the directories in a using PySpark in a notebook?

I'm trying to see every file is a certain directory, but since each file in the directory is very large, I can't use sc.wholeTextfile or sc.textfile. I wanted to just get the filenames from them, and then pull the file if needed in a different cell. I can access the files just fine using Cyberduck and it shows the names on there.
Ex: I have the link for one set of data at "name:///mainfolder/date/sectionsofdate/indiviual_files.gz", and it works, But I want to see the names of the files in "/mainfolder/date" and in "/mainfolder/date/sectionsofdate" without having to load them all in via sc.textFile or sc.Wholetextfile. Both those functions work, so I know my keys are correct, but it takes too long for them to be loaded.
Considering that the list of files can be retrieve by one single node, you can just list the files in the directory. Look at this response.
wholeTextFiles returns a tuple (path, content) but I don't know if the file content is lazy to get only the first part of the tuple.

Bigquery Java API - Get sublisting within a cloud folder?

How do I get a list of objects within a sub-directory on cloud storage using the Bigquery Java api?
I can get a list of objects within a folder but not within a second level folder.
So if I have
/Folder1/Folder2/File1.csv
/Folder1/File2.csv
I can get Folder2 & File2.csv using the following command:
list = cloudstorage.objects().list("Folder1");
But how do I get the list of objects within Folder2?
I think - you should use setPrefix method to filter results to objects whose names begin with this prefix. Assuming your bucket is Folder1, you should try
setPrefix("Folder2/")

How to get all the folder keys recursively?

I want to get all the folder keys recusively without files.
Example,
FolderA
FolderA/FolderB
FolderA/FolderB/FolderC
FolderC
FolderD
Thanks.
Although S3 does not have a concept of "folder" (see here), it lets you perform hierarchical operations using "prefix" and "delimiter".
You can look at the Java API here.