Bigquery Java API - Get sublisting within a cloud folder? - google-bigquery

How do I get a list of objects within a sub-directory on cloud storage using the Bigquery Java api?
I can get a list of objects within a folder but not within a second level folder.
So if I have
/Folder1/Folder2/File1.csv
/Folder1/File2.csv
I can get Folder2 & File2.csv using the following command:
list = cloudstorage.objects().list("Folder1");
But how do I get the list of objects within Folder2?

I think - you should use setPrefix method to filter results to objects whose names begin with this prefix. Assuming your bucket is Folder1, you should try
setPrefix("Folder2/")

Related

Get File Structure from Get Metadata in ADF

I want to get the column names for a parquet file. I have a Get Metadata module in my pipeline and it is using a parquet dataset with only the root folder provided. Because only the folder is provided ADF is not letting me get the file structure that contains the column names. The file name is not provided because that can change. Can anyone provide some advice on how to approach this?
You will need 2 Get Metadata activities and a ForEach activity to get the file structure if your file name is not the same every time.
Source dataset:
Parameterize the file name as the name changes frequently.
Preview of source data:
Get Metadata1:
In the first Get Metadata activity, get the file name dynamically.
You can also specify if your file name contains any specific pattern by adding an expression in the filename or you can mention asterisk (*) if you don’t have a specific pattern or need more than 1 file in the folder needs to be processed.
Give field list as child items when you want to get the files from the folder.
Output of Get Metadata1: Get the file name from the folder.
FoEach activity:
Using the ForEach activity, you can get the item's name listed inside the Get Metadata activity output array.
Get Metadata2:
Add Get Metadata activity inside ForEach activity to get the file structure or column list of the current file from the folder. It can loop the number of items count in the folder (1 or more).
Output of Get Metadata2:
You can parameterize your file name in dataset or via GetMeta data activity, get the list of files within the folder and then via GetMetaData activity get the list of columns for those corresponding files.

Get blobs from azure storage excluding subfolders

Trying to get blobs from Azure storage container using CloudBlobContainer.ListBlobs, but I want to only get blobs from the specific folder (using prefix), not the sub-folders under the main folder.
For example lets say I have this:
Folder
Sub-Folder
image2.jpg
image1.jpg
If I use Folder as the prefix, I want to get image1.jpg AND exclude image2.jpg (anything under sub-folders)
From Reference
If you call CloudBlobContainer.listBlobs() method,it will by default return a list of BlobItems that contains the blobs and directories immediately under the container. It's the default behavior with v8.
Even if you want to match pattern ,As mentioned by #Gaurav mantri- reference
There's limited support for server-side searching in Azure Blob
Storage. Only thing you can filter on is the blob prefix i.e. you can
instruct Azure Storage service to only returns blobs names of which
start with certain characters.
But if you want to get based on file name ,below example might give an idea
var container = blobClient.GetContainerReference(containerName);
foreach (var file in container.ListBlobs(prefix: " Folder/filename.xml", useFlatBlobListing: true))
{ …}
In your case try with Folder/filename.jpg.
Note:
useFlatBlobListing: Setting this value to true will ensure that only blobs are returned (including inside any sub folders inside that
directory) and not directories and blobs.
So, if you only want blobs, you have to set the UseFlatBlobListing property option to true.
Other references:
Ref 1,ref 2

Is there a way to list the directories in a using PySpark in a notebook?

I'm trying to see every file is a certain directory, but since each file in the directory is very large, I can't use sc.wholeTextfile or sc.textfile. I wanted to just get the filenames from them, and then pull the file if needed in a different cell. I can access the files just fine using Cyberduck and it shows the names on there.
Ex: I have the link for one set of data at "name:///mainfolder/date/sectionsofdate/indiviual_files.gz", and it works, But I want to see the names of the files in "/mainfolder/date" and in "/mainfolder/date/sectionsofdate" without having to load them all in via sc.textFile or sc.Wholetextfile. Both those functions work, so I know my keys are correct, but it takes too long for them to be loaded.
Considering that the list of files can be retrieve by one single node, you can just list the files in the directory. Look at this response.
wholeTextFiles returns a tuple (path, content) but I don't know if the file content is lazy to get only the first part of the tuple.

How to get all the items contained inside a office365 onedrive

I want to get all the files and folders contained inside a office365 onedrive folder in one rest API call, is there any option to do it?
There isn't a specific API call to retrieve a flat representation of a Drive. You can achieve a similar effect however using the drive's search method.
Simply pass an empty query string and it will return metadata for each file (regardless of it's directory):
https://graph.microsoft.com/v1.0/me/drive/root/search(q='')
Ok, try this search request:
https://graph.microsoft.com/v1.0/me/drive/root/search(q='%2A')
Or:
https://api.onedrive.com:443/v1.0/drives/(driveid)/items/(itemid)/view.search?q=%2A
Where %2A is asterisk, itemid may be a root folder id. Don't forget about pagination.
Or with OneDriveSDK:
_connection.SearchForItemsAsync(odFolder.ItemReference(), "*", ItemRetrievalOptions.Default)
Don't use "expand" query with search query.
This should return all items in current folder recursively - sub-folders, sub-items.

Search for a specific file in S3 using boto

I am using boto to parse S3 buckets. Basically I want to file a certain file in the bucket (say *.header or any other regex expression that has been provided by user). Since I could not find any function for that in boto I was trying to write a BFS routine to search through content of each folder but I couldn’t find any method to get contents of folder by key/key.name (which I am getting by bucketObj.list() ). Is there any other method for doing this?
For instance, lets say i have multiple folders in bucket
like
mybucket/A/B/C/x.txt
mybucket/A/B/D/y.jpg
mybucket/A/E/F/z.txt
and i want to find where are *.txt
so the boto script should return me following result
mybucket/A/B/C/x.txt
mybucket/A/E/F/z.txt
There is no way to do wildcard searches or file-globbing service-side with S3. The only filtering available via the API is a prefix. If you specify a prefix string, only results that begin with that prefix will be returned.
Otherwise, all filtering would have to happen on the client-side. Or, you could store your keys in a database and use that to do the searching and only retrieve the matches from S3.