S3 bucket listing files (only) located in multiple folders, without listing folders - amazon-s3

ObjectListing objectListing = s3.listObjects(
new ListObjectsRequest()
.withBucketName(bucketName)
.withPrefix("img/1/abc/")
);
for(S3ObjectSummary s3ObjectSummary: objectListing.getObjectSummaries())
System.out.println(("Listing all the files: " + s3ObjectSummary.getKey()));
Can anybody help me... I want to list only files available in multiple folders.
img/1/abc/ios-2x/a#2X.jpg--yes
img/1/abc/ios-2x_$folder$---no
img/1/abc/ios-3x/b#3X.jpg---yes
img/1/abc/ios-3x_$folder$---no
img/1/abc/xxhdpi/c#2x.jpg---yes
img/1/abc/xxhdpi/c.jpg--yes
img/1/abc/xxhdpi_$folder$---no
I am using java for listing the objects inside S3, I can do iterate over the folders to get desired result. but was looking for doing the same by using single loop and result will not give folder results.

Related

How to iterate through node while there is a relationship

I have nodes that are structured like folder, subfolder and files. Any folder can have a relationship with a subfolder, which can have a relationship with another subfolder, which can have a relationship with files. I'd like to iterate through every folder to find every subfolder and files inside a given folder.
In one query, I'd like to be able to get every file that is inside a folder or in his subfolders. I can't find any way to do it with Cypher. I saw FOREACH and UNWIND but I don't think it helps me.
Assuming you have labelled the nodes accordingly as Folder and File, the following query will fetch all the files belonging to the starting folder, directly or through a chain of one or more sub-folders:
MATCH(ParentFolder:Folder)-[*]->(childFile:File)
WHERE ParentFolder.name='Folder1'
RETURN childFile
If you haven't used Labels (highly recommend using them), you can look for all the paths starting with the specified folder and find all the last nodes in that path.
MATCH(ParentFolder)-[*]->(childFile)
WHERE ParentFolder.name='Folder1' AND NOT (childFile)-->()
RETURN childFile
The second query will fetch all the terminal nodes, even if they are folders. You would have to use labels or add filters in the where clause to ensure only files are fetched for childFile.
Both versions of the query work based on varying length paths. The wild character(*) retrieves all paths of any length starting from ParentFolder.

Get File Structure from Get Metadata in ADF

I want to get the column names for a parquet file. I have a Get Metadata module in my pipeline and it is using a parquet dataset with only the root folder provided. Because only the folder is provided ADF is not letting me get the file structure that contains the column names. The file name is not provided because that can change. Can anyone provide some advice on how to approach this?
You will need 2 Get Metadata activities and a ForEach activity to get the file structure if your file name is not the same every time.
Source dataset:
Parameterize the file name as the name changes frequently.
Preview of source data:
Get Metadata1:
In the first Get Metadata activity, get the file name dynamically.
You can also specify if your file name contains any specific pattern by adding an expression in the filename or you can mention asterisk (*) if you don’t have a specific pattern or need more than 1 file in the folder needs to be processed.
Give field list as child items when you want to get the files from the folder.
Output of Get Metadata1: Get the file name from the folder.
FoEach activity:
Using the ForEach activity, you can get the item's name listed inside the Get Metadata activity output array.
Get Metadata2:
Add Get Metadata activity inside ForEach activity to get the file structure or column list of the current file from the folder. It can loop the number of items count in the folder (1 or more).
Output of Get Metadata2:
You can parameterize your file name in dataset or via GetMeta data activity, get the list of files within the folder and then via GetMetaData activity get the list of columns for those corresponding files.

Is there a way to list the directories in a using PySpark in a notebook?

I'm trying to see every file is a certain directory, but since each file in the directory is very large, I can't use sc.wholeTextfile or sc.textfile. I wanted to just get the filenames from them, and then pull the file if needed in a different cell. I can access the files just fine using Cyberduck and it shows the names on there.
Ex: I have the link for one set of data at "name:///mainfolder/date/sectionsofdate/indiviual_files.gz", and it works, But I want to see the names of the files in "/mainfolder/date" and in "/mainfolder/date/sectionsofdate" without having to load them all in via sc.textFile or sc.Wholetextfile. Both those functions work, so I know my keys are correct, but it takes too long for them to be loaded.
Considering that the list of files can be retrieve by one single node, you can just list the files in the directory. Look at this response.
wholeTextFiles returns a tuple (path, content) but I don't know if the file content is lazy to get only the first part of the tuple.

How do I delete folders recursively using DQL?

I work on an application using Documentum.
Let's say I have the following structure :
MyCabinetName
|->Folder 1
|->Folder 2
|-> Folder 3
I am trying to delete all the folders inside a cabinet.
I am running the following DQL query :
delete dm_folder objects where folder ('MyCabinetName', DESCEND);
But when I run the query, I get a DQL ERROR :
[DM_FOLDER_E_CANT_DESTROY]error : "Cannot destroy folder with path name /MyCabinetName/Folder1 as it is not empty
I thought my query would delete recursively all folders inside MyCabinetName, but it does not seem to be the case, for if I run :
delete dm_folder objects where folder ('MyCabinetName/Folder1/Folder2', DESCEND);
and then
delete dm_folder objects where folder ('MyCabinetName/Folder1', DESCEND);
delete dm_folder objects where folder ('MyCabinetName/Folder3', DESCEND);
then
delete dm_folder objects where folder ('MyCabinetName', DESCEND);
will work.
Problem is that in real life, I don't know what my folder tree looks like. I just know the name of the cabinet whose content I want to delete.
Is there any way to delete a cabinet and its content recursively without having to delete each folder one by one?
It is not possible to delete folder with deep folder structure by DQL.
But you can do it by Delete Operation, it means you can write a tool in Java, Groovy, ...
Here is an example how to do that:
IDfDeleteOperation operation = new DfClientX().getDeleteOperation();
operation.setVersionDeletionPolicy(IDfDeleteOperation.ALL_VERSIONS);
operation.setDeepFolders(true);
operation.add("/MyCabinetName");
if (!operation.execute()) {
IDfList errors = operation.getErrors();
// process errors
}
This line operation.setDeepFolders(true) instructs the operation to delete the folder with all sub-folders and other objects contained in the structure.

Bigquery Java API - Get sublisting within a cloud folder?

How do I get a list of objects within a sub-directory on cloud storage using the Bigquery Java api?
I can get a list of objects within a folder but not within a second level folder.
So if I have
/Folder1/Folder2/File1.csv
/Folder1/File2.csv
I can get Folder2 & File2.csv using the following command:
list = cloudstorage.objects().list("Folder1");
But how do I get the list of objects within Folder2?
I think - you should use setPrefix method to filter results to objects whose names begin with this prefix. Assuming your bucket is Folder1, you should try
setPrefix("Folder2/")