How to get all the folder keys recursively? - amazon-s3

I want to get all the folder keys recusively without files.
Example,
FolderA
FolderA/FolderB
FolderA/FolderB/FolderC
FolderC
FolderD
Thanks.

Although S3 does not have a concept of "folder" (see here), it lets you perform hierarchical operations using "prefix" and "delimiter".
You can look at the Java API here.

Related

How to iterate through node while there is a relationship

I have nodes that are structured like folder, subfolder and files. Any folder can have a relationship with a subfolder, which can have a relationship with another subfolder, which can have a relationship with files. I'd like to iterate through every folder to find every subfolder and files inside a given folder.
In one query, I'd like to be able to get every file that is inside a folder or in his subfolders. I can't find any way to do it with Cypher. I saw FOREACH and UNWIND but I don't think it helps me.
Assuming you have labelled the nodes accordingly as Folder and File, the following query will fetch all the files belonging to the starting folder, directly or through a chain of one or more sub-folders:
MATCH(ParentFolder:Folder)-[*]->(childFile:File)
WHERE ParentFolder.name='Folder1'
RETURN childFile
If you haven't used Labels (highly recommend using them), you can look for all the paths starting with the specified folder and find all the last nodes in that path.
MATCH(ParentFolder)-[*]->(childFile)
WHERE ParentFolder.name='Folder1' AND NOT (childFile)-->()
RETURN childFile
The second query will fetch all the terminal nodes, even if they are folders. You would have to use labels or add filters in the where clause to ensure only files are fetched for childFile.
Both versions of the query work based on varying length paths. The wild character(*) retrieves all paths of any length starting from ParentFolder.

Is there a way to list the directories in a using PySpark in a notebook?

I'm trying to see every file is a certain directory, but since each file in the directory is very large, I can't use sc.wholeTextfile or sc.textfile. I wanted to just get the filenames from them, and then pull the file if needed in a different cell. I can access the files just fine using Cyberduck and it shows the names on there.
Ex: I have the link for one set of data at "name:///mainfolder/date/sectionsofdate/indiviual_files.gz", and it works, But I want to see the names of the files in "/mainfolder/date" and in "/mainfolder/date/sectionsofdate" without having to load them all in via sc.textFile or sc.Wholetextfile. Both those functions work, so I know my keys are correct, but it takes too long for them to be loaded.
Considering that the list of files can be retrieve by one single node, you can just list the files in the directory. Look at this response.
wholeTextFiles returns a tuple (path, content) but I don't know if the file content is lazy to get only the first part of the tuple.

Documentum - getting a list of sub folders

Is there a way in Documentum to get all sub folders of a folder? Can someone suggest a DQL or some thing where I can specify a parent folder and the DQL returns me a folder path of all the sub folders.
select distinct r_folder_path from dm_folder where folder('/Folder1/Folder2', descend)
This will return all the folders and subfolders under /Folder1/Folder2
One thing to keep in mind:
Documentum supports linking objects to multiple parent folders. This means that one folder can have multiple parent folders.
If you have a folder structure like this
Cabinet1
/Test1
/Test3
/Test2/
/Test3
Where Test3 is sub folder of Test1 but also of (as it can be linked to) Test2!
Documentum acomplishes this using repeating attributes. r_folder_path is a repating attribute of dm_folder (actually of dm_sysobject which is it's super type).
So, running a DQL :
select distinct r_folder_path from dm_folder where folder('/Folder1/Folder2', descend)
will return all folder paths your folder is part of (linked to):
/Cabinet1/Test1/Test3
/Cabinet1/Test2/Test3
Which might not be what you are looking for!
As DQL does not allow you to specify which repeating attribute value (you can not specify the index of repeating attribute) to be returned there is not elegant ( and fail safe) way to do it in DQL.
What you can do is to fetch all object_name of subfolders and prefix them with folder path of the parent folder you used in search (but that is with some coding).
Check Documentum Content Server System Object Reference guide (it is available on EMC developer community or for now also here)

Search for a specific file in S3 using boto

I am using boto to parse S3 buckets. Basically I want to file a certain file in the bucket (say *.header or any other regex expression that has been provided by user). Since I could not find any function for that in boto I was trying to write a BFS routine to search through content of each folder but I couldn’t find any method to get contents of folder by key/key.name (which I am getting by bucketObj.list() ). Is there any other method for doing this?
For instance, lets say i have multiple folders in bucket
like
mybucket/A/B/C/x.txt
mybucket/A/B/D/y.jpg
mybucket/A/E/F/z.txt
and i want to find where are *.txt
so the boto script should return me following result
mybucket/A/B/C/x.txt
mybucket/A/E/F/z.txt
There is no way to do wildcard searches or file-globbing service-side with S3. The only filtering available via the API is a prefix. If you specify a prefix string, only results that begin with that prefix will be returned.
Otherwise, all filtering would have to happen on the client-side. Or, you could store your keys in a database and use that to do the searching and only retrieve the matches from S3.

proper way to organize files in jcr repository

what is a proper way of organizing files in a wcm that is using JCR. Let's say the total file count is 100,000+ files and total file size is about 50-70GB.
Is it better to organize files by fie types ( and create sub directories to further group the files by some category)
What are the advantages. Does it make any difference while using query api, maintenance, or something.
Proposal 1:
--shared
------images
------pdf
------movies
--location1
------images
------pdf
------movies
--location2
------images
------pdf
------movies
Proposal 2:
--pdf
-------shared
-------location1
-------location2
--images
--------shared
--------location1
--------location2
.. etc
Take a look at this: David's Model: A guide for content modeling
Some highlights:
Data First, Structure Later. Maybe.
Drive the content hierarchy, don't let it happen.
Workspaces are for clone(), merge() and update().
Beware of Same Name Siblings.
References considered harmful.
Files are Files are Files.
ID's are evil.
Whatever you do, make sure you don't end up with more than a 1000 child nodes under any given node.
Just as in any (real) file system, when you want to list a folder with a lot of files/subfolders in it, it can take some time.
By default Jackrabbit 2.x will now hash up the user space.
ie:
/users/s/sa/sandra
/users/s/si/simong
...
I would personally go for your first proposal as it makes more sense.
We have a webapp where all our users can upload/delete/modify their files in JCR and did it this way:
/_users/s/si/simon/public
/_users/s/si/simon/public/My Pictures
/_users/s/si/simon/public/My Pictures/2010/06/Trip to the US
/_users/s/si/simon/public/My Pictures/2010/06/Trip to the US/DC1001.jpg
/_users/s/si/simon/private/account_details.txt
...
We're loosely following the way home folders are done in UNIX-like systems.
We try to hash up all the things we (reasonably) can. Like the for example the user space (/s/si/simong) but also things like messages:
/_users/s/si/simong/messages/2009/12/25/ab34ed87dee
/_users/s/si/simong/messages/2010/03/12/e4f1de3cd48
...
However it's up to the individual user to not have more then 1000 child files in a given folder (we do warn them though.)
Doing it this way also gives you a nice benefit of exercising Access Control.
ie: everthing under ~/private is only read- and writeable by the current user, ~/public is readable by everybody.