Anyone have a large folder tree sample? - testing

I'm doing some testing that requires a large folder tree. 1000s of folders, 100000s of files, atleast a gigabyte but not over 5 thats a little big. (around 2 is fine).
Anyone have one that they use as a test file? I can provide storage and transfer mechanism to share if you need it.

Well, if you don't want to generate random data, then you can, for example, download the DMoz database, it's an enormously big XML file with a tree in it, then parse it and generate a directory tree that follows the structure of the DMoz directory. You will have a meaningful and huge directory of files.

Related

Digital Asset Management tool for large files that are not photos or videos

Most DAMs that I have found are geared towards media like photos and videos. I have need to manage large binary files like ISOs and IMG files.
Does anybody know of a DAM that can manage non-media files? Specifically something that is on premise? Going to a DAM in the cloud would be too expensive because of the amount of storage we would need and the bandwidth it would consume.
DAMs have specific functionality tailored towards visual content. For example, DAM systems will create previews for the files stored and also, possibly, extract metadata from the file itself. In addition to that, it will also provide you options to transform and download content in various formats. Considering that all these options are part of the DAM package, I would not expect too much from them with respect to previews, metadata extraction and transformations when it comes to large binary files, such as ISO and IMG files.
You can however, use most of the DAMs to upload any file you want. It will simply take it and allow you to tag metadata against it. An example would be Elvis DAM where you can simply upload content (I would use hot folder type of uploads for large files) and tag them with metadata. You can create custom fields such as OS version, applications, etc. and store it against the ISO files. These will become searchable and it will scale to hold all of this information and allow you to quickly find your content.
There might be other simpler and less expensive solutions out there that might just simply keep a file and assign metadata to it.
Try NeoFinder
It's original incarnation was as a catalog program for CDs, but it supports extensive metadata for tagging, as well as pulling metadata from images.
https://www.cdfinder.de
We solved our need by using Git Large File Storage (LFS) to manage our large binary files. We tried out git-annex as well, which worked well, but in the end we went with Git LFS.

FIleSystemWatcher.Created how does it work?

I am working on a project that will copy files to a database every time something is added to a specific directory. Now the program works fine when I'm testing with a small set of data but I was wondering if someone could explain how the FileSystemWatcher.Created event work.
My main concern is when I use this on a larger scale the program may slow down when it handles 100,000+ files.
If this is an issue could anyone explain if there is some sort of workaround to polling the original folder, lets call that "C:\folder", and maybe poll a temp folder instead.
I have not tested the watcher with 100,000 files. However, in most cases you should not have so many files in a folder awaiting processing. I recommend a structure like
C:\folder
C:\folder\processing
C:\folder\archive
C:\folder\error
As soon as you begin working on a given file, move it into processing. If you successfully process it, move the file again to archive. If there is an error while processing a file, instead move it into error.
This will make it easier for you to keep the files organized and diagnose problems that occur in production.
With that file structure, you will not run into issues with large numbers of files in the folder you are watching, unless you receive files in incredibly large bursts compared to the speed with which they can be moved into the processing state.

How to put files inside files

MS Word's .docx files contain a bunch of .xml files.
Setup.exe files spit out hundreds of files that a program uses.
Zips, rars etc also hold lots of compressed stuff.
So how are they made? What does MS Word or another program that produces these files have to do to put files inside files?
When I looked this up I just got a bunch of results about compression, but let's say I wanted to make a program that 'wraps' files inside a file without making the final result any smaller. What would I even have to write?
I'm not asking/expecting any source code that does this, I just need a pointer. Is there something you think I'm misunderstanding based on what I've asked here?
Even a simple link to an article or some documentation would be greatly appreciated.
Ok, I'll just come up with some headers for ordinary files and write them along with the bytes of the actual files into one custom-defined file. You guys were very helpful, thank you!
Historically, Windows had a number of technologies to support solutions like this. These were often called Compound Files or Structured storage. However, I don't think the newer Office documents use these technologies. I think the Office file formats are similar to ZIP files with a different extensions. If you change a file with .docx extension to .zip and open it with your favorite compression tool, you'll see a bunch of folders and XML files.
Here are some links to descriptions of different file formats that create "files within files"
Zip file format
Compound File Binary Format (CFBF)
Structured Storage
Compound Document File Format
Office Open XML I: Exploring the Office Open XML Formats
At least on POSIX systems (e.g. Linux), a file is only a stream (i.e. a sequence) of bytes. And you can only grow (or shrink, i.e. truncate) it at the end - there is no way to insert bytes in the middle (without copying the rest).
You need some conventions, and some additional software, to handle it otherwise.
You might be interested in Sqlite, which gives you a library to handle some (e.g.) *.sqlite file as an SQL database
You could also use GDBM - a library giving you some indexed file abstraction.
libtar is a library to manipulate tar archives. See also tardy, a tar file postprocessor.

How to compare and find the differences between two XML files in cocoa?

This is a bit of a two part question, for working with 40mb xml files.
• What’s a reasonable size to store in memory for a program running continually in the background?
• How to find what has changed in an XML file.
So on the first read the XML is loaded into NSData, then uploaded to the server.
Now instead of uploading a 40mb XML every time it changes, I would prefer to upload a “delta” file containing only what has changed. The program would monitor the file for change, and activate when it’s been modified. From what I can see, I would need to parse an old version of the xml file and parse the modified xml file, then compare them? Is it unreasonable to store 80mb in memory like this every time the file is modified?. Now I’m assuming that this has to be done with a DOM parser because I can’t see how you could compare two files like that with a SAX parser since it only has part of the file stored?
I'm a newbie at this so any help would be appreciated!
To compare two files:
There are many ways to do, (As file is to be considered, I may not be correct):
sdiff file1.xml file2.xml A unix command
You can use this command with apple script.
-[NSFileManager contentsEqualAtPath:andPath:]
This method checks to see if two files at given path are the same file, then compares their size, and finally compares their contents.
For other part:
What size is considered for background process, I dont think so, for an application it matters. You can save these into temporary files. Even safari uses 130+ MB as you can easily check through Activity monitor.
NSXMLParser ended up being the most useful for this

Xcode groups hierarchy isn't mimic'd in the file structure

using Cocos2d project template (for objective c)
I've got a nice little hierarchy setup in Xcode, e.g categorization with folders, multiple levels, however in the file structure, its just one big directory of files, should I leave it like that? Should I replicate the hierarchy in file structure manually? Should it be automated did I do something wrong?
As Protheus says, the group structure in XCode is for organizational purposes only and not synced with the file system. For some projects though, it might be worth to keep XCode's group structure and the file system in sync. For example, I like to edit my files with VIM from the command line sometimes. On big projects with several hundreds of files, it is much easier to find specific files if you have a logical structure in the file system.
I discuss this in more detail in this blog article.
As far as I know neither structure in xcode, nor structure in project folder affect the program.
You structure it all for your convinience only.