NSURL Bookmarks: A Faster Alternative? - objective-c

Context
My app's model is a tree of objects where each object represents a filesystem item (a folder or file) on disk beneath a given starting folder.
Periodically, I recursively walk this tree from top-down in order to "sync" it to the actual state of the filesystem. That is, I visit each object in the model and verify that the file/folder it represents still exists in the same location on disk.
If the file/folder has moved, I use an NSURL bookmark to ascertain the new location of the file/folder so that I can update my model's state. (I create an NSURL bookmark when I first create the model object and then store the bookmark data as a property of the object so that I can resolve it later.)
The Problem
NSURL bookmarks simply aren't performant enough. It's not uncommon for my model graph to have 20,000 nested objects. Each one has a bookmark. Here's what I'm seeing when I profile performance:
The recursivelyValidateExistingChildItemsOfParentItem:... method is what walks my model tree. 90% of the time involved is just resolving bookmarks (and, if they are stale, re-creating them as described in Apple's documentation).
The app takes almost 2 minutes to complete the walk thanks to this. So, I need a faster alternative to NSURL bookmarks.
What I've Considered
Extended File Attributes. I could add a UUID attribute to each file on disk. Instead of walking my model graph, I could walk the actual filesystem underneath the starting folder. When I find a new file, I could see if it has a UUID extended attribute. If so, I could then search my model graph for the object with that UUID to handle moved/relocated files. The trouble here is that many things clobber extended file attributes—they aren't guaranteed to stick around.
BDAlias or NDAlias. I used to use BDAlias before I migrated to NSURL bookmarks, but that wasn't exceptionally more performant.
Bottom Line
I need a faster alternative to NSURL's bookmarks. But I still need to be able to track files across launches of my app, so simply keeping file descriptors open or using file id's won't work.
I don't care how low-level I have to get; I just need performance. Thanks!

I know the question is old. But this is my answer:
I only use resolving bookmark as a fallback. I save both file path and URL bookmark data in my model. When I want to open file, first I check if the file still exists in the previously known location. If not, I would try to resolve url's bookmark data. This would narrow calling to URL.init(resolvingBookmarkData) only to a limited subset of items. I would then update model with new path after resolving bookmark to keep performance reasonable.
If you need to assure you are working with exactly the same file, you can check file's date, size or a specific EA as an extra measurement.

Related

Attaching a specific piece of non-intrusive info to a file or folder to keep a connection to a program

This is going to be a question with a lot of hypotheticals, but it's been on my mind for a while now and I finally want to get some perspectives on how to tackle this "issue". For the sake of the question, I'll make up an example requirement of how the program I want to make would work on a conceptual level without too many specifics.
The Problem
I want to create a program to keep track of miscellaneous info for files and folders. This miscellaneous info can be anything from comments, authors, to more specific info like the original source of the file (a URL for example), categories, tags, and more. All this info is kept track of in an SQLite database.
Now... how would you create a connection to the file (or folder) to the database? Whatever file is added to the program, the file should continue to operate on an independent level from the program, meaning you should be able to edit, copy, move, rename or do anything else with the file you would usually do with your OS of choice - even deleting it.
You should even be able to archive it, zip it, upload it somewhere or do other things that temporarily or permanently removes the file from your system, without losing the connection to the database. The program itself doesn't actually ever touch the files themselves, unless to generate a new entry in the database, but obviously, there should be some kind of reference in the file to a database entry in the program.
Yes, I know that if you delete the file, you would have a dead entry in the database. For now, just treat this as an unfortunate reality that can't be solved unless you incorporate the file more closely into the program.
Possible solutions and why I decided against them
Reference inside Filename
Probably the most obvious choice, you could just have a reference inside the filename to point to a database entry, for example by including the id at the start of the filename:
#1 my-example-file.txt
#12814 this-is-one-of-many-files.txt
Obviously, that goes against what I established earlier, as you would be restricted from freely renaming the file. You would always have to keep in mind to not mess with the id inside the filename, or else the connection to your program is broken. Unfortunately, that is the best bet I currently have, but I would like to avoid using that approach if possible.
Alternate Data Streams (ADS)
A pretty cool feature I recently discovered that's available on NTFS file systems, ADS allows you to store different streams of data for your files, to grossly simplify it. You could attach a data stream to your file that saves the id for the database entry in the program, and a regular user would never be able to mess directly with that.
However, since this is a feature reserved for specific file systems, there's some ugly side effects to ADS, as you can easily lose that part of the file by:
moving/copying it to a file system that doesn't support ADS, such as the file systems most often used in removable drives
uploading it to a cloud then later downloading it
moving it to another OS that might not support ADS or treats it in an unexpected way
zipping it
Thus I can't really rely on ADS either.

Why should applications read a PDF file backwards?

I am trying to wrap my head around the PDF file structure. There is a header, a body with objects, a cross-reference table and a trailer. In the official PDF reference from Adobe, section 3.4.4 about file trailer, we can read that:
The trailer of a PDF file enables an application reading the file to quickly find the cross-reference table and certain special objects. Applications should read a PDF file from its end.
This looks very inefficient to me. I can't show anything to users this way (not even the first page) before I load the whole file. Well, to be precise, I can - if my file is linearized. But that is optional and means some extra overhead both when writing and reading such file.
Instead of that whole linearization thing, it would be easier to just put the references in front of the body (followed by objects on page 1, page2, page 3... ). But people in Adobe probably had their reasons to put it after it. I just don't see them. So...
Why is the cross-reference table placed after the body?
I would agree with the two reasons already mentioned, but not because of hardware limitations "back in the day", but rather scale. It's easy to think an invoice with a couple of pages of text could be done better differently, but what about a book, or a PDF with 1,000 photos?
With the trailer at the end you can write images/text/fonts to the file as they are processed and then discard them from memory while simply storing the file offset of each object to be used to write the trailer.
If the trailer had to come first then you would have to read (or even generate in the case of an embedded font) all of these objects just to get their size so you could write out the trailer, then write all the objects to the file. So you would either be reading, sizing, discarding, then reading again, or trying to hold everything in ram until you could write them to the file.
Write speed and ram are still issues we contend with today when we're running in a docker container on a VM on shared hardware..
PDF was invented back when hard drives were slow to write files... really s-l-o-w. By putting the xref at the end, you could quickly change a file by simply appending new objects and an updated xref to the end of the file rather than rewriting the whole thing.
Not only were the drives slow (giving rise to the argument in #joelgeraci's answer), also was there much less RAM available in a typical computer. Thus, when creating a pdf one had to write data to file early, much earlier than one had any idea how big the file or, as a consequence, the cross references would become. Writing the cross references at the end, therefore, was a natural consequence.

Objective-C - Finding directory size without iterating contents

I need to find the size of a directory (and its sub-directories). I can do this by iterating through the directory tree and summing up the file sizes etc. There are many examples on the internet but it's a somewhat tedious and slow process, particularly when looking at exceptionally large directory structures.
I notice that Apple's Finder application can instantly display a directory size for any given directory. This implies that the operating system is maintaining this information in real time. However, I've been unable to determine how to access this information. Does anyone know where this information is stored and if it can be retrieved by an Objective-C application?
IIRC Finder iterates too. In the old days, it used to use FSGetCatalogInfo (an old File Manager call) to do this quickly. I think there's a newer POSIX call for that these days that's the fastest, lowest-level API for this, especially if you're not interested in all the other info besides the size and really need blazing speed over easily maintainable code.
That said, if it is cached somewhere in a publicly accessible place, it is probably Spotlight. Have you checked whether the spotlight info for a folder includes its size?
PS - One important thing to remember when determining the size of a file: Mac files can have two "forks", the data fork, and the resource fork (where e.g. Finder keeps the info if you override a particular file to open with another application than the default for its file type, and custom icons assigned to files). So make sure you add up both forks' sizes, or your measurements will be off.

Is it possible to store files in Apache Lucene?

I'm new in Apache Lucene.
Is it possible to store files (e.g. pdf, doc) in Apache Lucene and later on to retrieve it? Or if i have to store those files somewhere else and just use it for indexing?
Technically you can, of course, store the contents of a file (e.g. in the StoredField or elsewhere) but I don't see any reason why you should. This will simply bring no added value but pain while serializing and deserializing file contents - and you will still have to keep the file name indexed somewhere else. Apart from serialization/deserialization pain, your app will likely have to block longer while Lucene will be merging index segments.
The best approach IMO is to store the path to the file relative to some file repository root - e.g. if your file is in /home/users/bob/files/123/file.txt, you might want to store the files/123/file.txt part without tokenization (using StringField).

Hiding (or encrypting) app resources?

I've developing a Cocoa app that has certain resources (images) which I wish to protect, but still display. Normally one would just place these in the resources folder, but storing there makes it quite easy to grab and use. Is there any way to keep these images hidden, but still access them within the app?
Simple solution:
Merge all files into one big data-file, optionally using 'salts'.
Then retrieve specific files with something like this:
NSData *dataFile = [NSData dataWithContentsOfFile:filePath];
NSData *theFile = [dataFile subdataWithRange: NSMakeRange(startPos,endPos)];
This does not really protect the files,
but prevents people simply dragging out the resources.
At least, the data-file is unusable, certainly with salts.
Another solution:
Create NSData object for every resource.
Add all objects to a NSMutableArray.
Convert the array to one big NSData object.
Write the NSData object to a file.
And add it to the resources folder.
Your app can then read the data-file.
And retrieve the array with the resources.
// Convert array to data
NSData* data=[NSKeyedArchiver archivedDataWithRootObject:theArray];
Use NSKeyedUnarchiver to retrieve the array again.
In order for you to protect the images in one big file, you can just dump the image data to a NSData object sequentially.
If you want, you can use either salts, as previously mentioned, or you can use AES encryption method, as shown here.
Then, you will have to either save the image files structurally (using an NSArray or similar) or record the image offsets so you can retrieve the image data blocks correctly.
This has some drawbacks, specially if your images change over time. That way you will have to monitor those changes and re-structure the file accordingly.
On other option is for you to simply mask the image files by changing name/extension to one of your choice. This will leave some users away from touch.
Finally, you can search for some archiving frameworks using zip like functions and keep the images there (as Blizzard uses in their MPQ format). This will be the best option (since it provides you with encryption methods and it abstracts you of the mechanisms of encryption and archiving) but it may not be easy to find such a framework.
Why do you want to protect the images? It goes without saying that anything you display can be recorded with a screenshot, so if you're trying to protect the images from the person viewing them, there isn't much point.
If you still want to protect them (say, some images should only be available to certain people), encrypting them on disk might be an option. I'm not an Objective-C guy, but this1 seems like a good place to look.