I have a working implementation using Grand Central dispatch queues that (1) opens a file and computes an OpenSSL DSA hash on "queue1", (2) writing out the hash to a new "side car" file for later verification on "queue2".
I would like to open multiple files at the same time, but based on some logic that doesn't "choke" the OS by having 100s of files open and exceeding the hard drive's sustainable output. Photo browsing applications such as iPhoto or Aperture seem to open multiple files and display them, so I'm assuming this can be done.
I'm assuming the biggest limitation will be disk I/O, as the application can (in theory) read and write multiple files simultaneously.
Any suggestions?
TIA
You are correct in that you'll be I/O bound, most assuredly. And it will be compounded by the random access nature of having multiple files open and being actively read at the same time.
Thus, you need to strike a bit of a balance. More likely than not, one file is not the most efficient, as you've observed.
Personally?
I'd use a dispatch semaphore.
Something like:
#property(nonatomic, assign) dispatch_queue_t dataQueue;
#property(nonatomic, assign) dispatch_semaphore_t execSemaphore;
And:
- (void) process:(NSData *)d {
dispatch_async(self.dataQueue, ^{
if (!dispatch_semaphore_wait(self.execSemaphore, DISPATCH_TIME_FOREVER)) {
dispatch_async(dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0), ^{
... do calcualtion work here on d ...
dispatch_async(dispatch_get_main_queue(), ^{
.... update main thread w/new data here ....
});
dispatch_semaphore_signal(self.execSemaphore);
});
}
});
}
Where it is kicked off with:
self.dataQueue = dispatch_queue_create("com.yourcompany.dataqueue", NULL);
self.execSemaphore = dispatch_semaphore_create(3);
[self process: ...];
[self process: ...];
[self process: ...];
[self process: ...];
[self process: ...];
.... etc ....
You'll need to determine how best you want to handle the queueing. If there are many items and there is a notion of cancellation, enqueueing everything is likely wasteful. Similarly, you'll probably want to enqueue URLs to the files to process, and not NSData objects like the above.
In any case, the above will process three things simultaneously, regardless of how many have been enqueued.
I'd use NSOperation for this because of the ease of handling both dependencies and cancellation.
I'd create one operation each for reading the data file, computing the data file's hash, and writing the sidecar file. I'd make each write operation dependent on its associated compute operation, and each compute operation dependent on its associated read operation.
Then I'd add the read and write operations to one NSOperationQueue, the "I/O queue," with a restricted width. The compute operations I'd add to a separate NSOperationQueue, the "compute queue," with a non-restricted width.
The reason for the restriced width on the I/O queue is that your work will likely be I/O bound; you may want it to have a width greater than 1, but it's very likely to be directly related to the number of physical disks on which your input files reside. (Probably something like 2x, you'll want to determine this experimentally.)
The code would wind up looking something like this:
#implementation FileProcessor
static NSOperationQueue *FileProcessorIOQueue = nil;
static NSOperationQueue *FileProcessorComputeQueue = nil;
+ (void)inititalize
{
if (self == [FileProcessor class]) {
FileProcessorIOQueue = [[NSOperationQueue alloc] init];
[FileProcessorIOQueue setName:#"FileProcessorIOQueue"];
[FileProcessorIOQueue setMaxConcurrentOperationCount:2]; // limit width
FileProcessorComputeQueue = [[NSOperationQueue alloc] init];
[FileProcessorComputeQueue setName:#"FileProcessorComputeQueue"];
}
}
- (void)processFilesAtURLs:(NSArray *)URLs
{
for (NSURL *URL in URLs) {
__block NSData *fileData = nil; // set by readOperation
__block NSData *fileHashData = nil; // set by computeOperation
// Create operations to do the work for this URL
NSBlockOperation *readOperation =
[NSBlockOperation blockOperationWithBlock:^{
fileData = CreateDataFromFileAtURL(URL);
}];
NSBlockOperation *computeOperation =
[NSBlockOperation blockOperationWithBlock:^{
fileHashData = CreateHashFromData(fileData);
[fileData release]; // created in readOperation
}];
NSBlockOperation *writeOperation =
[NSBlockOperation blockOperationWithBlock:^{
WriteHashSidecarForFileAtURL(fileHashData, URL);
[fileHashData release]; // created in computeOperation
}];
// Set up dependencies between operations
[computeOperation addDependency:readOperation];
[writeOperation addDependency:computeOperation];
// Add operations to appropriate queues
[FileProcessorIOQueue addOperation:readOperation];
[FileProcessorComputeQueue addOperation:computeOperation];
[FileProcessorIOQueue addOperation:writeOperation];
}
}
#end
It's pretty straightforward; rather than deal with multiply-nested layers of sync/async as you would with the dispatch_* APIs, NSOperation allows you to define your units of work and your dependencies between them independently. For some situations this can be easier to understand and debug.
You have received excellent answers already, but I wanted to add a couple points. I have worked on projects that enumerate all the files in a file system and calculate MD5 and SHA1 hashes of each file (in addition to other processing). If you are doing something similar, where you are searching a large number of files and the files may have arbitrary content, then some points to consider:
As noted, you will be I/O bound. If you read more than 1 file simultaneously, you will have a negative impact on the performance of each calculation. Obviously, the goal of scheduling calculations in parallel is to keep the disk busy between files, but you may want to consider structuring your work differently. For example, set up one thread that enumerates and opens the files and a second thread the gets open file handles from the first thread one at a time and processes them. The file system will cache catalog information, so the enumeration won't have a severe impact on reading the data, which will actually have to hit the disk.
If the files can be arbitrarily large, Chris' approach may not be practical since the entire content is read into memory.
If you have no other use for the data than calculating the hash, then I suggest disabling file system caching before reading the data.
If using NSFileHandles, a simple category method will do this per-file:
#interface NSFileHandle (NSFileHandleCaching)
- (BOOL)disableFileSystemCache;
#end
#include <fcntl.h>
#implementation NSFileHandle (NSFileHandleCaching)
- (BOOL)disableFileSystemCache {
return (fcntl([self fileDescriptor], F_NOCACHE, 1) != -1);
}
#end
If the sidecar files are small, you may want to collect them in memory and write them out in batches to minimize disruption of the processing.
The file system (HFS, at least) stores file records for files in a directory sequentially, so traverse the file system breadth-first (i.e., process each file in a directory before entering subdirectories).
The above is just suggestions, of course. You will want to experiment and measure performance to confirm the actual impact.
libdispatch actually provides APIs explicitly for this! Check out dispatch_io; it will handle parallelizing IO when appropriate, and otherwise serializing it to avoid thrashing the disk.
The following link is to a BitBucket project I setup utilizing NSOperation and Grand Central Dispatch in use a primitive file integrity application.
https://bitbucket.org/torresj/hashar-cocoa
I hope it is of help/use.
Related
I'm working on a framework and in order to ensure non blocking public methods, I'm using a NSOperationQueue that puts all the public method calls into an operation queue and returns immediately.
There is no relation or dependencies between different operations and the only thing that matters is that the operations are started in FIFO order that is in the same order as they were added to the queue.
Here is an example of my current implementation (sample project here):
#implementation Executor
-(instancetype) init {
self = [super init];
if(self) {
_taskQueue = [[NSOperationQueue alloc] init];
_taskQueue.name = #"com.d360.tasks";
}
return self;
}
-(void) doTask:(NSString*) taskName
{
NSOperation *operation = [NSBlockOperation blockOperationWithBlock:^{
NSLog(#"executing %#", taskName);
}];
[self.taskQueue addOperation:operation];
}
I realised though that the order at which the operations are started is not necessarily the order at which they were added to the queue. For instance, if I call
[self.executor doTask:#"Task 1"];
[self.executor doTask:#"Task 2"];
Sometimes Task 2 is started after Task 1.
The question is how can I ensure a FIFO execution start?
I could achieve it using _taskQueue.maxConcurrentOperationCount = 1; but this would allow only 1 operation at once which I don't want. One operation should not block any other operation and they can run concurrently as long as they are started in the correct order.
I looked also into the NSOperationQueuePriority property which would work If I knew the priorities of the calls which I don't. In fact, even if I sent the earlier added operation to NSOperationQueuePriorityHigh and the second to NSOperationQueuePriorityNormal the order is not guaranteed neither.
[self.executor doTask:#"Task 1" withQueuePriority:NSOperationQueuePriorityHigh];
[self.executor doTask:#"Task 2" withQueuePriority:NSOperationQueuePriorityNormal];
Output is sometimes
executing Task 2
executing Task 1
Any ideas?
thanks,
Jan
When you create each task you could add a dependency on the previous task with NSOperation -addDependency. The complication is that dependencies aren't satisfied until the dependent task completes, which probably isn't what you want. You could work around that by creating another NSOperation inside each task, and make the next queued task depend on that. This inner task can just set a flag or something that says "hey, I've started!". Then when that inner task completes it will satisfy the dependency for the next task in the queue and allow it to start.
Seems like a convoluted way to do things, though, and I'm not sure the benefit is worth the extra complication - why does it matter what order the operations are started in, if they truly are independent operations? Once they've started, the OS decides which task gets CPU time, and you don't have much control over it anyway, so why not just queue them up and let the OS manage the start order?
I have a function that constructs an NSMutableDictionary using bk_apply, a method provided by the third-party block utility library BlocksKit. The function's test suite usually passes just fine, but once every couple runs it crashes.
NSMutableDictionary *result = [[NSMutableDictionary alloc] init];
[inputSet bk_apply:^(NSString *property) {
NSString *localValueName = propertyToLocalName[property];
NSObject *localValue = [self valueForKey:localValueName];
result[property] = localValue ?: defaults[property]; // Crash
// Convert all dates in result to ISO 8601 strings
if ([result[property] isKindOfClass:[NSDate class]]) { // Crash
result[property] = ((NSDate *)result[property]).ISODateString; // Crash
}
}];
The crash always happens on a line where result is referenced, but it's not the same line every time.
Examining the contents of result in the debugger, I've seen very strange values like
po result
{
val1 = "Some reasonable value";
val2 = "Also reasonable value";
(null) = (null);
}
It's impossible for an NSDictionary to have null keys or values, so clearly some invariant is being violated.
What is causing this crash and how do I fix it?
From the BlocksKit documentation for bk_apply:
Enumeration will occur on appropriate background queues. This will
have a noticeable speed increase, especially on dual-core devices, but
you must be aware of the thread safety of the objects you message
from within the block.
The code above is highly unsafe with respect to threading, because it reads from and writes to a mutable variable on multiple threads.
The intermittent nature of the crash comes from the fact that the thread scheduler is non-deterministic. The crash won't happen when several threads accessing shared memory happen to have their execution scheduled in sequence rather than in parallel. It is therefore possible to "get lucky" some or even most of the time, but the code is still wrong.
The debugger printout is a good example of the danger. The thread that's paused is most likely reading from result while another thread performs an insertion.
NSMutableDictionary insertions are likely not atomic; example steps might be,
allocate memory for the new entry
copy the entry's key into the memory
copy the entry's value into the memory
If you read the dictionary from another thread between steps 1 and 2, you will see an entry for which memory has been allocated, but the memory contains no values.
The simplest fix is to switch to bk_each. bk_each does the same thing as bk_apply but it's implemented in a way that guarantees sequential execution.
So I have these two methods:
-(void)importEvents:(NSArray*)allEvents {
NSMutableDictionary *subjectAssociation = [[NSMutableDictionary alloc] init];
for (id thisEvent in allEvents) {
if (classHour.SubjectShort && classHour.Subject) {
[subjectAssociation setObject: classHour.Subject forKey:classHour.SubjectShort];
}
}
[self storeSubjects:subjectAssociation];
}
-(void)storeSubjects:(NSMutableDictionary*)subjects {
NSArray *documentPaths = NSSearchPathForDirectoriesInDomains(NSDocumentDirectory, NSUserDomainMask, YES);
NSString *documentsDir = [documentPaths objectAtIndex:0];
NSString *subjectsList = [documentsDir stringByAppendingPathComponent:#"Subjects.plist"];
[subjects writeToFile:subjectsList atomically:YES];
}
The first loops through an array of let's say 100 items, and builds a NSMutableDictionary of about 10 unique key/value pairs.
The second method writes this dictionary to a file for reference elsewhere in my app.
The first method is called quite often, and so is the second. However, I know, that once the dictionary is built and saved, its contents won't ever change, no matter how often I call these methods, since the number of possible values is just limited.
Question: given the fact that the second method essentially needs to be executed only once, should I add some lines that check if the file already exists, essentially adding code that needs to be executed, or can I just leave it as is, overwriting an existing file over and over again?
Should I care? I should add that I don't seem to suffer from any performance issues, so this is more of a philosophical/hygienic question.
thanks
It depends.
You say
once the dictionary is built and saved, its contents won't ever change
until they do :-)
If your app is not suffering from any performance issues on this particular loop I wouldn't try to cache for the reason that unless you somehow remember that you have a once-only write on the file you are storing up a bug for later.
This could be mitigated by using an intention revealing name on the method. i.e
-(void)storeSubjectsOnceOnlyPerLaunch:(NSDictionary*)subjects
If I got my time back for tracing down bugs caused by caching, I would have several days back in my life.
Your solution is totally over engineered, and has tons of potential to go wrong. What if the users drive is full? Does this file get backed up? Does it need backing up / are you wasting the users time backing it up? Can this fail? Are you handling it? You are concentrating on the entering and storing of data, you should be focusing on accessing that data.
I'd have a readwrite property allEvents and a property eventAssociations, declared readonly in the interface, but readwrite in the implementation file.
The allEvents setter stores allEvents and sets _eventAssociations to nil.
The eventAssociations getter checks whether _eventAssociations is nil and recalculates it when needed. A simple and bullet-proof pattern.
In my application there is searchBar. when we input a text, it will do functionGrab (grab data from internet and save it to coredata), example :
if we input "Hallo"
if([[dict objectForKey:#"Category"] isNotEmpty] && [[[dict objectForKey:#"Category"] objectAtIndex:0] class]!=[NSNull class]){
NSMutableArray * DownloadedTags =[dict objectForKey:#"Category"];
NSMutableSet * TagsReturn=[NSMutableSet set];
for(int i=0;i<[DownloadedTags count];i++){
NSString * Value=[DownloadedTags objectAtIndex:i];
Tag * thisTag= (Tag*)[GrabClass getObjectWithStringOfValue:Value fromTable:#"Tag" withAttribut:#"Name"];
[TagsReturn addObject:thisTag];
}
NSMutableSet * manyManagedObjects = [BusinessToSave mutableSetValueForKey:#"Tags"];
[self removeDifferenceBetween2MutableManagedObjectSets:manyManagedObjects withDownloadedVersion:TagsReturn];
}
So each biz has many categories. WHat happen in multi threaded application is one thread put category. The other thread also put the same category before committing.
So, [GrabClass getObjectWithStringOfValue:Value fromTable:#"Tag" withAttribut:#"Name"]; gives a new object even though some other thread already created the same object without knowing it.
If I synchronized the whole thing that the code would run serially and that'll be slow.
functionGrab:"H"
functionGrab:"Ha"
functionGrab:"Hal"
functionGrab:"Hall"
functionGrab:"Hallo"
something like,it do that functionGrab 5 times
I want to make functionGrab at background, but the problem is when I do that function without synchronized it will save more than one of data, so the result is there are 5 hallo words in my coredata, but if I do that with synchronized, it spent so much time, so slow..
is there any way to help my problem?
I do not recommended having more than one thread "creating" the same types of data for the exact reason you are running into.
I would suggest you queue all of your "creates" into a single thread and a single NSManagedObjectContext to avoid merge or duplication issues.
The other option would be to make the app Lion only and use the parent/child NSManagedObjectContext design and then your children will be more "aware" of each other.
Is it possible to encode an Objective-C block with an NSKeyedArchiver?
I don't think a Block object is NSCoding-compliant, therefore [coder encodeObject:block forKey:#"block"] does not work?
Any ideas?
No, it isn't possible for a variety of reasons. The data contained within a block isn't represented in any way similar to, say, instance variables. There is no inventory of state and, thus, no way to enumerate the state for archival purposes.
Instead, I would suggest you create a simple class to hold your data, instances of which carry the state used by the blocks during processing and which can be easily archived.
You might find the answer to this question interesting. It is related.
To expand, say you had a class like:
#interface MyData:NSObject
{
... ivars representing work to be done in block
}
- (void) doYourMagicMan;
#end
Then you could:
MyData *myWorkUnit = [MyData new];
... set up myWorkUnit here ...
[something doSomethingWithBlockCallback: ^{ [myWorkUnit doYourMagicMan]; }];
[myWorkUnit release]; // the block will retain it (callback *must* Block_copy() the block)
From there, you could implement archiving on MyData, save it away, etc... The key is treat the Block as the trigger for doing the computation and encapsulate said computation and the computation's necessary state into the instance of the MyData class.