How to compare and find the differences between two XML files in cocoa? - objective-c

This is a bit of a two part question, for working with 40mb xml files.
• What’s a reasonable size to store in memory for a program running continually in the background?
• How to find what has changed in an XML file.
So on the first read the XML is loaded into NSData, then uploaded to the server.
Now instead of uploading a 40mb XML every time it changes, I would prefer to upload a “delta” file containing only what has changed. The program would monitor the file for change, and activate when it’s been modified. From what I can see, I would need to parse an old version of the xml file and parse the modified xml file, then compare them? Is it unreasonable to store 80mb in memory like this every time the file is modified?. Now I’m assuming that this has to be done with a DOM parser because I can’t see how you could compare two files like that with a SAX parser since it only has part of the file stored?
I'm a newbie at this so any help would be appreciated!

To compare two files:
There are many ways to do, (As file is to be considered, I may not be correct):
sdiff file1.xml file2.xml A unix command
You can use this command with apple script.
-[NSFileManager contentsEqualAtPath:andPath:]
This method checks to see if two files at given path are the same file, then compares their size, and finally compares their contents.
For other part:
What size is considered for background process, I dont think so, for an application it matters. You can save these into temporary files. Even safari uses 130+ MB as you can easily check through Activity monitor.

NSXMLParser ended up being the most useful for this

Related

Why should applications read a PDF file backwards?

I am trying to wrap my head around the PDF file structure. There is a header, a body with objects, a cross-reference table and a trailer. In the official PDF reference from Adobe, section 3.4.4 about file trailer, we can read that:
The trailer of a PDF file enables an application reading the file to quickly find the cross-reference table and certain special objects. Applications should read a PDF file from its end.
This looks very inefficient to me. I can't show anything to users this way (not even the first page) before I load the whole file. Well, to be precise, I can - if my file is linearized. But that is optional and means some extra overhead both when writing and reading such file.
Instead of that whole linearization thing, it would be easier to just put the references in front of the body (followed by objects on page 1, page2, page 3... ). But people in Adobe probably had their reasons to put it after it. I just don't see them. So...
Why is the cross-reference table placed after the body?
I would agree with the two reasons already mentioned, but not because of hardware limitations "back in the day", but rather scale. It's easy to think an invoice with a couple of pages of text could be done better differently, but what about a book, or a PDF with 1,000 photos?
With the trailer at the end you can write images/text/fonts to the file as they are processed and then discard them from memory while simply storing the file offset of each object to be used to write the trailer.
If the trailer had to come first then you would have to read (or even generate in the case of an embedded font) all of these objects just to get their size so you could write out the trailer, then write all the objects to the file. So you would either be reading, sizing, discarding, then reading again, or trying to hold everything in ram until you could write them to the file.
Write speed and ram are still issues we contend with today when we're running in a docker container on a VM on shared hardware..
PDF was invented back when hard drives were slow to write files... really s-l-o-w. By putting the xref at the end, you could quickly change a file by simply appending new objects and an updated xref to the end of the file rather than rewriting the whole thing.
Not only were the drives slow (giving rise to the argument in #joelgeraci's answer), also was there much less RAM available in a typical computer. Thus, when creating a pdf one had to write data to file early, much earlier than one had any idea how big the file or, as a consequence, the cross references would become. Writing the cross references at the end, therefore, was a natural consequence.

writing file data directly to disk/db with blobs (plone.app.blob/Archetypes, and plone.namedfile)

I have several pieces of data that need to be merged into one file (ATContentTypes blob file, Plone 4.1). The total amount of data is likely to be quite large so I really don't want to have to load it all into memory, concatenate it, and do something like o.setFile(data). If I were writing directly to the file system I could just do open(myfile, 'a') and write to it, but I'm not clear how I could do that with a blob supported content type. All of the docs and tests I've been able to look at just have it being set with a str or in-memory StringIO. Is there a way to append to this field without loading the whole thing into memory?
Similarly, I've also looked at using Dexterity with a plone.namedfile NamedBlobFile. It looks like that field just has a 'data' attribute that is basically a string. How could I append to that without loading the whole thing into memory?
It's quite old and the product has never been officially released, but it can help you: ore.bigfile.
It's well explained in this blog article: http://blog.jazkarta.com/2010/09/21/handling-large-files-in-plone-with-ore-bigfile/

How to remove .efs file extension from 1000's of recovered files in one folder

I recently recovered a 1.5TB external HDD that crashed. The program I used to recover the files was Active Undelete Enterprise, it's excellent. When the files were successfully recovered they were all saved with a .efs extension so files looked like mydocument.docx.efs. At first I thought they were encrypted and needed to be decrypted, I spent 10 mins on it and realized I just need to remove the .efs from the entire filename and the mydocument.docx works perfectly. Problem is now I have over 55,000 files within hundreds of folders where I need to simply remove the .efs after each file. Does anyone know how to do this?
From a command prompt window, navigate to the top level directory where these files reside.
Type the command
DIR /S/B >>filelist.txt
This command will give you a bare format file listing of the current directory plus all nested subdirectories without any extraneous information. The list will be contained in the text file named "filelist.txt" or whatever else you choose to call it. I would then use this text file in a text editor to convert every line of text from, for example,
C:\Users\dlucas\.gimp-2.8\mathmap\file1.png.efs
to
rename c:\Users\dlucas\.gimp-2.8\mathmap\file1.png.efs file1.png
to give a simple example of a file that I just found on my system using this method.
You will need to use a text editor with a columnar editing capability since you have to modify som many files. Old programmer's editors such as CodeWright made this really simple while modern editors such as Eclipse or Notepad++ make this a little more difficult and may require a columnar editing plugin, depending on version. You basically have to make a columnar copy of all of the text in the file, and then paste the copy off to the far right - far enough that a second column of filenames and paths won't overwrite any of the existing file names and paths. You can then use columnar editing features to select and delete the path names of the text in the 2nd column since the rename command requires that the 2nd argument be simply the base filename and extension without the path information. You can use the columnar editing features to prepend every line with "RENAME ". If you attempt to do this without columnar editing features, you will find it slow going!
An alternate way to do this is to use a command formed from a "regular expression" to create the rename command. If you are not familiar with "regular expressions", ask a programmer friend as this is not an easy topic to learn from scratch. If you are familiar with regular expressions, this is probably the simplest way to perform this task. I haven't used them in many years and no longer recall the exact syntax to use or I would tell you myself.
Regardless of what kind of editor you use, the goal is to turn this ASCII file list of paths and filenames into a batch file (simply rename file1.txt to file1.bat when you are finished editing). You can then run the batch file by typing file1.bat at a command prompt.
I have just run into this same problem myself using the same really wonderful tool that you used. I am writing this while waiting for the undelete program to finish. That it restores files with this extra extension seems very anti-intuitive so I will look for an option to make it not do this when it finishes. If I find one, I will post a new answer here that is more specific to this tool. Otherwise, I am going to have rename all kazillion files just as you had to.
You experienced this problem because the disk that you recovered your files to "does not support encryption", according to the Active# UNDELETE documentation. The documentation offers no further explanation of what kind of disks support encryption, etc.
They offer a Decrypt command that restores the file's proper names as a post processing step. Unfortunately, this requires that you "include" each and every file to be decrypted, with no support for wildcards and parsing subdirectories so that is a non-starter, in my opinion given that both of us have hundreds of thousands of files to be renamed.
I did find that by selecting a normal fixed (non-removable) hard drive as the destination of the recovery effort, that the resulting files do not end up encrypted (i.e., they are recovered with the proper file name and extension). I originally chose a large USB based flash drive and the files were stored in their "encrypted" state (not really encrypted, but possibly potentially so and thus they give the .efs extension). Of course, this meant that I had to run the command all over again after switching to a regular hard drive (takes about 16 hours to recover 80GB worth of files due to presence of many sector CRC errors).

Out of memory error when merging large numbers of PDFs using Zend_PDF

We're using the Zend_PDF module in SugarCRM to merge pdf invoices that our system generates. I have been able to successfully merge a number of PDFs (around 10 to 30 in my tests), but we're getting memory errors when we try to merge larger numbers of pdf files. The error looks something like this:
[30-Jan-2012 14:10:20] PHP Fatal error: Allowed memory size of 268435456 bytes exhausted at /usr/local/src/php-5.3.8/Zend/zend_operators.c:1265 (tried to allocate 68134 bytes) in /srv/www/htdocs/sugar6_mf/Zend/Pdf/Element/Object/Stream.php on line 442
The above error was generated when we tried to merge 457 pdf files - that's files, not pages. We're going to need to merge 5,000 and more at a time eventually.
Can anyone offer any help/advice on how to address this?
If needed, ask, and I'll post the code on how the merged pdf is being generated.
Thanks.
I should preface this answer by saying that I know nothing about SugarCRM - my response is based solely on my knowledge of Zend_Pdf.
If my understanding is correct, you have a PHP script (hopefully not running inside Apache considering the length of time it will take to process 5,000 files) that is taking multiple PDF files as input using the Zend_Pdf::load() method and then iterating through the pages of each PDF object and adding them to one target instance of Zend_Pdf, which you are then writing to a file using the save() method.
Using this approach, even if you unset() each of the source PDF objects after you've added the pages to the target PDF object, you'll still need enough memory to store the entire output file. If you blew through 250MB with only 457 files, then I'm guessing your input PDF files are probably about 500KB, so your output file is going to be absolutely huge, so you are still going to end up running out of memory.
My advice would be to ditch this method entirely and use pdftk instead, which you could invoke using the exec() function. I'm sure there's a limit to the size of the arguments you can provide to exec(), so it will probably be a multi-step process with several intermediate files, but ultimately I think this will be a faster, more robust solution.
And just to re-iterate an earlier point, I would not run this process within Apache. I would set up a cron job that runs at the appropriate intervals and drops the output file into a secure area on your web/file server.

Use ZIP-archives to store NSDocument data

I noticed that Apple started using zip archives to replace document packages (folders appearing as a single file in Finder) in the iWork applications. I'm considering doing the same as I keep getting support emails related to my document packages getting corrupted when copying them to a windows fileserver.
My questions is what would be the best way to do this in a NSDocument-based application?
I guess the easiest way would be to create a directory file wrapper, create an archive of it and return it in NSDocument's
- (NSFileWrapper *)fileWrapperOfType:(NSString *)typeName error:(NSError **)outError
But I fail to understand how to create a zip archive of the NSFileWrapper.
If you just want to make a zip file your format (ie, "mydoc.myextension" is actually a zip file), there's no convenient, built-in Cocoa mechanism for creating zip archives with code. Take a look at this Google Code project: ziparchive I don't believe a file wrapper will help in that case, though.
Since you cited iWork, I don't own iWork 09, but previous versions use a package format (ie, NSFileWrapper would be ideal) but zip the XML that describes the document's structure, while keeping attachments (like embedded media, images, etc.) in a resource folder, all within the package. I assume they do this because XML can be quite large for large, complicated documents, but compresses very well because it's text. This results in an overall smaller document.
If indeed Apple has moved to making the entire document one big zip archive (which I would find odd), they'd either be extracting necessary resources to a temp folder somewhere or loading the whole thing into memory (a step backward from their package-based approach, IMO). These are considerations you'll need to take into account as well.
You’ll want to take the data from the file wrapper and feed it into something like ziparchive.
Pierre-Olivier Latour has written an extension to NSData that deals with zip compression. You can get it here: http://code.google.com/p/polkit/
I know this is a little late to the party but I thought I'd offer up another link that could help anyone that comes across this post.
Looks like the ZipBrowser sample from Apple would be a good start http://developer.apple.com/library/mac/#samplecode/ZipBrowser/Introduction/Intro.html
HTH