Out of memory error when merging large numbers of PDFs using Zend_PDF - pdf

We're using the Zend_PDF module in SugarCRM to merge pdf invoices that our system generates. I have been able to successfully merge a number of PDFs (around 10 to 30 in my tests), but we're getting memory errors when we try to merge larger numbers of pdf files. The error looks something like this:
[30-Jan-2012 14:10:20] PHP Fatal error: Allowed memory size of 268435456 bytes exhausted at /usr/local/src/php-5.3.8/Zend/zend_operators.c:1265 (tried to allocate 68134 bytes) in /srv/www/htdocs/sugar6_mf/Zend/Pdf/Element/Object/Stream.php on line 442
The above error was generated when we tried to merge 457 pdf files - that's files, not pages. We're going to need to merge 5,000 and more at a time eventually.
Can anyone offer any help/advice on how to address this?
If needed, ask, and I'll post the code on how the merged pdf is being generated.
Thanks.

I should preface this answer by saying that I know nothing about SugarCRM - my response is based solely on my knowledge of Zend_Pdf.
If my understanding is correct, you have a PHP script (hopefully not running inside Apache considering the length of time it will take to process 5,000 files) that is taking multiple PDF files as input using the Zend_Pdf::load() method and then iterating through the pages of each PDF object and adding them to one target instance of Zend_Pdf, which you are then writing to a file using the save() method.
Using this approach, even if you unset() each of the source PDF objects after you've added the pages to the target PDF object, you'll still need enough memory to store the entire output file. If you blew through 250MB with only 457 files, then I'm guessing your input PDF files are probably about 500KB, so your output file is going to be absolutely huge, so you are still going to end up running out of memory.
My advice would be to ditch this method entirely and use pdftk instead, which you could invoke using the exec() function. I'm sure there's a limit to the size of the arguments you can provide to exec(), so it will probably be a multi-step process with several intermediate files, but ultimately I think this will be a faster, more robust solution.
And just to re-iterate an earlier point, I would not run this process within Apache. I would set up a cron job that runs at the appropriate intervals and drops the output file into a secure area on your web/file server.

Related

Extract text from illustrator file without opening file

Any idea if it would be possible to extract text from a illustrator file without opening it?
I have an AppleScript currently extracting the text but it takes a long time when I'm working on hundreds of files. I was wondering if it would be possible to get the information without opening the AI file.
+1 for show your own code first. (Also, typo in first line: I think you meant “Illustrator”, not “photoshop”.)
If you’re only getting plain text it should only take a fraction of a second per document (opening the file will take longer):
tell application "Adobe Illustrator"
get contents of every text frame of document 1
end tell
(i.e. Never iterate over individual application objects, querying each one, when a single query will do everything for you. Apple events are relatively expensive for apps to resolve; sending lots of them unnecessarily really kills performance.)
Also be aware that AppleScript also has serious performance problems when iterating over large lists, but that’s a separate issue, the solution to which should already be covered elsewhere.

Why should applications read a PDF file backwards?

I am trying to wrap my head around the PDF file structure. There is a header, a body with objects, a cross-reference table and a trailer. In the official PDF reference from Adobe, section 3.4.4 about file trailer, we can read that:
The trailer of a PDF file enables an application reading the file to quickly find the cross-reference table and certain special objects. Applications should read a PDF file from its end.
This looks very inefficient to me. I can't show anything to users this way (not even the first page) before I load the whole file. Well, to be precise, I can - if my file is linearized. But that is optional and means some extra overhead both when writing and reading such file.
Instead of that whole linearization thing, it would be easier to just put the references in front of the body (followed by objects on page 1, page2, page 3... ). But people in Adobe probably had their reasons to put it after it. I just don't see them. So...
Why is the cross-reference table placed after the body?
I would agree with the two reasons already mentioned, but not because of hardware limitations "back in the day", but rather scale. It's easy to think an invoice with a couple of pages of text could be done better differently, but what about a book, or a PDF with 1,000 photos?
With the trailer at the end you can write images/text/fonts to the file as they are processed and then discard them from memory while simply storing the file offset of each object to be used to write the trailer.
If the trailer had to come first then you would have to read (or even generate in the case of an embedded font) all of these objects just to get their size so you could write out the trailer, then write all the objects to the file. So you would either be reading, sizing, discarding, then reading again, or trying to hold everything in ram until you could write them to the file.
Write speed and ram are still issues we contend with today when we're running in a docker container on a VM on shared hardware..
PDF was invented back when hard drives were slow to write files... really s-l-o-w. By putting the xref at the end, you could quickly change a file by simply appending new objects and an updated xref to the end of the file rather than rewriting the whole thing.
Not only were the drives slow (giving rise to the argument in #joelgeraci's answer), also was there much less RAM available in a typical computer. Thus, when creating a pdf one had to write data to file early, much earlier than one had any idea how big the file or, as a consequence, the cross references would become. Writing the cross references at the end, therefore, was a natural consequence.

Restrict file size using

I have to create a sub routine using VB.Net that compress some files into a "file.zip" file, but the problem is that this "file.zip" MUST have the maximum size of 2 MB.
I don't know how to do it, even if it's possible.
It would be nice if someone has some example to show me.
It is not possible to do this in the general case. For example if you have a 2GB movie file, no lossless compression algorithm will ever get it to 2MB.
One solution is to "chunk" your ZIP file. That is, divide it into parts that are individually no more than 2MB. 7-Zip has support for this. You can use their .NET API from VB.Net. I'm not sure whether the API provides direct support for chunking. If not, you can start 7-Zip from your program using Process.Start().

FIleSystemWatcher.Created how does it work?

I am working on a project that will copy files to a database every time something is added to a specific directory. Now the program works fine when I'm testing with a small set of data but I was wondering if someone could explain how the FileSystemWatcher.Created event work.
My main concern is when I use this on a larger scale the program may slow down when it handles 100,000+ files.
If this is an issue could anyone explain if there is some sort of workaround to polling the original folder, lets call that "C:\folder", and maybe poll a temp folder instead.
I have not tested the watcher with 100,000 files. However, in most cases you should not have so many files in a folder awaiting processing. I recommend a structure like
C:\folder
C:\folder\processing
C:\folder\archive
C:\folder\error
As soon as you begin working on a given file, move it into processing. If you successfully process it, move the file again to archive. If there is an error while processing a file, instead move it into error.
This will make it easier for you to keep the files organized and diagnose problems that occur in production.
With that file structure, you will not run into issues with large numbers of files in the folder you are watching, unless you receive files in incredibly large bursts compared to the speed with which they can be moved into the processing state.

How to compare and find the differences between two XML files in cocoa?

This is a bit of a two part question, for working with 40mb xml files.
• What’s a reasonable size to store in memory for a program running continually in the background?
• How to find what has changed in an XML file.
So on the first read the XML is loaded into NSData, then uploaded to the server.
Now instead of uploading a 40mb XML every time it changes, I would prefer to upload a “delta” file containing only what has changed. The program would monitor the file for change, and activate when it’s been modified. From what I can see, I would need to parse an old version of the xml file and parse the modified xml file, then compare them? Is it unreasonable to store 80mb in memory like this every time the file is modified?. Now I’m assuming that this has to be done with a DOM parser because I can’t see how you could compare two files like that with a SAX parser since it only has part of the file stored?
I'm a newbie at this so any help would be appreciated!
To compare two files:
There are many ways to do, (As file is to be considered, I may not be correct):
sdiff file1.xml file2.xml A unix command
You can use this command with apple script.
-[NSFileManager contentsEqualAtPath:andPath:]
This method checks to see if two files at given path are the same file, then compares their size, and finally compares their contents.
For other part:
What size is considered for background process, I dont think so, for an application it matters. You can save these into temporary files. Even safari uses 130+ MB as you can easily check through Activity monitor.
NSXMLParser ended up being the most useful for this