I'm working on an ANSI C application that produces file contents in reverse order. That is, the bytes at the end of the file are received first, and those at the beginning are received last. Preferably, due to the amounts of data that may be involved, I would like to write this data directly to the file without first arranging it in a separate memory buffer. Is that possible? How may this be accomplished using ANSI C? If this can only be done with a higher level library that is not ANSI compliant, that would acceptable.
The solution to this was actually much simpler than I had originally thought. I was able to use fseek to move the file pointer to the end of the file and then incrementally move it backward (decrement the file pointer index) through the file for each write.
Related
It seems that every time I call spurt without :append, it will open and overwrite the file and then close the file automatically. I have been writing thousands of lines to a file in a routine using spurt. Now it seems like a big waste of I/O resources. I guess if I need to write thousands of lines, I should always use "open" to get a file handle instead. Comments?
Yes, use open to get a file handle, and use print or say (or write for binary data) to append to it.
spurt is only useful for one-off operations, and meant to relieve you having to write open, print and close for a single logical write operation.
I am trying to wrap my head around the PDF file structure. There is a header, a body with objects, a cross-reference table and a trailer. In the official PDF reference from Adobe, section 3.4.4 about file trailer, we can read that:
The trailer of a PDF file enables an application reading the file to quickly find the cross-reference table and certain special objects. Applications should read a PDF file from its end.
This looks very inefficient to me. I can't show anything to users this way (not even the first page) before I load the whole file. Well, to be precise, I can - if my file is linearized. But that is optional and means some extra overhead both when writing and reading such file.
Instead of that whole linearization thing, it would be easier to just put the references in front of the body (followed by objects on page 1, page2, page 3... ). But people in Adobe probably had their reasons to put it after it. I just don't see them. So...
Why is the cross-reference table placed after the body?
I would agree with the two reasons already mentioned, but not because of hardware limitations "back in the day", but rather scale. It's easy to think an invoice with a couple of pages of text could be done better differently, but what about a book, or a PDF with 1,000 photos?
With the trailer at the end you can write images/text/fonts to the file as they are processed and then discard them from memory while simply storing the file offset of each object to be used to write the trailer.
If the trailer had to come first then you would have to read (or even generate in the case of an embedded font) all of these objects just to get their size so you could write out the trailer, then write all the objects to the file. So you would either be reading, sizing, discarding, then reading again, or trying to hold everything in ram until you could write them to the file.
Write speed and ram are still issues we contend with today when we're running in a docker container on a VM on shared hardware..
PDF was invented back when hard drives were slow to write files... really s-l-o-w. By putting the xref at the end, you could quickly change a file by simply appending new objects and an updated xref to the end of the file rather than rewriting the whole thing.
Not only were the drives slow (giving rise to the argument in #joelgeraci's answer), also was there much less RAM available in a typical computer. Thus, when creating a pdf one had to write data to file early, much earlier than one had any idea how big the file or, as a consequence, the cross references would become. Writing the cross references at the end, therefore, was a natural consequence.
OK, it may sound fairly straightforward but I'm still not sure how to go about it.
I know it's possible to check file type based on file extensions, using UTIs (e.g. Get the type of a file in Cocoa).
However, I need to be able to get the file type (in more general terms, like "text", "image", "else"), depending on the content.
Is that possible?
Any ideas?
One route forward is to call the file command and parse its output, but that is fairly horrible, and I wouldn't do that as it's slow and you are susceptible to changes in the output.
The file command uses a pretty extensive database of byte patterns to test the contents of the file and I would be tempted to implement my own internal version of it, or use this library (which I think might need some work before it works under OSX).
This is a bit of a two part question, for working with 40mb xml files.
• What’s a reasonable size to store in memory for a program running continually in the background?
• How to find what has changed in an XML file.
So on the first read the XML is loaded into NSData, then uploaded to the server.
Now instead of uploading a 40mb XML every time it changes, I would prefer to upload a “delta” file containing only what has changed. The program would monitor the file for change, and activate when it’s been modified. From what I can see, I would need to parse an old version of the xml file and parse the modified xml file, then compare them? Is it unreasonable to store 80mb in memory like this every time the file is modified?. Now I’m assuming that this has to be done with a DOM parser because I can’t see how you could compare two files like that with a SAX parser since it only has part of the file stored?
I'm a newbie at this so any help would be appreciated!
To compare two files:
There are many ways to do, (As file is to be considered, I may not be correct):
sdiff file1.xml file2.xml A unix command
You can use this command with apple script.
-[NSFileManager contentsEqualAtPath:andPath:]
This method checks to see if two files at given path are the same file, then compares their size, and finally compares their contents.
For other part:
What size is considered for background process, I dont think so, for an application it matters. You can save these into temporary files. Even safari uses 130+ MB as you can easily check through Activity monitor.
NSXMLParser ended up being the most useful for this
We're using the Zend_PDF module in SugarCRM to merge pdf invoices that our system generates. I have been able to successfully merge a number of PDFs (around 10 to 30 in my tests), but we're getting memory errors when we try to merge larger numbers of pdf files. The error looks something like this:
[30-Jan-2012 14:10:20] PHP Fatal error: Allowed memory size of 268435456 bytes exhausted at /usr/local/src/php-5.3.8/Zend/zend_operators.c:1265 (tried to allocate 68134 bytes) in /srv/www/htdocs/sugar6_mf/Zend/Pdf/Element/Object/Stream.php on line 442
The above error was generated when we tried to merge 457 pdf files - that's files, not pages. We're going to need to merge 5,000 and more at a time eventually.
Can anyone offer any help/advice on how to address this?
If needed, ask, and I'll post the code on how the merged pdf is being generated.
Thanks.
I should preface this answer by saying that I know nothing about SugarCRM - my response is based solely on my knowledge of Zend_Pdf.
If my understanding is correct, you have a PHP script (hopefully not running inside Apache considering the length of time it will take to process 5,000 files) that is taking multiple PDF files as input using the Zend_Pdf::load() method and then iterating through the pages of each PDF object and adding them to one target instance of Zend_Pdf, which you are then writing to a file using the save() method.
Using this approach, even if you unset() each of the source PDF objects after you've added the pages to the target PDF object, you'll still need enough memory to store the entire output file. If you blew through 250MB with only 457 files, then I'm guessing your input PDF files are probably about 500KB, so your output file is going to be absolutely huge, so you are still going to end up running out of memory.
My advice would be to ditch this method entirely and use pdftk instead, which you could invoke using the exec() function. I'm sure there's a limit to the size of the arguments you can provide to exec(), so it will probably be a multi-step process with several intermediate files, but ultimately I think this will be a faster, more robust solution.
And just to re-iterate an earlier point, I would not run this process within Apache. I would set up a cron job that runs at the appropriate intervals and drops the output file into a secure area on your web/file server.