Does the string <!-FTCACHE-1-> in a PDF file mean anything? - pdf

My program downloads a PDF file from a source location every day. When I see the binary text of the PDF file in Notepad, I find that sometimes the PDF file has the string <!-FTCACHE-1-> at the end. Sometimes this word is missing from the PDF file.
My program downloads this PDF daily and compares it with the previous day's PDF file using the Windiff binary comparison.
99% of the time, Windiff reports differences in the PDF file just because one PDF contains the string <!-FTCACHE-1-> at the end.
Does anyone knows what the reason behind this is?
Thanks,
Praveen

<!--FTCACHE-1--> is generated by FatWire Content Server, a web content management solution that is probably generating your URL. FTCACHE means FutureTenseCache, the name of the original product component. The text is a "footer" flag that indicates to the caching module whether or not the page was properly generated. If the page is supposed to be cached, a 1 indicates that the page was properly built, and so is cacheable. If 0 is returned, it indicates that the page was corrupted and should not be cached. The Satellite Server caching engine is supposed to strip this footer once it reads it.
In other words, the key that is there to ensure that the cache is not corrupted, is causing the corruption in your PDF.
This issue has been fixed in patches to FatWire ContentServer for quite some time now.
For your purposes, just ignore the string - strip it if you can.
Sorry about that. That was my bug. :-)

The application that generates the PDF file has a bug, the FTCACHE tag should not be there, it is not a valid PDF construct. Its presence actually damages the PDF file, it invalidates the FastWebView feature in the PDF file, as you have seen it. It is safe to remove it before comparing the files.

"FT" could be FreeType, the open source font engine. The comment probably comes from the software that generates the PDF. If you can somehow identify that, you could (assuming it is open source) perhaps take a look through it and see what causes it to emit the comment.
FreeType has a source folder dedicated to caching, the root source file there is called ftcache.c. It doesn't do a lot though, just #includes (!) the other source files.
Googling on the string you see, reveals several more or less random PDF:s that seem to contain it.

Related

GhostScript creating extra page when font errors occur

I have a process that needs to write multiple postscript and pdf files to a single postscript file generated by, and that will continue to be modified by, word interop VB code. Each call to ghostscript results in an extra blank page. I am using GhostScript 9.27.
Since there are several technologies and factors here, I've narrowed it down: the problem can be demonstrated by converting a postscript file to postscript and then to pdf via command line. The problem does not occur going directly from postscript to pdf. Here's an example and an example of the errors.
C:\>"C:\Program Files (x86)\gs\gs9.27\bin\gswin32c.exe" -dNOPAUSE -dBATCH -sDEVICE=ps2write -sOutputFile=C:\testfont.ps C:\smallexample.ps
C:\>"C:\Program Files (x86)\gs\gs9.27\bin\gswin32c.exe" -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sOutputFile=C:\testfont.pdf C:\testfont.ps
Can't find (or can't open) font file %rom%Resource/Font/TimesNewRomanPSMT.
Can't find (or can't open) font file TimesNewRomanPSMT.
Can't find (or can't open) font file %rom%Resource/Font/TimesNewRomanPSMT.
Can't find (or can't open) font file TimesNewRomanPSMT.
Querying operating system for font files...
Didn't find this font on the system!
Substituting font Times-Roman for TimesNewRomanPSMT.
I'm starting with the assumption that the font errors are the cause of the extra page (if only to rule that out, I know it is not certain). Since my ps->pdf test does not exhibit this problem and my ps->ps->pdf does, I'm thinking ghostscript is not writing font data that was in the original postscript file to the one it is creating. I'm looking for a way to preserve/recreate that in the resulting postscript file. Or if that is not possible, I'll need a way to tell ghostscript how to use those fonts. I did not have success attempting to include them as described in the GS documentation here: https://www.ghostscript.com/doc/current/Use.htm#CIDFontSubstitution.
Any help is appreciated.
I've made this an answer, even though I'm aware it doesn't answer the question, becasue it won'f fit as a comment.
I think your assumption that the missing fonts are causing your problem is flawed. Many PDF files do not embed all the fonts they require, I've seen many such examples and they do not emit extra pages.
You haven't been entirely clear in your description of what you are doing. You describe two processes, one going from PostScript to PDF and one going from PostScript on to PostScript (WHY ?) and then to PDF.
You haven't described why you are processing PostScript into a PostScript file.
In particular you haven't supplied an example file to look at. Without that there's no way to tell whether your experience is in fact correct.
For example; its entirely possible that you have set /Duplex true and have an odd number of pages in your file. This will cause an extra blank page (quite properly) to be emitted, because duplexing requires an even number of pages.
The documentation you linked to is for CIDFont substitution, it has nothing to do with Font substitution, CIDFonts and Fonts are different things in PDF and (more particularly) PostScript. But I honestly doubt that is your problem.
I'd suggest that you put (at the least) 'smallexample.ps' somewhere public and post the URL here, that way we can at least follow the same steps you are doing. That way we can probably tell you what's going on. An explanation of why you're doing this would be useful too, I would normally strongly suggest that you don't do extra steps like this; each step carries the risk of degarding the output in some way.
Thank you for the response. I am posting as an answer as well due to the comment length restrictions:
I think you are correct that my assumption about fonts is wrong. I have found the extra page in the second ps file and do not encounter the font errors until the second conversion.
I have a process that uses VB MSWord Interop libraries to print multiple documents to a single ps file using a virtual printer set up with ghostscript and redmon. I am adding functionality to mix in pdf files too. It works, but results in an extra page. To narrow down where the problem actually was, I tried much simpler test cases via command line to identify the problem. I only get the extra page when ghostscript is converting ps to ps (whether or not there is a pdf as well). Converting ps to pdf I do not get the extra page. Interestingly, I can work around the problem by converting the ps to pdf and then both pdfs back to ps. That is a slower and should not be necessary however, so I would like to identify and resolve the extra page issue. I cannot share that particular file. I'll see if I can create an example I can share that also exhibits the problem. In the meantime, I can confirm that the source ps file is six pages and the duplexing settings are as follows. There is duplex definition in the resulting ps file with the extra page. Might there be some other common culprits I could check for in the source ps? Thank you.
featurebegin{
%%BeginFeature: *DuplexUnit NotInstalled
%%EndFeature
}featurecleanup
featurebegin{
%%BeginFeature: *Duplex None
<</Duplex false /Tumble false>> setpagedevice
%%EndFeature
}featurecleanup

I'd like to recognize the text of all pdfs on my computer and save them without moving them from their locations. Is it possible?

I've tried using Adobe Acrobat X Pro to "recognize text in multiple files."
When I start this process and it asks for the directory, I've chose C:, my main hard drive.
It took hours to load and when it did, the list of files it generated included word documents as well. Adobe said I couldn't proceed until I removed the problem files.
Once I removed all the pdfs Adobe flagged as having errors (like password protection) and the prompt remained, I assumed it meant the word documents in the list.
So I manually removed those too. But Adobe still said that I couldn't proceed until problem files were removed and there weren't any remaining files in the list that adobe had flagged as having issues.
My firm is trying to make sure all pdfs we have are searcheable. Currently, some are and some aren't. Our goal is to make them all searchable without removing them from their varied locations.
I think you can do this using a combination of
regular java : to list all files in a directory that match a given criterium (e.g. their name ends with '.pdf')
iText : to iterate over the PDF document and extract all images
Tess4J : a port of Tesseract (google OCR engine) for java, to turn the extracted images back into text
Unless I am much mistaken, Tesseract even offers a crude version of this workflow for you. But only for 1 pdf at a time. So you'd still need some windows/linux scripting to pipe in all files of a given directory.

Why should applications read a PDF file backwards?

I am trying to wrap my head around the PDF file structure. There is a header, a body with objects, a cross-reference table and a trailer. In the official PDF reference from Adobe, section 3.4.4 about file trailer, we can read that:
The trailer of a PDF file enables an application reading the file to quickly find the cross-reference table and certain special objects. Applications should read a PDF file from its end.
This looks very inefficient to me. I can't show anything to users this way (not even the first page) before I load the whole file. Well, to be precise, I can - if my file is linearized. But that is optional and means some extra overhead both when writing and reading such file.
Instead of that whole linearization thing, it would be easier to just put the references in front of the body (followed by objects on page 1, page2, page 3... ). But people in Adobe probably had their reasons to put it after it. I just don't see them. So...
Why is the cross-reference table placed after the body?
I would agree with the two reasons already mentioned, but not because of hardware limitations "back in the day", but rather scale. It's easy to think an invoice with a couple of pages of text could be done better differently, but what about a book, or a PDF with 1,000 photos?
With the trailer at the end you can write images/text/fonts to the file as they are processed and then discard them from memory while simply storing the file offset of each object to be used to write the trailer.
If the trailer had to come first then you would have to read (or even generate in the case of an embedded font) all of these objects just to get their size so you could write out the trailer, then write all the objects to the file. So you would either be reading, sizing, discarding, then reading again, or trying to hold everything in ram until you could write them to the file.
Write speed and ram are still issues we contend with today when we're running in a docker container on a VM on shared hardware..
PDF was invented back when hard drives were slow to write files... really s-l-o-w. By putting the xref at the end, you could quickly change a file by simply appending new objects and an updated xref to the end of the file rather than rewriting the whole thing.
Not only were the drives slow (giving rise to the argument in #joelgeraci's answer), also was there much less RAM available in a typical computer. Thus, when creating a pdf one had to write data to file early, much earlier than one had any idea how big the file or, as a consequence, the cross references would become. Writing the cross references at the end, therefore, was a natural consequence.

Reducing PDF size - From 5MB to ~200KB

INTRO
I have this 2,7 MB PDF file.
It's a certificate with two fields that I have to fill: name and course.
After filling those fields I save it for later printing.
THE PROBLEM
After saving, the new file comes up with ~5MB.
I have tried many saving options and but I only managed to reduce it to the final size of 4,7MB (still larger than the original file).
For instance, I tried open the original file (2,7MB) and save it right after opening (without making any change). The result is the same: a new ~5MB file.
That means that it isn't the information (Name and Course) the faulty.
SOLVING
At some point, trying new methods of saving, I managed to save it to the size of 180KB.
Unfortunately, I'm not being able to reproduce this made.
After several hours trying to achieve this made again and not succeeding, I came here ask for help :(
As you are in Acrobat, you might use "save as optimized…" (where you are already, in order to show the space usage), and remove as much as possible (mainly structure information, private data (which means data allowing the original-creating application to edit the file again), etc.).
You might also start from a minimum-sized blank file, and copy/paste the form fields into it. (although I don't think that would cause much reduction, as, AFAIK, fonts used in form fields are counted in the Fonts item).

Extract embedded PDF file without a full parse

I want to build a utility to extract embedded files from a PDF (see section 7.11.4 of the spec). However I want the utility to be "small" and not depend on a full PDF parsing framework. I'm wondering if the file format is such that a simple tool could scan through the document for some token or sequence, and from that know where to start extracting the embedded file(s).
Potential difficulties include the possibility that the token or sequence that you scan for could validly exist elsewhere in the document leading to spurious or corrupt document extraction.
I'm not that familiar with the PDF spec, and so I'm looking for
confirmation that this is possible
a general approach that would work
There are at least two scenarios that are going to make your life difficult: encrypted files, and object streams (a compressed object that contains a collection of objects inside).
About the second item (object streams), some PDF generation tools will take most of the objects (dictionaries) inside a PDF file, put them inside a single object, and compress this single object (usually with deflate compression). These means that you cannot just skim through a PDF file looking for some particular token in order to extract some piece of information that you need while ignoring the rest. You will need to actually interpret the structure of PDF files at least partially.
Note that the embedded files that you want to extract are very likely to be compressed as well, even if an objects stream is not used.
Your program will need to be able to do at least the following:
- Processing xref tables
- Processing object streams
- Applying decoding/decompression filters to a data stream.
Once you are able to get all objects from the file, you could in theory go through all of them looking for dictionaries of type EmbeddedFile. This approach has the disadvantage that you might extract files that are not been referenced from anywhere inside the document (because a user deleted it at some point of the file's history for example)
Another approach could be to actually navigate through the structure of the file looking for embedded files on the locations specified by the PDF spec. You can find embedded files in at least the following elements (this list is from the top of my head, there might be a lot more that these):
- Names dictionary
- Document outlines
- Page annotations