PDF special searching iOS - objective-c

I know that there's a great source that works on iOS for PDF searching, it's PDFKitten
But my case is that I encounter some PDF files that this source don't work for search. I tried to open these file by 'Preview' app on Mac and tried to search, it works.
I uploaded one file here.
You can check by open this file by 'Preview' app and search the word 'ra'. It works perfect. By if you drag this file to the source PDFKitten and make some configurations so that the source open it, then try to search, it don't work.
I inspected the source, it cares all the text showing operator, including Tj, ', '', TJ. I placed some log lines in these operator's call backs and I saw these call backs are not called.
Can you give my some suggestions or any ideas?

If I understand the code correctly, PDFKitten looks for fonts only in the /Font entry of the /Resources dictionary of the page. At least that's my interpretation of the method fontCollectionWithPage of Scanner the result of which is queried by setFont in pdfScannerCallbacks to set the current font object.
Furthermore there is no callback for the Do operator (i.e. the operator used to inject the contents of a XObject resource into the page content). Unless CGPDFScannerScan interprets this operator under the hood, the content of included XObjects is not scanned at all. This would match your observation that the text setting operator callbacks never get called.
Your file mundo1.pdf, though, does not have any immediate /Font entry in the /Resources dictionaries of its pages. Instead all the actual content of each page is wrapped into a single /XObject resources respectively. These XObjects in turn have their own /Resources dictionaries which contain a /Font entry defining the fonts used for the respective page.
Thus, PDFKitten does not know anything about the fonts used in your file, especially about their encodings, and so cannot extract the text from the PDF contents. Maybe it does not even get to see the PDF contents to interpret.
I would, therefore, propose you post this issue on the PDFKitten issue management site.
By the way, this PDF construct is completely according to the PDF spec. Nonetheless it looks like a non-adequate use of the iText library. The author of the software using iText like that should review his code and start using better suited classes of the iText library.

Related

Embedding PDF graphics in PDF output file programmatically

I am looking for a rough overview of how one would go about embedding graphics (coming from a PDF file) into another PDF file when writing a C++ document processor.
Background: I work on the LilyPond music typesetter, and recently added Cairo output to the system. Now I would like to support adding externally provided graphics to the PDF files that we generate (eg. adding a logo onto page laid out). This is trivial with EPS for PS output.
I can see how you could hook up Poppler to read the PDF, and render the PDF contents onto a Cairo surface, but I wonder if there is a simpler shortcut (eg. embed the PDF file as a binary stream, and then point directly to that stream).
If you need to go via an external route, like reading the PDF and writing it into an existing PDF using Cairo, that would be simpler. To do it manually:
A PDF page consists of a stream of operators for drawing it, and a dictionary of external resources (fonts, images etc.). To stamp one PDF page onto another, you would need to:
a) Find all objects for external resources in the stamp which are needed, and add them to the destination PDF.
b) Convert the page to a "Form Xobject", which is a sort of reusable piece of content. Add this to the /XObjects entry in the destination page, making sure to pick a fresh name.
c) Add some operators to the page content in the destination page to invoke the new xobject
To see how this might work, you could play with -stamp-as-xobject and -postpend-content "/XObjName Do" from section 8.4 of the cpdf manual.
Making this work for arbitrary PDFs is really not for the faint of heart, I'm afraid.

How to merge PDFs into a PDFA1b with watermarks using iText5

Here is what I need to do:
Merge several PDF documents (which may or may not be PDFA) into one PDFA1b.
Add a watermark (a simple text label) on each page of the resulting PDF.
It has to be with iText 5
I have looked at this official merging example: http://developers.itextpdf.com/examples/merging-pdf-documents/adding-cover-page-existing-pdf
But can this method be used to create a PDFA, and also add watermarks?
Or am I stuck with using this other method which he specifically says not to use: http://developers.itextpdf.com/examples/merging-pdf-documents-itext5/how-not-merge-documents
You can create files that conform to PDF/A-1b with just about any PDF library including iText. PDF/A, in general, is a subset of ISO 32000 (PDF) so it's really just a matter of using the tool to do what you need to with the files but not adding anything that is forbidden by PDF/A-1b (in your case).
The thing to be aware of is that iText or any of the other libraries that "support" PDF/A, will not prevent you from modifying PDF in a way that is forbidden by PDF/A... you just need to know what those things are.
So... before merging, you'll want to be sure that the input files don't have any annotations or form fields or any other interactive content.
After merging, add your watermark as page content and be sure your XMP metadata is conforming and you should be OK.

Is it possible to have only a single page corrupted in a PDF file?

In a PDF with many pages, is it possible to have only a single page be corrupted?
I've done some digging and couldn't seem to find anything so I am not even sure if this is possible, wondering if anyone has knowledge about it. And if it is possible how could I go about reproducing this? I've done some experimenting with editing hex values but it always renders the whole pdf file corrupt.
A PDF is a complex object graph. A bit simplified you have
document
|
+ lots of objects with different purposes
+ page tree
|
+ .. some page
|
+ content stream ("page description language")
+ resources
Now, as #mkl and #Setasign mention there's a lot you can malform in the serialization format of this network.
In your concrete document the page reference, the content stream reference, the content stream content, a resource reference in the content, a resource content,... could be the reason for failure. To debug, you will need a copy of the PDF references, the invalid file and a good PDF parser / browser tool.
Recreating a failure by blindly hacking hex in the document will most probably fail because of:
the serialization format of the objects is indexed so adding/removing bytes in the middle of a page will not damage only the page, but the document.
blindly changing some value can do damage any (structural) content, not only page content. you must restrict to changing content of items that are bound to /Contents entries only. This is where again intimate knowledge of PDF is required.
In short: it's very unlikely that you can reproduce a page rendering error by hazardly changing/adding/removing hex in a PDF.
Its like you have had a runtime error in a program and now want to recreate the error in another program by adding / removing chars...

[Steganography ]Hiding Data in PDF files

I'm trying to hide a file in a PDF file code. I've already search some information to help me. I've tried to uncompress the pdf using pdftk ( pdftk pdf.pdf output uncompress.pdf uncompress ). Then I tried different things such as :
Insert commentary : I put " %TEXT_TO_HIDE " in the uncompress pdf file code.
add new object : I put " 0 0 obj << TEXT_TO_HIDE << endobj " in the uncompress pdf file code.
modify an existing object
then i compress it using pdftk again
In each case, I obtain a new pdf, which is looking different from the original. It's not corrupted but images have different colors, and some original text are missing.
So, do you know some rules to change a pdf code without anyone notice ?
(PS : Sorry if my english is bad ^^ )
You cannot modify a PDF file in a text editor and expect the file to be still compliant in general. PDF is a binary format and you need to read the PDF specification to figure out how to modify it.
That said, there are heaps of places where you can "hide" information in a PDF document, the real question is how much data you want to hide, and to what purpose. The purpose typically links to how secure exactly this needs to be.
As some examples:
1) PDF allows embedding complete files in the actual PDF file. This is not really secure as anyone with decent software can extract these files (but the file itself could still be secured of course).
2) PDF allows adding arbitrary objects anywhere (or almost anywhere) in the file. This is a great way to hide information, but someone with the right tools can browse the object tree (even if the file is compressed) and see what you did.
3) PDF allows adding for example white text on a white background or text behind other objects. Again, there are ways around this for people with the right software.
4) Adobe's PDF spec allows at least 1K of fluff after the %%EOF marker (although ISO 32000 does not). Keep in mind that this is visible to anyone opening the file with a decent text or binary editor. (Thanks Jongware).
In short, you need to define much better what exactly you want to accomplish and how "secure" secure is in your use case.
You should also consider how "robust" the method must be. Should someone be able to save your PDF file with Acrobat for example with the hidden code intact? Some of the above methods may not be robust enough to ensure that with absolute certainty.

Page Templates with Form XObject in PDF

I'm writing a PDF generation library and wanted to add the the ability to use other PDFs as templates. The specification notes a TemplateInstantiatedproperty on pages with the alias of the template object should be all that is needed.
Here is a gist of the pdf content:
https://gist.github.com/tyre/89c12f8203181f078001
The template itself is stored in object 16 and the page in object 19.
qpdf --check reports the PDF as invalid:
WARNING: tmp/alpaca.pdf: file is damaged
WARNING: tmp/alpaca.pdf (file position 32089): xref not found
WARNING: tmp/alpaca.pdf: Attempting to reconstruct cross-reference table
checking tmp/alpaca.pdf
PDF Version: 1.7
File is not encrypted
File is not linearized
I'm afraid your PDF document is completely and utterly broken and that you have misunderstood a number of key concepts. You cannot simply incorporate a complete PDF file into another PDF file in the way you have done and expect that to work.
The template system you are referring to is intended to include "hidden" pages - not referenced in the pages tree in the PDF file - in the context of an interactive form document (or interactive document in general). That doesn't sound like what you are intending to do. And these pages need to be valid PDF pages. You can in other words not just include the original PDF document verbatim and expect the PDF reader to sort things out; you need to insert a syntactically correct PDF page object.
What you want to do is take the content of a document and apply that as a background to a document. This most commonly is done using XObjects. Pseudo-code for this could be:
Open the original PDF document
Open the "template" document
Read the template document and copy all elements from the template page into a newly created XObject in the original PDF document.
Modify the page contents of the pages in the original PDF document to paint the new XObject at the beginning of the page description of the existing pages.
It's important to note that again, you're not supposed to simply insert the template document into the stream for the newly created XObject. You will have to create a valid XObject that contains a properly formed resources dictionary referencing all resources needed by your XObject, and that contains the content stream from your template document.
As already indicated in comments, the PDF presented by the OP is structurally defect, the cross reference table position and entries are wrong. Furthermore the transition from one PDF revision to a next update looks questionable. Essentially, therefore, the OP will have to provide a sample PDF which is at least syntactically correct.
That been said, though, the OP indicated he was
writing a PDF generation library and wanted to add the the ability to use other PDFs as templates. The specification notes a TemplateInstantiatedproperty on pages with the alias of the template object should be all that is needed.
The Named Pages mechanism is not meant for something like that. Its main current use (if it is used at all) is in the context of spawning page templates by Acroform actions.
For using pages from other PDFs, one can simply copy them (and the referenced other objects) from the source PDF if they are to be used as separate pages as is; and if multiple templates are to be put onto a single target page, one can wrap the copied sources into form xobjects and include them in the target page.