How to replace a specific image within a pdf? - pdf

I have a pdf with 3 images
I want to find each image and replace it with another image
I saw in the pdf the original paths under xmpMM:Ingredients:
I tried to change it via notepad++ but it looks like the images are already embedded and changing the path does nothing.
How can I find each image and replace it with another image?

The xmp stuff is information only. The actual images are embedded streams in the pdf file. Finding the correct streams to replace and replacing them isn't a simple problem, and can't be done with notepad. You'll need a library / toolkit that can modify PDFs, like https://pdf-lib.js.org/ or similar.
The PDF file looks like an Illustrator file, which adds another layer of weirdness - Illustrator can write PDFs that have both PDF and Illustrator versions of the content, and you see one in Acrobat and the other in Illustrator.
It's probably easier to recreate the PDF from whatever source produced it.

Related

Embedding PDF graphics in PDF output file programmatically

I am looking for a rough overview of how one would go about embedding graphics (coming from a PDF file) into another PDF file when writing a C++ document processor.
Background: I work on the LilyPond music typesetter, and recently added Cairo output to the system. Now I would like to support adding externally provided graphics to the PDF files that we generate (eg. adding a logo onto page laid out). This is trivial with EPS for PS output.
I can see how you could hook up Poppler to read the PDF, and render the PDF contents onto a Cairo surface, but I wonder if there is a simpler shortcut (eg. embed the PDF file as a binary stream, and then point directly to that stream).
If you need to go via an external route, like reading the PDF and writing it into an existing PDF using Cairo, that would be simpler. To do it manually:
A PDF page consists of a stream of operators for drawing it, and a dictionary of external resources (fonts, images etc.). To stamp one PDF page onto another, you would need to:
a) Find all objects for external resources in the stamp which are needed, and add them to the destination PDF.
b) Convert the page to a "Form Xobject", which is a sort of reusable piece of content. Add this to the /XObjects entry in the destination page, making sure to pick a fresh name.
c) Add some operators to the page content in the destination page to invoke the new xobject
To see how this might work, you could play with -stamp-as-xobject and -postpend-content "/XObjName Do" from section 8.4 of the cpdf manual.
Making this work for arbitrary PDFs is really not for the faint of heart, I'm afraid.

How to transfer OCR text from one PDF to another PDF?

I have two versions of one same scanned PDF. One of them has an OCR layer. How can I transfer the layer to the other one? I already install Ghostscript, but I don't know what to do next.
How to Use Ghostscript
There's no such thing as an 'OCR layer' in PDF.
Most likely what you have is a PDF file which has a scanned image and the text extracted from that image using OCR which has been drawn as 'invisible' text (text rendering mode 3).
In general you can't copy and paste text between PDF files, so it's very hard to do what you are asking. I don't know of any tools which will help you here, I can say for certain that Ghostscript absolutely will not help you at all.
Most likely you will also need to copy the Font (or CIDFont) from the PDF file as well, and if it has a ToUnicode CMap you'll definitely also want that or search won't work (and there's little point in this sort of OCR otherwise).
Since you have a PDF file which includes the OCR'ed text, why not simply use that PDF ? I can't see any reason why you would want to 'transfer' it to another PDF file.

Problems with PDF fonts generated with ggsave under windows when linking in Illustrator

I run into problems with embedded (or not embedded?) fonts in PDFs of ggplots created with ggsave and linked into illustrator files, for some reason on windows only.
For my workflow I link plots into illustrator where I create figures with several plots. I don't embed the plots because in case something changes in R the plots are automatically updated when Illustrator is reopened.
So the problem is that when trying to save such files I always get an error message that saving is not possible because the Font "^1" could not be embedded. I can save the illustrator files when I disable PDF compatibility but cannot save them as PDF, which is what I need.
I don't get this problem if I use ggsave(plot, device=cairo_pdf), but with cairo_pdfI run into other problems (e.g. with geom_rangeframe).
In previous posts I read about an issue with the dingbats or AdobePiStd font, but using ggsave(plot, useDingpats=F) does not solve it. Does anyone have an idea how to solve this?
After further research I could solve the problem with the embedFonts function. The the problem seems to be that the fonts are not embedded by default. I wrote a small function to use instead of ggsave to automatically embed the fonts into the same PDF file:
ggsave_embed<-function(fileN, ...){
ggsave(fileN, ...)
embedFonts(file=fileN, outfile = fileN)
}
# example usage:
ggsave_embed("myfile.pdf", myPlot)

PDF cannot display Chinese fonts in table of contents

I made a PDF file from Latex (using TexMaker).
Acrobat Reader is able to display BOTH the text and the table of contents in Linux.
But Acrobat Reader is unable to display the table of contents in Windows XP (the Chinese characters came out as boxes). However, the text is displayed correctly.
I tried to embed the fonts into the PDF but the various methods are not 100% successful, so I'm not sure if the fonts are embedded correctly or not. Anyway, the table of contents remain unreadable in Windows.
I wonder if it is really an font embedding problem? Or do I need to install these "Adobe Reader X Font Packs":
https://www.adobe.com/support/downloads/detail.jsp?ftpID=4883
My concern is that I'd like my PDF to be readable in Windows, including the table of contents (and preferably without further installations). If this is possible...
I suspect you are talking about "bookmarks" and not saying part of the text in the document is ok and part is not. PDF Bookmarks are part of the UI of the application and are not selected from embedded fonts. Therefore, the system you are running on needs to know how to handle fonts in the language(s) of choice.
See https://forums.adobe.com/thread/1144972?start=0&tstart=0
Embedding the fonts will have no effect on the bookmarks.

How can I overlay text on a TIFF image, creating something like a searchable pdf?

I would like to have an application where a user views an image of a document in TIFF Format.
If the words "foo" and "bar" appear on the page. And a selection is made on the image that only contains "foo", then I would like to only select the word "foo".
Is there a format that lends itself to storing both the location of text and the text of an image?
Since you know about searchable PDF, and it perfectly implements what you are suggesting, I assume that there is some reason why you can't use it. If not, you should use PDF -- the format supports mixed-content and overlaying them. All of the viewers that your users are likely to have will understand what to do with text beneath the image.
The TIFF format does not support this directly, but if you are making the viewer, and it only needs to work there, then you could try to store the text and positions in a custom tag.
Then your viewer would need to read this tag, interpret mouse positions, and look up the text that is being selected on the image. No other viewer would support your text tag, but they would show the TIFF.
For either of these mechanisms, you will need OCR and a way to encode the data you get either into PDF or the custom TIFF tag. For open source OCR, take a look at Tesseract from Google.
Disclaimer: I work at Atalasoft. Our imaging SDK, DotImage, has add-ons for OCR that can make searchable PDF, and can add and edit TIFF tags.