I am looking for a rough overview of how one would go about embedding graphics (coming from a PDF file) into another PDF file when writing a C++ document processor.
Background: I work on the LilyPond music typesetter, and recently added Cairo output to the system. Now I would like to support adding externally provided graphics to the PDF files that we generate (eg. adding a logo onto page laid out). This is trivial with EPS for PS output.
I can see how you could hook up Poppler to read the PDF, and render the PDF contents onto a Cairo surface, but I wonder if there is a simpler shortcut (eg. embed the PDF file as a binary stream, and then point directly to that stream).
If you need to go via an external route, like reading the PDF and writing it into an existing PDF using Cairo, that would be simpler. To do it manually:
A PDF page consists of a stream of operators for drawing it, and a dictionary of external resources (fonts, images etc.). To stamp one PDF page onto another, you would need to:
a) Find all objects for external resources in the stamp which are needed, and add them to the destination PDF.
b) Convert the page to a "Form Xobject", which is a sort of reusable piece of content. Add this to the /XObjects entry in the destination page, making sure to pick a fresh name.
c) Add some operators to the page content in the destination page to invoke the new xobject
To see how this might work, you could play with -stamp-as-xobject and -postpend-content "/XObjName Do" from section 8.4 of the cpdf manual.
Making this work for arbitrary PDFs is really not for the faint of heart, I'm afraid.
Related
I'm using pdftotext to extract info from a pdf. Currently using the -raw option. I do have a few problems with the PDFs I'm working with. If I select the text from top to bottom it selects in the following fashion.
PDF content:
A
B
C
It selects A then C and then B. So when I extract the text it is presented in the same way. Is there a way to reformat the PDF so I can select the content from top to bottom?
NOTE: I'm aware that if I omit the "raw" option the layout will be preserved, but it seems to be buggy when the document includes tables so raw works better for me.
Yes, you can reformat the PDF so that the content is returned from top to bottom. This is not something that can be easily done using Adobe Acrobat or any other viewer that I am aware of and here is why.
From the documentation of pdftotext, the -raw option is defined as
Keep the text in content stream order. This is a hack which often "undoes" column formatting, etc. Use of raw mode is no longer recommended.
"content stream order" is the important piece in the description.
In PDFs, the content on the page does not have to be written in the content stream (the instructions that are interpreted to display the page) in the order that a human would read the content when the page is rendered. The internals of PDFs do not care about the ordering, they were designed to reproduce the same visualization of a document on a variety of platforms. Since all that matters to PDF is the visualization, applications or libraries that write PDF tend to not order the content stream in any meaningful way.
So you can reorder the instructions in a content stream so that they are in the order a human would read them, it is not an easy task to do by hand and using a library that understands PDF to manipulate the content stream would be one way of doing this. Another way is to look for a more advanced tool to use to extract text from the PDF (there are a number of tools that will look at the placement of the content on a page rather than just where it appears in the content stream).
I am not aware of anything that will reorder the content stream in the PDF based on where the content appears on the page automatically though.
How can I programmatically and reliably create PNG images from CHM and EPUB files? The page that's needed is only the first one, as in "cover image thumbnail generation".
Could this even be done just from the command line?
I have already looked at the open-source CHM QuickLook plug-in for MacOSX for source that does this and at Calibre, the latter to no avail.
In CHM, the default page is a webpage (a .html file). Of course it can only contain one page.
An extracter program is easy to do based on chmlib or Free Pascal's libs, but it will need the html parsed to also find names of other programs to extract. Roughly the algorithm would be:
use some "list" function of a cmdline extract tool to get the default page's name. (this is stored in an internal record)
extract it, and parse it for img and other referencing tags.
extract those.
The biggest picture downloaded in the previous step is probably "it"!
I have a PDF file, containing layers.
For example, on some pages, there are graphs, with additional data displayed on top of that graph, when clicking (layers).
Now I need to try to fetch all these layers out of the PDF file, or to be precise, I need ALL the data from that PDF file, including layers. The pdf file contains javascript to show/hide the layers when appropriate.
What is the best approach? Is there any tool that actually works for my intentions? Or should I write something myself? (If this is possible ofcourse).
Edit:
Here you can download the PDF file:
http://www.2shared.com/document/IutUfDfr/OR_erasmus.html
The password for viewing is: erasmus
I do not know if there are any tools per se but if you cannot find those you might do the following:
for each combination of on/off layers that you are interested in walk all pages and collect the content streams. Tokenize those and cut out the content you do not want to see (the commands you need to monitor to determine this are BDC and EMC). Save the stream again with the clipped content (naturally save the result in different files). You need something to read the PDF object structure and update some objects (there are lot's of libraries for that), plus you need tobe able to parse the content streams.
Now you will have a set of PDF files without layers (optional content) for which there are plenty tools to render to HTML etc.
Note: optional content <--> layer switches in the PDF viewer usually are 1:1 but the standard supports a full n:m mapping. I would concentrate on the real optional content blocks that can be turned on/off to keep things simple.
you can use this tool to extract images and text from even locked pdfs
http://download.cnet.com/Able2Extract/3000-2079_4-10249654.html
I use it myself sometimes and it has the ability to convert to HTML
What are the alternatives to process illustrator files or PDFs into XAML. My Current workflow works like this:
Open the PDF file in Adobe illustrator
Save the file as .ai (Adobe Illustrator) file
Open in Expression Design
Do some processing, mainly separating elements to layers and removing unneeded parts.
Save as XAML
Add XAML to Blend project
My only problem is that this way the text gets converted to paths. I would like to keep my text in XAML as well instead of paths.
Is there any other way to do this, so I keep the text? Any other tools?
I think what you want is to have Glyphs elements instead of Paths.
The problem is that Glyphs elements require you to specify the URI of the font file. Also, Glyphs elements reference glyphs by their index into a font file (it may happen that a converter that generates Glyphs elements - like the Microsoft XPS Document Writer - uses indices into font subset files: so these indices may not be the right indices to the same glyphs as defined in the original font file). I have been able to "solve" this problem in two ways with my own PDF to XAML conversion tools.
1. approach: Embed the font-subset file, BASE64 coded, in the generated XAML code and have the application implement a class that, upon loading, extracts and decodes an embedded font-subset file to a temporary location and hands a valid URI to that temporary file back to the XAML loader.
or, 2. approach: Have most font files already installed along with my application and, again, adding some support by my application that replaces the font name by an URI to the installed font file upon loading of the XAML code. The problem with this second approach is that glyph indices need to be correctly mapped to the installed font file, which may not be all that trivial to do. (You can find a link to an example file that has been generated for this way of loading on my blog: in particular take a peek at the file truncatedcone-xaml.txt)
In short: both solutions require a special PDF to XAML converter and support by the loading application. The reason I wanted to do it this way instead of just having my PDFs converted to Paths only is that my application is a shared whiteboard: thus I want my vector graphics to be as small as possible. (Conversion to paths tends to blow up the XAML code by a factor of 10 or more in most cases).
I am contemplating the implementation of a third approach: this would consist in generating the outline for every glyph that is used only once and then add support by my application to transform and position these glyph outlines in a way closely analogous to what Glyphs elements do that would otherwise have to be generated. The advantage would be that the generated XAML would still be relatively small (comparable to the second approach described above) without requiring the relevant font files to be installed along with the application and without having to map glyph indices from a subset file to the installed font file. The reason I have not yet tried to implement this in earnest is twofold: first, my current (second) approach already works very well for what I currently need; second, there might be performance problems with this third approach as reagards loading and / or rendering.
There's a (free) Adobe Illustrator plugin to export to XAML. Not sure it does exactly what you are looking for, though.
Find it at http://www.mikeswanson.com/XAMLExport/
Well an XPS file is actually a ZIP file. So if you open it with a ZIP-archiver or if you rename its extension to ZIP you can see what is inside. It already contains the pages as XAML code (those files have the form [pagenumber].fpage). However, that XAML code may refer to other files (like raster images and font subset files, those are typically odttf files - basically encrypted true type files) that are included in that ZIP archive as well. Which means, that the XAML code that you find in an XPS document may not be directly usable as pure XAML in your application. I have written python scripts to do the conversion of XAML taken from XPS documents (generated by the Microsoft XPS Document Writer) to get XAML files that my application can load (see approaches 1 and 2 above). I could send you copies of those python scripts (they are not particularly great code, which is no problem for me since I am now using a different approach to convert PDFs to XAML anyway).
#gyurisc: Keeping the font file should work but keeping the text might turn out to be a problem, because, you see, glyphs are not characters. It might be that you could figure out the character by examining the font file that a given glyph is part of, but that would involve parsing the font file. If you are unlucky, your PDF to XPS converter does even not keep enough information in the font subset files to figure out the character a given glyph (very likely) represents.
For example: If I convert a PDF file to XPS with the help of Microsoft's XPS Document Writer, and then try to select a piece of text from that XPS document, I can (only apparently) copy it to the clipboard. However, if I then paste it back into a Word document, I get garbage. Whereas if I select that same piece of text in the original PDF document and paste it into the same Word document, I get reasonably meaningful text. So Microsoft's XPS Document Writer apparently does not care about the interpretation of a "glyph run" as text, and thus it seems very likely to me that the link between the glyph indices that one finds in the generated XPS code and the characters they are meant to represent is already broken at that point. (But, admittedly, that's just a guess.)
A representation of text (as opposed to a run of glyphs) would be a TextBlock element in XAML, I suppose. However, my guess is that a typical PDF to XPS converter is unlikely to generate TextBlock elements. XPS is mainly meant to be rendered - on screen or on paper - it doesn't suggest itself as a file format that is particularly suitable for data exchange (exchange of text in your case).
I need to programmatically embed an existing PDF (a small graphic) onto a specfic page on an existing PDF. Using iTextSharp I've been able to add a new page containing this embedded PDF, but what need is to modify an existing page by adding this graphic. Is this possible using iTextSharp or any other PDF-generation libarary?
I tend to do this sort of thing using Context, which is a Tex-based layout tool that in integrated into the pdftex Tex/Metapost engine. There's a learning curve involved, and installing Context isn't entirely trivial, but it makes very general programmatic document processing involving PDFs easy once you get the hang of it.
For this problem, you'd define two overlays, with the first overlay being the main PDF that you set to a background, and then on the page you want to change, defining a foreground overlay with a \setlayer command, which contains a single \framed box, which superimposes the second PDF using a \externalfigure command.
The nice thing about Context for this kind of task is that it works with PDF as its internal representation all the way through, so there is no unexpected blow up in file size or deterioration in image quality, which you can get with other tools that convert between formats.