How do I extract vector graphics from a pdf document? - pdf

I want to make a tool that extracts vector graphics from a pdf file with the help of a human. e.g. A person opens the pdf document using the tool and then selects the objects that he wants to save as a vector drawing. Are there any tools out there already doing this or any libraries that can be used to write my own tool. Language of the library can be(in decreasing preference) c#, VB.net, python or c/c++.

Perhaps this is a tedious way, but if you print it using the XPS Document Writer, the vector graphics should be there in WPF XAML that you can use. The output document is just a zip archive with the different document elements

Related

Embedding PDF graphics in PDF output file programmatically

I am looking for a rough overview of how one would go about embedding graphics (coming from a PDF file) into another PDF file when writing a C++ document processor.
Background: I work on the LilyPond music typesetter, and recently added Cairo output to the system. Now I would like to support adding externally provided graphics to the PDF files that we generate (eg. adding a logo onto page laid out). This is trivial with EPS for PS output.
I can see how you could hook up Poppler to read the PDF, and render the PDF contents onto a Cairo surface, but I wonder if there is a simpler shortcut (eg. embed the PDF file as a binary stream, and then point directly to that stream).
If you need to go via an external route, like reading the PDF and writing it into an existing PDF using Cairo, that would be simpler. To do it manually:
A PDF page consists of a stream of operators for drawing it, and a dictionary of external resources (fonts, images etc.). To stamp one PDF page onto another, you would need to:
a) Find all objects for external resources in the stamp which are needed, and add them to the destination PDF.
b) Convert the page to a "Form Xobject", which is a sort of reusable piece of content. Add this to the /XObjects entry in the destination page, making sure to pick a fresh name.
c) Add some operators to the page content in the destination page to invoke the new xobject
To see how this might work, you could play with -stamp-as-xobject and -postpend-content "/XObjName Do" from section 8.4 of the cpdf manual.
Making this work for arbitrary PDFs is really not for the faint of heart, I'm afraid.

Creating a searchable PDF from one already existing PDF and text (with coordinates)

My Situation:
I have an existing PDF with only images
I have a preprocessed OCR with all text identified and their respective coordinates
An application running in C#
I can use other programming languages if needed
My Question:
-> Do you know a way to create a searchable PDF using those existing resources (Images and Text with their coordinates) <-
I'm doing a lot of research but most of the results I get only show how to create a searchable PDF using some library (iText, PDF Sharp, etc.) that uses their built-in OCR engine, that is not my case, I already have the text and coordinates.
Thank you for any help and thought you can provide me.

How do I use an existing PDF as a container?

I have a PDF (created from Word) for a game I wrote for an old 8-bit computer, and I'd like to embed the code for that game (binary, less than 32k) into that PDF. This way, my emulator can load the program by reading the PDF, and the two can be stored and shared in one file.
You could call this a form of steganography.
I know a PDF has a tree structure and uses ASCII to define its components; is there a way to add inert, "orphan" elements that won't cause problems for PDF readers? I think that would be the easiest way to do it. But I'm not sure how to do it.
The simplest solution would be to use a document attachment or a file attachment annotation.
Most PDF tools that are available support these features as they are pretty basic.

How to make semi transparent layers in PDF printable on Adobe Acrobat?

I am using an online PDF generator to generate the attached PDF.
While the PDF opens and looks OK on adobe Acrobat (I tested several different versions including Reader and Pro) the transparent layers are printed as white boxes when sent to printer (either a real printer or another PDF printer like PDFill PDF&Image Writer.
Any idea what's wrong with the transparent layers and how to fix them?
This is the file: https://dl.dropboxusercontent.com/u/18517313/flyer.pdf
There doesn't seem to be anything wrong with the file to me, and it prints apparently correctly for me when printed from Adobe Acrobat. How are you printing the file ?
One workaround would be opening the file in Acrobat Pro, and use the Flattening Preview (to be found among the Print Production Tools) to flatten the transparencies.
When you print a PDF (or any other format) from an application, several sub systems are involved. The application (such as Adobe Reader) makes calls to the graphics subsystem of the OS (such as GDI on Windows). The OS, in turn passes these calls to the printer driver which is responsible for converting these calls (such as draw line, fill path, etc.) to instructions that are understood by the printer that you selected. These instructions are referred to as the page description language or PDL. Examples of PDLs are PostScript and PCL. This abstraction is good because applications no longer need to ship their own printer drivers. The downside is that the API of the graphics subsystem and the PDL both put restrictions on the richness of your graphics.
Transparency is a typical feature that is present in PDF but only limited available in PostScript. To achieve the same visual result, the feature is approximated. In case of transparency this is called flattening as Max Wyss points out.
By the way, applications (such as Adobe Acrobat) may choose to by-pass the OS and driver and generate the PDL themselves. This is referred to as pass-through printing. Although this circumvents the limitations of the graphics subsystem, the output is still bound to the PDL of your printer.

EMF GDI hDC to vector PDF

I have a bunch of base emf files that i play on a graphics surface then use gdi to merge text on the surface with drawstring to create a single page report forms.
The graphics object is then or sent to the printer, or saved as a png and then wrapped in a pdf. (iTextSharp)
I'm looking for a way to keep the pdf vector based but cant find any open source ways of getting direct access to the dc to be able to draw a metafile image.
My current pdf's are around 800k per page, where if I print the same image to a pdf printer (amyuni) its 23k. The only product that i've found is PdfTron which creates a 200k vector based file directly without printing, but is way too expensive because of all its features.
Do any of graphics experts have any suggestions for an easy way to put metadata directly into a pdf?
Thanks
Mike