My Situation:
I have an existing PDF with only images
I have a preprocessed OCR with all text identified and their respective coordinates
An application running in C#
I can use other programming languages if needed
My Question:
-> Do you know a way to create a searchable PDF using those existing resources (Images and Text with their coordinates) <-
I'm doing a lot of research but most of the results I get only show how to create a searchable PDF using some library (iText, PDF Sharp, etc.) that uses their built-in OCR engine, that is not my case, I already have the text and coordinates.
Thank you for any help and thought you can provide me.
I am trying to laser cut multiple signs, and I need to combine PDFs into a single PDF for import into AutoCAD. The signs are all the same shape, but I need to populate the different text/image for each in the frame.
I have experience with python, and I am open to learning a new tool/software to get this done in the easiest manner possible. I would love any guidance or advice on this project.
Here is a very basic picture of how I would like the final PDF to be.
The PDF toolkit (pdftk) can merge PDFs and change pagination and/or put multiple pages onto a single page.
https://www.pdflabs.com/tools/pdftk-the-pdf-toolkit/
I have an SVG file of a bar plot that I need to convert to a PDF. The bar plot was made in matplotlib, saved as a PDF and imported into Inkscape. I used Inkscape to add annotations to the figure and then export it back to a PDF to be used in a final document.
This is what the PDF file looks like going into Inkscape
After adding text elsewhere on the figure and saving as a PDF I get the same plot with these white lines:
These are not your typical PDF render artifacts, rather a closer inspection shows that they have a gradient to them.
I think this is somehow a product of the SVG file. I have used an online SVG-to-PDF converter and the lines are still present. Additionally, I use this method to make all my figure, Matplotlib to Inkscape to PDF and I have not had this issue with any other figures.
I've found that Inkscape does this when you import a bar graph which has a shading type that is not the same as any of the preset Inkscape patterns. I've seen this exact issue when I've imported graphs from R programing language and excel so I don't think it's specific to Matplotlib. I don't know the root cause, however, since I experience this problem a lot I'll share the workaround options I typically employ when I get this issue. One is not necessarily better than another and it depends on the situation which I use.
Option 1) Convert the PDF to a .png bitmap image in some other program, (Gimp, Photoshop, Powerpoint....) then embed the image in Inkscape. Make your changes then export from Inkscape as a PDF. This has the disadvantage that the graph will no longer be a vector map. Use option 2 or 3 to keep it a vector map.
Option 2) Import the pdf into Inkscape, ungroup the pdf object, delete the stripped filling in the bar graph, then recreate the filling using an Inkscape made fill. In the worst cases I've actually made custom bar graph patterns in Inkscape to exactly match the pattern that I had before. This process is a pain.
Option 3) Create shapes that cover over the artifacts, remove border lines from the shapes and use the eye dropper to make them exactly the same color as the good parts.
Like I said these are not an academic understanding of the problem to avoid the problem but I hope it can help you accomplish your task.
I am working on PDFs using Leaves. I'm unable to figure out how to make annotations. I haven't used Quartz 2D much and would like some direction
Adding write annotation support is hard.
Quartz 2D won't help you there.
You need to manually parse the PDF. (e.g. with NSScanner) and build up the XRef tree of all the PDF objects. Then you're writing a new trailer that replaces the /Page object and attaches all new annotation data. It's quite hard to get right, and the 2000 pages PDF reference is not very helpful on that. I worked the better part of the year for proper annotation support (Highlight, Underscore, Strikeout, Ink, Note, ...).
And when you want highlight annotations, you also want text selection (else the user would have to free-draw a highlight - not a nice experience.) Getting the correct frames for the text glyphs for all PDF font types is another level of horror; in PDF there's no notion of a word or a column. Just single glyphs. The rest is algorithms and guessing.
I even spoke with some Apple engineers how they did it [text selection, annotations], and they told me a three-person team worked about three years on their implementation.
Is there a library/tool which would list all colours used in a PDF document ?
I'm sure Acrobat itself would do this but I would like an alternative (ideally something that could be scripted).
So the idea is if you have a very simple PDF document with four colours in it the output might say :
RGB(100,0,0)
RGB(105,0,0)
CMYK(0,0,0,1)
CMYK(1,1,1,1)
You could explore the insides with pdfbox, but you would have to write some code to find and catalog all those colors.
Most PDF tools have access to this information but no api to access it. You could take any tool and add it in
Apago PDFspy generates an XML file containing all kinds of metadata extracted from PDF files. It reports color usage including spot colors.
We recently added a function called GetPageColorSpaces(0) to the Quick PDF Library - www.quickpdflibrary.com to retrieve much of the ColorSpace info used in the document.
Here is some sample output.
Resource,\"QuickPDFCS2eb0f578\",Separation,\"HKS 52 E\",DeviceCMYK,0.95,0,0.55,0
Resource,\"QuickPDFCSb7b05308\",Separation,\"Black\",DeviceCMYK,0,0,0,1
Resource,\"QuickPDFCSd9f10810\",Separation,\"Pantone 117 C\",DeviceCMYK,0,0.18,1,0.15
Resource,\"QuickPDFCS9314518c\",Separation,\"All\",DeviceCMYK,0,1,0,0.5
Resource,\"QuickPDFCS333d463d\",Separation,\"noplate\",DeviceCMYK,1,0,0,0
Resource,\"QuickPDFCSb41cafc4\",Separation,\"noprint\",DeviceCMYK,0,1,0,0
Resource,\"Cs10\",DeviceN,Black,Colorant,-1,-1,-1,-1
Resource,\"Cs10\",DeviceN,P1495,Colorant,-1,-1,-1,-1
Resource,\"Cs10\",DeviceN,CalRGB,Colorant,-1,-1,-1,-1
Resource,\"Cs10\",Separation,\"P1495\",DeviceCMYK,0,0.31,0.69,0
XObject,\"R29\",Image,,DeviceRGB,-1,-1,-1,-1
Disclaimer: I work at Atalasoft.
Our product, DotImage with the PDF Reader add-on, can do this. The easiest way is to rasterize the page and then just use any of our image analysis tools to get the colors.
This example shows how to do it if you want to group similar colors -- the deployed example will only work for PNG and JPEG, but if you download the code, it's trivial to include the add-on and get PDF as well (let me know if you need help)
Source here:
http://www.atalasoft.com/cs/blogs/31appsin31days/archive/2008/05/30/color-scheme-generator.aspx
Run it here:
http://www.atalasoft.com/31apps/ColorSchemeGenerator
If you are working with specific and simple PDF documents from a constrained source then you may be able to find the colors by reading through the content stream. However this cannot be a generic solution.
For example PDF documents can contain gradients or transparency. If your document contains this type of construct then you are likely to end up with a wide range of colors rather than a specific set.
Similarly many PDF documents contain bitmapped images. Given that these will need to be interpolated to be displayed at different resolutions, the set of colors in a displayed PDF may be bigger or different to (though obviously broadly similar to) the embedded bitmap.
Similarly many PDF documents contain constructs in multiple color spaces that are rendered into different color spaces. For example a PDF might contain a DeviceRGB bitmap, a line in an ICC based CMYK color and a Lab based rectangle. The displayed version might be in sRGB for display or CMYK for print. Each of these will influence the precise set of colors.
So the only 100% valid answer is going to be related to a particular render of a PDF at a particular resolution to a particular color space. From the resultant bitmap you can determine the colors that have been used.
There are a variety of PDF libraries that will do this type of render including DotImage (referenced in another answer) and ABCpdf .NET (on which I work).