PDF Colo(u)r Analysis (without Acrobat itself ?) - pdf

Is there a library/tool which would list all colours used in a PDF document ?
I'm sure Acrobat itself would do this but I would like an alternative (ideally something that could be scripted).
So the idea is if you have a very simple PDF document with four colours in it the output might say :
RGB(100,0,0)
RGB(105,0,0)
CMYK(0,0,0,1)
CMYK(1,1,1,1)

You could explore the insides with pdfbox, but you would have to write some code to find and catalog all those colors.

Most PDF tools have access to this information but no api to access it. You could take any tool and add it in

Apago PDFspy generates an XML file containing all kinds of metadata extracted from PDF files. It reports color usage including spot colors.

We recently added a function called GetPageColorSpaces(0) to the Quick PDF Library - www.quickpdflibrary.com to retrieve much of the ColorSpace info used in the document.
Here is some sample output.
Resource,\"QuickPDFCS2eb0f578\",Separation,\"HKS 52 E\",DeviceCMYK,0.95,0,0.55,0
Resource,\"QuickPDFCSb7b05308\",Separation,\"Black\",DeviceCMYK,0,0,0,1
Resource,\"QuickPDFCSd9f10810\",Separation,\"Pantone 117 C\",DeviceCMYK,0,0.18,1,0.15
Resource,\"QuickPDFCS9314518c\",Separation,\"All\",DeviceCMYK,0,1,0,0.5
Resource,\"QuickPDFCS333d463d\",Separation,\"noplate\",DeviceCMYK,1,0,0,0
Resource,\"QuickPDFCSb41cafc4\",Separation,\"noprint\",DeviceCMYK,0,1,0,0
Resource,\"Cs10\",DeviceN,Black,Colorant,-1,-1,-1,-1
Resource,\"Cs10\",DeviceN,P1495,Colorant,-1,-1,-1,-1
Resource,\"Cs10\",DeviceN,CalRGB,Colorant,-1,-1,-1,-1
Resource,\"Cs10\",Separation,\"P1495\",DeviceCMYK,0,0.31,0.69,0
XObject,\"R29\",Image,,DeviceRGB,-1,-1,-1,-1

Disclaimer: I work at Atalasoft.
Our product, DotImage with the PDF Reader add-on, can do this. The easiest way is to rasterize the page and then just use any of our image analysis tools to get the colors.
This example shows how to do it if you want to group similar colors -- the deployed example will only work for PNG and JPEG, but if you download the code, it's trivial to include the add-on and get PDF as well (let me know if you need help)
Source here:
http://www.atalasoft.com/cs/blogs/31appsin31days/archive/2008/05/30/color-scheme-generator.aspx
Run it here:
http://www.atalasoft.com/31apps/ColorSchemeGenerator

If you are working with specific and simple PDF documents from a constrained source then you may be able to find the colors by reading through the content stream. However this cannot be a generic solution.
For example PDF documents can contain gradients or transparency. If your document contains this type of construct then you are likely to end up with a wide range of colors rather than a specific set.
Similarly many PDF documents contain bitmapped images. Given that these will need to be interpolated to be displayed at different resolutions, the set of colors in a displayed PDF may be bigger or different to (though obviously broadly similar to) the embedded bitmap.
Similarly many PDF documents contain constructs in multiple color spaces that are rendered into different color spaces. For example a PDF might contain a DeviceRGB bitmap, a line in an ICC based CMYK color and a Lab based rectangle. The displayed version might be in sRGB for display or CMYK for print. Each of these will influence the precise set of colors.
So the only 100% valid answer is going to be related to a particular render of a PDF at a particular resolution to a particular color space. From the resultant bitmap you can determine the colors that have been used.
There are a variety of PDF libraries that will do this type of render including DotImage (referenced in another answer) and ABCpdf .NET (on which I work).

Related

How is hidden text stored in OCR-enhanced PDF files

// EDIT 26.03.2018 - Who wants to continue my work can have a look on my source-files https://github.com/n0l0cale/ocr-sampledata
I'm actually looking for some details about PDF Files. It's most important for me that the files will be usable for a very long time and if possible the OCR should be automatically applied for new files (which seems to be not really possible with Adobe Acrobat...).
For that I've been looking for different solutions how to OCR my PDF Files. I found three candidates which seems to be doing what they should do... (more or less). But all three variants have their pro&cons... But there seem to be different approaches how to store data in PDF Files.... for all three Variants... Let me explain:
a File OCRed with Adobe Acrobat:
https://github.com/n0l0cale/ocr-sampledata/blob/master/A4%20sample_ACROBAT.pdf
results in a file that Acrobat is able to open in one step (no preloading of any background layer) and after a preflight-script I'm able to see the text which is stored hidden:
a File OCRed with Abby Finereader:
https://github.com/n0l0cale/ocr-sampledata/blob/master/A4%20sample_ABBY.pdf
does not seem suitable for the default adobe preflight-script as it does not display any additional layers:
But far as I was able to reproduce these Files seems to have a Background-Text-Layer, which contains the OCRed Text, which is the underlying layer for the Image that is shown to the user at the end. Unfortunately this seems to be loaded separately and this is confusing while opening the file with Adobe Acrobat...
a File OCRed with Tesseract 4 (Alpha):
https://github.com/n0l0cale/ocr-sampledata/blob/master/A4%20sample_TESSERACT_oem2.pdf
is also doing some weird magic with the hidden text part:
But in all three cases I'm able to search for words in the files and see the text using "Remove hidden information" and selecting "hidden text":
I'm seriously confused.... Does anyone know how these programs are storing their hidden text information really?
S.
P.S.: For those wondering what this ominous preflight script is: https://theblog.adobe.com/hidden-gems-in-acrobat-dc-how-to-optimize-hidden-ocr-text/
Does anyone know how these programs are storing their hidden text information really?
You correctly have found out that the approach of Abby Finereader is different from that of Adobe Acrobat and of Tesseract:
Abby creates a page content stream in which first the text is drawn normally on the page and eventually covered by the scanned image.
Acrobat and Tesseract create content streams in which first the image is drawn and then the text is drawn invisibly (using text rendering mode 3 which draws nothing).
The difference between the latter two results is the choice of font used:
Acrobat uses regular standard 14 fonts for which a PDF viewer has a font program to render them as normal glyphs.
Tesseract uses a font GlyphLessFont it embeds a font program for into the result file. When rendered the glyphs in this font do not show as our normal Latin glyphs but merely as empty space.
Considering the visual effect you observed for the Abby result, the approach used by Acrobat or Tesseract might be preferable.
Whether one prefers fonts with visually recognizable glyphs (as used by Acrobat) or without (as used by Tesseract), is mostly a mere matter of taste. They are used only in the invisible rendering mode anyways.

Batch check Adobe Acrobat .pdf's for files containing rotated text

Does anybody know if there is a way to check whether a list of Adobe Acrobat .pdf files contain rotated text (any text not at 0 degrees)?
I thought this would be simple, but I'm struggling to find an answer.
I am using ABBYY Recognition Server to OCR thousands of files and the results are quite poor where the text is rotated. I need to get a list of files that have rotated text to allow me to perform some pre-processing on them.
I usually use iTextSharp for .pdf automation and modification but don't seem to be able to find anything for checking text rotation.
Thanks
You could achieve your goal by extracting all words from these PDFs and checking if any of the words is rotated.
I would recommend you to use a PDF library higher level abilities for the task. Docotic.Pdf library is a good choice (of course, I am one of the developers of the library).
Here is an articles that shows how to extract words from PDFs with extra info about their position etc.
Each extracted word comes in PdfTextData object. The PdfTextData contains IsTransformed property to check if word is rotated, scaled, and / or flipped. You can also analyze PdfTextData.TransformationMatrix for more information about the transformation.

Extract font and its corresponding cmap in PDF

I am tried several ways to extract font from pdf viz. fontforge, mupdf, pdfparser in C# and also some pythone script. But am just confusing about get exact pair of a font and its cmap embeded in pdf. Please direct me the right approach by which i will get exact pairs of fonts and its cmaps.
As mentioned in my first comment, that should be easy using iText or iTextSharp or any other such library that allows you to access low-level PDF objects.
In case of iText(Sharp), ListUsedFonts.java and ListUsedFonts.cs can present starting points for you; they inspect all the font dictionaries in a PDF file accessible via at least one page. Instead of the simple output of those examples, simply export all the information you need. For this, ISO 32000-1:2008 should be your reference guide.

Processing Illustrator or pdf files into XAML

What are the alternatives to process illustrator files or PDFs into XAML. My Current workflow works like this:
Open the PDF file in Adobe illustrator
Save the file as .ai (Adobe Illustrator) file
Open in Expression Design
Do some processing, mainly separating elements to layers and removing unneeded parts.
Save as XAML
Add XAML to Blend project
My only problem is that this way the text gets converted to paths. I would like to keep my text in XAML as well instead of paths.
Is there any other way to do this, so I keep the text? Any other tools?
I think what you want is to have Glyphs elements instead of Paths.
The problem is that Glyphs elements require you to specify the URI of the font file. Also, Glyphs elements reference glyphs by their index into a font file (it may happen that a converter that generates Glyphs elements - like the Microsoft XPS Document Writer - uses indices into font subset files: so these indices may not be the right indices to the same glyphs as defined in the original font file). I have been able to "solve" this problem in two ways with my own PDF to XAML conversion tools.
1. approach: Embed the font-subset file, BASE64 coded, in the generated XAML code and have the application implement a class that, upon loading, extracts and decodes an embedded font-subset file to a temporary location and hands a valid URI to that temporary file back to the XAML loader.
or, 2. approach: Have most font files already installed along with my application and, again, adding some support by my application that replaces the font name by an URI to the installed font file upon loading of the XAML code. The problem with this second approach is that glyph indices need to be correctly mapped to the installed font file, which may not be all that trivial to do. (You can find a link to an example file that has been generated for this way of loading on my blog: in particular take a peek at the file truncatedcone-xaml.txt)
In short: both solutions require a special PDF to XAML converter and support by the loading application. The reason I wanted to do it this way instead of just having my PDFs converted to Paths only is that my application is a shared whiteboard: thus I want my vector graphics to be as small as possible. (Conversion to paths tends to blow up the XAML code by a factor of 10 or more in most cases).
I am contemplating the implementation of a third approach: this would consist in generating the outline for every glyph that is used only once and then add support by my application to transform and position these glyph outlines in a way closely analogous to what Glyphs elements do that would otherwise have to be generated. The advantage would be that the generated XAML would still be relatively small (comparable to the second approach described above) without requiring the relevant font files to be installed along with the application and without having to map glyph indices from a subset file to the installed font file. The reason I have not yet tried to implement this in earnest is twofold: first, my current (second) approach already works very well for what I currently need; second, there might be performance problems with this third approach as reagards loading and / or rendering.
There's a (free) Adobe Illustrator plugin to export to XAML. Not sure it does exactly what you are looking for, though.
Find it at http://www.mikeswanson.com/XAMLExport/
Well an XPS file is actually a ZIP file. So if you open it with a ZIP-archiver or if you rename its extension to ZIP you can see what is inside. It already contains the pages as XAML code (those files have the form [pagenumber].fpage). However, that XAML code may refer to other files (like raster images and font subset files, those are typically odttf files - basically encrypted true type files) that are included in that ZIP archive as well. Which means, that the XAML code that you find in an XPS document may not be directly usable as pure XAML in your application. I have written python scripts to do the conversion of XAML taken from XPS documents (generated by the Microsoft XPS Document Writer) to get XAML files that my application can load (see approaches 1 and 2 above). I could send you copies of those python scripts (they are not particularly great code, which is no problem for me since I am now using a different approach to convert PDFs to XAML anyway).
#gyurisc: Keeping the font file should work but keeping the text might turn out to be a problem, because, you see, glyphs are not characters. It might be that you could figure out the character by examining the font file that a given glyph is part of, but that would involve parsing the font file. If you are unlucky, your PDF to XPS converter does even not keep enough information in the font subset files to figure out the character a given glyph (very likely) represents.
For example: If I convert a PDF file to XPS with the help of Microsoft's XPS Document Writer, and then try to select a piece of text from that XPS document, I can (only apparently) copy it to the clipboard. However, if I then paste it back into a Word document, I get garbage. Whereas if I select that same piece of text in the original PDF document and paste it into the same Word document, I get reasonably meaningful text. So Microsoft's XPS Document Writer apparently does not care about the interpretation of a "glyph run" as text, and thus it seems very likely to me that the link between the glyph indices that one finds in the generated XPS code and the characters they are meant to represent is already broken at that point. (But, admittedly, that's just a guess.)
A representation of text (as opposed to a run of glyphs) would be a TextBlock element in XAML, I suppose. However, my guess is that a typical PDF to XPS converter is unlikely to generate TextBlock elements. XPS is mainly meant to be rendered - on screen or on paper - it doesn't suggest itself as a file format that is particularly suitable for data exchange (exchange of text in your case).

Best way to generate a custom document?

I am working on generating a document for printing. It should use a specific TTF font and everything must be printed with vector graphics (for quality). Some of the text should be replaced automatically (e.g. current time). Also it should include a custom-generated EPS image with a chart.
Ideally I would like to have some kind of document template where the text could be replaced easily, and it would be nice if it could import the image through path. But I am not sure which format could be good for this. Best I can come to think of is LaTeX, but I don't like that it's a lot of manual work to use it with TTF... any other ideas?
By the way, I am using OS X...
Memoir package is very flexible for your special layouts.
Xetex uses your system fonts (Installed together with TexLive).
You could blend most of those elements to an EPS using imagemagick or gimp script-fu
There are several products out there that will build you a PDF programmatically. I've only used the Coldfusion Report Builder myself and that may not be practical/affordable for your application. If your budget allows I'd look into a commercial reporting product. I know Adobe have several that will generate Flash, FlashPaper or PDF output.