When converting html into a PDF with wkhtmltopdf it seems that the text gets rendered to curves with the default options instead of getting a text-based PDF.
As a consequence it's not possible to select the text in the PDF (as it is a bunch of curves ressembling text) as well as having rendering problems (instead of delegating the rendering of the font to the PDF viewer).
Additional info
There's much more context in here:
https://github.com/wkhtmltopdf/wkhtmltopdf/issues/2999
Questions
Q1) How can I tell wkhtmltopdf to render the document by placing text instead of converting text to curves?
Q2) How can I ensure that wkhtmltopdf embeds the needed fonts inside the document just in case the destination machine does not have it?
Related
I can easily convert a pdf to an odt file using:
soffice --infilter="writer_pdf_import" --convert-to odt a.pdf
But when I try to do:
soffice --infilter="writer_pdf_import" --convert-to odg a.pdf
I get an error:
no export filter
TL;DR the answer is at the bottom but do read the following as to why there can be issues
ODG is a multi-part graphics file usually a blank template, often similar to an ORA, however there are many ways they can be structured and converted TO a set of PDF page printouts, as they contain thumbnails, plus one or more high resolution images or scalable vector graphics. Common variants can be used with Inkscape, Krita possibly Scribus / OODraw and other more GRAPHIC apps.
PDF is a page document output format thus not a suitable candidate for converting to professional images with scalar graphics. *Except see the last comment
ODG or ORA may be done well in image conversion but the reverse is not usually true.
Open Office Graphic is like a DocX, a zip wrapper around a core object, here it is a Jpeg but could be PNG SVG etc.
However the contents of the zip are not simple potentially running to thousands of lines of coding. Thus you need to use a more appropriate method to hand build an ODG not simple command line conversion from cruder PDF.
The real strength of a EXPORT from draw as PDF is the hybrid use of embedding ODFG content thus opening such a PDF you can edit it in Draw.
And it will look just as good in any PDF viewer. However it is too specialist to be simply translated without the app settings. In reality the PDF is the chimera/polyglot ODG.
But if you wish to try with simple files the command line is for a.pdf to a.odg
soffice --infilter="draw_pdf_import" --convert-to odg a.pdf
// EDIT 26.03.2018 - Who wants to continue my work can have a look on my source-files https://github.com/n0l0cale/ocr-sampledata
I'm actually looking for some details about PDF Files. It's most important for me that the files will be usable for a very long time and if possible the OCR should be automatically applied for new files (which seems to be not really possible with Adobe Acrobat...).
For that I've been looking for different solutions how to OCR my PDF Files. I found three candidates which seems to be doing what they should do... (more or less). But all three variants have their pro&cons... But there seem to be different approaches how to store data in PDF Files.... for all three Variants... Let me explain:
a File OCRed with Adobe Acrobat:
https://github.com/n0l0cale/ocr-sampledata/blob/master/A4%20sample_ACROBAT.pdf
results in a file that Acrobat is able to open in one step (no preloading of any background layer) and after a preflight-script I'm able to see the text which is stored hidden:
a File OCRed with Abby Finereader:
https://github.com/n0l0cale/ocr-sampledata/blob/master/A4%20sample_ABBY.pdf
does not seem suitable for the default adobe preflight-script as it does not display any additional layers:
But far as I was able to reproduce these Files seems to have a Background-Text-Layer, which contains the OCRed Text, which is the underlying layer for the Image that is shown to the user at the end. Unfortunately this seems to be loaded separately and this is confusing while opening the file with Adobe Acrobat...
a File OCRed with Tesseract 4 (Alpha):
https://github.com/n0l0cale/ocr-sampledata/blob/master/A4%20sample_TESSERACT_oem2.pdf
is also doing some weird magic with the hidden text part:
But in all three cases I'm able to search for words in the files and see the text using "Remove hidden information" and selecting "hidden text":
I'm seriously confused.... Does anyone know how these programs are storing their hidden text information really?
S.
P.S.: For those wondering what this ominous preflight script is: https://theblog.adobe.com/hidden-gems-in-acrobat-dc-how-to-optimize-hidden-ocr-text/
Does anyone know how these programs are storing their hidden text information really?
You correctly have found out that the approach of Abby Finereader is different from that of Adobe Acrobat and of Tesseract:
Abby creates a page content stream in which first the text is drawn normally on the page and eventually covered by the scanned image.
Acrobat and Tesseract create content streams in which first the image is drawn and then the text is drawn invisibly (using text rendering mode 3 which draws nothing).
The difference between the latter two results is the choice of font used:
Acrobat uses regular standard 14 fonts for which a PDF viewer has a font program to render them as normal glyphs.
Tesseract uses a font GlyphLessFont it embeds a font program for into the result file. When rendered the glyphs in this font do not show as our normal Latin glyphs but merely as empty space.
Considering the visual effect you observed for the Abby result, the approach used by Acrobat or Tesseract might be preferable.
Whether one prefers fonts with visually recognizable glyphs (as used by Acrobat) or without (as used by Tesseract), is mostly a mere matter of taste. They are used only in the invisible rendering mode anyways.
How can I easily crop a PDF page in a given PDF file? I prefer using as little coding as possible, and guess border geometries as little as possible...
There are several options:
Crop by point-and-click using a GUI front-end:
pdf-quench
krop
briss
PDF scissors
Crop by using the command line:
pdfcrop command (provided by texlive-extra-utils), using the following arguments: pdfcrop --margins '-30 -30 -250 -150' --clip input.pdf output.pdf (-left -top -right -bottom format).
PDFCrop
convert -crop command (provided by imagemagick)
Ghostscript
Crop by writing your own script:
Python
LaTeX
For quick, GUI-aided PDF cropping tasks, try pdfarranger (available in Debian repos, formerly known as PDF-Shuffler).
For precise point-and-click cropping, one option is to use LibreOffice Draw.
The instructions below assume you want to crop part of a single-page PDF:
Start with a blank document
Select the Insert > Image... menu
Navigate to the PDF you wish to crop
The contents of the PDF will show up as an image
Right-click on the PDF content in your document and select the "Crop" menu item.
Use the handles to resize the viewable area of the PDF to the section you want to remain after cropping
Click outside of the PDF to disable the crop handles
Click again on the PDF content to position it however you want by:
Dragging it around the page
Using the arrow keys to move it
Use the Draw positioning tools to align or center the PDF content.
When you're happy with the result, save, export it to PDF, or print it.
For multi-page PDFs, You'll have to work page by page by first splitting the PDF into multiple pages using some other tool like PDF Arranger (or simply "Printing to PDF" each page of the PDF you want to crop in your PDF viewer), cropping them one by one with Draw, then recombining them into a single PDF (using PDF Arranger again).
You could try using the pdfCropMargins Python program (https://pypi.org/project/pdfCropMargins/) with the -pg option to select the particular page. The command-line program offers many options, and also has an optional GUI.
You can use Inkscape to losslessly crop PDFs. This uses Inkscape's built-in SVG-PDF conversion.
Open your file in Inkscape: File -> Open -> select your file -> Open
Resize PDF:
Using user-input values: File -> Document properties -> Page -> Custom size
Using auto resize to content: File -> Document properties -> Page -> Custom size -> Resize page to content... -> set desired margin -> Resize page to drawing or selection
Inkscape is a particularly good option as often PDF crop utilities (such as krop, mentioned in other answers) do not change the actual size of the object, instead adjusting how much of the object (e.g. an A4 page) is displayed.
E.g. from krop homepage:
Unfortunately, there is no simple way to eliminate
unnecessary/invisible parts of a PDF file. krop only adjusts which
parts of a PDF are displayed; the original content is still there in
the file and will, for instance, show up when editing the file in
inkscape
Editing directly in Inkscape does exactly what this says is impossible.
The list of tools provided by #sparkler was interesting, but did not help me very much.
Some of the tools provided, actually cropped my pages, but usually they involved some conversion to an image which made pdf files blurry and hard to read.
In the end I used podofocrop of PoDoFo tools which was able to retain all the graphics at full resolution and the text as real text.
It will crop all pages to the minimal size (i.e. without a border).
The command is: podofocrop input.pdf output.pdf
To install on MacOS use brew install podofo
I would like to have an application where a user views an image of a document in TIFF Format.
If the words "foo" and "bar" appear on the page. And a selection is made on the image that only contains "foo", then I would like to only select the word "foo".
Is there a format that lends itself to storing both the location of text and the text of an image?
Since you know about searchable PDF, and it perfectly implements what you are suggesting, I assume that there is some reason why you can't use it. If not, you should use PDF -- the format supports mixed-content and overlaying them. All of the viewers that your users are likely to have will understand what to do with text beneath the image.
The TIFF format does not support this directly, but if you are making the viewer, and it only needs to work there, then you could try to store the text and positions in a custom tag.
Then your viewer would need to read this tag, interpret mouse positions, and look up the text that is being selected on the image. No other viewer would support your text tag, but they would show the TIFF.
For either of these mechanisms, you will need OCR and a way to encode the data you get either into PDF or the custom TIFF tag. For open source OCR, take a look at Tesseract from Google.
Disclaimer: I work at Atalasoft. Our imaging SDK, DotImage, has add-ons for OCR that can make searchable PDF, and can add and edit TIFF tags.
What are the alternatives to process illustrator files or PDFs into XAML. My Current workflow works like this:
Open the PDF file in Adobe illustrator
Save the file as .ai (Adobe Illustrator) file
Open in Expression Design
Do some processing, mainly separating elements to layers and removing unneeded parts.
Save as XAML
Add XAML to Blend project
My only problem is that this way the text gets converted to paths. I would like to keep my text in XAML as well instead of paths.
Is there any other way to do this, so I keep the text? Any other tools?
I think what you want is to have Glyphs elements instead of Paths.
The problem is that Glyphs elements require you to specify the URI of the font file. Also, Glyphs elements reference glyphs by their index into a font file (it may happen that a converter that generates Glyphs elements - like the Microsoft XPS Document Writer - uses indices into font subset files: so these indices may not be the right indices to the same glyphs as defined in the original font file). I have been able to "solve" this problem in two ways with my own PDF to XAML conversion tools.
1. approach: Embed the font-subset file, BASE64 coded, in the generated XAML code and have the application implement a class that, upon loading, extracts and decodes an embedded font-subset file to a temporary location and hands a valid URI to that temporary file back to the XAML loader.
or, 2. approach: Have most font files already installed along with my application and, again, adding some support by my application that replaces the font name by an URI to the installed font file upon loading of the XAML code. The problem with this second approach is that glyph indices need to be correctly mapped to the installed font file, which may not be all that trivial to do. (You can find a link to an example file that has been generated for this way of loading on my blog: in particular take a peek at the file truncatedcone-xaml.txt)
In short: both solutions require a special PDF to XAML converter and support by the loading application. The reason I wanted to do it this way instead of just having my PDFs converted to Paths only is that my application is a shared whiteboard: thus I want my vector graphics to be as small as possible. (Conversion to paths tends to blow up the XAML code by a factor of 10 or more in most cases).
I am contemplating the implementation of a third approach: this would consist in generating the outline for every glyph that is used only once and then add support by my application to transform and position these glyph outlines in a way closely analogous to what Glyphs elements do that would otherwise have to be generated. The advantage would be that the generated XAML would still be relatively small (comparable to the second approach described above) without requiring the relevant font files to be installed along with the application and without having to map glyph indices from a subset file to the installed font file. The reason I have not yet tried to implement this in earnest is twofold: first, my current (second) approach already works very well for what I currently need; second, there might be performance problems with this third approach as reagards loading and / or rendering.
There's a (free) Adobe Illustrator plugin to export to XAML. Not sure it does exactly what you are looking for, though.
Find it at http://www.mikeswanson.com/XAMLExport/
Well an XPS file is actually a ZIP file. So if you open it with a ZIP-archiver or if you rename its extension to ZIP you can see what is inside. It already contains the pages as XAML code (those files have the form [pagenumber].fpage). However, that XAML code may refer to other files (like raster images and font subset files, those are typically odttf files - basically encrypted true type files) that are included in that ZIP archive as well. Which means, that the XAML code that you find in an XPS document may not be directly usable as pure XAML in your application. I have written python scripts to do the conversion of XAML taken from XPS documents (generated by the Microsoft XPS Document Writer) to get XAML files that my application can load (see approaches 1 and 2 above). I could send you copies of those python scripts (they are not particularly great code, which is no problem for me since I am now using a different approach to convert PDFs to XAML anyway).
#gyurisc: Keeping the font file should work but keeping the text might turn out to be a problem, because, you see, glyphs are not characters. It might be that you could figure out the character by examining the font file that a given glyph is part of, but that would involve parsing the font file. If you are unlucky, your PDF to XPS converter does even not keep enough information in the font subset files to figure out the character a given glyph (very likely) represents.
For example: If I convert a PDF file to XPS with the help of Microsoft's XPS Document Writer, and then try to select a piece of text from that XPS document, I can (only apparently) copy it to the clipboard. However, if I then paste it back into a Word document, I get garbage. Whereas if I select that same piece of text in the original PDF document and paste it into the same Word document, I get reasonably meaningful text. So Microsoft's XPS Document Writer apparently does not care about the interpretation of a "glyph run" as text, and thus it seems very likely to me that the link between the glyph indices that one finds in the generated XPS code and the characters they are meant to represent is already broken at that point. (But, admittedly, that's just a guess.)
A representation of text (as opposed to a run of glyphs) would be a TextBlock element in XAML, I suppose. However, my guess is that a typical PDF to XPS converter is unlikely to generate TextBlock elements. XPS is mainly meant to be rendered - on screen or on paper - it doesn't suggest itself as a file format that is particularly suitable for data exchange (exchange of text in your case).