OpenType Layout tables used in font ArialMT are not implemented in PDFBox - pdfbox

I'm using the CMS Magnolia in one of our projects. In the log files there are many errors like:
OpenType Layout tables used in font ArialMT are not implemented in PDFBox
What impact has this on a PDF? Can it be opened? Does it look 'nice' or is it some kind of broken?

This is an INFO if you are using the current version (2.0.11). It is only relevant if you use PDFBox to create PDFs, it means that certain advanced font features (GDEF, GSUB, GPOS) are not (yet) supported. You'll need these for certain languages e.g. Thai or Arabic or Indian languages. It can also be used for ligatures in latin languages (fl, fi, ffl, ffi).
Some work on this topic is being done in PDFBOX-4189, but there is still a lot to do.

As for Magnolia, PDFBox is used either in indexing of pdf documents or in generating preview of pdf. For first use case error is completely irrelevant, for the second it might mean that preview is not as accurate as it could be. Nothing major tho either. You can reconfigure log4j to stop seeing this error.

Related

Creating ODT and PDF files as end result

I've been working on an app to create various document formats for a while now, and I've had limited success.
Ideally, I'd like to dynamically create a fairly simple ODT/PDF/DOC file. I've been focusing my efforts on ODT, because it is editable, and open enough that there are several tools which will convert it to any of the other formats I need.
The problem is that the ODT XML files are NOT simple, and there aren't any good-quality API's I could find (especially in python). So far, I've had the most success creating a template ODT file, and then manipulating the DOM in python as needed. This is ok generally, but is quickly becoming inadequate and requires too much tweaking every single time I need to alter one of the templates.
The requirements are:
1) Produce a simple document that will have lists, paragraphs, and the ability to draw simple graphics on the page (boxes, circles, etc...)
2) The ability to specify page size, and the different formats should generally print the exact same output when sent to a printer
My questions:
1) Are there any other ways I can produce ODT/PDF/DOC files?
2) Would LaTeX be acceptable? I've never really used it, does anyone have experience converting LaTeX files into other formats?
3) Would it be possible to use HTML? There are a lot of converters online. Technically you can specify dimensions in mm/cm, etc..., but I am worried that the printed output will differ between browsers/converters....
Any other ideas?
have you tried pandoc? i've been using it with good success for the conversion of different formats into each other. why try to invent the wheel twice?
I suppose to be successful, you'd have to define how you want to input everything. Why don't you just use openoffice? it will save to ODT (duh...), PDF, and HTML (though it's not clean HTML, it's actually quite ugly).
In my recent experience, I've had success going from latex -> xhtml via LaTeXML (i had to compile from source). LaTeX is seeming more and more like a terminal format. It's great for PDF, but once you need some flexibility, it kind of fails. I should also note that there is no latex -> dvi in my workflow, so I can't comment on things like tex4ht that reads out of a dvi file (I have too many graphics that don't work with DVI to switch them now).
Shortly I'll be moving everything into docbook 4.5-- i like the docbook-utils package which supports latex, html, and i even saw a converter to ODT. But docbook is super-heavy on the markup, which is annoying, but it will provide me with the flexibility i need going forward.
Since you're using python, have you just considered using ReStructured Text?
I've also really enjoyed publishing from emacs' orgmode, which is a super light weight markup that goes into a bunch of different formats.

Create print-ready PDF/X (with bleedbox, trimbox, mediabox, etc) programatically?

I was wondering if it is possible to programaticaly create a PDF file with an acceptable quality for the production press, ideally using only open-source libraries.
Right now the process is like this:
-create texts and images
-merge them into a postscript file
-use Acrobat Distiller to convert it to PDF (Acrobat distiller helps you check all the parameters of the PDF)
-send the PDF to the press
What I want is something like:
-take all texts and pictures in this folder
-encode them into the press-ready PDF, something similar to what Distiller produces
-send them to the press
How would you do that?
Many thanks...
Are the Ghostscript's gsdll32.dll and gswin{32,64}.c.exe with their source code and the GPL3 enough (or too much) of Open Source? They ship as part of all recent releases (newest one currently: v8.71).
Ghostscript can create very good quality PDF. See here for the most recent documentation about its PDF/A and PDF/X support.
Note, that this documentation until very recently was a bit misleading: it missed hinting at the requirement to edit+adapt the referred PDFA_def.ps or PDFX_def.ps templates. If you followed the old documentation without editing the templates to specifically point to the ICC color profile you wanted to embed, your output would be valid PDF, but would not pass all checks testing for compliancy with the official PDF/A+PDF/X standards.
You can generate pdfs using f.e. TeXML and XeLaTeX (first one to make scripting easier -- TeX has lots of quirks in syntax).
I also tried OpenJade and its DocBook support, but the quality was lower. TeX seems to do typesetting much better.
Both ways are using standalone programs... which you can use in shell scripts or call using system facilities.
You didn't mention which version of Distiller you're using. Recent versions do have a setting that lets you generate (different verions of) PDF/X. See also the *.joboptions files which ship with Distiller.

How to open PDF and read it?

how can I open a PDF file and read some of it's contents with Python (this language is preferred, however Ruby, Perl or PHP are fine too) (in case it is recognized (not just an image)) or report that it's impossible without OCR? TIA
Update: thanks for the solutions, I'm sure some of them will suit me fine.
#RichH, I have a pdf file, and don't know whether it is image- or text-based. I'm looking for a tool to help me find that out and in case it's text-based extract some of it's contents.
For Perl, check out these modules:
PDF::API2
CAM::PDF
Parsing PDF and making something useful out of it is hard as the format is focused on keeping the layout so text can be stored in a way that each letter is positioned individually, depending on the font the text might also be stored as graphic.
libraries to read PDFs I know include the Zend Framework which has a PDF component which includes a PDF parser which can be used from PHP and gives more or less usaable results and the commercial PDFlib which offers quite usable results and offers binding to different languages.

Is there a reliable way to determine if a PDF was generated from a Powerpoint file?

Like the title says. Reason I ask is that we're converting PDFs to formatted ASCII text (using pdftotext) and only want to display the ones that look reasonably sane.
PPT files tend to have text over images, diagonal text and others things that don't translate to ASCII very well, so we'd like to filter them out if we can.
The creating application of a PDF is listed in its XMP metadata. You can see this quite easily in Acrobat 9 (and I believe earlier): go to File > Properties, click Additional Metadata..., then go to Advanced and it's listed under both XMP Core Properties and PDF Properties:
xmp:CreatorTool: Microsoft PowerPoint
pdf:Creator: Microsoft PowerPoint
I'm guessing you want to find this programatically, so you'll need to find a library to read this metadata that works with your language. Here is a list of some XMP tools.
Short answer:
No, I don't think so.
Long answer:
No, I don't think so, because there are may ways to convert a PowerPoint file to pdf, for example Adobe Acrobat and PDFCreator and many many others. It's up to the converters to embed specific information in the PDF file, even if you find a way to detect PowerPoint-source pdf from one convert, the same method may not work for another.
Even longer answer:
No, I don't think so, because of the reasons described in the "long answer". And I don't think detecting the source of the PDF is the best approach to the problem you are trying to solve. Not just PowerPoint produces overlapped text and images. I think it's much better to detect the actual layout of the PDF file. If there are overlay of image and text, then you do some filtering or pre-processing to cater for that.
Your reasoning is very arbitrary - there are surely plenty of PPT files without the features you describe, and plenty of PDF files with them, that were generated from another source.
In theory a better method would just be to detect when these "unwanted" situations occur. However, even though the PDF format is partly open (only for reading, apparently, so it's not truly an open format), extracting complex data like that would be incredibly difficult.
All PDFs can have this problem regardless of their source. Most desktop publishing suites are capable of outputting PDF and are often sold boasting their high quality and flashier PDF presentations ...
A "saner" method would be to use a PDF parser, ITextSharp, or pdfNet...etc, Using the library of your choice, find all image rectangles, and all text rectangles, SORT THE RECTANGLES, and then see if there is substantial overlap of text and image rects -- ignoring image to image overlaps. If so, reject the page and/or document.
That won't be perfect, but at least it's going to catch many PDFs that aren't sane, regardless of source. Other heuristics to add would include color analysis. (i.e. are the colors in the overlapping region sufficiently different to allow "sane" results?)
Best of luck to you
It might put its name in the creator or producer info, but I don't have a copy to check this theory with.
In general, it is not an easy task to programmatically determine (reliably) where a file came from or how it was generated based on its contents. After all, a file is just a collection of bits.
Unless you have a lot of resources to expend building the heuristics to determine whether a file looks "reasonably sane" according to your needs, I would consider this a task for human beings.
some converter from ppt to pdf preserve creator in comments at begin of pdf.
I think that PDF's generated from most applications seem to be the same. It may have some meta-data that you can read from the file...

What are the relative merits of pdflatex?

Not sure this is a programming question, but we use LaTeX for all our API documentation and user documentation, so I hope it will go through.
Can someone please explain what are the relative merits of using pdflatex as opposed to the "classic" technique of
latex foo
dvips -Ppdf foo
ps2pdf foo.ps
From time to time I run into people who have difficulty because things don't work in pdflatex, and I know that using pdflatex gives up two things I have grown to value:
Can't use the very speedy xdvi viewer
Can't use the PStricks package
I should add that I typically get PDF with hyperlinks by using something on the order of
\usepackage[ps2pdf,colorlinks=true]{hyperref}
so it's not necessary to use pdflatex to get good PDF.
So
What are the advantages of pdflatex that I don't know about?
What are the disadvantages of the old tools that I've overlooked?
My favorite pdflatex feature is the microtype package, which is available only when using pdflatex to go directly to PDF, and really produces stunning results with no effort on my part. Apart from that, the only caveats I run into are image formats:
pdflatex supports PDF, PNG, and JPG images.
the postscript drivers support (at least) EPS.
Also, if you want to install fonts, the procedures are slightly different depending on what fonts that driver supports. (Hint: use XeTeX to instantly enable OpenType fonts.)
As it turns out, I recently read a post that shows the difference directly. Any document that uses tables or narrow columns will be improved automatically. I also find the inter-word spacing to be far more pleasing with pdflatex.
Is xdvi much faster than xpdf? I find the edit, TeX, view cycle to be very quick with pdflatex.
Have you tried MetaPost or MetaFun for graphics? I tend to put graphics creation in the hands of the capable, but MetaFun would likely be the package I'd use. Just reading the manuals is a pleasure.
Also pdftex is the engine under development (towards luatex) and maintenance. I'm not sure the DVI counterparts are as actively maintained.
PStricks is supplanted by Tikz.
I didn't use xdvi in years, so pardon the trollish rhetorical questions: Does xdvi display vector fonts? Does it support synctex (jumping to and from code)? Does it have the confort of use of PDF readers like Skim?
Taco Hoekwater is working on Escrito, a Postscript interpreter written in Lua, which would allow you to use pstricks in Luatex. He has an impressive project completion record: maybe I should have used "will" rather than "would" in the previous sentence.
I used pdflatex to generate the PDF for my ICFP 2009 paper. (I still needed to use standard latex to generate the PostScript file.) I did so for two reasons:
I couldn't seem to get ps2pdf to generate Letter, rather than A4 output, no matter what command line options I used.
For the printers, I needed to produce a version 1.3 PDF file, not 1.4. pdflatex made this easy to do. I set the PDF author and title information while I was at it.
Both of these problems may be fixable in some way, but as a first-time latex user, I didn't find any obvious solutions, nor did more experienced users whom I'd asked.