Gluing (Imposition) PDF documents - pdf

I have several A4 PDF documents which I would like (two into one) "glue" together into A3 format PDF document. So I will get from 2PDFs A4 a single one sided PDF A3.
I have found the excellent utility PDFToolkit and some others but none of them can be used to "glue" side by side two documents.

I just came across a nice tool on superuser.com called PDFjam that can do all of the above in a single command:
pdfjam --nup 2x1 file1.pdf file2.pdf --outfile DONESKI.pdf
It has other standard features like page size plus a nice syntax for more sophisticated collations of pages (the tricky page re-ordering necessary for true booklet-style page imposition).
It's built on top of TeX which is, whatever it is. Installing is a breeze on Ubuntu: you can just apt-get install pdfjam. On Mac OS, I recommend getting BasicTeX (google "mactex basictex"; SO thinks I'm a spammer and won't let me post the link).
This is a lot easier and more maintanable than installing both pdftk and Multivalent (on both Mac OS for dev and Ubuntu for deploy), which wasn't going so well for me anyway...!

Found the following (free and open-source) tool for doing Imposition called Impose (thanks danio for the tip). This solved my problem perfectly.
EDIT:
Here is how it's done:
Use PDF Toolkit to joint two PDF files into one (two A4)
pdftk File1.pdf File2.pdf cat output OutputFile.pdf
Create from this a single page (one A3):
java -cp Multivalent.jar tool.pdf.Impose -dim 2x1 -verbose -paper-size "42.2x29.9cm" -layout "1,2" OutputFile.pdf

I would like to advertise my pdftools
It's written in Python so should run on any platform. It's a wrapper to Latex (the pdfpages packages) but can do lot of things with a single command line: merge pdf files, nup them (multiple input pages per output page) and number the pages of the output file (you specify the location and the format of the number)
It still needs some work but I think it's quite stable to be usable right now :)

This puts two landscape letter pages onto a single portrait letter sheet, to be "bound" (i.e., folded) along the top.
pdftops $1 - |
psbook |
pstops -w11in -h8.5in '4:1#.65(.5in,0in)+0#.65(.5in,5.5in),2U#.65(8in,5.5in)+3#.65U(8in,11in)' |
ps2pdf - $(basename $1 .pdf).psbook.pdf
By the way, I do this often, so I'll probably submit more "answers" to this question just to keep track of successful pstops pagespecs. Let me know if this is an inappropriate use of SO.

A nice, powerful, open-source imposition tool is included
in the PoDoFo package:
http://podofo.sourceforge.net/
It works for me. Some imposition plans can be found at:
http://www.av8n.com/computer/prepress/
PoDoFo can do lots of other stuff, not just imposition.
Another useful imposition tool is Bookbinder (on the
quantumelephant site). It has a GUI that appeals to non-experts.
It is not as flexible or powerful as PoDoFo, but it can do
imposition.
pdftk is more-or-less essential to have, but it will not
do imposition.
pdfjam is useless to me, because there are a wide range of
valid pdf files that it cannot handle.
I've never been able to get multivalent to work, either.

What you want to do is imposition. There are commercial tools to impose PDFs such as ARTS crackerjack and Quite imposing but they are pretty expensive (US$500), require a copy of acrobat professional and are overkill for imposing 2 A4 pages to an A3 sheet.

On the Postscript side, a tool named pstops is able to rearrange pages of a Postscript file in any way you could imagine. I've not heard of such a tool for PDF. But pdf2ps and ps2pdf exist. So a not-so-ideal solution may be a combination of pdf2ps, pstops and ps2pdf.

I would combine the two A4 pages into one 2-page PDF using pdftk. Then Print to PDF using something like PrimoPDF, and tell it to print to A3 format, two pages per side.
I just tested this printing some slides from PowerPoint. It worked great. I selected A3 as my paper size in PowerPoint, and then chose to print 2 pages per side. Printed to Primo and voila, I have two A4 slides per A3.

You can put multiple input pages on one output page using BookletImposer.
And you can change page orders and combine multiple pdf files using PDF Mod.
With these two tools, you can do almost everything you want with pdf files (except editing their content).

I had a similar problem. I tried Impose but it was giving me an
Exception in thread "main" java.lang.NoClassDefFoundError: tool/pdf/Impose
Caused by: java.lang.ClassNotFoundException: tool.pdf.Impose
(...)
Could not find the main class: tool.pdf.Impose. Program will exit.
I then tried PDF Snake which isn't free or open source, but has a completely unrestricted 30-day trial version. It worked perfectly, after tweaking the parameters to achieve what I wanted. It's a great tool. I would definitely buy it if it wasn't so expensive! Anyway, I thought I'd leave my 2 cents in case anyone had the same problem I had with Impose.

look at this
http://sourceforge.net/projects/proposition/
It needs laTex to run,
but when it does, works really fine
Regards

Related

Replace words/phrases in existing PDF or docx with other words

I am trying to make a dynamic PDF generator as an .NET Core API. I want to take an existing PDF, or .docx file, and edit it so it replaces the current name (John Doe) with something that can be replaced like #NAME_PLACEHOLDER.
I then want to transform #NAME_PLACEHOLDER -> John Doe (or whatever is in the KeyValuePair or Dictionary<string, string>).
I am running this on a Docker environment, so I can easily execute commands and I am willing to do that as well.
So far I have tried a few things:
1) pdf2htmlEX
Executes as pdf2htmlEX file.pdf
Does the job pretty well
Can be converted back to PDF using Google Chrome headless or similar
Problem: Only the characters used in the PDF can be used to replace. So if I only use A, B, C as characters, it will make D into Times New Roman (or default font)
2) LibreOffice ODT to PDF
This was pretty nice, because I could simply unzip the .odt file, open content.xml, search and replace, then save it as an .odt file again
Could be converted into PDF rather easily using soffice --convert-to pdf
LibreOffice is quite nice
Problem 1: Microsoft Word -> Save as ODT tends to break the formatting, so we have to use LibreOffice to go and change it back again
Problem 2: We don't want to move away from Microsoft's Office suite
3) HTML to PDF using Chrome Headless
What you see is what you get
By far the best option, if we're all developers aaand have unlimited time
Problem 1: Only our developers can make changes, since our marketing department do not know HTML
Problem 2: Our existing PDFs would have to be rewritten in HTML
As you can see, I have tried a bunch of things. None of them, except Chrome Headless, has lived up to my expectations. What I really like about #3 is what you see is what you get. I can make the whole thing in HTML, press CTRL+P and see what it looks like as a finished PDF, basically.
I am looking for a better solution, though. It can be paid. It can be free. All I need is to change out words/phrases with other words dynamically, which apparently seems like a tough thing to do.
Thanks for specifying what you've already found clearly. It helps a lot providing a succinct answer.
The conversion is always tricky - I'm sure you know Word has trouble displaying/editing some Word documents itself.
I have experience regarding point #2 "LibreOffice ODT to PDF" and can suggest a few things to test:
Don't use Microsoft to do the docx->odt conversion. It's not good as you know. Use LibreOffice itself to do this step. The rest of your process remains the same.
For some documents, Libre Office does doc->odt much better. So, you can instead work with DOC format and get a better result without any other changes.
You won't be able to remove the devs from the process, but you can certainly reduce their role allowing your business/marketing teams to have more direct input simply by:
get the starting point document to the devs to run through the conversion process. The devs can "clean up" the document to make it convert nicely.
make this version of the document the "official" starting point. The business or technical teams can load it, adjust it, and put it back into the process.
if possible, expose a test-platform to the business teams so they can download, adjust, upload and render to PDF. This cycle means they will be able to achieve more and if they're good, do impressive stuff without any dev input.
the above steps simply mean don't expect perfect conversion of arbitrary complex documents. Starting from a (even complex) working baseline is great.
Some of that might show you that your #2 is actually going to get the best overall results.
I hope that helps.

Concatenate PDFs without ruining accessibility or PDF tags

Would appreciate your help with the following:
I have 2 partially accessible PDFs (containing tags), and I want to concatenate them using some command line tool (as PDFtk or Ghostscript, or any Perl module):
I've tried doing this with PDFtk and Ghostscript and both output a non accessible PDF without the original tags (each of the concatenated PDFs had tags).
Do you know of any way to implement this with one of the mentioned tools or some other command line tool for Linux?
(Not necessarily freeware)
Perl modules are also an option.
Thanks!
pdfunite in-1.pdf in-2.pdf in-n.pdf out.pdf
You can read more in a similar question
Solved - new version of iText works (the former, which was the newest when writing the message didn't work- only since 5.4.4 it works).
It's important to mention (was missing in documentation in the past) that
when concatenating documents in tagged mode, you must keep all readers
opened until the resultant document is closed, i.e.:
first:
document.close();
and only after this:
reader.close();

Rule based PDF text extraction for verious bills and invoices

I have to extract text from invoices and bills pdf files
The files layouts can get complex, though its mostly filled with tables.
I've read a few dozens articles already about the pdf format, how easy it is for our brain to grasp it and how hard it is for a machine to understand its structure.
Also downloaded a few tools like the python's pdfminer and some java tools, some even have rule based layout extraction, like LA-PDBtext these are all great libraries, leaving you the final step.
Adobe also has an online service called exportPdf but it can't be customized
Bottom line, I understand that in order to extract text from structured pdf files and convert it to XML for example, there should be some level of manual work.
I also found From Data Extractor, a non free tool with the ability to set extraction rules that claims to do the job, though its hard to find a proper manual and it runs only on windows.
I thought I may even try a to convert those files to images and try tesseract-ocr but decided to ask for advice here before I spend more time on it.
I'll be very grateful if someone with such experience give me a hint.
I've done a lot of PDF extraction and I can confirm as you've already discovered that it can be a painful process to start. One of the important things to understand is that there is no concept of "tables" within a PDF, just text that happens to have lines around it. Also, there's no guarantee that the linear order of text within the PDF code actually matches the visual order when printed. In other words, there's no guarantee that "hello world" is written in that order, it could be draw 'word' at coord 20 then draw 'hello' at coord 10. Most PDF creators don't do this but still there's no guarantee. The more creative a PDF creator is (InDesign, Illustrator, etc) the more likely the text is going to be harder to get out. And actually, once a designer starts messing with fonts too much some programs will sometimes actually output words one character at a time, changing the font just slightly each time.
That said, I'd recommend the first one that you looked at, LA-PDFText. You can run it in discovery mode (blockify) from which you can create rules. I don't have Java installed anymore so I can't test it but it seems very promising.
Your second one, A-PDF Form Data Extractor, only really works with actual PDF forms. If this is your case I'd recommend just using an open source solution like iText/iTextSharp.
The last OCR one makes me cringe. I just can't imagine going through those hoops would get you better text representation than parsing the PDF. But then again, PDF is a visual format so maybe it would.
Personally I use iText/iTextSharp for this kind of thing but I also like to do things the hard way.
It is not clear if you are looking for the development tool to automate the data extraction from bills and invoices or just for the one time tool (utility) that can be used by the non-developer?
Anyway here are some specialized tools including engines they use:
Tabula (open-source, especially designed to extract data from tables in PDF. Can export shell scripts for batch processing, runs as the localhost web service, powered by JRuby Tabula engine)
Viet OCR (open-source .NET desktop utility for text extraction from PDF and images, based on tesseract oct engine)
Bytescout PDF Viewer (freeware closed source .NET utility, detects and extracts tables, including scanned invoices, powered by PDF Extractor SDK)
DISCLAIMER: I work for ByteScout.

Create print-ready PDF/X (with bleedbox, trimbox, mediabox, etc) programatically?

I was wondering if it is possible to programaticaly create a PDF file with an acceptable quality for the production press, ideally using only open-source libraries.
Right now the process is like this:
-create texts and images
-merge them into a postscript file
-use Acrobat Distiller to convert it to PDF (Acrobat distiller helps you check all the parameters of the PDF)
-send the PDF to the press
What I want is something like:
-take all texts and pictures in this folder
-encode them into the press-ready PDF, something similar to what Distiller produces
-send them to the press
How would you do that?
Many thanks...
Are the Ghostscript's gsdll32.dll and gswin{32,64}.c.exe with their source code and the GPL3 enough (or too much) of Open Source? They ship as part of all recent releases (newest one currently: v8.71).
Ghostscript can create very good quality PDF. See here for the most recent documentation about its PDF/A and PDF/X support.
Note, that this documentation until very recently was a bit misleading: it missed hinting at the requirement to edit+adapt the referred PDFA_def.ps or PDFX_def.ps templates. If you followed the old documentation without editing the templates to specifically point to the ICC color profile you wanted to embed, your output would be valid PDF, but would not pass all checks testing for compliancy with the official PDF/A+PDF/X standards.
You can generate pdfs using f.e. TeXML and XeLaTeX (first one to make scripting easier -- TeX has lots of quirks in syntax).
I also tried OpenJade and its DocBook support, but the quality was lower. TeX seems to do typesetting much better.
Both ways are using standalone programs... which you can use in shell scripts or call using system facilities.
You didn't mention which version of Distiller you're using. Recent versions do have a setting that lets you generate (different verions of) PDF/X. See also the *.joboptions files which ship with Distiller.

What are the relative merits of pdflatex?

Not sure this is a programming question, but we use LaTeX for all our API documentation and user documentation, so I hope it will go through.
Can someone please explain what are the relative merits of using pdflatex as opposed to the "classic" technique of
latex foo
dvips -Ppdf foo
ps2pdf foo.ps
From time to time I run into people who have difficulty because things don't work in pdflatex, and I know that using pdflatex gives up two things I have grown to value:
Can't use the very speedy xdvi viewer
Can't use the PStricks package
I should add that I typically get PDF with hyperlinks by using something on the order of
\usepackage[ps2pdf,colorlinks=true]{hyperref}
so it's not necessary to use pdflatex to get good PDF.
So
What are the advantages of pdflatex that I don't know about?
What are the disadvantages of the old tools that I've overlooked?
My favorite pdflatex feature is the microtype package, which is available only when using pdflatex to go directly to PDF, and really produces stunning results with no effort on my part. Apart from that, the only caveats I run into are image formats:
pdflatex supports PDF, PNG, and JPG images.
the postscript drivers support (at least) EPS.
Also, if you want to install fonts, the procedures are slightly different depending on what fonts that driver supports. (Hint: use XeTeX to instantly enable OpenType fonts.)
As it turns out, I recently read a post that shows the difference directly. Any document that uses tables or narrow columns will be improved automatically. I also find the inter-word spacing to be far more pleasing with pdflatex.
Is xdvi much faster than xpdf? I find the edit, TeX, view cycle to be very quick with pdflatex.
Have you tried MetaPost or MetaFun for graphics? I tend to put graphics creation in the hands of the capable, but MetaFun would likely be the package I'd use. Just reading the manuals is a pleasure.
Also pdftex is the engine under development (towards luatex) and maintenance. I'm not sure the DVI counterparts are as actively maintained.
PStricks is supplanted by Tikz.
I didn't use xdvi in years, so pardon the trollish rhetorical questions: Does xdvi display vector fonts? Does it support synctex (jumping to and from code)? Does it have the confort of use of PDF readers like Skim?
Taco Hoekwater is working on Escrito, a Postscript interpreter written in Lua, which would allow you to use pstricks in Luatex. He has an impressive project completion record: maybe I should have used "will" rather than "would" in the previous sentence.
I used pdflatex to generate the PDF for my ICFP 2009 paper. (I still needed to use standard latex to generate the PostScript file.) I did so for two reasons:
I couldn't seem to get ps2pdf to generate Letter, rather than A4 output, no matter what command line options I used.
For the printers, I needed to produce a version 1.3 PDF file, not 1.4. pdflatex made this easy to do. I set the PDF author and title information while I was at it.
Both of these problems may be fixable in some way, but as a first-time latex user, I didn't find any obvious solutions, nor did more experienced users whom I'd asked.