Concatenate PDFs without ruining accessibility or PDF tags

Concatenate PDFs without ruining accessibility or PDF tags - pdf

Would appreciate your help with the following:
I have 2 partially accessible PDFs (containing tags), and I want to concatenate them using some command line tool (as PDFtk or Ghostscript, or any Perl module):
I've tried doing this with PDFtk and Ghostscript and both output a non accessible PDF without the original tags (each of the concatenated PDFs had tags).
Do you know of any way to implement this with one of the mentioned tools or some other command line tool for Linux?
(Not necessarily freeware)
Perl modules are also an option.
Thanks!

pdfunite in-1.pdf in-2.pdf in-n.pdf out.pdf
You can read more in a similar question

Solved - new version of iText works (the former, which was the newest when writing the message didn't work- only since 5.4.4 it works).
It's important to mention (was missing in documentation in the past) that
when concatenating documents in tagged mode, you must keep all readers
opened until the resultant document is closed, i.e.:
first:
document.close();
and only after this:
reader.close();

Related

GhostScript creating extra page when font errors occur

I have a process that needs to write multiple postscript and pdf files to a single postscript file generated by, and that will continue to be modified by, word interop VB code. Each call to ghostscript results in an extra blank page. I am using GhostScript 9.27.
Since there are several technologies and factors here, I've narrowed it down: the problem can be demonstrated by converting a postscript file to postscript and then to pdf via command line. The problem does not occur going directly from postscript to pdf. Here's an example and an example of the errors.
C:\>"C:\Program Files (x86)\gs\gs9.27\bin\gswin32c.exe" -dNOPAUSE -dBATCH -sDEVICE=ps2write -sOutputFile=C:\testfont.ps C:\smallexample.ps
C:\>"C:\Program Files (x86)\gs\gs9.27\bin\gswin32c.exe" -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sOutputFile=C:\testfont.pdf C:\testfont.ps
Can't find (or can't open) font file %rom%Resource/Font/TimesNewRomanPSMT.
Can't find (or can't open) font file TimesNewRomanPSMT.
Can't find (or can't open) font file %rom%Resource/Font/TimesNewRomanPSMT.
Can't find (or can't open) font file TimesNewRomanPSMT.
Querying operating system for font files...
Didn't find this font on the system!
Substituting font Times-Roman for TimesNewRomanPSMT.
I'm starting with the assumption that the font errors are the cause of the extra page (if only to rule that out, I know it is not certain). Since my ps->pdf test does not exhibit this problem and my ps->ps->pdf does, I'm thinking ghostscript is not writing font data that was in the original postscript file to the one it is creating. I'm looking for a way to preserve/recreate that in the resulting postscript file. Or if that is not possible, I'll need a way to tell ghostscript how to use those fonts. I did not have success attempting to include them as described in the GS documentation here: https://www.ghostscript.com/doc/current/Use.htm#CIDFontSubstitution.
Any help is appreciated.

I've made this an answer, even though I'm aware it doesn't answer the question, becasue it won'f fit as a comment.
I think your assumption that the missing fonts are causing your problem is flawed. Many PDF files do not embed all the fonts they require, I've seen many such examples and they do not emit extra pages.
You haven't been entirely clear in your description of what you are doing. You describe two processes, one going from PostScript to PDF and one going from PostScript on to PostScript (WHY ?) and then to PDF.
You haven't described why you are processing PostScript into a PostScript file.
In particular you haven't supplied an example file to look at. Without that there's no way to tell whether your experience is in fact correct.
For example; its entirely possible that you have set /Duplex true and have an odd number of pages in your file. This will cause an extra blank page (quite properly) to be emitted, because duplexing requires an even number of pages.
The documentation you linked to is for CIDFont substitution, it has nothing to do with Font substitution, CIDFonts and Fonts are different things in PDF and (more particularly) PostScript. But I honestly doubt that is your problem.
I'd suggest that you put (at the least) 'smallexample.ps' somewhere public and post the URL here, that way we can at least follow the same steps you are doing. That way we can probably tell you what's going on. An explanation of why you're doing this would be useful too, I would normally strongly suggest that you don't do extra steps like this; each step carries the risk of degarding the output in some way.

Thank you for the response. I am posting as an answer as well due to the comment length restrictions:
I think you are correct that my assumption about fonts is wrong. I have found the extra page in the second ps file and do not encounter the font errors until the second conversion.
I have a process that uses VB MSWord Interop libraries to print multiple documents to a single ps file using a virtual printer set up with ghostscript and redmon. I am adding functionality to mix in pdf files too. It works, but results in an extra page. To narrow down where the problem actually was, I tried much simpler test cases via command line to identify the problem. I only get the extra page when ghostscript is converting ps to ps (whether or not there is a pdf as well). Converting ps to pdf I do not get the extra page. Interestingly, I can work around the problem by converting the ps to pdf and then both pdfs back to ps. That is a slower and should not be necessary however, so I would like to identify and resolve the extra page issue. I cannot share that particular file. I'll see if I can create an example I can share that also exhibits the problem. In the meantime, I can confirm that the source ps file is six pages and the duplexing settings are as follows. There is duplex definition in the resulting ps file with the extra page. Might there be some other common culprits I could check for in the source ps? Thank you.
featurebegin{
%%BeginFeature: *DuplexUnit NotInstalled
%%EndFeature
}featurecleanup
featurebegin{
%%BeginFeature: *Duplex None
<</Duplex false /Tumble false>> setpagedevice
%%EndFeature
}featurecleanup

ghostPCL: why is this file not converted properly to PDF?

I am using ghostpcl-9.18-win64. This is the script that I used to generate the pdf file:
gpcl6win64-9.18.exe -sDEVICE=pdfwrite -sOutputFile=%1.pdf -dNOPAUSE %1.txt
The file to test can be found here and the result of running ghostpcl can be found here.
If you take a look at the pdf file it contains only a page (there should be 2) and some of the text is missing. Why is that? I always pictured in my mind that ghostpcl would produce a pdf identical with a printout. Am I missing something, parameters perhaps?
As a matter of fact, when I used the lpr command to print the file on RHEL it printed exactly what I expected. I wonder how reliable is the ghostpcl tool in converting pcl files to PDF. And if it's not that reliable, a broader question is: is there another tool to do it? I am interested mainly in the linux version.
The txt file is based on a file generated using SQR.
Thanks

In fact the OP did raise a bug report (but didn't mention it here):
http://bugs.ghostscript.com/show_bug.cgi?id=696509
The opinion of our PCL maintainer is that the output is correct, inasmuch as it matches at least one HP printer. See the URL above for slightly more details.

Based on the discussion on the bug thread, the input file is invalid because it should have had CRLFs instead of LFs only.
If I convert the LFs to CRLFs then my input file is converted as expected to PDF. However, converting LFs to CRLFs is not a general solution. According to support LFs can be used for images. In that case converting such an LF to CRLF could break the image.
It seems that there is one thing I was wrong about on the bug thread, in our system, lpr includes carriage returns as well in the final file that gets sent to the printer. I followed the instructions here: https://wiki.ubuntu.com/DebuggingPrintingProblems, and the instructions in the 'Getting the data which would go to the printer' section to print to a file and the file includes Carriage returns.

Ghostscript skips characters when merging PDFs

I have a problem when using Ghostscript (version 8.71) on Ubuntu to merge PDF files created with wkhtmltopdf.
The problem I experience on random occasions is that some characters get lost in the merge process and replaced by nothing (or space) in the merged PDF. If I look at the original PDF it looks fine but after merge some characters are missing.
Note that one missing character, such as number 9 or the letter a, can be lost in one place in the document but show up fine somewhere else in the document so it is not a problem displaying it or a font issue as such.
The command I am using is:
gs \
-q \
-dNOPAUSE \
-sDEVICE=pdfwrite \
-sOutputFile=/tmp/outputfilename \
-dBATCH \
/var/www/documents/docs/input1.pdf \
/var/www/documents/docs/input2.pdf \
/var/www/documents/docs/input3.pdf
Anyone else that have experienced this, or even better know a solution for it?

I've seen this happening if the names for embedded font subsets are identical, but the real content of these subsets are different (containing different glyph sets).
Check all your input files for the fonts used. Use Poppler's pdffonts utility for this:
for i in input*.pdf; do
pdffonts ${i} | tee ${i}.pdffonts.txt
done
Look for the font names used in each PDF.
My theory/bet is on you seeing identical font names used (names which are similar to BAAAAA+ArialMT) by different input files.
The BAAAAA+ font name prefix to be used for subset fonts is supposed to be random (though the official specification is not very clear about this). Some applications use predictable prefixes, however, starting with BAAAAA+, CAAAAAA+ DAAAAA+ etc. (OpenOffice.org and LibreOffice are notorious for this). This means that the prefix BAAAAA+ gets used in every single file where at least one subset font is used...
It can easily happen that your input files do not use the exact same subset of characters. However the identical names used could make Ghostscript think that the font really is the same. It (falsely) 'optimizes' the merged PDF and embeds only one of the 2 font instances (both having the same name, for example BAAAAA+Arial). However, this instance may not include some glyphs which where part of the other instance(s).
This leads to some characters missing in merged output.
I know that more recent versions of Ghostscript have seen a heavy overhaul of their font handling code. Maybe you'll be more lucky with trying Ghostscript v9.06 (the most recent release to date).
I'm very much interested in investigating this in even bigger detail. If you can provide a sample of your input files (as well as the merged output given by GS v8.70), I can test if it works better with v9.06.
What you could do to avoid this problem
Try to always embed fonts as full sets, not subsets:
I don't know if and how you can control to have full font embedding when using wkhtmltopdf.
If you generate your input PDFs from Libre/OpenOffice, you're out of luck and you'll have no control over it.
If you use Acrobat to generate your input PDFs, you can tweak font embedding details in the Distiller settings.
If Ghostscript generates your input PDFs the commandline parameters to enforce full font embeddings are:
gs -o output.pdf -sDEVICE=pdfwrite -dSubsetFonts=false input.file
Some type of fonts cannot be embedded fully, but only subsetted (TrueType, Type3, CIDFontType0, CIDFontType1, CIDFontType2). See this answer to question "Why doesnt Acrobat Distiller embed all fonts fully?" for more details.
Do the following only if you are sure that no-one else gets to see or print or use your individual input files: Do not embed the fonts at all -- only embed when merging with Ghostscript the final result PDF from your inputs.
I don't know if and how you can control to have no font embedding when using wkhtmltopdf.
If you generate your input PDFs from Libre/OpenOffice, you're out of luck and you'll have no control over it.
If you use Acrobat to generate your input PDFs, you can tweak font embedding details in the Distiller settings.
If Ghostscript generates your input PDFs the commandline parameters to prevent font embedding are:
gs -o output.pdf -sDEVICE=pdfwrite -dEmbedAllFonts=false -c "<</AlwaysEmbed [ ]>>setpagedevice" input.file
Some type of fonts cannot be embedded fully, but only subsetted (Type3, CIDFontType1). See this answer to question "Why doesnt Acrobat Distiller embed all fonts fully?" for more details.
Do not use Ghostscript, but rather use pdftk for merging PDFs. pdftk is a more 'dumb' utility than Ghostscript (at least older versions of pdftk are) when it comes to merging PDFs, and this dumbness can be an advantage...
Update
To answer once more, but this time more explicitly (following the extra question of #sacohe in the comments below. In many (not all) cases the following procedure will work:
Re-'distill' the input PDF files with the help of Ghostscript (preferably the most recent version from the 9.0x series).
The command to use is this (or similar):
gs -o redistilled-out.pdf -sDEVICE=pdfwrite input.pdf
The resulting output PDF should then be using different (unique) prefixes to the font names, even when the input PDF used the same name prefix for different font (subsets).
This procedure worked for me when I processed a sample of original input files provided to me by 'Mr R', the author of the original question. After that fix, the "skipped character problem" was gone in the final result (a merged PDF created from the fixed input files).

I wanted to give some feedback that unfortunately the re-processing trick doesn't seem to work with ghostscript 8.70 (in redhat/centos releases) and files exported as pdf from word 2010 (which seems to use ABCDEE+ prefix for everything). and i haven't been able to find any pre-built versions of ghostscript 9 for my platform.
you mention that older versions of pdftk might work. we moved away from pdftk (newer versions) to gs, because some pdf files would cause pdftk to coredump. #Kurt, do you think that trying to find an older version of pdftk might help? if so, what version do you recommend?
another ugly method that halfway works is to use:
-sDEVICE=pdfwrite -dCompatibilityLevel=1.2 -dHaveTrueType=false
which converts the fonts to bitmap, but it then causes the characters on the page to be a bit light (not a big deal), trying to select text is off by about one line height (mildly annoying), and worst is that even though the characters display ok, copy/paste gives random garbage in the text.
(I was hoping this would be a comment, but I guess I can't do that, is answer closed?)

From what I can tell, this issue is fixed in Ghostscript version 9.21. We were having a similar issue where merged PDFs were missing characters, and while #Kurt Pfeifle suggestion of re-distilling those PDFs did work, it seems a little infeasible/silly to us. Some of our merged PDFs consisted of up to 600 or more individual PDFs, and re-distilling every single one of those to merge them just seemed nuts
Our production version of Ghostscript was 9.10 which was causing this problem. But when I did some tests on 9.21 the problem seemed to vanish. I have been unable to produce a document with missing or mangled characters using GS 9.21 so I think that's the real solution here.

Create print-ready PDF/X (with bleedbox, trimbox, mediabox, etc) programatically?

I was wondering if it is possible to programaticaly create a PDF file with an acceptable quality for the production press, ideally using only open-source libraries.
Right now the process is like this:
-create texts and images
-merge them into a postscript file
-use Acrobat Distiller to convert it to PDF (Acrobat distiller helps you check all the parameters of the PDF)
-send the PDF to the press
What I want is something like:
-take all texts and pictures in this folder
-encode them into the press-ready PDF, something similar to what Distiller produces
-send them to the press
How would you do that?
Many thanks...

Are the Ghostscript's gsdll32.dll and gswin{32,64}.c.exe with their source code and the GPL3 enough (or too much) of Open Source? They ship as part of all recent releases (newest one currently: v8.71).
Ghostscript can create very good quality PDF. See here for the most recent documentation about its PDF/A and PDF/X support.
Note, that this documentation until very recently was a bit misleading: it missed hinting at the requirement to edit+adapt the referred PDFA_def.ps or PDFX_def.ps templates. If you followed the old documentation without editing the templates to specifically point to the ICC color profile you wanted to embed, your output would be valid PDF, but would not pass all checks testing for compliancy with the official PDF/A+PDF/X standards.

You can generate pdfs using f.e. TeXML and XeLaTeX (first one to make scripting easier -- TeX has lots of quirks in syntax).
I also tried OpenJade and its DocBook support, but the quality was lower. TeX seems to do typesetting much better.
Both ways are using standalone programs... which you can use in shell scripts or call using system facilities.

You didn't mention which version of Distiller you're using. Recent versions do have a setting that lets you generate (different verions of) PDF/X. See also the *.joboptions files which ship with Distiller.

Gluing (Imposition) PDF documents

I have several A4 PDF documents which I would like (two into one) "glue" together into A3 format PDF document. So I will get from 2PDFs A4 a single one sided PDF A3.
I have found the excellent utility PDFToolkit and some others but none of them can be used to "glue" side by side two documents.

I just came across a nice tool on superuser.com called PDFjam that can do all of the above in a single command:
pdfjam --nup 2x1 file1.pdf file2.pdf --outfile DONESKI.pdf
It has other standard features like page size plus a nice syntax for more sophisticated collations of pages (the tricky page re-ordering necessary for true booklet-style page imposition).
It's built on top of TeX which is, whatever it is. Installing is a breeze on Ubuntu: you can just apt-get install pdfjam. On Mac OS, I recommend getting BasicTeX (google "mactex basictex"; SO thinks I'm a spammer and won't let me post the link).
This is a lot easier and more maintanable than installing both pdftk and Multivalent (on both Mac OS for dev and Ubuntu for deploy), which wasn't going so well for me anyway...!

Found the following (free and open-source) tool for doing Imposition called Impose (thanks danio for the tip). This solved my problem perfectly.
EDIT:
Here is how it's done:
Use PDF Toolkit to joint two PDF files into one (two A4)
pdftk File1.pdf File2.pdf cat output OutputFile.pdf
Create from this a single page (one A3):
java -cp Multivalent.jar tool.pdf.Impose -dim 2x1 -verbose -paper-size "42.2x29.9cm" -layout "1,2" OutputFile.pdf

I would like to advertise my pdftools
It's written in Python so should run on any platform. It's a wrapper to Latex (the pdfpages packages) but can do lot of things with a single command line: merge pdf files, nup them (multiple input pages per output page) and number the pages of the output file (you specify the location and the format of the number)
It still needs some work but I think it's quite stable to be usable right now :)

This puts two landscape letter pages onto a single portrait letter sheet, to be "bound" (i.e., folded) along the top.
pdftops $1 - |
psbook |
pstops -w11in -h8.5in '4:1#.65(.5in,0in)+0#.65(.5in,5.5in),2U#.65(8in,5.5in)+3#.65U(8in,11in)' |
ps2pdf - $(basename $1 .pdf).psbook.pdf
By the way, I do this often, so I'll probably submit more "answers" to this question just to keep track of successful pstops pagespecs. Let me know if this is an inappropriate use of SO.

A nice, powerful, open-source imposition tool is included
in the PoDoFo package:
http://podofo.sourceforge.net/
It works for me. Some imposition plans can be found at:
http://www.av8n.com/computer/prepress/
PoDoFo can do lots of other stuff, not just imposition.
Another useful imposition tool is Bookbinder (on the
quantumelephant site). It has a GUI that appeals to non-experts.
It is not as flexible or powerful as PoDoFo, but it can do
imposition.
pdftk is more-or-less essential to have, but it will not
do imposition.
pdfjam is useless to me, because there are a wide range of
valid pdf files that it cannot handle.
I've never been able to get multivalent to work, either.

What you want to do is imposition. There are commercial tools to impose PDFs such as ARTS crackerjack and Quite imposing but they are pretty expensive (US$500), require a copy of acrobat professional and are overkill for imposing 2 A4 pages to an A3 sheet.

On the Postscript side, a tool named pstops is able to rearrange pages of a Postscript file in any way you could imagine. I've not heard of such a tool for PDF. But pdf2ps and ps2pdf exist. So a not-so-ideal solution may be a combination of pdf2ps, pstops and ps2pdf.

I would combine the two A4 pages into one 2-page PDF using pdftk. Then Print to PDF using something like PrimoPDF, and tell it to print to A3 format, two pages per side.
I just tested this printing some slides from PowerPoint. It worked great. I selected A3 as my paper size in PowerPoint, and then chose to print 2 pages per side. Printed to Primo and voila, I have two A4 slides per A3.

You can put multiple input pages on one output page using BookletImposer.
And you can change page orders and combine multiple pdf files using PDF Mod.
With these two tools, you can do almost everything you want with pdf files (except editing their content).

I had a similar problem. I tried Impose but it was giving me an
Exception in thread "main" java.lang.NoClassDefFoundError: tool/pdf/Impose
Caused by: java.lang.ClassNotFoundException: tool.pdf.Impose
(...)
Could not find the main class: tool.pdf.Impose. Program will exit.
I then tried PDF Snake which isn't free or open source, but has a completely unrestricted 30-day trial version. It worked perfectly, after tweaking the parameters to achieve what I wanted. It's a great tool. I would definitely buy it if it wasn't so expensive! Anyway, I thought I'd leave my 2 cents in case anyone had the same problem I had with Impose.

look at this
http://sourceforge.net/projects/proposition/
It needs laTex to run,
but when it does, works really fine
Regards

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas