Multipage background in PDF, using pdftk or other tool - pdf

How to add multipage background (eg. odd and even backgrounds) to 10 thousands pages PDF, with keeping output file as small as possible?
I'm doing massively multipage documents (eg. 10000 pages in one document). Each page has background, which I apply in such way:
I have lot of .dvi documents, I join them using dviconcat
next I do dvipdf on joined .dvi
and next I use pdftk to apply background, using pdftk infile.pdf background bg.pdf output outfile.pdf
In this way, I have fairly small file, eg. 200MB, comparing to situation when I produce lot of .pdf files with background and join them using pdftk and resulting file is eg. 2G.
I think it's because background is not repeated every page, but it's copy is stored in PDF only once and there is some kind of reference in pages.
Unfortunately, now I need to use 2pages / 2 sides background. Different background for odd pages and different for even. PDFtk don't know how to do it. I can prepare 10.000 pages background, but it will be huge (eg. 1G).
Any suggestion how could I accomplish it, without playing with multi-gigabytes files? Is it doable at all? If yes - with pdftk or some different tool?

One solution would be to do the background when you convert PostScript to PDF. Using a BeginPage procedure you can paint the background before you pain the page contents. By checking the page count in BeginPage you can choose which background to paint, so you can have different ones for even/odd/whatever pages.
If you specify each background as a PostScript form, then your BeginPage can be small, also (and rather more importantly) the current version of Ghostscript, 9.14, will attempt to pass PostScript forms into a PDF file as a PDF form,and it can identify and consolidate duplicates so it 'should' only embed each form once. This should result in the minimum possible file size.
However, this code is at an early stage of development and might not work for you, also you'll need to do some PostScript programming.
I'm not familiar with pdftk, but would it be possible to produce all the even page, add a background to them. Produce all the odd pages, add a different background, then use pdftk to merge and interleave the pages ?
NB Ghostscript doesn't handle .dvi files, so I'm rather at a loss to know how you use Ghostscript to 'join' them. Also, if you are somehow creating the PostScript files using Ghostscript, you would almost certainly be better off using Ghostscript to produce the PDF file directly. (I'm assuming here that you are using Ghostscript's ps2pdf, but even if you aren't it'll still be quicker to produce the PDF in one step, and almost certainly produce better output too)

Related

a background grid appears after using ps2pdf

Morning, everyone,
Quick question about PS2PDF. I use it to convert graphics that I produce directly in postscript to PDF. While there is no visual problem on PS files, I see a grid on my PDF viewer. At first I thought the problem was in the viewer, but it remains present when I compile my TeX files containing the figures with PDFLaTeX. Do you have any ideas for settings that can "fix" this display? Thanks in advance :)
Evince is independent of Ghostscript as far as PDF files are concerned, but I don't know how it can be viewing PostScript files.
I believe what you are seeing is an artefact of the PDF rendering engine in use, and the way the PDF file is constructed (which is itself dependent on the way the PostScript is constructed).
Much of the content is drawn by creating little rectangles which are intended to butt up against each other (and basically do). However, depending on the resolution, the precise numerical accuracy of the calculations and the accuracy of the co-ordinates, it can be the case that these rectangles do not quite touch ideally. There is a theoretical gap between them.
You can see this occur with Adobe Acrobat, and zooming in and out changes where the lines appear (it changes the effective resolution, thereby changing the calculations from user space to device space, ie to the actual pixels on screen).
I cannot say for sure that the same problem exists with Evince, but I expect it does. Withh Acrobat I can turn off anti-aliasing, which is where the problem really arises. Acrobat is attempting to insert an anti-aliased pixel between the two rectangles, which leads to these faint lines. Turning it off (In Acrobat X Edit->Preferences->PageDisplay->Smooth Line art) makes the lines disappear.
Ghostscript doesn't apply anti-aliasing by default, so these lines don't appear when rendering either the PostScript or the PDF files, but if I turn on anti-aliasing (-dGraphicsAlphaBits=4) then Ghostscript renders the lines in both the PostScript and the PDF file.
Essentially I think the problem is that your PDF viewer is using anti-aliasing and your PostScript viewer isn't, so they don't look the same.

How is hidden text stored in OCR-enhanced PDF files

// EDIT 26.03.2018 - Who wants to continue my work can have a look on my source-files https://github.com/n0l0cale/ocr-sampledata
I'm actually looking for some details about PDF Files. It's most important for me that the files will be usable for a very long time and if possible the OCR should be automatically applied for new files (which seems to be not really possible with Adobe Acrobat...).
For that I've been looking for different solutions how to OCR my PDF Files. I found three candidates which seems to be doing what they should do... (more or less). But all three variants have their pro&cons... But there seem to be different approaches how to store data in PDF Files.... for all three Variants... Let me explain:
a File OCRed with Adobe Acrobat:
https://github.com/n0l0cale/ocr-sampledata/blob/master/A4%20sample_ACROBAT.pdf
results in a file that Acrobat is able to open in one step (no preloading of any background layer) and after a preflight-script I'm able to see the text which is stored hidden:
a File OCRed with Abby Finereader:
https://github.com/n0l0cale/ocr-sampledata/blob/master/A4%20sample_ABBY.pdf
does not seem suitable for the default adobe preflight-script as it does not display any additional layers:
But far as I was able to reproduce these Files seems to have a Background-Text-Layer, which contains the OCRed Text, which is the underlying layer for the Image that is shown to the user at the end. Unfortunately this seems to be loaded separately and this is confusing while opening the file with Adobe Acrobat...
a File OCRed with Tesseract 4 (Alpha):
https://github.com/n0l0cale/ocr-sampledata/blob/master/A4%20sample_TESSERACT_oem2.pdf
is also doing some weird magic with the hidden text part:
But in all three cases I'm able to search for words in the files and see the text using "Remove hidden information" and selecting "hidden text":
I'm seriously confused.... Does anyone know how these programs are storing their hidden text information really?
S.
P.S.: For those wondering what this ominous preflight script is: https://theblog.adobe.com/hidden-gems-in-acrobat-dc-how-to-optimize-hidden-ocr-text/
Does anyone know how these programs are storing their hidden text information really?
You correctly have found out that the approach of Abby Finereader is different from that of Adobe Acrobat and of Tesseract:
Abby creates a page content stream in which first the text is drawn normally on the page and eventually covered by the scanned image.
Acrobat and Tesseract create content streams in which first the image is drawn and then the text is drawn invisibly (using text rendering mode 3 which draws nothing).
The difference between the latter two results is the choice of font used:
Acrobat uses regular standard 14 fonts for which a PDF viewer has a font program to render them as normal glyphs.
Tesseract uses a font GlyphLessFont it embeds a font program for into the result file. When rendered the glyphs in this font do not show as our normal Latin glyphs but merely as empty space.
Considering the visual effect you observed for the Abby result, the approach used by Acrobat or Tesseract might be preferable.
Whether one prefers fonts with visually recognizable glyphs (as used by Acrobat) or without (as used by Tesseract), is mostly a mere matter of taste. They are used only in the invisible rendering mode anyways.

Adjusting format of PDF to print it faster

I am using a combination of iTextSharp and PdfSharp to assemble a large PDF file for printing to a Canon Oce VarioPrint 6000 series printer. The PDF is replacing a postscript file.
Both this new file and the old are transferred to the printer via an LPR command.
The postscript file would take maybe 10 minutes to rip to the printer. My PDF version of the same file is taking over 30 minutes to process before it is ready to print.
Can anyone give me pointers into ways I could change the way this file is written / created that would decrease the processing time on the Vario?
EDIT: I took the file that was ripping so slowly and ran it through Acrobat Preflight and it found many RGB images, that it wanted to convert to CMYK. When I look at the PDF though, they are all black and white logos, so I had Preflight do a fix up to convert all images to print Black and White.
I also noticed the Preflight was consolidating backgrounds. Half of the pages have the same logo on them, so leveraging this conversion is probably also helpful.
When I LPR'd that file, it copyed and ripped in less than 5 minutes! So I guess the real question is how can I do that programmatically?
I am modifying the title and tags.
Thanks!
An equivalent result to the preflight repair process in this case can be gotten by using iText (or in my case, iTextSharp). I replaced the PdfSharp method of aggregating the pdfs with the PdfSmartCopy class. This brought down the size of the outputted pdf significantly, combined with using iText's reader.RemoveUnusedObjects(), and my rip time to the printer was lowered to the same or below the previous rip times that we had with the postscript file. Very pleased.
So the RGB images that were probably contributing to the large processing time, were narrowed by the Smart copy removing duplicates.
More info on PdfSmartCopy can be found at: http://api.itextpdf.com/itext/com/itextpdf/text/pdf/PdfSmartCopy.html
and in Bruno's book, iText In Action, more specifically in Chapter 6.

Is there a way to programmatically remove all blank pages from a PDF file?

Nowadays it is more practical to purchase an ebook than the dead-tree version. But the PDFs frequently contain the blank pages used by the print edition. I typically see between 10-30 blank pages (or pages with text "This page intentionally left blank.") per ebook. Is it possible to programmatically remove these blank pages? Currently I manually identify the blank pages and then run it through this:
pdftops orig.pdf - | psselect "$range_of_non_blank_pages" | ps2pdf - new.pdf
So the hard part is identifying the blank pages. pdftotext would work for the most part, except where the page has only images and no text.
Also, even after removing many pages and seeing the resulting file size is smaller, after shrinking both the original file and the new version (using various methods found on the internets), the original file is usually smaller by several hundred KB or more. So it appears the method I'm using to remove the blank pages doesn't create an optimal pdf. I've also tried various gui programs and see the same results in this respect.
Partial answer: you don't need to go via postscript (this is probably the reason why you get a bigger file). One possibility is
pdftk orig.pdf cat "$range_of_non_blank_pages" output new.pdf
To identify blank pages, you'd need to use a tool that can go beyond selecting and reassembling pages. Try a library for a scripting language, for example CAM::PDF or PDF::API2 in Perl.
I don't know of an open source solution that can detect and remove blank pages. However, Apago's commercial PDF Enhancer can automatically remove blank pages -- both vector and scanned. For scanned, it can remove scan artifacts such as black edges, hole punches and noise prior to determining if page is blank.

Batch Decollating Directory of PDFs and Bar Code Imposing

I have a directory of PDFs. They are all different, but they all have 5 pages. I need to insert a bar code on each page for each PDF. After this process I need to combine and decollate every PDF. Essentially there would be 5 different PDFs created. The first would contain all page ones from every PDF, the second the second page, etc.
I need to find a tool, or a toolset, that would allow me to accomplish this. I'm willing to program my own solution but I'm not even sure what would be the most efficient language to attack it with.
What I ended up doing was using Perl with the PDF::Reuse and PDF::Reuse::BarCode libraries to get all PDFS in the directory, pull the pages I wanted, put the barcode and save out to a new PDF.