Compressing/Optimizing Vectors in PDF - pdf

I have a PDF of scanned book, the images are in JBIG2 format (B&W). I'd like to convert this to a vector PDF, which I can do easily by extracting the images and converting them to PDF vector graphics instructions with potrace.
The reason for this is that I want the PDF to display smoothly and quickly on an ebook reader device, such as a Kindle. With JBIG2 it is not doing this very well. Depending on the settings, the Kindle can't display the PDF, and even with that fixed it takes a long time to render each page. With a vector PDF the performance is much better, and the rendering very crisp.
The problem is that the resulting PDF is gigantic in filesize. Even with the streams gzcompressed to the max it is 300KB per page (original JBIG2 images were 30KB per page).
Is there any way I can optimize the vector graphics so that the filesize is much less?
Here is an segment of the vector drawing instructions:
0.100000 0.000000 0.000000 0.100000 0.000000 0.000000 cm
0 g
8277 29404 m
8263 29390 8270 29370 8289 29370 c
8335 29370 8340 29361 8340 29284 c
8340 29220 8338 29210 8323 29210 c
8194 29207 8141 29208 8132 29214 c
8125 29218 8120 29248 8120 29289 c
8120 29356 8121 29358 8150 29370 c
8201 29391 8184 29400 8095 29400 c
8004 29400 7986 29388 8033 29357 c
8056 29342 8057 29338 8057 29180 c
8058 29018 l
8029 29008 l
8012 29002 8001 28993 8003 28986 c
h
f
I would have thought that the numbers could be compressed down very easily, but apparently not. One page is 800KB uncompressed (as above) and 300KB gzcompressed. I would have thought that the compression ratio could be much better, considering how the instructions are all numbers in similar ranges.

I am afraid there's not much that can be done about this.
Of course, you might try to use LZW compression on PDF page streams (instead of Deflate) but it probably won't make much difference.
Another suggestions:
Smooth source image as much as possible / remove as many details as possible. This might render less curves (i.e. less data) during conversion.
Try to optimize values in PDF page stream. For example, you might try to use sophisticated combinations of scale / translate operators and changes to data. The goal here is to reduce length of operands.
For example, you might try to divide all operands (using integer, not floating-point division) by, say, 100 and add scaling before first operator. This approach most probably degrade the visual quality, though.
And of course, if you are going to do this to only a handful of files then I would say it's not worth the time.

Related

How are quantized DCT coeffiecients serialised in JPEG?

I've read in dozens of articles, scientific papers, and toy implementations that the steps in JPEG compression are roughly as follows
Take 8x8 DCT
Divide by quantization matrix
Round to integers
Run-length & Hufmann
And then the inverse is pretty much the same. What is left out in everything on the topic I've found so far is the magnitude of the data and the corresponding serialization.
It appears implicitly assumed that all the coefficients are stored as unsigned bytes. However, as I understand it, the DC coefficient is in the range 0-255, while the AC coefficients can be negative. Are the AC coefficients in the range ±255, or ±127, or something else?
What is the common way to store these coefficients in a compact way?
The first-hand source to read is of course the ITU-T T.81 standard document.
Looks like the first Google link leads to a paywall.. it's on the w3 site, though: https://www.w3.org/Graphics/JPEG/itu-t81.pdf
Take 8-bit input samples (0..255)
Subtract 128 (-128..127)
Do N*N fDCT, where N=8
Output can have log2(N)+8 bits = 11 bits (-1024..1023)
DC coefficients are stored as a difference, so they can have 12 bits.
The encoding process depends upon whether you have a sequential scan or a progressive scan. The details of the encoding process are too complicated to fit within an answer here.
I highly recommend this book:
https://www.amazon.com/Compressed-Image-File-Formats-JPEG/dp/0201604434/ref=sr_1_2?ie=UTF8&qid=1531091178&sr=8-2&keywords=JPEG&dpID=5168QFRTslL&preST=_SX258_BO1,204,203,200_QL70_&dpSrc=srch
It is the only source I know of that explains JPEG end-to-end in plain English.

Resizing multi-page mixed-format PDF with Ghostscript?

I have multi-page PDF-files with mixed formats A4 (portrait) - A0 (landscape).
Is Ghostscript capable of resizing the pages with size >A3 to A3 – but leaving the pages with smaller size (A4) not to be resized?
First, Ghostscript doesn't do manipulations of the input, you should read ghostpdl/doc/vectordevices.htm to see how Ghostscript and the pdfwrite device actually work.
Out of the box, no Ghostscript and the pdfwrite device won't allow you to produce output with differently sized media from the input, and different for each page (you can have it produce output sized to a single media size). It can, of course, be done, but will involve some programming, and in PostScript at that.
You would probably want to look at the pdf_PDF2PS_matrix routine in ghostpdl/Resource/Init/pdf_main.ps:
% Compute the matrix that transforms the PDF->PS "default" user space
/pdf_PDF2PS_matrix { % <pdfpagedict> -- matrix
...
Which calculates the scale factors required when resizing content to fit the media.
Also pdfshowpage_setup :
/pdfshowpage_setpage { % <pagedict> pdfshowpage_setpage <pagedict>
6 dict begin % for setpagedevice
% Stack: pdfpagedict
...
Which is where the selection of the media size takes place.
After spending long time looking for a solution, I found a great - and yet affordable - tool capable of doing the resizing and a lot more: PStill (http://www.pstill.com/)

Animated GIF larger than source images

I'm using imagemagick to create an animated GIF out of ~60 JPG 640x427px photos. The combined size of the JPGs is about 4MB.
However, the output GIF is ~12MB. Is there a reason why the GIF is considerably bigger? Can I conceivably achieve a GIF size of ~4MB?
The command I'm using is:
convert -channel RGB # no improvement in size
-delay 2x10 \
-size 640 \
-loop 0 \
-dispose Background # no improvement in size
-layers Optimize # about 2MB improvement
portrait/*.jpg portrait.gif
Using gifsicle didn't seem to improve either.
JPG is lossy compression.
GIF is lossless compression.
A better comparison would be to convert all the source images to GIF first, then combine them..
First google hit for GIF compression is http://ezgif.com/optimize which claims lossy GIF compresion, might work for you but I offer no warranty as I haven't tried it.
JPEG achieves it's compression through a (lossy) transform, where an 16x16 / 8x8 block of pixels is transformed to frequency representation and then quantized. Instead of selecting e.g. 256 levels (i.e. 8 bits) of red/green/blue per component, JPEG can ignore some frequency components, or use just 1 or 2 bits to represent them.
GIF on the other hand works by identifying repeated patterns from a paletted image (upto 256 entries), which occur exactly in the previously encoded/decoded stream. Both because of the JPEG compression, and the source of the images typically encoded by JPEG (natural full color), the probability of (long) exact matches is quite low.
60 RGB images with the size 640x427 is about 16 million pixels. To represent that much in 4 MB, requires a compression of 2 bits per pixel. To achieve this with GIF would require a very lossy algorithm, that would select (vector) quantization of true color pixels not to the closest pixel in the target GIF palette, but based also on the fact how good dictionary of code words this particular selection will make. The dictionary builds slowly and to achieve 2 bits/pixel, the average length of the decoded code word would have to map to 5.5 matching pixels in the close neighborhood.
By contrast, imagemagick has been able to compress the 16 million pixels (each selected from a palette of 256 elements) to 75% already!

Raw pdf color conversion (with known conversion formula) from RGB to CMYK

This question is related to
Script (or some other means) to convert RGB to CMYK in PDF?
however way more specific. Consider that I am not an expert in print production ;)
Situation: For printing I am only allowed to use two colors, Cyan and Black. The printery requests the final PDF to be in DeviceCMYK with only the Channels C and K used.
pdflatex automatically does that (with the xcolor package) for all fonts and drawn objects, however I have more than 100 sketches/figures in PDF format which are embedded in the manuscript. Due to an admittedly badly designed workflow (late realization that Inkscape cannot export CMYK PDFs), all these figures were created in Inkscape, and thus are RGB PDFs.
However, the only used colors within Inkscape were RGB complements of CMY(K), e.g. 100% Cyan is (0,255,255) RGB and 50% K is (127,127,127) etc.
Problem: I need to convert all these PDF figures from RGB to DeviceCMYK (or alternatively the whole PDF of the final manuscript) with a specific conversion formula.
I did a lot of google research and tried the often suggested ways of using e.g. Ghostscript or various print production tools in Adobe Acrobat, however all of the conversion techniques I found so far wanted to use ICC color profiles or used some other conversion strategy which filled the channels MY and spared some C and K, for example.
I know the exact conversion formula for the raw color numbers from our Inkscape-RGBs to the channels C and K, however I do not know or find any program or tool that allows me to manually specify conversion formulas.
Question: Is there any workflow to convert my PDFs from RGB to C(MY)K manually with my own specific conversion formula for the raw numbers with the converted PDF being in DeviceCMYK using a tool, script or Adobe product?
Due to the large number of figures I would prefer a batched solution which doesn't require too much coding from my side, but if it should be the only solution, I'd also be open minded for a workflow like "load/convert/save" within a program for every single figure or writing a small program with an easy-to-handle C++ PDF API for example.
Limitations and additional info: A different file format (like TikZ figures) is not possible any more since it does not work perfectly and the necessary adaptions to the figures would create too much overhead. A maybe helpful information: Since the figures are created in Inkscape, there are no raster images within the PDFs. I also do not want all figures to be converted to raster images during the color conversion.
Edit:
I have created an example of a RGB PDF-figure created with inkscape.
I also did a manual object-by-object color conversion to a CMYK-PDF with Illustrator, to show how the result should look like. Illustrator stores the axial shading in a DeviceN colorspace with the colors cyan and black, which is close enough^^
Here is an idea, I think it will work if your PDF files are using exclusively the colorspaces DeviceGray, DeviceRGB and DeviceCMYK:
1- Convert all your PDF files to Postscript (with pdf2ps from ghostscript for example)
2- Write a Postscript program that redefines the operators setrgbcolor, setgray and setcolor with your own implementation in the Postscript language, your implementation will internally use setcmykcolor and it will compute the values using your custom formula.
Here is an example for redefining the setgray operator:
% The operator setcmykcolor expects 4 values in the stack
% When setgray is called, we can expect to have 1 value in the stack, we will
% use it for the black component of cmyk by adding 3 zeros and rolling the
% top 4 elements of the stack 3 times
/setgray { 0 0 0 4 3 roll setcmykcolor } bind def
3- Paste your Postcript program at the begining of each resulting ps file from step 1.
4- Convert all your files back to PDF (with ps2pdf for example)
See it in action by saving this piece of code as sample.ps:
/setgray { 0 0 0 4 3 roll setcmykcolor } bind def
0.5 setgray
0 0 moveto
600 600 lineto
stroke
showpage
Convert it to PDF with ghostscript using this command line (I used version 9.14):
gswin64c.exe -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sOutputFile=sample.pdf sample.ps
The resulting PDF will have the following page content:
q 0.1 0 0 0.1 0 0 cm
/R7 gs
10 w
% The K operator is the PDF equivalent of setcmykcolor in postscript
0 0 0 0.5 K
0 0 m
3000 3000 l
S
Q
As you can see, the ps-> pdf conversion will preserve the cmky colors specified in postscript with the setcmykcolor operator.
Maybe you can post your formula as a new question and someone could help you out translating it to postscript.
Since you have access to Illustrator, you might want to try importing the PDF into Illustrator and using Illustrator's scripting capabilities to iterate over the elements and replace fill/stroke RGB colors with their CMYK replacement colors.
The difficulty will be with the shading patterns (Gradients) used in the PDF; if they are imported as GradientColor, then in theory it's a matter of digging into the GradientColor to find the base RGB colors and substitute their CMYK replacement.
A very similar problem was solved using the ActivePDF.dll with C++ (or C#??).

how to change all colours in a PDF to their respective complimentary colours; how to make a PDF negative

How can all colours in a PDF be changed to their compliments? So, I mean that a document consisting of black text on a white background would be changed to a document consisting of white text on a black background. Any red colours in a document would be changed to turquoise colours and so on. Is there some standard utility that could be used for this purpose or am I likely to have to contrive some awkward ImageMagick image conversions?
EDIT: Here's a very manual way of doing this using ImageMagick:
convert -density 300 -quality 100 "${fileName}" tmp.png
mogrify -flatten *.png
mogrify -negate *.png
convert *.png "${fileName}"_1.pdf
EDIT: I changed the wording for the purposes of clarity.
I can think of at least 3 ways to invert (negate, compliment) colors of a PDF page description -- I mean treating page content as a black box, therefore not counting direct diving into page content and messing around as per Dingo's answer. Unfortunately, ready-made free tools (Ghostscript, mainly) provide incomplete solution and require manual intervention.
Note, that all specific terms used below require at least some knowledge of basics of PDF and Postscript Language References and are presented here somewhat as a simplification, please refer to manuals or google for thorough description.
The most obvious method is to use inverting Transfer function. Transfer
function (TF) expects an argument in the range 0..1 which is (additive) color
component, and returns color value, too. Negating TF is, of course, {1 sub neg} and is easy to inject:
gs -q -sDEVICE=pdfwrite -o out.pdf -c '{1 sub neg} settransfer' -f in.pdf
That's great, and Adobe Reader displays our out.pdf as (see below) negated. But here 'greatness' ends. All other viewers ignore TF, (probably) considering it to be device-dependent and actually present in PDF as a compensation for output device pecularities (non-linear printer response etc.) and therefore something to be ignored when displaying PDF on-screen. Further, depending in Reader's version, negation of black-on-white text leads to either white-on-black or yellowish-on-black text. And that's not great.
Therefore, we need not only TF injection, but the way to properly apply TF to
PDF content before viewing. And, regardless of ps2pdf Ghostscript's manual saying:
Currently, the transfer function is always applied
(of three options: Apply, Preserve, Remove)
using current 9.10 version, I couldn't make Ghostscript to actually apply TF (i.e. modify page description operators) when outputting to high-level (pdfwrite, as opposing to image output) devices. Maybe I'm missing something here.
But, Adobe Distiller, with proper options set, does apply TF to input postscript file.
Somewhat related to TF is the use of inverting Device-Link color profiles,
which are simple identity DL profiles with inverting (input or output) curves.
That's an interesting use of interesting technology, but, again, Ghostscript currently doesn't support proper Color Management (and DL profiles) in PDF-2-PDF workflows. Moreover, Adobe Acrobat doesn't know what to do with DL profiles, their use within Acrobat requires expensive third-party plugins.
If PDF viewer (renderer) claims to support 1.4 and transparency (they all do, nowadays), that's another way to go. PDF Reference says, that if current Blending Mode is Difference and we paint with white, it effectively means inverting backdrop. So, we explicitly paint background with white (if there's no background then there's nothing to invert), then put our current content (treating it as black box), then set Blending Mode to Difference and paint on top with white. Is that clear? Again, I had no success setting Blending Mode using Ghostscript, with:
[ /BM /Difference /SetTransparency pdfmark
It works OK with Distiller but is ignored by Ghostscript. Maybe (again) I'm missing something.
OK, to round up (the answer's getting somewhat long), here's Perl solution for 3d method using proper API (programming site, isn't it. Any programming language and appropriate API will do):
use strict;
use warnings;
use PDF::API2;
use PDF::API2::Basic::PDF::Utils;
my $pdf = PDF::API2->open('adobe_supplement_iso32000.pdf');
for my $n (1..$pdf->pages()) {
my $p = $pdf->openpage($n);
$p->{Group} = PDFDict();
$p->{Group}->{CS} = PDFName('DeviceRGB');
$p->{Group}->{S} = PDFName('Transparency');
my $gfx = $p->gfx(1); # prepend
$gfx->fillcolor('white');
$gfx->rect($p->get_mediabox());
$gfx->fill();
$gfx = $p->gfx(); # append
$gfx->egstate($pdf->egstate->blendmode('Difference'));
$gfx->fillcolor('white');
$gfx->rect($p->get_mediabox());
$gfx->fill();
}
$pdf->saveas('out.pdf');
Here I take one of Adobe documents and invert it.
What's important: page should have transparency blending space set to RGB explicitly, because Adobe Reader defaults to CMYK, and inverting colors in CMYK you probably don't want. Pure CMYK black 0-0-0-100 inverts to 100-100-100-0, that's (nearly) black, too. RGB black gives something like 70-60-50-70 CMYK that inverts to brown 30-40-50-30, and you don't want that. That's why I add Group entry to pages dictionaries.
Your question seems to be very similar to this:
Change background color of pdf
but you also want to change the colour of the text.
so you can follow the workflow I suggested some time ago for the same task:
----------------
vector pdf background (meaning not raster image) in pdf files can be
easily changed in a couple of steps (see also my stackoverflow answer that
now I'll extend and improve
Change background color of pdf
PRELIMINAR CHECK:
open your pdf file with an editor able to show the internal pdf structure,
like
notepad++
- http://notepad-plus-plus.org/download/v6.1.8.html
and verify if you can see code snippets like
0.000 0.000 0.000 rg (it means *black*)
1.000 1.000 1.000 rg (it means *white*)
and so on...
(code snippet can change, for instance, in pdf produced by openoffice
internal pdf exporting feature, the same code snippepts are in this forms:
0 0 0 rg (it means *black*)
1 1 1 rg (it means *white*)
and so on...
if you are able to see these code snippets, then you can start to change
values, otherwise, you need to decompress text streams
you can perform this task with
pdftk
http://www.pdflabs.com/docs/install-pdftk/
pdftk file.pdf output uncompressed.pdf uncompress
and recompress after finished changes
pdftk uncompressed.pdf output recompressed.pdf compress
now, if you see these code snippets, you can change values
STEP 1 (for pdf editing) -
the first thing you need is to find the right equivalence between RGB
color values of text and background and the internal pdf represerntation
of same colors
Since it seems you are a windowsian inhabitant from the third planet in
the Microsoft constellation, you can use a free color picker like this
http://www.iconico.com/download.aspx?app=ColorPic&type=free
to identify the rgb values of text and background colors
once you have these values, you need to convert into special internal pdf
representation
to do this take i mind this proportion:
1:255=x:color you selected
for instance: let say you have this RGB triplet for background:
30,144,255
to know correspondent values in pdf in order to insert in code snippet to
change pdf background color, you do: (you can use http://
www.wolframalpha.com/ to compute with precision)
1:255=x:30 = 30/255 = 0.117 (approximated to first three decimals)
1:255=x:144 = 144/255 = 0.564 (approximated to first three decimals)
1:255=x:255 = 255/255 = 1
so, the whole triplet in pdf, corresponding to RGB 30,144,255, will be:
0.117 0.564 1.000
STEP 2 (for pdf editing)
we look for 0.117 0.564 1.000 in pdf file with notepad++ (wrap around
and match one word only need to be checked) and we found the internal
pdf representation of background and we can change from azure to, let say,
white
1.000 1.000 1.000
or
1 1 1
but, since you wrote about black background, to be more precise, I
created a sample pdf with white background and black text
http://ge.tt/1N7Vuz91/v/0
since we know that 0.000 0.000 0.000 rg means black, we look for this
and we can change from 0.000 0.000 0.000 rg, to 1.000 1.000 1.000 rg
(white) BUT...
at same time, if, your text is black, nd you want change its color to white, you need also to change first the text from black to other color, otherwise it will be invisible, white on white
so, we cannot simply change directly white background to black, at once,
since doing this, we have not a difference between color text and
background values
and then we act as follows:
we change white background from 1.000 1.000 1.000 into something like
0.5 0.5 0.5 (light grey)
http://ge.tt/1N7Vuz91/v/1 (resulting pdf - intermediate step)
then looking for
0.000 0.000 0.000 (black text) and change to **white**
1.000 1.000 1.000
resulting intermediate pdf file:
http://ge.tt/1N7Vuz91/v/2
finally, we change again the color of background from
0.5 0.5 0.5 (light grey)
to black
0.000 0.000 0.000
and we have now a vector pdf with white text and black background
http://ge.tt/1N7Vuz91/v/3
please, remember to
1 - compress again this pdf you mmodified if you uncompressed with pdftk
2 - repair
pdftk file.pdf output fixed.pdf
there is another way, starting from postscript, to perform the same task,
but being you a windowsian, I guess the postscript way is the harder way
for you, but if someone (a linuxian from Torvald constellation) is
interested I can explain how do the same thing in postscript
not in this post to avoid to be too verbose
give a feedback, please, and feel free to ask more