How to get bounding boxes of elements in EPS files - pdf

I need to check if a EPS/PDF file contains any vector elements
First I convert the PDF to EPS and remove all text elements and images from the file like this
pdftocairo -f $page_number -l $page_number -eps $input - | sed '/BT/,/ET/ d' | sed '/^8 dict dup begin$/,/^Q$/ c Q' > $output
But how can I then check if any elements are written to the canvas?

What do you mean, exactly, by 'vector elements' ? Anything except an actual bitmap image ? Why do you care ? Perhaps if you explained what you want to achieve it would be easier to help you.
Note that the approach you are using is by no means guaranteed to work, there can easily be 'elements' in the file which won't be removed by your rather basic approach to finding image.
You could use Ghostscript; run the file to a bitmap and specify -dFILTERTEXT and -dFILTERIMAGES. Then examine the pixels fo the bitmap to see if any are non-white. If they are, then there was vector content i the file. You could probably use something like ImageMagick to count the colours and see if there's more than 1.
Or run the file to bitmap twice, once normally, and once with -dFILTERVECTOR. Compare the two bitmaps (MD5 on them would be sufficient). If there are no differences then there was no vector content.

Any PDF that has vector elements will use at least one of the path painting operators. According to chapter 8 of the PDF standard, those are:
S, s, f, F, f*, B, B*, b, b*, n
Of course, since PDF files can be complex, you'll also need it in a standard form. You can do that using the qpdf program's QDF format. (apt install qpdf if you don't have it).
qpdf -qdf schedule.pdf - | egrep -m1 -q '\b[SsfFBbn]\*?$' && echo Yup
That'll print "Yup" if the file schedule.pdf has vector graphics in it.
Note: I think this will do the job for you, but it is not fool proof. It's possible to get false negatives if your PDF is loading vectors from an external file, embedding raw postscript, or doing some other trickiness. And, of course it can have false positives (for example, a file that draws a completely transparent, 0pt dot in white ink on a white background).

Other answers have addressed identifying the drawing operators in a plain text stream. For the other question,
But how can I then check if any elements are written to the canvas?
For this, the elements need to be part of a content stream that is referred to
in the /Contents member of the Page object.
If you read in all of the pdf objects, there will be a tree connecting all the content streams to the Root object declared in the trailer.
Trailer : /Root is a reference to the Document Catalog object
Document Catalog : /Pages is an array of Page objects or Pages nodes
Page : /Contents is an array of references to Content Stream objects that draw the elements of the page
It is possible for there to be stray content stream objects that are not referenced in the Document tree. By traversing the Pages tree you could collect any and all actual content and then feed that result to one of the solutions from the other answers.

Related

Avoiding fragmenting of text extracted from PDF after processing with Ghostscript

After processing with Ghostscript, I sometimes see whitespace breaking up the words as seen with pdftotext or in a PDF viewer when searching or selecting. Possibly unrelated but the anomalies seem to correspond with kerning variations in the rendered font.
Is there a way to avoid this?
For example, from GS 9.23 (also occurred with earlier versions):
gs -sDEVICE=pdfwrite \
-dNOPAUSE -dQUIET -dPARANOIDSAFER -dBATCH \
-sOutputFile=./output.pdf input.pdf
Excerpt from pdftotext input.pdf:
Review this manual before
operating deep cleaner
while pdftotext output.pdf:
Re vie w t his m a nua l be fore
ope ra t ing de e p c le a ne r
Ghostscript and the pdfwrite device (as explained in VectorDevices.htm) does not simply 'fiddle' with the input when producing a PDF file. The input (from whatever source; PDF, PostScript, XPS, PCL, PCL-XL) is fully interpreted into marking operations, those marking operations are sent to the device which turns them back into PDF constructs.
So the low level (PDF) format describing the page need not bear any relation to the low level format of the input. In particular you cannot expect the PDF operations in the input to be reflected in the output.
The visual appearance will be the same (or should be, because that's the main goal), but the actual operations won't be.
The reason for the difference in the text output is because, basically, there is no 'metadata' in a PDF file that describes words, paragraphs, columns etc. When you extract text from a PDF file what you actually get is a series of character codes and positions.
Its up to the text extraction code to try and make some sense of that. I'd guess that pdftotext is using the rather naive approach of assuming that text strings are words.
This is problematic because there are numerous different ways to handle kerning, justification and other spacings in PDF. You could do something like :
(Te) Tj
10 0 Td
(st) Tj
Or :
[(Te) 2 (st)] TJ
The pdfwrite device doesn't know what the original was, so what it emits could be either of those, depending on some heuristics. The chances of it matching the original are low.
I suspect that pdftotext would regard the first operation as "Te st" and the second operation as "Test"
One possible solution would be to use Ghostscript's txtwrite device to extract the text, it might do a better job.
As with your other question, it would be best to supply examples when asking these kinds of questions, because without that its pretty much guesswork.
TL;DR
Is there a way to avoid this?
No.

Replace colors in PDF using ghostscript

How can I convert a single color from a PDF document into another color, for example convert all instances of #ff0000 (red) to #ffffff (white)?
I've seen a number of ghostscript commands doing something similar (using setcolor, setcolortransfer), but I can't find a solution for this exact problem.
For example, the follwing will create an image-negative of the input PDF:
gs -o output.pdf -sDEVICE=pdfwrite -c "{1 exch sub}{1 exch sub}{1 exch sub}{1 exch sub} setcolortransfer" -f input.pdf
I'd move past this with a higher level of control, focusing on a single color being replaced with a different (not it's negative) color.
Essentially, you can't (or at least not using Ghostscript).
Firstly you seem to be assuming that the colours will be specified in RGB when in fact they could be specified in CMYK, ICC, CalRGB or Lab. You also need to consider Indexed colour spaces.
Secondly Ghostscript does not 'edit' PDF files, when you send a PDF file as input to Ghostscript it is fully interpreted to graphics primitives and the primitives are processed.
When the output is PDF the primitives are reassembled into a new PDF file. The goal of this process is that the visual appearance of the new PDF file should match the original. It is NOT the same PDF file, its internals will likely be completely different.
Finally, how do you plan to handle images ? Will you process those byte by byte to massage the colours ? Or do you plan to ignore them ? Same goes for shadings, where the colours aren't even present in the PDF file directly, but are generated from functions.
Without knowing why you want to do this I can't even offer a different approach other than; decompress the PDF file, read it and replace the colours manually.

How to search my PDF with grep?

I have followed ideas from this thread but it does not work.
https://unix.stackexchange.com/questions/6704/how-can-i-grep-in-pdf-files
pdftotext PercivalWalden.pdf - | grep 'Slepian'
pdftotext PercivalWalden.pdf - | grep 'Naive'
pdftotext PercivalWalden.pdf - | grep 'Filter'
I know for sure that 'Filter' appears at least 100 times in this book.
Any ideas?
If you really can grep a given string (that you can 'see' and read on a rendered or printed PDF page) from a PDF, even with the help of pdftotext, then you must be very lucky indeed.
First off: most of the advice from the link you provided to unix.stackexchange.com is very uninformed (to put it most politely). Most of the answers there are clearly written by people who are not familiar with the huge range of PDF variations out there.
In your case, you are trying to convert the file with the help of pdftotext first, streaming the output to stdout.
There are many types of PDF where pdftotext cannot extract the text at all. The reasons for this may be (listings below not complete):
The "text" that you see is not based on using a font. It may be one big raster image generated by a scan or other production process, then embedded into a PDF file shell. This may make the page only appear to be text strings.
The "text" that you see is not based on using a font. It may be a series of small vector drawings (or small raster images), that only look like text strings to our eyes and brain.
There are many software applications, which do convert fonts to so-called 'outlines'. The reason for this seemingly strange behaviour may be:
Circumvent licensing problems (when a certain font disallows its embedding).
Impose a handicap upon attempts to extract the text.
Accidentally wrong setting in the PDF generating application.
The font is embedded as a subset in the PDF file (by the PDF generating software -- users usually do not have much control over the details of this operation) and uses a 'custom' encoding, but the file does not provide a toUnicode table to map the glyphs to characters.
'Glyphs' are the well-defined shapes in each font drawn on screen. Glyphs map to characters for the computer -- our eyes merely see these shapes and our brains translate these to characters without needing a toUnicode table. Programs like pdftotext require a toUnicode table to reverse the translation of glyphs back to characters.
You can use a command line utility named pdffonts to gain a first insight into the fonts used by your PDF file. Example output:
pdffonts paper-projectiris---final.pdf
name type encoding emb sub uni object ID
-------------------------- ------------ -------------- --- --- --- ---------
TCQJEF+CMCSC10 Type 1 Builtin yes yes no 96 0
VPAFLY+CMBX12 Type 1 Builtin yes yes no 97 0
CWAIXW+CMTI12 Type 1 Builtin yes yes no 98 0
OBMDLT+CMR12 Type 1 Builtin yes yes no 99 0
In this case, text extraction (and your method of grepping for strings) should work:
Even though the column named uni (telling if a toUnicode map is embedded in the PDF file)
says no for each single font, the encoding column does not contain custom, but builtin (meaning that a glyph->character mapping is provided with the font file, which is of type Type 1.
To sum it up: Without access to your PDF file it is impossible to tell why you cannot "grep" for the strings you are looking for!

View source on pdf rendered inside a browser?

I have a report in Cognos. The output is rendered in a pdf inside the browser itself. Now the images are not showing up in the pdf. They show up fine in html. Now if they were not showing up on html, i would do a view source and check the image url and go from there. But when a pdf is rendered inside a browser, is there a way to do some kind of a 'View Source'?
As already recommended in comments, use a PDF browser like RUPS (based on iText) or any other one. Select the desired page, open its /Contents value, select the stream and you'll see something like this
/T1_0 1 Tf
0.0004 Tc -0.0002 Tw 13.98 0 0 13.98 189.87 476.67 Tm
(Praise for the First Edition)Tj
/T1_1 1 Tf
0.056 Tw 9.99 0 0 9.99 108.18 437.34 Tm
[(Each aspect is explained with numer)19(ous ex)]TJ
where text is to be displayed. Commands ending with Tf select the font for the text, those ending with Tc or Tw select the character or word spacing, those ending with Tm manipulate the text matrix and so position, rotate, stretch, etc. the text to be printed, and those ending with Tj or TJ actually print text.
Or you'll see something like this:
533.352005 0 0 668.2319946 -1.2660065 -1.0559998 cm
/Im0 Do
where some XObject is to be displayed. Commands ending with cm manipulate the current transformation matrix (again for positioning, rotating, stretching, etc.), and those ending with Do print a XObject.
What a given XObject is, can be seen in the /XObject value in the /Resources of the page, e.g.:
So the XObject is an image (see the value of /Subtype).
Therefore in your case
Now the images are not showing up in the pdf.
you should inspect the page in a likewise manner and search for something like the excerpts above. If you don't find a XObject referenced (and also don't find a command sequence BI … Key-value pairs … ID … Image data … EI in a contents stream; that sequence defines an inlined image), there is no image on that PDF page. Otherwise there is an image which for some reason does not show up.
There actually can be a number of other commands, too, and also other kinds of XObjects. For more details have a look at the PDF specification ISO 32000-1:2008 (made available by Adobe here), especially chapters 8 and 9.
... or search the web for the exact problem
http://www-01.ibm.com/support/docview.wss?uid=swg21339267
Although it doesn't explain why it works for HTML and PDF, most searches indicate that it is a web server security problem, and enabling anonymous authentication in your images folder might fix it.

Unable to understand PDF Function (type 4) stream syntax

stream
{ 360 mul sin
2 div
exch 360 mul sin
2 div
add
}
endstream
Can someone please explain this syntax to me?
This doesn't look like PDF to me:
stream and endstream are PDF keywords, yes.
But the rest rather looks like PostScript.
So stream and endstream could also be PostScript variables or functions, defined elsewhere (before) in the same code...
As PostScript, the code means:
{ and } are just separators that structure the code into blocks
360 mul sin: multiply by 360 (multiply what? => the value that's topmost on the stack), compute the sinus value for the result, and put this as topmost on the stack.
2 div: divide the topmost value on the stack by 2.
exch 360 mul sin: swap the 2 topmost items on the stack, multiply the item that's now topmost by 360, compute the sinus of it and put it back on the stack.
2 div: divide the topmost value on the stack by 2.
add: add the 2 topmost values on the stack.
Update:
Ouch!
I had completely forgotten about the details of the (very limited) PostScript function objects which the PDF spec allows inside PDF documents. These represent self-contained and static numerical transformations.
So my above explanation of the PostScript code as a calculator function is still valid, and it looks to me like describing a 'spot function' for a halftone screen. (However, stream and endstream in this context of course do keep their original meanings as PDF keywords, and the curly braces { and } are required to enclose the function definition.)
Since the PDF spec for these PostScript function objects does not allow the use of arrays, variables, names or strings, but only integers, reals and booleans as values, the processing of these code segments doesn't require a fully fledged PostScript interpreter, and this statement in the spec:
"PDF is not a programming language, and a PDF file is not a program."
does still apply and makes still the PDF language very different from PostScript (which is a programming language, and PS files are programs).
Keeping in mind, that PostScript is a stack-based language, and understanding its code by thinking of a pocket calculator that uses 'reverse Polish notation' conventions will help you wrapping your mind around this topic...
Its a postscript program which executes on the raw data to provide the end values. You will need a Postscript parser to handle it