Remove all vector paths from PDF [closed] - pdf

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
I'm looking for a way to remove all path objects from PDF file.
I suspect that this can probably be done with javascript in Adobe Acrobat, but would really appreciate a tip to do it with ghostscript or mupdf tools.
Anyhow any working solution is acceptable as correct answer

To do this with Ghostscript you would have to modify the pdfwrite device. In fact you would probably have to do something similar for any PDF interpreter.
What do you consider a 'path' object ? A shfill for example ? How about text ? How about text using a type 3 font (which constructs paths) ?
What about clip paths ?
If you really want to pursue this I can tell you where to modify pdfwrite, provided you don't mind recompiling Ghostscript.
Its probably a dumb question, but why do you want to do this ? Is it possible there might be another solution to your problem ? If all you want to do is remove filled paths (or indeed stroked paths. One solution would be to run the file through ps2write to get PostScript, prepend code to redefine 'fill' and 'stroke' as no-ops, and then run the file back through pdfwrite to get a PDF.
[Added after reading comments]
PDF doesn't have a 'path' object, unlike XObject which is a type of object. Paths are created by a series of operations such as 'newpath', 'moveto', 'curveto' and 'lineto'. Once you have built a path you then operate on it with 'fill' or 'stroke. Note that PDF also doesn't have a 'text' object type either.
This is why your approach doesn't work, you can't remove 'path objects' because there aren't any, the paths are created in the content stream. You can use a Form XObject to do something similar, but then the paths construction is in the Form content stream, it still isn't a separate object.
The same is true of PostScript, these are NOT any kind of object oriented languages. You cannot ' detect vector object of type path' in either language because there are no objects. In practice anything which isn't an image is a vector object, and is constructed from a path (and with clipping, even some images might be considered as paths)
The piece of PostScript you have highlighted adds a rectangle to a path (paths need not be contiguous in either PDF or PostScript) and then fills it. Note that, as is usual practice in PostScript, these are not directly using the PostScript operators, but are executing procedures which use the operators. The procedures are defined in the program prologue.
By the way, it looks like you used the pswrite device here (can't be sure with such a small sample). If this is the case you really want to start with ps2write instead. Otherwise you are going to end up with an awful lot of things degenerating to tiny filled rectangles (pswrite does this with many image types)
I didn't suggest that you try to 'decrypt' the ps2write output (it isn't encrypted, its compressed).
What I suggested was to create a PostScript file, redefine the 'show' and/or 'fill' operators so that they do nothing, and then run the resulting PostScript program back through Ghostscript using the pdfwrite device. This will produce a PDF file where all stroked and/or filled objects are ignored.
[final addition]
I picked up your sample file and examined it.
I presume the bug you are seeing is that the PDF file uses a /Separation colour (surely it cannot fail to fill a rectangle) with an ICCBased alternate and no device space tint transfrom. In that case the current version of ps2write may solve your problem. It (currently, this is due to change) does not preserve /Separation colours and instead emits them as a device colour, by default RGB. So simply converting the files to PostScript and back to PDF may completely resolve your problem.
If you knew what the problem was, it would have been quicker if you had told us, I could have given you that information and work-around in the first place.
Using ps2write I then created a PostScript version of the file (notice that the Separation colours are now RGB) and prefixed the PostScript program with two lines:
/fill {newpath} bind def
/stroke {newpath} bind def
Note that you must use an editor which preserves binary. Then running that PostScript program back through Ghostscript using the pdfwrite device I obtain a PDF file where the green 'decoration' which I think you are having a problem with is gone.
So, there's a solution to your question, and a possibly better way to solve your problem as well.

Related

Generating PDF from scratch, how are glyphs mapped to character codes?

I want to generate a Portable Document Format (PDF) by an original program of mine.
I am going to experiment an original typesetting program, and in the course of development I want to avoid external tools and fonts as far as possible.
So, it would be ideal to avoid using XeTeX, LuaTeX, among other engines.
And I want to store the glyph information internally in my program or my library.
But where should the character code be specified in the PDF so that the viewer program knows when they are copied or searched?
To generate glyphs, my naive approach is to save, in local library, raster images or Bézier curve parameters that correspond to the characters.
According to the PDF Reference, that seems well possible.
I do not care for kerning, ligature, or other aesthetics virtues for my present purpose, or at least that can be dealt later.
Initially, I think I may generate a Postscript, and use Ghostscript to convert that to PDF.
But it is pointed out here that Postscript does not support Unicode, which I will certainly use.
My option is then reduced to directly generating PDF from scratch.
My confusion is, though my brute-force approach may render correctly, I guess the resulting PDF would be such that the viewer is unable to copy, nor search, since I would have specified nowhere about the character codes.
In PDF Reference p.122, we see that there are several different objects.
What seems relevant are text objects, path objects, and image objects.
Is it possible to associate a image object to its character code?
As I recall, there are some scanned PDF, for example the freely-previewed parts of scanned Google-Books, in which you can copy strings correctly.
What is the method or field specifying that?
However, I think in various tables that follows the PDF Reference, there is no suitable slot for Unicode code.
Similarly, it is not clear how to associate a path object to its character code.
If this can be done, the envisioned project would be easiest, since I just extract out some open source fonts' Bézier curve parameters (I believe that can be done) and translate them myself to the PDF-allowed format.
If both image- and path-objects are impossible to hold character codes, I conclude that a text object is (obviously) more suitable for representing a glyph together with its character code.
Maybe a more correct way would be embedding a custom font, synthesized in runtime, in the PDF.
This is mentioned verbally and briefly in p.364, sec. 5.8, "Embedded Font Programs".
That does seem rather difficult and requires tremendous research.
I would like that you recommend some tutorials for embedding fonts, and they are not easy to find.
In fact, I find exemplary PDF files are itself already scarce, as most of them seems to come in LZ-compressed binary files (I guess).
Indeed, I try to compile a "Hello world" PDF in non-Computer-Modern font, and open with a text editor, and all I see is blanks, control characters, and Mojibake-like strings.
In summary, how do I (if possible) represent a glyph by a text object, image object, or a path object so that is character code can be known?
For concreteness, can you generate a PDF so that: there is shown a circle, but when you copy that, you copy the character "A"?
The association between the curves and the character code is the font. There are several tables involved that do the mappings. The font has an Encoding vector which is indexed by the character code and yields a Glyph name. For copying out of the document, there must also be a ToUnicode vector which maps to unicode code points.
If you study a simple example of a PostScript Type 3 font, that should be very beneficial in understanding a PDF font. I have a short one in this calendar program.
To answer the bold question, if you convert gridcal.ps to pdf, copying the moon glyph results in the character 1 because it is in the ascii position for 1 in the Encoding vector. Some other of the glyphs, notably sun, mars and venus are recognized by Ghostscript, which produces a mapping to the Unicode character. This is very clever, but probably not sufficiently extensive to rely upon (indeed, moon, mercury, jupiter and saturn are not recognized).

GhostScript creating extra page when font errors occur

I have a process that needs to write multiple postscript and pdf files to a single postscript file generated by, and that will continue to be modified by, word interop VB code. Each call to ghostscript results in an extra blank page. I am using GhostScript 9.27.
Since there are several technologies and factors here, I've narrowed it down: the problem can be demonstrated by converting a postscript file to postscript and then to pdf via command line. The problem does not occur going directly from postscript to pdf. Here's an example and an example of the errors.
C:\>"C:\Program Files (x86)\gs\gs9.27\bin\gswin32c.exe" -dNOPAUSE -dBATCH -sDEVICE=ps2write -sOutputFile=C:\testfont.ps C:\smallexample.ps
C:\>"C:\Program Files (x86)\gs\gs9.27\bin\gswin32c.exe" -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sOutputFile=C:\testfont.pdf C:\testfont.ps
Can't find (or can't open) font file %rom%Resource/Font/TimesNewRomanPSMT.
Can't find (or can't open) font file TimesNewRomanPSMT.
Can't find (or can't open) font file %rom%Resource/Font/TimesNewRomanPSMT.
Can't find (or can't open) font file TimesNewRomanPSMT.
Querying operating system for font files...
Didn't find this font on the system!
Substituting font Times-Roman for TimesNewRomanPSMT.
I'm starting with the assumption that the font errors are the cause of the extra page (if only to rule that out, I know it is not certain). Since my ps->pdf test does not exhibit this problem and my ps->ps->pdf does, I'm thinking ghostscript is not writing font data that was in the original postscript file to the one it is creating. I'm looking for a way to preserve/recreate that in the resulting postscript file. Or if that is not possible, I'll need a way to tell ghostscript how to use those fonts. I did not have success attempting to include them as described in the GS documentation here: https://www.ghostscript.com/doc/current/Use.htm#CIDFontSubstitution.
Any help is appreciated.
I've made this an answer, even though I'm aware it doesn't answer the question, becasue it won'f fit as a comment.
I think your assumption that the missing fonts are causing your problem is flawed. Many PDF files do not embed all the fonts they require, I've seen many such examples and they do not emit extra pages.
You haven't been entirely clear in your description of what you are doing. You describe two processes, one going from PostScript to PDF and one going from PostScript on to PostScript (WHY ?) and then to PDF.
You haven't described why you are processing PostScript into a PostScript file.
In particular you haven't supplied an example file to look at. Without that there's no way to tell whether your experience is in fact correct.
For example; its entirely possible that you have set /Duplex true and have an odd number of pages in your file. This will cause an extra blank page (quite properly) to be emitted, because duplexing requires an even number of pages.
The documentation you linked to is for CIDFont substitution, it has nothing to do with Font substitution, CIDFonts and Fonts are different things in PDF and (more particularly) PostScript. But I honestly doubt that is your problem.
I'd suggest that you put (at the least) 'smallexample.ps' somewhere public and post the URL here, that way we can at least follow the same steps you are doing. That way we can probably tell you what's going on. An explanation of why you're doing this would be useful too, I would normally strongly suggest that you don't do extra steps like this; each step carries the risk of degarding the output in some way.
Thank you for the response. I am posting as an answer as well due to the comment length restrictions:
I think you are correct that my assumption about fonts is wrong. I have found the extra page in the second ps file and do not encounter the font errors until the second conversion.
I have a process that uses VB MSWord Interop libraries to print multiple documents to a single ps file using a virtual printer set up with ghostscript and redmon. I am adding functionality to mix in pdf files too. It works, but results in an extra page. To narrow down where the problem actually was, I tried much simpler test cases via command line to identify the problem. I only get the extra page when ghostscript is converting ps to ps (whether or not there is a pdf as well). Converting ps to pdf I do not get the extra page. Interestingly, I can work around the problem by converting the ps to pdf and then both pdfs back to ps. That is a slower and should not be necessary however, so I would like to identify and resolve the extra page issue. I cannot share that particular file. I'll see if I can create an example I can share that also exhibits the problem. In the meantime, I can confirm that the source ps file is six pages and the duplexing settings are as follows. There is duplex definition in the resulting ps file with the extra page. Might there be some other common culprits I could check for in the source ps? Thank you.
featurebegin{
%%BeginFeature: *DuplexUnit NotInstalled
%%EndFeature
}featurecleanup
featurebegin{
%%BeginFeature: *Duplex None
<</Duplex false /Tumble false>> setpagedevice
%%EndFeature
}featurecleanup

Software RIP must be able to read or recognize color information in PDF

I use a printer from ROLAND and its software RIP "VersaWorks Dual". The RIP must be able to recognize or read the color information / layers so that it is possible to correct color values.
When I create the PDF (from a PS file) with Acrobat Distiller and set the Distiller parameter "/ColorConversionStrategy /LeaveColorUnchanged" the color information can be read.
But I want to use GhostScript and the color information cannot be recognized if created with GhostScript.
I tried several settings (-sColorConversionStrategy=/LeaveColorUnchanged, -dAutoFilterColorImages=false, -dAutoFilterGrayImages=false, -dPreserveHalftoneInfo=true) but without any positive result.
Has anybody an idea?
Thank you.
This sounds like something you need to take up with the supplier of the RIP software, presumably ROLAND.
Howwever, clearly there is still some colour information which the RIP can read, you would get an error or wildly incorrect output if the colour information had been discarded or corrupted.
Therefore it seems likely to me that the colour information is still present, but may have been 'altered' in some way. In general however, the Ghostscript pdfwrite device goes to some considerable lengths to retain colour unaltered unless you specifically tell it not to.
If you make an example PostScript file available, and state the exact Ghostscript command line used, Ghostscript version and operating system I can take a look at it.
Since you clearly have a working solution with Acrobat Distiller, why do you want to use Ghostscript anyway ?

Is there a reliable way to determine if a PDF was generated from a Powerpoint file?

Like the title says. Reason I ask is that we're converting PDFs to formatted ASCII text (using pdftotext) and only want to display the ones that look reasonably sane.
PPT files tend to have text over images, diagonal text and others things that don't translate to ASCII very well, so we'd like to filter them out if we can.
The creating application of a PDF is listed in its XMP metadata. You can see this quite easily in Acrobat 9 (and I believe earlier): go to File > Properties, click Additional Metadata..., then go to Advanced and it's listed under both XMP Core Properties and PDF Properties:
xmp:CreatorTool: Microsoft PowerPoint
pdf:Creator: Microsoft PowerPoint
I'm guessing you want to find this programatically, so you'll need to find a library to read this metadata that works with your language. Here is a list of some XMP tools.
Short answer:
No, I don't think so.
Long answer:
No, I don't think so, because there are may ways to convert a PowerPoint file to pdf, for example Adobe Acrobat and PDFCreator and many many others. It's up to the converters to embed specific information in the PDF file, even if you find a way to detect PowerPoint-source pdf from one convert, the same method may not work for another.
Even longer answer:
No, I don't think so, because of the reasons described in the "long answer". And I don't think detecting the source of the PDF is the best approach to the problem you are trying to solve. Not just PowerPoint produces overlapped text and images. I think it's much better to detect the actual layout of the PDF file. If there are overlay of image and text, then you do some filtering or pre-processing to cater for that.
Your reasoning is very arbitrary - there are surely plenty of PPT files without the features you describe, and plenty of PDF files with them, that were generated from another source.
In theory a better method would just be to detect when these "unwanted" situations occur. However, even though the PDF format is partly open (only for reading, apparently, so it's not truly an open format), extracting complex data like that would be incredibly difficult.
All PDFs can have this problem regardless of their source. Most desktop publishing suites are capable of outputting PDF and are often sold boasting their high quality and flashier PDF presentations ...
A "saner" method would be to use a PDF parser, ITextSharp, or pdfNet...etc, Using the library of your choice, find all image rectangles, and all text rectangles, SORT THE RECTANGLES, and then see if there is substantial overlap of text and image rects -- ignoring image to image overlaps. If so, reject the page and/or document.
That won't be perfect, but at least it's going to catch many PDFs that aren't sane, regardless of source. Other heuristics to add would include color analysis. (i.e. are the colors in the overlapping region sufficiently different to allow "sane" results?)
Best of luck to you
It might put its name in the creator or producer info, but I don't have a copy to check this theory with.
In general, it is not an easy task to programmatically determine (reliably) where a file came from or how it was generated based on its contents. After all, a file is just a collection of bits.
Unless you have a lot of resources to expend building the heuristics to determine whether a file looks "reasonably sane" according to your needs, I would consider this a task for human beings.
some converter from ppt to pdf preserve creator in comments at begin of pdf.
I think that PDF's generated from most applications seem to be the same. It may have some meta-data that you can read from the file...

Structure of a PDF file? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 2 years ago.
Improve this question
For a small project I have to parse pdf files and take a specific part of them (a simple chain of characters). I'd like to use python to do this and I've found several libraries that are capable of doing what I want in some ways.
But now after a few researches, I'm wondering what is the real structure of a pdf file, does anyone know if there is a spec or some explanations anywhere online? I've found a link on adobe but it seems that it's a dead link :(
Here is a link to Adobe's reference material
http://www.adobe.com/devnet/pdf/pdf_reference.html
You should know though that PDF is only about presentation, not structure. Parsing will not come easy.
I found the GNU Introduction to PDF to be helpful in understanding the structure. It includes an easily readable example PDF file that they describe in complete detail.
Other helpful links:
PDF Succinctly book is longer and has helpful pictures.
Introduction to the Insides of PDF is a presentation that isn't as in-depth but gives a quick overview and has lots of pictures.
When I first started working with PDF, I found the PDF reference very hard to navigate.
It might help you to know that the overview of the file structure is found in syntax, and what Adobe call the document structure is the object structure and not the file structure. That is also found in Syntax. The description of operators is hidden away in Appendix A - very useful for understanding what is happening in content streams. If you ever have the pain of working with colour spaces you will find that hidden in Graphics! Hopefully these pointers will help you find things more quickly than I did.
If you are using windows, pdftron CosEdit allows you to browse the object structure to understand it. There is a free demo available that allows you to examine the file but not save it.
Here's the raw reference of PDF 1.7, and here's an article describing the structure of a PDF file. If you use Vim, the pdftk plugin is a good way to explore the document in an ever-so-slightly less raw form, and the pdftk utility itself (and its GPL source) is a great way to tease documents apart.
I'm trying to do pretty much the same thing. The PDF reference is a very difficult document to read. This tutorial is a better start I think.
This may help shed a little light:
(from page 11 of PDF32000.book)
PDF syntax is best understood by considering it as four parts, as shown in Figure 1:
• Objects. A PDF document is a data structure composed from a small set of basic types of data objects.
Sub-clause 7.2, "Lexical Conventions," describes the character set used to write objects and other
syntactic elements. Sub-clause 7.3, "Objects," describes the syntax and essential properties of the objects.
Sub-clause 7.3.8, "Stream Objects," provides complete details of the most complex data type, the stream
object.
• File structure. The PDF file structure determines how objects are stored in a PDF file, how they are
accessed, and how they are updated. This structure is independent of the semantics of the objects. Sub-
clause 7.5, "File Structure," describes the file structure. Sub-clause 7.6, "Encryption," describes a file-level
mechanism for protecting a document’s contents from unauthorized access.
• Document structure. The PDF document structure specifies how the basic object types are used to
represent components of a PDF document: pages, fonts, annotations, and so forth. Sub-clause 7.7,
"Document Structure," describes the overall document structure; later clauses address the detailed
semantics of the components.
• Content streams. A PDF content stream contains a sequence of instructions describing the appearance of
a page or other graphical entity. These instructions, while also represented as objects, are conceptually
distinct from the objects that represent the document structure and are described separately. Sub-clause
7.8, "Content Streams and Resources," discusses PDF content streams and their associated resources.
Looks like navigating a PDF file will require a little more than a passing effort.
If You want to parse PDF using Python please have a look at PDFMINER. This is the best library to parse PDF files till date.
Didier have a tool to parse the PDF:
http://didierstevens.com/files/software/pdf-parser_V0_4_3.zip
or here:
http://blog.didierstevens.com/programs/pdf-tools/ which cataloged several related pdf-analysis tools.
Another tool is here:
http://mshahzadlatif.wordpress.com/2011/09/28/view-pdf-structure-using-adobe-acrobat-or-a-free-tool-called-pdfxplorer/
Extracting text from PDF is a hard problem because PDF has such a layout-oriented structure. You can see the docs and source code of my barely-successful attempt on CPAN (my implementation is in Perl). The PDF data structure is very cool and well designed, but it's easier to write than read.
One way to get some clues is to create a PDF file consisting of a blank page. I have CutePDF Writer on my computer, and made a blank Wordpad document of one page. Printed to a .pdf file, and then opened the .pdf file using Notepad.
Next, use a copy of this file and eliminate lines or blocks of text that might be of interest, then reload in Acrobat Reader. You'd be surprised at how little information is needed to make a working one-page PDF document.
I'm trying to make up a spreadsheet to create a PDF form from code.
You need the PDF Reference manual to start reading about the details and structure of PDF files. I suggest to start with version 1.7.
On windows I used a free tool PDF Analyzer to see the internal structure of PDF files.
This will help in your understanding when reading the reference manual.
(I'm affiliated with PDF Analyzer, no intention to promote)
To extract text from a PDF, try this on Linux, BSD, etc. machine or use Cygwin if on Windows:
pdfinfo -layout some_pdf_file.pdf
A plain text file named some_pdf_file.txt is created. The simpler the PDF file layout, the more straightforward the .txt file output will be.
Hexadecimal characters are frequently present in the .txt file output and will look strange in text editors. These hexadecimal characters usually represent curly single and double quotes, bullet points, hyphens, etc. in the PDF.
To see the context where the hexadecimal characters appear, run this grep command, and keep the original PDF handy to see what character the codes represent in the PDF:
grep -a --color=always "\\\\[0-9][0-9][0-9]" some_pdf_file.txt
This will provide a unique list of the different octal codes in the document:
grep -ao "\\\\[0-9][0-9][0-9]" some_pdf_file.txt|sort|uniq
To convert these hexadecimal characters to ASCII equivalents, a combination of grep, sed, and bc can be used, I'll post the procedure to do that soon.