Use Ghostscript to set PDF natural language via pdfmarks

Use Ghostscript to set PDF natural language via pdfmarks - pdf

I'm setting metadata on PDFs using Ghostscript and pdfmarks. I'm able to set just about everything I need IE: Title, Author, Bookmarks, etc using pdfmarks. However, I can not set the Natural Language. I'm sure I'm just missing the correct syntax, as I've looked over Adobe documentation and see it listed in there.
This is what I have tried:
[ /Type /Catalog /Lang (en-US) /StPNE pdfmark
[ /Subtype /document /Lang (en-US) /StPNE pdfmark
Neither of these works, unfortunately. Does anyone know the correct syntax to add a language?

That's a logical structure pdfmark StPNE, but the last pdfmark reference I can find (version 9 from 2008) does not list /Lang as a legal attribute for a logical structure pdfmark.
I note that the PDF specification does permit /Lang to be a member of a logical structure element, but that doesn't mean there's a pdfmark for it. I think Adobe stopped updating the pdfmark reference with new content for new versions of the PDF specification.
/Type /Catalog won't be legal either.
Can you explain which part of the resulting PDF you are trying to add this to ? Ghostscript only implements the pdfmarks listed in the pdfmark refrence, and I don't think it fully implements all of those currently.
[EDIT]
I just checked and Ghostscript's pdfwrite device does not implement the StPNE pdfmark at all, so that's not going to do anything.
[further edit]
It may be (looking at the PDF specification) that what you want is to set a key called /Lang in the Catalog object of the PDF file. Obviously I'm not certain but....
[{Catalog} <</Lang (en-US)>> /PUT pdfmark
puts a key called /Lang in the Catalog dictionary, and assigns it the string value (en-US). That may be sufficient, I can't tell.

Related

Is there any pdf analyser / debugger to debug a pdf file?

Is there any validator or PDF analyser which can tell me what is wrong with a PDF I created with a hint or indicator which object in my PDF is wrong or something like that?
I would like to create and understand the PDF file format better and I think I should be pretty close to a working PDF but I can not find the problem in it and why PDF readers are not able to read it.
Isn't there a program or an online service which can give me at least a hint what is wrong with my pdf structure or where the problem occurs or even tell me what is wrong? How to debug something like that?
Here is a link to the PDF (just the attached image converted to a PDF):
https://nonepatchwork.patchwork3d.de/create_pdf/created_pdf.pdf
Best regards and thank you very much in advance
Fuchur

The request for a validator or PDF analyser is not on topic on stack overflow, it's better suited for the Software Recommendations site. In this answer, therefore, I'll focus on analyzing the provided example files.
created_pdf.pdf
Here a number of issue immediately leap to the eye, in particular:
Your page object 5 points to object 6 as content stream, but object 6 is not a content stream but an image xobject! (You probably meant to point at object 7.)
All your cross reference table offsets are wrong.
The Size entry in the trailer is wrong.
There is an /ID between trailer dictionary and startxref.
There probably are more issues, but start by fixing these.
created_pdf_2.pdf
Here you fixed the errors listed above but the file still does not display as expected, Adobe Acrobat Reader in particular says:
Looking at the image dictionary the cause becomes clear:
6 0 obj <<
/Type /XObject
/Subtype /Image
/Width 595.276
/Height 841.89
/ColorSpace /DeviceRGB
/BitsPerComponent 8
/Filter/DCTDecode
/Interpolate true
/Length 707
>>
...
The values of Width and Height are floats which is invalid. Furthermore, inspecting the actual image data in the stream it becomes clear that the values are completely incorrect, the image only is 20×20 in size.
Thus, replace the Width and Height entries by
/Width 20
/Height 20

ghostscript settings to open PDF as full page

I know from some colleagues, who are desiging our leaflets in Indesign and store as PDF's that there is a setting to view it in full page mode, when opening the file.
I did a script to "merge" some of these docs using ghostscript device -pdfwriter and option -dPDFFitPage (edited after KenS' answer)
here my full command:
gs -dBATCH -sDEVICE=pdfwrite -dNO_PDFMARK_OUTLINES -dPDFFitPage -o output.pdf cover.pdf input1.pdf input2.pdf input3.pdf pdfmarks
But "-dPDFFitPage" does not do what I am expecting. The pagewidth is fit on screen, but I would like the whole page to fit on screen. I also heard using "/FIT" in the pdfmarks would help but it also doesn't.
If anybody can help me, I would be very thankful.
Best regards
Mike

Ok, after some other hours of reading the pdfmark reference and asking google various questions, I came across my ultimate and satisfying solution:
[ /PageMode /UseOutlines /Page 1 /View [ /Fit] /PageLayout /SinglePage /DOCVIEW pdfmark
So I simply added /PageLayout /SinglePage and it opens in full page mode in the reader window, showing bookmarks (/UseOutlines) and when scrolling, it scrolls pagewise, so every step of the mouswheel is one page. This works perfectly now.

I found a solution to my problem, perhaps it will help other's so I am posting it as an answer. KenS' answer was a big help to solve my problem. Thanks to him.
[ /PageMode /UseOutlines
/Page 1 /View [/Fit]
/DOCVIEW pdfmark
This sets the magnification of the PDF file to "windows size". With Acrobat Reader and Acrobat standard, it works pretty well. Other readers are not tested.
Best regards
Mike

There is no option -dpdfwriter. The fact that PDFFitPage doesn't do what you want isn't surprising, it has no effect on what a PDF viewer will do. This option (which is described in the documentation) only has any effect when used with a pre-defined fixed media size. It creates a new PDF where the content of the original PDF is scaled so that fits onto the fixed media size.
If you want to include directions to PDF viewers on how to open PDF documents then you need to look at the pdfmark operator. Specifically you will need to construct a DOCVIEW pdfmark as described on page 29 and 30 of the version 9 pdfmark reference.

Build a TOC when Concatenating PDFs

I have a dozen essays as PDFs which I want to combine to one concatenated master PDF with a table of content where each entry is a clickable link to the first page of each essay. The TOC could be either a page with internal links or a proper PDF TOC.
The best would be a command line solution on Linux and macOS. So far I have used QPDF, which works great for concatenating the essay PDFs, but it does not build a TOC.
It is a one-off problem, so I am happy to write some (bash, Python or other) scripting code to generate this TOC. For utility it is important that the links are clickable.
Any idea how to do this?

As I already noted, you can create TOC page manually and append/prepend it to the file.
To make TOC clickable, you need to add link annotations to it. After quick googling I made the following example using GhostScript:
gs -o output.pdf -sDEVICE=pdfwrite -dPDFSETTINGS=/prepress input.pdf an.txt
And an.txt file contains the following:
[ /Subtype /Link
/SrcPg 1
/Rect [10 10 50 50]
/Page 2
/ANN pdfmark
Here SrcPg is page number to put annotation on; Rect is the area to make clickable; Page is destination page number.
You can find more details on annotation syntax here and here. Hope it helps.

PDF special searching iOS

I know that there's a great source that works on iOS for PDF searching, it's PDFKitten
But my case is that I encounter some PDF files that this source don't work for search. I tried to open these file by 'Preview' app on Mac and tried to search, it works.
I uploaded one file here.
You can check by open this file by 'Preview' app and search the word 'ra'. It works perfect. By if you drag this file to the source PDFKitten and make some configurations so that the source open it, then try to search, it don't work.
I inspected the source, it cares all the text showing operator, including Tj, ', '', TJ. I placed some log lines in these operator's call backs and I saw these call backs are not called.
Can you give my some suggestions or any ideas?

If I understand the code correctly, PDFKitten looks for fonts only in the /Font entry of the /Resources dictionary of the page. At least that's my interpretation of the method fontCollectionWithPage of Scanner the result of which is queried by setFont in pdfScannerCallbacks to set the current font object.
Furthermore there is no callback for the Do operator (i.e. the operator used to inject the contents of a XObject resource into the page content). Unless CGPDFScannerScan interprets this operator under the hood, the content of included XObjects is not scanned at all. This would match your observation that the text setting operator callbacks never get called.
Your file mundo1.pdf, though, does not have any immediate /Font entry in the /Resources dictionaries of its pages. Instead all the actual content of each page is wrapped into a single /XObject resources respectively. These XObjects in turn have their own /Resources dictionaries which contain a /Font entry defining the fonts used for the respective page.
Thus, PDFKitten does not know anything about the fonts used in your file, especially about their encodings, and so cannot extract the text from the PDF contents. Maybe it does not even get to see the PDF contents to interpret.
I would, therefore, propose you post this issue on the PDFKitten issue management site.
By the way, this PDF construct is completely according to the PDF spec. Nonetheless it looks like a non-adequate use of the iText library. The author of the software using iText like that should review his code and start using better suited classes of the iText library.

Adding internal hyperlink to a pdf

I have a PDF document to which I want to add internal hyperlinks.
Specifically, page 1 contains a table of contents which I want to make clickable.
My idea is to create rectangular boxes in predetermined locations on page 1, which should link to pages 2, 3, ...
I found this post which talks about adding internal hyperlinks using the method I described above.
http://bugs.ghostscript.com/show_bug.cgi?id=691531
However, when I try to use this technique in my file, the script just ADDS pages with the rectangle and hyperlink.
I need it to overlay the hyperlink on the existing contents of my first page.

You can do this with Ghostscript, using the pdfmark operator.
For some introduction to the pdfmark topic, see also Thomas Merz's PDFmark Primer.
For an example to achieve a similar thing, see this answer: Merge PDF's with PDFTK with Bookmarks?
Alternatively, you could...
...use qpdf to expand all (compressed) internal PDFstreams into ASCII,
...edit the PDF source code (using the knowhow acquired from the PDFmark Primer),
...use qpdf again to re-compress the PDF streams.

This is what I used:
Ghostscript function call from MATLAB:
-o output.pdf -sDEVICE=pdfwrite -dPDFSETTINGS=/prepress original.pdf script.ps
Postscript code saved in script.ps:
[ /Rect [10 10 50 50]
/Page 2
/SrcPg 1
/Subtype /Link
/ANN pdfmark

There is currently (as of 2020) a piece of freeware for Windows that allows adding hyperlinks. PDF X-Change Editor, which has a free demo version, allows manually drawing hyperlinks on the page (arbitrary rectangles) and setting the target location (page). It is offered at no cost but it is not "free as in libre" software.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas