wrong encode when update pdf meta data using ghostscript and pdfmark - pdf

I have a base pdf file, and want to update the title into Chinese (UTF-8) using ghostscript and pdfmark, command like below
gs -dBATCH -dNOPAUSE -sDEVICE=pdfwrite -sOutputFile=result.pdf base.pdf pdfmarks
And the pdfmarks file (encoding is UTF-8 without BOM) is below
[ /Title (敏捷开发)
/Author (Larry Cai)
/Producer (xdvipdfmx (0.7.8))
/DOCINFO pdfmark
The command is successfully executed, while when I check the properties of the result.pdf
The title is changed to æŁ‘æ“·å¼•å‘
Please give me hints how to solve this, are there any parameters in gs command or pdfmark?

The PDF Reference states that the Title entry in the document info dictionary is of type 'text string'. Text strings are defined as using either PDFDocEncoding or UTF-16BE with a Byte Order Mark (see page 158 of the 1.7 PDF Reference Manual).
So you cannot specify a Title using UTF-8 without a BOM.
I would imagine that if you replace the Title string with a string defining the content using UTF-16BE with a BOM then it will work properly. I would suggest you use a hex string rather than a regular PostScript string to specify the data, simply for ease of use.

Using the idea from Happyman Chiu my solution is next. Get a UTF-16BE string with BOM by
echo -n '(敏捷开发)' | iconv -t utf-16 |od -x -A none | tr -d ' \n' | sed 's/./\U&/g;s/^/</;s/$/>/'
You will get <FEFF0028654F63775F0053D10029>. Substitute this for title.
/Title <FEFF0028654F63775F0053D10029>

follow pdfmark for docinfo metadata in pdf is not accepting accented characters in Keywords or Subject
I use this function to create the string from utf-8 for info.txt, to be used by gs command.
function str_in_pdf($str){
$cmd = sprintf("echo '%s'| iconv -t utf-16 |od -x -A none",$str);
exec($cmd,$out,$ret);
return "<" . implode("",$out) .">";
}

Related

Convert text file with ansi colours to pdf

I have a plain text file containing ansi escape codes for colouring text. If it matters, I generated this file using the python package colorama.
How can convert this file to pdf with colors properly rendered? I was hoping for a simple solution using e.g. pandoc, but any other command line or programmatic solutions would be ok as well. Example:
echo -e '\e[0;31mHello World\e[0m' > example.txt
# Convert to pdf
pandoc example.txt -o example.pdf # Gives error

ghostscript not retaining page level parameter while merging two postscripts

I have converted pdf file to postscript using ghostscript, while conversion, I have passed page-level parameter for the duplex option as below.
gswin32c.exe -q -dSAFER -dNOPAUSE -dBATCH -sDEVICE=output.ps \
-c "<</PSPageOptions [ (<</Duplex false>> setpagedevice)
(<</Duplex true>> setpagedevice) (<</Duplex true>> setpagedevice) ]
/LockDistillerParams true>> setdistillerparams" -f input.pdf
Refer solution link for the above command: https://stackoverflow.com/a/64128881/13696415
Now, i have added duplex parameter for 2 pdf files and converted to 2 individual postscript, the problem is, when i merge these pdfs with Ghostscript, it losing page-level parameter which i passed while converting to ps. I tried below suggested answer to merge postscript.
https://stackoverflow.com/a/3445325/13696415
why its loosing added parameter while merging? How to retain page level parameter while merging? some one please help.
I can confirm the %%BeginPageSetup entries for setpagedevice are lost when merging 2 postscript files. Even the /LockDistillerParams fails to save the settings. Just running the postscript files again with the ghostscript ps2write device causes the output to drop the previous settings. I suspect ghostscript rewrites these every time if /PSPageOptions is missing to redo them. I don't know of a way to save the settings when merging.
I have tried two other techniques with good results.
(1) Merge the 2 postscript files and then use the ps2write device to write the desired settings to the combined postscript file.
gs -dBATCH -dNOPAUSE -sDEVICE=ps2write -sOutputFile=merged.ps -f file1.pdf file2.pdf
gs -dBATCH -dNOPAUSE -sDEVICE=ps2write -sOutputFile=merged-out.ps -c ' << /PSPageOptions [ (<</Duplex false>> setpagedevice) (<</Duplex true>> setpagedevice) (<</Duplex true>> setpagedevice) ] /LockDistillerParams true >> setdistillerparams ' -f merged.ps
(2) Use ghostscript to merge the 2 pdf files using the ps2write device and with the /PSPageOptions setdistillerparams included for an all in one operation. I have found this only works for certain pdf files. This doesn't work if the pdf files were generated with the cairographics library used by my Firefox for example even if redistilled with ghostscript.
My test here was for two 12 page well behaved pdf files. The results show the % page3 string at page 13 as desired. The strings can be changed to use the setpagedevice as needed:
gs -dBATCH -dNOPAUSE -sDEVICE=ps2write -sOutputFile=file1+2.ps -c '<< /PSPageOptions [(% page1)(% page2)(% page3)(% page4)(% page5)] /LockDistillerParams true >>setdistillerparams' -f file1.pdf file2.pdf
P.S. Please edit your original post to show the correct sDEVICE callout. And the backslash can be omitted as implied depending on the user.

Pandoc, markdown to pdf doesn't wrap long words in paragraphs

I'm trying to generate a clean PDF from markdown using Pandoc and xelatex.
When I convert :
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
I end up having :
Here is the command I use to generate the PDF :
/usr/local/bin/pandoc --verbose \
--chapters --from=markdown+yaml_metadata_block -S \
--latex-engine=xelatex \
--listings -H listings-setup.tex \
--template template.pdf \
--toc --chapters \
-o test.pdf \
metadata.yml \
test.md
I use the document class : report
I have tried different things from inside the template and the extra header I'm using but I have now idea what template is Pandoc using when generating paragraphs.
I see under my template.pdf (extracted from Pandoc), but doesn't seem to apply here :
\setlength{\emergencystretch}{3em} % prevent overfull lines
You've a few possibilities. Since pandoc uses LaTeX for PDF generation, these are adapted from this LaTeX answer:
Annotate the proper language:
---
lang: en-GB
---
rest of document
use soft hyphens inside a word to explicitly denote the allowed places to break. You can either use the unicode character or the HTML entity ­ which pandoc will convert automatically for LaTeX etc. For example; cryp­to­graphy
Specify exceptions via \hyphenation{cryp-to-graphy}

ghostscript - remove only specific text in PDF file

Ghostscript allows to generate a new PDF without text from a source one with this easy script:
gs -o output_no_text.pdf -sDEVICE=pdfwrite -dFILTERTEXT input.pdf
My purpose is delete just one specific fixed string into the first PDF not all the text. Is there a parameter to set in order to do so?

Troubles with CP1251 printed file from Word

I have a bunch of PDF (1.4) files printed from Word with Adobe Distiller 6. Fonts are embedded (Tahoma and Times New Roman, which I have on my Linux machine) and encoding says "ANSI" and "Identity-H". Now by ANSI, I assume that regional code-page is used from Windows machine, which is CP-1251 (Cyrillic), and about "Identity-H" I assume that's something that only Adobe knows about.
I want to extract only text and index this files. Problem is I get garbage output from pdftotext.
I tried to export example PDF file from Acrobat, and I again got garbage, but additionally processing with iconv got me right data:
iconv -f windows-1251 -t utf-8 Adobe-exported.txt
But same trick doesn't work with pdftotext:
pdftotext -raw -nopgbrk sample.pdf - | iconv -f windows-1251 -t utf-8
which by default assumes UTF-8 encoding, and outputs some garbage afterwhich: Сiconv: illegal input sequence at position 77
pdftotext -raw -nopgbrk -enc Latin1 sample.pdf - | iconv -f windows-1251 -t utf-8
throws garbage again.
In /usr/share/poppler/unicodeMap I don't have CP1251, and couldn't find it with Google, so tried to make one. I created the file from wikipedia CP1251 data, and appended at the end of file, what other maps had:
...
fb00 6666
fb01 6669
fb02 666c
fb03 666669
fb04 66666c
so that pdftotext does not complain, but result from:
pdftotext -enc CP1251 sample.pdf -
is same garbage again. hexdump does not reveal anything on first sight, and I thought to ask about my problem here, before trying desperately to conclude something from this hexdumps