I have a bunch of PDF (1.4) files printed from Word with Adobe Distiller 6. Fonts are embedded (Tahoma and Times New Roman, which I have on my Linux machine) and encoding says "ANSI" and "Identity-H". Now by ANSI, I assume that regional code-page is used from Windows machine, which is CP-1251 (Cyrillic), and about "Identity-H" I assume that's something that only Adobe knows about.
I want to extract only text and index this files. Problem is I get garbage output from pdftotext.
I tried to export example PDF file from Acrobat, and I again got garbage, but additionally processing with iconv got me right data:
iconv -f windows-1251 -t utf-8 Adobe-exported.txt
But same trick doesn't work with pdftotext:
pdftotext -raw -nopgbrk sample.pdf - | iconv -f windows-1251 -t utf-8
which by default assumes UTF-8 encoding, and outputs some garbage afterwhich: Сiconv: illegal input sequence at position 77
pdftotext -raw -nopgbrk -enc Latin1 sample.pdf - | iconv -f windows-1251 -t utf-8
throws garbage again.
In /usr/share/poppler/unicodeMap I don't have CP1251, and couldn't find it with Google, so tried to make one. I created the file from wikipedia CP1251 data, and appended at the end of file, what other maps had:
...
fb00 6666
fb01 6669
fb02 666c
fb03 666669
fb04 66666c
so that pdftotext does not complain, but result from:
pdftotext -enc CP1251 sample.pdf -
is same garbage again. hexdump does not reveal anything on first sight, and I thought to ask about my problem here, before trying desperately to conclude something from this hexdumps
Related
I have a plain text file containing ansi escape codes for colouring text. If it matters, I generated this file using the python package colorama.
How can convert this file to pdf with colors properly rendered? I was hoping for a simple solution using e.g. pandoc, but any other command line or programmatic solutions would be ok as well. Example:
echo -e '\e[0;31mHello World\e[0m' > example.txt
# Convert to pdf
pandoc example.txt -o example.pdf # Gives error
The programm a2ps does not support utf-8. At least my version does only
support the latin-X encodings:
a2ps --list=encoding
Version:
GNU a2ps 4.14
How can I convert a simple utf-8 text to postscript or pdf?
If what you actually want is to use a2ps or enscript (which is a similar tool), and if your single need is to use them with some UTF-8 document, you only have to convert your document to ISO-8859-1 or some supported encoding. Various tools allow this. For instance, here is a workflow for enscript (but you can surely do the same with a2ps):
cat document.txt | iconv -c -f utf-8 -t ISO-8859-1 | enscript -o document.ps
But you may lose some characters during the conversion because such encodings have a smaller range than UTF-8.
On the other hand, if UTF-8 is a requirement, you may rather have to look for some recent tool allowing to convert UTF-8 to PDF. I wrote myself a Python program called txt2pdf; you may find it here. Have also a look at tools like pandoc, gimli, rst2pdf or wkhtmltopdf.
You can use Vim. Open the file and execute the command :hardcopy > output.ps in normal mode. You can also do this directly from the shell. Executing
$ vim -c ":hardcopy > output.ps" -c ":quit" input.txt
in your shell will open Vim, generate the output.ps, and then close Vim.
Use paps! For instance I use it as follow:
paps --font="Monospace 10" input.txt > output.ps
and I have no problem with utf encoding.
If you need a pdf file then
pdf2ps output.ps
I've gotten acceptable results (for printing code listings) from https://github.com/arsv/u2ps
https://gitlab.com/gnomify/u2ps is the replacement of gnome-u2ps.
If the text file is small, paps converts to text to ps, which then can be fed to ps2pdf. The problem is ps file from paps causes ps2pdf to create a very big pdf file. If that is ok, this is possible. Currently, I am having a large file size pdf from paps.
There's a utility based on gnome libraries and named gnome-u2ps. It has less functionality than a2ps, and it seems that it is not maintained anymore.
In brief, I'm dealing with a problematic PDF, which:
Cannot be fully rendered in a document viewer like evince, because of missing font information;
However - ghostscript can fully render the same PDF.
Thus -- regardless of what ghostscript uses to fill in the blanks (maybe fallback glyphs, or a different method to accessing fonts) -- I'd like to be able to use ghostscript to produce ("distill") an output PDF, where pretty much nothing will be changed, except font information added, so evince can render the same document in the same manner as ghostscript can.
My question is thus - is this possible to do at all; and if so, what would be command line be to achieve something like that?
Many thanks in advance for any answers,
Cheers!
Details:
I'm actually on an older Ubuntu 10.04, and I might be experiencing - not a bug - but an installation problem with evince (lack of poppler-data package), as noted in Bug #386008 “Some fonts fail to display due to “Unknown font tag...” : Bugs : “poppler” package : Ubuntu.
However, that is exactly what I'd like to handle, so I'll use the fontspec.pdf attached to that post ("PDF triggering the bug.", // v.) to demonstrate the problem.
evince
First, I open this pdf's page 3 in evince; and evince complains:
$ evince --page-label=3 fontspec.pdf
Error: Missing language pack for 'Adobe-Japan1' mapping
Error: Unknown font tag 'F5.1'
Error (7597): No font in show
Error: Unknown font tag 'F5.1'
Error (7630): No font in show
Error: Unknown font tag 'F5.1'
Error (7660): No font in show
Error: Unknown font tag 'F5.1'
...
The rendering looks like this:
... and it is obvious that some font shapes are missing.
Adobe acroread
Just a note on how Adobe's Acrobat Reader for Linux behaves; the following command line:
$ ./Adobe/Reader9/bin/acroread /a "page=3" fontspec.pdf
... generates no output to terminal whatsoever (for more on /a switch, see Man page acroread) -- and the program has absolutely no problem displaying the fonts.
Also, while I'd like to avoid the roundtrip to postscript - however, note that acroread itself can be used to convert a PDF to postscript:
$ ./Adobe/Reader9/bin/acroread -v
9.5.1
$ ./Adobe/Reader9/bin/acroread -toPostScript \
-rotateAndCenter -choosePaperByPDFPageSize \
-start 3 -end 3 \
-level3 -transQuality 5 \
-optimizeForSpeed -saveVM \
fontspec.pdf ./
Again, the above command line will generate no output to terminal; -optimizeForSpeed -saveVM are there because apparently they deal with fonts; the last argument ./ is the output directory (output file is automatically called fontspec.ps).
Now, evince can display the previously missing fonts in the fontspec.ps output - but again complains:
$ evince fontspec.ps
GPL Ghostscript 9.02: Error: Font Renderer Plugin ( FreeType ) return code = -1
GPL Ghostscript 9.02: Error: Font Renderer Plugin ( FreeType ) return code = -1
...
... and furthermore, all text seems to be flattened to curves in the postscript - so now one cannot select the text in the .ps file in evince anymore (note that the .ps file cannot be opened in acroread). However, one can convert this .ps back into .pdf again:
$ pstopdf fontspec.ps # note, `pstopdf` has no output filename option;
# it will automatically choose 'fontspec.pdf',
# and overwrite previous 'fontspec.pdf' in
# the same directory
... and now the text in the output of pstopdf is selectable in evince, all fonts are there, and evince doesn't complain anymore. However, as I noted, I'd like to avoid roundtrip to postscript files altogether.
display (from imagemagick)
We can also observe the page in the same document with imagemagicks display (note that image panning from the commandline using 'display' is apparently still not available, so I've used -crop below to adjust the viewport):
$ display -density 150 -crop 740x450+280+200 fontspec.pdf[2]
**** Warning: considering '0000000000 00000 n' as a free entry.
...
**** This file had errors that were repaired or ignored.
**** The file was produced by:
**** >>>> Mac OS X 10.5.4 Quartz PDFContext <<<<
**** Please notify the author of the software that produced this
**** file that it does not conform to Adobe's published PDF
**** specification.
... which generates some ghostscripish errors - and results with something like this:
... where it's obvious that the missing fonts that evince couldn't render, are now shown here, with imagemagicks display, properly.
ghostscript
Finally, we can use ghostscript as x11 viewer itself -- to observe the same page, same document:
$ gs -sDevice=x11 -g740x450 -r150x150 -dFirstPage=3 \
-c '<</PageOffset [-120 520]>> setpagedevice' \
-f fontspec.pdf
GPL Ghostscript 9.02 (2011-03-30)
Copyright (C) 2010 Artifex Software, Inc. All rights reserved.
This software comes with NO WARRANTY: see the file PUBLIC for details.
**** Warning: considering '0000000000 00000 n' as a free entry.
**** Warning: considering '0000000000 00000 n' as a free entry.
**** Warning: considering '0000000000 00000 n' as a free entry.
**** Warning: considering '0000000000 00000 n' as a free entry.
**** Warning: considering '0000000000 00000 n' as a free entry.
**** Warning: considering '0000000000 00000 n' as a free entry.
**** Warning: considering '0000000000 00000 n' as a free entry.
Processing pages 3 through 74.
Page 3
>>showpage, press <return> to continue<<
^C
... and results with this output:
In conclusion: ghostscript (and apparently by extension, imagemagick) can seemingly find the missing font (or at least some replacement for it), and render a page with that -- even if evince fails at that for the same document.
I would, therefore, simply like to export a PDF version from ghostscript, that would have only the missing fonts embedded, and no other processing; so I try this:
$ gs -dBATCH -dNOPAUSE -dSAFER \
-dEmbedAllFonts -dSubsetFonts=true -dMaxSubsetPct=99 \
-dAutoFilterMonoImages=false \
-dAutoFilterGrayImages=false \
-dAutoFilterColorImages=false \
-dDownsampleColorImages=false \
-dDownsampleGrayImages=false \
-dDownsampleMonoImages=false \
-sDEVICE=pdfwrite \
-dFirstPage=3 -dLastPage=3 \
-sOutputFile=mypg3out.pdf -f fontspec.pdf
GPL Ghostscript 9.02 (2011-03-30)
Copyright (C) 2010 Artifex Software, Inc. All rights reserved.
This software comes with NO WARRANTY: see the file PUBLIC for details.
**** Warning: considering '0000000000 00000 n' as a free entry.
**** Warning: considering '0000000000 00000 n' as a free entry.
**** Warning: considering '0000000000 00000 n' as a free entry.
**** Warning: considering '0000000000 00000 n' as a free entry.
**** Warning: considering '0000000000 00000 n' as a free entry.
**** Warning: considering '0000000000 00000 n' as a free entry.
**** Warning: considering '0000000000 00000 n' as a free entry.
Processing pages 3 through 3.
Page 3
**** This file had errors that were repaired or ignored.
**** The file was produced by:
**** >>>> Mac OS X 10.5.4 Quartz PDFContext <<<<
**** Please notify the author of the software that produced this
**** file that it does not conform to Adobe's published PDF
**** specification.
... but it doesn't work - the output file mypg3out.pdf suffers from the exact same problems in evince as noted previously.
Note: While I'd like to avoid the postscript roundtrip, a good example of gs command line with from pdf to ps with font embedding is here: (#277826) pdf - How to make GhostScript PS2PDF stop subsetting fonts; but the same command line switches for .pdf to .pdf to not seem to have any effect on the problem described above.
OK point 1; you CANNOT use Ghostscript and pdfwrite to create a PDF file 'without any additional processing'.
The way that pdfwrite and Ghostscript work is to fully interpret the incoming data (PostScript, PDF, XPS, PCL, whatever), creating a series of graphics primitives, which are passed to the pdfwrite device. The pdfwrite device then reassembles these into a brand new PDF file.
So its not possible to take a PDF file as input and manipulate it, it will always create a new file.
Now, I would suggest that you upgrade your 9.02 Ghostscript to 9.05 to start with. Missing CIDFonts are much better handled in 9.05 (and will be further improved in 9.06 later this year). (The font you are missing 'Osaka Mono' is in fact a CIDFont, not a regular font)
Using the current bleeding edge Ghostscript code produces a PDF file for me which has the missing font embedded. I can't tell if this will work for you because my copy of evince renders the original file perfectly well.
Added later
Examining the original PDF file I see that the fonts there are indeed embedded (as I would expect, since they are subsets). So in fact as you say in your own answer above, the problem is not font embedding, but the use of CIDFonts.
My answer here will not help you, as pdfwrite will still produce a CIDFont in the output. Basically this is a flaw in your version or installation of evince.
The problem with 'remapping' the characters is that a font is limited to 256 glyphs, while a CIDFont has effectively no limit. So there is no way to put a CIDFont into a Font. The only way to do this would be to create multiple Fonts each of which contained a portion of the original, and then switch between them as required. Slow and klunky.
If you convert to PostScript using the ps2write device then it will do this for you, but you stand a great risk that in the process it will convert the vector glyph data into bitmaps, which will not scale well.
Fundamentally you can't really achieve what you want to do (convert 1 CIDFont into N regular Fonts) with Ghostscript, or in fact with any other tool that I know of. While its technically possible, there is no real point since all PDF consumers should be able to handle CIDFonts. If they can't then its a bug in the PDF consumer.
Right, I got a bit further on this (but not completely) - so I'll post a partial answer/comment here.
Essentially, this is not a problem about font embedding in PDF - this is a problem with font mapping.
To show that, let's analyse the mypg3out.pdf, which was extracted by gs in the OP (from the 3rd page of the fontspec.pdf document):
$ pdffonts mypg3out.pdf
name type emb sub uni object ID
------------------------------------ ----------------- --- --- --- ---------
Error: Missing language pack for 'Adobe-Japan1' mapping
CAAAAA+Osaka-Mono-Identity-H CID TrueType yes yes yes 19 0
GBWBYF+CMMI9 Type 1C yes yes yes 28 0
FDFZUN+Skia-Regular_wght13333_wdth11999 TrueType yes yes yes 16 0
ZRLTKK+Optima-Regular TrueType yes yes yes 30 0
ZFQZLD+FPLNeu-Bold Type 1C yes yes yes 8 0
DDRFOG+FPLNeu-Italic Type 1C yes yes no 22 0
HMZJAO+FPLNeu-Regular Type 1C yes yes yes 10 0
RDNKXT+FPLNeu-Regular Type 1C yes yes yes 32 0
GBWBYF+Skia-Regular_wght13333_wdth11999 TrueType yes yes no 26 0
As the output shows - all fonts are, indeed, embedded; so something else is the problem. (It would have been more difficult to observe this in the complete fontspec.pdf, as there are a ton of fonts there, and a ton of error messages.)
The crucial point (I think) here, is that there is:
only one "Error: Missing language pack for 'Adobe-Japan1' mapping" message; and
only one CID TrueType font, which is CAAAAA+Osaka-Mono-Identity-H
There seems to be an obvious relationship between the CID TrueType and the 'Adobe-Japan1' mapping error; and I got that finally clarified by CID fonts - How to use Ghostscript:
CID fonts are PostScript resources containing a large number of glyphs (e.g. glyphs for Far East languages, Chinese, Japanese and Korean). Please refer to the PostScript Language Reference, third edition, for details.
CID font resources are a different kind of PostScript resource from fonts. In particular, they cannot be used as regular fonts. CID font resources must first be combined with a CMap resource, which defines specific codes for glyphs, before it can be used as a font. This allows the reuse of a collection of glyphs with different encodings.
All good - except here we're dealing with PDF fonts, not PostScript fonts as such; let's demonstrate that a bit.
For instance, 5.3. Using Ghostscript To Preview Fonts - Making Fonts Available To Ghostscript - Font HowTo describes how the Ghostscript-installed script called prfont.ps can be used to render a table of fonts.
However, here it would be easier with just Listing Ghostscript Fonts [gs-devel], and using resourcestatus operator to query for a specific font - which doesn't require a special .ps script:
$ gs -o /dev/null -dNODISPLAY -f mypg3out.pdf \
-c 'currentpagedevice (*) {=} 100 string /Font resourceforall'
...
Processing pages 1 through 1.
Page 1
URWAntiquaT-RegularCondensed
Palatino-Italic
Hershey-Gothic-Italian
...
$ gs -o /dev/null -dNODISPLAY -f mypg3out.pdf \
-c '/TimesNewRoman findfont pop [/TimesNewRoman /Font resourcestatus]'
....
Processing pages 1 through 1.
Page 1
Can't find (or can't open) font file /usr/share/ghostscript/9.02/Resource/Font/TimesNewRomanPSMT.
Can't find (or can't open) font file TimesNewRomanPSMT.
Can't find (or can't open) font file /usr/share/ghostscript/9.02/Resource/Font/TimesNewRomanPSMT.
Can't find (or can't open) font file TimesNewRomanPSMT.
Querying operating system for font files...
Loading TimesNewRomanPSMT font from /usr/share/fonts/truetype/msttcorefonts/times.ttf... 2549340 1142090 3496416 1237949 1 done.
We got a list of fonts; however, those are system fonts available to ghostscript - not the fonts embedded in the PDF!
(Basically,
gs -o /dev/null -dNODISPLAY -f mypg3out.pdf -c 'currentpagedevice (*) {=} 100 string /Font resourceforall' | grep -i osaka will return nothing, and
-c '/CAAAAA+Osaka-Mono-Identity-H findfont pop [/CAAAAA+Osaka-Mono-Identity-H /Font resourcestatus]' will conclude with "Didn't find this font on the system! Substituting font Courier for CAAAAA+Osaka-Mono-Identity-H.")
To list the fonts in the PDF, the pdf_info.ps script file from Ghostscript (not installed, in sources) can be used:
$ wget "http://git.ghostscript.com/?p=ghostpdl.git;a=blob_plain;f=gs/toolbin/pdf_info.ps" -O pdf_info.ps
$ gs -dNODISPLAY -q -sFile=mypg3out.pdf -dDumpFontsNeeded pdf_info.ps
...
No system fonts are needed.
$ gs -dNODISPLAY -q -sFile=mypg3out.pdf -dDumpFontsUsed -dShowEmbeddedFonts pdf_info.ps
...
Font or CIDFont resources used:
CAAAAA+Osaka-Mono
DDRFOG+FPLNeu-Italic
FDFZUN+Skia-Regular_wght13333_wdth11999
GBWBYF+CMMI9
GBWBYF+Skia-Regular_wght13333_wdth11999
GTIIKZ+Osaka-Mono
HMZJAO+FPLNeu-Regular
RDNKXT+FPLNeu-Regular
ZFQZLD+FPLNeu-Bold
ZRLTKK+Optima-Regular
So finally we can observe the CAAAAA+Osaka-Mono in Ghostscript - although I wouldn't know how to query more specific information about it from within ghostscript.
In the end, I guess my question boils down to: how could ghostscript be used, to map the glyphs from a CID embedded font - into a font with a different "encoding" (or "character map"?), which will not require additional language files?
Addendum
I have also experimented with these approaches:
pdffonts on the output here will not have the Osaka-Mono listed, but it will still complain "Error: Missing language pack for 'Adobe-Japan1' mapping": $ wget http://whalepdfviewer.googlecode.com/svn/trunk/cmaps/japanese/Adobe-Japan1-UCS2
$ gs -sDEVICE=pdfwrite -o mypg3o2.pdf -dBATCH -f mypg3out.pdf Adobe-Japan1-UCS2
same as previously - this (via Ghostscript's "Use.htm") also makes Osaka-Mono disappear from pdffonts list: gs -sDEVICE=pdfwrite -o mypg3o2.pdf -dBATCH \
-c '/CIDSystemInfo << /Registry (Adobe) /Ordering (Unicode) /Supplement 1 >>' \
-f mypg3out.pdf
this crashes with Error: /undefinedresource in findresource: gs -sDEVICE=pdfwrite -o mypg3o2.pdf -dBATCH \
-c '/Osaka-Mono-Identity-H /H /CMap findresource [/Osaka-Mono-Identity /CIDFont findresource] == ' \
-f mypg3out.pdf
Note finally that some of the .ps scripts ghostscript installs, it may use automatically; for instance, you can find gs_ttf.ps:
$ locate gs_ttf.ps
/usr/share/ghostscript/9.02/Resource/Init/gs_ttf.ps
... and then using sudo nano locate gs_ttf.ps, you can add the statement (Hello from gs_ttf.ps\n) print at the beginning of the code; then whenever one of the above gs commands is called, the printout will be visible in stdout.
References
Adding your own fonts - Fonts and font facilities supplied with Ghostscript
About "CIDFnmap" of Ghostscript - Features to support CJK CID-keyed in Ghostscript
Bug 689538 – GhostScript can not handle an embedded TrueType CID-Font
Bug 692589 – "Error CIDSystemInfo and CMap dict not compatible" when converting merged file to PDF/A - #1522
Adobe Forums: CMap resources versus PDF mapping resources: Please keep in mind that a CMap resource unidirectionally maps character codes to CIDs. Those other resources that Acrobat uses are best referred to as PDF mapping resources. Among them, there is a special category called ToUnicode mapping resources that unidirectionally map CIDs to UTF-16BE character codes
Adobe CIDs and glyphs in CJK TrueType font
Ghostscript and Japanese TrueType font
Installation guide: GS and CID font
Debian -- Filelist of package poppler-data/sid/all
I have a base pdf file, and want to update the title into Chinese (UTF-8) using ghostscript and pdfmark, command like below
gs -dBATCH -dNOPAUSE -sDEVICE=pdfwrite -sOutputFile=result.pdf base.pdf pdfmarks
And the pdfmarks file (encoding is UTF-8 without BOM) is below
[ /Title (敏捷开发)
/Author (Larry Cai)
/Producer (xdvipdfmx (0.7.8))
/DOCINFO pdfmark
The command is successfully executed, while when I check the properties of the result.pdf
The title is changed to æŁ‘æ“·å¼•å‘
Please give me hints how to solve this, are there any parameters in gs command or pdfmark?
The PDF Reference states that the Title entry in the document info dictionary is of type 'text string'. Text strings are defined as using either PDFDocEncoding or UTF-16BE with a Byte Order Mark (see page 158 of the 1.7 PDF Reference Manual).
So you cannot specify a Title using UTF-8 without a BOM.
I would imagine that if you replace the Title string with a string defining the content using UTF-16BE with a BOM then it will work properly. I would suggest you use a hex string rather than a regular PostScript string to specify the data, simply for ease of use.
Using the idea from Happyman Chiu my solution is next. Get a UTF-16BE string with BOM by
echo -n '(敏捷开发)' | iconv -t utf-16 |od -x -A none | tr -d ' \n' | sed 's/./\U&/g;s/^/</;s/$/>/'
You will get <FEFF0028654F63775F0053D10029>. Substitute this for title.
/Title <FEFF0028654F63775F0053D10029>
follow pdfmark for docinfo metadata in pdf is not accepting accented characters in Keywords or Subject
I use this function to create the string from utf-8 for info.txt, to be used by gs command.
function str_in_pdf($str){
$cmd = sprintf("echo '%s'| iconv -t utf-16 |od -x -A none",$str);
exec($cmd,$out,$ret);
return "<" . implode("",$out) .">";
}
Is it possible to search multiple pdf files using the 'grep' command. It doesn't seem to work, how do people search content on multiple pdf files?
Well, PDF is a binary format, and grep can search binary files as if they were text
grep -a
or you can just use pdftotext (which comes with xpdf) like this:
pdftotext whee.pdf | grep pattern
You don't mention which OS you're using, but under Mac OS X you can use mdfind from the command line:
mdfind -onlyin search/directory/path "kind:pdf search text"
use something like Solr or clucene I think they can do what you want.
Pdf is a binary format, that's why searching it with grep is not that helpful. You can search the strings is a pdf with grep like this:
ls dir_with_pdfs/*.pdf|xargs strings|grep "keyword"
Or you can use the pdf2text command on pdf's and then search result with grep.
This tool pdfgrep will do the work. It has a syntax similar to grep. To search in several files just a simple shell script. For example:
$> ls Documents/*.pdf | xargs pdfgrep -n -H "system"
Documents/2005-DoddGutierrezRO-MAN1.pdf:1: designed episodic memory system
Documents/2005-DoddGutierrezRO-MAN1.pdf:1: how ISAC's episodic memory system is
Documents/2005-DoddGutierrezRO-MAN1.pdf:1: cognitive system employs a combination
....
PDF is a binary dump of objects used to display the pages. There may be some meta data you can grep but the actual page text is in a Postscript stream and may be encoded in a variety of ways. Its also not guaranteed to be in any order. You need to think of PDF as more like a Vector image file than a text file.
There is a short article explaining text in PDFs in more detail at http://pdf.jpedal.org/java-pdf-blog/bid/27187/Understanding-the-PDF-file-format-text-streams
If you have pdftotext installed via the popplar package, then try this perl script :
#!/usr/bin/perl
my $p = shift;
foreach my $fn (#ARGV) {
open(F,"pdftotext $fn - |");
while (<F>) { print "$fn:$_" if /$p/; }
close(F);
}