Strange behaviour of a pdf-to-text conversion

Strange behaviour of a pdf-to-text conversion - pdf

I'm trying to convert a pdf document in .txt using pdftotext on a linux mint machine. The document is written in english but the output text result something like this:
23!,&/$!%+!,#$!AB&017"*&7!"-M')(!-)!gE*X/-&$!$-&23!')!,#$!
(-.$1!/*/-223!(/-&-)E ,$$*!,#-,!,#$!%,#$&!C2-3$&!>'22!($,!
,#$!-9[-0$),!0%&)$&7!S/0#!-!*',/-,'%)!'*!*$$)!')!V'-E
(&-.!Z7!I,!'*!##',$8*!,/&)1!D/,!)%!.-,,$&!>#$&$!##',$!(%$
*1!^2-0M!>'22!D$!-D2$!,%!+2'C!,#$ 9'*0!%)!,#$!gE*X/-&$!N(
KO!-)9!,-M$!,#$!#<!0%&
Is there an encoding problem? Maybe a wrong option in the command line?
Edit: the problem is the same even if I try to copy a bunch of text from the pdf document end paste it in a text document.
Edit #2: The Producer pdf property is: Mac OS X 10.5.6 Quartz PDFContext, the encoding for most of the fonts is WinAnsi or MacRoman. Maybe this can help.

Related

GROFF PDFPIC converted w ImageMagick to .ms document causes "troff: sample.ms:18: division by zero" and leads images to show very right of the pdf doc

I converted my original image to pdf with ImageMagick. If viewed independently, the pdf image looks perfectly normal.
sample.ms :
.PDFPIC Figure_1.pdf
Once I try to compile my .ms document with the following command:
groff -ms sample.ms -U -T pdf > sample.pdf
I get the following error from groff:
troff: sample.ms:1: division by zero
The document does compile but it looks like this: image is way to the right of the page to the point its sometimes almost completely out of the page.

I was having the same problem and it seems like the PDFs convert generates are corrupt in some way.
I ended up using convert img.png img.tiff and then tiff2pdf img.tiff > img.pdf. Including img.pdf then worked just fine.
I used tiff2pdf just because that's what I had installed, but any other program should work too if it generates valid PDF.

How to decode PDF file and encode it back?

My overall goal is to make some PDF files conform to the PDF/A standard for archival purposes. They fail one requirement, namely that some glyph mappings map to 0, which they should not.
My usual strategy was to use an old software called "Pdfedit" that could decode PDF-Files, all the byte-streams would then be human-readable, edit the relevant part of the PDF containing the glyph mappings, and open the file with Adobe Acrobat that automatically re-encoded it.
Now I have some PDFs that cause "Pdfedit" to crash upon opening. I tried using PDF-Parser but its output cannot be re-encoded by Adobe Acrobat.
Also, the relevant parts used to look like this decoded:
/CMapType 2 def
1 begincodespacerange
<00><04>
endcodespacerange
5 beginbfchar
<00><0000>
<01><0000>
<02><263A>
<03><0000>
<04><0000>
endbfchar
endcmap
But now I use the following command python3 pdf-parser.py -f -n /path/to/file.pdf > dump.txt and inside dump.txt the relevant part looks like this:
b'/CMapType 2 def\n1 begincodespacerange\n<00><04>\nendcodespacerange\n5 beginbfchar\n<00><0000>\n<01><0000>\n<02><263A>\n<03><0000>\n<04><0000>\nendbfchar\nendcmap\nCMapName currentdict/CMap defineresource pop end end'
So it is a bytestring and any linebreak is rendered literally as \n. The txt file that contains this cannot be interpreted as a PDF by Adobe Acrobat.
I have now also realized that many elements such as %%EOF are delimited by ''.
The true issue is how to get an Acrobat-readable output from pdf-parser.py, as the shell-command > does not work and stdout in the shell is also faulty.
I will try out a few things but could really need some help on this!

Answering my own question in case this is relevant for someone down the line.
Didier Stevens, the dev behind the pdf-parser, answered that his tool is not made for this. He recommended qpdf instead.
That was indeed the solution. Make sure you use the flag --stream-data=uncompress so that compressed parts are also accessible in the output. The command to use with qpdf is:
qpdf old_file.pdf --stream-data=uncompress --decode-level=all new_file.txt
You can output new_file also as .pdf. In any case you will be able to open it in the text editor. Once you're done applying the changes you wish to apply, you can change the ending to pdf and process it further with acrobat or any other conversion program.

Inkscape "PDF + Latex" export

I'm using inkscape to produce vector figures, save them in SVG format to export them later as "PDF + Latex" much in the vein of TUG inkscape+pdflatex guide.
Trying to produce a simple figure, however, turns out to be extremely frustating.
The first figure
is an example of the figure I would like to export in the form of "PDF + Latex" (shown here in PNG format).
If I export this to a PDF figure without latex macros the PDF produced looks exactly the same, except for some minor differences with the fonts used to render the text.
When I try to export this using the "PDF + Latex" option the PDF file produced consists on a PDF document of 2 pages (again as .png here):
This, of course, does not looks good when compiling my latex document. So far the guide at TUG has been very helpful, but I still can't produce a working "PDF + Latex" export from inkscape.
What am I doing wrong?

I worked around this by putting all the text in my drawing at the top
select text and then Object -> Raise to top
Inkscape only generates the separate pages if the text is below another object.

I asked this question on the Inkscape online discussion page and got some very helpful guidance from one of the users there.
This is a known bug https://bugs.launchpad.net/ubuntu/+bug/1417470 which was inadvertently introduced in Inkscape 0.91 in an attempt to fix a previous bug https://bugs.launchpad.net/inkscape/+bug/771957.
It seems this bug does two things:
The *.pdf_tex file will have an extra \includegraphics statement which needs to be deleted manually as described in the link to the bug above.
The *.pdf file may be split into multiple pages, regardless of the size of the image. In my case the line objects were split off onto their own page. I worked around this by turning off the text objects (opacity to zero) and then doing a standard PDF export.

If you can execute linux commands, this works:
# Generate the .pdf and .pdf_tex files
inkscape -z -D --file="$SVGFILE" --export-pdf="$PDFFILE" --export-latex
# Fix the number of pages
sed -i 's/\\\\/\n/g' ${PDFFILE}_tex;
MAXPAGE=$(pdfinfo $PDFFILE | grep -oP "(?<=Pages:)\s*[0-9]+" | tr -d " ");
sed -i "/page=$(($MAXPAGE+1))/,\${/page=/d}" ${PDFFILE}_tex;
with:
$SVGFILE: path of the svg
$PDF_FILE: path of the pdf
It is possible to include these commands in a script and execute it automatically when compiling your tex file (so that you don't have to manually export from inkscape each time you modify your svg).

Try it with an illustration that is less wide.
Alternatively, use a wider paperwidth setting.

Writing a basic PostScript script by hand

I wanted to try and manually code a PostScript file. Why? Why not. From Wikipedia, I copied and pasted their basic Hello World program for PostScript which is:
%!PS
/Courier % name the desired font
20 selectfont % choose the size in points and establish
% the font as the current one
72 500 moveto % position the current point at
% coordinates 72, 500 (the origin is at the
% lower-left corner of the page)
(Hello world!) show % stroke the text in parentheses
showpage % print all on the page
When I try to open it in GIMP, I get
Opening 'Hello World.ps' failed. Could not interpret file 'Hello World.ps'
I can use ImageMagick to convert the file
convert "Hello World.ps" "Hello World.pdf"
convert "Hello World.ps" "Hello World.eps"
The PDF opens successfully and displays 'Hello World' in Courier.
The EPS yields the same error as the PS.
Is there something wrong with the syntax of the PS file?
Are PS files just not meant to be viewed directly, and should instead be viewed in a containing format like PDF?
Is GIMP just not able to handle this particular format of PS file?

To answer your questions, one by one:
You PostScript file is completely OK.
PostScript files can be viewed directly if you use a PostScript-capable viewer. (BTW: PDF may be regarded as a 'container format' -- but it never embeds a PostScript file for 'viewing'...)
For Gimp to be able and handle PS/EPS files, you need a working Ghostscript (installation link) on your system.
The same as point '3.' is true for your convert command: ImageMagick cannot handle PS/EPS or PDF input files unless there is a functional Ghostscript installation available on the local system. This would work as a so-called 'delegate', employed by ImageMagick to handle file formats which it cannot handle itself. A delegate converts such a format into a raster file, which ImageMagick in turn can then take over for further processing.
To check for available ImageMagick delegates, run these commands:
convert -list delegate
convert -list delegate | grep -Ei --color '(eps|ps|pdf)'

Why does pdftk chop the tops off the characters in my form fields?

When I populate acrobat form fields by importing an FDF file into NitroPDF, things look fine. When I type data into the form fields manually in Acrobat 8, things look fine. When I use pdftk (on Windows XP or 2K), the tops of the characters in each form field are chopped off. Is there a parameter I'm missing somewhere? There aren't that many settings in pdftk...
Here's what I'm running:
pdftk form.pdf fill_form data.fdf output out.pdf flatten
Digging deeper, it appears supplied text:
<</T (A) /V (123)>>
Gets reworked to:
<</T (A) /V ([fe][ff][nul]1[nul]2[nul]3)>>
(I determined this by loading an "un-flattened" out.pdf into NitroPDF and exporting the FDF).

I ended up proofing the document using pdftk and Acrobat Reader as I worked, instead of the import in NitroPDF. It appears that the baseline for the characters is different. To get the results I was after, I had to make each field about twice the height required by NitroPDF and overlap fields.
I would still recommend NitroPDF for the rest of its capabilities.

I believe PDFTK only supports earlier versions of the PDF standard (up to 1.4 I think) so maybe it's just starting to show its age?

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Strange behaviour of a pdf-to-text conversion - pdf

Related

GROFF PDFPIC converted w ImageMagick to .ms document causes "troff: sample.ms:18: division by zero" and leads images to show very right of the pdf doc

How to decode PDF file and encode it back?

Inkscape "PDF + Latex" export

Writing a basic PostScript script by hand

Why does pdftk chop the tops off the characters in my form fields?

Categories

Resources