Starting on 9/2 we started noticing the characters fi and fl not showing up in the pdfs we generate. Any ideas on why this would start happening?
We use Helvetica font when we generate these pdfs.
Related
I have some pdf's with 2-3 passages for every page. every passage is separated by some line gap, but while reading with pymupdf, I cannot see any machine printable separator between passages. is there any other way, other library can do this?
code:
import fitz
from more_itertools import *
doc = fitz.open('IT_past.pdf',)
single_doc = doc.load_page(0) # put here the page number
text=single_doc.get_text('text')
text
page screen shot:
enter image description here
pdf
Full pdf
There is no gap as such, just for the moment as its much easier, lets look closer in your linked viewer rendering :-
So lets replicate what is inside the real PDF (that has no web side html <p> markers) :-
support, product design, HR Management, knowledge process outsourcing for
pharmaceutical companies and large complex projects.
Software exports make up 20 % of India's total export revenue in 2003-04, up from 4.9 %
in 1997.This figure is expected to go up to 44% of annual exports by 2010. Though India
See there is "no gap" just left aligned non justified (ragged) text that needs a style such as a font name and stretched out locations added to hold in a page de-void of line feeds nor true carriage returns. (occasionally there are some backspace or vertical/horizontal moves but generally meaningless in line printer text). Even "Tabs" "Indents" and some spatial characters are normally discarded in a PDF printout.
If you want gaps or line-wrap you need to add them.
A good alternative is export the -layout using poppler or xpdf here to - (console) or pipe it or replace that with a path/name.txt, many other options available like -nopgbrk
xpdf-tools-win-4.04\bin32>pdftotext -f 1 -l 1 -layout IT_past.pdf -
So I am trying to extract English and Hindi text from a PDF file. The English text is extracted properly. But when I try to extract the Hindi Text, some characters are replaced by circle/squares.
I copied the Hindi text snippet directly from the PDF File to a Word document and I get the same squares for some characters.
PDFBox Version: 2.0.7
PDF Version: 1.6(Acrobat 7.x)
Security Details(PDF):
Font Details:
I cannot attach the PDF, but here is a snippet of the PDF File(Adobe Acrobat Reader).
Note: I have drawn the black bar as it contains the address of someone.
Output of text extracted using PDFBox:
पता: कालकाजी, दि ण िद ी, िद ी - 110019
As you can see from the output of PDFBox text extraction above, some of the characters are replaced by circles. The same happens when I manually copy from PDF to a word document.
I have tried tesseract OCR also, but that is giving an even worse output. I would like to know any other options that I can try?
For instance, extracting the data using PDFBox, not as a text but an image?
EDIT:: Also getting the following warnings.
03:58:38.711 [main] WARN o.a.pdfbox.pdmodel.font.PDType0Font - No
Unicode mapping for CID+26 (26) in font Lohit-Devanagari
We have to construct a postscript file that contains Arabic text, so as English text.
GhostScript shows the Arabic text correctly, but converting it to pdf does not show the Arabic letters.
PS file contains the following:
/TraditionalArabic findfont dup
length dict
copy begin
/Encoding Encoding 256 array copy def
Encoding 1 /kafinitialarabic put
Encoding 2 /behinitialarabic put
Encoding 3 /yehmedialarabic put
Encoding 4 /seenfinalarabic put
Encoding 5 /eacute put
Encoding 6 /a put
/ArabicTradDict currentdict definefont pop
end
%%Page: 1 1
%%BeginPageSetup
%%PageMedia: Color Weight Type
<< /MediaColor (Blue)/MediaWeight 75 /MediaType () /xx {2.803464567 mul} def /xx {2.83464567 mul} def /PageSize [240 xx 345 xx]>> setpagedevice
%%EndPageSetup
/ArabicTradDict 18 selectfont
72 xx 300 xx moveto
(\004\003\002\001) show
showpage
To run ghostScript: running it from command line to include all windows fonts:
gswin64.exe -sFONTPATH=%windir%/fonts -dEmbedAllFonts=true
To convert the PS file to PDF file: running the following command:
gswin64.exe -dBATCH -dNOPAUSE - sOutputFile=c:/Users/mob/Desktop/TimesNewRomanPSMT.pdf -sDEVICE=pdfwrite - dPDFSETTINGS=/prepress -dCompressFonts=false -dSubsetFonts=false -sFONTPATH=%windir%/fonts -dEmbedAllFonts=true -dEmbedAllFonts=true -f c:/Users/mob/Desktop/TimesNewRomanPSMT.ps
So when converting to PDF, the Arabic characters are not showing correctly, but showing as squares that are of no meaning...
If I use Adobe tool to convert to PDF, the PDF we get is same, except the "eacute -(005) " if included in the PS file, will show after conversion, where as when I convert with the previous command line, all characters that are added from the Encoding are not shown correctly.
Any help with that?
Thanks to KenS hints I was able to solve my problem. The encoding used wrong character names like kafinitialarabic (i mean by wrong, pdf could not understand that), everything that ended with arabic was wrong. The Traditional Arabic font does not have those names for characters. In order to know what it really understood, have converted the ttf font to afm and pfa using the following command, that is converting the true type font to type 42 font which will be understood once embed in postscript file at conversion to pdf
C:\Program Files\gs\gs9.10\bin>gswin64c.exe -dNODISPLAY -q -- ttf2pf.ps times tim
esPS timesAFM
where times is the ttf font name. I then checked the generated pfa file for the characters I wanted to add, instead of kafinitialarabic, there was kafinitial, and for kafmedialarabic there was kafmedial and so on...
It works fine now to add those in encoding, but I want to find a way instead of adding all those characters in the dictionary, I want to use the font like we use with setfont in postscript normally - if that is possible...
As already suggested, you need to ensure the glyph names you use are in the font you use, or create a new font.
I haven't found anything that will choose the correct glyph from the set of initial, medial, final, isolated, depending on context, though.
I resorted to writing a program which takes unicode arabic, reverses it the arabic characters, and then decides which tone of character to use based on it's position in a word, and whether the previous or next characters are forced into isolated or final forms. Unfortunately had to embed quite some intrinsic knowledge about the font in use and the glyph names it has, as well as typos in them, into the program.
If that's of interest, I've stuck it on github, but it's very raw and initial.
It does work, though.
https://github.com/gbjk/arabic2ps
The font I used was a traditional arabic font, with quite a few idiosyncrasies.
I have a generated PDF file with Japaneses text I use Arial Unicode MS as a font some letters displayed correctly but for others I see gibrish like this 〠ぇ》❥ (hope you can see it)
How do I get it to display Japanese characters instead?
Looks like Arial Unicode MS has 2 different version 0.84 and 1.01 the later has a better support for Japanese
I am doing a massive set of file conversions and several of them happen to be ".dat" files. When I open them I see that the first line is "%!PS-Adobe". Here's an example.....
%!PS-Adobe
^M%c$in
^M/c$in {72.0 mul} def
^M%DEFINE MARGINS
^M/C$LMAR .2 c$in def %LEFT MARGIN
^M/C$RMAR 8.4 c$in def %RIGHT MARGIN
^M/C$TMAR 10.8 c$in def %TOP MARGIN
^M/C$BMAR .2 c$in def %BOTTOM MARGIN
^M/C$CF /Courier def %saves /Courier as C$CF
etc...
Am I correct in assuming that these are indeed Adobe Postscript files and ** NOT ** PDFs?
How hard is it to convert these to PDF? I was thinking command line perl via ImageMagick or something but right now I'm a little stumped about what's been handed to me.
Thank you SO Much...
Janie
You can convert these to PDF using Ghostscript and the pdfwrite device, or for simplicity the ps2pdf script supplied with Ghostscript.
Yup, that's Postscript. A PDF would start with "%PDF". If the text "^M" is literally there like that, then it was created on a Mac and got screwed up being copied to or edited on other platforms. (Maybe it was when the sample was pasted into the S.O. edit box?) It defines some variables with dollar signs in their names, which makes it look funny.
%!PS-Adobe is the signature for conforming Postscript files. (Non-conforming Postscript can get by with %!.) The signature for PDF files is %PDF-.