Selection in PDF files

Selection in PDF files - pdf

I'm working on an ebook conversion script.
When I create a PDF file using emacs ps-print-buffer-with-faces and then ps2pdf, I can select the words one by one on my ebook (Sony PRS 600). Same when I use Microsoft Word to print to PDF.
Yet, when I create a PDF using pdflatex, or latex -> dvips -> ps2pdf, I can't select but blocks of words, separated by punctuation signs.
It seems that there is something in the structure of the PDF files generated by latex that ebooks don't understand -- but what?
Do you know a switch to tell latex to behave properly, or a workaround?
Thanks!
CFP.

Latex doesn't use a white space character to separate words. That's the problem.

Related

What's the best way to extract text from pdf in python without changing the layout and format?

I want text with exact format and layout from pdf.
If pdf to text is not the direct choice, is it possible to do pdf -> xml -> text?
I have already tried PyPDF2, pdfminer and pdftotxt. Even I've tried using AWS textract and got incorrect layout.
Basically if I can construct sentence from the text extracted from pdf, that's enough.
I used Zamzar API which gives exact output but they're quiet expensive.
Any possible solution?

If you are looking to keep the structure of the PDF but not the font, colour, size etc., then try the pdftables_api library. This should hold the layout of your PDF. Convert PDF to CSV as a CSV file is just a comma seperated text file.
If you are looking to keep font, colour etc., Zamzar API is probably your best option.

Creating pdf file in Rebol/Red

I need to create a pdf file of A4 sized page which will have a black and white image and some text. Can I create a pdf file in Rebol or Red programming language?
If not possible directly, what is the best way to do it- create an image file which can be printed by external programs? Thanks for your help.

There are some scripts available e.g. pdf-maker.r

There is also a Rebol to Haru PDF binding that you can try in addition to Gabriele's PDF-Maker2.
And some of us used a Postscript dialect that allowed us to generate PS directly, and then use something like Ghostscript to convert the PS to PDF.

Copy text from PDF with custom FONT

I am trying to copy some text from a PDF. But When I paste it in a word file, it is just some garbage. Something like മുഖവുര. The PDF is in Malayalam language. When I see File->Properties->Fonts, It says BRHMalayalam (Embedded Subset) as shown in the screenshot.
I installed various Malayalam fonts but still no luck. Can anyone please guide me?
The PDF I am trying to copy from is https://drive.google.com/open?id=0B3QCwY9Vanoza0tBdFJjd295WEE&authuser=0

Installing fonts won't help, since they are embedded in the document. The reader will use the ones in the document.
In fact it almost certainly must use the ones on the document, because it will probably have used character codes specific to each font subset.
Your PDF probably has character codes which are not Unicode values, and does not contain ToUnicode CMaps for the fonts in question (note the same font name embedded multiple times). There is no realistic way to copy the text.
The best you can do is OCR it.

After looking at the file, and confirming the answer already given by #KenS, the problem with this PDF document is in fact how it's constructed. Or rather how the font in the document has been embedded.
The document contains a number of Times and Arial fonts, for which the text can be copied successfully. Those fonts are embedded as a subset with a WinAnsi encoding. What is actually in the file is close enough to that, that the text seems to copy out well.
The problem font (BRHMalayalam) is also embedded as a subset, and its encoding is also set as WinAnsiEncoding, which completely doesn't make sense.
And because the font doesn't contain a ToUnicode mapping table, a PDF viewer has no other choice when copying and pasting to assume the characters in the PDF are indeed Win Ansi encoding which means you end up with (garbled) latin characters.

Just convert the pdf file to word file and then edit or copy or modify the text present in the file simple :)
and after completion go to file -> save as -> and change the format of doc to pdf ..hope u understood :)

Recover text from PDF file when normal methods fail

I have a few hundred PDF files from which I need to extract sections of text. For many, pdftotext works fine, but for others, it misses large sections of text. If I open the PDF in Acrobat and select that text by hand and copy/paste into emacs and then view the file without an encoding, I get stuff like this:
Husband \364\200\200\272\364\200\201\213\364 etc.
How can I extract the text correctly?
I should mention that I've tried saving as text from Acrobat; also tried applying Acrobat's Document=>OCR feature before copying.

Why not convert the PDF to doc or txt first? See the guide:
http://www.aolor.com/pdf-converter/user-guide.html

How to convert WORD docs with Bookmarks to PDF using GhostScript?

I'm converting WORD docs to PDF programmatically using vb.net and ghostscript. This word doc I’m having problems with has hyperlinks to external URLs and also hyperlinks to bookmarks within the document. When the doc is converted to PDF the external URLs work but the links to the bookmarks do not.
I have searched for a solution to get these bookmarks to work on the output PDF but haven’t had any luck. Hopefully someone has done this and can share the solution.

Ghostscript only handles PDF or PostScript as an input, there are sibling products to handle XPS and PCL as well but none of them handle Word .doc files. So you must be converting the Word file into something else.
I'll hazard a guess that you are using the Windows PostScript printer driver to convert to PostScript and passing that to GS (possibly via the RedMon Port Monitor) to convert into PDF.
Now PostScript doesn't support hyperlinks, bookmarks, or any of the other paraphernalia of a viewing application, since its intended as a print language. To overcome this Adobe introduced an extension, the pdfmark operator, which can be used to create this kind of information. NOTE this is an extension which is only supported for conversion to PDF.
So, in order to get these inserted, you need to create pdfmarks in the PostScript. If you are printing from Word, this means that you have to insert PostScript into the file when printing. There is a 'pass through' mechanism for this purpose.
So what you need to do is create the appropriate Visual Basic script in Word which inserts the relevant pdfmarks when the document is printed. This is how the Adobe plug-in for Word (which used to be called PDFMaker a long time ago) works.

Have a look at this tool.
It does maintain bookmarks and hyperlinks.
http://www.transcom.de/transcom/en/2004_pdf-t-maker.htm

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Selection in PDF files - pdf

Latex doesn't use a white space character to separate words. That's the problem.

Related

What's the best way to extract text from pdf in python without changing the layout and format?

Creating pdf file in Rebol/Red

Copy text from PDF with custom FONT

Recover text from PDF file when normal methods fail

How to convert WORD docs with Bookmarks to PDF using GhostScript?

Categories

Resources