I have PDF files with text that should be replaced. More specificly, the text should be translated and replaced with the translated version.
It's important that the rest of the PDF structure stays intact. Note that the text is available in the PDFs and techniques like OCr are not needed. Also, it would be nice if font and other text attributes are kept.
Which libraries would you recommend for extracting the text to an easy to edit format (such as CSV) and put the new text back in again?
Assuming you are replacing text with a different language, you will have to choose a different font in most cases, and the font choice is non-trivial. I've used the Foxit libraries to change text or create PDFs with success.
Related
I am trying to copy some text from a PDF. But When I paste it in a word file, it is just some garbage. Something like മുഖവുര. The PDF is in Malayalam language. When I see File->Properties->Fonts, It says BRHMalayalam (Embedded Subset) as shown in the screenshot.
I installed various Malayalam fonts but still no luck. Can anyone please guide me?
The PDF I am trying to copy from is https://drive.google.com/open?id=0B3QCwY9Vanoza0tBdFJjd295WEE&authuser=0
Installing fonts won't help, since they are embedded in the document. The reader will use the ones in the document.
In fact it almost certainly must use the ones on the document, because it will probably have used character codes specific to each font subset.
Your PDF probably has character codes which are not Unicode values, and does not contain ToUnicode CMaps for the fonts in question (note the same font name embedded multiple times). There is no realistic way to copy the text.
The best you can do is OCR it.
After looking at the file, and confirming the answer already given by #KenS, the problem with this PDF document is in fact how it's constructed. Or rather how the font in the document has been embedded.
The document contains a number of Times and Arial fonts, for which the text can be copied successfully. Those fonts are embedded as a subset with a WinAnsi encoding. What is actually in the file is close enough to that, that the text seems to copy out well.
The problem font (BRHMalayalam) is also embedded as a subset, and its encoding is also set as WinAnsiEncoding, which completely doesn't make sense.
And because the font doesn't contain a ToUnicode mapping table, a PDF viewer has no other choice when copying and pasting to assume the characters in the PDF are indeed Win Ansi encoding which means you end up with (garbled) latin characters.
Just convert the pdf file to word file and then edit or copy or modify the text present in the file simple :)
and after completion go to file -> save as -> and change the format of doc to pdf ..hope u understood :)
I'm developing a new function to "my" program. This function is able to write PDF files by the simple way, making a simple text file with some codes of PDF standard.
I'm trying to understand how it works yet, but my first problem is about how to apply bold on some line of my document.
I've already downloaded the PDF REFERENCES GUIDE, but I've not found nothing about it.
Any idea?
PDF is not like HTML where you can apply formatting tags for emphasis. As you've read in the PDF reference, all that you do in PDF is to setup a graphics environment (colours used, fonts used, etc) and then put text on the page.
If you want to have something show in bold, use a font that is bold. If you want to have something show in italic, use a font that is italic.
Older software used dirty tricks to create "bold-alike" text, but the good (and easy) way to do it is to make sure you select the correct font before you start drawing text.
I'm sure this is just a google search away, but I can't find the right search terms to find what I'm looking for.
I've created a DataPackage that has both HTML annd plain text content. I've used this in my copy and my sharing code and it works fine. I now want to create RTF output as some apps don't seem to accept HTML clipboard content.
I'm looking for a good guide to making RTF text that can be added to the DataPackage. I just need simple formatting including changing the font family, font size, font weight and adding newlines. The data comes from a list of objects taht I want to serialise as RTF, not from a text control on the screen.
WordPad outputs fairly clean RTF and some other text editors do as well. If that's not enough, you can download the RTF Specification 1.9.1 although like any specification that's probably overkill for what you're doing.
You can also use the SaveToStream method on the Document property of a RichEditBox from a Metro style app to share out as well.
I have a piece of software called PDF2XL which is normally great for extracting tables of data from PDF files. I've used it with hundreds of files before.
This one file though, gives me gibberish output that I can't even copy and paste into this textarea correctly. All sorts of unicode weirdness.
If I copy and paste as per normal into excel/notepad I get the same issue.
I assume it's something to do with a messed up character encoding header in the PDF file? How can I change this? I'm on Windows and have no software that can edit PDFs, so if I need to edit/re-save it, please recommend a free piece of SW to do it.
Thanks!
There are an increasing number of PDF files the used subsetted fonts which is basically a custom encoding. Normally the font descriptor in the PDF should have a ToUnicode table to allow the text extraction to decode the font encoding and return the correct text.
Some PDF producers are doing this on purpose to prevent easy PDF text extraction for things such as financial reports. If there is only one font then you could manually decode the font but in my experience I have seen PDF's with multiple random encodings which makes it nearly impossible to decode automatically.
One way to test for these types of PDF's is to open the file in Acrobat, select some text, copy it and then paste it into Notepad. If the text is garbled then the PDF is using a subsetted font and there is not much more you can do. If Acrobat can't extract the text correctly then nothing else can. It may as well be a page of hieroglyphs.
I am writing a Master's thesis - NLP system. I have one component - extractor.
It is extracting a plain text from PDF files. There are a few PDF files that can not be extracted correctly. Extractor (PDFBox library) returns a string like this:
"┤xDn║if|d├gDF"Ti&cD╬lh d FÁhis~n ╗xd f«"d┤ffih »h"
or
"10a61a91a22a25a3a27a17a23a20a8a13a14a61a25a17"
I was checking each file that makes this extraction's problem and all these files' text also can not be copy-pasted from PDF Reader (Adobe Reader and FoxIt reader). Viewing them in this readers is enabled, but after selecting its content and copying to the clipboard I get the same wrong text (as described above - strings of not semantically correct chars or strings of digits and letters).
Could anybody help me???
Very often in such cases, where you can't select, copy'n'paste text from the Acrobat (Reader) window, there is another option which may work nevertheless:
Open 'File' menu,
select 'Save as...',
select 'Text (normal) (*.txt)',
browse to the target directory,
type the name you want to use for the text file.
You'll have all text from all pages in the file and need to locate the spot you wanted to copy'n'paste initially -- insofar it is not as comfortable as direct copy'n'paste. But it works more reliably....
It also works with acroread on Linux (but you have to choose 'Save as text...' from the file menu).
Update
You can use the pdffonts command line utility to get a quick-shot analysis of the fonts used by a PDF.
Here is an example output, which demonstrates where a problem for text extraction will very likely occur. It uses one of these hand-coded PDF files from a GitHub-Repository which was created to provide PDF sample files which are well commented and may easily be opened in a text editor:
$ pdffonts textextract-bad2.pdf
name type encoding emb sub uni object ID
------------------------------- ------------ ----------- --- --- --- ---------
BAAAAA+Helvetica TrueType WinAnsi yes yes yes 12 0
CAAAAA+Helvetica-Bold TrueType WinAnsi yes yes no 13 0
How to interpret this table?
The above PDF file uses two subsetted fonts (as indicated by the BAAAAA+ and CAAAAA+ prefixes to their names, as well as by the yes entries in the sub column), Helvetica and Helvtica-Bold.
Both fonts are of type TrueType.
Both fonts use a WinAnsi encoding (a font encoding maps char identifiers used in the PDF source code to glyphs that should be drawn).
However, only for font /Helvetica there is a /ToUnicode table available inside the PDF (for /Helvetica-Bold there is none), as indicated by the yes/no in the uni-column).
The /ToUnicode table is required to provide a reverse mapping from character identifiers/codes to characters.
A missing /ToUnicode table for a specific font is almost always a sure indicator that text strings using this font cannot be extracted or copied'n'pasted from the PDF. (Even if a /ToUnicode table is there, text extraction may still pose a problem, because this table may be damaged, incorrect or incomplete -- as seen in many real-world PDF files, and as also demonstrated by a few companion files in the above linked GitHub repository.)
If are able to successfully select and copy the text in Adobe Reader -- indicated that the PDF does contain text objects -- but you can't paste the copied text into Notepad without it looking like a bunch of garbage characters, then the problem is probably related to the CMap that the selected text uses.
The PDF specification provides many options for the display of textual content and the related extraction of the text content. A CMap specifies the mapping from character codes to character selectors. The PDF spec outlines some predefined CMaps, but other CMaps can also be embedded.
My guess is that either the CMap for this text is corrupt or that the PDFBox library doesn't support this particular CMap. I suggest trying a different SDK just to see if you get any different results.
When opened as a Gmail attachment in Chrome (the internal PDF browser) copying does copy normal readable characters!
It worked for me when I had this problem and for others as well. I think the Chrome PDF viewer uses the Google Drive OCR automatically... It's like magic!
What was the PDF created with. Some PDFs do not contain any encoding information, just the data to draw it. So there is no way to extract the data.
Select the text you wish to copy.
Right click
Choose option "Export Selection as"
In the dialog box, choose a file name and save the new file as Rich Text Format (RTF)
Open RTF to see your text!
The best way to deal with this is Convert PDF file to Word by using this website.
https://www.ilovepdf.com/pdf_to_word
The garbage issue will be fixed
The best way to deal with this is (assuming you have Adobe Acrobat, or something similar, not sure if Reader can do this) is save the doc as a JPEG. Then recompile all the images as a single pdf, then use the OCR function to find text in the pages, then you can copy and paste the text.
PDF is not a text document. It's more of a vector graphic format that sometimes can contain text. So there are some documents from which you can't extract text unless you are willing to do OCR. That's just the way it is.