PDF, PPT, DOC, etc to TEXT - pdf

Maybe these should be separate questions, one for each format, but...
What are the most RELIABLE libraries (in any language), binaries (for any platform), or webservices (free or not free) for converting diverse "text-containing" formats into plain text?
By reliable, I mean near 100% ability to extract ALL of the human-readable text while NOT EXTRACTING "code" or "markup".
By text-containing formats, I mean: all the most common things like PDF, PPT, DOC, DOCX, RTF, HTML, ".PAGES", ".KEYNOTE", ODT, etc etc
Please suggest both packages/services that support many of these formats as well as those that only support one. In addition, are there software "stacks" that "tie together" many packages/services for the purpose of converting to text?

http://www.filebuzz.com/files/Ascii_Convert/1.html <--This link will take you to a list of converters that can convert a PDF and other types of files to an ASCII format (plain text).
For Word documents, you can do this with out a software. For example, for Word documents, when you click 'Save As', it will open up a dialog box that will have a 'Save as Type' drop down list. Select 'Plain Text *.txt' and it will save your file in plain text. Good Luck!

In Java, the Apache Tika toolkit detects and extracts metadata and structured text content from various documents using existing parser libraries.

If you're using Ruby, take a look at Yomu. It's a wrapper for Apache TIKA and supports a variety of document formats which includes the following:
Microsoft Office OLE 2 and Office Open XML Formats (.doc, .docx, .xls, .xlsx, .ppt, .pptx)
OpenOffice.org OpenDocument Formats (.odt, .ods, .odp)
Apple iWorks Formats
Rich Text Format (.rtf)
Portable Document Format (.pdf)

You can try Extract Text.
From the description: "Extract text from documents such as PDF and Microsoft Word files. It will save the extracted text in a file. Works with .pdf, .doc, .docx, .xls, .xlsx, .ppt, and many more." Requires Microsoft.NET Framework 4.0.

Related

Open pdf file in Microsoft Word using OLE

I am looking for the method (of Word ole-object) which can open pdf in the Microsoft Word.
I want to copy all pages of pdf into doc/docx and add there footers.
Could anybody give the cue how to import pdf?
PS: any sample code for this problem would be great.
Thanks,
Lilya
You need OCR (Optical Character Recognition) engine for converting PDF to document. PDF is generic format and it can include text as image. So it is very hard to convert PDF to document. SAP hasn't got any OCR function for doing this. Maybe OpenText (if customer using it) has this functionality, I haven't got detail information about opentext. You need third party tools for this. You can use online services or command line utilities to converting PDF files to text files easelly if PDF included text, otherwise you need professional SDKs (for example Abbyy Finereader) for doing this.
I used FoxIT PDF Reader to save the PDF file into text file and make a macro to read the text file. Of course, by doing so, you can only get the text, but nothing else.

how do I extract the Arabic text of this PDF file correctly?

Today i tried to search a Arabic word in a PDF file that contained Arabic content.
All PDF reader soft wares cannot search any Arabic word in this PDF file.
So I dragged PDF file into Firefox browser and selected a area that contained some words by inspect elements and saw this:
hw ½oiC instead of آخرین سخن
What is type of the encoding used in this PDF file?
how can i encode this to normal text?
It's difficult to comment on the file you are looking at without seeing it but a good starting point is to try Acrobat and by either copying the text and pasting it into a text editor or doing a search for the text content will reveal if it can be extracted correctly or not.
If it can't be extracted properly then there's a good chance the font is lacking a ToUnicode entry (see Section 9.10.1 of the ISO PDF 32000-1:2008 specification for more information).

How to create and save a .rtf, .doc, .docx in Objective-C for iOS

I am looking to create and save either a rtf, doc or docx file on an iPad (iOS).
The scenario is that we'd like to assist a user in creating content on their iPad and then let them email this as an editable document cross-platform (OS X, WIN).
I am open to other solutions besides the rtf, doc or docx file format.
Thanks,
James
RTF is going to be the easiest, because it's a plain text format. It's kind of like HTML, but without closing tags. Here is a class for writing an RTF, but it requires a lot of dependencies from elsewhere in the framework.
DOCX would be rather difficult. It's actually a zip file, containing a few XML files. You can examine the format yourself by changing the .docx extension to .zip and unzipping it. But even though XML is a fairly easy to write format, the way the text attributes are organized is still rather complicated. Also, I recall that it has to be zipped in a very specific way to be read properly.
As for DOC, it will be very difficult because it's such a complex format. You could look into some open source projects, like Abiword or Word2x. Be careful using their code because the licenses may not agree with the App Store rules.
I've seen doc & docx readers for iPhone (App store entry linked here), but I don't know of any open source frameworks you can make use of.
RTF format should be pretty simple to write, if you're up to the challenge. There is no built in framework support for it (here's a related question, b.t.w.).
Maybe you could write out something in a regular TEXT format and e-mail that?
Docmosis has a cloud service that you can reach from iOS. You can ask it to render a doc in various formats (doc, rtf, pdf, odt etc) and email it off or stream it back - though you have to be connected. Previewing DOC on iOS is possible but a little flaky. One option is to stream PDF back for display on iOS and email editable document (which can be done in one call).

How to convert WORD docs with Bookmarks to PDF using GhostScript?

I'm converting WORD docs to PDF programmatically using vb.net and ghostscript. This word doc I’m having problems with has hyperlinks to external URLs and also hyperlinks to bookmarks within the document. When the doc is converted to PDF the external URLs work but the links to the bookmarks do not.
I have searched for a solution to get these bookmarks to work on the output PDF but haven’t had any luck. Hopefully someone has done this and can share the solution.
Ghostscript only handles PDF or PostScript as an input, there are sibling products to handle XPS and PCL as well but none of them handle Word .doc files. So you must be converting the Word file into something else.
I'll hazard a guess that you are using the Windows PostScript printer driver to convert to PostScript and passing that to GS (possibly via the RedMon Port Monitor) to convert into PDF.
Now PostScript doesn't support hyperlinks, bookmarks, or any of the other paraphernalia of a viewing application, since its intended as a print language. To overcome this Adobe introduced an extension, the pdfmark operator, which can be used to create this kind of information. NOTE this is an extension which is only supported for conversion to PDF.
So, in order to get these inserted, you need to create pdfmarks in the PostScript. If you are printing from Word, this means that you have to insert PostScript into the file when printing. There is a 'pass through' mechanism for this purpose.
So what you need to do is create the appropriate Visual Basic script in Word which inserts the relevant pdfmarks when the document is printed. This is how the Adobe plug-in for Word (which used to be called PDFMaker a long time ago) works.
Have a look at this tool.
It does maintain bookmarks and hyperlinks.
http://www.transcom.de/transcom/en/2004_pdf-t-maker.htm

Copy+pasting text from PDF results in garbage

I am writing a Master's thesis - NLP system. I have one component - extractor.
It is extracting a plain text from PDF files. There are a few PDF files that can not be extracted correctly. Extractor (PDFBox library) returns a string like this:
"┤xDn║if|d├gDF"Ti&cD╬lh d FÁhis~n ╗xd f«"d┤ffih »h"
or
"10a61a91a22a25a3a27a17a23a20a8a13a14a61a25a17"
I was checking each file that makes this extraction's problem and all these files' text also can not be copy-pasted from PDF Reader (Adobe Reader and FoxIt reader). Viewing them in this readers is enabled, but after selecting its content and copying to the clipboard I get the same wrong text (as described above - strings of not semantically correct chars or strings of digits and letters).
Could anybody help me???
Very often in such cases, where you can't select, copy'n'paste text from the Acrobat (Reader) window, there is another option which may work nevertheless:
Open 'File' menu,
select 'Save as...',
select 'Text (normal) (*.txt)',
browse to the target directory,
type the name you want to use for the text file.
You'll have all text from all pages in the file and need to locate the spot you wanted to copy'n'paste initially -- insofar it is not as comfortable as direct copy'n'paste. But it works more reliably....
It also works with acroread on Linux (but you have to choose 'Save as text...' from the file menu).
Update
You can use the pdffonts command line utility to get a quick-shot analysis of the fonts used by a PDF.
Here is an example output, which demonstrates where a problem for text extraction will very likely occur. It uses one of these hand-coded PDF files from a GitHub-Repository which was created to provide PDF sample files which are well commented and may easily be opened in a text editor:
$ pdffonts textextract-bad2.pdf
name type encoding emb sub uni object ID
------------------------------- ------------ ----------- --- --- --- ---------
BAAAAA+Helvetica TrueType WinAnsi yes yes yes 12 0
CAAAAA+Helvetica-Bold TrueType WinAnsi yes yes no 13 0
How to interpret this table?
The above PDF file uses two subsetted fonts (as indicated by the BAAAAA+ and CAAAAA+ prefixes to their names, as well as by the yes entries in the sub column), Helvetica and Helvtica-Bold.
Both fonts are of type TrueType.
Both fonts use a WinAnsi encoding (a font encoding maps char identifiers used in the PDF source code to glyphs that should be drawn).
However, only for font /Helvetica there is a /ToUnicode table available inside the PDF (for /Helvetica-Bold there is none), as indicated by the yes/no in the uni-column).
The /ToUnicode table is required to provide a reverse mapping from character identifiers/codes to characters.
A missing /ToUnicode table for a specific font is almost always a sure indicator that text strings using this font cannot be extracted or copied'n'pasted from the PDF. (Even if a /ToUnicode table is there, text extraction may still pose a problem, because this table may be damaged, incorrect or incomplete -- as seen in many real-world PDF files, and as also demonstrated by a few companion files in the above linked GitHub repository.)
If are able to successfully select and copy the text in Adobe Reader -- indicated that the PDF does contain text objects -- but you can't paste the copied text into Notepad without it looking like a bunch of garbage characters, then the problem is probably related to the CMap that the selected text uses.
The PDF specification provides many options for the display of textual content and the related extraction of the text content. A CMap specifies the mapping from character codes to character selectors. The PDF spec outlines some predefined CMaps, but other CMaps can also be embedded.
My guess is that either the CMap for this text is corrupt or that the PDFBox library doesn't support this particular CMap. I suggest trying a different SDK just to see if you get any different results.
When opened as a Gmail attachment in Chrome (the internal PDF browser) copying does copy normal readable characters!
It worked for me when I had this problem and for others as well. I think the Chrome PDF viewer uses the Google Drive OCR automatically... It's like magic!
What was the PDF created with. Some PDFs do not contain any encoding information, just the data to draw it. So there is no way to extract the data.
Select the text you wish to copy.
Right click
Choose option "Export Selection as"
In the dialog box, choose a file name and save the new file as Rich Text Format (RTF)
Open RTF to see your text!
The best way to deal with this is Convert PDF file to Word by using this website.
https://www.ilovepdf.com/pdf_to_word
The garbage issue will be fixed
The best way to deal with this is (assuming you have Adobe Acrobat, or something similar, not sure if Reader can do this) is save the doc as a JPEG. Then recompile all the images as a single pdf, then use the OCR function to find text in the pages, then you can copy and paste the text.
PDF is not a text document. It's more of a vector graphic format that sometimes can contain text. So there are some documents from which you can't extract text unless you are willing to do OCR. That's just the way it is.