Extract text from a PowerPoint (.ppt or .pptx) file? - api

I'm currently using a combination of OpenOffice macros and a pdf2text program to extract text and would like to find an easier, more efficient way getting the text out of a PowerPoint file.
I've tried using the Apache POI library and have not had much luck, encountered numerous exceptions within the library when trying to process the files I'm looking at and don't particularly want to sift through the source code of the library.
Is there an easy way to do this without using the aforementioned library?

If you have MS Office and you save the PPT in the RTF (Rich Text Format), it contains just the text from the presentation. You could then open the file in any editor that understands RTF files and save it as a text (TXT) file.
I expect this to work from Open Office too.
Since you talk of API, this may not be the way to go for you but maybe it will give you newer ideas on getting there. Say, you use multiple macros to do the conversion in stages...
Edit: I got curious and did a short google search
This is what i found on one of the www.openoffice.org pages
As people in this thread have pointed out, retrieving text from an OO
document isn't hard since it's just zipped xml that can be parsed with a
perl script. The problem is getting Microsoft Powerpoint documents into
a zipped XML format in the first place.
I've found that File -> Wizards -> Document Convertor does exactly that.
Just tell it you want to convert Powerpoint documents, not templates,
point it to your source directory and where you want it to spit out the
result and you're away.
I then find unzip -p $file.sxi content.xml | perl -p -e
"s/<[^>]>/\n/g;s/ +//;s/\n\n/\n/g;" -w
works rather well for extracting the text.
Sorry, i don't have Open Office handy to try any of that out.

pptx files are relatively easy to deal with, because they are just zipped xml - you can just unzip them and then strip all the xml tags from the content of the files in the 'ppt/slides' subdirectory of the unzipped stuff, yielding most of the pertinent text.
ppt files are a whole other ballgame, and the process is rendered even more painful because the canonical tool, catppt from the catdoc package, is susceptible to a buffer overflow that makes it nearly useless (it segfaults on a large percentage of ppt files).

LibreOffice-5 File - Export - HTML includes both slide contents and presenter notes.
Then, open the .html file in Firefox or other browser, and File - Save Page As - Text File (or utility such as pandoc -o file.txt file.html).

Related

Open a .pdf file

I am trying to open a .pdf file within Excel like an iframe in HTML.
My requirement is:
Save the path of multiple PDF files in Excel.
Excel should open each .pdf file within Excel itself (no need to open that in a separate .pdf window).
It should be like iframe in HTML. The user should be able
to view the .pdf within Excel itself.
I know this is little weird, but can anybody help me?
you could probably get the filenames via vba.
here's some that claim to work:
Loop through files in a folder using VBA?
So far as opening a pdf in excel - thats kinda pushing it.
Since your request is exotic I can think of an exotic workaround:
If you can spare the interactivity you can simply make copies and convert your pdfs to word formats to work with them and load them in that way. I've seen people convert pdfs to Jpgs just to load them in some other documents but thats rudimentary and really fringe.
Otherwise you are facing a lot of custom coding that needs to make it possible.

docx4j word/googledocs compatibility

I'm creating a program which extracts a docx file, displays it in a Javafx graphic interface with buttons in place of flags put in the docx, and when one puts on it, it modifies the docx taken in input.
I'm using the docx4j API for extracting and modifying the document.
The problem is that the program fails if i take in entry a docx generated from Microsoft Word. I'm forced to use an artifice.
I'm taking my docx made on Word, then i load it in Google Docs and I use the "Download in .docx format" option. If i directly put the docx from Word in my program, it fails.
I noticed my Word file was two times lighter after being passed trough google doc. Same, if I tale a docx file downloaded from Google Docs, if i open it in Word and modify one letter and save it, he becomes two times heavier. For the record i use word 2008.
That's it, so I'd like to know if someone know what explains this difference.
Thanks

How to search in PDF using VBA

Hi I am Rookie in VBA,
Is it possible to search within PDF files using VBA? And return the searched target with the name of the file in which the search is matched?
To give you a better understanding what I do: I have a macro that goes on web page (http://cetatenie.just.ro/ordine/articol-11/) loops through the hyperlinks and downloads the matching criteria files (PDF). Now I need to search within them a name and surname (i.e. BLANARI VITALIE) and know in which file/doc it is published.
The only idea that come into my head is to import data but than again the excel file will get too huge.
Please help!
if you have Acrobat Professional installed you can use the Automation (Acrobat.AcroPDDoc) and query the Postscript
Here is an example I found for you (by searching on Google):
get the data from PDF file into Excel sheet(s) or text file(s)
to use this code, you need the below References:
AcroPDFLib
Acrobat
on my pc they are in:
AcroPDFLib: C:\Program Files\Common Files\Adobe\Acrobat\ActiveX\AcroPDF.dll
Acrobat: C:\Program Files\Adobe\Reader 11.0\Reader\AcroRd32.dll
if you can't find them in your VB Editor References dialog, or by searching for in C:\Program Files\Adobe, then you don't have the necessary components installed on your pc to do it the easy way
The hard way is to strip the PostScript and read it into variables then search the variables!
HTH
Philip

How to create and save a .rtf, .doc, .docx in Objective-C for iOS

I am looking to create and save either a rtf, doc or docx file on an iPad (iOS).
The scenario is that we'd like to assist a user in creating content on their iPad and then let them email this as an editable document cross-platform (OS X, WIN).
I am open to other solutions besides the rtf, doc or docx file format.
Thanks,
James
RTF is going to be the easiest, because it's a plain text format. It's kind of like HTML, but without closing tags. Here is a class for writing an RTF, but it requires a lot of dependencies from elsewhere in the framework.
DOCX would be rather difficult. It's actually a zip file, containing a few XML files. You can examine the format yourself by changing the .docx extension to .zip and unzipping it. But even though XML is a fairly easy to write format, the way the text attributes are organized is still rather complicated. Also, I recall that it has to be zipped in a very specific way to be read properly.
As for DOC, it will be very difficult because it's such a complex format. You could look into some open source projects, like Abiword or Word2x. Be careful using their code because the licenses may not agree with the App Store rules.
I've seen doc & docx readers for iPhone (App store entry linked here), but I don't know of any open source frameworks you can make use of.
RTF format should be pretty simple to write, if you're up to the challenge. There is no built in framework support for it (here's a related question, b.t.w.).
Maybe you could write out something in a regular TEXT format and e-mail that?
Docmosis has a cloud service that you can reach from iOS. You can ask it to render a doc in various formats (doc, rtf, pdf, odt etc) and email it off or stream it back - though you have to be connected. Previewing DOC on iOS is possible but a little flaky. One option is to stream PDF back for display on iOS and email editable document (which can be done in one call).

Using Texmaker, I want to lock the PDF file created so others cannot copy text or print the file

I'm creating PDFs using Texmaker. I would like to create some of the PDF files so that when I give the PDF to others, they are not able to print the file or to copy the text. I know I can do this with some PDF creator applications, but can I do that from some command like program I have with Latex, MikTex and TexMaker?
It wouldn't be effective anyway. There are bits in the pdf format that purport to forbid the user from doing this, but they are really just suggestions that the reader application may or may not act on. There is nothing to stop a user from removing the code that inspects the bits from a free/libre PDF reader, or just to run a tool over the file to remove the restrictions.