python-pptx: unable extract slide title containing equation - title

I was trying to extract from a pptx file the slide titles. The method proposed here works fine if the title contains only text. However, as soon as the title contains an equation (i.e. generated with the office equation tool), the described method does not work, and instead it gives the following error (let us assume that the slide number is the 26th):
from pptx import Presentation
prs = Presentation('My_Presentation.pptx')
print(prs.slides[26].shapes.title.text)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'SlideShapes' object has no attribute 'text'
Example with a working slide:
print(prs.slides[1].shapes.title.text)
Outline
(on any other "normal title" slide this works just fine and the result is what I would expect)
Any work around for me would do, even completely ignore the equation part within the title as I have quite a few presentations to parse in this way and I cannot really go manual on this.
I am using Python 3.7
Edit: I fixed the code, I did a wrong copy-paste and added a working example

Related

openpyxl destroys functions on save

I'm trying to save pandas DF into an existing spreadsheet. I found an excellent answer at Writing Pandas DataFrame to Excel: How to auto-adjust column widths, which is really continuation of another question *)
The problem though is that when I use it, on trying to load the spreadsheet I get an error on "damaged content", complaining about a drawing - even though I have none in the spreadsheet, and all functions are gone. Static data are still there.
log is
<?xml version="1.0" encoding="UTF-8" standalone="true"?>
-<recoveryLog xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main">
<logFileName>error171360_05.xml</logFileName>
<summary>Errors were detected in file 'test.xlsx'</summary>
-<repairedRecords summary="Following is a list of repairs:">
<repairedRecord>Repaired Records: Drawing from /xl/drawings/drawing1.xml part (Drawing shape)</repairedRecord>
</repairedRecords>
</recoveryLog>
Any ideas?
Edit: I'm pretty sure now it's not caused by pandas, as opening workbook, adding an empty sheet, and saving it removes all the formulas.
workbook = load_workbook(file)
try:
sheet = workbook["Result"]
except KeyError:
sheet = workbook.create_sheet("Result")
# for r in dataframe_to_rows(result, index=False, header=True):
# sheet.append(r)
workbook.save(file)
It doesn't produce the error above though.
Edit2: There's a question from 2013 (Openpyxl: Formulas getting removed when saving file) which says OpenPyxl doesn't support it, with a feature requested to do so. But the link to the feature doesn't work, so I have no idea whether it works or not.
*) there is a small bug in the function in that answer, sheet_name is a param, but it also tries to look it up in **kwargs, which of course fails, so gets replaced by a default value even if passed into the function. I can't comment on the question, so maybe #maxU will read this and edit..

Lost lines when transfer text from a table to a shape in PowerPoint

I have been working on a PowerPoint presentation for a game show at my high school. I tried to import questions from a PowerPoint table like this:
using the code as follows:
Slide1.Shapes("question").TextFrame.TextRange.Text = ActivePresentation.Slides(6).Shapes(1).Table.Cell(k + x, 3).Shape.TextFrame.TextRange.Text
Slide1.Shapes("answer").TextFrame.TextRange.Text = ActivePresentation.Slides(6).Shapes(1).Table.Cell(k + x, 4).Shape.TextFrame.TextRange.Text
with k and x being variables for my question number calculating method (which is not related to the problem).
The problem that I encounter is some of the text transferred are missing while the code is running, for example:
This occasionally happens from time to time, with the text only appearing a few beginning lines and the rest are not shown.
When I escaped the slide show and then show it again (without the code running and the data is still there), the text appears just fine:
Even when the code is running and the error occurs on the slide show, when I Alt+Tab to see the editor window the text is still there fully.
I couldn't figure out anything for days now. Please help me, I'd be very grateful if you did so.

Open XML SDK 2.5 document validation: The 'smtClean' attribute is not declared

Our current work project involves opening a Microsoft PowerPoint file (.pptx format), changing some text and chart values, and then presenting the edited version to the end user.
This works rather well so far, but I'm puzzled by what happens when I try to validate the document afterwards. Using the DocumentFormat.OpenXml.Validation.OpenXmlValidator class, I run the Validate function with the PresentationDocument passed in as the only parameter.
Dim document As PresentationDocument = PresentationDocument.Open(templateFilePath, True)
Dim validator As OpenXmlValidator = New OpenXmlValidator()
Dim errors = validator.Validate(document)
For Each errInfo As ValidationErrorInfo In errors
Debug.Print("Error: """ & errInfo .Description & """")
Debug.Print("XPath: " & errInfo .Path.XPath)
Next
Validate() returns an array filled with instances of ValidationErrorInfo. Just about all of these give the same error description when debugging:
The 'smtClean' attribute is not declared.
The XPath for each error looks like this (numbers vary; there appears to be one error per piece of text):
/p:sldLayout[1]/p:cSld[1]/p:spTree[1]/p:sp[4]/p:txBody[1]/a:p[1]/a:fld[1]/a:rPr[1]
Every TableCell has a Paragraph, with child element Run, and this Run has child elements RunProperties and Text. I modify the Text in my scripts, but I do not touch anything else.
Searching for 'smtClean' gave me an MSDN entry for RunProperties which shows 'smtClean' as one of the possible values to be set, but if I create a new instance of DocumentFormat.OpenXml.Presentation.Drawing.RunProperties the 'smtClean' attribute is not available.
Looking around, I found threads where people mentioned merged documents to be one possible cause, but these errors occur even in an unmodified presentation with only a single slide and table in it. Using the Open XML SDK 2.5 Productivity Tool to Validate the base document, I get the same result.
The errors also occur no matter which format I ask the Validator to test for - both the 2007, 2010 and 2013 version of the PowerPoint format return the same amount of errors.
Finally: The file itself works just fine in PowerPoint, even after being modified. I am curious about why the validator returns so many errors, however.
Thanks in advance for any help.
we process Office Documents and remove this Attribute in all Types (Word, Powerpoint, Excel) without Side-Effects. Eric White has identified this as Bug: smtpClean attribute not supported
It is fixed in the current OpenXml SDK on Branch Office2016: https://github.com/OfficeDev/Open-XML-SDK/tree/Office2016
Regards...
Smart tags were deprecated in Office 2010, and the SDK v2.5 validator doesn't support smart tag elements and therefore marks them as invalid.
Please see this MSDN article for more information.
The current developer of the productivity tool says in this thread that the smtClean validation error was a bug in some situations and has been fixed in v3 of the tool.
v3 (the Office 2016 productivity tool) can be found here, however I'm not sure how compatible it is with older versions of Office.

Parsing text line by line across columns in Word document

Using VBA or Interop.Word I would like to simply parse the text in a Word document line by line, regardless of whether the text in that line spans multiple columns. As per the example below, I want:
Line 1 = "Line 1 Line 5"
Line 2 = "Line 2 Line 6"
Line 3 = "Line 3 Line 7"
etc.
I can't find any method, property or object in the Word Object Model that can facilitate this. I tried exporting to PDF and then opening that same file again in Word, but the conversion does not retain the original text line by line and gets very scrambled in places.
As per my comment above: a workaround is to try to get the document recreated using layout mode. In this case the Word file came from an export of an Adobe PDF scan of a printed document, so it's only applicable in these situations.

How to parse text from a plain text file and use the result to highlight a PDF file

Back in 2010, some guy claimed to be capable of doing this:
http://www.mobileread.com/forums/showthread.php?t=103847
"The Kindle stores its annotations in a Mobipocket (".mobi") file for each document and in one long text file named "My Clippings.txt." In this post I describe a system that synchronizes these annotations with PDF versions of the corresponding documents on a computer.
Overview
This system is embodied in an Applescript that parses the My Clippings file and controls the Skim PDF reader. The script first parses the clippings file. It then searches through the clippings and isolates any that come from documents on the kindle matching the filename of the currently open PDF file (the "pertinent clippings"). The script then iterates through each of the pertinent clippings, locating the matching text or location in the PDF document and applying highlights or adding notes where appropriate. The end result is an annotated, printable PDF document that matches the document on the kindle.
You can download the script here: http://dl.dropbox.com/u/2541109/KindleClippings.scpt. Before running the script, be sure to change the value of MyEmail to match your sending address and to verify that the Kindle mount point defined in MyClippingsFile is correct. You'll also need the free Skim PDF Reader.
To use it, send or copy a document file to your kindle. Remember, the kindle supports RTF, DOC, TXT and other common text formats and it will convert them into MobiPocket files internally for easier reading. Make some notes. Then take the same document that you just sent to the kindle and convert it to a PDF, e.g. by using the print to PDF feature in Mac OS X. Be sure to keep the filename the same. Open that same PDF in Skim and run the script. The highlights and notes should appear in the PDF.
If you're interested in how this works, read more on my blog here:
[not longer available]
Sadly, his script is no longer available, nor his blog.
Do you guys know if this is possible? I've been looking for this kind of functionality but can't find it anywhere.
This code, using python and PyMuPDF, works:
import fitz
# the document to annotate
doc = fitz.open("text_to_highlight.pdf")
# the text to be marked
text_list = [
"first piece of text",
"second piece of text",
"third piece of text"
]
for page in doc:
for text in text_list:
rl = page.search_for(text, quads = True)
page.add_highlight_annot(rl)
# save to a new PDF
doc.save("text_annotated.pdf")
The original 'My Clippings.txt' should be manipulated somehow, stringr could work but I found more useful to manipulate the text with multiple selections in Sublime Text---the goal is to have a list of highlights in the form of text_list above.
I am trying to do this using Python + a Windows macro creator (I'm a Win 7 user). You can use this approach to save the file as RTF, DOCX, PDF, etc. So far, it's been reasonably effective. Do note 2 things first:
1- the 'My Clippings' file only saves the text and the page, it does not save the location on the page (e.g., if you highlighted "mammals are animals" on page 15, it will give you this line and the page number, but if there are more than one "mammals are animals" on page 15, it's impossible to know which one you've highlighted). This is specially bad when you've highlighted a generic word, like "animals" or "the". And if you made comments by pressing on a word, this word is the only information you'll get about what in that page the comment refers to (e.g., I pressed on "animals" and the menu popped up, I selected 'Comment'. If "animals" appears 20 times on page 15, I cannot know to which of them my comment is refering).
2- The only way to retrieve the location on the page would be to analyze the *.pds and *.pdt files, inside the *.sdr folder in Kindle's drive ('Documents'). I can make no sense of these files.
In Python, you can run an easy code to extract the information you want from "My Clippings". Then you can use a macro creator to automate the process of copying the text and annotating it to the PDF (using Adobe Acrobat, for example), and then saving the PDF file.
Exemplifying with Adobe Acrobat:
Say I want to save all my highlights to the PDF file. First, I'll create a *.txt file on Python and run a script to copy all the strings related to the highlights to this new txt file (i.e., the highlighted text & the page number). Here's an example of such code (but first, copy and paste the "My Clippings.txt" file to the IDE start folder, e.g.: C:\Python27):
#for python 2.7.6
with open('My Clippings.txt','r') as rf:
with open('My Clippings Output.txt','w') as wf:
access = 0
bookTitle = 'Book Title'#put the book file's name as it's written in "My Clippings.txt"
for x in rf:
if access == 1:
wf.write(x)
if bookTitle in x:
access = 1
#for highlights only, instead of all annotations, include this if statement:
if (' | Added on ' in x) and ('- Your Note ' in x) or ('- Your Bookmark ' in x):
access = 0
if x == '==========\n':
access = 0
Then I'll create a macro to copy the page number in the "My Clippings Output.txt" file (it's inside the same folder you put the "My Clippings.txt" file), paste in Acrobat "page window", find (ctrl+f) the string in the page, then press "highlight". Done!
There's a catch in Acrobat though, the search/find function has a limit of ~28 chars, so your highlighted text can't be longer than that. I still don't know how to circumvent this limitation... I raised this problem here https://superuser.com/questions/884221/how-to-search-and-highlight-long-passages-in-a-pdf-file . As a bypass to the 28 chars limit on Acrobat, you can program the macro to copy using "shift"+"right arrow 28 times", and then use "cut" instead of "copy".
There are many free-to-use and libre macro creators out there, just google and choose the one you like best. For Windows, my favorite one is Pulover's Macro Creator. If you have any doubts about the process you can comment here or PM me. I'd prefer you to comment here, so that I can improve the answer