How to iterate over the list of words in a MS Word 2010 document with VBA? - vba

What I'm actually trying to do is implementing a button to compute the Gunning fog index. What I would normally do in anything other than VBA is:
Provide a dictionary of words considered to be "complex" (to be compiled from professional jargon that should be used only when necessary)
Get the list of words in the document.
Determine the length of this list.
Get the number of sentences (possibly just the number of "dot whitespace" occurrences) and determine the average words/sentence
Filter the list of words the "complex" words and compare the length of the "complex word list" with the length of the "word list".
What I don't know how to do is how to get an object "this documents.wordList", and what are the "length", and "filter is-complex" methods would be.
This doesn't need to be specially elegant, it's for personal use only.

The .Find method, combined with a counter that adds 1 to itself each time a word is found, could provide a list of complex words in the document. The length would simply be the counter at the end of the search.
The Words.Count property will return the number of all words in the document. Similarly you could do Sentences.Count for the number of sentences.
This should get you pointed in the right direction. Visit the Word VBA help files for more info on this and other possibilities.

Related

Using Text Mining in R to find a specific set of words in a set of PDFS

I am looking at a set of 10 PDFs, and I want to write code that will tell me the number of times a couple words I've predetermined appear in the document. So far, I've been using the pdftools function and tm function to find the frequency of most common words in the documents, but I don't know how to look for specific words. Thanks!
You can start of with pdftotext then send its output through your choice of OS string filter. On windows the better of several, in this case is Findstr:-
Note the string count is 13 but two lines have the same word more than once so the word count would be 15 HOWEVER there are no objects called words in a PDF thats a text thing. SO just beware that short words may get you more than expected.
pdftotext filename.pdf %temp%\pdfout.txt &&echo/ &&Findstr /O /I "one word or more" %temp%\pdfout.txt
For multiple files, simply wrap that in a "for" loop. On windows see For /?

How can I find a word with a new line in the VBA editor using find and replace?

I would like to go through and find all of the "End" statements in my code but skipping all of the "End x" statements like "End If", "End Sub", "End function", etc.--Just the pure "End". My thought was to use pattern matching, but I am unsure of how to do that.
I already tried using "End\n" and "End[\n]".
Does anyone know how to search for words that end in new lines?
The "find" function in the VBA editor does not support this kind of parameter/functionality.
You will have to manually step through the results and skip the ones you don't want to skip, or manually modify the "End" instances you don't want to catch, then search & replace, and finally restore all the End instances back to what you want.
Apologies for answering so long after the question was asked, but thought this information would help future readers as this question is still being actively found.
#TylerH is right that the specific search requested by the user cannot be performed in the VBE Find tool. For information, when "Use Pattern Matching" is selected the VBE Find tool supports use of:
? - single character
* - zero or more characters (on the same line)
# - single digit (0 to 9)
[charlist] - any single character in charlist
[!charlist] - any single character not in charlist
... where charlist can be a range of characters (eg [A-Z]) but must be in order (eg [Z-A] is not valid), it can also include multiple ranges of characters (eg [A-BD-E] matches A, B, D or E). Also to match any of ?, * or # then enclose them in square brackets (eg [*] matches an asterisk).
This means the VBE Find tool performs very similarly (perhaps identically ... but I can't provide assurances, VB and VBA not being the same language) to the VB Like operator, for which documentation is here
The alternative (which will perform the specific search in the question) is to use the 'Find Text' tool in the VBE Add-In MZ-Tools - though note MZ-Tools is a paid-for tool ... please note I am NOT in any way associated with MZ-Tools or it's author. The search text to use in MZ-Tools for the specific search requested in the question is: end\r?$

VBA getcrossreferenceitems(wdRefTypeNumberedItem) Paragraph Cut Off?

I'm using excel vba to extract information from a word document.
In the word document, there are levels of numbered lists. For example:
1. ABC
1.1 DEF
1.1.1 ABCDEF
2. AAA
2.1 BBB
2.1.1. CCC
and I need to get the full context of each heading in each level and put them into an excel range, i.e. {"1.ABC", "1.1 DEF", "1.1.1 ABCDEF", "2. AAA", "2.1 BBB", "2.1.1. CCC"}
The function I use is:
For Each sec In objDoc.getcrossreferenceitems(wdRefTypeNumberedItem)
However, my headings are truncated if the headings are too long. For example, I have (random text is added for confidentiality reasons):
"5.2.11. Current References: As part of the evaluation process, XXX will conduct 2340AERTQ3493YR. When selecting ADT34534FDGSR, please ensure that they are AERA34AEFDS."
But only
5.2.11. Current References: As part of the evaluation process, XXX will conduct 234
is displayed, and the rest of the sentence is gone.
If anybody has an alternate solution, please let me know.
i confirm this behavior. A workeable albeit and elaborate solution is to scan the document for all numbered items which gives you the full text and then cross reference that result against the list returned by the GetCrossReferenceItems. There's quite some work involved but works and gives you the ability to create one list with referable Headings and NumberedItems, which is what I did to build a more user friendly alternative to Word's own implementation.
You'll have to match the formatting Word applies to the list returned by GetCrossReferenceItems, ie. the identation and removal of special characters.
Be careful with track changes. There is a bug in GetCrossReferenceItems which means that items (in my case headers) that have a tracked change at the beginning of the text are not returned by GetCrossReferenceItems but internally are still on the list so the index is offset. If the item in question is item 11, then GetCrossReferenceItems gives the item belonging to item 12 the item 11. A workaround is to accept all revisions before GetCrossReferenceItems and undo it after.
It's not easy but works.
I met a similar problem in MSWord. I found some paragraph's text are shorten in the following code
Sub bug()
items = ActiveDocument.GetCrossReferenceItems(wdRefTypeNumberedItem)
For idx = 1 To UBound(items)
MsgBox items(idx)
Next
End Sub
I have to use a some long solution( in Python, sorry. But is is easy to rewrite in VBA):
varHeadings = []
for par in objDoc.Paragraph:
if par.Range.ListFormat.ListType == win32com.client.constants.wdListOutlineNumbering:
idx = par.Range.ListFormat.ListString
txt = par.Range.Text.strip('\n').strip('\r')
varHeadings.append('%s%s' % (idx, par.Range.Text))
which does work. However, as I have said, it is some tedious. So did I miss some VBA function in MSWord, or GetCrossReferenceItems has known bug and can not found any replacement in VBA?

VBA Macro to extract strings from word doc

i have a word document containing several strings. These strings have the first part always the same, for example ABC_001, ABC_002, ABC_003. I need to search for "ABC_" substring in the doc, extract all the occurences ("ABC_001", "ABC_002", "ABC_003") and copy them in an Excel sheet.
Anyone can help?
Thanks in advance.
You can reference the VBScript Regular Expressions 5.5 and regex them.
Have a look at http://www.macrostash.com/2011/10/08/simple-regular-expression-tutorial-for-excel-vba/
and http://txt2re.com/
and some of VBA multiple matches within one string using regular expressions execute method
EDIT:
Actually it is probably easier to go to data and "Get external data" choose de-limiter and import, either manually or record a macro to get a feeling for the vba structure.
This should get you all the entrys in seperate cells, then go over them with a MID to get the part you need

Read Index (Table of Content) of word document

I am interested to get the topic headings (say all lines with Heading 1 and Heading 2) from a word document. Using VBA you can parse thru every line in that document and verify the style; however this seems to be a tedious job. I believe that there should be some easy way of doing it. Any pointers
A pointer --->
tempD = ActiveDocument.GetCrossReferenceItems(wdRefTypeHeading)
gives you a list of Headings in document.