Using Text Mining in R to find a specific set of words in a set of PDFS - text-mining

I am looking at a set of 10 PDFs, and I want to write code that will tell me the number of times a couple words I've predetermined appear in the document. So far, I've been using the pdftools function and tm function to find the frequency of most common words in the documents, but I don't know how to look for specific words. Thanks!

You can start of with pdftotext then send its output through your choice of OS string filter. On windows the better of several, in this case is Findstr:-
Note the string count is 13 but two lines have the same word more than once so the word count would be 15 HOWEVER there are no objects called words in a PDF thats a text thing. SO just beware that short words may get you more than expected.
pdftotext filename.pdf %temp%\pdfout.txt &&echo/ &&Findstr /O /I "one word or more" %temp%\pdfout.txt
For multiple files, simply wrap that in a "for" loop. On windows see For /?

Related

Search for Multiple Strings in PDF and Word, Return Page Number(s) Where Strings Appear - Power Automate

I have a list of strings (e.g. "A3.11.2.3", "A3.2.1" and "A12.1.3(b)") and need to find a streamlined way to extract the page number(s) on which each sting appears from PDF and Word files.
The list of strings is fixed/can be hardcoded though it would be better if they were read from a particular excel file. The Flow I am trying to create is:
When a file is created;
Search file for list of strings and return all page numbers on which
strings appear with each page number separated by a comma;
Populate a Microsft Word template with each string's page numbers
(i.e. a template table will be created with one string on each
row and in the column beside the page numbers will be populated).
Items 1 and 3 are easy, item 2 has been destroying my brain for how to implement.
The files to be searched are most often PDFs (always file created/no need to add OCR) but occasionally include word documents.
All ideas welcome!

How can I find a word with a new line in the VBA editor using find and replace?

I would like to go through and find all of the "End" statements in my code but skipping all of the "End x" statements like "End If", "End Sub", "End function", etc.--Just the pure "End". My thought was to use pattern matching, but I am unsure of how to do that.
I already tried using "End\n" and "End[\n]".
Does anyone know how to search for words that end in new lines?
The "find" function in the VBA editor does not support this kind of parameter/functionality.
You will have to manually step through the results and skip the ones you don't want to skip, or manually modify the "End" instances you don't want to catch, then search & replace, and finally restore all the End instances back to what you want.
Apologies for answering so long after the question was asked, but thought this information would help future readers as this question is still being actively found.
#TylerH is right that the specific search requested by the user cannot be performed in the VBE Find tool. For information, when "Use Pattern Matching" is selected the VBE Find tool supports use of:
? - single character
* - zero or more characters (on the same line)
# - single digit (0 to 9)
[charlist] - any single character in charlist
[!charlist] - any single character not in charlist
... where charlist can be a range of characters (eg [A-Z]) but must be in order (eg [Z-A] is not valid), it can also include multiple ranges of characters (eg [A-BD-E] matches A, B, D or E). Also to match any of ?, * or # then enclose them in square brackets (eg [*] matches an asterisk).
This means the VBE Find tool performs very similarly (perhaps identically ... but I can't provide assurances, VB and VBA not being the same language) to the VB Like operator, for which documentation is here
The alternative (which will perform the specific search in the question) is to use the 'Find Text' tool in the VBE Add-In MZ-Tools - though note MZ-Tools is a paid-for tool ... please note I am NOT in any way associated with MZ-Tools or it's author. The search text to use in MZ-Tools for the specific search requested in the question is: end\r?$

How to iterate over the list of words in a MS Word 2010 document with VBA?

What I'm actually trying to do is implementing a button to compute the Gunning fog index. What I would normally do in anything other than VBA is:
Provide a dictionary of words considered to be "complex" (to be compiled from professional jargon that should be used only when necessary)
Get the list of words in the document.
Determine the length of this list.
Get the number of sentences (possibly just the number of "dot whitespace" occurrences) and determine the average words/sentence
Filter the list of words the "complex" words and compare the length of the "complex word list" with the length of the "word list".
What I don't know how to do is how to get an object "this documents.wordList", and what are the "length", and "filter is-complex" methods would be.
This doesn't need to be specially elegant, it's for personal use only.
The .Find method, combined with a counter that adds 1 to itself each time a word is found, could provide a list of complex words in the document. The length would simply be the counter at the end of the search.
The Words.Count property will return the number of all words in the document. Similarly you could do Sentences.Count for the number of sentences.
This should get you pointed in the right direction. Visit the Word VBA help files for more info on this and other possibilities.

Macro for adding spaces between merged words in Microsoft Word

Today I had to edit a terrible text in Microsoft Word that have many paragraphs where all words are merged and text was impossible to read, nightmare :)
So, I thought, maybe there is a solution to this?
I have this idea: 1) I go through text (or paragraph) placing cursor
between words which should be separated (i.e. loremipsum should become lorem ipsum).
2) macro remember all occasions when I placed cursor
3) macro inserts necessary spaces between words
But inserting a space when I place cursor also can be good.
Any suggestions?
Thanks!
User939536, welcome to SO. Before you build a full-fledged Add-In to Word, a good chunk of your functionality is already there.
If you go to Word Options->Proofing->Autocorrect Options (your version may have a different method), you can see the list of words to automatically replace with other words. You can put your most commonly combined words here and let Word do the work for you.
The other route of course is to build an Add-In that will:
Have a database of commonly replaced words (probably a text file that's read on startup)
A search function to search the document
An interface to present the user with the choice to Replace or Skip.

Script to modify outlook (2003) contacts

I'm trying to clean up my outlook 2003 contacts, which has become a rather ugly mess of various formatting, etc.
Basically, I have a bunch of contacts, in the form of either:
0xxxxxxxxx [ten digits, starting with 0] 0xxxxxxxx [nine digits, starting with 0] 0xxxxxxxx (xxxxx) [the same nine digits above with the last five repeated in parentheses] +xxxxxxx [some random "complete" number with an international dialing code, etc]
I want all of the numbers to match the last format. The algorithm is simple enough: for the first two types, drop the 0 and add +YYY where YYY is my country code. Ditto for the third, but drop everything in parentheses.
My problem is that I don't know how to go about doing this. I've written a million scripts in my life in Perl, but I'd rather not export everything to text, process it, and re-import; I'd like to have a one-click solution that can easily be re-run (such as when I import a new contact from my companies' directory which comes in one of the forms above). I suspect that VBScript is the way to go; I've seen a few references online to accessing contacts as objects, but I'm not really sure what the best way to get started is.
Any recommended resources?
This is a duplicate of https://superuser.com/questions/15913/script-to-modify-outlook-2003-contacts ; I'm not sure which site is a better location
I would say VBA, rather than VBScript.
Sub GetContactsTel()
Set oFolder = GetNamespace("MAPI").GetDefaultFolder(olFolderContacts)
' Loop through all of the items in the folder.
For i = 1 To oFolder.Items.Count
Debug.Print oFolder.Items(i).BusinessTelephoneNumber
Next
End Sub