MS Word 2010 VBA - how to parse non-latin characters? - vba

I have a VBA script that prints out the word frequency in a document.
Sub WordFrequency()
Const maxwords = 9000 'Maximum unique words allowed
Dim SingleWord As String 'Raw word pulled from doc
Dim Words(maxwords) As String 'Array to hold unique words
Dim Freq(maxwords) As Integer 'Frequency counter for unique words
Dim WordNum As Integer 'Number of unique words
Dim ByFreq As Boolean 'Flag for sorting order
Dim ttlwds As Long 'Total words in the document
Dim Excludes As String 'Words to be excluded
Dim Found As Boolean 'Temporary flag
Dim j, k, l, Temp As Integer 'Temporary variables
Dim ans As String 'How user wants to sort results
Dim tword As String '
' Set up excluded words
Excludes = "[the][a][of][is][to][for][by][be][and][are]"
' Find out how to sort
ByFreq = True
ans = InputBox("Sort by WORD or by FREQ?", "Sort order", "WORD")
If ans = "" Then End
If UCase(ans) = "WORD" Then
ByFreq = False
End If
Selection.HomeKey Unit:=wdStory
System.Cursor = wdCursorWait
WordNum = 0
ttlwds = ActiveDocument.Words.Count
' Control the repeat
For Each aword In ActiveDocument.Words
SingleWord = Trim(LCase(aword))
'Out of range?
If SingleWord < "a" Or SingleWord > "z" Then
SingleWord = ""
End If
'On exclude list?
If InStr(Excludes, "[" & SingleWord & "]") Then
SingleWord = ""
End If
If Len(SingleWord) > 0 Then
Found = False
For j = 1 To WordNum
If Words(j) = SingleWord Then
Freq(j) = Freq(j) + 1
Found = True
Exit For
End If
Next j
If Not Found Then
WordNum = WordNum + 1
Words(WordNum) = SingleWord
Freq(WordNum) = 1
End If
If WordNum > maxwords - 1 Then
j = MsgBox("Too many words.", vbOKOnly)
Exit For
End If
End If
ttlwds = ttlwds - 1
StatusBar = "Remaining: " & ttlwds & ", Unique: " & WordNum
Next aword
' Now sort it into word order
For j = 1 To WordNum - 1
k = j
For l = j + 1 To WordNum
If (Not ByFreq And Words(l) < Words(k)) _
Or (ByFreq And Freq(l) > Freq(k)) Then k = l
Next l
If k <> j Then
tword = Words(j)
Words(j) = Words(k)
Words(k) = tword
Temp = Freq(j)
Freq(j) = Freq(k)
Freq(k) = Temp
End If
StatusBar = "Sorting: " & WordNum - j
Next j
' Now write out the results
tmpName = ActiveDocument.AttachedTemplate.FullName
Documents.Add Template:=tmpName, NewTemplate:=False
Selection.ParagraphFormat.TabStops.ClearAll
With Selection
For j = 1 To WordNum
.TypeText Text:=Trim(Str(Freq(j))) _
& vbTab & Words(j) & vbCrLf
Next j
End With
System.Cursor = wdCursorNormal
j = MsgBox("There were " & Trim(Str(WordNum)) & _
" different words ", vbOKOnly, "Finished")
End Sub
However, this VBA script is not picking up non-latin characters in the document.
I am using Arial Unicode MS and another unicode font
What do I need in the macro to pick up these words?
-Thanks

It can't work with this...
If SingleWord < "a" Or SingleWord > "z" Then
SingleWord = ""
End If
Remove that, and you should at least see non-Latin words. Beyond that, it's more a question of what you are trying to filter out. If it's stuff like "'words' starting with a digit or a punctuation mark" then you probably need to have a look through the Unicode tables and work out a better set of things to filter out (bearing in mind that some transliterated scripts may even use digits at the beginnings of words).

Related

VBA Highlight multiple keywords with wildcard from text strings

Any help here would be appreciated please.
The included VBA code almost meets the intended purpose, however, I need a solution that enables the use of wildcards and highlights all parameters contained between "##", "%%" or potentially other special characters (special characters included).
For instance, lets say in the cell range B2:B10 we would find something like:
Checked at ##date1## and ##hour1##
But I want to be able to do a search and highlight using # * # or % * % within a selected determined cell range with the end result (bold being color):
Checked at ##date1## and ##hour1##
Sub HighlightStrings()
Application.ScreenUpdating = False
Dim Rng As Range
Dim cFnd As String
Dim xTmp As String
Dim x As Long
Dim m As Long
Dim y As Long
Dim xFNum As Integer
Dim xArrFnd As Variant
Dim xStr As String
cFnd = InputBox("Please enter the text, separate them by comma:")
If Len(cFnd) < 1 Then Exit Sub
xArrFnd = Split(cFnd, ",")
For Each Rng In Selection
With Rng
For xFNum = 0 To UBound(xArrFnd)
xStr = xArrFnd(xFNum)
y = Len(xStr)
m = UBound(Split(Rng.Value, xStr))
If m > 0 Then
xTmp = ""
For x = 0 To m - 1
xTmp = xTmp & Split(Rng.Value, xStr)(x)
.Characters(Start:=Len(xTmp) + 1, Length:=y).Font.ColorIndex = 3
xTmp = xTmp & xStr
Next
End If
Next xFNum
End With
Next Rng
Application.ScreenUpdating = True
End Sub
Thank you
Okay This seems to work for me. There is a limitation we can work on if required: the phrase to highlight must be padded with spaces on both sides.
Option Explicit
Option Base 0
Sub testreplace()
Dim I As Integer 'Iteration
Dim FlagNum As Integer 'Flag Number
Dim RG As Range 'Whole range
Dim CL As Range 'Each Cell
Dim FlagChar As String 'Flag characters
Dim ArrFlag 'Flag Char Array
Dim TextTemp As String 'Cell Contents
Set RG = Selection
FlagChar = "##"
FlagChar = InputBox("Enter 'Flag Characters' separated by a comma." & vbCrLf & vbCrLf & _
"Example:" & vbCrLf & vbCrLf & _
"##,%%,&&" & vbCrLf & _
"$$,XX", "Flag Characters to Highlight", "##,%%")
ArrFlag = Split(FlagChar, ",")
For Each CL In RG.Cells
TextTemp = CL.Value
For FlagNum = 0 To UBound(ArrFlag)
For I = 1 To Len(TextTemp)
'Debug.Print "<<" & Mid(TextTemp, I, Len(ArrFlag(Flagnum)) + 1) & _
"=" & " " & ArrFlag(Flagnum) & ">>"
If Mid(TextTemp, I, Len(ArrFlag(FlagNum)) + 1) = " " & ArrFlag(FlagNum) Then
CL.Characters(I + 1, InStr(I, TextTemp, ArrFlag(FlagNum) & " ") + _
Len(ArrFlag(FlagNum)) - I).Font.ColorIndex = 3
End If
Next I
Next FlagNum
Next CL
End Sub
Here's an example of it working:

How to extract numbers from a text string in VBA [duplicate]

This question already has answers here:
Excel UDF for capturing numbers within characters
(4 answers)
Closed 4 years ago.
I need to extract the numbers from a string of text and I'm not quite sure how to do it. The code I've attached below is very preliminary and most likely can be done more elegantly. A sample of the string I'm trying to parse is as follows:
"ID CSys ID Set ID Set Value Set Title 7026..Plate Top MajorPrn Stress 7027..Plate Top MinorPrn Stress 7033..Plate Top VonMises Stress"
I need to pull the numbers 7026, 7027, and 7033. The string will vary in length and the number of values that I'll need to pull will also vary. Any help would be much appreciated. Thanks!
Dim WrdArray() As String
Dim txtstrm As TextStream
Dim line As String
Dim clm As Long
Dim Rw As Long
'------------------------------------------------------------
Dim i As Long
Dim strPath As String
Dim strLine As String
Dim count, count1 As Integer
Dim holder As String
Dim smallSample As String
count = 0
count1 = 1
holder = ""
'Remove Filters and Add Custom Filters
Call Application.FileDialog(msoFileDialogOpen).Filters.Clear
Call Application.FileDialog(msoFileDialogOpen).Filters.Add("Text Files", "*.txt")
Call Application.FileDialog(msoFileDialogOpen).Filters.Add("Dat Files", "*.dat")
'only allow the user to select one file
Application.FileDialog(msoFileDialogOpen).AllowMultiSelect = False
'make the file dialog visible to the user
intChoice = Application.FileDialog(msoFileDialogOpen).Show
'determine what choice the user made
If intChoice <> 0 Then
'get the file path selected by the user
strPath = Application.FileDialog(msoFileDialogOpen).SelectedItems(1)
End If
'------------------------------------------------------------
If strPath <> "" Then
Set txtstrm = FSO.OpenTextFile(strPath)
Else
MsgBox "No file selected.", vbExclamation
Exit Sub
End If
Rw = 1
Do Until txtstrm.AtEndOfStream
line = txtstrm.ReadLine
clm = 1
WrdArray() = Split(line, " ") 'Change with ; if required
For Each wrd In WrdArray()
If Rw = 1 Then
Do While count <> Len(wrd)
smallSample = Left(wrd, 1)
If smallSample = "0" Or smallSample = "1" Or smallSample = "2" Or smallSample = "3" Or smallSample = "4" _
Or smallSample = "5" Or smallSample = "6" Or smallSample = "7" Or smallSample = "8" _
Or smallSample = "9" Then
holder = holder & smallSample
Else
If holder <> "" Then
Cells(count1, 1) = holder
count1 = count1 + 1
End If
holder = ""
End If
wrd = Right(wrd, Len(wrd) - 1)
clm = clm + 4
ActiveSheet.Cells(Rw, clm) = holder
Loop
Else
ActiveSheet.Cells(Rw, clm) = wrd
clm = clm + 1
End If
Next wrd
Rw = Rw + 1
Loop
txtstrm.Close
End Sub
You can use Regular Expressions.
Sub ExtractNumbers()
Dim str As String, regex As regExp, matches As MatchCollection, match As match
str = "ID CSys ID Set ID Set Value Set Title 7026..Plate Top MajorPrn Stress 7027..Plate Top MinorPrn Stress 7033..Plate Top VonMises Stress"
Set regex = New regExp
regex.Pattern = "\d+" '~~~> Look for variable length numbers only
regex.Global = True
If (regex.Test(str) = True) Then
Set matches = regex.Execute(str) '~~~> Execute search
For Each match In matches
Debug.Print match.Value '~~~> Prints: 7026, 7027, 7033
Next
End If
End Sub
Make sure you reference the VBA regex library:
Open VBA editor
Tools > References...
Check Microsoft VBScript Regular Expression 5.5
To exact numbers in the form you want, try something like:
Sub dural()
Dim s As String, i As Long, L As Long, c As String, temp As String
s = [A1]
L = Len(s)
temp = ""
For i = 1 To L
c = Mid(s, i, 1)
If c Like "[0-9]" Then
temp = temp & c
Else
temp = temp & " "
End If
Next i
temp = "'" & Application.WorksheetFunction.Trim(temp)
temp = Replace(temp, " ", ",")
[B1] = temp
End Sub
You can use this function that splits the "words and test for numeric:
Function numfromstring(str As String) As String
Dim strarr() As String
str = Replace(str, ".", " ")
strarr = Split(str)
Dim i As Long
For i = 0 To UBound(strarr)
If IsNumeric(strarr(i)) Then
numfromstring = numfromstring & "," & strarr(i)
End If
Next i
numfromstring = Mid(numfromstring, 2)
End Function
You would call it from the worksheet with a formula:
=numfromstring(A1)
Or from vba like this:
Sub try()
Dim str As String
str = "ID CSys ID Set ID Set Value Set Title 7026..Plate Top MajorPrn Stress 7027..Plate Top MinorPrn Stress 7033..Plate Top VonMises Stress"
Dim out As String
out = numfromstring(str)
Debug.Print out
End Sub
If you have Office 365 Excel you can use this array formula:
=TEXTJOIN(",",TRUE,IF(ISNUMBER(--TRIM(MID(SUBSTITUTE(SUBSTITUTE(A1,"."," ")," ",REPT(" ",99)),(ROW($1:$100)-1)*99+1,99))),TRIM(MID(SUBSTITUTE(SUBSTITUTE(A1,"."," ")," ",REPT(" ",99)),(ROW($1:$100)-1)*99+1,99)),""))
Being an array formula it needs to be confirmed with Ctrl-Shift-Enter instead of Enter when exiting edit mode:

Inserting Line Break to the nearest space after a set of character length is set

I have come across this code(not mine), what it actually does is insert a Line break after a character length has been determined.
Public Function LFNearSpace(InputStr As String, CharCnt As Long)
Dim SplitStrArr() As Variant
Dim SplitCnt As Long
Dim c As Long
Dim i As Long
Dim Lcnt As Long
Dim Rcnt As Long
Dim OutputStr As String
'Split string into Array
ReDim SplitStrArr(Len(InputStr) - 1)
For i = 1 To Len(InputStr)
SplitStrArr(i - 1) = Mid$(InputStr, i, 1)
Next
SplitCnt = 0
For c = LBound(SplitStrArr) To UBound(SplitStrArr)
SplitCnt = SplitCnt + 1
If SplitCnt = CharCnt Then
'get count to space nearest to the left and right of word
For i = c To LBound(SplitStrArr) Step -1
If SplitStrArr(i) = " " Then
Lcnt = i
Exit For
End If
Next i
For i = c To UBound(SplitStrArr)
If SplitStrArr(i) = " " Then
Rcnt = i
Exit For
End If
Next i
'add line feed to nearest space
If (Rcnt - c) < (c - Lcnt) Then
SplitStrArr(Lcnt) = Chr(10)
SplitCnt = c - Lcnt
ElseIf (Rcnt - c) = (c - Lcnt) Then
SplitStrArr(Rcnt) = Chr(10)
SplitCnt = c - Rcnt
End If
End If
Next c
'Finalize the output into a single string
LFNearSpace = Join(SplitStrArr, "")
End Function
So here's my condition:
Column Width: 75
Font Name: Arial
Font Size: 9
I am customizing it for a while to fit my conditions,as far as I can think of
Unfortunately, the function cuts(inserts line break) the word not in natural way for example:
I call it like this, well if I change the 105 value the output changes but I wanted to create a solution why the output is similar to the image below.
SomeStr = LFNearSpace(SomeStr, 105)
Worksheets("Sheet1").Range("A1").Value = SomeStr
Any thoughts? Thanks
Try this
With Columns(1)
.ColumnWidth = 75
.Font.Name = "Arial"
.Font.Size = 9
.WrapText = True
End With
below code will break string to two line on occurrence of space after 20 character.
dim inputstr as string = "This is my test input string. I hope it helps!"
dim breakafter as integer= 20
dim line1 as string,line2 as string
dim found as integer=InStr(breakafter, inputstr, " ", vbTextCompare) ' KNOW WHERE IS 1st space after 20 char(s)
line1= Left(inputstr,found ) ' get 1st part of text
line2 = Replace(inputstr, " ", environment.newline() , found, 1, vbTextCompare) ' get remaining text
msgbox line1 + iif(isnothing(line2),"",line2)

Adding tables in word with vba

I have been experimenting with adding tables in word from database.
So far I made a table in word document as a template
Then what I do is that I copy it into a new word document, search for DT and Dokumenttype and then replace it with the value I want. This is properly slow(however it seems to go extremly fast)and it would propberly be better to create it directly in word.
After creating the table i start adding rows to it, where the first column is to be hyperlinked. This is what seems to take time, it is only 235 rows total split on 11 tables but it takes almost a minute to create the 11 tables. So my question is how does you folks normally creates tables?
Do you create the header of the table, then keep adding rows?
Do you double loop, to find number of rows needed then create the whole table at one go?
Do you copy an array into the table to fill the rows? Then reloop to hyperlink the first column?
Output looks like this:
Below is my current code:
Option Explicit
Public Sub createDocOut(projectNumber As String, Optional reloadDatabase As Boolean = False)
Dim docOutArray() As String
Dim previousDokType As String
Dim doc_ As Document
Dim i As Long
Dim j As Integer
Dim k As Integer
Dim m As Integer
Dim sPercentage As Double
Dim numOfRows As Long
'Application.ScreenUpdating = False
docOutArray = Post.helpRequest("http://proto-ls/wordin.asp?Dok4=" & projectNumber)
If CPearson.IsArrayEmpty(docOutArray) Then
MsgBox "No document registered in database!"
End If
numOfRows = UBound(docOutArray, 1)
' creates a new document if needed otherwise it opens it
Set doc_ = NEwDocOut.createDocOutDocument(projectNumber)
If CustomProperties.getValueFromProperty(doc_, "_DocumentCount") = numOfRows And reloadDatabase = False Then
Application.ScreenUpdating = True
Exit Sub
Else
Selection.WholeStory
Selection.Delete
End If
'We add number of rows to document
Call CustomProperties.createCustomDocumentProperty(doc_, "_DocumentCount", numOfRows)
j = 0
previousDokType = ""
For i = LBound(docOutArray, 1) To numOfRows
'new table
If docOutArray(i, 1) <> previousDokType Then
If j > 0 Then
doc_.Tables(j).Select
Selection.Collapse WdCollapseDirection.wdCollapseEnd
Selection.MoveDown Unit:=wdLine, Count:=1
End If
j = j + 1
m = 2
Call NEwDocOut.addTable(doc_, docOutArray(i, 1), docOutArray(i, 2))
End If
'new row
With doc_.Tables(j)
.Rows(m).Select
Selection.InsertRowsBelow (1)
m = m + 1
' Hyper link the file
ActiveDocument.Hyperlinks.Add Anchor:=.Cell(m, 1).Range, _
Address:="z:\Prosjekt\" & projectNumber & docOutArray(i, 3), ScreenTip:="HyperLink To document", _
TextToDisplay:=FileHandling.GetFilenameFromPath(docOutArray(i, 3))
'loop through cells
For k = 3 To UBound(docOutArray, 2)
' .Cell(m, k - 2).Range.Font.Bold = False
' .Cell(m, k - 2).Range.Font.name = "Times New Roman"
' .Cell(m, k - 2).Range.Font.Size = 10
If k > 3 And k <> 8 Then
.Cell(m, k - 2).Range.Text = docOutArray(i, k)
End If
If k = 8 Then
.Cell(m, k - 2).Range.Text = Format(replace(docOutArray(i, k), ".", "/"), "mm.dd.yyyy")
End If
Next k
End With
previousDokType = docOutArray(i, 1)
Next i
'Application.ScreenUpdating = True
End Sub
'**********************************************************************
' ****** CREATE NEW DOCUMENT OUT **************************************
'**********************************************************************
Function createDocOutDocument(prosjektnumber As String) As Document
Dim dirName As String
Dim docOutname As String
Set createDocOutDocument = Nothing
' Hvis directory \Dokumentstyring\PFK ikke eksisterer, lag dette
dirName = "z:\Prosjekt\" & prosjektnumber
'change permision if needed
If Dir(dirName, vbDirectory) = "" And Not Settings.debugMy Then
MkDir dirName
End If
'filename of docOut
docOutname = dirName & "\" & prosjektnumber & "-Dokut.docx"
If FileHandling.doesFileExist(docOutname) Then
If FileHandling.openDocument(docOutname, True, True) Then
Set createDocOutDocument = ActiveDocument
Exit Function
End If
End If
'
' Add the tamplate for DocOut and save it to Doclist
'
Set createDocOutDocument = Documents.Add(Template:="Z:\Dokumentstyring\Config\DocOut.dotm", NewTemplate:=False)
createDocOutDocument.SaveAs filename:=docOutname
'Final check if document was created
If Not FileHandling.doesFileExist(docOutname) Then
Set createDocOutDocument = Nothing
End If
End Function
Function addTable(doc_ As Document, category As String, description As String)
doc_.Activate
'Insert out table
Selection.InsertFile filename:="Z:\Dokumentstyring\Config\Doklistut.docx", Range:="", _
ConfirmConversions:=False, link:=False, Attachment:=False
'Replace the DT with the category
If Not searchAll(doc_, "DT", category) Then
MsgBox "Failed to replace category in table"
End If
'Replace the Dokumenttype with the category
If Not searchAll(doc_, "Dokumenttype", description) Then
MsgBox "Failed to replace document type in table"
End If
End Function

Highlight (not delete) repeat sentences or phrases

I am getting the impression that this is not possible in word but I figure if you are looking for any 3-4 words that come in the same sequence anywhere in a very long paper I could find duplicates of the same phrases.
I copy and pasted a lot of documentation from past papers and was hoping to find a simple way to find any repeated information in this 40+ page document there is a lot of different formatting but I would be willing to temporarily get rid of formatting in order to find repeated information.
To highlight all duplicate sentences, you can also use ActiveDocument.Sentences(i). Here is an example
LOGIC
1) Get all the sentences from the word document in an array
2) Sort the array
3) Extract Duplicates
4) Highlight duplicates
CODE
Option Explicit
Sub Sample()
Dim MyArray() As String
Dim n As Long, i As Long
Dim Col As New Collection
Dim itm
n = 0
'~~> Get all the sentences from the word document in an array
For i = 1 To ActiveDocument.Sentences.Count
n = n + 1
ReDim Preserve MyArray(n)
MyArray(n) = Trim(ActiveDocument.Sentences(i).Text)
Next
'~~> Sort the array
SortArray MyArray, 0, UBound(MyArray)
'~~> Extract Duplicates
For i = 1 To UBound(MyArray)
If i = UBound(MyArray) Then Exit For
If InStr(1, MyArray(i + 1), MyArray(i), vbTextCompare) Then
On Error Resume Next
Col.Add MyArray(i), """" & MyArray(i) & """"
On Error GoTo 0
End If
Next i
'~~> Highlight duplicates
For Each itm In Col
Selection.Find.ClearFormatting
Selection.HomeKey wdStory, wdMove
Selection.Find.Execute itm
Do Until Selection.Find.Found = False
Selection.Range.HighlightColorIndex = wdPink
Selection.Find.Execute
Loop
Next
End Sub
'~~> Sort the array
Public Sub SortArray(vArray As Variant, i As Long, j As Long)
Dim tmp As Variant, tmpSwap As Variant
Dim ii As Long, jj As Long
ii = i: jj = j: tmp = vArray((i + j) \ 2)
While (ii <= jj)
While (vArray(ii) < tmp And ii < j)
ii = ii + 1
Wend
While (tmp < vArray(jj) And jj > i)
jj = jj - 1
Wend
If (ii <= jj) Then
tmpSwap = vArray(ii)
vArray(ii) = vArray(jj): vArray(jj) = tmpSwap
ii = ii + 1: jj = jj - 1
End If
Wend
If (i < jj) Then SortArray vArray, i, jj
If (ii < j) Then SortArray vArray, ii, j
End Sub
SNAPSHOTS
BEFORE
AFTER
I did not use my own DAWG suggestion, and I am still interested in seeing if someone else has a way to do this, but I was able to come up with this:
Option Explicit
Sub test()
Dim ABC As Scripting.Dictionary
Dim v As Range
Dim n As Integer
n = 5
Set ABC = FindRepeatingWordChains(n, ActiveDocument)
' This is a dictionary of word ranges (not the same as an Excel range) that contains the listing of each word chain/phrase of length n (5 from the above example).
' Loop through this collection to make your selections/highlights/whatever you want to do.
If Not ABC Is Nothing Then
For Each v In ABC
v.Font.Color = wdColorRed
Next v
End If
End Sub
' This is where the real code begins.
Function FindRepeatingWordChains(ChainLenth As Integer, DocToCheck As Document) As Scripting.Dictionary
Dim DictWords As New Scripting.Dictionary, DictMatches As New Scripting.Dictionary
Dim sChain As String
Dim CurWord As Range
Dim MatchCount As Integer
Dim i As Integer
MatchCount = 0
For Each CurWord In DocToCheck.Words
' Make sure there are enough remaining words in our document to handle a chain of the length specified.
If Not CurWord.Next(wdWord, ChainLenth - 1) Is Nothing Then
' Check for non-printing characters in the first/last word of the chain.
' This code will read a vbCr, etc. as a word, which is probably not desired.
' However, this check does not exclude these 'words' inside the chain, but it can be modified.
If CurWord <> vbCr And CurWord <> vbNewLine And CurWord <> vbCrLf And CurWord <> vbLf And CurWord <> vbTab And _
CurWord.Next(wdWord, ChainLenth - 1) <> vbCr And CurWord.Next(wdWord, ChainLenth - 1) <> vbNewLine And _
CurWord.Next(wdWord, ChainLenth - 1) <> vbCrLf And CurWord.Next(wdWord, ChainLenth - 1) <> vbLf And _
CurWord.Next(wdWord, ChainLenth - 1) <> vbTab Then
sChain = CurWord
For i = 1 To ChainLenth - 1
' Add each word from the current word through the next ChainLength # of words to a temporary string.
sChain = sChain & " " & CurWord.Next(wdWord, i)
Next i
' If we already have our temporary string stored in the dictionary, then we have a match, assign the word range to the returned dictionary.
' If not, then add it to the dictionary and increment our index.
If DictWords.Exists(sChain) Then
MatchCount = MatchCount + 1
DictMatches.Add DocToCheck.Range(CurWord.Start, CurWord.Next(wdWord, ChainLenth - 1).End), MatchCount
Else
DictWords.Add sChain, sChain
End If
End If
End If
Next CurWord
' If we found any matching results, then return that list, otherwise return nothing (to be caught by the calling function).
If DictMatches.Count > 0 Then
Set FindRepeatingWordChains = DictMatches
Else
Set FindRepeatingWordChains = Nothing
End If
End Function
I have tested this on a 258 page document (TheStory.txt) from this source, and it ran in just a few minutes.
See the test() sub for usage.
You will need to reference the Microsoft Scripting Runtime to use the Scripting.Dictionary objects. If that is undesirable, small modifications can be made to use Collections instead, but I prefer the Dictionary as it has the useful .Exists() method.
I chose a rather lame theory, but it seems to work (at least if I got the question right cuz sometimes I'm a slow understander).
I load the entire text into a string, load the individual words into an array, loop through the array and concatenate the string, containing each time three consecutive words.
Because the results are already included in 3 word groups, 4 word groups or more will automatically be recognized.
Option Explicit
Sub Find_Duplicates()
On Error GoTo errHandler
Dim pSingleLine As Paragraph
Dim sLine As String
Dim sFull_Text As String
Dim vArray_Full_Text As Variant
Dim sSearch_3 As String
Dim lSize_Array As Long
Dim lCnt As Long
Dim lCnt_Occurence As Long
'Create a string from the entire text
For Each pSingleLine In ActiveDocument.Paragraphs
sLine = pSingleLine.Range.Text
sFull_Text = sFull_Text & sLine
Next pSingleLine
'Load the text into an array
vArray_Full_Text = sFull_Text
vArray_Full_Text = Split(sFull_Text, " ")
lSize_Array = UBound(vArray_Full_Text)
For lCnt = 1 To lSize_Array - 1
lCnt_Occurence = 0
sSearch_3 = Trim(fRemove_Punctuation(vArray_Full_Text(lCnt - 1) & _
" " & vArray_Full_Text(lCnt) & _
" " & vArray_Full_Text(lCnt + 1)))
With Selection.Find
.Text = sSearch_3
.Forward = True
.Replacement.Text = ""
.Wrap = wdFindContinue
.Format = False
.MatchCase = False
Do While .Execute
lCnt_Occurence = lCnt_Occurence + 1
If lCnt_Occurence > 1 Then
Selection.Range.Font.Color = vbRed
End If
Selection.MoveRight
Loop
End With
Application.StatusBar = lCnt & "/" & lSize_Array
Next lCnt
errHandler:
Stop
End Sub
Public Function fRemove_Punctuation(sString As String) As String
Dim vArray(0 To 8) As String
Dim lCnt As Long
vArray(0) = "."
vArray(1) = ","
vArray(2) = ","
vArray(3) = "?"
vArray(4) = "!"
vArray(5) = ";"
vArray(6) = ":"
vArray(7) = "("
vArray(8) = ")"
For lCnt = 0 To UBound(vArray)
If Left(sString, 1) = vArray(lCnt) Then
sString = Right(sString, Len(sString) - 1)
ElseIf Right(sString, 1) = vArray(lCnt) Then
sString = Left(sString, Len(sString) - 1)
End If
Next lCnt
fRemove_Punctuation = sString
End Function
The code assumes a continuous text without bullet points.