Excel VBA InStr very slow and hanging - vba

I am using the below VBA code to compare 60K records (only a particular string) from 'Sheet1' with 7K records from 'Sheet2'. It's taking a very long time to complete and is unresponsive at times. Is there anyway to improve the performance of this subroutine?
Sub txtext()
Dim fnl As String, stl As String, env As String, typ As String
Dim ctm As String
Dim stdate As Date, enddate As Date
lr = Sheets("cid_match").Range("A" & Rows.Count).End(xlUp).Row
cr = Sheets("scme").Range("C" & Rows.Count).End(xlUp).Row
stdate = Now
m = 2
For Each e In Sheets("cid_match").Range("BI2:BI" & lr).Cells
stl = Worksheets("cid_match").Range("BI" & m).Cells.Value
typ = Worksheets("cid_match").Range("BK" & m).Cells.Value
s = 2
For Each r In Sheets("scme").Range("C2:C" & cr).Cells
ctm = Worksheets("scme").Range("B" & s).Cells.Value
fnl = r.Value
If InStr(fnl, stl) > 0 And ctm = typ Then
Worksheets("cid_match").Range("BJ" & m).Value = fnl
GoTo sss
End If
s = s + 1
Next r
sss:
m = m + 1
Next e
enddate = Now
MsgBox "Succesfully Completed!!! Started at " & stdate & " Ended at " & enddate
End Sub

In order to optimize searching for a record you want to firstly sort your data using either quicksort or bubblesort. Then you will be able to search through using binary search. This will significantly reduce your wait times. Lucky you, these functions have already been written in VBA by other developers.
Bubblesort
Sub BubbleSort(list())
' Sorts an array using bubble sort algorithm
Dim First As Integer, Last As Long
Dim i As Long, j As Long
Dim Temp As Long
First = LBound(list)
Last = UBound(list)
For i = First To Last - 1
For j = i + 1 To Last
If list(i) > list(j) Then
Temp = list(j)
list(j) = list(i)
list(i) = Temp
End If
Next j
Next i
End Sub
Binary Search
Function arrayFind(theArray() As Integer, target As Integer) As Boolean
Dim low As Integer
low = 0
Dim high As Integer
high = UBound(theArray)
Dim i As Integer
Dim result As Boolean
Do While low <= high
i = (low + high) / 2
If target = theArray(i) Then
arrayFind = True
Exit Do
ElseIf target < theArray(i) Then
high = (i - 1)
Else
low = (i + 1)
End If
Loop
If Not arrayFind Then
arrayFind = False
End If
End Function

Try the below: -
Sub Sample()
Dim AryLookup() As String
Dim DteStart As Date
Dim LngCounter As Long
Dim LngRow As Long
Dim WkSht As Worksheet
DteStart = Now
'Look at the lookup worksheet
Set WkSht = ThisWorkbook.Worksheets("scme")
'Make an array the same size as the dataset
ReDim AryLookup(WkSht.Range("C" & Rows.Count).End(xlUp).Row)
'Copy the dataset in
For LngRow = 2 To UBound(AryLookup, 1)
AryLookup(LngRow - 2) = WkSht.Range("C" & LngRow)
DoEvents
Next
Set WkSht = Nothing
'Look at the source worksheet
Set WkSht = ThisWorkbook.Worksheets("cid_match")
'Work from the bottom up so not to be falsly stopped by an empty row
'Step -1 means go backwards by one with each itteration of the loop
For LngRow = WkSht.Range("A" & Rows.Count).End(xlUp).Row To 2 Step -1
'Look through the loop for a match
For LngCounter = 0 To UBound(AryLookup, 1)
'If it matches then add it to the column and exit the loop
If InStr(1, WkSht.Range("BI" & LngRow), AryLookup(LngCounter)) > 0 Then
WkSht.Range("BJ" & LngRow).Value = AryLookup(LngCounter)
Exit For
End If
DoEvents
Next
DoEvents
Next
Set WkSht = Nothing
MsgBox "Succesfully Completed!!! " & vbNewLine & "Started at " & DteStart & vbNewLine & "Ended at " & Now
End Sub
Looking over your code was quite confusing to follow what was happening, in some cases I couldn't understand or see the benefit in what was happening in the code, so the above may not be correct. You should always write your code with the assumption that another developer will need to pick it up and support it one day, it'll help you in the future if you have to return to it years later and want to remember what is happening.

Related

alternative for nested for loops for faster run time

I have a nested for loop that first runs through 10-15k rows, compares a cell in that row to another table that is 40k+ rows, if it finds a match, it returns that match, otherwise "no record" is written in a cell. the code works fine, just investigating an alternative approach to make it run faster. currently, 13000 lines takes about 50 min to an hour to run. I've looked into arrays, but loading an array with 40k+ items seems like the wrong route to take. the report is often run bits at a time, so when it is first created it may have 2k rows, then 3k rows may be added to it later, the code below will skip over rows it has already checked and pick up where it left off. any help is appreciated
For i = 2 To lastRow
If Cells(i, 83).Value <> "" Then GoTo NextIteration:
Sheets("mft Rpt").Cells(i, 83) = "No Record"
model = Sheets("MFT RPT").Cells(i, 11).Value
trimModel = Replace(Replace(model, " ", ""), "-", "")
For j = 1 To lastCollateralRow
If trimModel = Sheets("Promosheet Table").Cells(j, 1).Value Then
Sheets("MFT RPT").Cells(i, 83) = Sheets("promosheet Table").Cells(j, 3).Value
End If
Next j
NextIteration:
Next i
This ia just a proof of concept, you will need to adjust variables and ranges to suit your needs.
Sub ProofOfConcept()
Dim rngList As Range
Dim rngMatch As Range
Dim arrList As Variant
Dim arrMatch As Variant
Set rngList = Range("A1:A50000")
arrList = Application.Transpose(rngList.Value)
Set rngMatch = Range("C1:D15000")
arrMatch = Application.Transpose(rngMatch.Value)
For a = 1 To 15000
For b = 1 To 50000
If arrMatch(1, a) = arrList(b) Then
arrMatch(2, a) = "Match found"
GoTo skip
End If
Next
skip:
Next
rngMatch = WorksheetFunction.Transpose(arrMatch)
End Sub
thanks #Michal
played with it a bit. I trimmed down the run time from almost an hour to about 7 or 8 min using this code. works beautifully!!
Dim promoList As Range
Dim rngMatch As Range
Dim arrList As Variant
Dim arrMatch As Variant
Dim z
Set promoList = Sheets("promosheet table").Range("A1:A" & lastcollateralRow)
arrList = Application.Transpose(promoList.Value)
Set rngMatch = Sheets("Mft rpt").Range("K2:K" & lastRow)
arrMatch = Application.Transpose(rngMatch.Value)
For z = LBound(arrMatch) To UBound(arrMatch)
arrMatch(z) = Replace(Replace(arrMatch(z), " ", ""), "-", "")
Next
For A = 1 To lastRow
If Cells(A + 1, 83).Value <> "" Then GoTo skip:
Sheets("mft rpt").Cells(A + 1, 83) = "No Record"
For b = 1 To lastcollateralRow + 1
If arrMatch(A) = promoList(b) Then
Sheets("mft rpt").Cells(A + 1, 83) = promoList(b, 3)
GoTo skip
End If
Next
skip:
Next

VBA - How to streamline the conversion from one time interval to another?

So I have 2 datasets, one with a time step of 1 minute, the other with a time step of fifteen minutes. I want to convert the 15min data set to the 1min data set by copying the values so that if 4:15 = x1 in the first set, 4:01 to 4:15 = X1 in the second data set.
I wrote the following code, which works fine. However, once past the first values the program slows down to a snail's pace, and takes multiple hours to complete the run.
So my question is: how can I improve the code I wrote so as to achieve completion quicker?
Example data:
Sub AddData()
Dim Query As Worksheet
Dim Time As String
Dim Time2 As String
Dim QueryRangeStart As Range
Dim QueryRangeFinish As Range
Dim QueryRange As Range
Dim Position As Range
Dim Position2 As Range
Set Query = Worksheets("Sheet1") 'Not needed in example, I conserved it so as not to have to rewrite my code
For Each Position In Range(Range("start"), Range("start").End(xlDown)).Cells 'Start is the first cell containing a timestamp in the 15min dataset
Time = Position.Text
Time2 = Position.Offset(1, 0).Text
Set QueryRangeStart = Nothing
Set QueryRangeFinish = Nothing
For Each Position2 In Range(Query.Range("A2"), Query.Range("A2").End(xlDown)).Cells
If InStr(1, Time, Position2.Text, 1) = 1 And QueryRangeStart Is Nothing Then
Set QueryRangeStart = Range(Position2.Address(rowabsolute:=False, columnabsolute:=False, external:=True))
End If
If InStr(1, Time2, Position2.Text, 1) = 1 And QueryRangeFinish Is Nothing Then
Test = False
Set QueryRangeFinish = Range(Position2.Offset(-1, 0).Address(rowabsolute:=False, columnabsolute:=False, external:=True))
Exit For
End If
Next
For Each Position2 In Range(QueryRangeStart, QueryRangeFinish).Cells
Position2.End(xlToRight).Offset(0, 1).Value = Position.Offset(0, 1).Value
Next
Next
End Sub
I had trouble following your code since I wasn't sure which columns your data was in, so I built a new procedure:
Sub split15to1()
Const colIn = 8 'column# where the input 15-mintue intervals are located (data next to it)
Const colOut = 11 'column# where the output 1-minute intervals should go (data next to it)
Const rwStart = 3 'row# where "everything" begins
Dim rwIn As Long, rwOut As Long, x As Long
rwIn = rwStart: rwOut = rwStart 'set start rows
Do
Debug.Print "Splitting: " & Cells(rwIn, colIn) & " : ";
For x = 0 To 14
Debug.Print x & ",";
Cells(rwOut, colOut) = Cells(rwIn, colIn) + TimeSerial(0, x, 0) 'add [x]mins to input time
Cells(rwOut, colOut + 1) = Cells(rwIn, colIn + 1) 'copy value from col next to input time
rwOut = rwOut + 1 'next output row
Next x
Debug.Print "Done."
rwIn = rwIn + 1 'next input row
Loop Until Cells(rwIn, colIn) = "" 'is there data on the next row?
MsgBox "Finished!"
End Sub
Example output:
Alternative
Slightly modified: No output time; place value next to existing data
Option Explicit
Sub split15to1()
Const colIn = 8 'column# where the input 15-mintue intervals are located (data next to it)
Const colOut = 4 'column# where the output 1-minute intervals should go (data next to it)
Const rwStart = 3 'row# where "everything" begins
Dim rwIn As Long, rwOut As Long, x As Long
rwIn = rwStart: rwOut = rwStart 'set start rows
Do
Debug.Print "Splitting: " & Cells(rwIn, colIn) & " : ";
For x = 0 To 14
Debug.Print x & ",";
Cells(rwOut, colOut) = Cells(rwIn, colIn + 1) 'put input data at output location
rwOut = rwOut + 1 'next output row
Next x
Debug.Print "Done."
rwIn = rwIn + 1 'next input row
Loop Until Cells(rwIn, colIn) = "" 'is there data on the next row?
MsgBox "Finished!"
End Sub
In case you're not aware the Debug.Print comments are for troubleshooting and print in the Immediate Window which you can open with CTRL+G from VBA.
If you're not using them, those lines can be deleted or ignored.

Split rows that have multiline text and single line text

I'm trying to figure out how to split rows of data where columns B,C,D in the row contain multiple lines and others do not. I've figured out how to split the multi-line cells if I copy just those columns into a new sheet, manually insert rows, and then run the macro below (that's just for column A), but I'm lost at coding the rest.
Here's what the data looks like:
So for row 2, I need it split into 6 rows (one for each line in cell B2) with the text in cell A2 in A2:A8. I also need columns C and D split the same as B, and then columns E:CP the same as column A.
Here is the code I have for splitting the cells in columns B,C,D:
Dim iPtr As Integer
Dim iBreak As Integer
Dim myVar As Integer
Dim strTemp As String
Dim iRow As Integer
iRow = 0
For iPtr = 1 To Cells(Rows.Count, col).End(xlUp).Row
strTemp = Cells(iPtr1, 1)
iBreak = InStr(strTemp, vbLf)
Range("C1").Value = iBreak
Do Until iBreak = 0
If Len(Trim(Left(strTemp, iBreak - 1))) > 0 Then
iRow = iRow + 1
Cells(iRow, 2) = Left(strTemp, iBreak - 1)
End If
strTemp = Mid(strTemp, iBreak + 1)
iBreak = InStr(strTemp, vbLf)
Loop
If Len(Trim(strTemp)) > 0 Then
iRow = iRow + 1
Cells(iRow, 2) = strTemp
End If
Next iPtr
End Sub
Here is a link to an example file (note this file has 4 rows, the actual sheet has over 600): https://www.dropbox.com/s/46j9ks9q43gwzo4/Example%20Data.xlsx?dl=0
This is a fairly interesting question and something I have seen variations of before. I went ahead and wrote up a general solution for it since it seems like a useful bit of code to keep for myself.
There are pretty much only two assumptions I make about the data:
Returns are represented by Chr(10) or which is the vbLf constant.
Data that belongs with a lower row has enough returns in it to make it line up. This appears to be your case since there are return characters which appear to make things line up like you want.
Pictures of the output, zoomed out to show all the data for A:D. Note that the code below processes all of the columns by default and outputs to a new sheet. You can limit the columns if you want, but it was too tempting to make it general.
Code
Sub SplitByRowsAndFillBlanks()
'process the whole sheet, could be
'Intersect(Range("B:D"), ActiveSheet.UsedRange)
'if you just want those columns
Dim rng_all_data As Range
Set rng_all_data = Range("A1").CurrentRegion
Dim int_row As Integer
int_row = 0
'create new sheet for output
Dim sht_out As Worksheet
Set sht_out = Worksheets.Add
Dim rng_row As Range
For Each rng_row In rng_all_data.Rows
Dim int_col As Integer
int_col = 0
Dim int_max_splits As Integer
int_max_splits = 0
Dim rng_col As Range
For Each rng_col In rng_row.Columns
'splits for current column
Dim col_parts As Variant
col_parts = Split(rng_col, vbLf)
'check if new max row count
If UBound(col_parts) > int_max_splits Then
int_max_splits = UBound(col_parts)
End If
'fill the data into the new sheet, tranpose row array to columns
sht_out.Range("A1").Offset(int_row, int_col).Resize(UBound(col_parts) + 1) = Application.Transpose(col_parts)
int_col = int_col + 1
Next
'max sure new rows added for total length
int_row = int_row + int_max_splits + 1
Next
'go through all blank cells and fill with value from above
Dim rng_blank As Range
For Each rng_blank In sht_out.Cells.SpecialCells(xlCellTypeBlanks)
rng_blank = rng_blank.End(xlUp)
Next
End Sub
How it works
There are comments within the code to highlight what is going on. Here is a high level overview:
Overall, we iterate through each row of the data, processing all of the columns individually.
The text of the current cell is Split using the vbLf. This gives an array of all the individual lines.
A counter is tracking the maximum number of rows that were added (really this is rows-1 since these arrays are 0-indexed.
Now the data can be output to the new sheet. This is easy because we can just dump the array that Split created for us. The only tricky part is getting it to the right spot on the sheet. To that end, there is a counter for the current column offset and a global counter to determine how many total rows need to be offset. The Offset moves us to the right cell; the Resize ensures that all of the rows are output. Finally, Application.Transpose is needed because Split returns a row array and we're dumping a column.
Update the counters. Column offset is incremented every time. The row offset is updated to add enough rows to cover the last maximum (+1 since this is 0-indexed)
Finally, I get to use my waterfall fill (your previous question) on all of the blanks cells that were created to ensure no blanks. I forgo error checking because I assume blanks exist.
Thank you for providing a sample. This task was so interesting that I thought of writing the code for that. You are more than welcome to tweak it to your satisfaction, and I hope your team gets to use an RDBMS to manage this kind of data in the future.
Sub OrganizeSheet()
Dim LastRow As Integer
LastRow = GetLastRow()
Dim Barray() As String
Dim Carray() As String
Dim Darray() As String
Dim LongestArray As Integer
Dim TempInt As Integer
Dim i As Integer
i = 1
Do While i <= LastRow
Barray = Split(Range("B" & i), Chr(10))
Carray = Split(Range("C" & i), Chr(10))
Darray = Split(Range("D" & i), Chr(10))
LongestArray = GetLongestArray(Barray, Carray, Darray)
If LongestArray > 0 Then
' reset the values of B, C and D columns
On Error Resume Next
Range("B" & i).Value = Barray(0)
Range("C" & i).Value = Carray(0)
Range("D" & i).Value = Darray(0)
Err.Clear
On Error GoTo 0
' duplicate the row multiple times
For TempInt = 1 To LongestArray
Rows(i & ":" & i).Select
Selection.Copy
Range(i + TempInt & ":" & i + TempInt).Select
Selection.Insert Shift:=xlDown
' as each row is copied, change the values of B, C and D columns
On Error Resume Next
Range("B" & i + TempInt).Value = Barray(TempInt)
If Err.Number > 0 Then Range("B" & i + TempInt).Value = ""
Err.Clear
Range("C" & i + TempInt).Value = Carray(TempInt)
If Err.Number > 0 Then Range("C" & i + TempInt).Value = ""
Err.Clear
Range("D" & i + TempInt).Value = Darray(TempInt)
If Err.Number > 0 Then Range("D" & i + TempInt).Value = ""
Err.Clear
On Error GoTo 0
Application.CutCopyMode = False
Next TempInt
' increment the outer FOR loop's counters
LastRow = LastRow + LongestArray
i = i + LongestArray
End If
i = i + 1
Loop
End Sub
' ----------------------------------
Function GetLongestArray(ByRef Barray() As String, ByRef Carray() As String, ByRef Darray() As String)
GetLongestArray = UBound(Barray)
If UBound(Carray) > GetLongestArray Then GetLongestArray = UBound(Carray)
If UBound(Darray) > GetLongestArray Then GetLongestArray = UBound(Darray)
End Function
' ----------------------------------
Function GetLastRow() As Integer
Worksheets(1).Select
Range("A1").Select
Selection.End(xlDown).Select
GetLastRow = Selection.Row
Range("A1").Select
End Function
Give it a shot!

How to Import CSV to Excel with VBA

I Have a Red Lion Data station that records temperatures from about 25 ovens and the data is stored on a network in CSV files. I would like to import this data into an Excel file once every two minutes, but only import the changes after the initial import. After the import, the VBA must release the CSV file for the 2 minutes so that it can be updated by the data station. I have searched the web, this site and the closest thing I have found is the following code. This code looks for changes but it does not import the file. any help would be appreciated.
Dim NextTime As Date
Function LastModTime(FileSpec As String) As Date
'Returns the date-time the file specified by FileSpec (path string) was last modified
Dim fs, f, s
Set fs = CreateObject("Scripting.FileSystemObject")
Set f = fs.GetFile(FileSpec)
LastModTime = f.DateLastModified
End Function
Sub Check4Changes()
'Checks the file FilePath for changes every 60 seconds
'If file has changed, pops up a message box. Stores the
'last modified time in cell M1 of Sheet1
ChDir "Q:\Manufacturing\Equipment\DispatchLogs\logs\7-DES"
Const FilePath As String = "Q:\Manufacturing\Equipment\DispatchLogs\logs\7-DES\*.csv"
Dim LastMod As Date
On Error GoTo ReSchedule
LastMod = LastModTime(FilePath)
With Worksheets("Sheet1").Range("C1")
If IsEmpty(.Value) Then
.Value = LastMod
GoTo ReSchedule
ElseIf .Value < LastMod Then
.Value = LastMod
MsgBox FilePath & " updated.", vbInformation, "Check4Changes"
End If
End With
ReSchedule:
'Reschedule this same routine to run in one minute.
NextTime = Now + 2 / 1440
Application.StatusBar = "Next check at " & NextTime
Application.OnTime NextTime, "Check4Changes"
End Sub
Sub CancelChecking()
Application.OnTime NextTime, "Check4Changes", Schedule:=False
Application.StatusBar = False
End Sub
I have written some code which will give you the basic idea
Sub Main()
Dim Wbk_CSV As Excel.Workbook
Dim Excel_Wbk As Excel.Workbook
Dim Var_WholeCSVData As Variant
Dim Var_ExcelData As Variant
Dim Var_ToUpdate As Variant
Dim NumOfRows As Long
Dim Last_Row As Long
Set Wbk_CSV = Workbooks.Open("PathWithFileName")
Wbk_CSV.Sheets(1).Activate
Last_Row = ActiveSheet.Cells.SpecialCells(xlCellTypeLastCell).Row
'Taking whole data in a variant
'Change the range as per the data in csv file
Var_WholeCSVData = Wbk_CSV.Sheets(1).Range("A2:D" & Last_Row).Value 'Considering first row as header row and there are 4 columns
'After taking whole data in varinat close csv file without saving
Wbk_CSV.Close savechanges:=False
Set Wbk_CSV = Nothing
'Now open excel file in which data will be updated
Set Excel_Wbk = Workbooks.Open("PathWithFileName")
Excel_Wbk.Sheets(1).Activate
Last_Row = ActiveSheet.Cells.SpecialCells(xlCellTypeLastCell).Row
Var_ExcelData = Excel_Wbk.Sheets(1).Range("A2:D" & Last_Row).Value 'Considering first row as header row
NumOfRows = 0
'This function will return count of updated rows and data to update
Var_ToUpdate = Delete_Duplicates(Var_WholeCSVData, Var_ExcelData, NumOfRows)
Excel_Wbk.Sheets(1).Activate
'paste data
If NumOfRows > 0 Then
ActiveSheet.Range("A" & Last_Row + 1 & ":D" & Last_Row + NumOfRows).Value = Var_ToUpdate
End If
Excel_Wbk.Close savechanges:=True
Set Excel_Wbk = Nothing
'result
MsgBox ("Number of rows imported: " & NumOfRows)
End Sub
Function Delete_Duplicates(Var_FromCSV As Variant, Var_FromExcel As Variant, ByRef NumberOfRowToupdate As Long) As Variant
'using dictinpary objects
Dim dict_Duplicates As Object
Dim i_AddToDict, i, j As Long
Dim Str_value As String
Dim Var_Temp As Variant
Dim lng_temp As Long
Set dict_Duplicates = CreateObject("Scripting.Dictionary")
ReDim Var_Temp(1 To UBound(Var_FromCSV, 1), 1 To UBound(Var_FromCSV, 2))
'Add excel data to dict. by concatenating
'All unique values will be added
For i_AddToDict = 1 To UBound(Var_FromCSV)
Str_value = CStr(Var_FromExcel(i_AddToDict, 1) & Var_FromExcel(i_AddToDict, 2) & Var_FromExcel(i_AddToDict, 3) & Var_FromExcel(i_AddToDict, 4))
If dict_Duplicates.exists(Str_value) Then
'do nothing
Else
dict_Duplicates.Add Str_value, 1
End If
Next i_AddToDict
'looking for values which are not available in excel file
For i = 1 To UBound(Var_FromCSV)
Str_value = CStr(Var_FromCSV(i_AddToDict, 1) & Var_FromCSV(i_AddToDict, 2) & Var_FromCSV(i_AddToDict, 3) & Var_FromCSV(i_AddToDict, 4))
If dict_Duplicates.exists(Str_value) Then
'do nothing
Else
'storing values in a variant
For j = 1 To 4
Var_Temp(lng_temp, j) = Var_FromCSV(i, j)
Next j
lng_temp = lng_temp + 1
dict_Duplicates.Add Str_value, 1
End If
Next i
NumberOfRowToupdate = lng_temp - 1
Delete_Duplicates = Var_Temp
End Function

Highlight (not delete) repeat sentences or phrases

I am getting the impression that this is not possible in word but I figure if you are looking for any 3-4 words that come in the same sequence anywhere in a very long paper I could find duplicates of the same phrases.
I copy and pasted a lot of documentation from past papers and was hoping to find a simple way to find any repeated information in this 40+ page document there is a lot of different formatting but I would be willing to temporarily get rid of formatting in order to find repeated information.
To highlight all duplicate sentences, you can also use ActiveDocument.Sentences(i). Here is an example
LOGIC
1) Get all the sentences from the word document in an array
2) Sort the array
3) Extract Duplicates
4) Highlight duplicates
CODE
Option Explicit
Sub Sample()
Dim MyArray() As String
Dim n As Long, i As Long
Dim Col As New Collection
Dim itm
n = 0
'~~> Get all the sentences from the word document in an array
For i = 1 To ActiveDocument.Sentences.Count
n = n + 1
ReDim Preserve MyArray(n)
MyArray(n) = Trim(ActiveDocument.Sentences(i).Text)
Next
'~~> Sort the array
SortArray MyArray, 0, UBound(MyArray)
'~~> Extract Duplicates
For i = 1 To UBound(MyArray)
If i = UBound(MyArray) Then Exit For
If InStr(1, MyArray(i + 1), MyArray(i), vbTextCompare) Then
On Error Resume Next
Col.Add MyArray(i), """" & MyArray(i) & """"
On Error GoTo 0
End If
Next i
'~~> Highlight duplicates
For Each itm In Col
Selection.Find.ClearFormatting
Selection.HomeKey wdStory, wdMove
Selection.Find.Execute itm
Do Until Selection.Find.Found = False
Selection.Range.HighlightColorIndex = wdPink
Selection.Find.Execute
Loop
Next
End Sub
'~~> Sort the array
Public Sub SortArray(vArray As Variant, i As Long, j As Long)
Dim tmp As Variant, tmpSwap As Variant
Dim ii As Long, jj As Long
ii = i: jj = j: tmp = vArray((i + j) \ 2)
While (ii <= jj)
While (vArray(ii) < tmp And ii < j)
ii = ii + 1
Wend
While (tmp < vArray(jj) And jj > i)
jj = jj - 1
Wend
If (ii <= jj) Then
tmpSwap = vArray(ii)
vArray(ii) = vArray(jj): vArray(jj) = tmpSwap
ii = ii + 1: jj = jj - 1
End If
Wend
If (i < jj) Then SortArray vArray, i, jj
If (ii < j) Then SortArray vArray, ii, j
End Sub
SNAPSHOTS
BEFORE
AFTER
I did not use my own DAWG suggestion, and I am still interested in seeing if someone else has a way to do this, but I was able to come up with this:
Option Explicit
Sub test()
Dim ABC As Scripting.Dictionary
Dim v As Range
Dim n As Integer
n = 5
Set ABC = FindRepeatingWordChains(n, ActiveDocument)
' This is a dictionary of word ranges (not the same as an Excel range) that contains the listing of each word chain/phrase of length n (5 from the above example).
' Loop through this collection to make your selections/highlights/whatever you want to do.
If Not ABC Is Nothing Then
For Each v In ABC
v.Font.Color = wdColorRed
Next v
End If
End Sub
' This is where the real code begins.
Function FindRepeatingWordChains(ChainLenth As Integer, DocToCheck As Document) As Scripting.Dictionary
Dim DictWords As New Scripting.Dictionary, DictMatches As New Scripting.Dictionary
Dim sChain As String
Dim CurWord As Range
Dim MatchCount As Integer
Dim i As Integer
MatchCount = 0
For Each CurWord In DocToCheck.Words
' Make sure there are enough remaining words in our document to handle a chain of the length specified.
If Not CurWord.Next(wdWord, ChainLenth - 1) Is Nothing Then
' Check for non-printing characters in the first/last word of the chain.
' This code will read a vbCr, etc. as a word, which is probably not desired.
' However, this check does not exclude these 'words' inside the chain, but it can be modified.
If CurWord <> vbCr And CurWord <> vbNewLine And CurWord <> vbCrLf And CurWord <> vbLf And CurWord <> vbTab And _
CurWord.Next(wdWord, ChainLenth - 1) <> vbCr And CurWord.Next(wdWord, ChainLenth - 1) <> vbNewLine And _
CurWord.Next(wdWord, ChainLenth - 1) <> vbCrLf And CurWord.Next(wdWord, ChainLenth - 1) <> vbLf And _
CurWord.Next(wdWord, ChainLenth - 1) <> vbTab Then
sChain = CurWord
For i = 1 To ChainLenth - 1
' Add each word from the current word through the next ChainLength # of words to a temporary string.
sChain = sChain & " " & CurWord.Next(wdWord, i)
Next i
' If we already have our temporary string stored in the dictionary, then we have a match, assign the word range to the returned dictionary.
' If not, then add it to the dictionary and increment our index.
If DictWords.Exists(sChain) Then
MatchCount = MatchCount + 1
DictMatches.Add DocToCheck.Range(CurWord.Start, CurWord.Next(wdWord, ChainLenth - 1).End), MatchCount
Else
DictWords.Add sChain, sChain
End If
End If
End If
Next CurWord
' If we found any matching results, then return that list, otherwise return nothing (to be caught by the calling function).
If DictMatches.Count > 0 Then
Set FindRepeatingWordChains = DictMatches
Else
Set FindRepeatingWordChains = Nothing
End If
End Function
I have tested this on a 258 page document (TheStory.txt) from this source, and it ran in just a few minutes.
See the test() sub for usage.
You will need to reference the Microsoft Scripting Runtime to use the Scripting.Dictionary objects. If that is undesirable, small modifications can be made to use Collections instead, but I prefer the Dictionary as it has the useful .Exists() method.
I chose a rather lame theory, but it seems to work (at least if I got the question right cuz sometimes I'm a slow understander).
I load the entire text into a string, load the individual words into an array, loop through the array and concatenate the string, containing each time three consecutive words.
Because the results are already included in 3 word groups, 4 word groups or more will automatically be recognized.
Option Explicit
Sub Find_Duplicates()
On Error GoTo errHandler
Dim pSingleLine As Paragraph
Dim sLine As String
Dim sFull_Text As String
Dim vArray_Full_Text As Variant
Dim sSearch_3 As String
Dim lSize_Array As Long
Dim lCnt As Long
Dim lCnt_Occurence As Long
'Create a string from the entire text
For Each pSingleLine In ActiveDocument.Paragraphs
sLine = pSingleLine.Range.Text
sFull_Text = sFull_Text & sLine
Next pSingleLine
'Load the text into an array
vArray_Full_Text = sFull_Text
vArray_Full_Text = Split(sFull_Text, " ")
lSize_Array = UBound(vArray_Full_Text)
For lCnt = 1 To lSize_Array - 1
lCnt_Occurence = 0
sSearch_3 = Trim(fRemove_Punctuation(vArray_Full_Text(lCnt - 1) & _
" " & vArray_Full_Text(lCnt) & _
" " & vArray_Full_Text(lCnt + 1)))
With Selection.Find
.Text = sSearch_3
.Forward = True
.Replacement.Text = ""
.Wrap = wdFindContinue
.Format = False
.MatchCase = False
Do While .Execute
lCnt_Occurence = lCnt_Occurence + 1
If lCnt_Occurence > 1 Then
Selection.Range.Font.Color = vbRed
End If
Selection.MoveRight
Loop
End With
Application.StatusBar = lCnt & "/" & lSize_Array
Next lCnt
errHandler:
Stop
End Sub
Public Function fRemove_Punctuation(sString As String) As String
Dim vArray(0 To 8) As String
Dim lCnt As Long
vArray(0) = "."
vArray(1) = ","
vArray(2) = ","
vArray(3) = "?"
vArray(4) = "!"
vArray(5) = ";"
vArray(6) = ":"
vArray(7) = "("
vArray(8) = ")"
For lCnt = 0 To UBound(vArray)
If Left(sString, 1) = vArray(lCnt) Then
sString = Right(sString, Len(sString) - 1)
ElseIf Right(sString, 1) = vArray(lCnt) Then
sString = Left(sString, Len(sString) - 1)
End If
Next lCnt
fRemove_Punctuation = sString
End Function
The code assumes a continuous text without bullet points.