Duplicate removal for VBA Word not working effectively - vba

I have a program to remove duplicates and everything is working properly. It is just freezing with large data sets i.e. 1 to 2.5 million words.
What is wrong with my approach? Is there a better one?
Sub DeleteDuplicateParagraphs()
Dim p1 As Paragraph
Dim p2 As Paragraph
Dim DupCount As Long
Dim StartTime As Double
Dim SecondsElapsed As Double
StartTime = Timer
For Each p1 In ActiveDocument.Paragraphs
If p1.range.Text <> vbCr Then
For Each p2 In ActiveDocument.Paragraphs
If p1.range.Text = p2.range.Text Then
DupCount = DupCount + 1
If p1.range.Text = p2.range.Text And DupCount > 1 Then p2.range.Delete
End If
Next p2
End If
DupCount = 0
Next p1
SecondsElapsed = Round(Timer - StartTime, 2)
MsgBox "This code ran successfully in " & SecondsElapsed & " seconds", vbInformation
DupCount = 0
End Sub

Try this (first add a reference to the Microsoft Scripting Runtime to your VBA project):
Sub DeleteDuplicateParagraphs()
Dim p As Paragraph
Dim d As New Scripting.Dictionary
Dim t As Variant
Dim i As Integer
Dim StartTime As Single
StartTime = Timer
' collect duplicates
For Each p In ActiveDocument.Paragraphs
t = p.Range.Text
If t <> vbCr Then
If Not d.Exists(t) Then d.Add t, New Scripting.Dictionary
d(t).Add d(t).Count + 1, p
End If
Next
' eliminate duplicates
Application.ScreenUpdating = False
For Each t In d
For i = 2 To d(t).Count
d(t)(i).Range.Delete
Next
Next
Application.ScreenUpdating = True
MsgBox "This code ran successfully in " & Round(Timer - StartTime, 2) & " seconds", vbInformation
End Sub
This makes use of the fact that the Scripting.Dictionary is a hash table that is geared towards very quickly associating unique keys with values. It is therefore very good at spotting duplicate keys. Dictionary keys have to be strings, conveniently we can use the paragraph texts for that.
For values we use more dictionary objects, solely for the fact that they work a lot better than VBA's arrays. In them we collect the references to the actual paragraph instances with the same text.
Actually deleting duplicate paragraphs is a very simple matter afterwards.
Note: The duplicate detection part in the above code is very fast. However, if Word becomes unresponsive in large documents then it's in the duplicate removal part, namely because of Word's undo buffer.
The culprit is that the paragraph ranges are deleted one after another, causing Word to build a very large undo buffer. Unfortunately there is no way (that I know of) to either
delete multiple separate ranges in one step (which would result in only a single entry in the undo buffer), or
disable the undo buffer altogether from VBA
Calling UndoClear periodically in the "eliminate duplicates" loop might help, disabling ScreenUpdating is also not a bad idea:
' eliminate duplicates
Dim x As Integer
Application.ScreenUpdating = False
For Each t In d
x = x + 1
For i = 2 To d(t).Count
d(t)(i).Range.Delete
Next
If x Mod 50 = 0 Then ActiveDocument.UndoClear
Next
ActiveDocument.UndoClear
Application.ScreenUpdating = True

First of all, Just wanted to thank you so much for the time and effort you have put in to helping me.
Your idea behind the method is really impressive. I did change the code slightly and would like you to peruse it when you have the time, to see if it is of optimal standard. Again, I truly thank you, the code ran 20 splits faster than the previous and that is not even over a larger data set.
> Sub DeleteDuplicateParagraphs()
>
> Dim p As Paragraph
> Set d = CreateObject("Scripting.Dictionary")
> Dim t As Variant
> Dim i As Integer
> Dim StartTime As Single
>
> StartTime = Timer
>
> ' collect duplicates For Each p In ActiveDocument.Paragraphs
> t = p.range.Text
> If t <> vbCr Then
> If Not d.Exists(t) Then d.Add t, CreateObject("Scripting.Dictionary")
> d(t).Add d(t).Count + 1, p
> End If Next
>
> ' eliminate duplicates For Each t In d
> For i = 2 To d(t).Count
> d(t)(i).range.Delete
> Next Next
>
> MsgBox "This code ran successfully in " & Round(Timer - StartTime,
> 2) & " seconds", vbInformation
>
> End Sub

Related

Word VBA Progress Bar with Unknown Number of Steps

I have a macro that loops through an unknown number of times. The number of times varies based on a total number of rows in multiple tables in a reference document, and that number of rows will vary across reference documents that may be used. The relevant snippet of code for the loop is below:
For Each oRow In oTbl.Rows
p = p + 1
Helper.ProgressIndicator_Code (p)
strPhrase = Split(Trim(oRow.Range.Cells(1).Range.Text), vbCr)(0)
strRule = Split(Trim(oRow.Cells(2).Range.Text), vbCr)(0)
If strPhrase <> "" Then
If Not strStartWord = vbNullString Then
'Process defined sections
arrEndWords = Split(strEndWord, "|")
For lngIndex = 0 To UBound(arrEndWords)
Set oRng = GetDocRange(strStartWord, arrEndWords(lngIndex))
If Not oRng Is Nothing Then Exit For
Next lngIndex
Else
'Process whole document
Set oRng = m_oDocCurrent.Range
End If
If Not oRng Is Nothing Then
Set oRngScope = oRng.Duplicate
With oRng.Find
.Text = strPhrase
Do While .Execute
If Not oRng.InRange(oRngScope) Then Exit For
oRng.HighlightColorIndex = wdTurquoise
If strRule <> "" Then
Set oComment = m_oDocCurrent.Comments.Add(Range:=oRng, Text:=strUsr & ": " & strRule)
oComment.Author = UCase("WordCheck")
oComment.Initial = UCase("WC")
End If
Loop
End With
End If
End If
Next oRow
The progress bar is a classic progress bar for which a label field width is updated using the below code based on a value of p as updated in the above code:
Sub progress(pctCompl As Integer)
ProgressIndicator.Text.Caption = pctCompl & "% Completed"
ProgressIndicator.Bar.Width = pctCompl * 2
DoEvents
End Sub
Here's my problem: The value of p varies based on which reference document is used, so my progress bar is never even approximately accurate with respect to the processing of the VBA macro. The progress bar doesn't have to be exact, merely close and to indicate that progress is being made and nothing has hung.
I'm not looking for written code, just would be very grateful for suggestions or advice as to approaches for making my progress bar more accurate so that I can learn (e.g., I just ran the macro for three different reference documents - one gave me 25%, one gave 44%, and one gave 82%; none showed even close to 100% when completed). Essentially I need to divide i by an unknown number to get my percentage, which is clearly impossible, so some function for a close approximation is needed.
Edit: New code based on #macropod suggestion.
Dim strCheckDoc As String, docRef As Document, projectPath As String, _
j As Integer, i As Integer, k As Integer, oNumRows as Long
j = 1
For i = 0 To UBound(strUsr)
strCheckDoc = [path to reference document unique to each strUsr]
Set docRef = Documents.Open(strCheckDoc, ReadOnly:=True, Visible:=False)
For k = 1 To docRef.Tables.Count
oNumRows = oNumRows + docRef.Tables(i).Rows.Count
Next k
Next i
Then the code to update the progress bar is:
Dim pctCompl As Single
pctCompl = Round((p / oNumRows) * 100)
ProgressIndicator.Text.Caption = pctCompl & "% Completed"
ProgressIndicator.Bar.Width = pctCompl * 2
DoEvents
The progress bar now gets to 64% when complete (i.e., it should be at 100%). I'm also working on a way to make oNumRows only count a row if the row has content in the first column.

Application.Match not exact value

Have a piece of code that looks for matches between 2 sheets (sheet1 is customer list and rData is copied pdf with invoices). It usually is exact match but in some cases I'm looking for 6 first characters that matches rData
Dim rData As Variant
Dim r As Variant
Dim r20 As Variant
Dim result As Variant
Dim i As Long
rData = ActiveWorkbook.Sheets(2).Range("A1:A60000")
r20 = ActiveWorkbook.Sheets(1).Range("C2:C33")
For Each r In r20
result = Application.Match(r, rData, 0)
If Not IsError(result) Then
For i = 1 To 5
If (result - i) > 0 Then
If (Left(Trim(rData(result - i, 1)), 3) = "418") Then
MsgBox "customer: " & r & ". invoice: " & rData(result - i, 1)
End If
End If
Next
For i = 1 To 15
If (result + i) > 0 Then
If (Left(Trim(rData(result + i, 1)), 3) = "418") Then
MsgBox "customer: " & r & ". invoice: " & rData(result + i, 1)
End If
End If
Next
End If
Next r
End Sub
Only part of this that is giving me a headache is this part result = Application.Match(r, rData, 0). How do it get match for not exact match?
Sample of Sheet1
This is what more or less looks like. Matching after CustomerNumber# is easy because they are the same every invoice. BUT sometimes invoice does not have it so I'm searching after CustomerName and sometimes they have uppercase letters, sometimes there is extra stuff behind it and therefore it cannot find exact match.
Hope it makes sense.
To match the customer name from your customer list to the customer name in the invoice even if it has extra characters appended, you can use the wildcard * in Match().
You also have a typo in the Match() function. r20 should be rData.
This is your code with the fixes applied:
Sub Test()
'v4
Dim rData As Variant
Dim r As Variant
Dim r20 As Variant
Dim result As Variant
Dim i As Long
rData = ActiveWorkbook.Sheets(2).Range("A1:A60000")
r20 = ActiveWorkbook.Sheets(1).Range("C2:C33")
For Each r In r20
result = Application.Match(r & "*", rData, 0) ' <~ Fixed here
If Not IsError(result) Then
For i = 1 To 5
If (result - i) > 0 Then
If (Left(Trim(rData(result - i, 1)), 3) = "418") Then
MsgBox "customer: " & r & ". invoice: " & rData(result - i, 1)
End If
End If
Next
For i = 1 To 15
If (result + i) > 0 Then
If (Left(Trim(rData(result + i, 1)), 3) = "418") Then
MsgBox "customer: " & r & ". invoice: " & rData(result + i, 1)
End If
End If
Next
End If
Next r
End Sub
Notes:
Match() is case insensitive, so it works with different capitalisations.
The data in Sheets(2) must all be text for Match() to work correctly with wildcards.
EDIT1: New better version
EDIT2: Refactored constants and made data ranges dynamic
EDIT3: Allows for any prefix to an invoice number of a fixed length
The following is a better, rewritten version of your code:
Sub MuchBetter()
'v3
Const s_InvoiceDataWorksheet As String = "Sheet2"
Const s_InvoiceDataColumn As String = "A:A"
Const s_CustomerWorksheet As String = "Sheet1"
Const s_CustomerStartCell As String = "C2"
Const s_InvoiceNumPrefix As String = "418"
Const n_InvoiceNumLength As Long = 8
Const n_InvScanStartOffset As Long = -5
Const n_InvScanEndOffset As Long = 15
Dim ƒ As Excel.WorksheetFunction: Set ƒ = Excel.WorksheetFunction ' Shortcut
With Worksheets(s_InvoiceDataWorksheet).Range(s_InvoiceDataColumn)
With .Parent.Range(.Cells(1), .Cells(Cells.Rows.Count).End(xlUp))
Dim varInvoiceDataArray As Variant
varInvoiceDataArray = ƒ.Transpose(.Cells.Value2)
End With
End With
With Worksheets(s_CustomerWorksheet).Range(s_CustomerStartCell)
With .Parent.Range(.Cells(1), .EntireColumn.Cells(Cells.Rows.Count).End(xlUp))
Dim varCustomerArray As Variant
varCustomerArray = ƒ.Transpose(.Cells.Value2)
End With
End With
Dim varCustomer As Variant
For Each varCustomer In varCustomerArray
Dim dblCustomerIndex As Double
dblCustomerIndex = Application.Match(varCustomer & "*", varInvoiceDataArray, 0)
If Not IsError(dblCustomerIndex) _
And varCustomer <> vbNullString _
Then
Dim i As Long
For i = ƒ.Max(dblCustomerIndex + n_InvScanStartOffset, 1) _
To ƒ.Min(dblCustomerIndex + n_InvScanEndOffset, UBound(varInvoiceDataArray))
Dim strInvoiceNum As String
strInvoiceNum = Right$(Trim$(varInvoiceDataArray(i)), n_InvoiceNumLength)
If (Left$(strInvoiceNum, Len(s_InvoiceNumPrefix)) = s_InvoiceNumPrefix) Then
MsgBox "customer: " & varCustomer & ". invoice: " & strInvoiceNum
End If
Next
End If
Next varCustomer
End Sub
Notes:
It is a good idea to use constants so all literal values are typed once only and kept grouped together.
Using the RVBA naming convention greatly increases the readability of the code, and reduces the likelihood of bugs.
Using long, appropriately named variables makes the code essentially self-documenting.
Using .Value2 whenever reading cell values is highly recommended (it avoids implicit casting, making it slightly faster as well as eliminating certain issues caused by the casting ).
Surprisingly, in VBA there are good reasons to put a variable declaration as close as possible to the first use of the variable. Two such reasons are 1) it improves readability, and 2) it simplifies future refactoring. Just remember that the variable is not reinitialised every time the Dim is encountered. Initialisation only occurs the first time.
The twin loops have been rolled into one according to the DRY principle.
Whilst the check for an empty customer name/number is not strictly necessary if you can guarantee it will never be so, it is good defensive programming as an empty value will cause erroneous results.
The negative index check inside the loop has been removed and replaced with the one-time use of the Max() worksheet function in the For statement.
The Min() worksheet function is also used in the For statement to avoid trying to read past the end of the array.
Always use worksheet functions on the WorksheetFunction object unless you are explicitly checking for errors, in which case use the Application object.

Adding name creates "the object invoked has disconnected from its clients" error

Often, but not always, when I work on/develop in this workbook and save it manually under a new file name to increment the version number, I get
"the object invoked has disconnected from its clients" error in the line of code noted below the next time I run it:
Sub copy_range_names_storage_to_values(wStorage As Workbook)
Dim n As Name
For Each n In ThisWorkbook.Names
If VBA.InStr(1, n.Name, "_FilterDatabase") = 0 Then
If VBA.InStr(1, n.Name, "_xlfn") = 0 Then
If VBA.InStr(1, n.Name, "Values!") <> 0 Or VBA.InStr(1, n.RefersTo, "Values!") <> 0 Then
n.Delete
End If
End If
End If
Next n
Set n = Nothing
Dim sNewRefersTo As String
For Each n In wStorage.Names
With ThisWorkbook.Sheets("Values").Names
If VBA.InStr(1, n.Name, "_xlfn") = 0 Then
If VBA.InStr(1, n.RefersTo, "=" & wStorage.Sheets("Sheet1").Name & "!") > 0 Then
sNewRefersTo = VBA.Replace(n.RefersTo, "Sheet1", "Values")
If VBA.InStr(1, sNewRefersTo, "#REF!") = 0 Then
Dim rn As String
rn = ""
rn = VBA.Replace(n.Name, wStorage.Sheets("Sheet1").Name & "!", "")
Dim ra As String
ra = ""
ra = "=" & "Values!" & ThisWorkbook.Sheets("Values").Range(sNewRefersTo).Address
'ThisWorkbook.Sheets("Values").Activate
.Add Name:=rn, RefersTo:=ra '<--- ERROR HERE!!!
Else
n.Delete
End If
End If
End If
End With
Next n
Application.StatusBar = "copy_range_names_storage_to_values: finished"
End Sub
The purpose of the sub above is to delete all range names from Thisworkbook.Sheets("Values") and copy the range names from wStorage.Sheets("Sheet1") and add them to Thisworkbook.Sheets("Values"). In other words, I remove the names from Values and update them with the names on Sheet1. There are some exceptions where I don't want to copy the names over and I don't want to delete the names, etc. (as can be seen in the If statements).
One other thing to note is that it always occurs with a couple range names in particular, which happen to be the largest ranges in the workbook (2000 rows by 1250 columns and 28 rows by 1250 columns). The fact that it only occurs with the larger ranges makes me think it is a memory issue, but I would expect the error to occur more frequently/consistently if it were a memory issue because the memory usage of this pc is pretty consistent.
This is a mission critical workbook for a hedge fund client so reliability is key.
The sub has been build piece mill over time as things have come up. I am still on the hunt for a elegant way to transfer named ranges.
In concept, I understand what the error means: I have disconnected the range from the sheet or workbook somehow. But how do I avoid that or avoid the error and achieve what I need?

Excel VBA: Looking for Advice Avoiding an Infinite Loop

Imgur Album with screens of worksheets: http://imgur.com/a/6rFWF
Long story short, I am writing an Excel VBA utility that will assign two types of security shifts (called coverages and weekend duties) to security staff members. Basically, I have a worksheet with all of the staff members and their various availability information in it (the top image in the imgur album) and a worksheet with all of the coverage dates in it (the bottom image in the imgur album). Note that I don't have an image of the weekend duty dates as it looks similar to the coverage dates (but with the Friday and Saturday shifts).
The utility basically assigns a random staff member to each date, checking to make sure it doesn't violate any of their availability requirements. Unfortunately, I realize that I am creating a large chance for an infinite loop to occur. In my own testing, there has only been 1 attempt out of around 15-16 that did not enter an infinite loop near the end. So I'm looking for your help to account for this so the utility doesn't eat itself.
Here is the "pseudo-code" for the procedure in question.
'Loop for Column A in the Coverage Slips sheet (image 2 in imgur album)
Do Until (CoverageRowNumber = LastCoverageSlipRow + 1)
Get a Random Staff Member by RNG
If staff member still needs more shifts (see Requirements columns) Then
If staff member does not have an "X" under the day of the week Then
If staff member does not have a matching date conflict Then
Assign the coverage
Increase CoverageRowNumber
End If
End If
End If
Loop
'Loop for Column B in the coverage slips sheet (image 2 in imgur album)
Do Until...
Same as the loop above
Loop
Edit: Disregard that I have the dates in two columns for now. I'll be fixing that once I solve the problem of this post...it's an easy fix and will cut the code almost in half.
The problem is that as the utility gets near the end of the list of dates, it often runs into the scenario where the only staff members left cannot sit that specific shift (whether because of day of the week or specific date). In the event that it runs into this scenario, I can see a couple of acceptable options (though I don't know how I'd go about programming them):
Undo all of the work that the utility did and start over until it can get lucky and find a solution that works. This would save me some time doing manual placements for the last few shifts but might take a very long time. Additionally, I'd have to store all of the original values and then paste them back into the spreadsheet anytime it starts over.
Simply stop assigning shifts and just exit the procedure. I will be able to manually place the last few shifts by moving a few people around. I sure is a lot less work than manually assigning 200 shifts by hand like I've been doing it the past few years.
Do you guys have any thoughts that could be of help here? I'm not even sure how I could have the procedure check to see if there are any available options or not, but either way there's got to be a way to detect (and deter) this infinite loop before it crashes the program.
Sorry for the novel, and thanks in advance for any help!
Edit: In an effort to provide a little more clarity, I figured I'd copy and paste the actual code below:
'------------------------------------------------------------'
'Create ws variables for each worksheet
Dim wsConflicts As Worksheet
Dim wsCoverageSlips As Worksheet
Dim wsWDSlips As Worksheet
Dim wsCoverageOutput As Worksheet
Dim wsWDOutput As Worksheet
'------------------------------------------------------------'
Public Function SetSheets()
'Assign the worksheets to the ws variables
Set wsConflicts = Worksheets("Conflicts")
Set wsCoverageSlips = Worksheets("Coverage Slips")
Set wsWDSlips = Worksheets("WD Slips")
Set wsCoverageOutput = Worksheets("Coverage Output")
Set wsWDOutput = Worksheets("WD Output")
'Display a message (debugging)
'MsgBox "The sheets have been assigned successfully"
End Function
'------------------------------------------------------------'
Public Function ColumnLetter(ColumnNumber As Integer) As String
Dim n As Long
Dim c As Byte
Dim s As String
n = ColumnNumber
Do
c = ((n - 1) Mod 26)
s = Chr(c + 65) & s
n = (n - c) \ 26
Loop While n > 0
ColumnLetter = s
End Function
'------------------------------------------------------------'
Sub AssignCoverages()
'Fill the ws variables
Call SetSheets
'Set the first and last row numbers
Dim FirstStaffMemberRow As Integer
FirstStaffMemberRow = 3
Dim LastStaffMemberRow As Integer
LastStaffMemberRow = wsConflicts.UsedRange.Rows.Count
'Count the number of required coverages and weekend duties
Dim RequiredCoverages As Integer
Dim RequiredWDs As Integer
For i = FirstStaffMemberRow To LastStaffMemberRow
RequiredCoverages = RequiredCoverages + wsConflicts.Range("B" & i).Value
RequiredWDs = RequiredWDs + wsConflicts.Range("C" & i).Value
Next i
'Display a message (debugging)
MsgBox "You currently have " & RequiredCoverages & " required coverages and " & RequiredWDs & " required weekend duties."
'Count the number of coverage slips and weekend duty slips
Dim FirstCoverageSlipRow As Integer
FirstCoverageSlipRow = 1
Dim LastCoverageSlipRow As Integer
LastCoverageSlipRow = wsCoverageSlips.UsedRange.Rows.Count
Dim NumCoverageSlips As Integer
NumCoverageSlips = (LastCoverageSlipRow - FirstCoverageSlipRow + 1)
Dim FirstWDSlipRow As Integer
FirstWDSlipRow = 1
Dim LastWDSlipRow As Integer
LastWDSlipRow = wsWDSlips.UsedRange.Rows.Count
Dim NumWDSlips As Integer
NumWDSlips = (LastWDSlipRow - FirstWDSlipRow + 1)
'Check to make sure there are enough required shifts for slips
If RequiredCoverages <> NumCoverageSlips Then
MsgBox "The number of shifts you require (Columns B & C on Conflicts sheet) does not match the number of slips you've entered. You have " & RequiredCoverages & " required coverages and " & NumCoverageSlips & " coverage slips. You have " & RequiredWDs & " required weekend duties and " & NumWDSlips & " weekend duty slips. Please correct this error and retry."
Exit Sub
Else
'Debugging
'MsgBox "The number of shifts you require (Columns B & C on Conflicts sheet) matches the number of slips you've entered. You have " & RequiredCoverages & " required coverages and " & NumCoverageSlips & " coverage slips. You have " & RequiredWDs & " required weekend duties and " & NumWDSlips & " weekend duty slips."
End If
'Massive loop to assign coverages to random staff members
Dim NumRemainingCoverages As Integer
NumRemainingCoverages = NumCoverageSlips
Dim SlipRowNumber As Integer
SlipRowNumber = FirstCoverageSlipRow
'Loop for Column A
Do Until (SlipRowNumber = LastCoverageSlipRow + 1)
'Get a random staff member row
StaffMemberRow = GetRandomStaffMemberRow(FirstStaffMemberRow, LastStaffMemberRow)
'Check to make sure the staff member has remaining required coverages
If wsConflicts.Range("B" & StaffMemberRow).Value > 0 Then
'Check to make sure the staff member can sit the day of the week
Dim CurrentDate As Date
CurrentDate = wsCoverageSlips.Range("A" & SlipRowNumber).Value
Dim CurrentDay As Integer
CurrentDay = Weekday(CurrentDate)
Dim CurrentDayColumn As String
If CurrentDay = 1 Then CurrentDayColumn = "D"
If CurrentDay = 2 Then CurrentDayColumn = "E"
If CurrentDay = 3 Then CurrentDayColumn = "F"
If CurrentDay = 4 Then CurrentDayColumn = "G"
If CurrentDay = 5 Then CurrentDayColumn = "H"
If CurrentDay = 6 Then CurrentDayColumn = "I"
If CurrentDay = 7 Then CurrentDayColumn = "J"
If wsConflicts.Range(CurrentDayColumn & StaffMemberRow).Value = "" Then
'Check to make sure the staff member does not have a date conflict
Dim ColumnNumber As Integer
Dim ColumnLetterText As String
Dim CoverageDateConflicts As Integer
CoverageDateConflicts = 0
For ColumnNumber = 11 To 20
ColumnLetterText = ColumnLetter(ColumnNumber)
Dim CoverageSlipDate As Date
If IsDate(wsConflicts.Range(ColumnLetterText & StaffMemberRow).Value) = True Then
CoverageSlipDate = wsConflicts.Range(ColumnLetterText & StaffMemberRow).Value
Else
CoverageSlipDate = DateValue("01/01/1900")
End If
If CurrentDate = CoverageSlipDate Then
CoverageDateConflicts = CoverageDateConflicts + 1
End If
Next ColumnNumber
If CoverageDateConflicts = 0 Then
'Assign the coverage
Dim BlankCoverageOutputRow As Integer
BlankCoverageOutputRow = wsCoverageOutput.UsedRange.Rows.Count + 1
wsCoverageOutput.Range("A" & BlankCoverageOutputRow).Value = wsConflicts.Range("A" & StaffMemberRow).Value
wsCoverageOutput.Range("B" & BlankCoverageOutputRow).Value = CurrentDate
'Reduce the staff member's required coverages by 1
Dim CurrentRequirements As Integer
CurrentRequirements = wsConflicts.Range("B" & StaffMemberRow).Value
wsConflicts.Range("B" & StaffMemberRow).Value = CurrentRequirements - 1
'Reduce the number of remaning coverages by 1
NumRemainingCoverages = NumRemainingCoverages - 1
'Increase the slip row number by 1
SlipRowNumber = SlipRowNumber + 1
'Message box for debugging
'MsgBox "Coverage Date (" & CurrentDate & ") assigned to " & wsConflicts.Range("A" & StaffMemberRow).Value & "."
End If 'End date check
End If 'End day check
End If 'End requirements check
Loop 'End loop for column A
End Sub
'------------------------------------------------------------'
Public Function GetRandomStaffMemberRow(FirstStaffMemberRow As Integer, LastStaffMemberRow As Integer)
'Pick a random number between the first staff member row and the last
Call Randomize
GetRandomStaffMemberRow = Int((LastStaffMemberRow - FirstStaffMemberRow + 1) * Rnd + FirstStaffMemberRow)
End Function
The question is too open for a detailed answer, so I try with some guidelines. I hope it helps.
I would use a class Solution with the following members:
Solution.ReadInputFromSheet() reads the table from the sheet into the class members
Solution.GenerateRandom() creates a new random solution. Try to find a balance between smart (add some logic to avoid totally random solutions) and speed (don't get stuck, exit after trying 10 or 50 random numbers that don't work), but speed is more important
Solution.Quality() As Double calculates the quality of the solution. For example a solution that is not valid returns 0, if Joe has 10 consecutive shifts returns 20, if the shifts are better distributed returns 100.
Solution.WriteOnSheet() write the data from the class members into the sheet.
Solution.Clone() As Solution() creates a new Solution instance with the same data
Make a cycle that creates a solution, checks if its quality is better than the best quality solution found so far, if it is better keep it, otherwise go and calculate another solution.
Set BestS = New Solution
BestS.ReadInputFromSheet
BestS.GenerateRandom()
Set S = New Solution
S.ReadInputFromSheet
For I = 1 To 10000
S.GenerateRandom()
If S.Quality() > BestS.Quality() Then Set BestS = S.Clone()
Next I
BestS.WriteOnSheet
Instead of 10000 you can use Timer to run it for a finite number of seconds, or make a button to interrupt it when you come back from lunch break.
A faster solution generator function is better than risking of getting stuck with one difficult (or impossible) solution.
For a smarter solution generator function I need more details on the rules.
So I went ahead and developed my own solution to this problem--it's not perfect and it's probably not the best way to handle the scenario. But it works, and it solved my problem in a matter of minutes instead of hours learning other methods.
Basically, I created two new "counter" variables. The first is FailedAttempts. Every time the procedure tries a random staff member but runs into a conflict, it increments FailedAttempts by 1. Every time the random staff member is a successful match (no conflicts), it resets FailedAttempts to 0. If at any time FailedAttempts = 100, it immediately exits the loop and starts over. In other words, if it tries 100 random staff members in a row without finding a match, I assume it's not going to find a match and just cut my losses.
The second variable, Assignments, is incremented by 1 every time that the procedure makes a successful assignment. When this number equals the number of shifts that the procedure is supposed to assign, it immediately exits the loop.
To do this, I had to use a couple of forbidden 'GoTo' commands (I wasn't sure how else to exit the loop. You can exit a For loop with Exit For but I believe this is invalid for Do While loops. I ended up only needing two GoTo's, one for exiting the loop and one to go back to the beginning of the procedure. I also made sure that the cells in the worksheet that change during the procedure are reset to their original state before it retries the assignment procedure.
I'll save everyone the trouble of reading through the extended version of the code, but in 'pseudo-code' form it looks like this:
Retry: 'Label for GoTo command
Do Until (CoverageRowNumber = LastCoverageSlipRow + 1)
Get a Random Staff Member by RNG
If staff member still needs more shifts (see Requirements columns) Then
If staff member does not have an "X" under the day of the week Then
If staff member does not have a matching date conflict Then
'Assign the coverage
'Increase CoverageRowNumber
Assignments = Assignments + 1
Else
FailedAttempts = FailedAttempts + 1
End If
Else
FailedAttempts = FailedAttempts + 1
End If
Else
FailedAttempts = FailedAttempts + 1
End If
If FailedAttempts > 100 Then
GoTo ExitLoop
End If
Loop
ExitLoop: 'Label for GoTo command
If Assignments <> NumCoverageSlips Then
GoTo Retry
End If
'Do rest of procedure
Again, there may be (and certainly is) a more elegant and "correct" way of accomplishing the task at hand. This method worked for me with the given environment. Thanks to those who provided solutions--even though I ended up going a different direction they provided great food for thought and helped me learn a bunch of new methods (especially the class idea from #stenci).
Thanks all.

Speedup VBA: Large tables from Word to Excel

I have large tables in RTF format, from 20-150 Mb in size. I first tried to export the RTF -> HTML -> Import to excel. It took about 35 minutes for a 60 Mb file. Next, I tried copying the table directly from Word -> excel. It always fails midway (everything gets pasted, by data is not in the right cell).
I tried a few more ways (importing all cells into memory before transferring to excel, and other permutations, and methods detailed in this and other sources) before settling on the .ConvertToText method.
This method is relatively faster, taking about 25 minutes for the same 60 Mb file (this is without displaying Word, setting repagination, events, dispayupdate and tableautofit to false).
Considering that these files can be entirely loaded into RAM memory in less than 10 seconds, I wonder why does it take 25 minutes to read data off a 60 Mb file. I understand that the table engine in Word is slow because of the change to HTML format, but reading a table cell by cell is atrociously slow. First few cells are super fast, Last cells are slower - I'm sure manual reading is faster than that. It defeats the whole purpose of automation. However, I do not have a choice.
The code is:
Dim oWord As Word.Application
Dim RTF As Word.Document
Set oWord = CreateObject("Word.Application")
Set RTF = oWord.Documents.Open(filename:=Fname, ConfirmConversions:=False, ReadOnly:=False) ', ReadOnly:=True)
Application.StatusBar = vbNullString
Application.EnableEvents = False
Application.ScreenUpdating = False
Application.DisplayAlerts = False
With oWord
Options.Pagination = False
Options.AllowReadingMode = False
Application.AutoRecover.Enabled = False
Options.SaveInterval = 0
Options.CheckGrammarAsYouType = False
Options.CheckGrammarWithSpelling = False
End With
With RTF
Options.Pagination = False
Options.AllowReadingMode = False
Application.AutoRecover.Enabled = False
Options.SaveInterval = 0
Options.CheckGrammarAsYouType = False
Options.CheckGrammarWithSpelling = False
End With
Dim AAF As Table
For Each AAF In RTF.Tables
AAF.AllowAutoFit = False
Next
oWord.Visible = False
Dim rng As Word.Range
Dim sData As String
Dim aData1() As String
Dim aData2() As String
Dim aDataAll() As String
Dim nrRecs As Long
Dim nrFields As Long
Dim lRecs As Long
Dim lFields As Long
Dim CTbl As Table 'Data Table
Dim oCell As Cell
'I'm not displaying the code which replaces all ^p with a spl character to maintain the table structure - it is staright forward, and does the job
Set rng = CTbl.ConvertToText(Separator:="$", NestedTables:=False)
sData = rng.Text 'This contains the entire table, delimited by vbCr and $...
Application.StatusBar = "Closing open files..."
RTF.Close (wdDoNotSaveChanges) 'All data has been extracted, hence quit word
oWord.Quit
Set oWord = Nothing
sData = Mid(sData, 1, Len(sData) - 1)
aData1() = Split(sData, vbCr)
nrRecs = UBound(aData1())
If Dbg Then MsgBox "The table contained " & nrRecs + 1 & " rows"
For lRecs = LBound(aData1()) To nrRecs 'Cycle through all rows
aData2() = Split(aData1(lRecs), "$") 'Split rows into arrays
Debug.Print aData1(lRecs)
nrFields = UBound(aData2()) 'Find out the number of columns
If lRecs = LBound(aData1()) Then 'If this is the first row/cycle,
ReDim Preserve aDataAll(nrRecs, 9) 'nrFields) 'Resize the array - currently I'm using a fixed size for the column since the first row of my table contains merged rows
End If
For lFields = LBound(aData2()) To nrFields 'Cycle through all columns
aDataAll(lRecs, lFields) = aData2(lFields) 'Collate the data in a single array
'If MsgBox(aDataAll(lRecs, lFields), vbYesNo, "Continue?") = vbNo Then Exit For
Next
Next 'All of this was slapped together from MS code samples and stackoverflow examples
Any suggestions to improve performance?
The conversion will go a lot faster if you first split the table (I assume there's one very large table) into smaller tables and then convert each table to text.
I tried this on a table with 10000 rows and 10 columns. The time to convert to text went from ~280 seconds to ~70 seconds (i.e. 4X faster).
I ran the code below directly from the document with the 10000 row table (as opposed to running from Excel) for simplicity.
Splt then convert:
Sub SplitThenConvert()
Dim t As Table
Set t = ActiveDocument.Tables.Item(1)
Dim rowCount As Integer
Dim index As Integer
Dim numSteps As Integer
Dim splitRow As Integer
Dim increment As Integer
Dim start_time, end_time
start_time = Now()
Application.ScreenUpdating = False
rowCount = t.Rows.Count
numSteps = 10
increment = rowCount / numSteps
splitRow = rowCount - increment
For index = 1 To numSteps
Debug.Print "Split #" + CStr(index)
ActiveDocument.Tables(1).Rows(splitRow).Select
Selection.SplitTable
splitRow = splitRow - increment
If splitRow < increment Then
Exit For
End If
Next index
index = ActiveDocument.Tables.Count
While index > 0
Debug.Print "Convert #" + CStr(index)
ActiveDocument.Tables(index).ConvertToText ","
index = index - 1
Wend
end_time = Now()
Application.ScreenUpdating = True
MsgBox (DateDiff("s", start_time, end_time))
End Sub
Convert entire table without splitting:
Sub ConvertAll()
Dim start_time, end_time
Application.ScreenUpdating = False
start_time = Now()
ActiveDocument.Tables(1).ConvertToText ","
end_time = Now()
Application.ScreenUpdating = True
MsgBox (DateDiff("s", start_time, end_time))
End Sub
I do agree with #KazJaw: reading/writing from/to MS Office programs (including .rtf because is treated as Word) is very computational expensive, better relying on other means as much as possible (just converting the .rtf file reading into a simple .txt file reading would improve the speed a lot). I have recently answered a post on these lines.
The other proposal I have is reducing the number of "live Office variables" as much as possible. Instead creating the RTF and the oDoc variables at the same time, better doing it one after the other (the same for Excel). What should be done only under exceptional circumstamces (because of being too computational expensive) is copying/pasting in real time between two different instances (for example, two different Word documents).
Thus, use the connection to Office programs for what it is intended, that is, top-level access to a file storing information in a pretty complex way: populate values, change formatting, perform complex actions (e.g., searching through the whole document); but intend to reduce the iterative behaviour (e.g., copying from one cell and pasting into another one over and over) as much as possible. See it in this way: copying/pasting in a .txt file involves just inspecting the input value/the target location and performing the action; doing it in Word involves the same than in the .txt file plus accounting for the huge amount of variables analysed while considering each record (formatting, references to other elements, special actions, etc.).