Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I need to create a csv file from data set matrix, in which I have materials as a rows, people as a columns and quantity of products on intersection. Here is an example of this data set (Order id #1000):
Materials Person1 Person2
563718 20 40
837563 15 35
As a first action I have to transform this data set to a linear structure in this way on additional sheet:
Orderid Material Person Qty
1000 563718 Person1 20
1000 837563 Person1 15
1000 563718 Person2 40
1000 837563 Person2 35
And from this linear structure I have to generate a csv file with Orders for another system based on unique Persons from the list above. Each Order should have one header line and details based on the number of materials he/she ordered. General structure is the following:
H,1000-1,OUT,20160830,Person1
l,1000-1,1,563718,20,EA
l,1000-1,2,837563,15,EA
H,1000-2,OUT,20160830,Person2
l,1000-2,1,563718,40,EA
l,1000-2,2,837563,15,EA
where "H" - means Header row, "1000-1" - first Sub-Order of a Global order 1000, "20160830" requested delivery date, "l" - line row, "1" - line number, "EA" - unit of measure.
Here's a macro that will get you most of the way. It takes the data in your first table and organizes it so that your date in like columns (person1 and person2) is separated into separate rows:
This script assumes that your fixed column(s) are on the left and the columns to be combined (and split out into multiple rows) follow on the right. I hope this helps!
Option Explicit
Sub MatrixConverter2_3()
' Macro created 11/16/2005 by Peter T Oboyski (updated 8/24/2006)
'
' *** Substantial changes made by Chris Brackett (updated 8/3/2016) ***
'
' You are welcome to redistribute this macro, but if you make substantial
' changes to it, please indicate so in this section along with your name.
' This Macro converts matrix-type spreadsheets (eg. plot x species data) into column data
' The new (converted) spreadsheet name is "DB of 'name of active spreadsheet'"
' The conversion allows for multiple header rows and columns.
'--------------------------------------------------
' This section declares variables for use in the script
Dim book, head, cels, mtrx, dbase, v, UserReady, columnsToCombine, RowName, DefaultRowName, DefaultColName1, DefaultColName2, ColName As String
Dim defaultHeaderRows, defaultHeaderColumns, c, r, selectionCols, ro, col, newro, newcol, rotot, coltot, all, rowz, colz, tot As Long
Dim headers(100) As Variant
Dim dun As Boolean
'--------------------------------------------------
' This section sets the script defaults
defaultHeaderRows = 1
defaultHeaderColumns = 2
DefaultRowName = "Activity"
'--------------------------------------------------
' This section asks about data types, row headers, and column headers
UserReady = MsgBox("Have you selected the entire data set (not the column headers) to be converted?", vbYesNoCancel)
If UserReady = vbNo Or UserReady = vbCancel Then GoTo EndMatrixMacro
all = MsgBox("Exclude zeros and empty cells?", vbYesNoCancel)
If all = vbCancel Then GoTo EndMatrixMacro
' UN-COMMENT THIS SECTION TO ALLOW FOR MULTIPLE HEADER ROWS
rowz = 1
' rowz = InputBox("How many HEADER ROWS?" & vbNewLine & vbNewLine & "(Usually 1)", "Header Rows & Columns", defaultHeaderRows)
' If rowz = vbNullString Then GoTo EndMatrixMacro
colz = InputBox("How many HEADER COLUMNS?" & vbNewLine & vbNewLine & "(These are the columns on the left side of your data set to preserve as is.)", "Header Rows & Columns", defaultHeaderColumns)
If colz = vbNullString Then GoTo EndMatrixMacro
'--------------------------------------------------
' This section allows the user to provide field (column) names for the new spreadsheet
selectionCols = Selection.Columns.Count ' get the number of columns in the selection
For r = 1 To selectionCols
headers(r) = Selection.Cells(1, r).Offset(rowOffset:=-1, columnOffset:=0).Value ' save the column headers to use as defaults for user provided names
Next r
colz = colz * 1
columnsToCombine = "'" & Selection.Cells(1, colz + 1).Offset(rowOffset:=-1, columnOffset:=0).Value & "' to '" & Selection.Cells(1, selectionCols).Offset(rowOffset:=-1, columnOffset:=0).Value & "'"
Dim Arr(20) As Variant
newcol = 1
For r = 1 To rowz
If r = 1 Then RowName = DefaultRowName
Arr(newcol) = InputBox("Field name for the fields/columns to be combined" & vbNewLine & vbNewLine & columnsToCombine, , RowName)
If Arr(newcol) = vbNullString Then GoTo EndMatrixMacro
newcol = newcol + 1
Next
For c = 1 To colz
ColName = headers(c)
Arr(newcol) = InputBox("Field name for column " & c, , ColName)
If Arr(newcol) = vbNullString Then GoTo EndMatrixMacro
newcol = newcol + 1
Next
Arr(newcol) = "Data"
v = newcol
'--------------------------------------------------
' This section creates the new spreadsheet, names it, and color codes the new worksheet tab
mtrx = ActiveSheet.Name
Sheets.Add After:=ActiveSheet
dbase = "DB of " & mtrx
'--------------------------------------------------
' If the proposed worksheet name is longer than 28 characters, truncate it to 29 characters.
If Len(dbase) > 28 Then dbase = Left(dbase, 28)
'--------------------------------------------------
' This section checks if the proposed worksheet name
' already exists and appends adds a sequential number
' to the name
Dim sheetExists As Variant
Dim Sheet As Worksheet
Dim iName As Integer
Dim dbaseOld As String
dbaseOld = dbase ' save the original proposed name of the new worksheet
iName = 0
sheetExists = False
CheckWorksheetNames:
For Each Sheet In Worksheets ' loop through every worksheet in the workbook
If dbase = Sheet.Name Then
sheetExists = True
iName = iName + 1
dbase = Left(dbase, Len(dbase) - 1) & " " & iName
GoTo CheckWorksheetNames
' Exit For
End If
Next Sheet
'--------------------------------------------------
' This section notify the user if the proposed
' worksheet name is already being used and the new
' worksheet was given an alternate name
If sheetExists = True Then
MsgBox "The worksheet '" & dbaseOld & "' already exists. Renaming to '" & dbase & "'."
End If
'--------------------------------------------------
' This section creates and names a new worksheet
On Error Resume Next 'Ignore errors
If Sheets("" & Range(dbase) & "") Is Nothing Then ' If the worksheet name doesn't exist
ActiveSheet.Name = dbase ' Rename newly created worksheet
Else
MsgBox "Cannot name the worksheet '" & dbase & "'. A worksheet with that name already exists."
GoTo EndMatrixMacro
End If
On Error GoTo 0 ' Resume normal error handling
Sheets(dbase).Tab.ColorIndex = 41 ' color the worksheet tab
'--------------------------------------------------
' This section turns off screen and calculation updates so that the script
' can run faster. Updates are turned back on at the end of the script.
Application.Calculation = xlCalculationManual
Application.ScreenUpdating = False
'--------------------------------------------------
'This section determines how many rows and columns the matrix has
dun = False
rotot = rowz + 1
Do
If (Sheets(mtrx).Cells(rotot, 1) > 0) Then
rotot = rotot + 1
Else
dun = True
End If
Loop Until dun
rotot = rotot - 1
dun = False
coltot = colz + 1
Do
If (Sheets(mtrx).Cells(1, coltot) > 0) Then
coltot = coltot + 1
Else
dun = True
End If
Loop Until dun
coltot = coltot - 1
'--------------------------------------------------
'This section writes the new field names to the new spreadsheet
For newcol = 1 To v
Sheets(dbase).Cells(1, newcol) = Arr(newcol)
Next
'--------------------------------------------------
'This section actually does the conversion
tot = 0
newro = 2
For col = (colz + 1) To coltot
For ro = (rowz + 1) To rotot 'the next line determines if data are nonzero
If ((Sheets(mtrx).Cells(ro, col) <> 0) Or (all <> 6)) Then 'DCB modified ">0" to be "<>0" to exclude blank and zero cells
tot = tot + 1
newcol = 1
For r = 1 To rowz 'the next line copies the row headers
Sheets(dbase).Cells(newro, newcol) = Sheets(mtrx).Cells(r, col)
newcol = newcol + 1
Next
For c = 1 To colz 'the next line copies the column headers
Sheets(dbase).Cells(newro, newcol) = Sheets(mtrx).Cells(ro, c)
newcol = newcol + 1
Next 'the next line copies the data
Sheets(dbase).Cells(newro, newcol) = Sheets(mtrx).Cells(ro, col)
newro = newro + 1
End If
Next
Next
'--------------------------------------------------
'This section displays a message box with information about the conversion
book = "Original matrix = " & ActiveWorkbook.Name & ": " & mtrx & Chr(10)
head = "Matrix with " & rowz & " row headers and " & colz & " column headers" & Chr(10)
cels = tot & " cells of " & ((rotot - rowz) * (coltot - colz)) & " with data"
'--------------------------------------------------
' This section turns screen and calculation updates back ON.
Application.Calculation = xlCalculationAutomatic
Application.ScreenUpdating = True
MsgBox (book & head & cels)
'--------------------------------------------------
' This is an end point for the macro
EndMatrixMacro:
End Sub
Thanks #ChrisB for your answer. Actually I decided to do it my own way and here are the main steps I did:
I created an Excel file with several buttons to which I assigned below Subroutines. Also I have added some parameters, which user can modify (OrderId, Delivery Date and WH id).
I created a Subroutine ReadData(), which clears the Sheet "DATA" in original file and after reads column by column in the data file and generates a linear data set with all required fields on "DATA" Sheet.
After that I simply writes "DATA" sheet to external csv file.
the final code looks like this:
Global Const DAODBEngine = "DAO.DBEngine.36"
Global intColBeg As Integer 'Column Index with Data set to analyze
Global intRowBeg As Integer 'Row Index with Data set to analyze
Sub FileOpen()
Dim filePath As String
filePath = Application.GetOpenFilename()
If filePath = "False" Then Exit Sub
ThisWorkbook.Sheets("BASE").Cells(4, 3) = filePath
End Sub
Sub ClearData()
' Check if DATA Sheet exists
If Evaluate("ISREF('" & "DATA" & "'!A1)") Then
Application.DisplayAlerts = False
ThisWorkbook.Sheets("DATA").Delete
Application.DisplayAlerts = True
End If
Dim sheet As Worksheet
ThisWorkbook.Sheets.Add(After:=ThisWorkbook.Sheets(ThisWorkbook.Sheets.Count)).Name = "DATA"
End Sub
' This function reads data and adds it to DATA Sheet
Sub ReadData()
Dim i As Integer, l As Integer
Dim intColumn As Integer, intRow As Integer
Dim intAddRow As Integer
Dim wbCopyFrom As Workbook
Dim wbCopyTo As Workbook
Dim wsCopyFrom As Worksheet
Dim wsCopyTo As Worksheet
Dim dataLoc As String, wbLoc As String
Dim mandant As String
Dim orderId As String
Dim orderNum As Integer
Dim shipDate As Date
dataLoc = Trim(ThisWorkbook.Sheets("BASE").Cells(4, 3).Text)
Set wbCopyFrom = Workbooks.Open(dataLoc)
Set wsCopyFrom = wbCopyFrom.Worksheets(1)
ThisWorkbook.Activate
Call ClearData ' Clears all the data on DATA Sheet
Set wbCopyTo = ThisWorkbook
Set wsCopyTo = wbCopyTo.Sheets("DATA")
wbCopyTo.Activate
mandant = wbCopyTo.Sheets("BASE").Cells(11, 3).Text
orderId = wbCopyTo.Sheets("BASE").Cells(7, 3).Text
shipDate = wbCopyTo.Sheets("BASE").Cells(9, 3).Text
' Initial upper left row/column where matrix data begins
intColBeg = 4
intRowBeg = 4
intColumn = intColBeg
intRow = intRowBeg
intAddRow = 1 ' We will add data from this row
orderNum = 1
While Trim(wsCopyFrom.Cells(intRowBeg - 1, intColumn).Text) <> ""
' Header of an Order
wsCopyTo.Cells(intAddRow, 1) = "H;OUT;" & mandant & ";" & orderId & "/" & orderNum & ";" & _
";;" & Mid(shipDate, 7, 4) & Mid(shipDate, 4, 2) & Mid(shipDate, 1, 2) & ";" & _
Trim(wsCopyFrom.Cells(3, intColumn).Text) & ";" & Trim(wsCopyFrom.Cells(2, intColumn).Text) & _
";;;;;;;999;;"
Dim r As Integer
r = 1
intAddRow = intAddRow + 1
While Trim(wsCopyFrom.Cells(intRow, intColBeg - 1).Text) <> ""
If (Trim(wsCopyFrom.Cells(intRow, intColumn).Text) <> "") Then
If Round(CDbl(Trim(wsCopyFrom.Cells(intRow, intColumn).Value)), 0) > 0 Then
' Rows of an Order
wsCopyTo.Cells(intAddRow, 1) = "I;" & orderId & "/" & orderNum & ";" & r & ";" & _
Trim(wsCopyFrom.Cells(intRow, 1).Text) & ";" & Trim(wsCopyFrom.Cells(intRow, intColumn).Value) & _
";PCE;;;;;;;;;;;;;;;"
r = r + 1
intAddRow = intAddRow + 1
End If
End If
intRow = intRow + 1
Wend
intRow = intRowBeg
intColumn = intColumn + 1
orderNum = orderNum + 1
Wend
wbCopyFrom.Close
wbCopyTo.Sheets("BASE").Activate
End Sub
Sub Export()
Dim MyPath As String
Dim MyFileName As String
MyFileName = "Orders_" & Sheets("BASE").Cells(7, 3).Text & "_" & Format(Date, "ddmmyyyy")
If Not Right(MyFileName, 4) = ".csv" Then MyFileName = MyFileName & ".csv"
Sheets("DATA").Copy
With Application.FileDialog(msoFileDialogFolderPicker)
.Title = "Select a Folder"
.AllowMultiSelect = False
.InitialFileName = "" '<~~ The start folder path for the file picker.
If .Show <> -1 Then GoTo NextCode
MyPath = .SelectedItems(1) & "\"
End With
NextCode:
If MyPath <> "" Then
Application.DisplayAlerts = False
With ActiveWorkbook
.SaveAs fileName:=MyPath & MyFileName, AccessMode:=xlExclusive, FileFormat:=xlCSV, CreateBackup:=False, ConflictResolution:=Excel.XlSaveConflictResolution.xlLocalSessionChanges
.Close False
End With
Application.DisplayAlerts = True
Else
On Error Resume Next
ActiveWorkbook.Close SaveChanges:=False
If Err.Number = 1004 Then
On Error GoTo 0
End If
End If
End Sub
I believe that this code isn't optimal as I don't have any experience in VBA and it was a method of trying/changing/trying again in debugging mode and googling in case of issues.
If you can provide any suggestion how to optimise it - that would be great!
I have been trying to write a program that will loop through all the cells in an excel sheet and if one starts with a '#' it should display a message. here's the code:
(template is a worksheet variable)
Private Function processTemplate()
Dim total As Long
total = template.UsedRange.count
Dim count As Integer
count = 0
While count <= total
If template.Cells(count).Value Like "[#]*" Then 'Here I get a error
MsgBox "Found #"
End If
count = count + 1
Wend
End Function
I have isolated the error to using a variable inside of cells(). If I replace count with some number (like 8) it works fine. I am getting error 1004 on the line If template.Cells(count).Value Like "[#]*" Then
If I make total an Integer it has the same error at the same place. After about 2-3 hrs of research/banging my head on the wall I have no idea. I initially got the error when assigning template.cells(row, col).Value to a string variable.
Here's my code now:
Private Sub processTemplate()
MsgBox Len("")
Dim str As String
Dim rows As Long
Dim cols As Long
rows = template.UsedRange.Height
cols = template.UsedRange.Width
Dim row As Integer
row = 1
While row < rows
Dim col As Integer
col = 1
While col < cols
str = template.Cells(row, col).Text
If Len(str) > 0 Then
If Left(template.Cells(row, col).Text, 1) = "#" Then
MsgBox "Found IT"
End If
End If
Rem MsgBox template.Parent.Name & ": " & template.Name & ", Cell(" & row & ", " & col & "): " & template.Cells(row, col).Value
col = col + 1
Wend
row = row + 1
Wend
End Sub
Now I get the error on str = template.Cells(row, col).Text
We can use a sub rather than a function
We loop over all the cells in UsedRange looking for a # as the first character in the cell.
Sub FindThePound()
Dim r As Range, pound As String, template As Worksheet
pound = "#"
Set template = ActiveSheet
For Each r In template.UsedRange
If Left(r.Value, 1) = pound Then
MsgBox "Found # in " & r.Address(0, 0)
End If
Next r
End Sub
EDIT#1
This version loops over all the cells, but does not test cells containing formulas
Sub FindThePound()
Dim r As Range, pound As String, template As Worksheet
pound = "#"
Set template = ActiveSheet
For Each r In template.UsedRange
If r.HasFormula = False Then
If Left(r.Value, 1) = pound Then
MsgBox "Found # in " & r.Address(0, 0)
End If
End If
Next r
End Sub
You could use find/ find next function which i guess bit faster than looping through each cell and do string comparison.
With Worksheets(1).Range("a1:a500") 'Provide the search range
Set c = .Find(2, lookin:=xlValues) ' searching for 2 in cell value
If Not c Is Nothing Then
firstAddress = c.Address 'first occurance
Do
'do whatever you want to do with the matches even replace them
c.Value = 5
Set c = .FindNext(c)
Loop While Not c Is Nothing And c.Address <> firstAddress
End If
End With
Reference:
http://msdn.microsoft.com/en-us/library/office/ff196143(v=office.15).aspx
I have a VBA script that prints out the word frequency in a document.
Sub WordFrequency()
Const maxwords = 9000 'Maximum unique words allowed
Dim SingleWord As String 'Raw word pulled from doc
Dim Words(maxwords) As String 'Array to hold unique words
Dim Freq(maxwords) As Integer 'Frequency counter for unique words
Dim WordNum As Integer 'Number of unique words
Dim ByFreq As Boolean 'Flag for sorting order
Dim ttlwds As Long 'Total words in the document
Dim Excludes As String 'Words to be excluded
Dim Found As Boolean 'Temporary flag
Dim j, k, l, Temp As Integer 'Temporary variables
Dim ans As String 'How user wants to sort results
Dim tword As String '
' Set up excluded words
Excludes = "[the][a][of][is][to][for][by][be][and][are]"
' Find out how to sort
ByFreq = True
ans = InputBox("Sort by WORD or by FREQ?", "Sort order", "WORD")
If ans = "" Then End
If UCase(ans) = "WORD" Then
ByFreq = False
End If
Selection.HomeKey Unit:=wdStory
System.Cursor = wdCursorWait
WordNum = 0
ttlwds = ActiveDocument.Words.Count
' Control the repeat
For Each aword In ActiveDocument.Words
SingleWord = Trim(LCase(aword))
'Out of range?
If SingleWord < "a" Or SingleWord > "z" Then
SingleWord = ""
End If
'On exclude list?
If InStr(Excludes, "[" & SingleWord & "]") Then
SingleWord = ""
End If
If Len(SingleWord) > 0 Then
Found = False
For j = 1 To WordNum
If Words(j) = SingleWord Then
Freq(j) = Freq(j) + 1
Found = True
Exit For
End If
Next j
If Not Found Then
WordNum = WordNum + 1
Words(WordNum) = SingleWord
Freq(WordNum) = 1
End If
If WordNum > maxwords - 1 Then
j = MsgBox("Too many words.", vbOKOnly)
Exit For
End If
End If
ttlwds = ttlwds - 1
StatusBar = "Remaining: " & ttlwds & ", Unique: " & WordNum
Next aword
' Now sort it into word order
For j = 1 To WordNum - 1
k = j
For l = j + 1 To WordNum
If (Not ByFreq And Words(l) < Words(k)) _
Or (ByFreq And Freq(l) > Freq(k)) Then k = l
Next l
If k <> j Then
tword = Words(j)
Words(j) = Words(k)
Words(k) = tword
Temp = Freq(j)
Freq(j) = Freq(k)
Freq(k) = Temp
End If
StatusBar = "Sorting: " & WordNum - j
Next j
' Now write out the results
tmpName = ActiveDocument.AttachedTemplate.FullName
Documents.Add Template:=tmpName, NewTemplate:=False
Selection.ParagraphFormat.TabStops.ClearAll
With Selection
For j = 1 To WordNum
.TypeText Text:=Trim(Str(Freq(j))) _
& vbTab & Words(j) & vbCrLf
Next j
End With
System.Cursor = wdCursorNormal
j = MsgBox("There were " & Trim(Str(WordNum)) & _
" different words ", vbOKOnly, "Finished")
End Sub
However, this VBA script is not picking up non-latin characters in the document.
I am using Arial Unicode MS and another unicode font
What do I need in the macro to pick up these words?
-Thanks
It can't work with this...
If SingleWord < "a" Or SingleWord > "z" Then
SingleWord = ""
End If
Remove that, and you should at least see non-Latin words. Beyond that, it's more a question of what you are trying to filter out. If it's stuff like "'words' starting with a digit or a punctuation mark" then you probably need to have a look through the Unicode tables and work out a better set of things to filter out (bearing in mind that some transliterated scripts may even use digits at the beginnings of words).
Can someone please confirm why my macro jumps out of loop. I am not getting why its happening.
My Input looks like this http://i.imgur.com/Y6XRBai.jpg
What I am trying is split the text and write to textfile from column D2 onwards using while loop, First file writes properly but when it starts writing second file either for loop or if condition breaks out and macro comes to line where strDir starts
Sub SplitTextAndSave()
'Macro to split text and write to text file
'Full name of File name will be Single quote + Prefix from B2 + ( + filename from C2 + )'
'Application.DisplayAlerts = False
Dim Val, splitVal As String
Dim reqNumTxt, totLn, reqNum, remChr, i As Integer
Dim wb As Workbook
Dim strFile, fileNm, strDir As String
Set Sheet = Excel.ActiveSheet
' Select where to place the files
Dim obj As Object
Dim path As String
Set obj = CreateObject("Shell.Application").browseforfolder(0, "Please Select Folder where TWS scripts will be created", 0)
On Error GoTo error_trap:
path = obj.self.path & "\"
error_trap:
'this is where it starts again when the loop breaks
strDir = path
filepre = Sheet.Cells(2, 2).Value
reqNum = Sheet.Cells(3, 2).Value
reqNumTxt = 0
Sheet.Cells(2, 4).Activate
Do While ActiveCell.Value <> ""
Set nextcell = ActiveCell.Offset(1, 0)
fileNm = ActiveCell.Offset(0, -1).Value
FileFullNm = strDir & "'" & filepre & "(" & fileNm & ")'"
Open FileFullNm For Output As #1
Val = ActiveCell.Value
totLn = Int(Len(Val) / reqNum)
remChr = Len(Val) Mod reqNum
If Len(Val) <= reqNum Then
Print #1, Val
Close #1
Else
For i = 1 To totLn
'I observed sometimes loop breaks here
splitVal = Left(Right(Val, Len(Val) - reqNumTxt), reqNum)
Print #1, splitVal
reqNumTxt = reqNumTxt + reqNum
Next i
If remChr = 0 Then
Close #1
Else
'most of the time loop break here when writing second file
splitVal = Left(Right(Val, Len(Val) - reqNumTxt), reqNum)
Print #1, splitVal
Close #1
End If
End If
nextcell.Select
Set currentcell = nextcell
'Next
Loop
MsgBox "Done"
'Application.DisplayAlerts = True
End Sub
I added 2 lines to your code and it runs without error. I set splitVal to null and reqNumTxt to zero.
Val = ActiveCell.Value
totLn = Int(Len(Val) / reqNum)
remChr = Len(Val) Mod reqNum
**splitVal = ""**
If Len(Val) <= reqNum Then
Print #1, Val
Close #1
Else
For i = 1 To totLn
'I observed sometimes loop breaks here
splitVal = Left(Right(Val, Len(Val) - reqNumTxt), reqNum)
Print #1, splitVal
reqNumTxt = reqNumTxt + reqNum
Next i
If remChr = 0 Then
Close #1
Else
'most of the time loop break here when writing second file
splitVal = Left(Right(Val, Len(Val) - reqNumTxt), reqNum)
Print #1, splitVal
Close #1
End If
End If
nextcell.Select
Set currentcell = nextcell
**reqNumTxt = 0**
Loop
I am getting the impression that this is not possible in word but I figure if you are looking for any 3-4 words that come in the same sequence anywhere in a very long paper I could find duplicates of the same phrases.
I copy and pasted a lot of documentation from past papers and was hoping to find a simple way to find any repeated information in this 40+ page document there is a lot of different formatting but I would be willing to temporarily get rid of formatting in order to find repeated information.
To highlight all duplicate sentences, you can also use ActiveDocument.Sentences(i). Here is an example
LOGIC
1) Get all the sentences from the word document in an array
2) Sort the array
3) Extract Duplicates
4) Highlight duplicates
CODE
Option Explicit
Sub Sample()
Dim MyArray() As String
Dim n As Long, i As Long
Dim Col As New Collection
Dim itm
n = 0
'~~> Get all the sentences from the word document in an array
For i = 1 To ActiveDocument.Sentences.Count
n = n + 1
ReDim Preserve MyArray(n)
MyArray(n) = Trim(ActiveDocument.Sentences(i).Text)
Next
'~~> Sort the array
SortArray MyArray, 0, UBound(MyArray)
'~~> Extract Duplicates
For i = 1 To UBound(MyArray)
If i = UBound(MyArray) Then Exit For
If InStr(1, MyArray(i + 1), MyArray(i), vbTextCompare) Then
On Error Resume Next
Col.Add MyArray(i), """" & MyArray(i) & """"
On Error GoTo 0
End If
Next i
'~~> Highlight duplicates
For Each itm In Col
Selection.Find.ClearFormatting
Selection.HomeKey wdStory, wdMove
Selection.Find.Execute itm
Do Until Selection.Find.Found = False
Selection.Range.HighlightColorIndex = wdPink
Selection.Find.Execute
Loop
Next
End Sub
'~~> Sort the array
Public Sub SortArray(vArray As Variant, i As Long, j As Long)
Dim tmp As Variant, tmpSwap As Variant
Dim ii As Long, jj As Long
ii = i: jj = j: tmp = vArray((i + j) \ 2)
While (ii <= jj)
While (vArray(ii) < tmp And ii < j)
ii = ii + 1
Wend
While (tmp < vArray(jj) And jj > i)
jj = jj - 1
Wend
If (ii <= jj) Then
tmpSwap = vArray(ii)
vArray(ii) = vArray(jj): vArray(jj) = tmpSwap
ii = ii + 1: jj = jj - 1
End If
Wend
If (i < jj) Then SortArray vArray, i, jj
If (ii < j) Then SortArray vArray, ii, j
End Sub
SNAPSHOTS
BEFORE
AFTER
I did not use my own DAWG suggestion, and I am still interested in seeing if someone else has a way to do this, but I was able to come up with this:
Option Explicit
Sub test()
Dim ABC As Scripting.Dictionary
Dim v As Range
Dim n As Integer
n = 5
Set ABC = FindRepeatingWordChains(n, ActiveDocument)
' This is a dictionary of word ranges (not the same as an Excel range) that contains the listing of each word chain/phrase of length n (5 from the above example).
' Loop through this collection to make your selections/highlights/whatever you want to do.
If Not ABC Is Nothing Then
For Each v In ABC
v.Font.Color = wdColorRed
Next v
End If
End Sub
' This is where the real code begins.
Function FindRepeatingWordChains(ChainLenth As Integer, DocToCheck As Document) As Scripting.Dictionary
Dim DictWords As New Scripting.Dictionary, DictMatches As New Scripting.Dictionary
Dim sChain As String
Dim CurWord As Range
Dim MatchCount As Integer
Dim i As Integer
MatchCount = 0
For Each CurWord In DocToCheck.Words
' Make sure there are enough remaining words in our document to handle a chain of the length specified.
If Not CurWord.Next(wdWord, ChainLenth - 1) Is Nothing Then
' Check for non-printing characters in the first/last word of the chain.
' This code will read a vbCr, etc. as a word, which is probably not desired.
' However, this check does not exclude these 'words' inside the chain, but it can be modified.
If CurWord <> vbCr And CurWord <> vbNewLine And CurWord <> vbCrLf And CurWord <> vbLf And CurWord <> vbTab And _
CurWord.Next(wdWord, ChainLenth - 1) <> vbCr And CurWord.Next(wdWord, ChainLenth - 1) <> vbNewLine And _
CurWord.Next(wdWord, ChainLenth - 1) <> vbCrLf And CurWord.Next(wdWord, ChainLenth - 1) <> vbLf And _
CurWord.Next(wdWord, ChainLenth - 1) <> vbTab Then
sChain = CurWord
For i = 1 To ChainLenth - 1
' Add each word from the current word through the next ChainLength # of words to a temporary string.
sChain = sChain & " " & CurWord.Next(wdWord, i)
Next i
' If we already have our temporary string stored in the dictionary, then we have a match, assign the word range to the returned dictionary.
' If not, then add it to the dictionary and increment our index.
If DictWords.Exists(sChain) Then
MatchCount = MatchCount + 1
DictMatches.Add DocToCheck.Range(CurWord.Start, CurWord.Next(wdWord, ChainLenth - 1).End), MatchCount
Else
DictWords.Add sChain, sChain
End If
End If
End If
Next CurWord
' If we found any matching results, then return that list, otherwise return nothing (to be caught by the calling function).
If DictMatches.Count > 0 Then
Set FindRepeatingWordChains = DictMatches
Else
Set FindRepeatingWordChains = Nothing
End If
End Function
I have tested this on a 258 page document (TheStory.txt) from this source, and it ran in just a few minutes.
See the test() sub for usage.
You will need to reference the Microsoft Scripting Runtime to use the Scripting.Dictionary objects. If that is undesirable, small modifications can be made to use Collections instead, but I prefer the Dictionary as it has the useful .Exists() method.
I chose a rather lame theory, but it seems to work (at least if I got the question right cuz sometimes I'm a slow understander).
I load the entire text into a string, load the individual words into an array, loop through the array and concatenate the string, containing each time three consecutive words.
Because the results are already included in 3 word groups, 4 word groups or more will automatically be recognized.
Option Explicit
Sub Find_Duplicates()
On Error GoTo errHandler
Dim pSingleLine As Paragraph
Dim sLine As String
Dim sFull_Text As String
Dim vArray_Full_Text As Variant
Dim sSearch_3 As String
Dim lSize_Array As Long
Dim lCnt As Long
Dim lCnt_Occurence As Long
'Create a string from the entire text
For Each pSingleLine In ActiveDocument.Paragraphs
sLine = pSingleLine.Range.Text
sFull_Text = sFull_Text & sLine
Next pSingleLine
'Load the text into an array
vArray_Full_Text = sFull_Text
vArray_Full_Text = Split(sFull_Text, " ")
lSize_Array = UBound(vArray_Full_Text)
For lCnt = 1 To lSize_Array - 1
lCnt_Occurence = 0
sSearch_3 = Trim(fRemove_Punctuation(vArray_Full_Text(lCnt - 1) & _
" " & vArray_Full_Text(lCnt) & _
" " & vArray_Full_Text(lCnt + 1)))
With Selection.Find
.Text = sSearch_3
.Forward = True
.Replacement.Text = ""
.Wrap = wdFindContinue
.Format = False
.MatchCase = False
Do While .Execute
lCnt_Occurence = lCnt_Occurence + 1
If lCnt_Occurence > 1 Then
Selection.Range.Font.Color = vbRed
End If
Selection.MoveRight
Loop
End With
Application.StatusBar = lCnt & "/" & lSize_Array
Next lCnt
errHandler:
Stop
End Sub
Public Function fRemove_Punctuation(sString As String) As String
Dim vArray(0 To 8) As String
Dim lCnt As Long
vArray(0) = "."
vArray(1) = ","
vArray(2) = ","
vArray(3) = "?"
vArray(4) = "!"
vArray(5) = ";"
vArray(6) = ":"
vArray(7) = "("
vArray(8) = ")"
For lCnt = 0 To UBound(vArray)
If Left(sString, 1) = vArray(lCnt) Then
sString = Right(sString, Len(sString) - 1)
ElseIf Right(sString, 1) = vArray(lCnt) Then
sString = Left(sString, Len(sString) - 1)
End If
Next lCnt
fRemove_Punctuation = sString
End Function
The code assumes a continuous text without bullet points.