VBA - Search and remove duplicates - vba

I'm looking for an algorithm for which I do not have the VBA knowledge to script myself. So I'm stuck. It isn't through lack of effort trying because I have given it a go (plus, this bit of code is the last remaining piece of my bigger VBA code) I simply lack the knowledge/experience/skill...
Basically, I have an Excel file. In this file is a sheet, "sheet1". Sheet1 contains many rows of data. The number of rows contained in sheet1 can vary from 1 to n. Sometimes, I may have 50 while other times I may have 30, etc. What is consistent is the layout of the book, i.e. I have codes in column A which identify a product in my database.
What I want to do is this:
1. Scan the sheet for empty rows (due to the way the workbook is generated, I sometimes have blank rows) and remove them. These blank rows are sometimes in-between rows with data while at other times may be trailing at the end of the sheet.
2. After removing the blank rows find the last used row. Store that to a variable. I have found this piece of code useful for doing that:
mylastrow = myBook.Sheets("Results").Cells.Find(what:="*", SearchOrder:=xlByRows, SearchDirection:=xlPrevious).Row
3. Starting from the row determined in (2), I want to take the product code in A(x where x = mylastrow) and find any other occurrences of it (in column A). If any are found, delete that entire row corresponding to it. Importantly, this loop must go in reverse. For example let's say mylastrow = 40, the loop will need to begin at A40 and on the next iteration do A39 (or 38 if a row has been removed?). This is because with any of the product numbers the corresponding data in the row contains more data further down the column (because of the way the sheet was generated). Essentially the entry closest to the last row is the most recent.
Hopefully I've been able to explain the situ properly. But if not and you're willing to take the challenge (my burden?) off me I would be very grateful.
QF

The only way to develop that knowledge and skill is to get in there and code! I'm sure someone may come in and write you the entire procedure, but in the meantime these resources should give you the tools to do it yourself.
First, check out the method here to delete blank rows. It relies on "Selection" for the range, so you can either manually select all the cells of the sheet, then run the macro, or replace it with the following:
Dim r as range
set r = Sheet1.Cells 'now use r instead of Selection
OR (even better) use your code for finding the last used row and set the range from row 1 to "mylastrow".
Next, beginning from "mylastrow", start adding the values in Column A to a Dictionary object (example here). You can use a row counter to decrement from "mylastrow" to 1. Here's an example of how it would work. The key is assumed to be in the 1st column ("A").
Dim dict As Object
Dim rowCount As Long
Dim strVal As String
Set dict = CreateObject("Scripting.Dictionary")
rowCount = Sheet1.Range("A1").CurrentRegion.Rows.Count
Do While rowCount > 1
strVal = Sheet1.Cells(rowCount, 1).Value2
If dict.exists(strVal) Then
Sheet1.Rows(rowCount).EntireRow.Delete
Else
dict.Add strVal, 0
End If
rowCount = rowCount - 1
Loop
Set dict = Nothing
Before:
After:
Note that the 1st row hasn't been touched since we stopped when rowCount is 1 (assumes there's a header).

Related

Why am I getting an object-defined error while adjust which rows I want to clear?

I have a specific template that starts it's printed array data in cell B9, but I just want rows 9-rows.count to be cleared of it's contents (not deleted because I have drop-down menus in there).
Rows(8).Offset(1, 0).Resize(Rows.Count - 1).EntireRow.ClearContents
doesn't seem to work, but it did work when I had Rows(1).Offset(1,0) etc... in there. why is it now not wanting to work when I adjusted the row number to start on?
Rows.Count gives the maximum numbers of rows in a worksheet (usually 1'048'576).
Splitting your command Rows(8).Offset(1, 0).Resize(Rows.Count - 1):
Rows(8) gives you the row 8 ($8:$8)
.Offset(1, 0) gives you the next row ($9:$9)
.Resize(Rows.Count - 1) asks the range that starts at row 9 to be resized so that it is 1'048'576 rows "high". However, that would mean row 9 until row 1'048'585, and that would be behind the maximum number of rows.
What you probably meant is to delete until the end of your data. Is is not an easy task to figure out the end of the data, usually it is done with a construct like
rowCount = Cells(Rows.Count, 2).End(xlUp).Row. This will (virtually) jump to cell B1048576, simulate the key "Ctrl+Up" and (again virtually) jumps to the last row holding data in that column. Change the 2 into any other column number (A would be 1...) if you want to check that column.
Try the following code:
Dim rowCount As Long
With ActiveSheet
rowCount = .Cells(.Rows.Count, 2).End(xlUp).Row
.Rows(8).Offset(1, 0).Resize(rowCount - 1).EntireRow.ClearContents
End With

Manipulating Excel spreadsheet, removing rows based on values in a column and then removing more rows based on values in another column

I have a rather complicated problem.
I have a log file that when put into excel the column "I" contains event IDs, and the column J contains a custom key that keeps a particular even grouped.
All i want to do is remove any rows that do not contain the value of say 102 in the event id column.
And THEN i need to check the custom key (column J) and remove rows that are duplicates since any duplicates will falsely show other statistics i want.
I have gotten as far as being able to retrieve the values from the columns using com objects and .entirecolumn cell value etc, but I am completely stumped as to how i can piece together a solid way to remove rows. I could not figure out how to get the row for each value.
To give a bit more clarity this is my thought process on what i need to do:
If cell value in Column I does not = 102 Then delete the row that cell contains.
Repeat for all rows in spreadsheet.
And THEN-
Read every cell in column J and remove all rows containing duplicates based on the values in column J.
Save spreadsheet.
Can any kind persons help me?
Additional Info:
Column I holds a string that is an event id number e.g = 1029
Column J holds a string that is a mix of numbers and letters = 1ASER0X3NEX0S
Ellz, I do agree with Macro Man in that your tags are misleading and, more importantly, I did indeed need to know the details of Column J.
However, I got so sick of rude posts today and yours was polite and respectful so I've pasted some code below that will do the trick ... provided Column J can be a string (the details of which you haven't given us ... see what Macro Man's getting at?).
There are many ways to test for duplicates. One is to try and add a unique key to a collection and see if it throws an error. Many wouldn't like that philosophy but it seemed to be okay for you because it also gives you a collection of all the unique (ie remaining) keys in Column J.
Sub Delete102sAndDuplicates()
Dim ws As Worksheet
Dim uniques As Collection
Dim rng As Range
Dim rowPair As Range
Dim iCell As Range
Dim jCell As Range
Dim delRows As Range
Set ws = ThisWorkbook.Worksheets("Sheet1")
Set rng = Intersect(ws.UsedRange, ws.Range("I:J"))
Set uniques = New Collection
For Each rowPair In rng.Rows
Set iCell = rowPair.Cells(, 1)
Set jCell = rowPair.Cells(, 2)
On Error Resume Next
uniques.Add jCell.Value2, jCell.Text
If Err = 457 Or iCell.Value2 = 102 Then
On Error GoTo 0
If delRows Is Nothing Then
Set delRows = rowPair.EntireRow
Else
Set delRows = Union(delRows, rowPair.EntireRow)
End If
End If
Next
If Not delRows is Nothing then
MsgBox delRows.Address(False, False) & " deleted."
delRows.Delete
End If
End Sub
There are a number of ways in which this can be done, and which is best will depend on how frequently you perform this task and whether you want to have it fully automated. Since you've tagged your question with VBA I assume you'll be happy with a VBA-based answer:
Sub removeValues()
Range("I1").Select 'Start at the top of the I column
'We are going to go down the column until we hit an empty row
Do Until IsEmpty(ActiveCell.Value) = True
If ActiveCell.Value <> 102 Then
ActiveCell.EntireRow.Delete 'Then delete the row
Else
ActiveCell.Offset(1).Select 'Select the cell below
End If
Loop
'Now we have removed all non-102 values from the column, let`s remove the duplicates from the J column
Range("A:J").RemoveDuplicates Columns:=10, Header:=xlNo
End Sub
The key line there is Range("A:J").RemoveDuplicates. It will remove rows from the range you specify according to duplicates it finds in the column you specify. In that case, it will remove items from the A-J columns based on duplicates in column 10 (which is J). If your data extends beyond the J column, then you'll need to replace "A:J" with the appropriate range. Note that the Columns value is relative to the index of the first column, so while the J column is 10 when that range starts at A (1), it would be 2 for example if the range were only I:J. Does that make sense?
(Note: Using ActiveCell is not really best practice, but it's the method that most obviously translates to what you were trying to do and as it seems you're new to VBA I thought it would be the easiest to understand).

Copy row from one sheet and insert copied row under last row in another sheet

I am in need of your expert assistance.
I am trying to write some code that will copy rows and insert the copied row below the last row in another sheet.
I have a Global sheet that has the data i will be copying. It will need to look in column Q.
I think the problem will be when trying to copy the data, the data in column G is the text name of a Contract Code. But the sheets are name with the Number version.
for example i have a row that has BRREPAIRS in column Q, I need this to copy to Sheet 2870, then i have a row that has BRVOIDS in column Q, I need this to copy to Sheet 2781.
I could have multiple different Contract names so i think i might need to define the text to equal a sheet. So maybe Set BRVOIDS = Sheet.name("2781") Set BRREPAIRS = Sheet.name("2780") and so on until all sheets are defined.
When the data gets copied i need it to find the last row in column a that has data, when it is found it will insert the copied row into the sheet. for example EntireRow.Insert Shift:=xlDown.
I dont have any code at the moment. I would really appreciate all the assistance.
You don't need to do things like Set BRVOIDS = Sheet.name("2781"). In fact, that would be positively harmful since then you would need to run the data in Column Q through a possibly large Select statement to know what variable to use. Instead, you could write a function like
Function TargetSheet(ContractName As String) As Worksheet
'code which uses your secret list to determine target sheet
'Maybe a Select statement, Maybe a Vlookup -- who knows?
Set TargetSheet = 'sheet your code determined
End Function
Sounds like your code will be scanning down column Q, determining where to copy the corresponding row to. Once you get the above function working, you could combine it with something like this:
Function LastRow(TargetCol As Variant, Optional ws As Variant) As Range
'assumes TargetCol is something like 1 or "A"
Dim n As Long
If IsMissing(ws) Then Set ws = ActiveSheet
n = ws.Cells(1, TargetCol).EntireColumn.Rows.Count
Set LastRow = ws.Cells(n, TargetCol).End(xlUp).EntireRow
End Function
This returns as a range the last row containing data (or row 1 if the column is empty) in a specified column in a specified worksheet (which defaults to the Active sheet).
You haven't given enough to go on, but something along the lines of
LastRow("A",TargetSheet(Range("Q" & i).Value)).Insert Shift := xlDown
Might be what you are looking for. Why don't you try to work it out and ask another question (if need be) once you have some actual code?

VBA Macro: Trying to code "if two cells are the same, then nothing, else shift rows down"

My Goal: To get all data about the same subject from multiple reports (already in the same spreadsheet) in the same row.
Rambling Backstory: Every month I get a new datadump Excel spreadsheet with several reports of variable lengths side-by-side (across columns). Most of these reports have overlapping subjects, but not entirely. Fortunately, when they are talking about the same subject, it is noted by a number. This number tag is always the first column at the beginning of each report. However, because of the variable lengths of reports, the same subjects are not in the same rows. The columns with the numbers never shift (report1's numbers are always column A, report2's are always column G, etc) and numbers are always in ascending order.
My Goal Solution: Since the columns with the ascending numbers do not change, I've been trying to write VBA code for a Macro that compares (for example) the number of the active datarow with from column A with Column G. If the number is the same, do nothing, else move all the data in that row (and under it) from columns G:J down a line. Then move on to the next datarow.
I've tried: I've written several "For Each"s and a few loops with DataRow + 1 to and calling what I thought would make the comparisons, but they've all failed miserably. I can't tell if I'm just getting the syntax wrong or its a faulty concept. Also, none of my searches have turned up this problem or even parts of it I can maraud and cobble together. Although that may be more of a reflection of my googling skill :)
Any and all help would be appreciated!
Note: In case it's important, the columns have headers. I've just been using DataRow = Found.Row + 1 to circumvent. Additionally, I'm very new at this and self-taught, so please feel free to explain in great detail
I think I understand your objective and this should work. It doesn't use any of the methodology you were using as reading your explanation I had a good idea how to proceed. If it isn't what you are looking for my apologies.
It starts at a predefined column (see FIRST_ROW constant) and goes row by row comparing the two cells (MAIN_COLUMN & CHILD_COLUMN). If MAIN_COLUMN < CHILD_COLUMN it pushes everything between SHIFT_START & SHIFT_END down one row. It continues until it hits an empty row.
Sub AlignData()
Const FIRST_ROW As Long = 2 ' So you can skip a header row, or multiple rows
Const MAIN_COLUMN As Long = 1 ' this is your primary ID field
Const CHILD_COLUMN As Long = 7 ' this is your alternate ID field (the one we want to push down)
Const SHIFT_START As String = "G" ' the first column to push
Const SHIFT_END As String = "O" ' the last column to push
Dim row As Long
row = FIRST_ROW
Dim xs As Worksheet
Set xs = ActiveSheet
Dim im_done As Boolean
im_done = False
Do Until im_done
If WorksheetFunction.CountA(xs.Rows(row)) = 0 Then
im_done = True
Else
If xs.Cells(row, MAIN_COLUMN).Value < xs.Cells(row, CHILD_COLUMN).Value Then
xs.Range(Cells(row, SHIFT_START), Cells(row, SHIFT_END)).Insert Shift:=xlDown
Debug.Print "Pushed row: " & row & " down!"
End If
row = row + 1
End If
Loop
End Sub
I modified the code to work as a macro. You should be able to create it right from the macro dialog and run it from there also. Just paste the code right in and make sure the Sub and End Sub lines don't get duplicated. It no longer accepts a worksheet name but instead runs against the currently active worksheet.

Is there a way to check for duplicate values in Excel WITHOUT using the CountIf function?

A lot of the solutions here on SO involve using CountIf to find duplicates. When I have a list of 100,000+ values however, it will often take minutes for CountIf to search for duplicates.
Is there a quicker way to search for duplicates within an Excel column WITHOUT using CountIf?
Thanks!
EDIT #1:
After reading the comments and replies I realize I need to go into greater detail. Let's pretend I'm a birdwatcher, and after I return from a birdwatching trip I input anywhere from 1 to 25 or 50 new birds that I saw on my trip into my "Master List of Birds Seen". This is really a dynamically growing list, and with each addition I want to make sure I'm not duplicating something that already exists in my list.
So, in column A of my file are the names of the birds. Column B-M might contain other attributes of the birds. I want to know if a bird that I just added in column A after my latest birdwatching trip ALREADY exists somewhere ELSE in my list. And, if it does, I would manually merge the data of the 2 entries and throw away some and keep some after careful review. I clearly don't want to have duplicate entries of the same bird in my database.
So, ultimately I want some indication that there is or isn't a duplicate somewhere else, and if there is duplicate please tell me what row to look in (or highlight or color both of the duplicates).
The fastest way that I know of (in case you are using Excel 2007/2010/2011) is to use Data (In Ribbon) | Remove Duplicates to find the total number of duplicates OR to remove duplicates. You might want to move data to a temp sheet before you test this.
The 2nd fastest way is to use Countif. Now Countif can be used in many ways to find duplicates. Here are two main ways.
1) Inserting a New Column next to the data and putting the formula and simply copying it down.
2) Using Countif in Conditional formatting to highlight cells which are duplicates. For more details, please see this link.
suggestions for a macro to find duplicates in a SINGLE column
EDIT:
My Apologies :)
Countif is the 3rd fastest way!
The 2nd fastest way is to use Pivot Tables ;)
What exactly is your main purpose of finding duplicates? Do you want to delete them? Or Do you want to highlight them? Or something else?
FOLLOWUP
Seems like I made a typo in the formula. Yes for large number of rows, CountIf does take minutes as you suggested.
Let me see if I can come up with a VBA code to suit your exact needs.
Sid
You can use VBA - the following function returns a list of unique entries within a list of 100,000 in less than a second. Usage: select a range, type the formula (=getUniqueListFromRange(YourRange)) and validate with CTRL+SHIFT+ENTER.
Public Function getUniqueListFromRange(parRange As Range) As Variant
' Returns a (1 to n,1 to 1) array with all the values without duplicates
Dim i As Long
Dim j As Long
Dim locKey As Variant
Dim locData As Variant
Dim locUniqueDict As Variant
Dim locUniqueList As Variant
On Error GoTo error_handler
locData = Intersect(parRange.Parent.UsedRange, parRange)
Set locUniqueDict = CreateObject("Scripting.Dictionary")
On Error Resume Next
For i = 1 To UBound(locData, 1)
For j = 1 To UBound(locData, 2)
locKey = UCase(locData(i, j))
If locKey <> "" Then locUniqueDict.Add locKey, locData(i, j)
Next j
Next i
If locUniqueDict.Count > 0 Then
ReDim locUniqueList(1 To locUniqueDict.Count, 1 To 1) As Variant
i = 1
For Each locKey In locUniqueDict
locUniqueList(i, 1) = locUniqueDict(locKey)
i = i + 1
Next
getUniqueListFromRange = locUniqueList
End If
error_handler: 'Empty range
End Function
If using Excel 2007 or later (which is likely from the 100,000+ values) you can choose:
Home Tab | Conditional Formatting > Highlight Cell Rules > Duplicate Values...
Right-click a highlighted cell and filter by selected cell color to show just the duplicates (be aware however this can be slow with conditional formatting).
Alternatively run this code and filter for colored cells which takes only a second on 100,000 cells:
Sub HighlightDupes()
Dim i As Long, dic As Variant, v As Variant
Application.ScreenUpdating = False
Set dic = CreateObject("Scripting.Dictionary")
i = 1
For Each v In Selection.Value2
If dic.exists(v) Then dic(v) = "" Else dic.Add v, i
i = i + 1
Next v
Selection.Font.Color = 255
For Each v In dic
If dic(v) <> "" Then Selection(dic(v)).Font.Color = 0
Next v
End Sub
Addendum:
To select only duplicate values without code or formulas, i have found this method useful:
Data Tab | Advanced Filter... Filter in Place, Unique Records Only, OK.
Now select the range of unique values and press Alt+; (Goto Special... Visible cells only). With this selection clear the filter and you will see that all unselected cells are duplicates, you can then press Ctrl+9 (Hide Rows) to show just the duplicates. These rows can be copied to another sheet if needed or marked with an "X".
You do not mention what you want to do when you find them. If you merely want to see where they are...
Sub HighLightCells()
ActiveSheet.UsedRange.Cells.FormatConditions.Delete
ActiveSheet.UsedRange.Cells.FormatConditions.Add Type:=xlCellValue, Operator:=xlEqual, Formula1:=ActiveCell
ActiveSheet.UsedRange.Cells.FormatConditions(1).Interior.ColorIndex = 4
End Sub
Preventing Duplicates with Data Validation
You can use Data Validation to prevent you entering duplicate bird names. See Debra Dalgelish's site here
Handling existing duplicates
My free Duplicate Master addin will let you
Select
Colour
List
Delete
duplicates.
But more importantly it will let you run more complex matching than exact strings, ie
Case Insensitive / Case Sensitive searches (sample below)
Trim/Clean data
Remove all blank spaces (including CHAR(160)) see the " mapgie" and "magpie" example below
Run regular expression matches (for example the sample below replaces s$ with "" to remove plurals)
Match on any combination of columns (ie Column A, all columns, Column A&B etc)
I'm surprised that no one has mentioned the RemoveDuplicates method.
ActiveSheet.Range("A:A").RemoveDuplicates Columns:=1
This will simply remove any duplicate entries on the active worksheet in column A. It takes milliseconds to run (tested with 200k rows). Mind you, this will strictly delete all the duplicate entries. Although that isn't how the original question was worded, I do believe that this still serves your purpose.
One simple way of finding unique values is to use the advance filter and filter for unique values only and copy and paste them into other sheet as when the pivot is removed you will get the whole data with the duplicate in them.
Sort the range
and in next column put `=if(a2=a1;1;if(a2=a3;1;0))
"1" will be displayed for duplicates.