First question on StackOverflow, please tell me how to improve. Also new to programming so problem may be simple, took me multiple hours to solve
I have two columns contained in a table, column X and column Y. Column X contains keys, column Y contains items. The columns are related by row. The item on row Z in Y should be mapped to the key on the same row in X. There can be more than one unique item per key, but only one unique key per item. All values in both columns occur an unknown amount of times. The task is to come up with an efficient mapping that provides easy access to all unique items corresponding to a specific key.
Please see my solution below. The function is called teamgroup and is part of a larger program, so there are inputs from other functions. What has not been included in my code is that the table is sorted by column X and that unique keys from column X are stored in a separate array, keys. IsInArray is also a separate function.
'Set keys
For i = 1 To UBound(keys)
With dict
.Add keys(i, 1), tmp
End With
Next i
'Group values by key
For Each kee In dict.keys
For i = 2 To lastrow
If kee = tbl.Range(i, keycol).Value Then
If IsInArray(tbl.Range(i, itemcol).Value, dict.Item(kee)) Then
Else
ReDim Preserve tmp(1 To UBound(tmp) + 1)
tmp(UBound(tmp)) = tbl.Range(i, itemcol).Value
dict.Remove (kee)
dict.Add (kee), tmp
End If
Else
End If
Next i
ReDim tmp(0)
Next
Set teamgroup = dict
End Function
The data is as follows:
X Y
56 6070
56 6070
55 6021
55 6024
56 6054
0 6050
There is no required form of the output but it should allow easy access to the items mapped to a specific key. I believe the dictionary to be a good tool for that. In this instance, an input of "56" should output an array, or similar, containing 6070 and 6054.
The end result of my code is a dictionary with unique keys mapping to items containing arrays with unique values from column Y.
My feeling is that this is an inefficient and convoluted way to solve the problem, so I would like to receive input on how it could be done better.
Related
To start, I have a scraping function that scrapes a table from a web page, and stores the data in a 2D array.
The 2D array starts from row 0 to however many rows of data are on the page, and columns 1 to however many columns there are.
Row 0 simply contains all the column names.
My ReDim:
ReDim Addresses(0 To lngTotalRecords, 1 To columns.Count) As String
I've then stored this 2D array into a scripting dictionary called dictClients, as there are multiple clients that all have their own entries for the same table on the web page.
So in my dictionary I have something like the following to refer to a particular address table for a particular client:
dictClients(1)("Addresses")
dictClients(2)("Addresses")
I now want to be able to check if a cell in a certain row contains a specific value, however the web page allows the columns to be reorganized so that:
dictClients(1)("Addresses")(1,1) 'row 1 column 1
will not always refer to the "Street Number" column. The street number column could be the following for someone else for example:
dictClients(1)("Addresses")(1,3) ' row 1 column 3
Given that these cells:
dictClients(1)("Addresses")(0,1) '(0,2) (0,3) etc.
all refer to the column's names, what's the best way for me to find the position of a particular named column?
Example: I want to get the value of the Postal Code cell in row 1, so I need to look in
dictClients(1)("Addresses")(1, POSTALCODECOLUMN),
which isn't always in the same position on the web page.
I was thinking of using the following function:
Public Function column(strArr() As String, strColumnName As String)
Dim i As Long
For i = 1 To UBound(strArr, 2)
If strArr(0, i) = strColumnName Then column = i
Next
End Function
But it just feels so lengthy calling it like:
strPostalCode =
dictClients(1)("Addresses")(1,column(dictClients(1)("Addresses"), "Postal Code")
Is there a better and easier way to do this?
Thanks.
Im working on a project report for work and I'm trying to find a way to compare two lists of project codes i.e "123456" and see whether the 2nd list is missing any new values that would've been entered into the first list. The lists are thousands of records long and so far people have been doing it manually (it hurts me knowing this) so I'm trying to make it automatic.
What I have tried is using an Array with a Index(Match(CountIF))) formula but I just cant seem to get it working.
My problem is that when I get the array to fill with what i want I then can't get it to not duplicate values (I need it to check the masterlist so it doesnt output something more than once into the output list).
I've also tried to give it a go with other formulas - but the lists can be thousands of records long so I cant do a cell for cell match as the list would be huge (that or my excel knowledge isnt good enough to know the easy solution).
Any help would be hugely appreciated.
Array might not be the best solution
I've checked quite a few other solutions but they don't quite deal with my issue and I don't have the skill to adapt them.
Here is one approach using VBA and arrays which is quicker than doing via the sheet. It checks each item in H to see it is present in J (and not the other way round). I assume that's what you want.
Sub x()
Dim v1, v2, v3(), i As Long, j As Long
v1 = Range("H2", Range("H" & Rows.Count).End(xlUp)).Value
v2 = Range("J2", Range("J" & Rows.Count).End(xlUp)).Value
ReDim v3(1 To UBound(v1, 1))
For i = LBound(v1) To UBound(v1)
If IsError(Application.Match(v1(i, 1), v2, 0)) Then
j = j + 1
v3(j) = v1(i, 1)
End If
Next i
Range("K2").Resize(j) = Application.Transpose(v3)
End Sub
Using an input box
Sub x()
Dim v1, v2, v3(), i As Long, j As Long
v1 = Application.InputBox("First list", Type:=8)
v2 = Application.InputBox("Second list", Type:=8)
ReDim v3(1 To UBound(v1, 1))
For i = LBound(v1) To UBound(v1)
If IsError(Application.Match(v1(i, 1), v2, 0)) Then
j = j + 1
v3(j) = v1(i, 1)
End If
Next i
Range("K2").Resize(j) = Application.Transpose(v3)
End Sub
A formula solution.
Note that I turned the first two ranges into Tables and changed the names. The formula is using structured references. This enables the formula to auto update if you add rows in the future.
=IFERROR(INDEX(ProjList1[#Data],AGGREGATE(15,6,1/ISNA(MATCH(ProjList1[#Data],ProjList2[#Data],0))*ROW(ProjList1[#Data]),ROWS($1:1))-ROW(ProjList1[#Headers])),"")
How does it work? Briefly:
MATCH generates an array of #NA! errors or a number.
ISNA turns that into an array of TRUE/FALSE where TRUE indicates an entry in table 1 that is NOT in table 2
Multiplying that array by the array of project list rows returns an array of error message vs row number
AGGREGATE small function ignores the error returns to give an ascending list of row numbers
INDEX then returns the appropriate entry from Table 1
ROW(ProjList1[#Headers]) is a correction so that the table may be located anyplace on the worksheet, and still return the correct row.
Not sure if you're trying to set this up so it will autoupdate in future, but as a stopgap:
Countif column next to list 1 that checks whether they appear in list 2...
... Feeding into a pivot that only shows those where the countif value is 0, in the "row" field to remove duplication?
If you have variables and values in vba
is it possible to sort them from max to min, but also to match them with variables.
So for example if you have in VBA:
Sub sort()
...
x = 5
y = 3
z = 8
...
End sub
That in excel you get (in A column variables, in B column values):
A B
z 8
x 5
y 3
It's pretty easy to sort 8, 5, 3.
But how to add z, x, y in their proper positions?
AFAIK there isn't any way to do exactly what you are asking. Variable names are part of your code; using and accessing them as data would need reflection which is not something Excel VBA offers*.
But I don't think what you are asking is the best way to achieve what you want by a long way. What you seem to need is a way of storing (name, value) pairs and accessing both the name and the value. One straightforward way to do this is to use the Dictionary object.
There are alternatives if you need different functionality, e.g. using a pair of Arrays (one to hold the names, one to hold the values) - to make this neater you could write a class to keep the two together and implement whatever functions you need as methods.
Once you have the (name,value) pairs outputting them in a sorted list is straightforward. The simplest way is to write them to the spreadsheet and then use Excel's built-in Range.Sort (see MSDN documentation).
Putting it all together (note this needs a reference to the Microsoft Scripting Runtime library for the Dictionary object):
Dim dict As New Dictionary
Dim ws As Worksheet, rng As Range
Dim ky As Variant, itm As Variant
Set ws = ThisWorkbook.Worksheets(1)
'Add items to dictionary
dict.Add "x", 5
dict.Add "y", 3
dict.Add "z", 8
'You can use these in code like this:
For Each ky In dict.Keys
Debug.Print "The value of " & ky & " is " & dict(ky)
Next ky
'you can change values
dict.item("z") = 10
dict.item("z") = 8
'Output the values and keys (the key/value arrays are horizontal while
'the range is vertical, hence transpose)
Set rng = ws.Range(ws.Cells(1, 1), ws.Cells(dict.Count, 2))
rng.Columns(1).Value = Application.Transpose(dict.Keys)
rng.Columns(2).Value = Application.Transpose(dict.Items)
'Sort them
rng.Sort Key1:=ws.Range("B1"), Header:=xlNo
*not easily anyway - there are some objects accessible through VBProject which allow limited access to the code. But using that to do what you are asking for would be nigh on impossible.
In VBA there is no such thing as order between variables. You can determine if x is bigger than y, but that doesn't help you since you want them in a specific order.
My answer is not exactly what you are asking for, but I suspect it'll help you anyway.
Since you've got your output to the sheet, albeit unsorted, it is easier to sort it there. Simply sort columns A and B together with column B as primary.
Doing this in code is a bit more complex. You would need to put both name (x, y, z) and value(8, 5, 3) in an array and then sort your array. How to sort arrays have been describe several times before (like here: Excel VBA Quickest way to sort an array of numbers in descending order?)
I have the following data structure:
As you see in column J, I am trying to merge data into one column from columns A & C & E & G.
I am using this formula:
=IF(ROW()<=COUNTA($A:$A);INDEX($A:$C;ROW();COLUMN(A1));INDEX($A:$C;ROW()-COUNTA($A:$A)+1;COLUMN(C1)))
and I get the values in column K as you see. Currently this formula is merging only two columns. How to modify it to merge all four columns?
And how to only get those values starting from row 5?
The column height will vary constantly: sometimes there are 10 values in column A and sometimes there are 2 values.
Either any excel formula or any VBA code will be acceptable.
There is a fairly standard method for retrieving unique values from a column but not multiple columns. To achieve the retrieval from multiple columns you need to stack multiple formulas together with the processing being passed to successive columns one the earlier formula errors out.
The array formula¹ in J5 is,
=IFERROR(INDEX($A$5:$A$99, MATCH(0, IF(LEN($A$5:$A$99), COUNTIF(J$4:J4, $A$5:$A$99), 1), 0)),
IFERROR(INDEX($C$5:$C$99, MATCH(0, IF(LEN($C$5:$C$99), COUNTIF(J$4:J4, $C$5:$C$99), 1), 0)),
IFERROR(INDEX($E$5:$E$99, MATCH(0, IF(LEN($E$5:$E$99), COUNTIF(J$4:J4, $E$5:$E$99), 1), 0)),
IFERROR(INDEX($G$5:$G$99, MATCH(0, IF(LEN($G$5:$G$99), COUNTIF(J$4:J4, $G$5:$G$99), 1), 0)),
""))))
I have only included columns A, C, E and G as your sample data shows only duplicates in columns B, D, F, and H.
¹ Array formulas need to be finalized with Ctrl+Shift+Enter↵. If entered correctly, Excel with wrap the formula in braces (e.g. { and }). You do not type the braces in yourself. Once entered into the first cell correctly, they can be filled or copied down or right just like any other formula. Try and reduce your full-column references to ranges more closely representing the extents of your actual data. Array formulas chew up calculation cycles logarithmically so it is good practise to narrow the referenced ranges to a minimum. See Guidelines and examples of array formulas for more information.
This answer is another way of thinking about the formulas you could use for this sort of task. It gets to the point made by #Jeeped that it is difficult to find unique values in multiple columns. My first step then is to create a single column.
If you can live with a helper column, these formulas might be a tad easier to maintain than the nested IFERROR already proposed. They are equally difficult to understand though at first glance. The other upside is that it scales nicely if the number of columns involved increases.
It is possible using CHOOSE and some INDEX math to build a single column array of a group of separated columns. The trick is that CHOOSE will join discontinuous ranges side-by-side when given an array as the selecting parameter. If this starts with columns of the same size, you can then use division and mod math to turn it into a single column.
Picture of ranges shows the four groups of data with duplicates colored red.
Formula in F2:F31 is an array formula. This is combining all of the columns into an array and then back into a single column. I selected the columns out of order just to emphasize that it is handling a discontinuous range.
=INDEX(CHOOSE({1,2,3,4}, A2:A7,C2:C7,B2:B7,D2:D7), MOD(ROW(1:30)-1, ROWS(A2:A7))+1,INT((ROW(1:30)-1)/ROWS(A2:A7))+1)
The array formula in H2 and copied down is then the standard formula for unique values. The one exception is that instead of avoiding blanks like normal, I am avoiding 0 values.
=IFERROR(INDEX(F2:F31,MATCH(0,IF(F2:F31=0,1,COUNTIF($H$1:H1,F2:F31)),0)),"")
A couple of other comments about this approach:
In the CHOOSE, I am using {1,2,3,4}. This could be replaced with TRANSPOSE(ROWS(1:4)) or whatever number of columns you have.
There is also a ROWS(A2:A7) in 2 places, this could just be 2:7 or 1:6 or whatever size was used for the column size. I used one of the data ranges so that the coloring was simplified and to emphasize it needs to match the size of the block.
And the ROW(1:30) is used for the number of total items to collect. It really only needs to be 1:24 since there are 6*4 items, but I made it big while testing.
There are definitely a couple of downsides to this approach, but it may be a good trick to keep in the toolbox. Never know when you might want to make a column out of discontinuous ranges. The largest downside is that the columns of data all need to be the same size (and of course the helper column).
This code will do what you ask:
Sub MoveData()
START_ROW = 5
START_COL = 1
STEP_COL = 2
OUTPUT_ROW = 5
OUTPUT_COL = 10
Row = START_ROW
Col = START_COL
Out_Row = OUTPUT_ROW
While Col < OUTPUT_COL
While Cells(Row, Col).Value <> ""
Cells(Out_Row, OUTPUT_COL).Value = Cells(Row, Col).Value
Out_Row = Out_Row + 1
Row = Row + 1
Wend
Row = START_ROW
Col = Col + STEP_COL
Wend
End Sub
Think you guys are making this complicated. Just pull the range of data into power query , select all the columns and unpivot them this will bring all the data into a single column
A lot of the solutions here on SO involve using CountIf to find duplicates. When I have a list of 100,000+ values however, it will often take minutes for CountIf to search for duplicates.
Is there a quicker way to search for duplicates within an Excel column WITHOUT using CountIf?
Thanks!
EDIT #1:
After reading the comments and replies I realize I need to go into greater detail. Let's pretend I'm a birdwatcher, and after I return from a birdwatching trip I input anywhere from 1 to 25 or 50 new birds that I saw on my trip into my "Master List of Birds Seen". This is really a dynamically growing list, and with each addition I want to make sure I'm not duplicating something that already exists in my list.
So, in column A of my file are the names of the birds. Column B-M might contain other attributes of the birds. I want to know if a bird that I just added in column A after my latest birdwatching trip ALREADY exists somewhere ELSE in my list. And, if it does, I would manually merge the data of the 2 entries and throw away some and keep some after careful review. I clearly don't want to have duplicate entries of the same bird in my database.
So, ultimately I want some indication that there is or isn't a duplicate somewhere else, and if there is duplicate please tell me what row to look in (or highlight or color both of the duplicates).
The fastest way that I know of (in case you are using Excel 2007/2010/2011) is to use Data (In Ribbon) | Remove Duplicates to find the total number of duplicates OR to remove duplicates. You might want to move data to a temp sheet before you test this.
The 2nd fastest way is to use Countif. Now Countif can be used in many ways to find duplicates. Here are two main ways.
1) Inserting a New Column next to the data and putting the formula and simply copying it down.
2) Using Countif in Conditional formatting to highlight cells which are duplicates. For more details, please see this link.
suggestions for a macro to find duplicates in a SINGLE column
EDIT:
My Apologies :)
Countif is the 3rd fastest way!
The 2nd fastest way is to use Pivot Tables ;)
What exactly is your main purpose of finding duplicates? Do you want to delete them? Or Do you want to highlight them? Or something else?
FOLLOWUP
Seems like I made a typo in the formula. Yes for large number of rows, CountIf does take minutes as you suggested.
Let me see if I can come up with a VBA code to suit your exact needs.
Sid
You can use VBA - the following function returns a list of unique entries within a list of 100,000 in less than a second. Usage: select a range, type the formula (=getUniqueListFromRange(YourRange)) and validate with CTRL+SHIFT+ENTER.
Public Function getUniqueListFromRange(parRange As Range) As Variant
' Returns a (1 to n,1 to 1) array with all the values without duplicates
Dim i As Long
Dim j As Long
Dim locKey As Variant
Dim locData As Variant
Dim locUniqueDict As Variant
Dim locUniqueList As Variant
On Error GoTo error_handler
locData = Intersect(parRange.Parent.UsedRange, parRange)
Set locUniqueDict = CreateObject("Scripting.Dictionary")
On Error Resume Next
For i = 1 To UBound(locData, 1)
For j = 1 To UBound(locData, 2)
locKey = UCase(locData(i, j))
If locKey <> "" Then locUniqueDict.Add locKey, locData(i, j)
Next j
Next i
If locUniqueDict.Count > 0 Then
ReDim locUniqueList(1 To locUniqueDict.Count, 1 To 1) As Variant
i = 1
For Each locKey In locUniqueDict
locUniqueList(i, 1) = locUniqueDict(locKey)
i = i + 1
Next
getUniqueListFromRange = locUniqueList
End If
error_handler: 'Empty range
End Function
If using Excel 2007 or later (which is likely from the 100,000+ values) you can choose:
Home Tab | Conditional Formatting > Highlight Cell Rules > Duplicate Values...
Right-click a highlighted cell and filter by selected cell color to show just the duplicates (be aware however this can be slow with conditional formatting).
Alternatively run this code and filter for colored cells which takes only a second on 100,000 cells:
Sub HighlightDupes()
Dim i As Long, dic As Variant, v As Variant
Application.ScreenUpdating = False
Set dic = CreateObject("Scripting.Dictionary")
i = 1
For Each v In Selection.Value2
If dic.exists(v) Then dic(v) = "" Else dic.Add v, i
i = i + 1
Next v
Selection.Font.Color = 255
For Each v In dic
If dic(v) <> "" Then Selection(dic(v)).Font.Color = 0
Next v
End Sub
Addendum:
To select only duplicate values without code or formulas, i have found this method useful:
Data Tab | Advanced Filter... Filter in Place, Unique Records Only, OK.
Now select the range of unique values and press Alt+; (Goto Special... Visible cells only). With this selection clear the filter and you will see that all unselected cells are duplicates, you can then press Ctrl+9 (Hide Rows) to show just the duplicates. These rows can be copied to another sheet if needed or marked with an "X".
You do not mention what you want to do when you find them. If you merely want to see where they are...
Sub HighLightCells()
ActiveSheet.UsedRange.Cells.FormatConditions.Delete
ActiveSheet.UsedRange.Cells.FormatConditions.Add Type:=xlCellValue, Operator:=xlEqual, Formula1:=ActiveCell
ActiveSheet.UsedRange.Cells.FormatConditions(1).Interior.ColorIndex = 4
End Sub
Preventing Duplicates with Data Validation
You can use Data Validation to prevent you entering duplicate bird names. See Debra Dalgelish's site here
Handling existing duplicates
My free Duplicate Master addin will let you
Select
Colour
List
Delete
duplicates.
But more importantly it will let you run more complex matching than exact strings, ie
Case Insensitive / Case Sensitive searches (sample below)
Trim/Clean data
Remove all blank spaces (including CHAR(160)) see the " mapgie" and "magpie" example below
Run regular expression matches (for example the sample below replaces s$ with "" to remove plurals)
Match on any combination of columns (ie Column A, all columns, Column A&B etc)
I'm surprised that no one has mentioned the RemoveDuplicates method.
ActiveSheet.Range("A:A").RemoveDuplicates Columns:=1
This will simply remove any duplicate entries on the active worksheet in column A. It takes milliseconds to run (tested with 200k rows). Mind you, this will strictly delete all the duplicate entries. Although that isn't how the original question was worded, I do believe that this still serves your purpose.
One simple way of finding unique values is to use the advance filter and filter for unique values only and copy and paste them into other sheet as when the pivot is removed you will get the whole data with the duplicate in them.
Sort the range
and in next column put `=if(a2=a1;1;if(a2=a3;1;0))
"1" will be displayed for duplicates.