So I am creating a function to replace some manual index/match formulas. Note that this function works, but my problem is with speed. So I have a PivotTable with 6 columns and approx. 200.000 rows. I want this to find the value (and I don't use the pivotfunctions, meaning that this is just a table in pivot format) I found that this runs faster than having it in a regular data table. Both would be imported from a SQL table.
A single piece of this formula runs instantly, but the performance slows down when I have a few hundreds in the same sheet.
So any ideas on how to speed this up?
Function getnum2(ByVal Comp As String, Period As String, Measure As String, Optional BU As String, _
Optional Country As String, Optional Table As String, Optional TableSheet As String) As Double
Dim pTable As PivotTable, wTableSheet As Worksheet
If BU = "" Then
BU = "Group"
End If
If Country = "" Then
Country = "Total"
End If
If TableSheet = "" Then
Set wTableSheet = Worksheets("Data")
Else
Set wTableSheet = Worksheets(TableSheet)
End If
If Table = "" Then
Set pTable = wTableSheet.PivotTables("PivotTable1")
Else
Set pTable = wTableSheet.PivotTables(Table)
End If
'Find match
If Intersect(pTable.PivotFields("Bank").PivotItems(Comp).DataRange.EntireRow, _
pTable.PivotFields("Date").PivotItems(Period).DataRange.EntireRow, _
pTable.PivotFields("Business Unit").PivotItems(BU).DataRange.EntireRow, _
pTable.PivotFields("Country").PivotItems(Country).DataRange.EntireRow, _
pTable.PivotFields("Name").PivotItems(Measure).DataRange) Is Nothing Then
getnum2 = "No match"
ElseIf Intersect(pTable.PivotFields("Bank").PivotItems(Comp).DataRange.EntireRow, _
pTable.PivotFields("Date").PivotItems(Period).DataRange.EntireRow, _
pTable.PivotFields("Business Unit").PivotItems(BU).DataRange.EntireRow, _
pTable.PivotFields("Country").PivotItems(Country).DataRange.EntireRow, _
pTable.PivotFields("Name").PivotItems(Measure).DataRange).Count > 1 Then
getnum2 = "More than 1 match"
Else
getnum2 = Intersect(pTable.PivotFields("Bank").PivotItems(Comp).DataRange.EntireRow, _
pTable.PivotFields("Date").PivotItems(Period).DataRange.EntireRow, _
pTable.PivotFields("Business Unit").PivotItems(BU).DataRange.EntireRow, _
pTable.PivotFields("Country").PivotItems(Country).DataRange.EntireRow, _
pTable.PivotFields("Name").PivotItems(Measure).DataRange)
End If
End Function
Rather than calling the function three times, you could use a variable:
Function getnum2(ByVal Comp As String, Period As String, Measure As String, Optional BU As String, _
Optional Country As String, Optional Table As String, Optional TableSheet As String) As Double
Dim pTable As PivotTable, wTableSheet As Worksheet
Dim rgResult as Range
If BU = "" Then
BU = "Group"
End If
If Country = "" Then
Country = "Total"
End If
If TableSheet = "" Then
Set wTableSheet = Worksheets("Data")
Else
Set wTableSheet = Worksheets(TableSheet)
End If
If Table = "" Then
Set pTable = wTableSheet.PivotTables("PivotTable1")
Else
Set pTable = wTableSheet.PivotTables(Table)
End If
'Find match
Set rgResult = Intersect(pTable.PivotFields("Bank").PivotItems(Comp).DataRange.EntireRow, _
pTable.PivotFields("Date").PivotItems(Period).DataRange.EntireRow, _
pTable.PivotFields("Business Unit").PivotItems(BU).DataRange.EntireRow, _
pTable.PivotFields("Country").PivotItems(Country).DataRange.EntireRow, _
pTable.PivotFields("Name").PivotItems(Measure).DataRange)
if rgResult Is Nothing Then
getnum2 = "No match"
ElseIf rgResult.Count > 1 Then
getnum2 = "More than 1 match"
Else
getnum2 = rgResult.Value
End If
End Function
One very simple way to achieve this is by using two PivotTables.
In PivotTable 1, put all fields but the numeric one you want to
return in the ROWS area, and put the field that you want to return in
the VALUES area with aggregation set to COUNT.
In PivotTable 2, put all fields but the numeric one you want to
return in the ROWS area, and put the field that you want to return in
the VALUES area with aggregation set to SUM or MIN or MAX (It doesn't
matter which).
Then you can use a paramatized GETPIVOTDATA function to check PivotTable 1 to see if the thing you're looking up is unique (i.e. COUNT = 1) and if so, then look up the SUM/MIN/MAX of that item in PivotTable2. Given the item is unique, then the SUM/MIN/MAX is only operating on one number, and so does nothing to it.
Here's how that looks, using simplified data:
I've added conditional formatting to the two Pivots to highlight multiple occurances where we want to return the text 'Multiple Items', and as you can see, the formula that is populating column 6 of the Lookup table is only returning unique items as per your requirements.
Here's the formula, using Table notation as my Lookup range has been turned into an Excel Table:
=IF(GETPIVOTDATA("6",$A$3,"1",[#1],"2",[#2],"3",[#3],"4",[#4],"5",[#5])=1,GETPIVOTDATA("6",$H$3,"1",[#1],"2",[#2],"3",[#3],"4",[#4],"5",[#5]),"Multiple Items")
If I randomise the input cells in the Lookup table, you can see what happens when some items aren't in the PivotTable:
This approach works because the field you want to return is a numeric one, meaning you can add it to the VALUES pane of the Pivot. But you could still use this to return strings, by adding a unique ID to the source data, such as the row number, and putting that in the VALUES field, then retrieving it with the double GETPIVOTDATA lookup and using it to retrieve the associated string in the source data.
Another approach is to simply concatenate your columns into a primary key using a suitable delimiter such as the pipe character | and then use that as your lookup key. If you did a binary search on sorted data, this would be lightning fast. (I discuss this at http://dailydoseofexcel.com/archives/2015/04/23/how-much-faster-is-the-double-vlookup-trick/ ). The down side is that you wouldn't be warned in the event that there were multiple items. But it would be possible to do a second lookup using the match position returned by the first, to see if you get another result, and if so then return "Multiple Items". This would be super-fast.
Here's the fastest way to do this: using Binary Match on a sorted lookup table.
On the left I have 5 columns x 1048575 rows of random numbers between 1 and 10. These have been concatenated in column G to make a non-unique key, and then sorted ascending on that key.
(Because the concatenated key is text, it gets sorted alphabetical from left to right, which is why 1|1|1|1|10 appears between 1|1|1|1|1 and 1|1|1|1|2)
I gave the data in Column G the named range of Concat to simplify the formula. My lookup formula in J2 returns the row number of the lookup item if and only if that item is unique to the dataset. The formula is:
=IF(OR( AND(INDEX(Concat,MATCH(I2,Concat,1))=I2, MATCH(I2,Concat,1)=1), AND(INDEX(Concat,MATCH(I2,Concat,1))=I2,INDEX(Concat,MATCH(I2,Concat,1)-1)<>I2)),MATCH(I2,Concat,1),NA())
This executes in 0.01 milliseconds for one instance, for a lookup table of 1048576 rows. My double GETPIVOTDATA approach above took 6 milliseconds. So there you have it: a complex formula that gives a 600 times efficiency boost.
I can explain the formula later in need, but note that some of the complexity is due to the edge case where you may have a unique item appearing in row 1. If I leave out that edge case, then the formula is as follows:
=IF( AND(INDEX(Concat,MATCH(I3,Concat,1))=I3,INDEX(Concat,MATCH(I3,Concat,1)-1)<>I3),MATCH(I3,Concat,1),NA())
Related
In a column with hundreds or even 1-2 thousand strings of approximately 40 characters, with one string per cell and many repeating entries, what is the best way to populate the column to conduct the fastest possible search later? The search should return a row number so that the corresponding row can be deleted.
Is there some way to append metadata or label to a cell/row for faster search? Is there some other mechanism that can identify cells that will make searching easier?
I'm new to VBA, and I want to set out on the best path before I get too far into the project and have to search through thousands of strings.
edit: Someone requested an example cell: The cells will have email addresses in them. I can control the email addresses on the server, so they will roughly be 40 characters long each. They will contain alphanumeric characters only.
Example of a fast way to implement a dictionary lookup
Data is on Sheet1, and starts in column A
The strings are in column B
Option Explicit
Public Sub SearchStrings()
Dim ur As Variant, r As Long, d As Object
Const COL_ID = 2
Set d = CreateObject("Scripting.Dictionary") 'or Reference to Microsof Scripting Runtime
d.CompareMode = TextCompare 'Case insensitive, or "BinaryCompare" otherwise
ur = Sheet1.UsedRange.Columns(COL_ID) 'read strings from column COL_ID into array
For r = LBound(ur) To UBound(ur) 'populate dictionary; Key = string (unique)
If Not IsError(ur(r, 1)) Then d(CStr(ur(r, 1))) = r 'Item = row id
Next
Debug.Print d.Keys()(3) 'prints the string in row 3
Debug.Print d.Items()(3) 'prints the row number of the 3rd string
End Sub
If you want to store string duplicates use this:
If Not IsError(ur(r, 1)) Then d(COL_ID & "-" & r) = CStr(ur(r, 1))
which is Key = Column ID & "-" & row ID (2-5), and Item = String itself
I have two very large tables. One of them has 12 columns and 280K rows. The other ones has 12k rowns and 33 columns. I am using vlookup to look for matching values in large table to small one. Vlookups take forever to calculate. Is there an easy way to do this with a VBA code? Can someone share a sample code for me to replicate?
Thanks
You can use Collection object to quickly find matches. This will work very fast (if not faster than VLOOKUP) because when you add key parameter to Collection - it hashes / indexes its value with the specific goal of fast lookup later).
Moreover, for the large number of records you populate Collection once and keep reusing it, while VLOOKUP does search the entire target range repeatedly (which is way less efficient, although built-in formulas run in parallel on multiple cores plus Microsoft definitely built-in some caching for increased efficiency for repeated searches). Even then a single-threaded VBA collections should still be faster.
See example below with more information in the in-line comments.
"Big Table" is on Sheet1:
"Small Table" is on Sheet2:
And the code that matches records in small table to those in the big one:
Option Explicit
Sub matchRows()
' this is where the big table is
Dim w1 As Worksheet
Set w1 = Worksheets("Sheet1")
' this is where the small table is
Dim w2 As Worksheet
Set w2 = Worksheets("Sheet2")
Dim c As New Collection ' list of match keys in big table 1
Dim r As Range
' assume the match key is in col1 in both tables
' enumerate the keys in the big table
For Each r In w1.Range(w1.[a2], w1.[a2].End(xlDown))
c.Add r, r ' this stores the range (first param) and
' its key (second param - taken as string
' (value of the range), must be unique)
Next r
' now lets try to match / vlookup records in small table against
' big table
For Each r In w2.Range(w2.[a2], w2.[a2].End(xlDown))
If contains(c, CStr(r)) Then
' you didn't say what you want to do after a match, so
' I'll just display matched key value and row number in debug console
Debug.Print "Found match """ & r & """ at row number " & r.Row
Else
Debug.Print "No match found for """ & r & """ at row number " & r.Row
End If
Next r
End Sub
Function contains(col As Collection, key As String) As Boolean
On Error Resume Next
col.Item key
contains = (Err.Number = 0)
On Error GoTo 0
End Function
Result in Immediate Window:
Found match "data51" at row number 2
Found match "data61" at row number 3
No match found for "data81" at row number 4
Found match "data91" at row number 5
I have 2 different files which have different headers, for example:
OldfileHeaders | NewFileheaders
ID | Test ID
Date | New date
and so on. I am trying to compare the data in both sheets and see if they match. The rows of data may be in different order and the headers may also be in different order.
So what I am trying to do is:
1) define which headers match which headers between the 2 files
2) find the ID from the oldfile and see if it is in the new file, if it is then see if the data under each header matches. If it doesn't then export that row of data to a new sheet add a column and label it "Missing".
The Code So far:
Set testIdData = testIdData.Resize(testIdData.CurrentRegion.Rows.Count)
Do Until sourceId.Value = ""
datacopy = False
' Look for ID in test data
Set cellFound = testIdData.Find(What:=sourceId.Value, LookIn:=xlValues, LookAt:=xlWhole, MatchCase:=False)
If cellFound Is Nothing Then
' This entry not found, so copy to output
datacopy = True
outputRange.Resize(ColumnSize:=NUMCOLUMNS).Interior.Color = vbRed
Else
' This assumes that columns are in same order
For columnNum = 2 To NUM_COLUMNS_DATA
' No need to test the ID column
If sourceId.Cells(ColumnIndex:=columnNum).Value <> cellFound.Cells(ColumnIndex:=columnNum).Value Then
outputRange.Cells(ColumnIndex:=columnNum).Interior.Color = vbYellow
datacopy = True
End If
Next columnNum
End If
If datacopy Then
sourceId.Resize(ColumnSize:=NUMCOLUMNS).Copy
outputRange.PasteSpecial xlPasteValuesAndNumberFormats
Application.CutCopyMode = False
Set outputRange = outputRange.Offset(RowOffset:=1)
difference = difference + 1
End If
Set sourceId = sourceId.Offset(RowOffset:=1)
Loop
This code works depending on me formatting the sheets in the correct order and changing the header names.
I need help in defining which field names match which field names within the 2 sheets, and then searching the new sheet for each ID and seeing if the data in the corresponding cells match. If the ID is not in the sheet then output that row too a different sheet. If the id is present and there are differences in the cells then out put these to the shame sheet. I want to produce a tally of differences in each column.
Matching up data between data sets requires that you give the program some help. In this case, the help needed is which columns are related to each other. You have identified a small table of how headers are related. With this, you can do the various translations from data source 1 to data source 2. It requires heavy usage of Application.Match and Application.VLookup.
I will provide a base example which does the core of what you are trying to do. It is much easier to see it all on one sheet which is what I have done.
Picture of data shows three tables: rng_headers, rng_source, and rng_dest. One is the lookup for the headers, the second is the "source" data, and the third is the data source to compare against which I will call destination = "dest".
Code include steps to: iterate through all the IDs in the source data, check if they exist in the dest data, and, if so, check all the individual values for equality. This code checks the headers on every step (which is slow) but allows for the data to be out of order.
Sub ConfirmHeadersAndMatch()
Dim rng_headers As Range
Set rng_headers = Range("B3").CurrentRegion
Dim rng_dest As Range
Set rng_dest = Range("I2").CurrentRegion
Dim rng_source As Range
Set rng_source = Range("E2").CurrentRegion
Dim rng_id As Range 'first column, below header row
For Each rng_id In Intersect(rng_source.Columns(1).Offset(1), rng_source)
Dim str_header As Variant
str_header = Application.VLookup( _
Intersect(rng_id.EntireColumn, rng_source.Rows(1)), _
rng_headers, 2, False)
'get col number
Dim int_col_id As Integer
int_col_id = Application.Match(str_header, rng_dest.Rows(1), 0)
'find ID in the new column
Dim int_row_id As Variant
int_row_id = Application.Match(rng_id, rng_dest.Columns(int_col_id), 0)
If IsError(int_row_id) Then
'ID missing... do something
rng_id.Interior.Color = 255
Else
Dim rng_check As Range 'all values, same row
For Each rng_check In Intersect(rng_source, rng_id.EntireRow)
'get col number
str_header = Application.VLookup( _
Intersect(rng_check.EntireColumn, rng_source.Rows(1)), _
rng_headers, 2, False)
int_col_id = Application.Match(str_header, rng_dest.Rows(1), 0)
'check value
If rng_check.Value <> rng_dest.Cells(int_row_id, int_col_id).Value Then
'values did not match... do something
rng_dest.Cells(int_row_id, int_col_id).Interior.Color = 255
End If
Next rng_check
End If
Next
End Sub
Notes on the code
Ranges are built on CurrentRegion which picks out the blocks of data. You can swap these out for different ranges on different sheets.
Column header translation is done with Application.VLookup to check the source header and return the destination header. This String is then found in the destination header row using Application.Match. You could abstract this code into a Function to avoid repeating it twice.
Once the column is found, the ID is searched for in the destination table using Application.Match. This will return an error if the ID is not found.
If the ID is found, it then checks all of the other values in the same row, comparing them against the correct columns in the destination table. Non-matching results are colored red.
If all of the columns do not have pairs, you can add additional checks on the VLookup or the column Match to check this.
The vast majority of this code just handles getting to the correct spots in the data using Intersect, Rows, and Columns.
Results show some red values for the ID not found and the values that don't match.
I'm having trouble turning my thought process into tangible code and honestly I'm not sure where to start with the code. I have a data set with two applicable columns, for the sake of simplicity we'll say A and B. A contains a list of three initials followed by a number, ex. JFD3, JFD2, JFD6, EUW1, YMG2, YMG3. Column B has a value. I need to find the range of the highest to lowest values for each set of initials, which has me thinking a max - min solution. The list of initials isn't necessarily in order, and there could be one set of initials(with a net variance of 0, which is OK), or up to 8 sets of initials, with the numbers not necessarily being consecutive. I was thinking some sort of Match(Left(3)) but i don't think that would encompass everything.
Any ideas on where to start would be much appreciated. I'll be happy to clarify if theres any questions.
You can use dictionaries from the Scripting Runtime to do this easily. Use two of them with the initials as the keys, one holding the minimum values found and the other holding the maximum values found.
Add a reference to the Microsoft Scripting Runtime (Tools->Add reference..., then check the box next to "Microsoft Scripting Runtime") or late bind (see instructions below). Something like this should do the trick, assumes initials in column 1, values in column 2, no headers:
Private Sub MinMax()
Dim mins As Dictionary
Dim maxes As Dictionary
Dim sheet As Worksheet
Set sheet = ActiveSheet
Set mins = New Dictionary
Set maxes = New Dictionary
Dim row As Long
For row = 1 To sheet.UsedRange.Rows.Count
Dim key As Variant
Dim val As Integer
key = sheet.Cells(row, 1).Value2
If Len(key) >= 3 Then
key = Left$(sheet.Cells(row, 1).Value2, 3)
val = sheet.Cells(row, 2).Value2
If Not mins.Exists(key) Then
mins.Add key, val
Else
If mins(key) > val Then mins(key) = val
End If
If Not mins.Exists(key) Then
maxes.Add key, val
Else
If maxes(key) < val Then maxes(key) = val
End If
End If
Next row
For Each key In mins.Keys
Debug.Print key & ": Min = "; mins(key) & " Max = "; maxes(key)
Next key
End Sub
To use late binding, the code is exactly the same with these exceptions. Instead of declaring mins and maxes as Dictionary, declare them as Object:
Dim mins As Object
Dim maxes As Object
And instead of setting them as New Dictionary, use CreateObject:
Set sheet = ActiveSheet
Set mins = CreateObject("Scripting.Dictionary")
Set maxes = CreateObject("Scripting.Dictionary")
Use a Pivot Table. Put your Column A field* in the Row Labels, then put column B in the Values twice. Change one from Sum to Min, and the other from Sum to Max.
* Not sure if you need to group by JFD for all JFDx or by each JFDx. If you need them grouped by the 3 initials, make a column C =left("A1",3), then use that in your
An approach to this could be:
Sort the data in the range A-B by A in alphabetical order. To do this, you can record a macro while doing this action and edit the code to make it dynamically working every time. This is required to make the below solution work, more performing for many other kinds of similar approaches.
Use While blocks to run the solution. I let you take the time to build and test a working code, but this is the idea:
startSubset = 2 '<-- we start getting the key from row 2
'build the key to define the subset
keyStart = 1
currentKey = ""
Do While Not IsNumeric(Right(Left(Range("A" & startSubset),keyStart),1))
'while the last char of the key is not numeric, let's add it to the key
currentKey = currentKey & Right(Left(Range("A" & startSubset),keyStart),1)
keyStart = keyStart + 1
Loop
After the above, the key is stored in the variable currentKey. It will be JFD if the first cell is JFD213, etc. Hence, you loop until the end of this subset storing max and min in two variables:
min = 0
max = 0
Do While Left(Range("A" & startSubset),Len(currentKey)) = currentKey
If Range("B" & startSubset) < min Then min = Range("B" & startSubset)
If Range("B" & startSubset) > max Then max = Range("B" & startSubset)
startSubset = startSubset + 1
Loop
Once this is done, you just need to cast the values into a collection, for example:
myObs.Add(currentKey)
myObs.Add(min)
myObs.Add(max) '<-- you will get something like myObs = ("DJF", 0, 100)
Then cast this object into a bigger collection:
allValues.Add(myObs) '<-- at the end you will have something like this:
'allValues = [("DJF",0,100), ("ABC", 1, 75), ...]
and re-set the values to let them continue:
currentKey = ""
keyStart = 1
All the above, should be run in a While loop that will break when the data are over.
Please note the above code cannot work standing-alone, but it's rather a possible approach to the problem that you will need to re-work on your data to make it work in real life.
I am using VLookup function which looks up multiple values which are present in the column. This works very well but just takes a lot of time as I have 100,000 rows in the Excel sheet.
Is there any way to quicken this code?
The code basically looks up a particular value in a column and gets the offset. The difference between simple VLookup and this is that in case there are multiple rows with the same lookup value then it gets all the elements.
Function VLookupAll(ByVal lookup_value As String, _
ByVal lookup_column As Range, _
ByVal return_value_column As Long, _
Optional seperator As String = ", ") As String
Dim i As Long
Dim result As String
For i = 1 To lookup_column.Rows.Count
If Len(lookup_column(i, 1).Text) <> 0 Then
If lookup_column(i, 1).Text = lookup_value Then
result = result & (lookup_column(i).Offset(0, return_value_column).Text & seperator)
End If
End If
Next
If Len(result) <> 0 Then
result = Left(result, Len(result) - Len(seperator))
End If
VLookupAll = result
End Function
This is about 20-30x faster than a simple loop (tested over a column of 20k values, with 3 matches to the value being searched).
'rng: a single-column range to search for matches
'val: the value to match on
'col: offset from match in rng to concatenate values from (similar
' to the matching VLOOKUP argument)
Function MultiLookup(rng As Range, val As String, col As Long)
Dim i As Long, v, s
Dim r As Long
r = rng.Cells.Count
v = Application.Match(val, rng, 0)
s = ""
Do While Not IsError(v)
s = s & IIf(s <> "", ",", "") & rng.Cells(v).Offset(0, col - 1).Value
r = r - v
Set rng = rng.Offset(v, 0).Resize(r, 1)
v = Application.Match(val, rng, 0)
Loop
MultiLookup = s
End Function
http://www.excelhero.com/blog/2011/03/the-imposing-index.html says "Excel INDEX MATCH is significantly quicker than VLOOKUP"
You could try doing a Range.Find to see if the value exists at all in lookup column before proceeding. You are looping through every item in lookup column only to find it isn't there. If it were me, I would do a Range.find to see if lookup value is in lookup_column. If it is then you could do a countif to see how many occurrences there are...if there is only one occurrence, use plain old VLookup...and only fall back into your process if there is more than one occurrence.....may work....of course if Find fails, bail out of the function.
Another option is to load the lookup_column into any array...and process the array rather than the range.mnthat can sometimes help.
Summary:
Concate the values and do a vlookup on that new value.
For me I needed to have a formula and not a function to look up by 2 values. VLOOKUP could only work by a single value from what I've seen, so my solution was to concatenate the 2 values for a single primary key.
In my raw data tab I added a column called Lookup that simply concatenated the ID column with the Timestamp columns I had.
Then in my comparison tab I had
=VLOOKUP(CONCATENATE(A4, $F$1),'Historical Data'!$A:$G,3,FALSE)
Which took the ID column, concatenated with my lookup date at $F$1, and vlookup'ed into my data tab (Historical Data).