Excel VBA using SUMPRODUCT and COUNTIFS - issue of speed - vba

I have an issue of speed. (Apologies for the long post…). I am using Excel 2013 and 2016 for Windows.
I have a workbook that performs 10,000+ calculations on a 200,000 cell table (1000 rows x 200 columns).
Each calculation returns an integer (e.g. count of filtered rows) or more usually a percentage (e.g. sum of value of filtered rows divided by sum of value of rows). The structure of the calculation is variations of the SUMPRODUCT(COUNTIFS()) idea, along the lines of:
=IF($B6=0,
0,
SUMPRODUCT(COUNTIFS(
Data[CompanyName],
CompanyName,
Data[CurrentYear],
TeamYear,
INDIRECT(VLOOKUP(TeamYear&"R2",RealProgress,2,FALSE)),
"<>"&"",
Data[High Stage],
NonDom[NonDom]
))
/$B6
)
Explaining above:
the pair Data[Company Name] and CompanyName is the column in the table and the condition value for the first filter.
The pair Data[Current Year] and TeamYear are the same as above and constitute the second filter.
The third pair looks up a intermediary table and returns the name of the column, the condition ("<>"&"") is ‘not blank’, i.e. returns all rows that have a value in this column
Finally, the fourth pair is similar to 3 above but returns a set of values that matches the set of values in
Lastly, the four filters are joined together with AND statements.
It is important to note that across all the calculations the same principle is applied of using SUMPRODUCT(COUNTIFS()) – however there are many variations on this theme.
At present, using Calculate on a select range of sheets (rather than the slower calculating the whole workbook), yields a speed of calculation of around 30-40 seconds. Not bad, and tolerable as calculations aren’t performed all the time.
Unfortunately, the model is to be extended and now could approach 20,000 rows rather 1,000 rows. Calculation performance is directly linked to the number of rows or cells, therefore I expect performance to plummet!
The obvious solution [1] is to use arrays, ideally passing an array, held in memory, to the formula in the cell and then processing it along with the filters and their conditions (the lookup filters being arrays too).
The alternative solution [2] is to write a UDF using arrays, but reading around the internet the opinion is that UDFs are much slower than native Excel functions.
Two questions:
Is solution [1] possible, and the best way of doing this, and if so how would I construct it?
If solution [1] is not possible or not the best way, does anyone have any thoughts on how much quicker solution [2] might be compared with my current solution?
Are there other better solutions out there? I know about Power BI Desktop, PowerPivot and PowerQuery – however this is a commercial application for use by non-Excel users and needs to be presented in the current Excel ‘grid’ form of rows and columns.
Thanks so much for reading!
Addendum: I'm going to try running an array calculation for each sheet on the Worksheet.Activate event and see if there's some time savings.

Writing data to arrays is normally a good idea if looking to increase speed. Done like this:
Dim myTable As ListObject
Dim myArray As Variant
'Set path for Table variable
Set myTable = ActiveSheet.ListObjects("Table1")
'Create Array List from Table
myArray = myTable.DataBodyRange
(Source)

Related

Resolving performance Issues on Linq with "LIKE"

I have a recognition table containing 25,000 records, and an incoming table of strings that must be recognised using LIKE matching, typically between 200 and 4000 per batch. This used to be in sql server but I am trying to get it to go faster by doing it all in memory, however linq is much slower, taking 5 seconds instead of 250ms in sql when the incoming table has 200 rows.
The recognition table is declared as follows:
Private mRecognition377LK As New SortedDictionary(Of String, RecognitionItem)(StringComparer.CurrentCultureIgnoreCase)
The actual like comparison is here:
r = mRecognition377LK.FirstOrDefault(Function(v As KeyValuePair(Of String, RecognitionItem)) sTitle Like v.Key).Value
So this is executed for every incoming record and I thought that using v.key would enable the linq engine to not scan records that start with a different character, but it seems not.
I can reinvent the wheel and create a collection class that splits the recognition table into its constituent
E.g. if an incoming string is abcdef and we have a recognition record of "abc*" then I could store collection grouped by length of recognition item up to the first star (3), then inside that a collection of recognition items with that length, keyed on the text up to the first star (abc)
So abc* has a string length of 3 so:
r = Itemz(3).Recog("abc")
I think that will work and perform well but its a lot of faff and I'm sure that collection classes and linq would have been designed in a way that such a simple thing could be executed quickly without this performance drag.
So my question is is there a way to make this go fast without going to my proposed solution ?
DRAFT ANSWER
So having programmed up several iterations of TRIE and binary searches I realised that all this was excessive processing and that is because....
BOTH LISTS ARE SORTED
... that means we only need one loop to process both lists and join them, i.e. we are doing in C#/VB what Sql Server does when it performs a MERGE join. So now I am pursuing this as a solution and will update here as appropriate.
FINAL UPDATE
The solution is now finished, and you can indeed join as many lists as you like as long as they are all sorted ascending or all sorted descending on the attributes you are joining, and you can do this in a single loop (because they are sorted). My code is about 1000 lines and very specific, so I'm not going to post a code solution, but for anyone that hits this kind of problem in future, it seems there is nothing in linq that will help do a merge join which is not based on equality (we have LIKE matching) so writing your own merge join in a single loop is possible when the incoming data is sorted.
The basis of the algorithm is to loop through the table which is your "maintable", and advance a pointer into each other list until the text comparison becomes greater than or equal. When its equal, you don't advance this list again until it doesn't match the maintable list, since one item on the right could join many items on the left. This can be repeated for multiple arrays.
It would be nice to see a library where you can pass lambda functions to perform merge joins on multiple sorted arrays. I will consider writing one in future.
The solution runs in 0.007 seconds to join 200 records to a 70,000 record recognition list. With linq performing effectively an inner loop, it took 5 seconds. When joining 4000 records to the same 70,000 record recognition list, the performance degrades only slightly to around 0.01s, showing the great effectiveness of the merge join logic. Sql server took around 250ms to perform the join.

Looping through columns to conduct data manipulations in a data frame

One struggle I have with using Python Pandas is to repeat the same coding scheme for a large number of columns. For example, below is trying to create a new column age_b in a data frame called data. How do I easily loop through a long (100s or even 1000s) of numeric columns, do the exact same thing, with the newly created column names being the existing name with a prefix or suffix string such as "_b".
labels = [1,2,3,4,5]
data['age_b'] = pd.cut(data['age'],bins=5, labels=labels)
In general, I have many simply data frame column manipulations or calculations, and it's easy to write the code. However, so often I want to repeat the same process for dozens of columns, that's when I get bogged down, because most functions or manipulations would work for one column, but not easily repeatable to many columns. It would be nice if someone can suggest a looping code "structure". Thanks!

Latest date from a date list >180 days in past from a given date in same list

I have a "Appeared date" column A and next to it i have a ">180" date column B. There is also "CONCAT" column C and a "ATTR" column D.
What i want to do is find out the latest date 180 or more from past, and write it in ">180" column, for each date in "Appeared Date" column, where the Concat column values are same.
The Date in >180 column should be more than 180 days from "Appeared date" column in the past, but should also be an earliest date found only from the "Appeared date" column.
Based on this i would like to check if a particular product had "ATTR" = 'NEW' >180 earlier also i.e. was it launched 180 days or more ago and appearing again recently?
Is there an excel formula which can get the nearest dates (>180) picked from the Appeared date and show it in the ">180" column?
Will it involve a mix of SMALL(), FREQUENCY(), MATCH(), INDEX() etc?
Or a VBA procedure is required?
To do this efficiently with formulas, you can use something called Range Slicing to reduce the size of the arrays to be processed, by efficiently truncating them so that they contain just the subset of those 3,000 to 50,000 rows that could possibly hold the correct answer, and THEN doing the actual equality check. (As apposed to your MAX/Array approach, which does computationally expensive array operations on all the rows, even though most of the rows have no relationship with the current row that you seek an answer for).
Here's my approach. First, here's my table layout:
...and here's my formulas:
180: =[#Appeared]-180
Start: =MATCH([#CONCAT],[CONCAT],0)
End: =MATCH([#CONCAT],[CONCAT],1)
LastRow: =MATCH(1,--(OFFSET([Appeared],[#Start],,[#End]-[#Start])>[#180]),0)+[#Start]-1
LastItem: =INDEX([Appeared],[#LastRow])
LastDate > 180: =IF([#Appeared]-[#LastItem]>180,[#LastItem],"")
Days: =IFERROR([#Appeared]-[#[LastDate > 180]],"")
Even with this small data set, my approach is around twice as fast as your MAX approach. And as the size of the data grows, your approach is going to get exponentially slower, as more and more processing power is wasted on crunching rows that can't possibly contain the answer. Whereas mine will get slower in a linear fashion. We're probably talking a difference of minutes, or perhaps even an hour or so at the extremes.
Note that while you could do my approach with a single mega-formula, you would be wise not to: it won't be anywhere near as efficient. splitting your mega-formulas into separate cells is a good idea in any case because it may help speed up calculation due to something called multithreading. Here’s what Diego Oppenheimer, a former program manager for Microsoft Excel, had to say on the subject back in 2005 :
Multithreading enables Excel to spot formulas that can be calculated concurrently, and then run those formulas on multiple processors simultaneously. The net effect is that a given spreadsheet finishes calculating in less time, improving Excel’s overall calculation performance. Excel can take advantage of as many processors (or cores, which to Excel appear as processors) as there are on a machine—when Excel loads a workbook, it asks the operating system how many processors are available, and it creates a thread for each processor. In general, the more processors, the better the performance improvement.
Diego went on to outline how spreadsheet design has a direct impact on any performance increase:
A spreadsheet that has a lot of completely independent calculations should see enormous benefit. People who care about performance can tweak their spreadsheets to take advantage of this capability.
The bottom line: Splitting formulas into separate cells increases the chances of calculating formulas in parallel, as further outlined by Excel MVP and calculation expert Charles Williams at the following links:
Decision Models: Excel Calculation Process
Excel 2010 Performance: Performance and Limit Improvements
I think i found the answer. Earlier i was using the MIN function, though incorrectly, as the dates in the array formula (when you select and hit F9 key) were coming in descending order. So i finally used the MAX function to find the earliest date which was more than 180 in the past.
=IF(MAX(IF(--(A2-$A$2:$A$33>=180)*(--(C2=$C$2:$C$33))*(--
($D$2:$D$33="NEW")),$A$2:$A$33))=0,"",MAX(IF(--(A2-$A$2:$A$33>=180)*(--
(C2=$C$2:$C$33))*(--($D$2:$D$33="NEW")),$A$2:$A$33)))
Check the revised Sample.xlsx which is self-explanatory. I have added the Attr='NEW' criteria in the formula for the final workaround, to find if there were any new items that came 180 days or earlier.
Though still an ADO query alternative may be required to process the large amounts of data.

Clever way to check if value meets threshold in VBA

Disclaimer: Numbers below are randomly generated
What I'm trying to do is, purely in VBA, look at the ratio of [column B]/[column A] and checking whether or not the ratio in row 10 (=1,241/468) is below the minimum of the ratios or above the maximum of the ratios in rows 1 through 9 but only compared to the rows where there is a 1 in column C.
That is, compare Cell(B10)/Cell(A10) to Cell(B2)/Cell(A2), Cell(B3)/Cell(A3), etc. (only comparing against rows with a 1 in column C).
The workbook I'm working with has a lot more data and columns and I'm not allowed to explicitly edit the cells, so defining a new column is out of the question. Is there a way to do this in VBA such that it essentially returns a boolean depending whether or not the ratio in the last row violates the threshold defined above?
You can achieve the minimum and maximum ratios (with criteria) easily with the AGGREGATE¹ function's SMALL sub-function and LARGE sub-function.
        
The formulas in D13:E13 are,
=AGGREGATE(15, 6, ((B1:B9)/(A1:A9))/C1:C9, 1)
=AGGREGATE(14, 6, ((B1:B9)/(A1:A9))/C1:C9, 1)
The 6 is the AGGREGATE parameter for ignoring error values. By dividing the ratio
by the value in column C we are producing #DIV/0! errors for anything we do not want considered leaving them ignored. If the values in C were more diverse, we could divide by (C1:C9=1) to produce the same results.
Since we are using the SMALL and LARGE sub-functions, we can easily retrieve the second, third, etc. ratios by increasing the k parameter (the 1 off the back end).
I've modified some of the values in your sample slightly to demonstrate that the min and max with criteria are being picked up correctly.
These can be adapted to VBA with the WorksheetFunction object or Application.Evaluate method.
¹The AGGREGATE¹ function's was introduced with Excel 2010. It is not available in previous versions.

VBA: Performance of multidimensional List, Array, Collection or Dictionary

I'm currently writing code to combine two worksheets containing different versions of data.
Hereby I first want to sort both via a Key Column, combine 'em and subsequently mark changes between the versions in the output worksheet.
As the data amounts to already several 10000 lines and might some day exceed the lines-per-worksheet limit of excel, I want these calculations to run outside of a worksheet. Also it should perform better.
Currently I'm thinking of a Quicksort of first and second data and then comparing the data sets per key/line. Using the result of the comparison to subsequently format the cells accordingly.
Question
I'd just love to know, whether I should use:
List OR Array OR Collection OR Dictionary
OF Lists OR Arrays OR Collections OR Dictionaries
I have as of now been unable to determine the differences in codability and performance between this 16 possibilities. Currently I'm implementing an Array OF Arrays approach, constantly wondering whether this makes sense at all?
Thanks in advance, appreciate your input and wisdom!
Some time ago, I had the same problem with the macro of a client. Additionally to the really big number of rows (over 50000 and growing), it had the problem of being tremendously slow from certain row number (around 5000) when a "standard approach" was taken, that is, the inputs for the calculations on each row were read from the same worksheet (a couple of rows above); this process of reading and writing was what made the process slower and slower (apparently, Excel starts from row 1 and the lower is the row, the longer it takes to reach there).
I improved this situation by relying on two different solutions: firstly, setting a maximum number of rows per worksheet, once reached, a new worksheet was created and the reading/writing continued there (from the first rows). The other change was moving the reading/writing in Excel to reading from temporary .txt files and writing to Excel (all the lines were read right at the start to populate the files). These two modifications improved the speed a lot (from half an hour to a couple of minutes).
Regarding your question, I wouldn't rely too much on arrays with a macro (although I am not sure about how much information contains each of these 10000 lines); but I guess that this is a personal decision. I don't like collections too much because of being less efficient than arrays; and same thing for dictionaries.
I hope that this "short" comment will be of any help.