Can I write an array to a range and only recalculate changed cells? - vba

I've got a very large array of data that I'm writing to a range. However, sometimes only a few elements of the array change. I believe that since I am writing the entire array to the range, all of the cells are being re-calculated. Is there any way to efficiently write a subset of the elements - specifically, those that have changed?
Update: I'm essentially following this method to save on write time:
http://www.dailydoseofexcel.com/archives/2006/12/04/writing-to-a-range-using-vba/
In particular, I have a property collection that I populate with all of the objects (they are cells) with data that I need. Then, I loop through all of the properties and write the values to an array, indexing the array so it matches the dimensions of the range that I want to write to. Finally, with TheRange.Value = TempArray I write the data in the array to she sheet. This last step overwrites the full range, I believe causing recalculations even in cells whose actual values didn't change.

Let me start with a few basics:
When you write to a range of cells, even if the values are the same, Excel still sees it as a change and will recalculate accordingly. It does not matter if you have calculation turned off, then next time the range/sheet/workbook is calculated it will recalculate everything that is dependent on that range.
As you've discovered, writing an array to a range is much, much faster than writing cell-by-cell. It is also true that reading a range into an array is much faster than reading cell-by-cell.
As to your question of only writing the subset of data that has changed, you need a fast way to identify which data has changed. This is probably obvious, but needs to be taken into account as whatever that method is will also take some time.
To write only the changed data, you can do this two ways: either go back to writing cell-by-cell or break the array into smaller chunks. The only way to know if either of these is faster than writing the whole range is to try all three methods and time them with your data. If 90% of the data is changed, writing the entire block will certainly be faster than writing cell-by-cell. On the other hand, if the changed data only represents 5%, cell-by-cell may be better. The performance is dependent on too many variables to give a one-answer-fits-all solution.

Related

Excel VBA using SUMPRODUCT and COUNTIFS - issue of speed

I have an issue of speed. (Apologies for the long post…). I am using Excel 2013 and 2016 for Windows.
I have a workbook that performs 10,000+ calculations on a 200,000 cell table (1000 rows x 200 columns).
Each calculation returns an integer (e.g. count of filtered rows) or more usually a percentage (e.g. sum of value of filtered rows divided by sum of value of rows). The structure of the calculation is variations of the SUMPRODUCT(COUNTIFS()) idea, along the lines of:
=IF($B6=0,
0,
SUMPRODUCT(COUNTIFS(
Data[CompanyName],
CompanyName,
Data[CurrentYear],
TeamYear,
INDIRECT(VLOOKUP(TeamYear&"R2",RealProgress,2,FALSE)),
"<>"&"",
Data[High Stage],
NonDom[NonDom]
))
/$B6
)
Explaining above:
the pair Data[Company Name] and CompanyName is the column in the table and the condition value for the first filter.
The pair Data[Current Year] and TeamYear are the same as above and constitute the second filter.
The third pair looks up a intermediary table and returns the name of the column, the condition ("<>"&"") is ‘not blank’, i.e. returns all rows that have a value in this column
Finally, the fourth pair is similar to 3 above but returns a set of values that matches the set of values in
Lastly, the four filters are joined together with AND statements.
It is important to note that across all the calculations the same principle is applied of using SUMPRODUCT(COUNTIFS()) – however there are many variations on this theme.
At present, using Calculate on a select range of sheets (rather than the slower calculating the whole workbook), yields a speed of calculation of around 30-40 seconds. Not bad, and tolerable as calculations aren’t performed all the time.
Unfortunately, the model is to be extended and now could approach 20,000 rows rather 1,000 rows. Calculation performance is directly linked to the number of rows or cells, therefore I expect performance to plummet!
The obvious solution [1] is to use arrays, ideally passing an array, held in memory, to the formula in the cell and then processing it along with the filters and their conditions (the lookup filters being arrays too).
The alternative solution [2] is to write a UDF using arrays, but reading around the internet the opinion is that UDFs are much slower than native Excel functions.
Two questions:
Is solution [1] possible, and the best way of doing this, and if so how would I construct it?
If solution [1] is not possible or not the best way, does anyone have any thoughts on how much quicker solution [2] might be compared with my current solution?
Are there other better solutions out there? I know about Power BI Desktop, PowerPivot and PowerQuery – however this is a commercial application for use by non-Excel users and needs to be presented in the current Excel ‘grid’ form of rows and columns.
Thanks so much for reading!
Addendum: I'm going to try running an array calculation for each sheet on the Worksheet.Activate event and see if there's some time savings.
Writing data to arrays is normally a good idea if looking to increase speed. Done like this:
Dim myTable As ListObject
Dim myArray As Variant
'Set path for Table variable
Set myTable = ActiveSheet.ListObjects("Table1")
'Create Array List from Table
myArray = myTable.DataBodyRange
(Source)

Why does this one line take SO long?

I want to put a lot of identical formulas in to a medium sized range. And it is taking forever to process. The code I am using is:
.Range("M2:AZ1000").FormulaR1C1 = "=COUNTIFS(MOODLE!C[-5],""Y"",MOODLE!C3,RC3)"
but the line takes about five minutes to run.
I have tried switching calculation off, inserting the formulas then back on (this INCREASES the time it takes as it seems to now calculate twice!).
I can think of two reasons for your problem:
your formula itself requires lots of calculation. A column has 2^10 rows. In your formula, you input two whole column into the criteria_range fields, it means Excel has 2^10 + 2^10 of IF to check, and that only is for one cell. You apply it to 39960 cells. Try limit the input of criteria_range to a smaller range. (In my experience, when using COUNTIF with criteria_range of 10,000 cells and applying the formula over 10,000 cells, things start to get very sloppy).
you have other Excel open which also has many volatile functions. Try closing them.
Also, may be turning calculation to Manual mode would ease the pain?

How to check if a value is in an array without the use of a For loop?

Is this possible and/or recommended? Currently, the issue I'm having is that the processing time of this code I have checks for a value in an array of ~40 for a value, once it finds it we set a boolean. This same for loop is called up to 20 times so I was wondering if there was a way I could optimize this code in a better way to where I don't need to have several for loops checking for a single answer.
Here's an example of the code
For i = 0 to iCount 'iCount up to 40
If name = UCase((m_Array(i, 1))) Then
<logic>
End If
Next
Above is an example of what I"m looking at, this little chunk of code checks the array which is prepopulated prior to running this function and is usually around 30-40 items in the array. With this being called up to 20 times I feel I could reduce the amount of time it takes to run this if I could find another way to do it without the use of so many for loops.
LINQ provides a Contains extension method, which returns a Boolean, but that won't work for multidimensional arrays. Even if it did work, if performance is the concern, then Contains wouldn't help much since, internally, all the Contains method does is loop through the items until it finds the matching item.
One way to make it faster, is to use an Exit For statement to exit the loop once the first matching item is found. At least then it won't continue searching through the rest of the items after it finds the one for which it was looking:
For i = 0 to iCount 'iCount up to 40
If name = UCase((m_Array(i, 1))) Then
' logic...
Exit For
End If
Next
If you don't want it to have to search through the array at all, you would need to index your data. The simplest way to index your data is with a hash table. The Dictionary class is an easy to use implementation of a hash table. However, in the end, a hash table (just like any other indexing method) will only help performance if the situation is right. In this situation, where the array only contains 40 or so items, it's quite possible that a hash table will be slower. The only way to know for sure is to test it both ways and see if it makes any difference.
You're currently searching the list repeatedly and doing something if a member has certain properties. Instead you could check the properties of an item once, when you add an item to the list, and perform the logic then. Instead of repeating the tests it's then only done once. No searching at all is better than even the fastest search.

VBA: Performance of multidimensional List, Array, Collection or Dictionary

I'm currently writing code to combine two worksheets containing different versions of data.
Hereby I first want to sort both via a Key Column, combine 'em and subsequently mark changes between the versions in the output worksheet.
As the data amounts to already several 10000 lines and might some day exceed the lines-per-worksheet limit of excel, I want these calculations to run outside of a worksheet. Also it should perform better.
Currently I'm thinking of a Quicksort of first and second data and then comparing the data sets per key/line. Using the result of the comparison to subsequently format the cells accordingly.
Question
I'd just love to know, whether I should use:
List OR Array OR Collection OR Dictionary
OF Lists OR Arrays OR Collections OR Dictionaries
I have as of now been unable to determine the differences in codability and performance between this 16 possibilities. Currently I'm implementing an Array OF Arrays approach, constantly wondering whether this makes sense at all?
Thanks in advance, appreciate your input and wisdom!
Some time ago, I had the same problem with the macro of a client. Additionally to the really big number of rows (over 50000 and growing), it had the problem of being tremendously slow from certain row number (around 5000) when a "standard approach" was taken, that is, the inputs for the calculations on each row were read from the same worksheet (a couple of rows above); this process of reading and writing was what made the process slower and slower (apparently, Excel starts from row 1 and the lower is the row, the longer it takes to reach there).
I improved this situation by relying on two different solutions: firstly, setting a maximum number of rows per worksheet, once reached, a new worksheet was created and the reading/writing continued there (from the first rows). The other change was moving the reading/writing in Excel to reading from temporary .txt files and writing to Excel (all the lines were read right at the start to populate the files). These two modifications improved the speed a lot (from half an hour to a couple of minutes).
Regarding your question, I wouldn't rely too much on arrays with a macro (although I am not sure about how much information contains each of these 10000 lines); but I guess that this is a personal decision. I don't like collections too much because of being less efficient than arrays; and same thing for dictionaries.
I hope that this "short" comment will be of any help.

What is the difference between the time complexity of these two ways of using loops in VBA?

I got a theoretical question, will appreciate if you advise me here.
Say, we have these two pieces of code.
First one:
For Each cell In rng1
collectionOfValues.Add (cell.Value)
Next
For Each cell In rng2
collectionOfAddresses.Add (cell.Address)
Next
For i = 1 To collectionOfAddresses.Count
Range(collectionOfAddresses.Item(i)) = collectionOfValues.Item(i)
Next i
Here we add addresses from one range to a certain collection, and values from another range to a second collection, and then fill cells on these addresses with the values.
Here is the second code, which makes the same:
For i = 1 To rng1.Rows.Count
For j = 1 To rng1.Columns.Count
rng2.Cells(i, j) = rng1.Cells(i, j)
Next j
Next i
So, the question is - what is the time of execution in both cases? I mean, it's clear that the second case is O(n^2) (to make it easier we assume the range is square).
What about the first one? Is For Each considered a nested loop?
And if so, does it mean that the time of the first code is O(n^2) + O(n^2) + O(n^2) = 3*O(n^2) which makes pretty the same as the second code time?
In general, do these two codes differ apart from the fact that the first one takes additional memory when creating collections?
Thanks a lot in advance.
Actually, your first example is O(n^4)!
That might sound surprising, but this is because indexing into a VBA Collection has linear, not constant, complexity. The VBA Collection essentially has the performance characteristics of a list - to get element N by index takes a time proportional to N. To iterate the whole thing by index takes a time proportional to N^2. (I switched cases on you to distinguish N, the number of elements in the data structure, from your n, the number of cells on the side of a square block of cells. So here N = n^2.)
That is one reason why VBA has the For...Each notation for iterating Collections. When you use For...Each, VBA uses an iterator behind the scenes so walking through the entire Collection is O(N) not O(N^2).
So, switching back to your n, your first two loops are using For...Each over a Range with n^2 cells, so they are each O(n^2). Your third loop is using For...Next over a Collection with n^2 elements, so that is O(n^4).
I actually don't know for sure about your last loop because I don't know exactly how the Cells property of a Range works - there could be some extra hidden complexity there. But I think Cells will have the performance characteristics of an array, so O(1) for random access by index, and that would make the last loop O(n^2).
This is a good example of what Joel Spolsky called "Shlemiel the painter's algorithm":
There must be a Shlemiel the Painter's
Algorithm in there somewhere. Whenever
something seems like it should have
linear performance but it seems to
have n-squared performance, look for
hidden Shlemiels. They are often
hidden by your libraries.
(See this article from way before stackoverflow was founded: http://www.joelonsoftware.com/articles/fog0000000319.html)
More about VBA performance can be found at Doug Jenkins's webstie:
http://newtonexcelbach.wordpress.com/2010/03/07/the-speed-of-loops/
http://newtonexcelbach.wordpress.com/2010/01/15/good-practice-best-practice-or-just-practice/
(I will also second what cyberkiwi said about not looping through Ranges just to copy cell contents if this were a "real" program and not just a learning excercise.)
You are right that the first is 3 x O(n^2), but remember that O-notation does not care about constants, so in terms of complexity, it is still an O(n^2) algorithm.
The first one is not considered a nested loop, even if it is working on the same size as the loop in the second. It is just a straight iteration over an N-item range in Excel. What makes it N^2 is the fact that you are defining N as the length of a side, i.e. number of rows/columns (being square).
Just an Excel VBA note, you shouldn't be looping through cells nor storing addresses anyway. Neither of the approaches is optimal. But I think they serve to illustrate your question to understand O-notation.
rng1.Copy
rng2.Cells(1).PasteSpecial xlValues
Application.CutCopyMode = False
Remember not to confuse the complexity of YOUR code with the complexity of background Excel functions. Over all the amount of work done is N^2 in both cases. However, in your first example - YOUR code is actually only 3N (N for each of the three loops). The fact that a single statement in Excel can fill in multiple values does not change the complexity of your written code. A foreach loop is the same as a for loop - N complexity by itself. You only get N^2 when you nest loops.
To answer your question about which is better - generally it is preferable to use built in functions where you can. The assumption should be that internally Excel will run more efficiently than you could write yourself. However (knowing MS) - make sure you always check that assumption if performance is a priority.