VBA: Big Data and the use of Arrays - vba

So I'm working with large 28,000 line plus data+. Plus possibly 5 other spreadsheets to cross reference against.
I keep being told Arrays are faster but can it be explained to me it seems they are faster where you can read and write large chunks of data into the array at a time. Which is something where I can understand where there might be a speed overhead reduction.
Or is it right to say Arrays are just plain faster than say....
Worksheet.range("A1").Value=AOtherWorksheet.range("A1").Value
It just appears somewhat magical if that's the case as could get why reading in blocks of variants would be faster but don't necessarily get why reading off a sheet into a array and then off array into second sheet would be faster. Have I misunderstood I'm just trying to tease that specific part out.
Any other tricks comments for automating large spreadsheets welcome but was mainly focused on understanding this titbit.

I think the magic is caused by complexity - each cell carries with it a lot of "baggage"
Hundreds of settings for its environment, and most of them are about cell formatting
Height, Width, RowHeight, ColumnWidth, Column, Row, Font, IndentLevel, etc
To see all properties, observe the watch window for Sheet1.Range("A1")
(properties with a + next to them are complex objects with their own set of properties)
The main reason for optimizing with arrays is to avoid all formatting
Each cell "knows" about all settings regardless if they are changed or not, and carries all this "weight" around. Most users, most of the times only care about the value in the cell, and never touch the formatting. In rare occasions you may be stuck working directly with the range object if you need to modify each individual cell's .Borders, .Interior.Color, .Font, etc, and even then, there are ways of grouping similarly formatted cells and modifying the attributes of the entire group at once
.
To continue with the baggage analogy (and this is stretching it a bit): at an airport, if I need to refill a pen for passenger "John Doe" from his luggage already on the plane, in a utility room at the back of the airport, I will be able to do it (I have all the info I need), but it'll take me time to go back and forth, carrying that luggage. And for one passenger it can be done in a reasonable amount of time, but how much longer would it take to refill 20K pens, or 100K, or a million ? (ONE - BY - ONE)
I view the Range <-> VBA interaction the same way: working with individual cells one at the time, is like carrying each individual luggage for a million passengers, to the utility room at the back of the airport. This is what this statement does:
Sheet1.Range("A1:A1048576").Value = Sheet2.Range("A1:A1048576").Value
as opposed to extracting all pens, from all suitcases at once, refilling them, and placing them all back
.
Copying the range object to an array is isolating one of the properties for each cell - its Value ("the pen"), from all the others settings (Excel is extremely efficient about this). We now have an array of only the values, and no other formatting settings. Modify each value individually in memory, then place them all back into the range object:
Dim arr as Variant
arr = Sheet2.Range("A1:A1048576") 'Get all values from Sheet2 into Sheet1
Sheet1.Range("A1:A1048576") = arr
.
This is where the Copy / Paste parameters are different as well:
Sheet2.Range("A1:A1048576").Copy
Sheet1.Range("A1:A1048576").PasteSpecial xlPasteAll
.
Timers for Rows: 1,048,573
xlPasteAll - Time: 0.629 sec; (all values + all formatting)
xlPasteAllExceptBorders - Time: 0.791 sec
xlPasteAllMergingConditionalFormats - Time: 0.782 sec; (no merged cells)
xlPasteAllUsingSourceTheme - Time: 0.791 sec
xlPasteColumnWidths - Time: 0.004 sec
xlPasteComments - Time: 0.000 sec; (comments test is too slow)
xlPasteFormats - Time: 0.497 sec; (format only, no values, no brdrs)
xlPasteFormulas - Time: 0.718 sec
xlPasteFormulasAndNumberFormats - Time: 0.775 sec
xlPasteValidation - Time: 0.000 sec
xlPasteValues - Time: 0.770 sec; (conversion from formula to val)
xlPasteValuesAndNumberFormats - Time: 0.634 sec
.
Another aspect, beyond arrays, are the types of indexes for data structures
For most situations arrays are acceptable, but when there is a need for better performance, there are the Dictionary and Collection objects
One array inefficiency is that for finding elements we need to iterate over each one
A more convenient option could be to access specific items, a lot faster
Dim d As Object 'Not indexed (similar to linked lists)
Set d = CreateObject("Scripting.Dictionary") 'Native to VB Script, not VBA
d.Add Key:="John Doe", Item:="31" 'John Doe - 31 (age); index is based on Key
d.Add Key:="Jane Doe", Item:="33"
Debug.Print d("Jane Doe") 'Prints 33
Dictionaries also have the very useful and fast method of checking items d.Exists("John Doe"), which returns True or False without errors (but Collections don't). With an array you'd have to loop over potentially all items to find out
I think one of the fastest ways to extract unique values for large columns is to combine arrays and dictionaries
Public Sub ShowUniques()
With Sheet1.UsedRange
GetUniques .Columns("A"), .Columns("B")
End With
End Sub
Public Sub GetUniques(ByRef dupesCol As Range, uniquesCol As Range)
Dim arr As Variant, d As Dictionary, i As Long, itm As Variant
arr = dupesCol
Set d = CreateObject("Scripting.Dictionary")
For i = 1 To UBound(arr)
d(arr(i, 1)) = 0 'Shortcut to add new items to dictionary, ignoring dupes
Next
uniquesCol.Resize(d.Count) = Application.Transpose(d.Keys)
'Or - Place d itms in new array (resized accordingly), and place array back on Range
' ReDim arr(1 To d.Count, 1 To 1)
' i = 1
' For Each itm In d
' arr(i, 1) = itm
' i = i + 1
' Next
' uniquesCol.Resize(d.Count) = arr
End Sub
From: Col A To: Col B
1 1
2 2
1 3
3
Dictionaries don't accept duplicate keys, just ignores them

Related

Do Collections need more computer power than arrays?

Are Collections in vba not as efficient as arrays when it comes to long lists of strings?
My vba-Tool is not as fast as i want it to be. I use a lot of collections because i don't have to REDIM and also i don't have to use additionally counting-variables.
For example (I want to unite the array a and the collection col in one list, but the tricky part is, that to every array element, there are a certain number of col-elements):
For i = 1 To col.count
colSave.Add "==========================="
colSave.Add a(i - 1)
colSave.Add "==========================="
For k = 1 To colFilter.Item(i).count
colSave.Add col.Item(i).Item(k)
Next k
Next i
Is more efficient to use an array in this case with a third counting variable?
Probably the most efficient way is to list the strings in cells on a worksheet then read the range of those cells into an array. This is a very quick method (ranges of 100k cells read in milliseconds on a reasonably fast PC):
Sub test()
Dim a() As Variant
a = Range("A1:A1000").Value
End Sub
a will now contain those strings.
Note that this method produces a multidimensional base 1 array, not base zero, so for example the first string in the above example would be at index 1,1.

Compare and Select ranges based off most up-to-date Reading Date VBA

I am working on an excel workbook where the user imports text files into a "Data Importation Sheet", the amount of text files imported is dynamic. See image.
So here is what I need to happen
1) Need to find the most up-to-date Reading Date (in this example 2016)
2) Need to copy and paste the range of Depth values of the most up-to-date Reading Date to a separate sheet (in this example I would want to copy and paste values 1-17.5.
3) Need to check if all other data sets contain this same range of Depth values. For the year 2014 you can see its depth goes from 0.5-17.5. I want to be able to just copy the data at the range of the most up-to-date Reading Date so the range of 1-17.5.
Here is my code to find the most up-to-date Reading date and to copy those depths to the other sheets.
Sub Copy_Depth()
Dim dataws As Worksheet, hiddenws As Worksheet
Dim tempDate As String, mostRecentDate As String
Dim datesRng As Range, recentCol As Range, headerRng As Range, dateRow As Range, cel As Range
Dim lRow As Long
Dim x As Double
Set dataws = Worksheets("Data Importation Sheet")
Set hiddenws = Worksheets("Hidden2")
Set calcws = Worksheets("Incre_Calc_A")
Set headerRng = dataws.Range(dataws.Cells(1, 1), dataws.Cells(1, dataws.Cells(1, Columns.Count).End(xlToLeft).Column))
'headerRng.Select
For Each cel In headerRng
If cel.Value = "Depth" Then
Set dateRow = cel.EntireColumn.Find(What:="Reading Date:", LookIn:=xlValues, lookat:=xlPart)
Set datesRng = dataws.Cells(dateRow.Row + 1, dateRow.Column)
'datesRng.Select
' Find the most recent date
tempDate = Left(datesRng, 10)
If tempDate > mostRecentDate Then
mostRecentDate = tempDate
Set recentCol = datesRng
End If
End If
Next cel
Dim copyRng As Range
With dataws
Set copyRng = .Range(.Cells(2, recentCol.Column), .Cells(.Cells(2, recentCol.Column).End(xlDown).Row, recentCol.Column))
End With
hiddenws.Range(hiddenws.Cells(2, 1), hiddenws.Cells(copyRng.Rows(copyRng.Rows.Count).Row, 1)).Value = copyRng.Value
calcws.Range(calcws.Cells(2, 1), calcws.Cells(copyRng.Rows(copyRng.Rows.Count).Row, 1)).Value = copyRng.Value
Worksheets("Incre_Calc_A").Activate
lRow = Cells(Rows.Count, 1).End(xlUp).Row
x = Cells(lRow, 1).Value
Cells(lRow + 1, 1) = x + 0.5
End Sub
Any tips/help would be greatly appreciated. I am fairly new to VBA and don't know how to go about comparing the depth ranges! Thanks in advance!
Assuming that your datasets are as regularly organised as your screenshot suggests then quite a lot of processing can be done in Excel.
The image below shows a possible approach based on the data shown in your example.
The approach exploits the fact that each data set occupies 7 columns of the importation worksheet. The =ADDRESS() function is used to build text strings which look like cell addresses and these are further manipulated to create text strings which look like range addresses. The approach also assumes that the reading date is always located in the third row following the final row of depth data.
The solution is slightly different to your problem, in that it identifies the common range of depth values across all datasets. For the example in the question this amounts to the same thing as identifying the depth values associated with the latest reading date.
This approach was taken as it is not clear from the question what would happen if, say, a dataset had depth values starting at say 1.5 (so greater than the first value for the latest reading date) or ending at say 17 (so less than the the last value for the latest reading date). The approach can obviously be adapted if these possibilities will never occur.
The table shown in the image above has in its final column, a text representation of the ranges to be copied from the Data Importation Sheet. A simple bit of VBA can read this column, a cell at a time and use the text to assign an appropriate range object to which copy and paste methods can then be applied.
Additional bit of answer
The image above could be set-up as a "helper" worksheet. If there is always the same number of datasets on the Data Importation Worksheet then set up this helper sheet so that the number of rows in Table 2 is equal to this number of datasets. If the number of datasets is variable, then set up the helper sheet so that the number of rows in Table 2 is equal to the maximum number of datasets that is ever likely to be encountered. In this situation, when the number of datasets imported is fewer than this maximum, some rows of Table 2 will be unused and these unused rows will contain meaningless values in some columns.
Your VBA program should be organised to read the value in the value in cell D2 of the helper sheet and then use this to determine how many rows of Table 2 to examine with the rest of your VBA code. This will unused rows (if any) to be ignored.
If your VBA code identifies a value of, say 10, in cell D2 of the helper sheet then you will want your code to read one a time the 10 values in the range Q12:Q21 (so in a loop). Each of these cells holds, as a string, the range containing a single dataset's values and so can be assigned to a Range object using code such as
Set datasetRng = Range(datasetStr)
where datasetStr is the text string read from a cell in Q12:Q21.
Still within the loop, datasetRng can then be copied and pasted to your output worksheet.
Because the same helper worksheet can be re-used for each data importation, you should be able to incorporate it into your automation scheme. No need for copying and pasting formula down rows to create a different helper for each importation, just apply the same helper template to each data importation.
The approach adopted makes as much use of Excel as possible to determine relevant information about the imported data sets and summarises this information within the helper worksheet. This means VBA can be limited to automation of the copy/paste operations on the datasets and its reads information from the helper sheet in determining what to copy for each dataset.
It is of course possible to do everything in VBA but as you indicated you were fairly new to VBA it seemed sensible to tip the balance towards using less VBA and more Excel.
Incidentally, the problem of comparing the depth ranges is not really one of Excel or programming, it is one of analysis - ie looking at a range of cases, figuring out what needs to happen for each case, and distilling this into a set of processing rules (what some would call an algorithm). Only then should attempts be made to implement these processing rules (either via Excel formula or VBA code). I have hinted at my analysis of the problem (finding the common range of depth values across all datasets)and you should be able to track through how I have implemented this in Excel to cater for cases where some datasets might contain Depth values which are less than the minimum of the common range or which are greater than its maximum (or possibly both).
End of additional bit
The formula used are shown in the table below.

Find a value from a column and quickly return the row number of its cell

What I have
I have a file with part numbers and several suppliers for each part. There are 1500 parts with around 20 possible suppliers each. For the sake of simplicity let's say parts are listed in column A, with each supplier occupying a column after that. Values under the suppliers are entered manually but don't really matter.
In another sheet, I have a list of parts that is imported from an Access database. The parts list is imported, but not the supplier info. In both cases, each part appears only once.
What I want to do
I simply want to match the supplier info from the first sheet with the parts in the imported list. Right now, I have a function which goes through each part in the list with suppliers, copies the supplier information in an array, finds the part number in the imported part list (there is always a unique match) and copies the array next to it (with supplier info inside). It works. Unfortunately, the find function slows down considerably each time it is used. I know it is the culprit through various tests, and I can't understand why it slows down (starts at 200 loop iterations per second, slows down to 1 per second and Excel crashes) . I may have a leak of some sort? The file size remains 7mb throughout. Here it is:
Function LigneNum(numAHNS As String) As Integer
Dim oRange As Range, aCell As Range
Dim SearchString As String
Set oRange = f_TableMatrice.Range("A1:A1500")
SearchString = numAHNS
Set aCell = oRange.Find(What:=SearchString, LookIn:=xlValues, _
LookAt:=xlPart, SearchOrder:=xlByRows, SearchDirection:=xlNext, _
MatchCase:=False, SearchFormat:=False)
If Not aCell Is Nothing Then
'We have found the number by now:
LigneNum = aCell.Row
Exit Function
Else
MsgBox "Un numéro AHNS n'a pas été trouvé: " & SearchString
Debug.Print SearchString & " not found!"
LigneNum = 0
Exit Function
End If
End Function
The function simply returns the row number on which the value is found, or 0 if it doesn't find it which should never happen.
What I need help with
I'd like either to identify the cause of the slow down, or find a replacement for the Find method. I have used the Find before and it is the first time this happens to me. It was initially taken from Siddarth Rout's website: http://www.siddharthrout.com/2011/07/14/find-and-findnext-in-excel-vba/ What is strange is that it doesn't start slow, it just becomes sluggish as it goes on.
I think using Match could work, or maybe dumping the range to search (the part numbers) into an array and trying to match these with the imported parts number list could work. I am unsure how to do it, but my question is more about which one would be faster (as long as it remains under 15 seconds I don't really care, though, but looping over 1500 items 1500 times right out of the sheet is out of the question). Would anyone suggest match over the array solution / spending more hours fixing my code?
EDIT
Here is the loop it is being called from. I don't think it is problematic:
For Each cellToMatch In rngToMatch
Debug.Print cellToMatch.Row
'The cellsToMatch's values are the numbers I want, rngToMatch is the column where they are.
For i = 2 To nbSup + 1
infoSup(i - 2) = f_TableMatrice.Cells(cellToMatch.Row, i)
Next
'infoSup contains the required supplier data now
'I call the find function here to find the row where the number appears in the imported sheet
'To copy the array nbSup on that line
LigneAHNS = LigneNum(cellToMatch.Value) 'This is the Find function
If LigneAHNS = 0 Then Exit Sub
'This loop just empties the array in the right line.
For i = LBound(infoSup) To UBound(infoSup)
f_symix.Cells(LigneAHNS, debutsuppliers + i) = infoSup(i)
Next
Next
If I replace LigneAHNS = LigneNum by LigneAHNS = 20, for example, the code executes extremely fast. The leak therefore comes from the find function itself.
Another way to do it without using the find function might be something like this. Firstly, put the part IDs and their line numbers into a scripting dictionary. These are really quick to lookup from. Like this:
Dim Dict As New Scripting.Dictionary
Dim ColA As Variant
Lastrow=range("A50000").end(xlUp).Row
ColA = Range("A1:A" & LastRow).Value
For i = 1 To LastRow
Dict.Add ColA(i, 1), i
Next i
To further optimise, you could declare the Dict as a public variable, populate it once, and refer to it many times in your lookups. I expect this would be faster than running a cells.find over a range every time you do a lookup.
For syntax of looking up items in the dictionary, refer to Looping through a Scripting.Dictionary using index/item number
You could achieve this with only Excel cell formulas and no VB if you are willing to devote a separate column to each supplier on your main parts sheet. You could then use conditional formatting to make it more visually appealing. I've tried it with 1500 rows and it's very quick. Increasing it to 5000 rows becomes noticeably slower, but you say you have only 1500 rows for now, so it should be suitable.
On Sheet 1, define a part number column and a separate column for each supplier.
Create a separate sheet for each supplier with all part numbers available from that supplier listed in column A. Make sure the rows on the supplier sheets are ordered by part number.
Name each of the supplier sheets the same as the associated column heading shown on Sheet 1.
Assign the following formula in each cell beneath each supplier column heading on Sheet 1:
=NOT(ISNA(VLOOKUP($A2,INDIRECT("'"&B$1&"'!A:A"),1,FALSE)))
The following screen cap shows this implemented along with conditional formatting to highlight which suppliers have which parts:
If you wanted to show quantities available from suppliers, then you could always have a second column (B) on the supplier sheets containing last known quantities for each part and use VLOOKUP to retrieve column B instead of A.

Outputting rows into another sheet

I have two sets of data stored in two different sheets. I need to run an analysis which prints out the non-duplicate rows (i.e. row is present in one and not the other) found in the sheets and print them in a new sheet.
I can do the comparison fine - it is relatively simple with ranges and the For Next method. I currently store the non-duplicates in two different collections, each representing the non-duplicates in each sheet. However I am having trouble deciding how to proceed with pasting the duplicate rows on the new sheet.
I thought about storing the entire row into a collection but printing the row out of the collection in the new sheet seems non-trivial: I would have to determine the size of the collection, set the appropriate range and then iterate through the collection and print them out. I would also like to truncate this data which would add another layer of complexity.
The other method I thought was simply storing the row number and using Range.Select.Copy and PasteSpecial. The advantage of this is that I can truncate however much I wish, however this seems incredibly hacky to me (essentially using VBA to simulate user input) and I am not sure on performance hits.
What are the relative merits or is there a better way?
I have been tackling a similar problem at work this week. I have come up with two methods:
First you could simply iterate through each collection one row at a time, and copy the values to the new sheet:
Function PasteRows1(ByRef srcRows As Collection, ByRef dst As Worksheet)
Dim row As Range
Dim curRow As Integer
curRow = 1
For Each row In srcRows
dst.rows(curRow).Value = row.Value
curRow = curRow + 1
Next
End Function
This has the benefit of not using the Range.Copy method and so the user's clipboard is preserved. If you are not copying an entire row then you will have to create a range that starts at the first cell of the row and then resize it using Range.Resize. So the code inside the for loop would roughly be:
Dim firstCellInRow as Range
Set firstCellInRow = dst.Cells(curRow,1)
firstCellInRow.Resize(1,Row.columns.Count).Value = row.Value
curRow = curRow + 1
The second method I thought of uses the Range.Copy. Like so:
Function PasteRows2(ByRef srcRows As Collection, ByRef dst As Worksheet)
Dim row As Range
Dim disjointRange As Range
For Each row In srcRows
If disjointRange is Nothing Then
Set disjointRange = row
Else
Set disjointRange = Union(disjointRange, row)
End If
Next
disjointRange.Copy
dst.Paste
End Function
While this does use the .Copy method it also will allow you to copy all of the rows in one shot which is nice because you will avoid partial copies if excel ever crashes in the middle of your macro.
Let me know if either of these methods satisfy your needs :)

Iterating 100 cells takes too long

In my excel VBA code, I need to move some data from a range to another sheet.
As of now, I'm iterating through the range and copying the values like this:
For offset = 0 To 101
ActiveWorkbook.Sheets(Sheet).Range("C3").offset(offset, 0).Value = ActiveSheet.Range("D4").offset(offset, 0).Value
Next offset
However, it takes almost a minute to iterate and copy the values for the 100 cells.
Would I be better off using Copy-Paste programatically, or is there a way to copy for the entire range at once? Something like:
ActiveWorkbook.Sheets(Sheet).Range("C3:C102").Value = ActiveSheet.Range("D4:D104").Value
You can read the entire range at once into a Variant array, and then write it back to another range. This is also quick, flickerless, and has the added bonus that you can code some operations on the data if you are so inclined.
Dim varDummy As Variant
varDummy = ActiveSheet.Range("D4:D104")
' Can insert code to do stuff with varDummy here
Workbook.Sheets(Sheet).Range("C3:C103") = varDummy
This I learned the hard way: Avoid Copy/Paste if at all possible! Copy and Paste use the clipboard. Other programs may read from / write to the clipboard while your code is running, which will cause wild, unpredictable results.
Also, it's generally a good idea to minimize the number of interactions between VBA and Excel, because they are slow. Having such interactions in a loop is multiply slow.
So, silly me did not try before posting here. Apparently, I can move data for an entire range this way:
Workbook.Sheets(Sheet).Range("C3:C102").Value = ActiveSheet.Range("D4:D104").Value
Its as fast as copy-paste without the switching of sheets.
Iterating through the range using a for loop takes about 45s for 100 cells, while the above two options are instant.
You can speed up code and stop flickering with:
Application.ScreenUpdating = False
'YOUR CODE
Application.ScreenUpdating = True
More: http://www.ozgrid.com/VBA/excel-macro-screen-flicker.htm
Columns("A:Z").Select
Selection.Copy
Sheets("Sheet2").Select
Range("A1").Select
ActiveSheet.Paste
That will copy columns A to Z from Sheet 1 to Sheet 2. This was generated by recording the macro. You can also apply it to ranges with something like this:
Range("D4:G14").Select
Selection.Copy
Sheets("Sheet2").Select
Range("D4").Select
ActiveSheet.Paste
Is this something like what you're after?
If you need anything specific and you can do it manually (e.g. copy and paste), record the macro to get the VBA code for it.
Copy and pasting has a decent amount of overhead in VBA, as does dealing with ranges like that. Its been a while since I have done VBA but if I recall correctly the fastest way to do something like this is to write the values you want into an array and then use the Resize function. So something like this:
Option Base 0
Dim firstrow as integer
Dim lastrow as integer
Dim valuesArray() as Long
Dim i as integer
//Set firstrow and lastrow however you deem appropriate
...
//Subtracing first row from last row gets you the needed size of the 0 based array
ReDim valuesArray(lastrow-firstrow)
for int i = 0 to (lastrow-firstrow)
valuesArray(i)=Cells(i+firstrow, COLUMNNUMBER).value
next i
Of course replace COLUMNNUMBER with whatever column it is you are iterating over. This should fill your array with your desired values. Then pick your destination cell and use Resize to put the values in. So if your destination cell is D4:
Range("D4").Resize(UBound(valuesArray)+1, 0).value = valuesArray
That write all the values in the array starting at D4 and going down to as many cells are in the array. Slightly more complicated but if you are going for speed I don't think I have ever come up with anything faster. Also I did this off the top of my head so please test and make sure that you don't cut off a cell here and there.
That OZGrid page has very useful info - http://www.ozgrid.com/VBA/SpeedingUpVBACode.htm
In my case, I need the formatting to be copied as well so I have been using this:
Sheet1.Range("A1:A200").Copy Destination:=Sheet2.Range("B1")
but was still having very slow execution - to the point of locking up the application - I finally found the problem - at some point in the past a number of empty text boxes got into my page - and while they were copied each time my code ran they were not erased by my code to clear the working area. The result was something like 4,500 empty text boxes - each of which was copy and pasted by even the code above.
If you use Edit - Go To... - Click on Special - then choose Objects - and you don't see anything that is good - if you see a bunch of objects that you were not aware of on your page that is not good.