I'm trying to calculate the 99.5% percentile for a data set of 100000 values in an array (arr1) within VBA using the percentile function as follows:
Pctile = Application.WorksheetFunction.Percentile(arr1, 0.995)
Pctile = Application.WorksheetFunction.Percentile_Inc(arr1, 0.995)
Neither works and I keep getting a type mismatch (13).
The code runs fine if I limit the array size up to a maximum of 65536. As far as I was aware calculation limited by available memory since Excel 2007 array sizes when passing to macro limited by available memory since Excel 2000.
I'm using Excel 2010 on a high performance server. Can anyone confirm this problem exists? Assuming so, I figure that my options are to build a vba function to calculate the percentile 'manually' or output to a worksheet, calculate it there and read it back. Are there any alternatives and what would be quickest?
The error would occur if arr1 is 1-dimensional and has greater than 65536 elements (see Charles' answer in Array size limits passing array arguments in VBA). Dimension arr1 as a two-dimensional array with a single column:
Dim arr1(1 to 100000, 1 to 1)
This works in Excel 2013. Based on Charles' answer, it appears that it will work with Excel 2010.
Here is a Classic VBA example that mimics the Excel Percentile function.
Percentile and Confidence Level (Excel-VBA)
In light of Jean's exposure of the Straight Insertion method being inefficient. I've edited this answer with the following:
I read that QuickSelect seems to excel with large records and is quite efficient doing so.
References:
Wikipedia.org: Quick Select
A C# implementation can be found # Fast Algorithm for computing percentiles to remove outliers which should be easily converted to VB.
Related
I have a process built in Excel and am trying to transfer it over to SQL for efficiency purposes. I have figured much of it out, including random number generation and pieces involving exponential distributions.
The one piece I am unable to figure out in SQL is how to get the inverse of lognormal.
This is how the formula looks in Excel:
lognorm.inv(p,mu,sigma)
I have the equivalent to p, mu and sigma set up in my data and am trying to find the SQL equivalent (or workaround) to the lognorm.inv() Excel function.
Has anyone done something like this before?
I have an issue of speed. (Apologies for the long post…). I am using Excel 2013 and 2016 for Windows.
I have a workbook that performs 10,000+ calculations on a 200,000 cell table (1000 rows x 200 columns).
Each calculation returns an integer (e.g. count of filtered rows) or more usually a percentage (e.g. sum of value of filtered rows divided by sum of value of rows). The structure of the calculation is variations of the SUMPRODUCT(COUNTIFS()) idea, along the lines of:
=IF($B6=0,
0,
SUMPRODUCT(COUNTIFS(
Data[CompanyName],
CompanyName,
Data[CurrentYear],
TeamYear,
INDIRECT(VLOOKUP(TeamYear&"R2",RealProgress,2,FALSE)),
"<>"&"",
Data[High Stage],
NonDom[NonDom]
))
/$B6
)
Explaining above:
the pair Data[Company Name] and CompanyName is the column in the table and the condition value for the first filter.
The pair Data[Current Year] and TeamYear are the same as above and constitute the second filter.
The third pair looks up a intermediary table and returns the name of the column, the condition ("<>"&"") is ‘not blank’, i.e. returns all rows that have a value in this column
Finally, the fourth pair is similar to 3 above but returns a set of values that matches the set of values in
Lastly, the four filters are joined together with AND statements.
It is important to note that across all the calculations the same principle is applied of using SUMPRODUCT(COUNTIFS()) – however there are many variations on this theme.
At present, using Calculate on a select range of sheets (rather than the slower calculating the whole workbook), yields a speed of calculation of around 30-40 seconds. Not bad, and tolerable as calculations aren’t performed all the time.
Unfortunately, the model is to be extended and now could approach 20,000 rows rather 1,000 rows. Calculation performance is directly linked to the number of rows or cells, therefore I expect performance to plummet!
The obvious solution [1] is to use arrays, ideally passing an array, held in memory, to the formula in the cell and then processing it along with the filters and their conditions (the lookup filters being arrays too).
The alternative solution [2] is to write a UDF using arrays, but reading around the internet the opinion is that UDFs are much slower than native Excel functions.
Two questions:
Is solution [1] possible, and the best way of doing this, and if so how would I construct it?
If solution [1] is not possible or not the best way, does anyone have any thoughts on how much quicker solution [2] might be compared with my current solution?
Are there other better solutions out there? I know about Power BI Desktop, PowerPivot and PowerQuery – however this is a commercial application for use by non-Excel users and needs to be presented in the current Excel ‘grid’ form of rows and columns.
Thanks so much for reading!
Addendum: I'm going to try running an array calculation for each sheet on the Worksheet.Activate event and see if there's some time savings.
Writing data to arrays is normally a good idea if looking to increase speed. Done like this:
Dim myTable As ListObject
Dim myArray As Variant
'Set path for Table variable
Set myTable = ActiveSheet.ListObjects("Table1")
'Create Array List from Table
myArray = myTable.DataBodyRange
(Source)
I need to do a Vlookup from another workbook on about 400000 cells with Vba. These cells are all in one Column.And shall be written into one Column. I know already , how the Vlookup Works, but my runtime is much to high by using autofill. Do you have an Suggestion how i can approve it?
Dont use VLookup use Index Match: http://www.randomwok.com/excel/how-to-use-index-match/
If you are able to adjust what the data looks like a slight amount, you may be interested in using a binary search. Its been a while since I last used one (writing a code for group exercise check-in program). https://www.khanacademy.org/computing/computer-science/algorithms/binary-search/a/implementing-binary-search-of-an-array , was helpful in setting up the idea behind it.
If you are able to sort them in an order, say by last name (im not sure of what data you are working with) then add an order of numbers to use for the binary search.
Edit:
The reasoning for a binary search would be that with a binary search is that the computational time it takes. The amount of iterations it would take is log2(400000) vs 400000. So instead of 400000 possible iterations, it would take at most 19 times with a binary search, as you can see with the more data you use the binary search would yield much quicker times.
This would only be a beneficial way if you are able to manipulate the data in such a way that would allow you to use a binary search.
So, if you can give us a bit more background on what data you are using and any restrictions you have with that data we would be able to give more constructive feedback.
In an MS Access 2007 app, which manages contracts and changes for large construction projects, I need to create a Bell Curve representing a Contract Value, over a time period.
For example, a $500m contract runs for, say, 40 months, and I need a Bell Curve that distributes the Contract Value over these 40 months. The idea is to present a starting point for cashflow projections during the life of the contract.
Using VBA, I had thought to create the 'monthly' values and store them in a temp table, for later use in a report chart. However, I'm stuck trying to work out an algorithm.
Any suggestions on how I might tackle this would be most appreciated.
You will need the =NORMSDIST() function borrowed from Excel as follows:
Public Function Normsdist(X As Double) As Double
Normsdist = Excel.WorksheetFunction.Normsdist(X)
End Function
Requires some knowledge of statistics to use this function to distribute a cash flow over x periods, assuming a standard-normal. I created an Excel sheet to demonstrate how this function is used and posted it here:
Normal Distribution of a cash flow sample .XLSX
If for some reason you hated the idea of using an Excel function, you can pull any statistics text or search for the formula that generates a series of normal values. In your case you want to distribute the cash flow over three standard deviations in each tail. So that's a total of six (6) standard deviations. To divide 40 months over 6 standard deviations it's 6/40 = 0.15 standard deviations each data point (month). Use a for/next/step or similar loop to generate that to a temporary table, as you suggested, and graph it with a column chart (as seen in the above Excel example). Will take just a little VBA coding to make this variable as to user-supplied number of months and total contract.
A standard-normal distribution has a mean of 0 and standard deviation of 1. If you want a flatter bell curve, you can use the NormDist function instead where you can specify the mean and st. dev.
I'm currently writing code to combine two worksheets containing different versions of data.
Hereby I first want to sort both via a Key Column, combine 'em and subsequently mark changes between the versions in the output worksheet.
As the data amounts to already several 10000 lines and might some day exceed the lines-per-worksheet limit of excel, I want these calculations to run outside of a worksheet. Also it should perform better.
Currently I'm thinking of a Quicksort of first and second data and then comparing the data sets per key/line. Using the result of the comparison to subsequently format the cells accordingly.
Question
I'd just love to know, whether I should use:
List OR Array OR Collection OR Dictionary
OF Lists OR Arrays OR Collections OR Dictionaries
I have as of now been unable to determine the differences in codability and performance between this 16 possibilities. Currently I'm implementing an Array OF Arrays approach, constantly wondering whether this makes sense at all?
Thanks in advance, appreciate your input and wisdom!
Some time ago, I had the same problem with the macro of a client. Additionally to the really big number of rows (over 50000 and growing), it had the problem of being tremendously slow from certain row number (around 5000) when a "standard approach" was taken, that is, the inputs for the calculations on each row were read from the same worksheet (a couple of rows above); this process of reading and writing was what made the process slower and slower (apparently, Excel starts from row 1 and the lower is the row, the longer it takes to reach there).
I improved this situation by relying on two different solutions: firstly, setting a maximum number of rows per worksheet, once reached, a new worksheet was created and the reading/writing continued there (from the first rows). The other change was moving the reading/writing in Excel to reading from temporary .txt files and writing to Excel (all the lines were read right at the start to populate the files). These two modifications improved the speed a lot (from half an hour to a couple of minutes).
Regarding your question, I wouldn't rely too much on arrays with a macro (although I am not sure about how much information contains each of these 10000 lines); but I guess that this is a personal decision. I don't like collections too much because of being less efficient than arrays; and same thing for dictionaries.
I hope that this "short" comment will be of any help.