How to identify the indices for the nth smallest values in a multi dimensional array in VBA? - vba

I currently have a 3 dimensional array full of different values. I would like to find the indices corresponding to the "nth" smallest values in the array. For example... If the 3 smallest values were 0.1, 0.2 and 0.3 I would like to see, in order, the indices for these values. Any help would be greatly appreciated.

A possible way to approach this would be adding an original index dimension into your 3rd array, then sorting, by the current values to find out the smallest item and returning the original index. Take a look into this: VBA array sort function?

Related

Fastest way to find two minimum values in each column of NumPy array

If I want to find the minimum value in each column of a NumPy array, I can use the numpy.amin() function. However, is there a way to find the two minimum values in each column, that is faster than sorting each column?
You can simply use np.partition along the columns to get smallest N numbers -
N = 2
np.partition(a,kth=N-1,axis=0)[:N]
This doesn't actually sort the entire data, simply partitions into two sections such that smallest N numbers are in the first section, also called as partial-sort.
Bonus (Getting top N elements) : Similarly, to get the top N numbers per col, simply use negative kth value -
np.partition(a,kth=-N,axis=0)[-N:]
Along other axes and higher dim arrays
To use it along other axes, change the axis value. So, along rows, it would be axis=1 for a 2D array and extend the same way for higher dimension ndarrays.
Use the min() method, and specify the axis you want to average over:
a = np.random.rand(10,3)
a.min(axis=0)
gives:
array([0.04435587, 0.00163139, 0.06327353])
a.min(axis=1)
gives
array([0.01354386, 0.08996586, 0.19332211, 0.00163139, 0.55650945,
0.08409907, 0.23015718, 0.31463493, 0.49117553, 0.53646868])

Need explanation on how pandas.drop is working here

I have a data frame, lets say xyz. I have written code to find out the % of null values each column possess in the dataframe. my code below:
round(100*(xyz.isnull().sum()/len(xyz.index)), 2)
let say i got following results:
abc 26.63
def 36.58
ghi 78.46
I want to drop column ghi because it has more than 70% of null values.
I achieved it using the following code:
xyz = xyz.drop(xyz.loc[:,round(100*(xyz.isnull().sum()/len(xyz.index)), 2)>70].columns, 1)
but , i did not understand how does this code works, can anyone please explain it?
the code is doing the following:
xyz.drop( [...], 1)
removes the specified elements for a given axis, either by row or by column. In this particular case, df.drop( ..., 1) means you're dropping by axis 1, i.e, column
xyz.loc[:, ... ].columns
will return a list with the column names resulting from your slicing condition
round(100*(xyz.isnull().sum()/len(xyz.index)), 2)>70
this instruction is counting the number of nulls, adding them up and normalizing by the number of rows, effectively computing the percentage of nan in each column. Then, the amount is rounded to have only 2 decimal positions and finally you return True is the number of nan is more than 70%. Hence, you get a mapping between columns and a True/False array.
Putting everything together: you're first producing a Boolean array that marks which columns have more than 70% nan, then, using .loc you use Boolean indexing to look only at the columns you want to drop ( nan % > 70%), then using .columns you recover the name of such columns, which then are used by the .drop instruction.
Hopefully this clear things up!
If you code is hard to understand , you can just check dropna with thresh, since pandas already cover this case.
df=df.dropna(axis=1,thresh=round(len(df)*0.3))

How to find the set difference of two sorted arrays in numpy?

I would like to use an array rows for indexing rows of another array x. Initially, rows contains indices of all rows of x (and is therefor sorted). Throughout the program, some indices exclude are chosen to be removed from rows. Similar to rows itself, exclude is a sorted array.
What is the best way of finding the set difference of rows and exclude?
I have thought of a few different options, but I think their complexities are more than O(n + m), where n is the length of rows and m is the length of exclude.
new_rows = [r for r in rows if r not in exclude]
This solutions requires looking up exclude every time and therefore, O(mn) complexity.
new_rows = setdiff1d(rows, exclude, assume_unique=True)
This will probably take O(nlogm), but I'm not sure.
Convert exclude to a dict and run 1. The problem with this approach is that it requires extra memory, but it meets the complexity requirement.
Here are outlines of two O(n+m) options:
1) heapq.merge will combine two sorted sequences in linear time. As the combined sequence is sorted, shared indices will sit next to each other.
2) as rows as you describe it is a "thinned out range" I assume that the the max value of rows is not excessively large. You can therfore allocate an array E of that size (O(1) if we don't initialize it, i.e. use np.empty). Then you use rows and exclude to index into the empty array. For example, you write E[rows] = 1 E[exclude] = 0 and then check back E[rows] and remove all elements of rows at which E has changed from 1 to 0.
Option 2 also works if the two sets are not sorted.

How can I Select nth element from an array's 2nd dimension?

I have a 2-dimensional int array, and I'd like to get the 2nd element from every array in the 2nd dimension. So for example, I'd like to get 2,4, and 6 from the array literal '{{1,2},{3,4},{5,6}'. Is this possible? I've searched the docs but I haven't found anything that can do what I want.
unnest(arr[:][2:2]) will give you a table expression for what you want (where arr is the name of your array column)
If you want to get a 1 dimensional array of those elements, you can use array(select * from unnest(arr[:][2:2])) (because arr[:][2:2] is still a 2 dimensional one).
http://rextester.com/VLOJ18858

Speed Enhancements for a Sorted Vector in MATLAB

What is the fastest way to lookup the index of a value in sorted vector in MATLAB?
That is, is there a fast find(vector == myNumber, 1, 'first') for when vector is sorted?
I have a large matrix (200,000 x 4) of locations each with a unique integer ID recorded in the first column. I want to find the right the location of a known ID but thousands of searches can take me a little bit to find.
If you use ismembc2, the loc output should give you what you need. See this for more details:
http://www.mathworks.com/support/solutions/en/data/1-9NIE1N/index.html?product=ML&solution=1-9NIE1N
There are a number of submissions for this on FEX: http://www.mathworks.com/matlabcentral/fileexchange/?term=binary+search+vector
I do not know if it is faster but you may want to try
result=vector(vector(:,1)==myNumber,:)
result will contain the 4 elements row for which vector first column == myNumber