Fastest way to find two minimum values in each column of NumPy array - numpy

If I want to find the minimum value in each column of a NumPy array, I can use the numpy.amin() function. However, is there a way to find the two minimum values in each column, that is faster than sorting each column?

You can simply use np.partition along the columns to get smallest N numbers -
N = 2
np.partition(a,kth=N-1,axis=0)[:N]
This doesn't actually sort the entire data, simply partitions into two sections such that smallest N numbers are in the first section, also called as partial-sort.
Bonus (Getting top N elements) : Similarly, to get the top N numbers per col, simply use negative kth value -
np.partition(a,kth=-N,axis=0)[-N:]
Along other axes and higher dim arrays
To use it along other axes, change the axis value. So, along rows, it would be axis=1 for a 2D array and extend the same way for higher dimension ndarrays.

Use the min() method, and specify the axis you want to average over:
a = np.random.rand(10,3)
a.min(axis=0)
gives:
array([0.04435587, 0.00163139, 0.06327353])
a.min(axis=1)
gives
array([0.01354386, 0.08996586, 0.19332211, 0.00163139, 0.55650945,
0.08409907, 0.23015718, 0.31463493, 0.49117553, 0.53646868])

Related

Subtract the mean of a group for a column away from a column value

I have a companies dataset with 35 columns. The companies can belong to one of 8 different groups. How do I for each group create a new dataframe which subtract the mean of the column for that group away from the original value?
Here is an example of part of the dataset.
So for example for row 1 I want to subtract the mean of BANK_AND_DEP for Consumer Markets away from the value of 7204.400207. I need to do this for each column.
I assume this is some kind of combination of a transform and a lambda - but cannot hit the syntax.
Although it might seem counter-intuitive for this to involve a loop at all, looping through the columns themselves allows you to do this as a vectorized operation, which will be quicker than .apply(). For what to subtract by, you'll combine .groupby() and .transform() to get the value you need to subtract from a column. Then, just subtract it.
for column in df.columns:
df['new_'+column] = df[column]-df.groupby('Cluster')['column'].transform('mean')

How to find the set difference of two sorted arrays in numpy?

I would like to use an array rows for indexing rows of another array x. Initially, rows contains indices of all rows of x (and is therefor sorted). Throughout the program, some indices exclude are chosen to be removed from rows. Similar to rows itself, exclude is a sorted array.
What is the best way of finding the set difference of rows and exclude?
I have thought of a few different options, but I think their complexities are more than O(n + m), where n is the length of rows and m is the length of exclude.
new_rows = [r for r in rows if r not in exclude]
This solutions requires looking up exclude every time and therefore, O(mn) complexity.
new_rows = setdiff1d(rows, exclude, assume_unique=True)
This will probably take O(nlogm), but I'm not sure.
Convert exclude to a dict and run 1. The problem with this approach is that it requires extra memory, but it meets the complexity requirement.
Here are outlines of two O(n+m) options:
1) heapq.merge will combine two sorted sequences in linear time. As the combined sequence is sorted, shared indices will sit next to each other.
2) as rows as you describe it is a "thinned out range" I assume that the the max value of rows is not excessively large. You can therfore allocate an array E of that size (O(1) if we don't initialize it, i.e. use np.empty). Then you use rows and exclude to index into the empty array. For example, you write E[rows] = 1 E[exclude] = 0 and then check back E[rows] and remove all elements of rows at which E has changed from 1 to 0.
Option 2 also works if the two sets are not sorted.

How to identify the indices for the nth smallest values in a multi dimensional array in VBA?

I currently have a 3 dimensional array full of different values. I would like to find the indices corresponding to the "nth" smallest values in the array. For example... If the 3 smallest values were 0.1, 0.2 and 0.3 I would like to see, in order, the indices for these values. Any help would be greatly appreciated.
A possible way to approach this would be adding an original index dimension into your 3rd array, then sorting, by the current values to find out the smallest item and returning the original index. Take a look into this: VBA array sort function?

Numpy maximum(arrays)--how to determine the array each max value came from

I have numpy arrays representing July temperature for each year since 1950.
I can use the numpy.maximum(temp1950,temp1951,temp1952,..temp2014)
to determine the maximum July temperature at each cell.
I need the maximum for each cell..the numpy.maximum() works for only 2 arrays
How do I determine the year that each max value came from?
Also the numpy.maximum(array1,array2) works comparing only two arrays.
Thanks to Praveen, the following works fine:
array1 = numpy.array( ([1,2],[3,4]) )
array2 = numpy.array( ([3,4],[1,2]) )
array3 = numpy.array( ([9,1],[1,9]) )
all_arrays = numpy.dstack((array1,array2,array3))
#maxvalues = numpy.maximum(all_arrays)#will not work
all_arrays.max(axis=2) #this returns the max from each cell location
max_indexes = numpy.argmax(all_arrays,axis=2)#this returns correct indexes
The answer is argmax, except that you need to do this along the required axis. If you have 65 years' worth of temperatures, it doesn't make sense to keep them in separate arrays.
Instead, put them all into a single 2D dimensional array using something like np.vstack and then take the argmax over rows.
alltemps = np.vstack((temp1950, temp1951, ..., temp2014))
maxindexes = np.argmax(alltemps, axis=0)
If your temperature arrays are already 2D for some reason, then you can use np.dstack to stack in depth instead. Then you'll have to take argmax over axis=2.
For the specific example in your question, you're looking for something like:
t = np.dstack((array1, array2)) # Note the double parantheses. You need to pass
# a tuple to the function
maxindexes = np.argmax(t, axis=2)
PS: If you are getting the data out of a file, I suggest putting them in a single array to start with. It gets hard to handle 65 variable names.
You need to use Numpy's argmax
It would give you the index of the largest element in the array, which you can map to the year.

Speed Enhancements for a Sorted Vector in MATLAB

What is the fastest way to lookup the index of a value in sorted vector in MATLAB?
That is, is there a fast find(vector == myNumber, 1, 'first') for when vector is sorted?
I have a large matrix (200,000 x 4) of locations each with a unique integer ID recorded in the first column. I want to find the right the location of a known ID but thousands of searches can take me a little bit to find.
If you use ismembc2, the loc output should give you what you need. See this for more details:
http://www.mathworks.com/support/solutions/en/data/1-9NIE1N/index.html?product=ML&solution=1-9NIE1N
There are a number of submissions for this on FEX: http://www.mathworks.com/matlabcentral/fileexchange/?term=binary+search+vector
I do not know if it is faster but you may want to try
result=vector(vector(:,1)==myNumber,:)
result will contain the 4 elements row for which vector first column == myNumber