I have a numpy.ndarray in which the maximum value will mostly occur more than once.
EDIT: This is subtly different from numpy.argmax: how to get the index corresponding to the *last* occurrence, in case of multiple occurrences of the maximum values because the author says
Or, even better, is it possible to get a list of indices of all the occurrences of the maximum value in the array?
whereas in my case getting such a list may prove very expensive
Is it possible to find the index of the last occurrence of the maximum value by using something like numpy.argmax? I want to find only the index of the last occurrence, not an array of all occurrences (since several hundreds may be there)
For example this will return the index of the first occurrence ie 2
import numpy as np
a=np.array([0,0,4,4,4,4,2,2,2,2])
print np.argmax(a)
However I want it to output 5.
numpy.argmax only returns the index of the first occurrence. You could apply argmax to a reversed view of the array:
import numpy as np
a = np.array([0,0,4,4,4,4,2,2,2,2])
b = a[::-1]
i = len(b) - np.argmax(b) - 1
i # 5
a[i:] # array([4, 2, 2, 2, 2])
Note numpy doesn't copy the array but instead creates a view of the original with a stride that accesses it in reverse order.
id(a) == id(b.base) # True
If your array is made up of integers and has less than 1e15 rows. You can also sort this out by adding a noise function that linearly increases the value of later occurrences.
>>>import numpy as np
>>>a=np.array([0,0,4,4,4,4,2,2,2,2])
>>>noise= np.array(range(len(a))) * 1e-15
>>>print(np.argmax(a + noise))
5
Related
I want to get the largest index where a condition is true, e.g.:
import numpy as np
a = np.arange(10, 0, -1)
i = np.max(np.argwhere(a > 5).ravel())
print(i)
which gives 4.
But I want to do this on a very large array, where np.argwhere is simply too costly.
How can I do this without allocating a (large) array, i.e. without np.argwhere?
Use argmax on flipped mask that gives us first index and thus essentially last index in original order -
len(a)-np.argmax((a>5)[::-1])-1
In pandas, I'd like to analyze groups if there is a single occurrence of a conditional value. I've included a sample dataframe with a first step attempt at identifying such groups below. So, let's say, in the data frame below, I want to filter the original data frame only for species of iris that ever had a sepal length greater than 6. In the last command, I'm counting the number of unique species groups that had a sepal length greater than 6 (so, at least I can count them).
But, what I really want is the original dataframe where I analyze rows only if the species had a sepal length greater than 6 (so, it would be a dataframe without the species "setosa" since they never have one).
The longer explanation is that I have a real dataset of users. Each user will have values in certain columns that may exceed a threshold value of interest. I haven't figured out how to analyze users who have these threshold values.
Perhaps a loop would be better. I might loop through each unique user name and look if any row with that user ever exceeds a certain value and gets some kind of new column (though I know loops are frowned upon in pandas so I'm posting here to see if there's some kind of well-known method of identifying groups by occurrence).
Thanks and let me know if I can make this question any more clear!
import pandas as pd
import seaborn as sns
import numpy as np
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
iris = sns.load_dataset('iris')
iris['longsepal'] = iris['sepal_length'] > 7
iris['longpetal'] = iris['petal_length'] > 5
iris.groupby(['longsepal'])['species'].nunique()
Consider groupby().transform() to calculate inline max aggregates to be later filtered on its value by species. Technically, the > 7 only returns one species as veriscolor max reaches 7.0. Below shows the operator and functional form of inequality logic.
iris['longsepal'] = iris.groupby(['species'])['sepal_length'].transform('max')
iris['longpetal'] = iris.groupby(['species'])['petal_length'].transform('max')
# DATA FILTERS
longsepal_iris = iris.loc[iris['longsepal'] > 7] # GREATER THAN OPERATOR FORM: >
longsepal_iris = iris.loc[iris['longsepal'].gt(7)] # GREATER THAN FUNCTIONAL FORM: gt()
longpetal_iris = iris.loc[iris['longpetal'] > 5] # GREATER THAN OPERATOR FORM: >
longpetal_iris = iris.loc[iris['longpetal'].gt(5)] # GREATER THAN FUNCTIONAL FORM: gt()
# SPECIES
longsepal_iris['species'].unique()
# ['virginica']
longpetal_iris['species'].unique()
# ['versicolor' 'virginica']
I would like to calculate SVD from large matrix by Dask. However, I tried naively to create an empty 2D array and update in a loop, but Dask does not allow mutating the array.
So, I'm looking for a workaround. I tried saving large ( around 65,000 x 65,000, or even more) array into HDF5 via h5py, but updating the array in a loop is quite inefficient. Should I be using mmap, memory mapped numpy instead?
Below, I shared a sample code, without any dask implementation. Should I use dask.bag or dask.delayed for this operation?
The sample code is taking in long strings and in window size of 8, generates combinations of two-letter words. In actual data, the window size would be 20 and words will be 8-letter long. And, the input string can be 3 Gb long.
import itertools
import numpy as np
np.set_printoptions(threshold=np.Inf)
# generate all possible words of length 2 (AA, AC, AG, AT, CA, etc.)
# then get numerical index (AA -> 0, AC -> 1, etc.)
bases=['A','C','G','T']
all_two = [''.join(p) for p in itertools.product(bases, repeat=2)]
two_index = {x: y for (x,y) in zip(all_two, range(len(all_two)))}
# final array to fill, size is [ 16 possible words x 16 possible words ]
counts = np.zeros(shape=(16,16)) # in actual sample we expect 65000x65000 array
# sample sequences (these will be gigabytes long in actual sample)
seq1 = "AAAAACCATCGACTACGACTAC"
seq2 = "ACGATCACGACTACGACTAGATGCATCACGACTAAAAA"
# accumulate results
all_pairs=[]
def generate_pairs(sequence):
pairs=[]
for i in range(len(sequence)-8+1):
window=sequence[i:i+8]
words= [window[i:i+2] for i in range(0, len(window), 2)]
for pair in itertools.combinations(words,2):
pairs.append(pair)
return pairs
# use function for each sequence
all_pairs.extend(generate_pairs(seq1))
all_pairs.extend(generate_pairs(seq2))
# convert 1D array of pairs into 2D counts of pairs
# for each pair, lookup word index and increase corresponding cell
for j in all_pairs:
counts[ two_index[j[0]], two_index[j[1]] ] += 1
print(counts)
EDIT: I might have asked the question a little complicated, let me try to paraphrase it. I need to construct a single large 2D array of size ~65000x65000. The array needs to be filled with counting occurrences of (word1,word2) pairs. Since Dask does not allow item assignment/mutate for Dask array, I can not fill the array as pairs are processed. Is there a workaround to generate/fill a large 2D array with Dask?
Here's simpler code to test:
import itertools
import numpy as np
np.set_printoptions(threshold=np.Inf)
bases=['A','C','G','T']
all_two = [''.join(p) for p in itertools.product(bases, repeat=2)]
two_index = {x: y for (x,y) in zip(all_two, range(len(all_two)))}
seq = "AAAAACCATCGACTACGACTAC"
counts = np.zeros(shape=(16,16))
for i in range(len(seq)-8+1):
window=seq[i:i+8]
words= [window[i:i+2] for i in range(0, len(window), 2)]
for pair in itertools.combinations(words,2):
counts[two_index[pair[0]], two_index[pair[1]]] += 1 # problematic part!
print(counts)
I have the following minimal example:
a = np.zeros((5,5,5))
a[1,1,:] = [1,1,1,1,1]
print(a[1,:,range(4)])
I would expect as output an array with 5 rows and 4 columns, where we have ones on the second row. Instead it is an array with 4 rows and 5 columns with ones on the second column. What is happening here, and what can I do to get the output I expected?
This is an example of mixed basic and advanced indexing, as discussed in https://docs.scipy.org/doc/numpy/reference/arrays.indexing.html#combining-advanced-and-basic-indexing
The slice dimension has been appended to the end.
With one scalar index this is a marginal case for the ambiguity described there. It's been discussed in previous SO questions and one or more bug/issues.
Numpy sub-array assignment with advanced, mixed indexing
In this case you can replace the range with a slice, and get the expected order:
In [215]: a[1,:,range(4)].shape
Out[215]: (4, 5) # slice dimension last
In [216]: a[1,:,:4].shape
Out[216]: (5, 4)
In [219]: a[1][:,[0,1,3]].shape
Out[219]: (5, 3)
Is it possible to initialize a pandas SparseArray by providing only the dense entries? I could not figure this out from the documentation: http://pandas.pydata.org/pandas-docs/stable/sparse.html .
For example, say I want a length 1000 SparseArray with a one at index 9 and zeros everywhere else, how would I go about creating it? This is one way:
a = [0] * 1000
a[9] = 1
sparse_a = pd.SparseArray(data=a, fill_value=0)
But, in the above, we have to create the dense array before the sparse one. Is there a way to specify only the indices and the dense entries to create the SparseArray directly?
A length 10 SparseArray with a one at index 9 and zeros everywhere else:
pd.SparseArray(1, index= range(1), kind='block',
sparse_index= BlockIndex(10, [8], [1]),
fill_value=0)
Notes:
index could be any list as long as its length is equal to all non-sparsed part of the array (the smaller part of the data), in this case, number of 1 in the sparse array
BlockIndex(10, [8], [1]) is the object pointing to the positions of the non-parsed part of the data where the first argument is the TOTAL length of the array (sparse + non-sparse), the second argument is a list of starting positions of the non-sparse data and the third argument is a list of how long each block of non-sparse lasts. Notice: that the length of the array mentioned in point 1 is the sum of all elements of the list in the third argument of this BlockIndex
So a more general example is: to make a length 20 SparseArray where the 2nd, 3rd, 6th,7th,8th elements are 1 and the rest is 0 is:
pd.SparseArray(1, index= range(5), kind='block',
sparse_index= BlockIndex(20, [1,5], [2,3]),
fill_value=0)
or
pd.SparseArray(1, index= [None, 3, 2, 7, np.inf], kind='block',
sparse_index= BlockIndex(20, [1,5], [2,3]),
fill_value=0)
Sadly, I don't know any good way to specify an array of non-sparsed data as the first argument for SparseArray-- it does not mean that it can't be done, this is only a disclaimer. I think as long as you specify index=... pandas will require a scalar for the first argument (the data).
Tested on Windows 7, pandas version 0.20.2 installed by Aconda.