Largest index where condition is true, without using argwhere - numpy

I want to get the largest index where a condition is true, e.g.:
import numpy as np
a = np.arange(10, 0, -1)
i = np.max(np.argwhere(a > 5).ravel())
print(i)
which gives 4.
But I want to do this on a very large array, where np.argwhere is simply too costly.
How can I do this without allocating a (large) array, i.e. without np.argwhere?

Use argmax on flipped mask that gives us first index and thus essentially last index in original order -
len(a)-np.argmax((a>5)[::-1])-1

Related

How to build a numpy matrix one row at a time?

I'm trying to build a matrix one row at a time.
import numpy as np
f = np.matrix([])
f = np.vstack([ f, np.matrix([1]) ])
This is the error message.
ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 1, the array at index 0 has size 0 and the array at index 1 has size 1
As you can see, np.matrix([]) is NOT an empty list. I'm going to have to do this some other way. But what? I'd rather not do an ugly workaround kludge.
you have to pass some dimension to the initial matrix. Either fill it with some zeros or use np.empty():
f = np.empty(shape = [1,1])
f = np.vstack([f,np.matrix([1])])
you can use np.hstack instead for the first case, then use vstack iteratively.
arr = np.array([])
arr = np.hstack((arr, np.array([1,1,1])))
arr = np.vstack((arr, np.array([2,2,2])))
Now you can convert into a matrix.
mat = np.asmatrix(arr)
Good grief. It appears there is no way to do what I want. Kludgetown it is. I'll build an array with a bogus first entry, then when I'm done make a copy without the bogosity.

generate large array in dask

I would like to calculate SVD from large matrix by Dask. However, I tried naively to create an empty 2D array and update in a loop, but Dask does not allow mutating the array.
So, I'm looking for a workaround. I tried saving large ( around 65,000 x 65,000, or even more) array into HDF5 via h5py, but updating the array in a loop is quite inefficient. Should I be using mmap, memory mapped numpy instead?
Below, I shared a sample code, without any dask implementation. Should I use dask.bag or dask.delayed for this operation?
The sample code is taking in long strings and in window size of 8, generates combinations of two-letter words. In actual data, the window size would be 20 and words will be 8-letter long. And, the input string can be 3 Gb long.
import itertools
import numpy as np
np.set_printoptions(threshold=np.Inf)
# generate all possible words of length 2 (AA, AC, AG, AT, CA, etc.)
# then get numerical index (AA -> 0, AC -> 1, etc.)
bases=['A','C','G','T']
all_two = [''.join(p) for p in itertools.product(bases, repeat=2)]
two_index = {x: y for (x,y) in zip(all_two, range(len(all_two)))}
# final array to fill, size is [ 16 possible words x 16 possible words ]
counts = np.zeros(shape=(16,16)) # in actual sample we expect 65000x65000 array
# sample sequences (these will be gigabytes long in actual sample)
seq1 = "AAAAACCATCGACTACGACTAC"
seq2 = "ACGATCACGACTACGACTAGATGCATCACGACTAAAAA"
# accumulate results
all_pairs=[]
def generate_pairs(sequence):
pairs=[]
for i in range(len(sequence)-8+1):
window=sequence[i:i+8]
words= [window[i:i+2] for i in range(0, len(window), 2)]
for pair in itertools.combinations(words,2):
pairs.append(pair)
return pairs
# use function for each sequence
all_pairs.extend(generate_pairs(seq1))
all_pairs.extend(generate_pairs(seq2))
# convert 1D array of pairs into 2D counts of pairs
# for each pair, lookup word index and increase corresponding cell
for j in all_pairs:
counts[ two_index[j[0]], two_index[j[1]] ] += 1
print(counts)
EDIT: I might have asked the question a little complicated, let me try to paraphrase it. I need to construct a single large 2D array of size ~65000x65000. The array needs to be filled with counting occurrences of (word1,word2) pairs. Since Dask does not allow item assignment/mutate for Dask array, I can not fill the array as pairs are processed. Is there a workaround to generate/fill a large 2D array with Dask?
Here's simpler code to test:
import itertools
import numpy as np
np.set_printoptions(threshold=np.Inf)
bases=['A','C','G','T']
all_two = [''.join(p) for p in itertools.product(bases, repeat=2)]
two_index = {x: y for (x,y) in zip(all_two, range(len(all_two)))}
seq = "AAAAACCATCGACTACGACTAC"
counts = np.zeros(shape=(16,16))
for i in range(len(seq)-8+1):
window=seq[i:i+8]
words= [window[i:i+2] for i in range(0, len(window), 2)]
for pair in itertools.combinations(words,2):
counts[two_index[pair[0]], two_index[pair[1]]] += 1 # problematic part!
print(counts)

How do I append a column from a numpy array to a pd dataframe?

I have a numpy array of 100 predicted values called first_100. If I convert these to a dataframe they are indexed as 0,1,2 etc. However, the predictions are for values that are in random indexed order, 66,201,32 etc. I want to be able to put the actual values and the predictions in the same dataframe, but I'm really struggling.
The real values are in a dataframe called first_100_train.
I've tried the following:
pd.concat([first_100, first_100_train], axis=1)
This doesn't work and for some reason returns the entire dataframe and indexed from 0 so there are lots of NaNs...
first_100_train['Prediction'] = first_100[0]
This is almost what I want, but again because the indexes are different the data doesn't match up. I'd really appreciate any suggestions.
EDIT: After managing to join the dataframes I now have this:
I'd like to be able to drop the final column...
Here is first_100.head()
and first_100_train.head()
The problem is that index 2 from first_100 actually corresponds to index 480 of first_100_train
Set default index values by DataFrame.reset_index and drop=True for correct alignment:
pd.concat([first_100.reset_index(drop=True),
first_100_train.reset_index(drop=True)], axis=1)
Or if first DataFrame have default RangeIndex solution is simplify:
pd.concat([first_100,
first_100_train.reset_index(drop=True)], axis=1)

How to find if any column in an array has duplicate values

Let's say I have a numpy matrix A
A = array([[ 0.5, 0.5, 3.7],
[ 3.8, 2.7, 3.7],
[ 3.3, 1.0, 0.2]])
I would like to know if there is at least two rows i and i' such that A[i, j]=A[i', j] for some column j?
In the example A, i=0 and i'=1 for j=2 and the answer is yes.
How can I do this?
I tried this:
def test(A, n):
for j in range(n):
i = 0
while i < n:
a = A[i, j]
for s in range(i+1, n):
if A[s, j] == a:
return True
i += 1
return False
Is there a faster/better way?
There are a number of ways of checking for duplicates. The idea is to use as few loops in the Python code as possible to do this. I will present a couple of ways here:
Use np.unique. You would still have to loop over the columns since it wouldn't make sense for unique to accept an axis argument because each column could have a different number of unique elements. While it still requires a loop, unique allows you to find the positions and other stats of repeated elements:
def test(A):
for i in range(A.shape[1]):
if np.unique(A[:, i]).size < A.shape[0]:
return True
return False
With this method, you basically check if the number of unique elements in a column is equal to the size of the column. If not, there are duplicates.
Use np.sort, np.diff and np.any. This is a fully vectorized solution that does not require any loops because you can specify an axis for each of these functions:
def test(A):
return np.any(diff(np.sort(A, axis=0), axis=0) == 0)
This literally reads "if any of the column-wise differences in the column-wise sorted array are zero, return True". A zero difference in the sorted array means that there are identical elements. axis=0 makes sort and diff operate on each column individually.
You never need to pass in n since the size of the matrix is encoded in the attribute shape. If you need to look at the subset of a matrix, just pass in the subset using indexing. It will not copy the data, just return a view object with the required dimensions.
A solution without numpy would look like this: First, swap columns and rows with zip()
zipped = zip(*A)
then check if any now row has any duplicates. You can check for duplicates by turning a list into a set, which discards duplicates, and check the length.
has_duplicates = any(len(set(row)) != len(row) for row in zip(*A))
Most probably way slower and also more memory intensive than the pure numpy solution, but this may help for clarity

How to find last occurrence of maximum value in a numpy.ndarray

I have a numpy.ndarray in which the maximum value will mostly occur more than once.
EDIT: This is subtly different from numpy.argmax: how to get the index corresponding to the *last* occurrence, in case of multiple occurrences of the maximum values because the author says
Or, even better, is it possible to get a list of indices of all the occurrences of the maximum value in the array?
whereas in my case getting such a list may prove very expensive
Is it possible to find the index of the last occurrence of the maximum value by using something like numpy.argmax? I want to find only the index of the last occurrence, not an array of all occurrences (since several hundreds may be there)
For example this will return the index of the first occurrence ie 2
import numpy as np
a=np.array([0,0,4,4,4,4,2,2,2,2])
print np.argmax(a)
However I want it to output 5.
numpy.argmax only returns the index of the first occurrence. You could apply argmax to a reversed view of the array:
import numpy as np
a = np.array([0,0,4,4,4,4,2,2,2,2])
b = a[::-1]
i = len(b) - np.argmax(b) - 1
i # 5
a[i:] # array([4, 2, 2, 2, 2])
Note numpy doesn't copy the array but instead creates a view of the original with a stride that accesses it in reverse order.
id(a) == id(b.base) # True
If your array is made up of integers and has less than 1e15 rows. You can also sort this out by adding a noise function that linearly increases the value of later occurrences.
>>>import numpy as np
>>>a=np.array([0,0,4,4,4,4,2,2,2,2])
>>>noise= np.array(range(len(a))) * 1e-15
>>>print(np.argmax(a + noise))
5