Extract required rows from a numpy array

Extract required rows from a numpy array - numpy

I have a given numpy array as follows.
import numpy as np
data = np.array([[4,6,8,9,3,2,4,4,1], # no of 0s == 0
[4,6,8,9,3,0,0,4,0], # no of 0s == 3
[4,6,0,9,0,2,0,4,0], # no of 0s == 4
[4,6,8,0,3,0,0,0,0], # no of 0s == 5
[4,6,8,9,3,2,0,4,0]]) # no of 0s == 2
From the given array, data , I have to extract 3 rows which contain the least 0s.
So, the expected are, 1st, last, and second rows.
res = np.array([[4,6,8,9,3,2,4,4,1], # no of 0s == 0
[4,6,8,9,3,0,0,4,0], # no of 0s == 3
[4,6,8,9,3,2,0,4,0]]) # no of 0s == 2
How can I do it guys?

Sum on your condition and partition.
n = 3
c = (data == 0).sum(1)
mn = np.argpartition(c, n)[:n]
data[mn]
array([[4, 6, 8, 9, 3, 2, 4, 4, 1],
[4, 6, 8, 9, 3, 2, 0, 4, 0],
[4, 6, 8, 9, 3, 0, 0, 4, 0]])
If you need the rows sorted by original index value and not number of zeros, replace the last line with:
data[np.sort(mn)]

Related

Find duplicated sequences in numpy.array or pandas column

For example, I have got an array like this:
([ 1, 5, 7, 9, 4, 6, 3, 3, 7, 9, 4, 0, 3, 3, 7, 8, 1, 5 ])
I need to find all duplicated sequences , not values, but sequences of at least two values one by one.
The result should be like this:
of length 2: [1, 5] with indexes (0, 16);
of length 3: [3, 3, 7] with indexes (6, 12); [7, 9, 4] with indexes (2, 8)
The long sequences should be excluded, if they are not duplicated. ([5, 5, 5, 5]) should NOT be taken as [5, 5] on indexes (0, 1, 2)! It's not a duplicate sequence, it's one long sequence.
I can do it with pandas.apply function, but it calculates too slow, swifter did not help me.
And in real life I need to find all of them, with length from 10 up to 100 values one by one on database with 1500 columns with 700 000 values each. So i really do need a vectorized decision.
Is there a vectorized decision for finding all at once? Or at least for finding only 10-values sequences? Or only 4-values sequences? Anything, that will be fully vectorized?

One possible implementation (although not fully vectorized) that finds all sequences of size n that appear more than once is the following:
import numpy as np
def repeated_sequences(arr, n):
Na = arr.size
r_seq = np.arange(n)
n_seqs = arr[np.arange(Na - n + 1)[:, None] + r_seq]
unique_seqs = np.unique(n_seqs, axis=0)
comp = n_seqs == unique_seqs[:, None]
M = np.all(comp, axis=-1)
if M.any():
matches = np.array(
[np.convolve(M[i], np.ones((n), dtype=int)) for i in range(M.shape[0])]
)
repeated_inds = np.count_nonzero(matches, axis=-1) > n
repeated_matches = matches[repeated_inds]
idxs = np.argwhere(repeated_matches > 0)[::n]
grouped_idxs = np.split(
idxs[:, 1], np.unique(idxs[:, 0], return_index=True)[1][1:]
)
else:
return [], []
return unique_seqs[repeated_inds], grouped_idxs
In theory, you could replace
matches = np.array(
[np.convolve(M[i], np.ones((n), dtype=int)) for i in range(M.shape[0])]
)
with
matches = scipy.signal.convolve(
M, np.ones((1, n), dtype=int), mode="full"
).astype(int)
which would make the whole thing "fully vectorized", but my tests showed that this was 3 to 4 times slower than the for-loop. So I'd stick with that. Or simply,
matches = np.apply_along_axis(np.convolve, -1, M, np.ones((n), dtype=int))
which does not have any significant speed-up, since it's basically a hidden loop (see this).
This is based off #Divakar's answer here that dealt with a very similar problem, in which the sequence to look for was provided. I simply made it so that it could follow this procedure for all possible sequences of size n, which are found inside the function with n_seqs = arr[np.arange(Na - n + 1)[:, None] + r_seq]; unique_seqs = np.unique(n_seqs, axis=0).
For example,
>>> a = np.array([1, 5, 7, 9, 4, 6, 3, 3, 7, 9, 4, 0, 3, 3, 7, 8, 1, 5])
>>> repeated_seqs, inds = repeated_sequences(a, n)
>>> for i, seq in enumerate(repeated_seqs[:10]):
...: print(f"{seq} with indexes {inds[i]}")
...:
[3 3 7] with indexes [ 6 12]
[7 9 4] with indexes [2 8]
Disclaimer
The long sequences should be excluded, if they are not duplicated. ([5, 5, 5, 5]) should NOT be taken as [5, 5] on indexes (0, 1, 2)! It's not a duplicate sequence, it's one long sequence.
This is not directly taken into account and the sequence [5, 5] would appear more than once according to this algorithm. You could do something like this, based off #Paul's answer here, but it involves a loop:
import numpy as np
repeated_matches = np.array([[0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]])
idxs = np.argwhere(repeated_matches > 0)
grouped_idxs = np.split(
idxs[:, 1], np.unique(idxs[:, 0], return_index=True)[1][1:]
)
>>> print(grouped_idxs)
[array([ 6, 7, 8, 12, 13, 14], dtype=int64),
array([ 7, 8, 9, 10], dtype=int64)]
# If there are consecutive numbers in grouped_idxs, that means that there is a long
# sequence that should be excluded. So, you'd have to check for consecutive numbers
filtered_idxs = []
for idx in grouped_idxs:
if not all((idx[1:] - idx[:-1]) == 1):
filtered_idxs.append(idx)
>>> print(filtered_idxs)
[array([ 6, 7, 8, 12, 13, 14], dtype=int64)]
Some tests:
>>> n = 3
>>> a = np.array([1, 5, 7, 9, 4, 6, 3, 3, 7, 9, 4, 0, 3, 3, 7, 8, 1, 5])
>>> %timeit repeated_sequences(a, n)
414 µs ± 5.88 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> n = 4
>>> a = np.random.randint(0, 10, (10000,))
>>> %timeit repeated_sequences(a, n)
3.88 s ± 54 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> result, _ = repeated_sequences(a, n)
>>> result.shape
(2637, 4)
This is not the most efficient implementation by far, but it works as a 2D approach. Plus, if there aren't any repeated sequences, it returns empty lists.
EDIT: Full implementation
I vectorized the routine I added in the Disclaimer section as a possible solution to the long sequence problem and ended up with the following:
import numpy as np
# Taken from:
# https://stackoverflow.com/questions/53051560/stacking-numpy-arrays-of-different-length-using-padding
def stack_padding(it):
def resize(row, size):
new = np.array(row)
new.resize(size)
return new
row_length = max(it, key=len).__len__()
mat = np.array([resize(row, row_length) for row in it])
return mat
def repeated_sequences(arr, n):
Na = arr.size
r_seq = np.arange(n)
n_seqs = arr[np.arange(Na - n + 1)[:, None] + r_seq]
unique_seqs = np.unique(n_seqs, axis=0)
comp = n_seqs == unique_seqs[:, None]
M = np.all(comp, axis=-1)
repeated_seqs = []
idxs_repeated_seqs = []
if M.any():
matches = np.apply_along_axis(np.convolve, -1, M, np.ones((n), dtype=int))
repeated_inds = np.count_nonzero(matches, axis=-1) > n
if repeated_inds.any():
repeated_matches = matches[repeated_inds]
idxs = np.argwhere(repeated_matches > 0)
grouped_idxs = np.split(
idxs[:, 1], np.unique(idxs[:, 0], return_index=True)[1][1:]
)
# Additional routine
# Pad this uneven array with zeros so that we can use it normally
grouped_idxs = np.array(grouped_idxs, dtype=object)
padded_idxs = stack_padding(grouped_idxs)
# Find the indices where there are padded zeros
pad_positions = padded_idxs == 0
# Perform the "consecutive-numbers check" (this will take one
# item off the original array, so we have to correct for its shape).
idxs_to_remove= np.pad(
(padded_idxs[:, 1:] - padded_idxs[:, :-1]) == 1,
[(0, 0), (0, 1)],
constant_values=True,
)
pad_positions = np.argwhere(pad_positions)
i = pad_positions[:, 0]
j = pad_positions[:, 1] - 1 # Shift by one (shape correction)
idxs_to_remove[i, j] = True # Masking, since we don't want pad indices
# Obtain a final mask (boolean opposite of indices to remove)
final_mask = ~idxs_to_remove.all(axis=-1)
grouped_idxs = grouped_idxs[final_mask] # Filter the long sequences
repeated_seqs = unique_seqs[repeated_inds][final_mask]
# In order to get the correct indices, we must first limit the
# search to a shape (on axis=1) of the closest multiple of n.
# This will avoid taking more indices than we should to show where
# each repeated sequence begins
to = padded_idxs.shape[1] & (-n)
# Build the final list of indices (that goes from 0 - to with
# a step of n
idxs_repeated_seqs = [
grouped_idxs[i][:to:n] for i in range(grouped_idxs.shape[0])
]
return repeated_seqs, idxs_repeated_seqs
For example,
n = 2
examples = [
# First example is your original example array.
np.array([1, 5, 7, 9, 4, 6, 3, 3, 7, 9, 4, 0, 3, 3, 7, 8, 1, 5]),
# Second example has a long sequence of 5's, and since there aren't
# any [5, 5] anywhere else, it's not taken into account and therefore
# should not come out.
np.array([1, 5, 5, 5, 5, 6, 3, 3, 7, 9, 4, 0, 3, 3, 7, 8, 1, 5]),
# Third example has the same long sequence but since there is a [5, 5]
# later, then it should take it into account and this sequence should
# be found.
np.array([1, 5, 5, 5, 5, 6, 5, 5, 7, 9, 4, 0, 3, 3, 7, 8, 1, 5]),
# Fourth example has a [5, 5] first and later it has a long sequence of
# 5's which are uneven and the previous implementation got confused with
# the indices to show as the starting indices. In this case, it should be
# 1, 13 and 15 for [5, 5].
np.array([1, 5, 5, 9, 4, 6, 3, 3, 7, 9, 4, 0, 3, 5, 5, 5, 5, 5]),
]
for a in examples:
print(f"\nExample: {a}")
repeated_seqs, inds = repeated_sequences(a, n)
for i, seq in enumerate(repeated_seqs):
print(f"\t{seq} with indexes {inds[i]}")
Output (as expected):
Example: [1 5 7 9 4 6 3 3 7 9 4 0 3 3 7 8 1 5]
[1 5] with indexes [0 16]
[3 3] with indexes [6 12]
[3 7] with indexes [7 13]
[7 9] with indexes [2 8]
[9 4] with indexes [3 9]
Example: [1 5 5 5 5 6 3 3 7 9 4 0 3 3 7 8 1 5]
[1 5] with indexes [0 16]
[3 3] with indexes [6 12]
[3 7] with indexes [7 13]
Example: [1 5 5 5 5 6 5 5 7 9 4 0 3 3 7 8 1 5]
[1 5] with indexes [ 0 16]
[5 5] with indexes [1 3 6]
Example: [1 5 5 9 4 6 3 3 7 9 4 0 3 5 5 5 5 5]
[5 5] with indexes [ 1 13 15]
[9 4] with indexes [3 9]
You can test it out yourself with more examples and more cases. Keep in mind this is what I understood from your disclaimer. If you want to count the long sequences as one, even if multiple sequences are in there (for example, [5, 5] appears twice in [5, 5, 5, 5]), this won't work for you and you'd have to come up with something else.

pandas sort values based other column's value for each row

Generally, I want to sort each ceil of some columns in pandas dataframe based on 1 column's value, That single column stores rank of other columns' value.
Suppose I have a dataframe like this, chrs has characters I want to sort, rank is the order of charaters for each row :
import pandas as pd
import numpy as np
import string
from operator import itemgetter
letters = list(string.ascii_lowercase)
np.random.seed(0)
# generate length for each row
data = pd.DataFrame({'col0': np.random.randint(2,10,10)})
# generate random string for each row
data['chrs'] = data.col0.apply(lambda x: ','.join(np.random.choice(letters) for i in range(x)))
# generate random rank for each row
data['rank_of_chr'] = data.col0.apply(lambda x: np.random.choice(x,x,replace = False))
data.iloc[:,1:]
chrs rank_of_chr
0 v,s,e,x,g,y [2, 3, 5, 1, 4, 0]
1 y,m,b,g,h,x,o,y,r [0, 4, 2, 3, 5, 6, 7, 1, 8]
2 f,z,n,i,j,u,t [4, 1, 5, 0, 6, 2, 3]
3 q,t [0, 1]
4 f,p,p,a,s [3, 0, 2, 1, 4]
5 d,y,r,t,t [1, 4, 2, 0, 3]
6 t,o,h,a,b [1, 2, 0, 3, 4]
7 j,z,a,k,u,x,d,l,s [7, 5, 1, 2, 3, 8, 6, 0, 4]
8 x,c,a [2, 0, 1]
9 a,e,v,f,g [0, 2, 3, 4, 1]
I want to sort chrs value base on rank_of_chr value for each row. For instance, for row 9, I want a,g,e,v,f(a,e,v,f,g with rank [0,2,3,4,1], rank is ascending just like rank() in sql).
Since the true data is 50,000,000 rows, I want to find the fastest methods for it.
What I have tried is:
use itertuple for each rows, use for loop to iter over each column I want to sort.
for each row, use np.argsort to get the index of sorted chr and then use itergetter to index original value of chrs
I revise dataframes' value inplace using dt.at[index,col_name] = new_value
cols_need_sort = ['chrs']
for i in data.itertuples():
this_order = np.argsort(list(map(int, data.loc[i.Index,'rank_of_chr'])))
for col_name in cols_need_sort:
data.at[i.Index, col_name] = itemgetter(*this_order)(data.loc[i.Index,col_name].split(','))
data.iloc[:,1:]
Any method to boost performance for this task?

How to delete rows from column which have matching values in the list Pandas

I am finding outliers from a column and storing them in a list. Now i want to delete all the values which
are present in my list from the column.
How can achieve this ?
This is my function for finding outliers
outlier=[]
def detect_outliers(data):
threshold=3
m = np.mean(data)
st = np.std(data)
for i in data:
#calculating z-score value
z_score=(i-m)/st
#if the z_score value is greater than threshold value than its a outlier
if np.abs(z_score)>threshold:
outlier.append(i)
return outlier
This is my column in data frame
df_train_11.AMT_INCOME_TOTAL

import numpy as np, pandas as pd
df = pd.DataFrame(np.random.rand(10,5))
outlier_list=[]
def detect_outliers(data):
threshold=0.5
for i in data:
#calculating z-score value
z_score=(df.loc[:,i]- np.mean(df.loc[:,i])) /np.std(df.loc[:,i])
outliers = np.abs(z_score)>threshold
outlier_list.append(df.index[outliers].tolist())
return outlier_list
outlier_list = detect_outliers(df)
[[1, 2, 4, 5, 6, 7, 9],
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
[0, 1, 2, 4, 8],
[0, 1, 3, 4, 6, 8],
[0, 1, 3, 5, 6, 8, 9]]
This way, you get the outliers of each column. outlier_list[0] gives you [1, 2, 4, 5, 6, 7, 9] which means that the rows 1,2,etc are outliers for column 0.
EDIT
Shorter answer:
df = pd.DataFrame(np.random.randn(10, 3), columns=list('ABC'))
df[((df.B - df.B.mean()) / df.B.std()).abs() < 3]
This willfilter the DataFrame where only ONE column (e.g. 'B') is within three standard deviations.

Some array indexing in numpy

lookup = np.array([60, 40, 50, 60, 90])
The values in the following arrays are equal to indices of lookup.
a = np.array([1, 2, 0, 4, 3, 2, 4, 2, 0])
b = np.array([0, 1, 2, 3, 3, 4, 1, 2, 1])
c = np.array([4, 2, 1, 4, 4, 0, 4, 4, 2])
array 1st column elements lookup value
a 1 --> 40
b 0 --> 60
c 4 --> 90
Maximum is 90.
So, first element of result is 4.
This way,
expected result = array([4, 2, 0, 4, 4, 4, 4, 4, 0])
How to get it?
I tried as:
d = np.vstack([a, b, c])
print (d)
res = lookup[d]
res = np.max(res, axis = 0)
print (d[enumerate(lookup)])
I got error
IndexError: only integers, slices (:), ellipsis (...), numpy.newaxis (None) and integer or boolean arrays are valid indices

Do you want this:
d = np.vstack([a,b,c])
# option 1
rows = lookup[d].argmax(0)
d[rows, np.arange(d.shape[1])]
# option 2
(lookup[:,None] == lookup[d].max(0)).argmax(0)
Output:
array([4, 2, 0, 4, 4, 4, 4, 4, 0])

Swap a subset of multi-values in numpy

Given a starting numpy array that looks like:
B = np.array( [1, 1, 1, 0, 2, 2, 1, 3, 3, 0, 4, 4, 4, 4] )
What it the most efficient way to swap one set of values for another when there are duplicates? For example, let
s1 = [1,2,4]
s2 = [4,1,2]
An inefficient swapping method would iterate through s1 and s2 as so:
B2 = B.copy()
for x,y in zip(s1,s2):
B2[B==x] = y
Giving as output
B2 -> [4, 4, 4, 0, 1, 1, 4, 3, 3, 0, 2, 2, 2, 2]
Is there a way to do this essentially in-place without the zip loop?

>>> B = np.array( [1, 1, 1, 0, 2, 2, 1, 3, 3, 0, 4, 4, 4, 4] )
>>> s1 = [1,2,4]
>>> s2 = [4,1,2]
>>> B2 = B.copy()
>>> c, d = np.where(B == np.array(s1)[:,np.newaxis])
>>> B2[d] = np.repeat(s2,np.bincount(c))
>>> B2
array([4, 4, 4, 0, 1, 1, 4, 3, 3, 0, 2, 2, 2, 2])

If you have only integers that are between 0 and n (if not its no problem to generalize to any integer range unless its very sparse), the most efficient way is the use of take/fancy indexing:
swap = np.arange(B.max() + 1) # all values in B
swap[s1] = s2 # replace the values you want to be replaced
B2 = swap.take(B) # or swap[B]
This is seems almost twice as fast for the small B given here, but with larger B it gets even more speedup repeating B to a length of about 100000 gives 8x already. This also avoids the == operation for every s1 element, so will scale much better as s1/s2 get large.
EDIT: you could also use np.put (also in the other answer) for some speedup for swap[s1] = s2. For these 1D problems take/put are simply faster.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Extract required rows from a numpy array - numpy

Related

Find duplicated sequences in numpy.array or pandas column

pandas sort values based other column's value for each row

How to delete rows from column which have matching values in the list Pandas

Some array indexing in numpy

Swap a subset of multi-values in numpy

Categories

Resources