I have a given numpy array as follows.
import numpy as np
data = np.array([[4,6,8,9,3,2,4,4,1], # no of 0s == 0
[4,6,8,9,3,0,0,4,0], # no of 0s == 3
[4,6,0,9,0,2,0,4,0], # no of 0s == 4
[4,6,8,0,3,0,0,0,0], # no of 0s == 5
[4,6,8,9,3,2,0,4,0]]) # no of 0s == 2
From the given array, data , I have to extract 3 rows which contain the least 0s.
So, the expected are, 1st, last, and second rows.
res = np.array([[4,6,8,9,3,2,4,4,1], # no of 0s == 0
[4,6,8,9,3,0,0,4,0], # no of 0s == 3
[4,6,8,9,3,2,0,4,0]]) # no of 0s == 2
How can I do it guys?
Sum on your condition and partition.
n = 3
c = (data == 0).sum(1)
mn = np.argpartition(c, n)[:n]
data[mn]
array([[4, 6, 8, 9, 3, 2, 4, 4, 1],
[4, 6, 8, 9, 3, 2, 0, 4, 0],
[4, 6, 8, 9, 3, 0, 0, 4, 0]])
If you need the rows sorted by original index value and not number of zeros, replace the last line with:
data[np.sort(mn)]
Related
For example, I have got an array like this:
([ 1, 5, 7, 9, 4, 6, 3, 3, 7, 9, 4, 0, 3, 3, 7, 8, 1, 5 ])
I need to find all duplicated sequences , not values, but sequences of at least two values one by one.
The result should be like this:
of length 2: [1, 5] with indexes (0, 16);
of length 3: [3, 3, 7] with indexes (6, 12); [7, 9, 4] with indexes (2, 8)
The long sequences should be excluded, if they are not duplicated. ([5, 5, 5, 5]) should NOT be taken as [5, 5] on indexes (0, 1, 2)! It's not a duplicate sequence, it's one long sequence.
I can do it with pandas.apply function, but it calculates too slow, swifter did not help me.
And in real life I need to find all of them, with length from 10 up to 100 values one by one on database with 1500 columns with 700 000 values each. So i really do need a vectorized decision.
Is there a vectorized decision for finding all at once? Or at least for finding only 10-values sequences? Or only 4-values sequences? Anything, that will be fully vectorized?
One possible implementation (although not fully vectorized) that finds all sequences of size n that appear more than once is the following:
import numpy as np
def repeated_sequences(arr, n):
Na = arr.size
r_seq = np.arange(n)
n_seqs = arr[np.arange(Na - n + 1)[:, None] + r_seq]
unique_seqs = np.unique(n_seqs, axis=0)
comp = n_seqs == unique_seqs[:, None]
M = np.all(comp, axis=-1)
if M.any():
matches = np.array(
[np.convolve(M[i], np.ones((n), dtype=int)) for i in range(M.shape[0])]
)
repeated_inds = np.count_nonzero(matches, axis=-1) > n
repeated_matches = matches[repeated_inds]
idxs = np.argwhere(repeated_matches > 0)[::n]
grouped_idxs = np.split(
idxs[:, 1], np.unique(idxs[:, 0], return_index=True)[1][1:]
)
else:
return [], []
return unique_seqs[repeated_inds], grouped_idxs
In theory, you could replace
matches = np.array(
[np.convolve(M[i], np.ones((n), dtype=int)) for i in range(M.shape[0])]
)
with
matches = scipy.signal.convolve(
M, np.ones((1, n), dtype=int), mode="full"
).astype(int)
which would make the whole thing "fully vectorized", but my tests showed that this was 3 to 4 times slower than the for-loop. So I'd stick with that. Or simply,
matches = np.apply_along_axis(np.convolve, -1, M, np.ones((n), dtype=int))
which does not have any significant speed-up, since it's basically a hidden loop (see this).
This is based off #Divakar's answer here that dealt with a very similar problem, in which the sequence to look for was provided. I simply made it so that it could follow this procedure for all possible sequences of size n, which are found inside the function with n_seqs = arr[np.arange(Na - n + 1)[:, None] + r_seq]; unique_seqs = np.unique(n_seqs, axis=0).
For example,
>>> a = np.array([1, 5, 7, 9, 4, 6, 3, 3, 7, 9, 4, 0, 3, 3, 7, 8, 1, 5])
>>> repeated_seqs, inds = repeated_sequences(a, n)
>>> for i, seq in enumerate(repeated_seqs[:10]):
...: print(f"{seq} with indexes {inds[i]}")
...:
[3 3 7] with indexes [ 6 12]
[7 9 4] with indexes [2 8]
Disclaimer
The long sequences should be excluded, if they are not duplicated. ([5, 5, 5, 5]) should NOT be taken as [5, 5] on indexes (0, 1, 2)! It's not a duplicate sequence, it's one long sequence.
This is not directly taken into account and the sequence [5, 5] would appear more than once according to this algorithm. You could do something like this, based off #Paul's answer here, but it involves a loop:
import numpy as np
repeated_matches = np.array([[0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]])
idxs = np.argwhere(repeated_matches > 0)
grouped_idxs = np.split(
idxs[:, 1], np.unique(idxs[:, 0], return_index=True)[1][1:]
)
>>> print(grouped_idxs)
[array([ 6, 7, 8, 12, 13, 14], dtype=int64),
array([ 7, 8, 9, 10], dtype=int64)]
# If there are consecutive numbers in grouped_idxs, that means that there is a long
# sequence that should be excluded. So, you'd have to check for consecutive numbers
filtered_idxs = []
for idx in grouped_idxs:
if not all((idx[1:] - idx[:-1]) == 1):
filtered_idxs.append(idx)
>>> print(filtered_idxs)
[array([ 6, 7, 8, 12, 13, 14], dtype=int64)]
Some tests:
>>> n = 3
>>> a = np.array([1, 5, 7, 9, 4, 6, 3, 3, 7, 9, 4, 0, 3, 3, 7, 8, 1, 5])
>>> %timeit repeated_sequences(a, n)
414 µs ± 5.88 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> n = 4
>>> a = np.random.randint(0, 10, (10000,))
>>> %timeit repeated_sequences(a, n)
3.88 s ± 54 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> result, _ = repeated_sequences(a, n)
>>> result.shape
(2637, 4)
This is not the most efficient implementation by far, but it works as a 2D approach. Plus, if there aren't any repeated sequences, it returns empty lists.
EDIT: Full implementation
I vectorized the routine I added in the Disclaimer section as a possible solution to the long sequence problem and ended up with the following:
import numpy as np
# Taken from:
# https://stackoverflow.com/questions/53051560/stacking-numpy-arrays-of-different-length-using-padding
def stack_padding(it):
def resize(row, size):
new = np.array(row)
new.resize(size)
return new
row_length = max(it, key=len).__len__()
mat = np.array([resize(row, row_length) for row in it])
return mat
def repeated_sequences(arr, n):
Na = arr.size
r_seq = np.arange(n)
n_seqs = arr[np.arange(Na - n + 1)[:, None] + r_seq]
unique_seqs = np.unique(n_seqs, axis=0)
comp = n_seqs == unique_seqs[:, None]
M = np.all(comp, axis=-1)
repeated_seqs = []
idxs_repeated_seqs = []
if M.any():
matches = np.apply_along_axis(np.convolve, -1, M, np.ones((n), dtype=int))
repeated_inds = np.count_nonzero(matches, axis=-1) > n
if repeated_inds.any():
repeated_matches = matches[repeated_inds]
idxs = np.argwhere(repeated_matches > 0)
grouped_idxs = np.split(
idxs[:, 1], np.unique(idxs[:, 0], return_index=True)[1][1:]
)
# Additional routine
# Pad this uneven array with zeros so that we can use it normally
grouped_idxs = np.array(grouped_idxs, dtype=object)
padded_idxs = stack_padding(grouped_idxs)
# Find the indices where there are padded zeros
pad_positions = padded_idxs == 0
# Perform the "consecutive-numbers check" (this will take one
# item off the original array, so we have to correct for its shape).
idxs_to_remove= np.pad(
(padded_idxs[:, 1:] - padded_idxs[:, :-1]) == 1,
[(0, 0), (0, 1)],
constant_values=True,
)
pad_positions = np.argwhere(pad_positions)
i = pad_positions[:, 0]
j = pad_positions[:, 1] - 1 # Shift by one (shape correction)
idxs_to_remove[i, j] = True # Masking, since we don't want pad indices
# Obtain a final mask (boolean opposite of indices to remove)
final_mask = ~idxs_to_remove.all(axis=-1)
grouped_idxs = grouped_idxs[final_mask] # Filter the long sequences
repeated_seqs = unique_seqs[repeated_inds][final_mask]
# In order to get the correct indices, we must first limit the
# search to a shape (on axis=1) of the closest multiple of n.
# This will avoid taking more indices than we should to show where
# each repeated sequence begins
to = padded_idxs.shape[1] & (-n)
# Build the final list of indices (that goes from 0 - to with
# a step of n
idxs_repeated_seqs = [
grouped_idxs[i][:to:n] for i in range(grouped_idxs.shape[0])
]
return repeated_seqs, idxs_repeated_seqs
For example,
n = 2
examples = [
# First example is your original example array.
np.array([1, 5, 7, 9, 4, 6, 3, 3, 7, 9, 4, 0, 3, 3, 7, 8, 1, 5]),
# Second example has a long sequence of 5's, and since there aren't
# any [5, 5] anywhere else, it's not taken into account and therefore
# should not come out.
np.array([1, 5, 5, 5, 5, 6, 3, 3, 7, 9, 4, 0, 3, 3, 7, 8, 1, 5]),
# Third example has the same long sequence but since there is a [5, 5]
# later, then it should take it into account and this sequence should
# be found.
np.array([1, 5, 5, 5, 5, 6, 5, 5, 7, 9, 4, 0, 3, 3, 7, 8, 1, 5]),
# Fourth example has a [5, 5] first and later it has a long sequence of
# 5's which are uneven and the previous implementation got confused with
# the indices to show as the starting indices. In this case, it should be
# 1, 13 and 15 for [5, 5].
np.array([1, 5, 5, 9, 4, 6, 3, 3, 7, 9, 4, 0, 3, 5, 5, 5, 5, 5]),
]
for a in examples:
print(f"\nExample: {a}")
repeated_seqs, inds = repeated_sequences(a, n)
for i, seq in enumerate(repeated_seqs):
print(f"\t{seq} with indexes {inds[i]}")
Output (as expected):
Example: [1 5 7 9 4 6 3 3 7 9 4 0 3 3 7 8 1 5]
[1 5] with indexes [0 16]
[3 3] with indexes [6 12]
[3 7] with indexes [7 13]
[7 9] with indexes [2 8]
[9 4] with indexes [3 9]
Example: [1 5 5 5 5 6 3 3 7 9 4 0 3 3 7 8 1 5]
[1 5] with indexes [0 16]
[3 3] with indexes [6 12]
[3 7] with indexes [7 13]
Example: [1 5 5 5 5 6 5 5 7 9 4 0 3 3 7 8 1 5]
[1 5] with indexes [ 0 16]
[5 5] with indexes [1 3 6]
Example: [1 5 5 9 4 6 3 3 7 9 4 0 3 5 5 5 5 5]
[5 5] with indexes [ 1 13 15]
[9 4] with indexes [3 9]
You can test it out yourself with more examples and more cases. Keep in mind this is what I understood from your disclaimer. If you want to count the long sequences as one, even if multiple sequences are in there (for example, [5, 5] appears twice in [5, 5, 5, 5]), this won't work for you and you'd have to come up with something else.
Generally, I want to sort each ceil of some columns in pandas dataframe based on 1 column's value, That single column stores rank of other columns' value.
Suppose I have a dataframe like this, chrs has characters I want to sort, rank is the order of charaters for each row :
import pandas as pd
import numpy as np
import string
from operator import itemgetter
letters = list(string.ascii_lowercase)
np.random.seed(0)
# generate length for each row
data = pd.DataFrame({'col0': np.random.randint(2,10,10)})
# generate random string for each row
data['chrs'] = data.col0.apply(lambda x: ','.join(np.random.choice(letters) for i in range(x)))
# generate random rank for each row
data['rank_of_chr'] = data.col0.apply(lambda x: np.random.choice(x,x,replace = False))
data.iloc[:,1:]
chrs rank_of_chr
0 v,s,e,x,g,y [2, 3, 5, 1, 4, 0]
1 y,m,b,g,h,x,o,y,r [0, 4, 2, 3, 5, 6, 7, 1, 8]
2 f,z,n,i,j,u,t [4, 1, 5, 0, 6, 2, 3]
3 q,t [0, 1]
4 f,p,p,a,s [3, 0, 2, 1, 4]
5 d,y,r,t,t [1, 4, 2, 0, 3]
6 t,o,h,a,b [1, 2, 0, 3, 4]
7 j,z,a,k,u,x,d,l,s [7, 5, 1, 2, 3, 8, 6, 0, 4]
8 x,c,a [2, 0, 1]
9 a,e,v,f,g [0, 2, 3, 4, 1]
I want to sort chrs value base on rank_of_chr value for each row. For instance, for row 9, I want a,g,e,v,f(a,e,v,f,g with rank [0,2,3,4,1], rank is ascending just like rank() in sql).
Since the true data is 50,000,000 rows, I want to find the fastest methods for it.
What I have tried is:
use itertuple for each rows, use for loop to iter over each column I want to sort.
for each row, use np.argsort to get the index of sorted chr and then use itergetter to index original value of chrs
I revise dataframes' value inplace using dt.at[index,col_name] = new_value
cols_need_sort = ['chrs']
for i in data.itertuples():
this_order = np.argsort(list(map(int, data.loc[i.Index,'rank_of_chr'])))
for col_name in cols_need_sort:
data.at[i.Index, col_name] = itemgetter(*this_order)(data.loc[i.Index,col_name].split(','))
data.iloc[:,1:]
Any method to boost performance for this task?
I am finding outliers from a column and storing them in a list. Now i want to delete all the values which
are present in my list from the column.
How can achieve this ?
This is my function for finding outliers
outlier=[]
def detect_outliers(data):
threshold=3
m = np.mean(data)
st = np.std(data)
for i in data:
#calculating z-score value
z_score=(i-m)/st
#if the z_score value is greater than threshold value than its a outlier
if np.abs(z_score)>threshold:
outlier.append(i)
return outlier
This is my column in data frame
df_train_11.AMT_INCOME_TOTAL
import numpy as np, pandas as pd
df = pd.DataFrame(np.random.rand(10,5))
outlier_list=[]
def detect_outliers(data):
threshold=0.5
for i in data:
#calculating z-score value
z_score=(df.loc[:,i]- np.mean(df.loc[:,i])) /np.std(df.loc[:,i])
outliers = np.abs(z_score)>threshold
outlier_list.append(df.index[outliers].tolist())
return outlier_list
outlier_list = detect_outliers(df)
[[1, 2, 4, 5, 6, 7, 9],
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
[0, 1, 2, 4, 8],
[0, 1, 3, 4, 6, 8],
[0, 1, 3, 5, 6, 8, 9]]
This way, you get the outliers of each column. outlier_list[0] gives you [1, 2, 4, 5, 6, 7, 9] which means that the rows 1,2,etc are outliers for column 0.
EDIT
Shorter answer:
df = pd.DataFrame(np.random.randn(10, 3), columns=list('ABC'))
df[((df.B - df.B.mean()) / df.B.std()).abs() < 3]
This willfilter the DataFrame where only ONE column (e.g. 'B') is within three standard deviations.
lookup = np.array([60, 40, 50, 60, 90])
The values in the following arrays are equal to indices of lookup.
a = np.array([1, 2, 0, 4, 3, 2, 4, 2, 0])
b = np.array([0, 1, 2, 3, 3, 4, 1, 2, 1])
c = np.array([4, 2, 1, 4, 4, 0, 4, 4, 2])
array 1st column elements lookup value
a 1 --> 40
b 0 --> 60
c 4 --> 90
Maximum is 90.
So, first element of result is 4.
This way,
expected result = array([4, 2, 0, 4, 4, 4, 4, 4, 0])
How to get it?
I tried as:
d = np.vstack([a, b, c])
print (d)
res = lookup[d]
res = np.max(res, axis = 0)
print (d[enumerate(lookup)])
I got error
IndexError: only integers, slices (:), ellipsis (...), numpy.newaxis (None) and integer or boolean arrays are valid indices
Do you want this:
d = np.vstack([a,b,c])
# option 1
rows = lookup[d].argmax(0)
d[rows, np.arange(d.shape[1])]
# option 2
(lookup[:,None] == lookup[d].max(0)).argmax(0)
Output:
array([4, 2, 0, 4, 4, 4, 4, 4, 0])
Given a starting numpy array that looks like:
B = np.array( [1, 1, 1, 0, 2, 2, 1, 3, 3, 0, 4, 4, 4, 4] )
What it the most efficient way to swap one set of values for another when there are duplicates? For example, let
s1 = [1,2,4]
s2 = [4,1,2]
An inefficient swapping method would iterate through s1 and s2 as so:
B2 = B.copy()
for x,y in zip(s1,s2):
B2[B==x] = y
Giving as output
B2 -> [4, 4, 4, 0, 1, 1, 4, 3, 3, 0, 2, 2, 2, 2]
Is there a way to do this essentially in-place without the zip loop?
>>> B = np.array( [1, 1, 1, 0, 2, 2, 1, 3, 3, 0, 4, 4, 4, 4] )
>>> s1 = [1,2,4]
>>> s2 = [4,1,2]
>>> B2 = B.copy()
>>> c, d = np.where(B == np.array(s1)[:,np.newaxis])
>>> B2[d] = np.repeat(s2,np.bincount(c))
>>> B2
array([4, 4, 4, 0, 1, 1, 4, 3, 3, 0, 2, 2, 2, 2])
If you have only integers that are between 0 and n (if not its no problem to generalize to any integer range unless its very sparse), the most efficient way is the use of take/fancy indexing:
swap = np.arange(B.max() + 1) # all values in B
swap[s1] = s2 # replace the values you want to be replaced
B2 = swap.take(B) # or swap[B]
This is seems almost twice as fast for the small B given here, but with larger B it gets even more speedup repeating B to a length of about 100000 gives 8x already. This also avoids the == operation for every s1 element, so will scale much better as s1/s2 get large.
EDIT: you could also use np.put (also in the other answer) for some speedup for swap[s1] = s2. For these 1D problems take/put are simply faster.