Consider the following code:
aa = np.arange(16)
step = 4
bb = aa[::4]
This selects every 4th element. Is there a quick and easy numpy function to select the complement of bb? I'm looking for the following output
array([1, 2, 3, 5, 6, 7, 9, 10, 11, 13, 14, 15])
Yes, I could generate indices and then do np.setdiff1d, but I'm looking for something more elegant than that.
If you're looking for a simple single-liner:
np.delete(aa,slice(None,None,4))
Another solution (I don't know about elegant), but you could define a selection index of ones, and then set every fourth element to False to then index the original array:
o = np.ones_like(s,dtype=bool)
o[::step] = False
aa[o]
A flexible way to select based on an arbitrary repeated position could be to use a modulo:
bb = aa[np.arange(len(aa))%step != step-1]
Output:
array([ 0, 1, 2, 4, 5, 6, 8, 9, 10, 12, 13, 14])
Related
I want to create a subset of my data by applying tf.data.Dataset filter operation. I have this data:
data = tf.convert_to_tensor([[1, 2, 1, 1, 5, 5, 9, 12], [1, 2, 3, 8, 4, 5, 9, 12]])
dataset = tf.data.Dataset.from_tensor_slices(data)
I want to retrieve a subset of 'dataset' which corresponds to all elements whose first column is equal to 1. So, result should be:
[[1, 1, 1], [1, 3, 8]] # dtype : dataset
I tried this:
subset = dataset.filter(lambda x: tf.equal(x[0], 1))
But I don't get the correct result, since it sends me back x[0]
Someone to help me ?
I finally resolved it:
a = tf.convert_to_tensor([1, 2, 1, 1, 5, 5, 9, 12])
b = tf.convert_to_tensor([1, 2, 3, 8, 4, 5, 9, 12])
data_set = tf.data.Dataset.from_tensor_slices((a, b))
subset = data_set.filter(lambda x, y: tf.equal(x, 1))
I have a dataframe of the following form.
df = {'X': [0, 3, 6, 7, 8, 11],
'Y1': [8, 5, 4, 3, 2, 1.5],
'Y2': [1, 2, 4, 5, 5, 5]}
I would like to create a new dataframe where I use interpolate where 'X' is stepping in fixed steps [0, 2, 4, 6, 8, 10].
To find the new 'Y' values I need to find f(x)=Y1 and then I can evaluate for each step in X. But since I have many Y's I think there must be a more clever way to do this.
The solution I found was the following:
step_size = 0.25
no_steps = int(np.floor(max(b['X'])/step_size))
for i in range(0,no_steps+1):
b = b.append({'X' : 0.25*i, 'StepNo' : 10, 'PointNo' : 23+i}, ignore_index=True)
b = b.sort_values(['X'])
b = b.set_index(['X'])
c = b.interpolate('index')
c = c.reset_index()
c = c.sort_values(['PointNo'])
So first I define step size. Then I calculate number of steps. Then I append the steps into the dataframe. Sort the dataframe and reindex so I can use interpolate using 'index' as values.
I have two numpy arrays. One of them is 2D while the other is 1D.
>>> a = np.arange(0,20).reshape(2,10)
>>> a
array([[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14, 15, 16, 17, 18, 19]])
>>> b = np.full( a.shape[1], 10 )
>>> b
array([10, 10, 10, 10, 10, 10, 10, 10, 10, 10])
I want to compare them column-wise:
If the columns elements in a is identical to the column element of b, then store row number(s) of a.
Else, find the closest matching of a to b and store the row number(s).
In my example, the output from the comparison should be:
[ 1, [0,1], [0,1], [0,1], [0,1], [0,1], [0,1], [0,1], [0,1], [0,1] ]
How do I do this in NumPy?
I was thinking of using np.where( a==b, run a function to get row(s) if same, run another function to get row(s) of diff )? Is this the way?
I am finding outliers from a column and storing them in a list. Now i want to delete all the values which
are present in my list from the column.
How can achieve this ?
This is my function for finding outliers
outlier=[]
def detect_outliers(data):
threshold=3
m = np.mean(data)
st = np.std(data)
for i in data:
#calculating z-score value
z_score=(i-m)/st
#if the z_score value is greater than threshold value than its a outlier
if np.abs(z_score)>threshold:
outlier.append(i)
return outlier
This is my column in data frame
df_train_11.AMT_INCOME_TOTAL
import numpy as np, pandas as pd
df = pd.DataFrame(np.random.rand(10,5))
outlier_list=[]
def detect_outliers(data):
threshold=0.5
for i in data:
#calculating z-score value
z_score=(df.loc[:,i]- np.mean(df.loc[:,i])) /np.std(df.loc[:,i])
outliers = np.abs(z_score)>threshold
outlier_list.append(df.index[outliers].tolist())
return outlier_list
outlier_list = detect_outliers(df)
[[1, 2, 4, 5, 6, 7, 9],
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
[0, 1, 2, 4, 8],
[0, 1, 3, 4, 6, 8],
[0, 1, 3, 5, 6, 8, 9]]
This way, you get the outliers of each column. outlier_list[0] gives you [1, 2, 4, 5, 6, 7, 9] which means that the rows 1,2,etc are outliers for column 0.
EDIT
Shorter answer:
df = pd.DataFrame(np.random.randn(10, 3), columns=list('ABC'))
df[((df.B - df.B.mean()) / df.B.std()).abs() < 3]
This willfilter the DataFrame where only ONE column (e.g. 'B') is within three standard deviations.
Given a starting numpy array that looks like:
B = np.array( [1, 1, 1, 0, 2, 2, 1, 3, 3, 0, 4, 4, 4, 4] )
What it the most efficient way to swap one set of values for another when there are duplicates? For example, let
s1 = [1,2,4]
s2 = [4,1,2]
An inefficient swapping method would iterate through s1 and s2 as so:
B2 = B.copy()
for x,y in zip(s1,s2):
B2[B==x] = y
Giving as output
B2 -> [4, 4, 4, 0, 1, 1, 4, 3, 3, 0, 2, 2, 2, 2]
Is there a way to do this essentially in-place without the zip loop?
>>> B = np.array( [1, 1, 1, 0, 2, 2, 1, 3, 3, 0, 4, 4, 4, 4] )
>>> s1 = [1,2,4]
>>> s2 = [4,1,2]
>>> B2 = B.copy()
>>> c, d = np.where(B == np.array(s1)[:,np.newaxis])
>>> B2[d] = np.repeat(s2,np.bincount(c))
>>> B2
array([4, 4, 4, 0, 1, 1, 4, 3, 3, 0, 2, 2, 2, 2])
If you have only integers that are between 0 and n (if not its no problem to generalize to any integer range unless its very sparse), the most efficient way is the use of take/fancy indexing:
swap = np.arange(B.max() + 1) # all values in B
swap[s1] = s2 # replace the values you want to be replaced
B2 = swap.take(B) # or swap[B]
This is seems almost twice as fast for the small B given here, but with larger B it gets even more speedup repeating B to a length of about 100000 gives 8x already. This also avoids the == operation for every s1 element, so will scale much better as s1/s2 get large.
EDIT: you could also use np.put (also in the other answer) for some speedup for swap[s1] = s2. For these 1D problems take/put are simply faster.