I would like to use np.where on a array (arr) in order to retrieve the indices corresponding to the following condition: first column value must be 1, third column value must be two, here is my code so far:
arr = np.array([
[0,0,0],
[1,0,2],
[0,0,0],
[1,0,2]
])
print(np.where(arr[:,0]==1 and arr[:,2]==2))
But it produces this error:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
Any idea ?
Thank you
This is an issue of operator precedence, you need to use parentheses:
np.where((arr[:,0]==1)&(arr[:,2]==2))
A more generic method (imagine you have 20 columns to compare) would be to use:
np.where((arr[:,[0,2]]==[1,2]).all(1))
output: (array([1, 3]),)
Maybe this little change would help :
arr = np.array([
[0,0,0],
[1,0,2],
[0,0,0],
[1,0,2]
])
#print(np.where(arr[:,0]==1 and arr[:,2]==2))
print(np.where(arr[:,0]==1) , np.where(arr[:,2]==0))
Output :
array([1, 3], dtype=int64),) (array([0, 2], dtype=int64)
Related
I have a multidimensional np.array like: [[2, 55, 62], [3, 56,63], [4, 57, 64], ...].
I'm pretending to print only the values greater than 2 at the firt column, returnig a print like: [[3, 56,63], [4, 57, 64], ...]
How can I get it?
All you need to do is to select just the values you want to print.
Short answer:
import numpy as np
a = np.array([[1,2,3],[3,2,1]])
print(a[a>2])
What's going on?
Well, first, a>2 return a boolean mask telling if condition is met for each position of the array. This is a numpy array with exactly the same shape than a, but with dtype=bool.
Then, this mask is used to select only values where the mask's value is True, which are also those hat meet your condition.
Finally, you just print them.
Step by step, you can write as follows:
import numpy as np
a = np.array([[1,2,3],[3,2,1]])
print(a.shape) # output is (2, 3)
mask = a > 2
print(mask.shape) # output is (2, 3)
print(mask.dtype) # output is book
print(mask) # here you can see True only for those positions where condition is met
print(a[mask])
I m practicing on a Data Cleaning Kaggle excercise.
In parsing dates example I can´t figure out what the [1] does at the end of the indices object.
Thanks..
# Finding indices corresponding to rows in different date format
indices = np.where([date_lengths == 24])[1]
print('Indices with corrupted data:', indices)
earthquakes.loc[indices]
As described in the documentation, numpy.where called with a single argument is equivalent to calling np.asarray([date_lengths == 24]).nonzero().
numpy.nonzero return a tuple with as many items as the dimensions of the input array with the indexes of the non-zero values.
>>> np.nonzero([1,0,2,0])
(array([0, 2]),)
Slicing [1] enables to get the second element (i.e. second dimension) but as the input was wrapped into […], this is equivalent to doing:
np.where(date_lengths == 24)[0]
>>> np.nonzero([1,0,2,0])[0]
array([0, 2])
It is an artefact of the extra [] around the condition. For example:
a = np.arange(10)
To find, for example, indices where a>3 can be done like this:
np.where(a > 3)
gives as output a tuple with one array
(array([4, 5, 6, 7, 8, 9]),)
So the indices can be obtained as
indices = np.where(a > 3)[0]
In your case, the condition is between [], which is unnecessary, but still works.
np.where([a > 3])
returns a tuple of which the first is an array of zeros, and the second array is the array of indices you want
(array([0, 0, 0, 0, 0, 0]), array([4, 5, 6, 7, 8, 9]))
so the indices are obtained as
indices = np.where([a > 3])[1]
I am trying to use properly pipelines and column transformers from sklearn but always end up with an error. I reproduced it in the following example.
# Data to reproduce the error
X = pd.DataFrame([[1, 2 , 3, 1 ],
[1, '?', 2, 0 ],
[4, 5 , 6, '?']],
columns=['A', 'B', 'C', 'D'])
#SimpleImputer to change the values '?' with the mode
impute = SimpleImputer(missing_values='?', strategy='most_frequent')
#Simple one hot encoder
ohe = OneHotEncoder(handle_unknown='ignore', sparse=False)
col_transfo = ColumnTransformer(transformers=[
('missing_vals', impute, ['B', 'D']),
('one_hot', ohe, ['A', 'B'])],
remainder='passthrough'
)
Then calling the transformer as follows:
col_transfo.fit_transform(X)
Returns the following error:
TypeError: Encoders require their input to be uniformly strings or numbers. Got ['int', 'str']
ColumnTransformer applies its transformers in parallel, not in sequence. So the OneHotEncoder sees the un-imputed column B and balks at the mixed types.
In your case, it's probably fine to just impute on all the columns, and then encode A, B:
encoder = ColumnTransformer(transformers=[
('one_hot', ohe, ['A', 'B'])],
remainder='passthrough'
)
preproc = Pipeline(steps=[
('impute', impute),
('encode', encoder),
# optionally, just throw the model here...
])
If it's important that future missing values in A,C cause errors, then similarly wrap impute into its own ColumnTransformer.
See also Apply multiple preprocessing steps to a column in sklearn pipeline
It's giving you an error because OneHotEncoder accepts just one format of data. In your case, it's a mixture of numbers and object. To overcome this issue you can separate the pipeline after imputer and OneHotEncoder to use astype method on the output of the imputing . Something like:
ohe.fit_transform(imputer.fit_transform(X[['A','B']]).astype(float))
The error is not coming from the ColumnTransformer but from the OneHotEncoder object
col_transfo = ColumnTransformer(transformers=[
('missing_vals', impute, ['B', 'D'])],
remainder='passthrough'
)
col_transfo.fit_transform(X)
array([[2, 1, 1, 3],
[2, 0, 1, 2],
[5, 0, 4, 6]], dtype=object)
ohe.fit_transform(X)
TypeError: argument must be a string or number
OneHotEncoder is throwing this error because the object get mixed type of values (int + string) to encode on the same column, you need to cast the float columns to string in order to apply it
I have an array:
>>> arr1 = np.array([[1,2,3], [4,5,6], [7,8,9]])
array([[1 2 3]
[4 5 6]
[7 8 9]])
I want to retrieve a list (or 1d-array) of elements of this array by giving a list of their indices, like so:
indices = [[0,0], [0,2], [2,0]]
print(arr1[indices])
# result
[1,6,7]
But it does not work, I have been looking for a solution about it for a while, but I only found ways to select per row and/or per column (not per specific indices)
Someone has any idea ?
Cheers
Aymeric
First make indices an array instead of a nested list:
indices = np.array([[0,0], [0,2], [2,0]])
Then, index the first dimension of arr1 using the first values of indices, likewise the second:
arr1[indices[:,0], indices[:,1]]
It gives array([1, 3, 7]) (which is correct, your [1, 6, 7] example output is probably a typo).
I want to have a function that can operate on either a row or a column of a 2D ndarray. Assume the array has C order. The function changes values in the 2D data.
Inside the function I want to have identical index syntax whether it is called with a row or column. A row slice is [n,:] and column slice [:,n] so they have different shapes. Inside the function this requires different indexing expressions.
Is there a way to do this that does not require moving or allocating memory? I am under the impression that using reshape will force a copy to make the data to make it contiguous. Is there a way to use nditer in the function?
Do you mean like this:
In [74]: def foo(arr, n):
...: arr += n
...:
In [75]: arr = np.ones((2,3),int)
In [76]: foo(arr[0,:],1)
In [77]: arr
Out[77]:
array([[2, 2, 2],
[1, 1, 1]])
In [78]: foo(arr[:,1],[100,200])
In [79]: arr
Out[79]:
array([[ 2, 102, 2],
[ 1, 201, 1]])
In the first case I'm adding 1 to one row of the array, ie. a row slice. In the second case I'm add a array (list) to a column. In that case n has to have the right length.
Usually we don't worry about whether the values are C contiguous. Striding takes care of access either way.