how to use pandas isin function in 2d numpy array? - pandas

I have created a 2d numpy array with 2 rows and 5 columns.
import numpy as np
import pandas as pd
arr = np.zeros((2, 5))
arr[0] = [12, 94, 4, 4, 2]
arr[1] = [1, 3, 4, 12, 46]
I have also created a dataframe with two columns col1 and col2
list1 = [1,2,3,4,5]
list2 = [2,3,4,5,6]
df = pd.DataFrame({'col1': list1, 'col2': list2})
I used pandas isin function with col1 and col2 to create a boolean value list, like this:
df['col1'].isin(df['col2'])
output
0 False
1 True
2 True
3 True
4 True
Now I want to use these bool values to slice the 2d array that I have created before, I can do that for a single row but now for the whole 2d array at once:
print(arr[0][df['col1'].isin(df['col2'])])
print(arr[1][df['col1'].isin(df['col2'])])
output:
[94. 4. 4. 2.]
[ 3. 4. 12. 46.]
but when I do something like this:
print(arr[df['col1'].isin(df['col2'])])
But this gives the error:
IndexError: boolean index did not match indexed array along dimension 0; dimension is 2 but corresponding boolean dimension is 5
Is there a way to achieve this?

You should slice on the second dimension of the array:
arr[:, df['col1'].isin(df['col2'])]
output:
array([[94., 4., 4., 2.],
[ 3., 4., 12., 46.]])

Related

Why can I not use linspace on this basic function that contains a for-loop?

I have a larger, more complex function that essentially boils down to the below function: f(X). Why can't this function take in linspace integers as a parameter?
This error is given:
"TypeError: only integer scalar arrays can be converted to a scalar index"
import numpy
from matplotlib import*
from numpy import*
import numpy as np
from math import*
import random
from random import *
import matplotlib.pyplot as plt
def f(x):
for i in range(x):
x+=1
return x
xlist = np.linspace(0, 100, 10)
ylist = f(xlist)
plt.figure(num=0)
linspace produces a numpy array, not a list. Don't confuse the two.
In [3]: x = np.linspace(0,100,10)
In [4]: x
Out[4]:
array([ 0. , 11.11111111, 22.22222222, 33.33333333,
44.44444444, 55.55555556, 66.66666667, 77.77777778,
88.88888889, 100. ])
The numbers will look nicer if we take into account that linspace includes the end points:
In [5]: x = np.linspace(0,100,11)
In [6]: x
Out[6]: array([ 0., 10., 20., 30., 40., 50., 60., 70., 80., 90., 100.])
As an array, we can simply add a scalar; no need to write a loop:
In [7]: x + 1
Out[7]: array([ 1., 11., 21., 31., 41., 51., 61., 71., 81., 91., 101.])
range takes one number (or 3), but never an array (of more than 1 element):
In [8]: for i in range(5): print(i)
0
1
2
3
4
We can't modify i in such a loop, or the "x" used to create the range. But we can write a list comprehension:
In [9]: [i+1 for i in range(5)]
Out[9]: [1, 2, 3, 4, 5]
references:
https://www.w3schools.com/python/ref_func_range.asp
https://numpy.org/doc/stable/reference/generated/numpy.linspace.html
xlist = np.linspace(0, 100, 10) + 1
Based on your comment it seems you are just trying to plot a few points. I think your use of range and linspace are redundant and you don't need a loop at all. You do not need a for loop to plot many points, unless you are trying to plot multiple curves or do something fancy with each point. For example, this plots 10 points:
x = np.linspace(0, 100, 10)
y=x*x
plt.plot(x,y,'.')
I think this is a good place to start: https://matplotlib.org/3.5.1/tutorials/introductory/pyplot.html

Create a 3d tensor from a pandas dataframe (pytorch)

What is the easiest way (I am looking for the minimum number of code lines) to convert a pandas dataframe of 4 columns into a 3d tensor padding the missing values along the way.
import pandas as pd
# initialize data of lists.
data = {'Animal':['Cat', 'Dog', 'Dog', 'Dog'],
'Country':["USA", "Canada", "USA", "Canada"],
'Likes': ['Petting', 'Hunting', 'Petting', 'Petting'],
'Age':[1, 2, 3, 4]}
# there are no duplicate lines in terms of Animal, Country and Likes, so I do not need any aggregation function
# Create DataFrame
dfAnimals = pd.DataFrame(data)
dfAnimals
I want to create a 3d tensor with shape (2, 2, 2) --> (Animal, Country, Likes) and Age is the value. I also want to fill the missing values with 0
There might be a solution with fewer lines and more optimized library calls, but this seems to do the trick:
import pandas as pd
import numpy as pd
import torch
data = ...
df = pd.DataFrame(data)
CAT = df.columns.tolist()
CAT.remove("Age")
# encode categories as integers and extract the shape
shape = []
for c in CAT:
shape.append(len(df[c].unique()))
df[c] = df[c].astype("category").cat.codes
shape = tuple(shape)
# get indices as tuples and corresponding values
idx = [tuple(t) for t in df.values[:,:-1]]
values = df.values[:,-1]
# init final matrix with zeros and fill it from indices
A = np.zeros(shape)
for i, v in zip(idx,values):
A[i] = v
# convert to pytorch tensor
A = torch.tensor(A)
print(A)
tensor([[[0., 0.],
[0., 1.]],
[[2., 4.],
[0., 3.]]], dtype=torch.float64)

pandas np.where based on mulitindex level

I have a Dataframe with MultiIndex. The two Levels are 'Nr' and 'Price'. Is it possible to use np.where on Index Level 1 ('Price') to create a new column ('ZZ')?
'ZZ' should be calculated by column 'first' multiplicated by 2, if Level 1 ('Price') is equal to 'x'.
import pandas as pd
index = pd.MultiIndex.from_product([['s1', 's2','s3'],['x','y']])
df = pd.DataFrame([1,2,3,4,5,6],index, columns=['first'] )
df.index.names = ['Nr', 'Price']
df
I tried:
df['ZZ'] = np.where(df['Price']=='x',df['0']*2,np.nan)
I obtain:
Thank you!
You should use get_level_values
np.where(df.index.get_level_values(1)=='x', df['first']*2, np.nan)
array([ 2., nan, 6., nan, 10., nan])
#df['ZZ'] = np.where(df.index.get_level_values(1)=='x', df['first']*2, np.nan)

DataFrame.apply unintuitively changes int to float breaking an index loopup

Problem description
The column 'a' has type integer, not float. The apply function should not change the type just because the dataframe has another, unrelated float column.
I understand, why it happens: it detects the most suitable type for a Series. I still consider it unintuitive that I select a group of columns to apply some function to them that only works on ints, not on floats, and suddenly I remove one unrelated column and get an exception, because now I only have numeric columns, and all ints became floats.
>>> import pandas as pd
# This works.
>>> pd.DataFrame({'a': [1, 2, 3], 'b': ['', '', '']}).apply(lambda row: row['a'], axis=1)
0 1
1 2
2 3
dtype: int64
# Here we also expect 1, 2, 3, as above.
>>> pd.DataFrame({'a': [1, 2, 3], 'b': [0., 0., 0.]}).apply(lambda row: row['a'], axis=1)
0 1.0
1 2.0
2 3.0
# Why floats?!?!?!?!?!
# It's an integer column:
>>> pd.DataFrame({'a': [1, 2, 3], 'b': [0., 0., 0.]})['a'].dtype
dtype('int64')
Expected Output
0 1
1 2
2 3
dtype: int64
Specifically in my problem I am trying to use the value in the apply function to get the value from a list. I am trying to do this in a performant way such that recasting as int inside the apply is too slow.
>>> pd.DataFrame({'a': [1, 2, 3], 'b': [0., 0., 0.]}).apply(lambda row: myList[row['a']], axis=1)
https://github.com/pandas-dev/pandas/issues/23230
This is from the only source I could find having the same problem.
It seems like your underlying problem is to index a list by the values in one of your DataFrame columns. This can be done by converting your list to an array and then you can normally slice:
Sample Data
import numpy as np
import pandas as pd
df = pd.DataFrame({'a': [1, 0, 3], 'b': ['', '', '']})
myList = ['foo', 'bar', 'baz', 'boo']
Code:
np.array(myList)[df.a.to_numpy()]
#array(['bar', 'baz', 'boo'], dtype='<U3')
Or if you want the Series:
pd.Series(np.array(myList)[df.a.to_numpy()], index=df.index)
#0 bar
#1 foo
#2 boo
#dtype: object
Alternatively with a list comprehension this is:
[myList[i] for i in df.a]
#['bar', 'foo', 'boo']
You are getting caught by Pandas upcasting. Certain operations will result in an upcast column dtype. The (0.24 Doc)[https://pandas.pydata.org/pandas-docs/stable/getting_started/basics.html#gotchas] describes this here.
Examples of this are encountered when certain operations are done.
import pandas as pd
import numpy as np
print(pd.__version__)
# float64 is the default dtype of an empty dataframe.
df = pd.DataFrame({'a': [], 'b': []})['a'].dtype
print(df)
try:
df['a'] = [1,2,3,4]
except TypeError as te:
# good, the default dtype is float64
print(te)
print(df)
# even if 'defaul' is changed, this is a surprise
# because referring to all columns does convert to float
df = pd.DataFrame(columns=["col1", "col2"], dtype=np.int64)
# creates an index, "a" is float type
df.loc["a", "col1":"col2"] = np.int64(0)
print(df.dtypes)
df = pd.DataFrame(columns=["col1", "col2"], dtype=np.int64)
# not upcast
df.loc[:"col1"] = np.int64(0)
print(df.dtypes)
Taking a shot at a performant answer that works around such upcasting behavior:
import pandas as pd
import numpy as np
print(pd.__version__)
df = pd.DataFrame({'a': [1, 2, 3], 'b': [0., 0., 0.]})
df['a'] = df['a'].apply(lambda row: row+1)
df['b'] = df['b'].apply(lambda row: row+1)
print(df)
print(df['a'].dtype)
print(df['b'].dtype)
dtypes are preserved.
0.24.2
a b
0 2 1.0
1 3 1.0
2 4 1.0
int64
float64

Finding entries containing a substring in a numpy array?

I tried to find entries in an Array containing a substring with np.where and an in condition:
import numpy as np
foo = "aa"
bar = np.array(["aaa", "aab", "aca"])
np.where(foo in bar)
this only returns an empty Array.
Why is that so?
And is there a good alternative solution?
We can use np.core.defchararray.find to find the position of foo string in each element of bar, which would return -1 if not found. Thus, it could be used to detect whether foo is present in each element or not by checking for -1 on the output from find. Finally, we would use np.flatnonzero to get the indices of matches. So, we would have an implementation, like so -
np.flatnonzero(np.core.defchararray.find(bar,foo)!=-1)
Sample run -
In [91]: bar
Out[91]:
array(['aaa', 'aab', 'aca'],
dtype='|S3')
In [92]: foo
Out[92]: 'aa'
In [93]: np.flatnonzero(np.core.defchararray.find(bar,foo)!=-1)
Out[93]: array([0, 1])
In [94]: bar[2] = 'jaa'
In [95]: np.flatnonzero(np.core.defchararray.find(bar,foo)!=-1)
Out[95]: array([0, 1, 2])
Look at some examples of using in:
In [19]: bar = np.array(["aaa", "aab", "aca"])
In [20]: 'aa' in bar
Out[20]: False
In [21]: 'aaa' in bar
Out[21]: True
In [22]: 'aab' in bar
Out[22]: True
In [23]: 'aab' in list(bar)
It looks like in when used with an array works as though the array was a list. ndarray does have a __contains__ method, so in works, but it is probably simple.
But in any case, note that in alist does not check for substrings. The strings __contains__ does the substring test, but I don't know any builtin class that propagates the test down to the component strings.
As Divakar shows there is a collection of numpy functions that applies string methods to individual elements of an array.
In [42]: np.char.find(bar, 'aa')
Out[42]: array([ 0, 0, -1])
Docstring:
This module contains a set of functions for vectorized string
operations and methods.
The preferred alias for defchararray is numpy.char.
For operations like this I think the np.char speeds are about same as with:
In [49]: np.frompyfunc(lambda x: x.find('aa'), 1, 1)(bar)
Out[49]: array([0, 0, -1], dtype=object)
In [50]: np.frompyfunc(lambda x: 'aa' in x, 1, 1)(bar)
Out[50]: array([True, True, False], dtype=object)
Further tests suggest that the ndarray __contains__ operates on the flat version of the array - that is, shape doesn't affect its behavior.
If using pandas is acceptable, then utilizing the str.contains method can be used.
import numpy as np
entries = np.array(["aaa", "aab", "aca"])
import pandas as pd
pd.Series(entries).str.contains('aa') # <----
Results in:
0 True
1 True
2 False
dtype: bool
The method also accepts regular expressions for more complex patterns:
pd.Series(entries).str.contains(r'a.a')
Results in:
0 True
1 False
2 True
dtype: bool
The way you are trying to use np.where is incorrect. The first argument of np.where should be a boolean array, and you are simply passing it a boolean.
foo in bar
>>> False
np.where(False)
>>> (array([], dtype=int32),)
np.where(np.array([True, True, False]))
>>> (array([0, 1], dtype=int32),)
The problem is that numpy does not define the in operator as an element-wise boolean operation.
One way you could accomplish what you want is with a list comprehension.
foo = 'aa'
bar = np.array(['aaa', 'aab', 'aca'])
out = [i for i, v in enumerate(bar) if foo in v]
# out = [0, 1]
bar = ['aca', 'bba', 'baa', 'aaf', 'ccc']
out = [i for i, v in enumerate(bar) if foo in v]
# out = [2, 3]
You can also do something like this:
mask = [foo in x for x in bar]
filter = bar[ np.where( mask * bar != '') ]