pandas np.where based on mulitindex level - pandas

I have a Dataframe with MultiIndex. The two Levels are 'Nr' and 'Price'. Is it possible to use np.where on Index Level 1 ('Price') to create a new column ('ZZ')?
'ZZ' should be calculated by column 'first' multiplicated by 2, if Level 1 ('Price') is equal to 'x'.
import pandas as pd
index = pd.MultiIndex.from_product([['s1', 's2','s3'],['x','y']])
df = pd.DataFrame([1,2,3,4,5,6],index, columns=['first'] )
df.index.names = ['Nr', 'Price']
df
I tried:
df['ZZ'] = np.where(df['Price']=='x',df['0']*2,np.nan)
I obtain:
Thank you!

You should use get_level_values
np.where(df.index.get_level_values(1)=='x', df['first']*2, np.nan)
array([ 2., nan, 6., nan, 10., nan])
#df['ZZ'] = np.where(df.index.get_level_values(1)=='x', df['first']*2, np.nan)

Related

how to use pandas isin function in 2d numpy array?

I have created a 2d numpy array with 2 rows and 5 columns.
import numpy as np
import pandas as pd
arr = np.zeros((2, 5))
arr[0] = [12, 94, 4, 4, 2]
arr[1] = [1, 3, 4, 12, 46]
I have also created a dataframe with two columns col1 and col2
list1 = [1,2,3,4,5]
list2 = [2,3,4,5,6]
df = pd.DataFrame({'col1': list1, 'col2': list2})
I used pandas isin function with col1 and col2 to create a boolean value list, like this:
df['col1'].isin(df['col2'])
output
0 False
1 True
2 True
3 True
4 True
Now I want to use these bool values to slice the 2d array that I have created before, I can do that for a single row but now for the whole 2d array at once:
print(arr[0][df['col1'].isin(df['col2'])])
print(arr[1][df['col1'].isin(df['col2'])])
output:
[94. 4. 4. 2.]
[ 3. 4. 12. 46.]
but when I do something like this:
print(arr[df['col1'].isin(df['col2'])])
But this gives the error:
IndexError: boolean index did not match indexed array along dimension 0; dimension is 2 but corresponding boolean dimension is 5
Is there a way to achieve this?
You should slice on the second dimension of the array:
arr[:, df['col1'].isin(df['col2'])]
output:
array([[94., 4., 4., 2.],
[ 3., 4., 12., 46.]])

I have problems to replace inf and -inf win nan in pandas

I found a code on the web to replace inf and -inf with np.nan, however it did not work on my computer.
df = pd.DataFrame({"A" : [4.6, 5., np.inf]})
new_dataframe = a_dataframe.replace([np.inf, -np.inf], np.nan)
df
My output
A
0 4.6
1 5.0
2 inf
Does somebody know a solution?
import pandas and numpy, then assign to the dataframe df.replace([np.inf, -np.inf], np.nan). When you do df.replace([np.inf, -np.inf], np.nan) this does not update the dataframe it needs to be assigned with = for the change to happen.
Also, for some reason in the code you provided there is new_dataframe and a_dataframe, which have nothing to do with df. Try the code below.
import pandas as pd
import numpy as np
df = pd.DataFrame({"A": [4.6, 5., np.inf]})
df = df.replace([np.inf, -np.inf], np.nan)
print(df)

Create a 3d tensor from a pandas dataframe (pytorch)

What is the easiest way (I am looking for the minimum number of code lines) to convert a pandas dataframe of 4 columns into a 3d tensor padding the missing values along the way.
import pandas as pd
# initialize data of lists.
data = {'Animal':['Cat', 'Dog', 'Dog', 'Dog'],
'Country':["USA", "Canada", "USA", "Canada"],
'Likes': ['Petting', 'Hunting', 'Petting', 'Petting'],
'Age':[1, 2, 3, 4]}
# there are no duplicate lines in terms of Animal, Country and Likes, so I do not need any aggregation function
# Create DataFrame
dfAnimals = pd.DataFrame(data)
dfAnimals
I want to create a 3d tensor with shape (2, 2, 2) --> (Animal, Country, Likes) and Age is the value. I also want to fill the missing values with 0
There might be a solution with fewer lines and more optimized library calls, but this seems to do the trick:
import pandas as pd
import numpy as pd
import torch
data = ...
df = pd.DataFrame(data)
CAT = df.columns.tolist()
CAT.remove("Age")
# encode categories as integers and extract the shape
shape = []
for c in CAT:
shape.append(len(df[c].unique()))
df[c] = df[c].astype("category").cat.codes
shape = tuple(shape)
# get indices as tuples and corresponding values
idx = [tuple(t) for t in df.values[:,:-1]]
values = df.values[:,-1]
# init final matrix with zeros and fill it from indices
A = np.zeros(shape)
for i, v in zip(idx,values):
A[i] = v
# convert to pytorch tensor
A = torch.tensor(A)
print(A)
tensor([[[0., 0.],
[0., 1.]],
[[2., 4.],
[0., 3.]]], dtype=torch.float64)

Set value of specific cell in pandas dataframe to sum of two other cells

I have a dataframe:
import pandas as pd
import numpy as np
df = pd.DataFrame(
data={'X': [1.5, 6.777, 2.444, np.NaN],
'Y': [1.111, np.NaN, 8.77, np.NaN],
'Z': [5.0, 2.333, 10, 6.6666]})
I think this should work, but i get the following error;
df.at[1,'Z'] =(df.loc[[2],'X'] +df.loc[[0],'Y'])
How can I achieve this?
ValueError: setting an array element with a sequence.
This should work
df.loc[1, 'Z'] = df.loc[2,'X'] + df.loc[0,'Y']

DataFrame.apply unintuitively changes int to float breaking an index loopup

Problem description
The column 'a' has type integer, not float. The apply function should not change the type just because the dataframe has another, unrelated float column.
I understand, why it happens: it detects the most suitable type for a Series. I still consider it unintuitive that I select a group of columns to apply some function to them that only works on ints, not on floats, and suddenly I remove one unrelated column and get an exception, because now I only have numeric columns, and all ints became floats.
>>> import pandas as pd
# This works.
>>> pd.DataFrame({'a': [1, 2, 3], 'b': ['', '', '']}).apply(lambda row: row['a'], axis=1)
0 1
1 2
2 3
dtype: int64
# Here we also expect 1, 2, 3, as above.
>>> pd.DataFrame({'a': [1, 2, 3], 'b': [0., 0., 0.]}).apply(lambda row: row['a'], axis=1)
0 1.0
1 2.0
2 3.0
# Why floats?!?!?!?!?!
# It's an integer column:
>>> pd.DataFrame({'a': [1, 2, 3], 'b': [0., 0., 0.]})['a'].dtype
dtype('int64')
Expected Output
0 1
1 2
2 3
dtype: int64
Specifically in my problem I am trying to use the value in the apply function to get the value from a list. I am trying to do this in a performant way such that recasting as int inside the apply is too slow.
>>> pd.DataFrame({'a': [1, 2, 3], 'b': [0., 0., 0.]}).apply(lambda row: myList[row['a']], axis=1)
https://github.com/pandas-dev/pandas/issues/23230
This is from the only source I could find having the same problem.
It seems like your underlying problem is to index a list by the values in one of your DataFrame columns. This can be done by converting your list to an array and then you can normally slice:
Sample Data
import numpy as np
import pandas as pd
df = pd.DataFrame({'a': [1, 0, 3], 'b': ['', '', '']})
myList = ['foo', 'bar', 'baz', 'boo']
Code:
np.array(myList)[df.a.to_numpy()]
#array(['bar', 'baz', 'boo'], dtype='<U3')
Or if you want the Series:
pd.Series(np.array(myList)[df.a.to_numpy()], index=df.index)
#0 bar
#1 foo
#2 boo
#dtype: object
Alternatively with a list comprehension this is:
[myList[i] for i in df.a]
#['bar', 'foo', 'boo']
You are getting caught by Pandas upcasting. Certain operations will result in an upcast column dtype. The (0.24 Doc)[https://pandas.pydata.org/pandas-docs/stable/getting_started/basics.html#gotchas] describes this here.
Examples of this are encountered when certain operations are done.
import pandas as pd
import numpy as np
print(pd.__version__)
# float64 is the default dtype of an empty dataframe.
df = pd.DataFrame({'a': [], 'b': []})['a'].dtype
print(df)
try:
df['a'] = [1,2,3,4]
except TypeError as te:
# good, the default dtype is float64
print(te)
print(df)
# even if 'defaul' is changed, this is a surprise
# because referring to all columns does convert to float
df = pd.DataFrame(columns=["col1", "col2"], dtype=np.int64)
# creates an index, "a" is float type
df.loc["a", "col1":"col2"] = np.int64(0)
print(df.dtypes)
df = pd.DataFrame(columns=["col1", "col2"], dtype=np.int64)
# not upcast
df.loc[:"col1"] = np.int64(0)
print(df.dtypes)
Taking a shot at a performant answer that works around such upcasting behavior:
import pandas as pd
import numpy as np
print(pd.__version__)
df = pd.DataFrame({'a': [1, 2, 3], 'b': [0., 0., 0.]})
df['a'] = df['a'].apply(lambda row: row+1)
df['b'] = df['b'].apply(lambda row: row+1)
print(df)
print(df['a'].dtype)
print(df['b'].dtype)
dtypes are preserved.
0.24.2
a b
0 2 1.0
1 3 1.0
2 4 1.0
int64
float64