pandas rename_axis not taking inplace argument - pandas

I have the following code. I thought df should have index name of INDEX at the end, given I set the inplace argument. But that's not the case. What am I missing? Or is it a bug in Pandas?
>>> df = pd.DataFrame([[1,2],[3,4]])
>>> df
0 1
0 1 2
1 3 4
>>> df.rename_axis('INDEX', inplace=True)
0 1
INDEX
0 1 2
1 3 4
>>> df
0 1
0 1 2
1 3 4
>>>

It should be straightforward. I tried it myself and experienced the same issue, so, congratulation, you've found a bug in Pandas.
Here is what you can do with your code for now:
>>> df = df.rename_axis('INDEX')
0 1
INDEX
0 1 2
1 3 4
>>> df
0 1
INDEX
0 1 2
1 3 4
>>>
In case you're interested to contribute to Pandas: https://pandas.pydata.org/pandas-docs/stable/contributing.html#bug-reports-and-enhancement-requests

It seems to be a bug, instead of rename_axis you can use index.rename with inplace = True parameter.
df.index.rename('INDEX', inplace=True)
Output of df:
0 1
INDEX
0 1 2
1 3 4

Related

How to compare 2 dataframes columns and add a value to a new dataframe based on the result

I have 2 dataframes with the same length, and I'd like to compare specific columns between them. If the value of the first column in one of the dataframe is bigger - i'd like it to take the value in the second column and assign it to a new dataframe.
See example. The first dataframe:
0 class
0 1.9 0
1 9.8 0
2 4.5 0
3 8.1 0
4 1.9 0
The second dataframe:
0 class
0 1.4 1
1 7.8 1
2 8.5 1
3 9.1 1
4 3.9 1
The new dataframe should look like:
class
0 0
1 0
2 1
3 1
4 1
Use numpy.where with DataFrame constructor:
df = pd.DataFrame({'class': np.where(df1[0] > df2[0], df1['class'], df2['class'])})
Or DataFrame.where:
df = df1[['class']].where(df1[0] > df2[0], df2[['class']])
print (df)
class
0 0
1 0
2 1
3 1
4 1
EDIT:
If there is another condition use numpy.select and if necessary numpy.isclose
print (df2)
0 class
0 1.4 1
1 7.8 1
2 8.5 1
3 9.1 1
4 1.9 1
masks = [df1[0] == df2[0], df1[0] > df2[0]]
#if need compare floats in some accuracy
#masks = [np.isclose(df1[0], df2[0]), df1[0] > df2[0]]
vals = ['not_determined', df1['class']]
df = pd.DataFrame({'class': np.select(masks, vals, df2['class'])})
print (df)
class
0 0
1 0
2 1
3 1
4 not_determined
Or:
masks = [df1[0] == df2[0], df1[0] > df2[0]]
vals = ['not_determined', 1]
df = pd.DataFrame({'class': np.select(masks, vals, 1)})
print (df)
class
0 0
1 0
2 1
3 1
4 not_determined
Solution for out of box:
df = np.sign(df1[0].sub(df2[0])).map({1:0, -1:1, 0:'not_determined'}).to_frame('class')
print (df)
class
0 0
1 0
2 1
3 1
4 not_determined
Since class is 0 and 1, you could try,
df1[0].lt(df2[0]).astype(int)
For generic solutions, check jezrael's answer.
Try this one:
>>> import numpy as np
>>> import pandas as pd
>>> df_1
0 class
0 1.9 0
1 9.8 0
2 4.5 0
3 8.1 0
4 1.9 0
>>> df_2
0 class
0 1.4 1
1 7.8 1
2 8.5 1
3 9.1 1
4 3.9 1
>>> df_3=pd.DataFrame()
>>> df_3["class"]=np.where(df_1["0"]>df_2["0"], df_1["class"], df_2["class"])
>>> df_3
class
0 0
1 0
2 1
3 1
4 1

Warning with loc function with pandas dataframe

While working on a SO Question i came across a warning error using with loc, precise details are as belows:
DataFrame Samples:
First dataFrame df1 :
>>> data1 = {'Sample': ['Sample_A','Sample_D', 'Sample_E'],
... 'Location': ['Bangladesh', 'Myanmar', 'Thailand'],
... 'Year':[2012, 2014, 2015]}
>>> df1 = pd.DataFrame(data1)
>>> df1.set_index('Sample')
Location Year
Sample
Sample_A Bangladesh 2012
Sample_D Myanmar 2014
Sample_E Thailand 2015
Second dataframe df2:
>>> data2 = {'Num': ['Value_1','Value_2','Value_3','Value_4','Value_5'],
... 'Sample_A': [0,1,0,0,1],
... 'Sample_B':[0,0,1,0,0],
... 'Sample_C':[1,0,0,0,1],
... 'Sample_D':[0,0,1,1,0]}
>>> df2 = pd.DataFrame(data2)
>>> df2.set_index('Num')
Sample_A Sample_B Sample_C Sample_D
Num
Value_1 0 0 1 0
Value_2 1 0 0 0
Value_3 0 1 0 1
Value_4 0 0 0 1
Value_5 1 0 1 0
>>> samples
['Sample_A', 'Sample_D', 'Sample_E']
While i'm taking samples to preserve the column from it as follows it works but at the same time it produce warning ..
>>> df3 = df2.loc[:, samples]
>>> df3
Sample_A Sample_D Sample_E
0 0 0 NaN
1 1 0 NaN
2 0 1 NaN
3 0 1 NaN
4 1 0 NaN
Warnings:
indexing.py:1472: FutureWarning:
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.
See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
return self._getitem_tuple(key)
Would like to know about to handle this to a better way!
Use reindex like:
df3 = df2.reindex(columns=samples)
print (df3)
Sample_A Sample_D Sample_E
0 0 0 NaN
1 1 0 NaN
2 0 1 NaN
3 0 1 NaN
4 1 0 NaN
Or if want only intersected columns use Index.intersection:
df3 = df2[df2.columns.intersection(samples)]
#alternative
#df3 = df2[np.intersect1d(df2.columns, samples)]
print (df3)
Sample_A Sample_D
0 0 0
1 1 0
2 0 1
3 0 1
4 1 0

Strange behavior of pandas DataFrame.agg

The Lambdas in the following code return the same Series, but the aggregation results are different. Why?
import pandas as pd
df=pd.DataFrame([1, 2])
print(df)
print(df.agg({0: lambda x: x.cumsum()}))
print(df.agg({0: lambda x: pd.Series([1, 3], name=0)}))
Which gives:
0
0 1
1 2
0
0 1
1 3
0
0 1
0 1 3
1 1 3

Pandas: Delete duplicated items in a specific column

I have a panda dataframe (here represented using excel):
Now I would like to delete all dublicates (1) of a specific row (B).
How can I do it ?
For this example, the result would look like that:
You can use duplicated for boolean mask and then set NaNs by loc, mask or numpy.where:
df.loc[df['B'].duplicated(), 'B'] = np.nan
df['B'] = df['B'].mask(df['B'].duplicated())
df['B'] = np.where(df['B'].duplicated(), np.nan,df['B'])
Alternative if need remove duplicates rows by B column:
df = df.drop_duplicates(subset=['B'])
Sample:
df = pd.DataFrame({
'B': [1,2,1,3],
'A':[1,5,7,9]
})
print (df)
A B
0 1 1
1 5 2
2 7 1
3 9 3
df.loc[df['B'].duplicated(), 'B'] = np.nan
print (df)
A B
0 1 1.0
1 5 2.0
2 7 NaN
3 9 3.0
df = df.drop_duplicates(subset=['B'])
print (df)
A B
0 1 1
1 5 2
3 9 3

create a dataframe from a list of length-unequal lists

I try to convert such a list:
l = [[1, 2, 3, 17], [4, 19], [5]]
to a dataframe having each of the number as indice, and position of list as value.
For example, 19 is in the second list, I thus expect to get somwhere one row with "19" as index and "1" as value, and so on.
I managed to get it (cf.boiler plate below), but I guess there is something more simple
>>> df=pd.DataFrame(l)
>>> df=df.unstack().reset_index(level=0,drop=True)
>>> df=df[df.notnull()==True] # remove NaN rows
>>> df=pd.DataFrame(df)
>>> df = df.reset_index().set_index(0)
>>> print df
index
0
1 0
4 1
5 2
2 0
19 1
3 0
17 0
Thanks in advance.
In [52]: pd.DataFrame([(item, i) for i, seq in enumerate(l)
for item in seq]).set_index(0)
Out[52]:
1
0
1 0
2 0
3 0
17 0
4 1
19 1
5 2