Warning with loc function with pandas dataframe

Warning with loc function with pandas dataframe - pandas

While working on a SO Question i came across a warning error using with loc, precise details are as belows:
DataFrame Samples:
First dataFrame df1 :
>>> data1 = {'Sample': ['Sample_A','Sample_D', 'Sample_E'],
... 'Location': ['Bangladesh', 'Myanmar', 'Thailand'],
... 'Year':[2012, 2014, 2015]}
>>> df1 = pd.DataFrame(data1)
>>> df1.set_index('Sample')
Location Year
Sample
Sample_A Bangladesh 2012
Sample_D Myanmar 2014
Sample_E Thailand 2015
Second dataframe df2:
>>> data2 = {'Num': ['Value_1','Value_2','Value_3','Value_4','Value_5'],
... 'Sample_A': [0,1,0,0,1],
... 'Sample_B':[0,0,1,0,0],
... 'Sample_C':[1,0,0,0,1],
... 'Sample_D':[0,0,1,1,0]}
>>> df2 = pd.DataFrame(data2)
>>> df2.set_index('Num')
Sample_A Sample_B Sample_C Sample_D
Num
Value_1 0 0 1 0
Value_2 1 0 0 0
Value_3 0 1 0 1
Value_4 0 0 0 1
Value_5 1 0 1 0
>>> samples
['Sample_A', 'Sample_D', 'Sample_E']
While i'm taking samples to preserve the column from it as follows it works but at the same time it produce warning ..
>>> df3 = df2.loc[:, samples]
>>> df3
Sample_A Sample_D Sample_E
0 0 0 NaN
1 1 0 NaN
2 0 1 NaN
3 0 1 NaN
4 1 0 NaN
Warnings:
indexing.py:1472: FutureWarning:
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.
See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
return self._getitem_tuple(key)
Would like to know about to handle this to a better way!

Use reindex like:
df3 = df2.reindex(columns=samples)
print (df3)
Sample_A Sample_D Sample_E
0 0 0 NaN
1 1 0 NaN
2 0 1 NaN
3 0 1 NaN
4 1 0 NaN
Or if want only intersected columns use Index.intersection:
df3 = df2[df2.columns.intersection(samples)]
#alternative
#df3 = df2[np.intersect1d(df2.columns, samples)]
print (df3)
Sample_A Sample_D
0 0 0
1 1 0
2 0 1
3 0 1
4 1 0

Related

How to compare 2 dataframes columns and add a value to a new dataframe based on the result

I have 2 dataframes with the same length, and I'd like to compare specific columns between them. If the value of the first column in one of the dataframe is bigger - i'd like it to take the value in the second column and assign it to a new dataframe.
See example. The first dataframe:
0 class
0 1.9 0
1 9.8 0
2 4.5 0
3 8.1 0
4 1.9 0
The second dataframe:
0 class
0 1.4 1
1 7.8 1
2 8.5 1
3 9.1 1
4 3.9 1
The new dataframe should look like:
class
0 0
1 0
2 1
3 1
4 1

Use numpy.where with DataFrame constructor:
df = pd.DataFrame({'class': np.where(df1[0] > df2[0], df1['class'], df2['class'])})
Or DataFrame.where:
df = df1[['class']].where(df1[0] > df2[0], df2[['class']])
print (df)
class
0 0
1 0
2 1
3 1
4 1
EDIT:
If there is another condition use numpy.select and if necessary numpy.isclose
print (df2)
0 class
0 1.4 1
1 7.8 1
2 8.5 1
3 9.1 1
4 1.9 1
masks = [df1[0] == df2[0], df1[0] > df2[0]]
#if need compare floats in some accuracy
#masks = [np.isclose(df1[0], df2[0]), df1[0] > df2[0]]
vals = ['not_determined', df1['class']]
df = pd.DataFrame({'class': np.select(masks, vals, df2['class'])})
print (df)
class
0 0
1 0
2 1
3 1
4 not_determined
Or:
masks = [df1[0] == df2[0], df1[0] > df2[0]]
vals = ['not_determined', 1]
df = pd.DataFrame({'class': np.select(masks, vals, 1)})
print (df)
class
0 0
1 0
2 1
3 1
4 not_determined
Solution for out of box:
df = np.sign(df1[0].sub(df2[0])).map({1:0, -1:1, 0:'not_determined'}).to_frame('class')
print (df)
class
0 0
1 0
2 1
3 1
4 not_determined

Since class is 0 and 1, you could try,
df1[0].lt(df2[0]).astype(int)
For generic solutions, check jezrael's answer.

Try this one:
>>> import numpy as np
>>> import pandas as pd
>>> df_1
0 class
0 1.9 0
1 9.8 0
2 4.5 0
3 8.1 0
4 1.9 0
>>> df_2
0 class
0 1.4 1
1 7.8 1
2 8.5 1
3 9.1 1
4 3.9 1
>>> df_3=pd.DataFrame()
>>> df_3["class"]=np.where(df_1["0"]>df_2["0"], df_1["class"], df_2["class"])
>>> df_3
class
0 0
1 0
2 1
3 1
4 1

Pandas groupby with MultiIndex columns and different levels

I want to do a groupby on a MultiIndex dataframe, counting the occurrences for each column for every user2 in df:
>>> df
user1 user2 count
0 1 2
a x d a
0 2 6 0 1 0 0
1 4 6 0 0 0 3
2 21 76 2 0 1 0
3 5 18 0 0 0 0
Note that user1 and user2 are at the same level as count (side effect of merging).
Desired output:
user2 count
0 1 2
a x d a
0 6 0 1 0 1
1 76 1 0 0 0
3 18 0 0 0 0
I've tried
>>> df.groupby(['user2','count'])
but I get
ValueError: Grouper for 'count' not 1-dimensional
GENERATOR CODE:
df = pd.DataFrame({'user1':[2,4,21,21],'user2':[6,6,76,76],'param1':[0,2,0,1],'param2':['x','a','a','d'],'count':[1,3,2,1]}, columns=['user1','user2','param1','param2','count'])
df = df.set_index(['user1','user2','param1','param2'])
df = df.unstack([2,3]).sort_index(axis=1).reset_index()
df2 = pd.DataFrame({'user1':[2,5,21],'user2':[6,18,76]})
df2.columns = pd.MultiIndex.from_product([df2.columns, [''],['']])
final_df = df2.merge(df, on=['user1','user2'], how='outer').fillna(0)

IIUC, you want:
final_df.where(final_df>0).groupby('user2').count().drop('user1', axis=1).reset_index()
Output:
user2 count
0 1 2
a x d a
0 6 0 1 0 1
1 18 0 0 0 0
2 76 1 0 1 0
Avoid dropping columns, select only 'count', and changed function to sum:
final_df.where(final_df>0).groupby('user2').sum()[['count']].reset_index()
Output:
user2 count
0 1 2
a x d a
0 6 0.0 1.0 0.0 3.0
1 18 0.0 0.0 0.0 0.0
2 76 2.0 0.0 1.0 0.0
To void dropping user2 equal to zero values also.
final_df[['count']].where(final_df[['count']]>0)\
.groupby(final_df.user2).sum().reset_index()

How to find average of two tables in pandas?

I have one table with 1000s of rows that looks like this:
file1:
apples1 + hate 0 0 0 2 4 6 0 1
apples2 + hate 0 2 0 4 4 6 0 2
apples4 + hate 0 2 0 4 4 6 0 2
and another file with same headers in file2 - nb some headers are missing in file1:
apples1 + hate 0 0 0 1 4 6 0 2
apples2 + hate 0 1 0 6 4 6 0 2
apples3 + hate 0 2 0 4 4 6 0 2
apples4 + hate 0 1 0 3 4 3 0 1
I want to compare the two files in pandas and average across common columns. I do not want to print columns that are in one file only. So the resultant file would look like:
apples1 + hate 0 0 0 1.5 4 6 0 1.5
apples2 + hate 0 1.5 0 5 4 6 0 2
apples4 + hate 0 2 0 3.5 4 6 0 2

There are two steps in this solution.
concatenate all your dataframes by stacking them vertically (axis=0, the default) using pandas.concat(...) and specifying a join of 'inner' to only maintain columns that in all the dataframes.
call mean(...) function on resultant dataframe.
Example:
In [1]: df1 = pd.DataFrame([[1,2,3], [4,5,6]], columns=['a','b','c'])
In [2]: df2 = pd.DataFrame([[1,2],[3,4]], columns=['a','c'])
In [3]: df1
Out[3]:
a b c
0 1 2 3
1 4 5 6
In [4]: df2
Out[4]:
a c
0 1 2
1 3 4
In [5]: df3 = pd.concat([df1, df2], join='inner')
In [6]: df3
Out[6]:
a c
0 1 3
1 4 6
0 1 2
1 3 4
In [7]: df3.mean()
Out[7]:
a 2.25
c 3.75
dtype: float64

Let's try this:
df1 = pd.read_csv('file1', header=None)
df2 = pd.read_csv('file2', header=None)
Set index to first three columns ie.. "apple1 + hate"
df1 = df1.set_index([0,1,2])
df2 = df2.set_index([0,1,2])
Let's use merge to inner join datafiles on indexes, and the groupby columns with the same name and aggregate with mean:
df1.merge(df2, right_index=True, left_index=True)\
.pipe(lambda x: x.groupby(x.columns.str.extract('(\w+)\_[xy]', expand=False),
axis=1, sort=False).mean()).reset_index()
Output:
0 1 2 3 4 5 6 7 8 9 10
0 apples1 + hate 0.0 0.0 0.0 1.5 4.0 6.0 0.0 1.5
1 apples2 + hate 0.0 1.5 0.0 5.0 4.0 6.0 0.0 2.0
2 apples4 + hate 0.0 1.5 0.0 3.5 4.0 4.5 0.0 1.5

filter dataframe rows based on length of column values

I have a pandas dataframe as follows:
df = pd.DataFrame([ [1,2], [np.NaN,1], ['test string1', 5]], columns=['A','B'] )
df
A B
0 1 2
1 NaN 1
2 test string1 5
I am using pandas 0.20. What is the most efficient way to remove any rows where 'any' of its column values has length > 10?
len('test string1')
12
So for the above e.g., I am expecting an output as follows:
df
A B
0 1 2
1 NaN 1

If based on column A
In [865]: df[~(df.A.str.len() > 10)]
Out[865]:
A B
0 1 2
1 NaN 1
If based on all columns
In [866]: df[~df.applymap(lambda x: len(str(x)) > 10).any(axis=1)]
Out[866]:
A B
0 1 2
1 NaN 1

I had to cast to a string for Diego's answer to work:
df = df[df['A'].apply(lambda x: len(str(x)) <= 10)]

In [42]: df
Out[42]:
A B C D
0 1 2 2 2017-01-01
1 NaN 1 NaN 2017-01-02
2 test string1 5 test string1test string1 2017-01-03
In [43]: df.dtypes
Out[43]:
A object
B int64
C object
D datetime64[ns]
dtype: object
In [44]: df.loc[~df.select_dtypes(['object']).apply(lambda x: x.str.len().gt(10)).any(1)]
Out[44]:
A B C D
0 1 2 2 2017-01-01
1 NaN 1 NaN 2017-01-02
Explanation:
df.select_dtypes(['object']) selects only columns of object (str) dtype:
In [45]: df.select_dtypes(['object'])
Out[45]:
A C
0 1 2
1 NaN NaN
2 test string1 test string1test string1
In [46]: df.select_dtypes(['object']).apply(lambda x: x.str.len().gt(10))
Out[46]:
A C
0 False False
1 False False
2 True True
now we can "aggregate" it as follows:
In [47]: df.select_dtypes(['object']).apply(lambda x: x.str.len().gt(10)).any(axis=1)
Out[47]:
0 False
1 False
2 True
dtype: bool
finally we can select only those rows where value is False:
In [48]: df.loc[~df.select_dtypes(['object']).apply(lambda x: x.str.len().gt(10)).any(axis=1)]
Out[48]:
A B C D
0 1 2 2 2017-01-01
1 NaN 1 NaN 2017-01-02

Use the apply function of series, in order to keep them:
df = df[df['A'].apply(lambda x: len(x) <= 10)]

Pandas: Cannot assign NaN using a boolean mask on MultiIndex columns

First, create this DataFrame:
df = pd.DataFrame([[1,-2,3],[4,5,-6],[-7,8,9]],
columns=pd.MultiIndex.from_tuples(
[('foo', 'bar'), ('foo', 'baz'), ('ignore', 'other')]))
That is:
foo ignore
bar baz other
0 1 -2 3
1 4 5 -6
2 -7 8 9
Now, try to replace the negative values under foo with NAN:
df.foo[df.foo < 0] = np.nan
That doesn't do anything but print a warning:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
OK, let's do that:
df.loc[:,'foo'][df.foo < 0] = np.nan
That doesn't print a warning, but it also does nothing!
But it works if we use a non-NAN value:
df.loc[:,'foo'][df.foo < 0] = 666
Now I have:
foo ignore
bar baz other
0 1 666 3
1 4 5 -6
2 666 8 9
But I want to fill with NAN, not 666. Is there an easy way that works?

You can use slicers with DataFrame.mask:
idx = pd.IndexSlice
sliced = df.loc[:, idx['foo',:]]
print (sliced)
foo
bar baz
0 1 -2
1 4 5
2 -7 8
df.loc[:, idx['foo',:]] = sliced.mask(sliced < 0)
print (df)
foo ignore
bar baz other
0 1.0 NaN 3
1 4.0 5.0 -6
2 NaN 8.0 9
Another solution with concat:
idx = pd.IndexSlice
df1 = df.loc[:, idx['foo',:]]
print (df1)
foo
bar baz
0 1 -2
1 4 5
2 -7 8
df1 = df1.mask(df1 < 0)
print (df1)
foo
bar baz
0 1.0 NaN
1 4.0 5.0
2 NaN 8.0
print (pd.concat([df1, df.drop('foo', axis=1, level=0)], axis=1))
foo ignore
bar baz other
0 1.0 NaN 3
1 4.0 5.0 -6
2 NaN 8.0 9

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Warning with loc function with pandas dataframe - pandas

Related

How to compare 2 dataframes columns and add a value to a new dataframe based on the result

Pandas groupby with MultiIndex columns and different levels

How to find average of two tables in pandas?

filter dataframe rows based on length of column values

Pandas: Cannot assign NaN using a boolean mask on MultiIndex columns

Categories

Resources