Drop largest element within level in Pandas - pandas

I'm trying to remove outliers in my data by dropping the largest element within an index level.
import pandas as pd
index = pd.MultiIndex.from_product([['A','B'],range(3)],names=['Letters','Numbers'])
s = pd.Series([0,2,1,2,0,2], index=index)
s
Out:
Letters Numbers
A 0 0
1 2
2 1
B 0 2
1 0
2 2
dtype: int64
s.groupby('Letters').nlargest(-1)
expected output
Out:
Letters Numbers
A 0 0
2 1
B 1 0
2 2
dtype: int64

Your solution should be changed with group_keys=False parameter in Series.groupby and then is used Series.drop by index values:
s = s.drop(s.groupby('Letters', group_keys=False).nlargest(1).index)
print (s)
Letters Numbers
A 0 0
2 1
B 1 0
2 2
dtype: int64

You can use idxmax and drop:
s.drop(s.groupby('Letters').idxmax())
# or
# s.drop(s.groupby(level=0).idxmax())
Output:
Letters Numbers
A 0 0
2 1
B 1 0
2 2
dtype: int64

Related

change all values in a dataframe with other values from another dataframe

I just started with learning pandas.
I have 2 dataframes.
The first one is
val num
0 1 0
1 2 1
2 3 2
3 4 3
4 5 4
and the second one is
0 1 2 3
0 1 2 3 4
1 5 3 2 2
2 2 5 3 2
I want to change my second dataframe so that the values present in the dataframe are compared with the column val in the first dataframe and every values that is the same needs then to be changed in the values that is present in de the num column from dataframe 1. Which means that in the end i need to get the following dataframe:
0 1 2 3
0 0 1 2 3
1 4 2 1 1
2 1 4 2 1
How do i do that in pandas?
You can use DataFrame.replace() to do this:
df2.replace(df1.set_index('val')['num'])
Explanation:
The first step is to set the val column of the first DataFrame as the index. This will change how the matching is performed in the third step.
Convert the first DataFrame to a Series, by sub-setting to the index and the num column. It looks like this:
val
1 0
2 1
3 2
4 3
5 4
Name: num, dtype: int64
Next, use DataFrame.replace() to do the replacement in the second DataFrame. It looks up each value from the second DataFrame, finds a matching index in the Series, and replaces it with the value from the Series.
Full reproducible example:
import pandas as pd
import io
s = """ val num
0 1 0
1 2 1
2 3 2
3 4 3
4 5 4"""
df1 = pd.read_csv(io.StringIO(s), delim_whitespace=True)
s = """ 0 1 2 3
0 1 2 3 4
1 5 3 2 2
2 2 5 3 2"""
df2 = pd.read_csv(io.StringIO(s), delim_whitespace=True)
print(df2.replace(df1.set_index('val')['num']))
Creat the mapping dict , then replace
mpd = dict(zip(df1.val,df1.num))
df2.replace(mpd, inplace=True)
0 1 2 3
0 0 1 2 3
1 4 2 1 1
2 1 4 2 1

How to compare 2 dataframes columns and add a value to a new dataframe based on the result

I have 2 dataframes with the same length, and I'd like to compare specific columns between them. If the value of the first column in one of the dataframe is bigger - i'd like it to take the value in the second column and assign it to a new dataframe.
See example. The first dataframe:
0 class
0 1.9 0
1 9.8 0
2 4.5 0
3 8.1 0
4 1.9 0
The second dataframe:
0 class
0 1.4 1
1 7.8 1
2 8.5 1
3 9.1 1
4 3.9 1
The new dataframe should look like:
class
0 0
1 0
2 1
3 1
4 1
Use numpy.where with DataFrame constructor:
df = pd.DataFrame({'class': np.where(df1[0] > df2[0], df1['class'], df2['class'])})
Or DataFrame.where:
df = df1[['class']].where(df1[0] > df2[0], df2[['class']])
print (df)
class
0 0
1 0
2 1
3 1
4 1
EDIT:
If there is another condition use numpy.select and if necessary numpy.isclose
print (df2)
0 class
0 1.4 1
1 7.8 1
2 8.5 1
3 9.1 1
4 1.9 1
masks = [df1[0] == df2[0], df1[0] > df2[0]]
#if need compare floats in some accuracy
#masks = [np.isclose(df1[0], df2[0]), df1[0] > df2[0]]
vals = ['not_determined', df1['class']]
df = pd.DataFrame({'class': np.select(masks, vals, df2['class'])})
print (df)
class
0 0
1 0
2 1
3 1
4 not_determined
Or:
masks = [df1[0] == df2[0], df1[0] > df2[0]]
vals = ['not_determined', 1]
df = pd.DataFrame({'class': np.select(masks, vals, 1)})
print (df)
class
0 0
1 0
2 1
3 1
4 not_determined
Solution for out of box:
df = np.sign(df1[0].sub(df2[0])).map({1:0, -1:1, 0:'not_determined'}).to_frame('class')
print (df)
class
0 0
1 0
2 1
3 1
4 not_determined
Since class is 0 and 1, you could try,
df1[0].lt(df2[0]).astype(int)
For generic solutions, check jezrael's answer.
Try this one:
>>> import numpy as np
>>> import pandas as pd
>>> df_1
0 class
0 1.9 0
1 9.8 0
2 4.5 0
3 8.1 0
4 1.9 0
>>> df_2
0 class
0 1.4 1
1 7.8 1
2 8.5 1
3 9.1 1
4 3.9 1
>>> df_3=pd.DataFrame()
>>> df_3["class"]=np.where(df_1["0"]>df_2["0"], df_1["class"], df_2["class"])
>>> df_3
class
0 0
1 0
2 1
3 1
4 1

pandas rename_axis not taking inplace argument

I have the following code. I thought df should have index name of INDEX at the end, given I set the inplace argument. But that's not the case. What am I missing? Or is it a bug in Pandas?
>>> df = pd.DataFrame([[1,2],[3,4]])
>>> df
0 1
0 1 2
1 3 4
>>> df.rename_axis('INDEX', inplace=True)
0 1
INDEX
0 1 2
1 3 4
>>> df
0 1
0 1 2
1 3 4
>>>
It should be straightforward. I tried it myself and experienced the same issue, so, congratulation, you've found a bug in Pandas.
Here is what you can do with your code for now:
>>> df = df.rename_axis('INDEX')
0 1
INDEX
0 1 2
1 3 4
>>> df
0 1
INDEX
0 1 2
1 3 4
>>>
In case you're interested to contribute to Pandas: https://pandas.pydata.org/pandas-docs/stable/contributing.html#bug-reports-and-enhancement-requests
It seems to be a bug, instead of rename_axis you can use index.rename with inplace = True parameter.
df.index.rename('INDEX', inplace=True)
Output of df:
0 1
INDEX
0 1 2
1 3 4

how to insert a new integer index ipython pandas

I made value count dataframe from another dataframe
for example
freq
0 2
0.33333 10
1.66667 13
automatically, its indexs are 0, 0.3333, 1.66667
and the indexs can be variable
because I intend to make many dataframes based on a specific value
how can I insert a integer index?
like
freq
0 0 2
1 0.33333 10
2 1.66667 13
thanks
The result you get back from values_count is a series, and to set a generic 1 ... n index, you can use reset_index:
In [4]: s = pd.Series([0,0.3,0.3,1.6])
In [5]: s.value_counts()
Out[5]:
0.3 2
1.6 1
0.0 1
dtype: int64
In [9]: s.value_counts().reset_index(name='freq')
Out[9]:
index freq
0 0.3 2
1 1.6 1
2 0.0 1

Pandas value_counts(sort=False) with large series doesn't work

By default Series.values_counts is sorted by the count, in descending order:
In [192]: pd.Series([3,0,2,0,0,1,0,0,0,1,1,0,1,0,2,2,2,2,2,0,0,2]).value_counts()
Out[192]:
0 10
2 7
1 4
3 1
dtype: int64
If I pass sort=False, it appears to try and sort by the value key instead:
In [193]: pd.Series([3,0,2,0,0,1,0,0,0,1,1,0,1,0,2,2,2,2,2,0,0,2]).value_counts(sort=False)
Out[193]:
0 10
1 4
2 7
3 1
dtype: int64
However when I increase the length of the series, the sorting reverts to the original order:
In [194]: pd.Series([3,0,2,0,0,1,0,0,0,1,1,0,1,0,2,2,2,2,2,0,0,2]*100).value_counts(sort=False)
Out[194]:
0 1000
2 700
1 400
3 100
dtype: int64
Any ideas what's going on here?
This is correct. You asked .value_counts() not to sort the result, so it doesn't. Below I emulate what sort=True actually does, which is simply a sort_values. If you don't sort, then you will get the result of the counts which is done by a hash table and consequently is in an arbitrary order.
In [39]: pd.Series([3,0,2,0,0,1,0,0,0,1,1,0,1,0,2,2,2,2,2,0,0,2]).value_counts(sort=False).sort_values(ascending=False)
Out[39]:
0 10
2 7
1 4
3 1
dtype: int64
In [40]: pd.Series([3,0,2,0,0,1,0,0,0,1,1,0,1,0,2,2,2,2,2,0,0,2]*100).value_counts(sort=False).sort_values(ascending=False)
Out[40]:
0 1000
2 700
1 400
3 100
dtype: int64