Rank multiple columns in pandas - pandas

I have a dataset of a series with missing values that I want to replace by the index. The second column contains the same numbers than the first column, but in a different order.
here's an example:
>>> df
ind u v d
0 5 7 151
1 7 20 151
2 8 40 151
3 20 5 151
this should turn out to:
>>>df
ind u v d
0 1 2 151
1 2 4 151
2 3 5 151
3 4 1 151
i reindexed the values in row 'u' by creating a new column:
>>>df['new_index'] = range(1, len(numbers) + 1)
but how do I now replace values of the second column referring to the indexes?
Thanks for any advice!

You can use Series.rank, but first need create Series with unstack and last create DataFrame with unstack again:
df[['u','v']] = df[['u','v']].unstack().rank(method='dense').astype(int).unstack(0)
print (df)
u v d
ind
0 1 2 151
1 2 4 151
2 3 5 151
3 4 1 151
If use only DataFrame.rank, output in v is different:
df[['u','v']] = df[['u','v']].rank(method='dense').astype(int)
print (df)
u v d
ind
0 1 2 151
1 2 3 151
2 3 4 151
3 4 1 151

Related

pandas dataframe and how to find an element using row and column

is there a way to find the element in a pandas data frame by using the row and column values.For example, if we have a list, L = [0,3,2,3,2,4,30,7], we can use L[2] and get the value 2 in return.
Use .iloc
df = pd.DataFrame({'L':[0,3,2,3,2,4,30,7], 'M':[10,23,22,73,72,14,130,17]})
L M
0 0 10
1 3 23
2 2 22
3 3 73
4 2 72
5 4 14
6 30 130
7 7 17
df.iloc[2]['L']
df.iloc[2:3, 0:1]
df.iat[2, 0]
2
df.iloc[6]['M']
df.iloc[6:7, 1:2]
df.iat[6, 1]
130

Keep the second entry in a dataframe

I am showing you below an example dataset and the output desired.
ID number
1 50
1 49
1 48
2 47
2 40
2 31
3 60
3 51
3 42
Example output
1 49
2 40
3 51
I want to keep the second entry for every group in my dataset. I have already grouped them by ID but not I want for each Id to keep the second entry and remove all the duplicates afterwards from ID.
Use GroupBy.nth with 1 for second rows, because python counts from 0:
df1 = df.groupby('ID', as_index=False).nth(1)
print (df1)
ID number
1 1 49
4 2 40
7 3 51
Another solution with GroupBy.cumcount for counter and filtering by boolean indexing:
df1 = df[df.groupby('ID').cumcount() == 1]
Details:
print (df.groupby('ID').cumcount())
0 0
1 1
2 2
3 0
4 1
5 2
6 0
7 1
8 2
dtype: int64
EDIT: Solution for second maximal value -s first sorting and then get second row - values has to be unique per groups:
df = (df.sort_values(['ID','number'], ascending=[True, False])
.groupby('ID', as_index=False)
.nth(1))
print (df)
ID number
1 1 49
4 2 40
7 3 51
If want second maximal value if exist duplicates add DataFrame.drop_duplicates:
print (df)
ID number
0 1 50 <-first max
1 1 50 <-first max
2 1 48 <-second max
3 2 47
4 2 40
5 2 31
6 3 60
7 3 51
8 3 42
df3 = (df.drop_duplicates(['ID','number'])
.sort_values(['ID','number'], ascending=[True, False])
.groupby('ID', as_index=False)
.nth(1))
print (df3)
ID number
2 1 48
4 2 40
7 3 51
If that is the case we can use duplicated + drop_duplicates
df=df[df.duplicated('ID')].drop_duplicates('ID')
ID number
1 1 49
4 2 40
7 3 51
Flexible solution cumcount
df[df.groupby('ID').cumcount()==1].copy()
ID number
1 1 49
4 2 40
7 3 51

Winsorize within groups of dataframe

I have a dataframe like this:
df = pd.DataFrame([[1,2],
[1,4],
[1,5],
[2,65],
[2,34],
[2,23],
[2,45]], columns = ['label', 'score'])
Is there an efficient way to create a column score_winsor that winsorises the score column within the groups at the 1% level?
I tried this with no success:
df['score_winsor'] = df.groupby('label')['score'].transform(lambda x: max(x.quantile(.01), min(x, x.quantile(.99))))
You could use scipy's implementation of winsorize
df["score_winsor"] = df.groupby('label')['score'].transform(lambda row: winsorize(row, limits=[0.01,0.01]))
Output
>>> df
label score score_winsor
0 1 2 2
1 1 4 4
2 1 5 5
3 2 65 65
4 2 34 34
5 2 23 23
6 2 45 45
This works:
df['score_winsor'] = df.groupby('label')['score'].transform(lambda x: np.maximum(x.quantile(.01), np.minimum(x, x.quantile(.99))))
Output
print(df.to_string())
label score score_winsor
0 1 2 2.04
1 1 4 4.00
2 1 5 4.98
3 2 65 64.40
4 2 34 34.00
5 2 23 23.33
6 2 45 45.00

merging 3 dataframes on condition

i have a dataframe df
id value
1 100
2 200
3 500
4 600
5 700
6 800
i have another dataframe df2
c_id flag
2 Y
3 Y
5 Y
Similarly df3
c_id flag
1 N
3 Y
4 Y
i want to merge these 3 dataframes and create a column in df
such that my df looks like:
id value flag
1 100 N
2 200 Y
3 500 Y
4 600 Y
5 700 Y
6 800 nan
I DON'T WANT TO USE df2 and df3 concatenation
for eg(
final = pd.concat([df2,df3],ignore_index=False)
final.drop_duplicates(inplace=True)
i don't want to use this method, is there any other way?
Using pd.merge, between df and combined df2+df3
In [1150]: df.merge(df2.append(df3), left_on=['id'], right_on=['c_id'], how='left')
Out[1150]:
id value c_id flag
0 1 100 1.0 N
1 2 200 2.0 Y
2 3 500 3.0 Y
3 3 500 3.0 Y
4 4 600 4.0 Y
5 5 700 5.0 Y
6 6 800 NaN NaN
Details
In [1151]: df2.append(df3)
Out[1151]:
c_id flag
0 2 Y
1 3 Y
2 5 Y
0 1 N
1 3 Y
2 4 Y
Using map you could
In [1140]: df.assign(flag=df.id.map(
df2.set_index('c_id')['flag'].combine_first(
df3.set_index('c_id')['flag']))
)
Out[1140]:
id value flag
0 1 100 N
1 2 200 Y
2 3 500 Y
3 4 600 Y
4 5 700 Y
5 6 800 NaN
Let me explain, using set_index and combine_first create a mapping for id and flag
In [1141]: mapping = df2.set_index('c_id')['flag'].combine_first(
df3.set_index('c_id')['flag'])
In [1142]: mapping
Out[1142]:
c_id
1 N
2 Y
3 Y
4 Y
5 Y
Name: flag, dtype: object
In [1143]: df.assign(flag=df.id.map(mapping))
Out[1143]:
id value flag
0 1 100 N
1 2 200 Y
2 3 500 Y
3 4 600 Y
4 5 700 Y
5 6 800 NaN
Merge on both df2 and df3
df= pd.merge(pd.merge(df,df2,on='ID',how='left'),df3,on='ID',how='left')
Fill nulls
df['ID'] =df['ID_y'].fillna(df['ID_x']
Delete the columns
del df['ID_y']; del df['ID_x']
Or you could just append,
df4 = df2.append(df3)
pd.merge(df,df4,how='left',on='ID')

How to plot aggregated DataFrame using two columns?

I have the following, using a DF that has two columns that I would like to aggregate by:
df2.groupby(['airline_clean','sentiment']).size()
airline_clean sentiment
americanair -1 14
0 36
1 1804
2 722
3 171
4 1
jetblue -1 2
0 7
1 1074
2 868
3 250
4 11
southwestair -1 4
0 20
1 1320
2 829
3 237
4 4
united -1 7
0 74
1 2467
2 1026
3 221
4 5
usairways -1 5
0 62
1 1962
2 716
3 155
4 2
virginamerica -1 2
0 2
1 250
2 180
3 69
dtype: int64
Plotting the aggragated view:
dfc=df2.groupby(['airline_clean','sentiment']).size()
dfc.plot(kind='bar', stacked=True,figsize=(18,6))
Results in:
I would like to change two things:
plot the data in a stacked chart by airline
using % instead of raw numbers (by airline as well)
I am not sure how to achieve that. Any direction is appreciated.
The best way is to plot this dataset is to convert to % values first and use unstack() for plotting:
airline_sentiment = df3.groupby(['airline_clean', 'sentiment']).agg({'tweet_count': 'sum'})
airline = df3.groupby(['airline_clean']).agg({'tweet_count': 'sum'})
p = airline_sentiment.div(airline, level='airline_clean') * 100
p.unstack().plot(kind='bar',stacked=True,figsize=(9, 6),title='Sentiment % distribution by airline')
This results in a nice chart: