How to plot aggregated DataFrame using two columns? - pandas

I have the following, using a DF that has two columns that I would like to aggregate by:
df2.groupby(['airline_clean','sentiment']).size()
airline_clean sentiment
americanair -1 14
0 36
1 1804
2 722
3 171
4 1
jetblue -1 2
0 7
1 1074
2 868
3 250
4 11
southwestair -1 4
0 20
1 1320
2 829
3 237
4 4
united -1 7
0 74
1 2467
2 1026
3 221
4 5
usairways -1 5
0 62
1 1962
2 716
3 155
4 2
virginamerica -1 2
0 2
1 250
2 180
3 69
dtype: int64
Plotting the aggragated view:
dfc=df2.groupby(['airline_clean','sentiment']).size()
dfc.plot(kind='bar', stacked=True,figsize=(18,6))
Results in:
I would like to change two things:
plot the data in a stacked chart by airline
using % instead of raw numbers (by airline as well)
I am not sure how to achieve that. Any direction is appreciated.

The best way is to plot this dataset is to convert to % values first and use unstack() for plotting:
airline_sentiment = df3.groupby(['airline_clean', 'sentiment']).agg({'tweet_count': 'sum'})
airline = df3.groupby(['airline_clean']).agg({'tweet_count': 'sum'})
p = airline_sentiment.div(airline, level='airline_clean') * 100
p.unstack().plot(kind='bar',stacked=True,figsize=(9, 6),title='Sentiment % distribution by airline')
This results in a nice chart:

Related

pandas dataframe and how to find an element using row and column

is there a way to find the element in a pandas data frame by using the row and column values.For example, if we have a list, L = [0,3,2,3,2,4,30,7], we can use L[2] and get the value 2 in return.
Use .iloc
df = pd.DataFrame({'L':[0,3,2,3,2,4,30,7], 'M':[10,23,22,73,72,14,130,17]})
L M
0 0 10
1 3 23
2 2 22
3 3 73
4 2 72
5 4 14
6 30 130
7 7 17
df.iloc[2]['L']
df.iloc[2:3, 0:1]
df.iat[2, 0]
2
df.iloc[6]['M']
df.iloc[6:7, 1:2]
df.iat[6, 1]
130

Winsorize within groups of dataframe

I have a dataframe like this:
df = pd.DataFrame([[1,2],
[1,4],
[1,5],
[2,65],
[2,34],
[2,23],
[2,45]], columns = ['label', 'score'])
Is there an efficient way to create a column score_winsor that winsorises the score column within the groups at the 1% level?
I tried this with no success:
df['score_winsor'] = df.groupby('label')['score'].transform(lambda x: max(x.quantile(.01), min(x, x.quantile(.99))))
You could use scipy's implementation of winsorize
df["score_winsor"] = df.groupby('label')['score'].transform(lambda row: winsorize(row, limits=[0.01,0.01]))
Output
>>> df
label score score_winsor
0 1 2 2
1 1 4 4
2 1 5 5
3 2 65 65
4 2 34 34
5 2 23 23
6 2 45 45
This works:
df['score_winsor'] = df.groupby('label')['score'].transform(lambda x: np.maximum(x.quantile(.01), np.minimum(x, x.quantile(.99))))
Output
print(df.to_string())
label score score_winsor
0 1 2 2.04
1 1 4 4.00
2 1 5 4.98
3 2 65 64.40
4 2 34 34.00
5 2 23 23.33
6 2 45 45.00

Finding the maximum value for each column and the corspondace vlaue of a common column

I am trying to get the maximum value from each column in a dataframe with their time that they occur.
l = [[1,6,2,6,7],[2,66,2,6,8],[3,44,2,44,8],[4,5,35,6,8],[5,3,9,6,95]]
dft = pd.DataFrame(l, columns=['Time','25','50','75','100'])
max_t = pd.DataFrame()
max_t['Max_f'] = dft.loc[:, ['25','50','75','100']].max(axis=0)
max_t
I managed to get the maximum value in each column, however, I could not figure out how to get the time.
IIUC:
In [48]: dft
Out[48]:
Time 25 50 75 100
0 1 6 2 6 7
1 2 66 2 6 8
2 3 44 2 44 8
3 4 5 35 6 8
4 5 3 9 6 95
In [49]: dft.set_index('Time').agg(['max','idxmax']).T
Out[49]:
max idxmax
25 66 2
50 35 4
75 44 3
100 95 5

Set value from another dataframe

Having a data frame exex as
EXEX I J
1 702 2 3
2 3112 2 4
3 1360 2 5
4 702 3 2
5 221 3 5
6 591 3 11
7 3112 4 2
8 394 4 5
9 3416 4 11
10 1360 5 2
11 221 5 3
12 394 5 4
13 108 5 11
14 591 11 3
15 3416 11 4
16 108 11 5
is there a more efficient pandas approach to update the value of an existing dataframe df of 0 to the value exex.EXEX where the exex.I field is the index and the exex.J field is the column? Is there a way in where to update the data by specifing the name instead of the row index? This is because if the name fields change, the row index would be different and could lead to an erroneous result.
i get it by:
df = pd.DataFrame(0, index = range(1,908), columns=range(1,908))
for index, row in exex12.iterrows():
df.set_value(row[1],row[2],row[0])
Assign to df.values
df.values[exex.I.values - 1, exex.J.values - 1] = exex.EXEX.values
print(df.iloc[:5, :5])
1 2 3 4 5
1 0 0 0 0 0
2 0 0 702 3112 1360
3 0 702 0 0 221
4 0 3112 0 0 394
5 0 1360 221 394 0

Rank multiple columns in pandas

I have a dataset of a series with missing values that I want to replace by the index. The second column contains the same numbers than the first column, but in a different order.
here's an example:
>>> df
ind u v d
0 5 7 151
1 7 20 151
2 8 40 151
3 20 5 151
this should turn out to:
>>>df
ind u v d
0 1 2 151
1 2 4 151
2 3 5 151
3 4 1 151
i reindexed the values in row 'u' by creating a new column:
>>>df['new_index'] = range(1, len(numbers) + 1)
but how do I now replace values of the second column referring to the indexes?
Thanks for any advice!
You can use Series.rank, but first need create Series with unstack and last create DataFrame with unstack again:
df[['u','v']] = df[['u','v']].unstack().rank(method='dense').astype(int).unstack(0)
print (df)
u v d
ind
0 1 2 151
1 2 4 151
2 3 5 151
3 4 1 151
If use only DataFrame.rank, output in v is different:
df[['u','v']] = df[['u','v']].rank(method='dense').astype(int)
print (df)
u v d
ind
0 1 2 151
1 2 3 151
2 3 4 151
3 4 1 151