Delete row of a dataframe condition - dataframe

here is my first dataframe df1
269 270 271 346
0 1 153.00 2.14 1
1 1 153.21 3.89 2
2 1 153.90 2.02 1
3 1 154.18 3.02 1
4 1 154.47 2.30 1
5 1 154.66 2.73 1
6 1 155.35 2.82 1
7 1 155.70 2.32 1
8 1 220.00 15.50 1
9 0 152.64 1.44 1
10 0 152.04 2.20 1
11 0 150.48 1.59 1
12 0 149.88 1.73 1
13 0 129.00 0.01 1
here is my second dataframe df2
269 270 271 346
0 0 149.88 2.0 1
I would like the row at the index 12 to be remove because they have the same number in columns ['269'] & ['270']

Hope below solutions would match to your requirement
Using anti_join from dplyr
library(dplyr)
anti_join(df1, df2, by = c("269", "270"))
Using %in% operator
df1[!(df1$269 %in% df2$269 & df1$270 %in% df2$270),]

Related

Python - Sort column ascending - using groupby

The following code:
import pandas as pd
df_original=pd.DataFrame({\
'race_num':[1,1,1,2,2,2,2,3,3],\
'race_position':[2,3,0,1,0,0,2,3,0],\
'percentage_place':[77,55,88,50,34,56,99,12,75]
})
Gives an output of:
race_num
race_position
percentage_place
1
2
77
1
3
55
1
0
88
2
1
50
2
0
34
2
0
56
2
2
99
3
3
12
3
0
75
I need to mainpulate this dataframe to keep the race_num grouped but sort the percentage place in ascending order - and the race_position is to stay aligned with the original percentage_place.
Desired out is:
race_num
race_position
percentage_place
1
0
88
1
2
77
1
3
55
2
2
99
2
0
56
2
1
50
2
0
34
3
0
75
3
3
12
My attempt is:
df_new = df_1.groupby(['race_num','race_position'])\['percentage_place'].nlargest().reset_index()
Thank you in advance.
Look into sort_values
In [137]: df_original.sort_values(['race_num', 'percentage_place'], ascending=[True, False])
Out[137]:
race_num race_position percentage_place
2 1 0 88
0 1 2 77
1 1 3 55
6 2 2 99
5 2 0 56
3 2 1 50
4 2 0 34
8 3 0 75
7 3 3 12

Multindex add zero values if no in pandas dataframe

I have a pandas (v.0.23.4) dataframe with multindex('date', 'class').
Col_values
date class
2019-04-30 0 324
1 6874
2 44
3 5
4 15
2019-05-31 0 393
1 6534
2 64
3 1
4 22
2019-06-30 0 325
1 5899
2 48
4 7
In '2019-06-30' class 3 is missing because there are no data.
What I want is to add class 3 in the multindex and zero values in the Col_values column automatically.
Use DataFrame.unstack with fill_value=0 with DataFrame.stack:
df = df.unstack(fill_value=0).stack()
print (df)
Col_values
date class
2019-04-30 0 324
1 6874
2 44
3 5
4 15
2019-05-31 0 393
1 6534
2 64
3 1
4 22
2019-06-30 0 325
1 5899
2 48
3 0
4 7
Another solution is use DataFrame.reindex with MultiIndex.from_product:
mux = pd.MultiIndex.from_product(df.index.levels, names=df.index.names)
df = df.reindex(mux, fill_value=0)
print (df)
Col_values
date class
2019-04-30 0 324
1 6874
2 44
3 5
4 15
2019-05-31 0 393
1 6534
2 64
3 1
4 22
2019-06-30 0 325
1 5899
2 48
3 0
4 7

What's the problem of this one-hot encoding?

In [4]: data = pd.read_csv('student_data.csv')
In [5]: data[:10]
Out[5]:
admit gre gpa rank
0 0 380 3.61 3
1 1 660 3.67 3
2 1 800 4.00 1
3 1 640 3.19 4
4 0 520 2.93 4
5 1 760 3.00 2
6 1 560 2.98 1
7 0 400 3.08 2
8 1 540 3.39 3
9 0 700 3.92 2
one_hot_data = pd.get_dummies(data['rank'])
# TODO: Drop the previous rank column
data = data.drop('rank', axis=1)
data = data.join(one_hot_data)
# Print the first 10 rows of our data
data[:10]
It always gives an error:
KeyError: 'rank'
During handling of the above exception, another exception occurred:
KeyError Traceback (most recent call last)
<ipython-input-25-6a749c8f286e> in <module>()
1 # TODO: Make dummy variables for rank
----> 2 one_hot_data = pd.get_dummies(data['rank'])
3
4 # TODO: Drop the previous rank column
5 data = data.drop('rank', axis=1)
If get:
KeyError: 'rank'
it means there is no column rank. Obviously problem is with traling whitespace or encoding.
print (data.columns.tolist())
['admit', 'gre', 'gpa', 'rank']
Your solution should be simplify by DataFrame.pop - it select column and remove from original DataFrame:
data = data.join(pd.get_dummies(data.pop('rank')))
# Print the first 10 rows of our data
print(data[:10])
admit gre gpa 1 2 3 4
0 0 380 3.61 0 0 1 0
1 1 660 3.67 0 0 1 0
2 1 800 4.00 1 0 0 0
3 1 640 3.19 0 0 0 1
4 0 520 2.93 0 0 0 1
5 1 760 3.00 0 1 0 0
6 1 560 2.98 1 0 0 0
7 0 400 3.08 0 1 0 0
8 1 540 3.39 0 0 1 0
9 0 700 3.92 0 1 0 0
I tried your code and it works fine. You can need to rerun the previous cells which includes loading of the data

Set value from another dataframe

Having a data frame exex as
EXEX I J
1 702 2 3
2 3112 2 4
3 1360 2 5
4 702 3 2
5 221 3 5
6 591 3 11
7 3112 4 2
8 394 4 5
9 3416 4 11
10 1360 5 2
11 221 5 3
12 394 5 4
13 108 5 11
14 591 11 3
15 3416 11 4
16 108 11 5
is there a more efficient pandas approach to update the value of an existing dataframe df of 0 to the value exex.EXEX where the exex.I field is the index and the exex.J field is the column? Is there a way in where to update the data by specifing the name instead of the row index? This is because if the name fields change, the row index would be different and could lead to an erroneous result.
i get it by:
df = pd.DataFrame(0, index = range(1,908), columns=range(1,908))
for index, row in exex12.iterrows():
df.set_value(row[1],row[2],row[0])
Assign to df.values
df.values[exex.I.values - 1, exex.J.values - 1] = exex.EXEX.values
print(df.iloc[:5, :5])
1 2 3 4 5
1 0 0 0 0 0
2 0 0 702 3112 1360
3 0 702 0 0 221
4 0 3112 0 0 394
5 0 1360 221 394 0

How to plot aggregated DataFrame using two columns?

I have the following, using a DF that has two columns that I would like to aggregate by:
df2.groupby(['airline_clean','sentiment']).size()
airline_clean sentiment
americanair -1 14
0 36
1 1804
2 722
3 171
4 1
jetblue -1 2
0 7
1 1074
2 868
3 250
4 11
southwestair -1 4
0 20
1 1320
2 829
3 237
4 4
united -1 7
0 74
1 2467
2 1026
3 221
4 5
usairways -1 5
0 62
1 1962
2 716
3 155
4 2
virginamerica -1 2
0 2
1 250
2 180
3 69
dtype: int64
Plotting the aggragated view:
dfc=df2.groupby(['airline_clean','sentiment']).size()
dfc.plot(kind='bar', stacked=True,figsize=(18,6))
Results in:
I would like to change two things:
plot the data in a stacked chart by airline
using % instead of raw numbers (by airline as well)
I am not sure how to achieve that. Any direction is appreciated.
The best way is to plot this dataset is to convert to % values first and use unstack() for plotting:
airline_sentiment = df3.groupby(['airline_clean', 'sentiment']).agg({'tweet_count': 'sum'})
airline = df3.groupby(['airline_clean']).agg({'tweet_count': 'sum'})
p = airline_sentiment.div(airline, level='airline_clean') * 100
p.unstack().plot(kind='bar',stacked=True,figsize=(9, 6),title='Sentiment % distribution by airline')
This results in a nice chart: