Find maximum value of each group within a Pandas Frame - pandas

I do have a question, hoping that you could give me a little support. I looked into the archiv here, found a solution but that's taking much time and is not "beautiful", since works with Loops
Suppose you have a following frame
System Country_Key Name Bank_number_length Check rule for bank acct no.
PEM AD Andorra 8 2
PL1 AD Andorra 15 5
PPE AD Andorra 14 5
P11 AD Andorra 9 5
P16 AD Andorra 12 4
PEM AE Emirates 3 5
PL1 AE Emirates 15 4
PPE AE Emirates 15 5
P11 AE Emirates 15 6
P16 AE Emirates 13 5
I found the following approach for two columns Get the max value from each group with pandas.DataFrame.groupby
However, in my case I do really have many columns and need to set the index for the first three columns "System", "Country_Key" and "Name"
my desire output would be the following
System Country_Key Name Bank_number_length Check rule for bank acct no.
PEM AD Andorra
PL1 15 5
PPE 5
P11 5
P16
PEM AE Emirates
PL1 15
PPE 15
P11 15 6
P16
So actually dropping the lowest values except the max value. Any kind of hint would be really benefical

You can try mask the not max value to empty string and mask the duplicated values to empty string
keys = ['Country_Key', 'Name']
cols = ['Bank_number_length', 'Check rule for bank acct no.']
df[cols] = df[cols].mask(df[cols].ne(df.groupby(keys)[cols].transform(max)), '')
df.loc[df.duplicated(keys), keys] = ''
print(df)
System Country_Key Name Bank_number_length Check rule for bank acct no.
0 PEM AD Andorra
1 PL1 15 5
2 PPE 5
3 P11 5
4 P16
5 PEM AE Emirates
6 PL1 15
7 PPE 15
8 P11 15 6
9 P16

Related

How to get a value of a column from another column

I am new to pandas. And recently I have been stuck on a question.
I need to find the name who has the lowest score. But i just don't know how.
df =
name score subject
0 Amy 100
1 Amy 99
3 Amy 95
4 Bob 98
5 Bob 88
6 Bob 85
7 Cathy 94
8 Cathy 87
9 Cathy 90
It would be so great if anyone can help.
You can use min() function
df.loc[df.score==df.score.min(), 'name']

Smoothed Average over rows and columns with pandas

I am trying to create a function that averages over both row and column. For example:
**State** **1943 1944 1945 1946 1947 (1947_AVG) 1948 (1948_AVG)**
Alaska 1 2 3 4 5 2 6 3
CA 234 234 234 6677 34
I want a code that will give me an average for 1947 using 1943, 1944, and 1945. Something that gives me 1948 using 1944, 1945, 1946, ect, ect.
I currently have:
d3['pandas_SMA_Year'] = d3.iloc[:,1].rolling(window=3).mean()
But this is simply working over the rows, not the columns, and it doesn't take into account the fact that I'm looking 2 years back. Please and thank you for any guidance!

How can I merge two dataframes outside the intersection of the data?

I have a dataframe of presidential candiates, their received donation amount, and the states where the donations came from (contbr_st).
However, the state includes non state abbreviations such as AA, FF, FM as shown below. And, I have a single column dataframe of 50 state abbreviations.
dataframe below is "total"
cand_nm Obama, Barack Romney, Mitt
contbr_st
AA 56405.00 135.00
AB 2048.00 NaN
AE 42973.75 5680.00
AK 281840.15 86204.24
AL 543123.48 527303.51
AP 37130.50 1655.00
AR 359247.28 105556.00
AS 2955.00 NaN
AZ 1506476.98 1888436.23
CA 23824984.24 11237636.60
CO 2132429.49 1506714.12
CT 2068291.26 3499475.45
DC 4373538.80 1025137.50
DE 336669.14 82712.00
FF NaN 99030.00
FL 7318178.58 8338458.81
FM 600.00 NaN
Dataframe below is 50 state, it is "state"
state
0 AL
1 AK
2 AZ
3 AR
4 CA
5 CO
6 CT
7 DC
8 DE
9 FL
10 GA
11 HI
12 ID
13 IL
14 IN
15 IA
16 KS
17 KY
18 LA
19 ME
20 MD
21 MA
22 MI
23 MN
24 MS
25 MO
26 MT
27 NE
28 NV
29 NH
30 NJ
31 NM
32 NY
33 NC
34 ND
35 OH
36 OK
37 OR
38 PA
39 RI
40 SC
41 SD
42 TN
43 TX
44 UT
45 VT
46 VA
47 WA
48 WV
49 WI
50 WY
Is there a simple way in Pandas to merge these two dataframes to discard the intersecting states, and keep the non state data from the original dataframe ('total')?
so my expected output would include non state abbreviation data as below
cand_nm Obama, Barack Romney, Mitt
contbr_st
AA 56405.00 135.00
AP 37130.50 1655.00
FF NaN 99030.00
FM 600.00 NaN
.
.
The only way I can think of is convert state list from each dataframe, convert to set, use the difference() method. Then, convert the result to dataframe, and merge with the "total" dataframe.

Need assistance with below query

I'm getting this error:
Error tokenizing data. C error: Expected 2 fields in line 11, saw 3
Code: import webbrowser
website = 'https://en.wikipedia.org/wiki/Winning_percentage'
webbrowser.open(website)
league_frame = pd.read_clipboard()
And the above mentioned comes next.
I believe you need use read_html - returned all parsed tables and select Dataframe by position:
website = 'https://en.wikipedia.org/wiki/Winning_percentage'
#select first parsed table
df1 = pd.read_html(website)[0]
print (df1.head())
Win % Wins Losses Year Team Comment
0 0.798 67 17 1882 Chicago White Stockings best pre-modern season
1 0.763 116 36 1906 Chicago Cubs best 154-game NL season
2 0.721 111 43 1954 Cleveland Indians best 154-game AL season
3 0.716 116 46 2001 Seattle Mariners best 162-game AL season
4 0.667 108 54 1975 Cincinnati Reds best 162-game NL season
#select second parsed table
df2 = pd.read_html(website)[1]
print (df2)
Win % Wins Losses Season Team \
0 0.890 73 9 2015–16 Golden State Warriors
1 0.110 9 73 1972–73 Philadelphia 76ers
2 0.106 7 59 2011–12 Charlotte Bobcats
Comment
0 best 82 game season
1 worst 82-game season
2 worst season statistically

Pandas: Error when merging two tables, Error with set_index

Thanks in advance for your help, here's my question:
I've successfully loaded my df in to ipython notebook and then I ran a group by on it:
station_count = station.groupby('landmark').count()
which produced a table like this:
Now I'm trying to merge it with another table:
dock_count_by_station = station.groupby('landmark').sum()
that is also a simple group by on the same table, but the merge produces an error:
TypeError: cannot concatenate a non-NDFrame object
with this code:
dock_count_by_station.merge(station_count)
I think the problem is that I need to set the index of the two tables before merging them but I keep getting this error for the code below:
pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:3979)()
pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:3843)()
pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12265)()
pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12216)()
KeyError: 'landmark'
station_count.set_index('landmark')
Using join
You can use join, which merges the tables on their index. You may also wish to specify the join type (e.g. 'outer', 'inner', 'left' or 'right'). You have overlapping column names (e.g. station_id), so you need to specify a suffix.
>>> dock_count_by_station.join(station_count, rsuffix='_rhs')
dockcount lat long station_id dockcount_rhs installation lat_rhs long_rhs name station_id_rhs
landmark
Mountain View 117 261.767433 -854.623012 210 7 7 7 7 7 7
Palo Alto 75 187.191873 -610.767939 180 5 5 5 5 5 5
Redwood City 115 262.406232 -855.602755 224 7 7 7 7 7 7
San Francisco 665 1322.569239 -4284.054814 2126 35 35 35 35 35 35
San Jose 249 560.039892 -1828.370075 200 15 15 15 15 15 15
Using merge
Note that your landmark index was set by default when you did the groupby. You can always use as_index=False if you don't want this to occur, but then you would have to use merge instead of join.
dock_count_by_station = station.groupby('landmark', as_index=False).sum()
station_count = station.groupby('landmark', as_index=False).count()
>>> dock_count_by_station.merge(station_count, on='landmark', suffixes=['_lhs', '_rhs'])
landmark dockcount_lhs lat_lhs long_lhs station_id_lhs dockcount_rhs installation lat_rhs long_rhs name station_id_rhs
0 Mountain View 117 261.767433 -854.623012 210 7 7 7 7 7 7
1 Palo Alto 75 187.191873 -610.767939 180 5 5 5 5 5 5
2 Redwood City 115 262.406232 -855.602755 224 7 7 7 7 7 7
3 San Francisco 665 1322.569239 -4284.054814 2126 35 35 35 35 35 35
4 San Jose 249 560.039892 -1828.370075 200 15 15 15 15 15 15