Creating a new conditional column in panda dataframe - pandas

I am trying to determine the result of a football match based on the scored points. If the amount of goals and scored and received are equal the expected output should be a draw. if the amount of scored goals is higher then the goals received is then the expected output should be a win. If the amount of the goals scored are lower the goals received are the same the output should be lost.
Football_data_match['result'] = if(Football_data_match['goal_scored'] > Football_data_match['goal_against']:
Football_data_match['result'] = 'win'
elif (Football_data_match['goal_scored'<Football_data_match['goal_against']:
Football_data_match['result'] 'lost'
else:
Football_data_match['result'] = 'draw')
The code above gives a syntax error but I'm not able to pinpoint the exact mistake. Could somebody help me fix this problem.

One way is using np.select:
import numpy as np
import pandas as pd
# Example data
df = pd.DataFrame({
"goal_scored": np.random.randint(4, size=12),
"goal_against": np.random.randint(4, size=12)
})
df["result"] = np.select(
[
df["goal_scored"] < df["goal_against"],
df["goal_scored"] == df["goal_against"],
df["goal_scored"] > df["goal_against"]
], ["lost", "draw", "win"]
)
df:
goal_scored goal_against result
0 1 3 lost
1 0 1 lost
2 0 3 lost
3 3 2 win
4 1 3 lost
5 2 0 win
6 2 2 draw
7 2 2 draw
8 3 1 win
9 0 2 lost
10 2 3 lost
11 1 1 draw

You can also use DataFrame.apply:
import pandas as pd
import numpy as np
import itertools
teams = ['Arizona Cardinals', 'Atlanta Falcons', 'Baltimore Ravens', 'Buffalo Bills', 'Carolina Panthers', 'Chicago Bears']
k = pd.DataFrame(np.random.randint(20,high=30, size=(15,2)), index=itertools.combinations(teams, 2), columns=['goal_scored', 'goal_against'])
k['result'] = k.apply(lambda row: 'win' if row['goal_scored'] > row['goal_against'] else ('lost' if row['goal_scored'] < row['goal_against'] else 'draw'), axis=1)
k is:
goal_scored goal_against result
(Arizona Cardinals, Atlanta Falcons) 29 29 draw
(Arizona Cardinals, Baltimore Ravens) 20 26 lost
(Arizona Cardinals, Buffalo Bills) 21 24 lost
(Arizona Cardinals, Carolina Panthers) 20 25 lost
(Arizona Cardinals, Chicago Bears) 27 28 lost
(Atlanta Falcons, Baltimore Ravens) 26 24 win
(Atlanta Falcons, Buffalo Bills) 20 21 lost
(Atlanta Falcons, Carolina Panthers) 22 25 lost
(Atlanta Falcons, Chicago Bears) 26 22 win
(Baltimore Ravens, Buffalo Bills) 23 21 win
(Baltimore Ravens, Carolina Panthers) 29 22 win
(Baltimore Ravens, Chicago Bears) 21 27 lost
(Buffalo Bills, Carolina Panthers) 24 21 win
(Buffalo Bills, Chicago Bears) 28 26 win
(Carolina Panthers, Chicago Bears) 24 22 win
Your problem is that you need to think vectorized when using pandas. Your if...else... operates on scalars, when Football_data_match is a whole DataFrame.
You need to start with the DataFrame or numpy.ndarray.

Related

How to convert wide dataframe to long based on similar column

I have a pandas dataframe like this
and i want to convert it to below dataframe
i am not sure how to use pd.wide_to_long function here
below is the dataset for creating dataframe:
Date, IN:Male teacher ,IN:Male engineer, IN: Male Atronaut , IN:female teacher ,IN:female engineer, IN: female Atronaut ,GB:Male teacher ,GB:Male engineer, GB: Male Atronaut,GB:female teacher ,GB:female engineer, GB: female Atronaut
20220405,25,29,5,41,23,23,12,23,34,11,22,34
20220404,21,29,4,40,23,22,12,23,32,10,23,34
Convert Date column to index and for all another columns remove possible traling spaces by str.strip, then replace spaces to : and last split by one or more : to MultiIndex, so possible reshape by DataFrame.stack with DataFrame.rename_axis for new columns names created by DataFrame.reset_index:
df1 = df.set_index('Date')
df1.columns = df1.columns.str.strip().str.replace('\s+', ':').str.split('[:]+', expand=True)
df1 = df1.stack([0,1]).rename_axis(['Date','Symbol','Gender']).reset_index()
print (df1)
Date Symbol Gender Atronaut engineer teacher
0 20220405 GB Male 34 23 12
1 20220405 GB female 34 22 11
2 20220405 IN Male 5 29 25
3 20220405 IN female 23 23 41
4 20220404 GB Male 32 23 12
5 20220404 GB female 34 23 10
6 20220404 IN Male 4 29 21
7 20220404 IN female 22 23 40
pivot_longer from pyjanitor offers an easy way to abstract the reshaping; in this case it can be solved with a regular expression:
# pip install pyjanitor
import pandas as pd
import janitor
df.pivot_longer(
index = 'Date',
names_to = ('symbol', 'gender', '.value'),
names_pattern = r"(.+):\s*(.+)\s+(.+)",
sort_by_appearance = True)
Date symbol gender teacher engineer Atronaut
0 20220405 IN Male 25 29 5
1 20220405 IN female 41 23 23
2 20220405 GB Male 12 23 34
3 20220405 GB female 11 22 34
4 20220404 IN Male 21 29 4
5 20220404 IN female 40 23 22
6 20220404 GB Male 12 23 32
7 20220404 GB female 10 23 34
The regular expression has capture groups, any group paired with .value stays as a header, the rest become column values.

Concatenate labels to an existing dataframe

I want to use a list of names "headers" to create a new column in my dataframe. In the initial table, the name of each division is positioned above the results for each team in that division. I want to add that header to each row entry for each divsion to make the data more identifiable like this. I have the headers stored in the "header" object in my code. How can I multiply each division header by the number of rows that appear in the division and append to the dataset?
Edit: here is another snippet of what I want the get from the end product.
df3 = df.iloc[0:6]
df3.insert(0, 'Divisions', ['na','L5 Junior', 'L5 Junior', 'na',
'L5 Senior - Medium', 'L5 Senior - Medium'])
df3
.
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import requests
Import HTML
scr = 'https://tv.varsity.com/results/7361971-2022-spirit-unlimited-battle-at-the-
boardwalk-atlantic-city-grand-ntls/31220'
scr1 = requests.get(scr)
soup = BeautifulSoup(scr1.text, "html.parser")
List of names to append
table_MN = pd.read_html(scr)
sp3 = soup.find(class_="full-content").find_all("h2")
headers = [elt.text for elt in sp3]
table_MN = pd.read_html(scr)
Extract text and header from division info
div = pd.DataFrame(headers)
div.columns = ["division"]
df = pd.concat(table_MN, ignore_index=True)
df.columns = df.iloc[0]
df
It is still not clear what is the output you are looking for. However, may I suggest the following, which accomplishes selecting common headers from tables in table_MN and the concatenating the results. If it is going in the right direction pls let me know, and indicate what else you want to extract from the resulting table:
tmn_1 = [tbl.T.set_index(0).T for tbl in table_MN]
pd.concat(tmn_1, axis=0, ignore_index = True)
output:
Rank Program Name Team Name Raw Score Deductions Performance Score Event Score
-- ------ --------------------------- ----------------- ----------- ------------ ------------------- -------------
0 1 Rockstar Cheer New Jersey City Girls 47.8667 0 95.7333 95.6833
1 2 Cheer Factor Xtraordinary 46.6667 0.15 93.1833 92.8541
2 1 Rockstar Cheer New Jersey City Girls 47.7667 0 95.5333 23.8833
3 2 Cheer Factor Xtraordinary 46.0333 0.2 91.8667 22.9667
4 1 Star Athletics Roar 47.5333 0.9 94.1667 93.9959
5 1 Prime Time All Stars Lady Onyx 43.9 1.35 86.45 86.6958
6 1 Prime Time All Stars Lady Onyx 44.1667 0.9 87.4333 21.8583
7 1 Just Cheer All Stars Jag 5 46.4333 0.15 92.7167 92.2875
8 1 Just Cheer All Stars Jag 5 45.8 0.6 91 22.75
9 1 Quest Athletics Black Ops 47.4333 0.45 94.4167 93.725
10 1 Quest Athletics Black Ops 46.5 1.35 91.65 22.9125
11 1 The Stingray Allstars X-Rays 45.3 0.95 89.65 88.4375
12 1 Vortex Allstars Lady Rays 45.7 0.5 90.9 91.1083
13 1 Vortex Allstars Lady Rays 45.8667 0 91.7333 22.9333
14 1 Upper Merion All Stars Citrus 46.4333 0 92.8667 92.7
15 2 Cheer Factor JUNIOR X 45.9 1.1 90.7 90.6542
16 3 NJ Premier All Stars Prodigy 44.6333 0.05 89.2167 89.8292
17 1 Upper Merion All Stars Citrus 46.1 0 92.2 23.05
18 2 NJ Premier All Stars Prodigy 45.8333 0 91.6667 22.9167
19 3 Cheer Factor JUNIOR X 45.7333 0.95 90.5167 22.6292
20 1 Virginia Royalty Athletics Dynasty 46.5 0 93 92.9
21 1 Virginia Royalty Athletics Dynasty 46.3 0 92.6 23.15
22 1 South Jersey Storm Lady Reign 47.2333 0 94.4667 93.4875
...

Cumulative value counts of categorical data, with group by

In my data frame, I have a text column group with the group name, and column drop_week, holding a categorical value in range [1,4]. I want to store, for each group, the cumulative count of values 1 to 4 of drop week. I'm doing this:
drop_data = all_data[['group', 'drop_week']].groupby('group')['drop_week'] \
.value_counts().unstack().transpose().fillna(0).cumsum().transpose()
and it works. But since it took me 2 hours of googling to come up with this solution, I was wondering if there is a better way to do it.
You could use pd.crosstab to create the frequency table. Then use cumsum(axis=1) to compute the cumulative sum across each row:
pd.crosstab(index=all_data['group'], columns=all_data['drop_week']).cumsum(axis=1)
# drop_week 1 2 3 4
# group
# 0 12 17 21 27
# 1 7 13 18 25
# 2 9 14 22 26
# 3 5 11 16 22
which agrees with
drop_data = (all_data[['group', 'drop_week']].groupby('group')['drop_week']
.value_counts().unstack().transpose().fillna(0).cumsum().transpose())
# drop_week 1 2 3 4
# group
# 0 12 17 21 27
# 1 7 13 18 25
# 2 9 14 22 26
# 3 5 11 16 22
The setup I used for this was:
import numpy as np
import pandas as pd
np.random.seed(2019)
N = 100
all_data = pd.DataFrame({'group':np.random.randint(4, size=N),
'drop_week':np.random.randint(1,5, size=N)})
drop_data = (all_data[['group', 'drop_week']].groupby('group')['drop_week']
.value_counts().unstack().transpose().fillna(0).cumsum().transpose())

Pandas: sorting by integer notation value

In this dataframe, column key values correspond to integer notation of each song key.
df
track key
0 Last Resort 4
1 Casimir Pulaski Day 8
2 Glass Eyes 8
3 Ohio - Live At Massey Hall 1971 7
4 Ballad of a Thin Man 11
5 Can You Forgive Her? 11
6 The Only Thing 3
7 Goodbye Baby (Baby Goodbye) 4
8 Heart Of Stone 0
9 Ohio 0
10 the gate 2
11 Clampdown 2
12 Cry, Cry, Cry 4
13 What's Happening Brother 8
14 Stupid Girl 11
15 I Don't Wanna Play House 7
16 Inner City Blues (Make Me Wanna Holler) 11
17 The Lonesome Death of Hattie Carroll 4
18 Paint It, Black - (Original Single Mono Version) 5
19 Let Him Run Wild 11
20 Undercover (Of The Night) - Remastered 5
21 Between the Bars 7
22 Like a Rolling Stone 0
23 Once 2
24 Pale Blue Eyes 5
25 The Way You Make Me Feel - 2012 Remaster 1
26 Jeremy 2
27 The Entertainer 7
28 Pressure 9
29 Play With Fire - Mono Version / Remastered 2002 2
30 D-I-V-O-R-C-E 9
31 Big Shot 0
32 What's Going On 1
33 Folsom Prison Blues - Live 0
34 American Woman 1
35 Cocaine Blues - Live 8
36 Jesus, etc. 5
the notation is as follows:
'C' --> 0
'C#'--> 1
'D' --> 2
'Eb'--> 3
'E' --> 4
'F' --> 5
'F#'--> 6
'G' --> 7
'Ab'--> 8
'A' --> 9
'Bb'--> 10
'B' --> 11
what is specific about this notation is that 11 is closer to 0 than 2, for instance.
GOAL:
given an input_notation = 0, I would like to sort according to closeness to key 0, or 'C'.
you can get closest value by doing:
closest_key = (input_notation -1) % 12
so I would like to sort according to this logic, having on top input_notation values and then closest matches, like so:
8 Heart Of Stone 0
9 Ohio 0
22 Like a Rolling Stone 0
31 Big Shot 0
33 Folsom Prison Blues - Live 0
(...)
I have tried:
v = df[['key']].values
df = df.iloc[np.lexsort(np.abs(v - (input_notation - 1) %12 ).T)]
but this does not work..
any clues?
You can define the closeness firstly and then use argsort with iloc to sort the data frame:
input_notation = 0
# define the closeness or distance
diff = (df.key - input_notation).abs()
closeness = np.minimum(diff, 12 - diff)
# use argsort to calculate the sorting index, and iloc to reorder the data frame
closest_to_input = df.iloc[closeness.argsort(kind='mergesort')]
closest_to_input.head()
# track key
#8 Heart Of Stone 0
#9 Ohio 0
#22 Like a Rolling Stone 0
#31 Big Shot 0
#33 Folsom Prison Blues - Live 0

What type of graph can best show the correlation between 'Fare' (price) and "Survival" (Titanic)?

I'm playing around with Seaborn and Matplotlib and I trying to find the best type of graph to show the correlation between fare values and chance of survival from the titanic dataset.
The Titanic fare column has a lot of different values ranging from 1 to 500 and some of the values are repeated often.
Here is a sample of value_counts:
titanic.fare.value_counts()
8.0500 43
13.0000 42
7.8958 38
7.7500 34
26.0000 31
10.5000 24
7.9250 18
7.7750 16
0.0000 15
7.2292 15
26.5500 15
8.6625 13
7.8542 13
7.2500 13
7.2250 12
16.1000 9
9.5000 9
15.5000 8
24.1500 8
14.5000 7
7.0500 7
52.0000 7
31.2750 7
56.4958 7
69.5500 7
14.4542 7
30.0000 6
39.6875 6
46.9000 6
21.0000 6
.....
91.0792 2
106.4250 2
164.8667 2
Survival column on the other hand has only two values :
>>> titanic.survived.head(10)
271 1
597 0
302 0
633 0
277 0
413 0
674 0
263 0
466 0
A histogram would only show the frequency of fares in certain ranges.
For a scatter plot I would need two variables; having "survived" which has only two values would make for a strange variable.
Is there a way to show the rise of survivability as fare increases clearly through a line graph?
I know there is a correlation as If I sort fare values in ascending order (000-500).
Then do:
>>> titanic.head(50).survived.sum()
5
>>>titanic.tail(50).survived.sum()
37
I see a correlation.
Thanks.
This is what I did to show the correlation between the fare values and the chance of survival:
First, I created a new column Fare Groups, converting fare values to groups of fare ranges, using cut().
df['Fare Groups'] = pd.cut(df.Fare, [0,50,100,150,200,550])
Next, I created a pivot_table().
piv_fare = df.pivot_table(index='Fare Groups', columns='Survived', values = 'Fare', aggfunc='count')
Output:
Survived 0 1
Fare Groups
(0, 50] 484 232
(50, 100] 37 70
(100, 150] 5 19
(150, 200] 3 6
(200, 550] 6 14
Plot:
piv_fare.plot(kind='bar')
It seems, those who had the cheapest tickets (0 to 50) had the lowest chance of survival. In fact, (0 to 50) is the only fare range where the chance to die is higher than the chance to survive. Not just higher, but significantly higher.