How to convert wide dataframe to long based on similar column - pandas

I have a pandas dataframe like this
and i want to convert it to below dataframe
i am not sure how to use pd.wide_to_long function here
below is the dataset for creating dataframe:
Date, IN:Male teacher ,IN:Male engineer, IN: Male Atronaut , IN:female teacher ,IN:female engineer, IN: female Atronaut ,GB:Male teacher ,GB:Male engineer, GB: Male Atronaut,GB:female teacher ,GB:female engineer, GB: female Atronaut
20220405,25,29,5,41,23,23,12,23,34,11,22,34
20220404,21,29,4,40,23,22,12,23,32,10,23,34

Convert Date column to index and for all another columns remove possible traling spaces by str.strip, then replace spaces to : and last split by one or more : to MultiIndex, so possible reshape by DataFrame.stack with DataFrame.rename_axis for new columns names created by DataFrame.reset_index:
df1 = df.set_index('Date')
df1.columns = df1.columns.str.strip().str.replace('\s+', ':').str.split('[:]+', expand=True)
df1 = df1.stack([0,1]).rename_axis(['Date','Symbol','Gender']).reset_index()
print (df1)
Date Symbol Gender Atronaut engineer teacher
0 20220405 GB Male 34 23 12
1 20220405 GB female 34 22 11
2 20220405 IN Male 5 29 25
3 20220405 IN female 23 23 41
4 20220404 GB Male 32 23 12
5 20220404 GB female 34 23 10
6 20220404 IN Male 4 29 21
7 20220404 IN female 22 23 40

pivot_longer from pyjanitor offers an easy way to abstract the reshaping; in this case it can be solved with a regular expression:
# pip install pyjanitor
import pandas as pd
import janitor
df.pivot_longer(
index = 'Date',
names_to = ('symbol', 'gender', '.value'),
names_pattern = r"(.+):\s*(.+)\s+(.+)",
sort_by_appearance = True)
Date symbol gender teacher engineer Atronaut
0 20220405 IN Male 25 29 5
1 20220405 IN female 41 23 23
2 20220405 GB Male 12 23 34
3 20220405 GB female 11 22 34
4 20220404 IN Male 21 29 4
5 20220404 IN female 40 23 22
6 20220404 GB Male 12 23 32
7 20220404 GB female 10 23 34
The regular expression has capture groups, any group paired with .value stays as a header, the rest become column values.

Related

How to extract car model name from the car dataset?

Can anyone help me to extact the car model names from the following sample dataframe?
index,Make,Model,Price,Year,Kilometer,Fuel Type,Transmission,Location,Color,Owner,Seller Type
0,Honda,Amaze 1.2 VX i-VTEC,505000,2017,87150,Petrol,Manual,Pune,Grey,First,Corporate
1,Maruti Suzuki,Swift DZire VDI,450000,2014,75000,Diesel,Manual,Ludhiana,White,Second,Individual
2,Hyundai,i10 Magna 1.2 Kappa2,220000,2011,67000,Petrol,Manual,Lucknow,Maroon,First,Individual
3,Toyota,Glanza G,799000,2019,37500,Petrol,Manual,Mangalore,Red,First,Individual
I have used this code :
model_name = df['Model'].str.extract(r'(\w+)')
How ever, i'm unable to get the car names which has names such as WR-V, CR-V ( or which has space or hyfen in between the names)
This is the detailed link of the dataset:https://www.kaggle.com/datasets/nehalbirla/vehicle-dataset-from-cardekho?select=car+details+v4.csv
Desired output should be:
index,0
0,Amaze
1,Swift
2,i10
3,Glanza
4,Innova
5,Ciaz
6,CLA
7,X1 xDrive20d
8,Octavia
9,Terrano
10,Elite
11,Kwid
12,Ciaz
13,Harrier
14,Polo
15,Celerio
16,Alto
17,Baleno
18,Wagon
19,Creta
20,S-Presso
21,Vento
22,Santro
23,Venue
24,Alto
25,Ritz
26,Creta
27,Brio
28,Elite
29,WR-V
30,Venue
Please help me!!
The exact logic is unclear, but assuming you want the first word (including special characters) or the first two words if the first word has only one or two characters:
df['Model'].str.extract(r'(\S{3,}|\S{1,2}\s+\S+)', expand=False)
Output:
0 Amaze
1 Swift
2 i10
3 Glanza
4 Innova
5 Ciaz
6 CLA
7 X1 xDrive20d
8 Octavia
9 Terrano
10 Elite
11 Kwid
12 Ciaz
13 Harrier
14 Polo
15 Celerio
16 Alto
17 Baleno
18 Wagon
19 Creta
20 S-Presso
21 Vento
22 Santro
23 Venue
24 Alto
25 Ritz
26 Creta
27 Brio
28 Elite
29 WR-V
... ...
Name: Model, dtype: object

Creating a new conditional column in panda dataframe

I am trying to determine the result of a football match based on the scored points. If the amount of goals and scored and received are equal the expected output should be a draw. if the amount of scored goals is higher then the goals received is then the expected output should be a win. If the amount of the goals scored are lower the goals received are the same the output should be lost.
Football_data_match['result'] = if(Football_data_match['goal_scored'] > Football_data_match['goal_against']:
Football_data_match['result'] = 'win'
elif (Football_data_match['goal_scored'<Football_data_match['goal_against']:
Football_data_match['result'] 'lost'
else:
Football_data_match['result'] = 'draw')
The code above gives a syntax error but I'm not able to pinpoint the exact mistake. Could somebody help me fix this problem.
One way is using np.select:
import numpy as np
import pandas as pd
# Example data
df = pd.DataFrame({
"goal_scored": np.random.randint(4, size=12),
"goal_against": np.random.randint(4, size=12)
})
df["result"] = np.select(
[
df["goal_scored"] < df["goal_against"],
df["goal_scored"] == df["goal_against"],
df["goal_scored"] > df["goal_against"]
], ["lost", "draw", "win"]
)
df:
goal_scored goal_against result
0 1 3 lost
1 0 1 lost
2 0 3 lost
3 3 2 win
4 1 3 lost
5 2 0 win
6 2 2 draw
7 2 2 draw
8 3 1 win
9 0 2 lost
10 2 3 lost
11 1 1 draw
You can also use DataFrame.apply:
import pandas as pd
import numpy as np
import itertools
teams = ['Arizona Cardinals', 'Atlanta Falcons', 'Baltimore Ravens', 'Buffalo Bills', 'Carolina Panthers', 'Chicago Bears']
k = pd.DataFrame(np.random.randint(20,high=30, size=(15,2)), index=itertools.combinations(teams, 2), columns=['goal_scored', 'goal_against'])
k['result'] = k.apply(lambda row: 'win' if row['goal_scored'] > row['goal_against'] else ('lost' if row['goal_scored'] < row['goal_against'] else 'draw'), axis=1)
k is:
goal_scored goal_against result
(Arizona Cardinals, Atlanta Falcons) 29 29 draw
(Arizona Cardinals, Baltimore Ravens) 20 26 lost
(Arizona Cardinals, Buffalo Bills) 21 24 lost
(Arizona Cardinals, Carolina Panthers) 20 25 lost
(Arizona Cardinals, Chicago Bears) 27 28 lost
(Atlanta Falcons, Baltimore Ravens) 26 24 win
(Atlanta Falcons, Buffalo Bills) 20 21 lost
(Atlanta Falcons, Carolina Panthers) 22 25 lost
(Atlanta Falcons, Chicago Bears) 26 22 win
(Baltimore Ravens, Buffalo Bills) 23 21 win
(Baltimore Ravens, Carolina Panthers) 29 22 win
(Baltimore Ravens, Chicago Bears) 21 27 lost
(Buffalo Bills, Carolina Panthers) 24 21 win
(Buffalo Bills, Chicago Bears) 28 26 win
(Carolina Panthers, Chicago Bears) 24 22 win
Your problem is that you need to think vectorized when using pandas. Your if...else... operates on scalars, when Football_data_match is a whole DataFrame.
You need to start with the DataFrame or numpy.ndarray.

Pandas replace function specifying the column [duplicate]

This question already has answers here:
Replacing column values in a pandas DataFrame
(16 answers)
Closed 4 months ago.
dataset = pd.read_csv('./file.csv')
dataset.head()
This gives:
age sex smoker married region price
0 39 female yes no us 250000
1 28 male no no us 400000
2 23 male no yes europe 389000
3 17 male no no asia 230000
4 43 male no yes asia 243800
I want to replace all yes/no values of smoker with 0 or 1, but I don't want to change the yes/no values of married. I want to use pandas replace function.
I did the following, but this obviously changes all yes/no values (from smoker and married column):
dataset = dataset.replace(to_replace='yes', value='1')
dataset = dataset.replace(to_replace='no', value='0')
age sex smoker married region price
0 39 female 1 0 us 250000
1 28 male 0 0 us 400000
2 23 male 0 1 europe 389000
3 17 male 0 0 asia 230000
4 43 male 0 1 asia 243800
How can I ensure that only the yes/no values from the smoker column get changed, preferably using Pandas' replace function?
did you try:
dataset['smoker']=dataset['smoker'].replace({'yes':1, 'no':0})

ValueError: grouper for xxx not 1-dimensional with pandas pivot_table()

I am working on olympics dataset and want to create another dataframe that has total number of athletes and total number of medals won by type for each country.
Using following pivot_table gives me an error "ValueError: Grouper for 'ID' not 1-dimensional"
pd.pivot_table(olymp, index='NOC', columns=['ID','Medal'], values=['ID','Medal'], aggfunc={'ID':pd.Series.nunique,'Medal':'count'}).sort_values(by='Medal')
Result should have one row for each country with columns for totalAthletes, gold, silver, bronze. Not sure how to go about it using pivot_table. I can do this using merge of crosstab but would like to use just one pivottable statement.
Here is what original df looks like.
Update
I would like to get the medal breakdown as well e.g. gold, silver, bronze. Also I need unique count of athlete id's so I use nunique since one athlete may participate in multiple events. Same with medal, ignoring NA values
IIUC:
out = df.pivot_table('ID', 'NOC', 'Medal', aggfunc='count', fill_value=0)
out['ID'] = df[df['Medal'].notna()].groupby('NOC')['ID'].nunique()
Output:
>>> out
Medal Bronze Gold Silver ID
NOC
AFG 2 0 0 1
AHO 0 0 1 1
ALG 8 5 4 14
ANZ 5 20 4 25
ARG 91 91 92 231
.. ... ... ... ...
VIE 0 1 3 3
WIF 5 0 0 4
YUG 93 130 167 317
ZAM 1 0 1 2
ZIM 1 17 4 16
[149 rows x 4 columns]
Old answer
You can't have the same column for columns and values:
out = olymp.pivot_table(index='NOC', values=['ID','Medal'],
aggfunc={'ID':pd.Series.nunique, 'Medal':'count'}) \
.sort_values('Medal', ascending=False)
print(out)
# Output
ID Medal
NOC
USA 9653 5637
URS 2948 2503
GER 4872 2165
GBR 6281 2068
FRA 6170 1777
.. ... ...
GAM 33 0
GBS 15 0
GEQ 26 0
PNG 61 0
LBA 68 0
[230 rows x 2 columns]
Another way to get the result above:
out = olym.groupby('NOC').agg({'ID': pd.Series.nunique, 'Medal': 'count'}) \
.sort_values('Medal', ascending=False)
print(out)
# Output
ID Medal
NOC
USA 9653 5637
URS 2948 2503
GER 4872 2165
GBR 6281 2068
FRA 6170 1777
.. ... ...
GAM 33 0
GBS 15 0
GEQ 26 0
PNG 61 0
LBA 68 0
[230 rows x 2 columns]

Add 2nd column index to pandas dataframe using another dataframe column

I have these 2 dataframes
edfmonthtradedays
Out[57]:
Instrument AAPL.O AMZN.O FB.OQ GOOG.OQ GOOGL.OQ BHP.AX JPM.N MSFT.O \
Date
2016-04-30 21 21 21 21 21 21 21 21
2016-05-31 21 21 21 21 21 21 21 21
2016-06-30 22 22 22 22 22 22 22 22
and
rics
Out[60]:
0 1
0 AAPL.O US
1 MSFT.O US
2 AMZN.O US
3 BHP.AX AU
I am trying to add a second column index to df1 using column[1] in df2, such that AAPL.O column would also have 'US' as a column index, BHP.AX would have 'AU', etc? I am new to python and programming but have tried for some time to get this working without luck.
I have tried,
dfmonthtradedays.columns = pd.MultiIndex.from_arrays(dfmonthtradedays.columns, rics[1].tolist())
number columns in df1 = number rows in df2
Regards