python pandas melt on combined columns - pandas

I have a dataframe like this. i have regular fields till "state" then i will have trailers (3 columns tr1* represents 1 tailer) i want to convert those trailers to rows. I tried melt function but i am able to use only 1 trailer column. kindly look at below example you can understand
Name number city state tr1num tr1acct tr1ct tr2num tr2acct tr2ct tr3num tr3acct tr3ct
DJ 10 Edison nj 1001 20345 Dew 1002 20346 Newca. 1003. 20347. pen
ND 20 Newark DE 2001 1985 flor 2002 1986 rodge
I am expecting the output like this.
Name number city state trnum tracct trct
DJ 10 Edison nj 1001 20345 Dew
DJ 10 Edison nj 1002 20346 Newca
DJ 10 Edison nj 1003 20347 pen
ND 20 Newark DE 2001 1985 flor
ND 20 Newark DE 2002 1986 rodge

You need to look at using pd.wide_to_long. However, you will need to do some column renaming first.
df = df.set_index(['Name','number','city','state'])
df.columns = df.columns.str.replace('(\D+)(\d+)(\D+)',r'\1\3_\2')
df = df.reset_index()
pd.wide_to_long(df, ['trnum','trct','tracct'],
['Name','number','city','state'], 'Code',sep='_',suffix='\d+')\
.reset_index()\
.drop('Code',axis=1)
Output:
Name number city state trnum trct tracct
0 DJ 10 Edison nj 1001.0 Dew 20345.0
1 DJ 10 Edison nj 1002.0 Newca. 20346.0
2 DJ 10 Edison nj 1003.0 pen 20347.0
3 ND 20 Newark DE 2001.0 flor 1985.0
4 ND 20 Newark DE 2002.0 rodge 1986.0
5 ND 20 Newark DE NaN NaN NaN

you could achieve this by renaming your columns and bit and applying the pandas wide_to_long method. Below is the code which produces your desired output.
df = pd.DataFrame({"Name":["DJ", "ND"], "number":[10,20], "city":["Edison", "Newark"], "state":["nj","DE"],
"trnum_1":[1001,2001], "tracct_1":[20345,1985], "trct_1":["Dew", "flor"], "trnum_2":[1002,2002],
"trct_2":["Newca", "rodge"], "trnum_3":[1003,None], "tracct_3":[20347,None], "trct_3":["pen", None]})
pd.wide_to_long(df, stubnames=['trnum', 'tracct', 'trct'], i='Name', j='dropme', sep='_').reset_index().drop('dropme', axis=1)\
.sort_values('trnum')
outputs
Name state city number trnum tracct trct
0 DJ nj Edison 10 1001.0 20345.0 Dew
1 DJ nj Edison 10 1002.0 NaN Newca
2 DJ nj Edison 10 1003.0 20347.0 pen
3 ND DE Newark 20 2001.0 1985.0 flor
4 ND DE Newark 20 2002.0 NaN rodge
5 ND DE Newark 20 NaN NaN None

Another option:
df = pd.DataFrame({'col1': [1,2,3], 'col2':[3,4,5], 'col3':[5,6,7], 'tr1':[0,9,8], 'tr2':[0,9,8]})
The df:
col1 col2 col3 tr1 tr2
0 1 3 5 0 0
1 2 4 6 9 9
2 3 5 7 8 8
subsetting to create 2 df's:
tr1_df = df[['col1', 'col2', 'col3', 'tr1']].rename(index=str, columns={"tr1":"tr"})
tr2_df = df[['col1', 'col2', 'col3', 'tr2']].rename(index=str, columns={"tr2":"tr"})
res = pd.concat([tr1_df, tr2_df])
result:
col1 col2 col3 tr
0 1 3 5 0
1 2 4 6 9
2 3 5 7 8
0 1 3 5 0
1 2 4 6 9
2 3 5 7 8

One option is the pivot_longer function from pyjanitor, using the .value placeholder:
# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git
import janitor
import pandas as pd
(df
.pivot_longer(
index=slice("Name", "state"),
names_to=(".value", ".value"),
names_pattern=r"(.+)\d(.+)",
sort_by_appearance=True)
.dropna()
)
Name number city state trnum tracct trct
0 DJ 10 Edison nj 1001.0 20345.0 Dew
1 DJ 10 Edison nj 1002.0 20346.0 Newca.
2 DJ 10 Edison nj 1003.0 20347.0 pen
3 ND 20 Newark DE 2001.0 1985.0 flor
4 ND 20 Newark DE 2002.0 1986.0 rodge
The .value keeps the part of the column associated with it as header, and since we have multiple .value, they are combined into a single word. The .value is determined by the groups in the names_pattern, which is a regular expression.
Note that currently the multiple .value option is available in dev.

Related

Concatenate labels to an existing dataframe

I want to use a list of names "headers" to create a new column in my dataframe. In the initial table, the name of each division is positioned above the results for each team in that division. I want to add that header to each row entry for each divsion to make the data more identifiable like this. I have the headers stored in the "header" object in my code. How can I multiply each division header by the number of rows that appear in the division and append to the dataset?
Edit: here is another snippet of what I want the get from the end product.
df3 = df.iloc[0:6]
df3.insert(0, 'Divisions', ['na','L5 Junior', 'L5 Junior', 'na',
'L5 Senior - Medium', 'L5 Senior - Medium'])
df3
.
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import requests
Import HTML
scr = 'https://tv.varsity.com/results/7361971-2022-spirit-unlimited-battle-at-the-
boardwalk-atlantic-city-grand-ntls/31220'
scr1 = requests.get(scr)
soup = BeautifulSoup(scr1.text, "html.parser")
List of names to append
table_MN = pd.read_html(scr)
sp3 = soup.find(class_="full-content").find_all("h2")
headers = [elt.text for elt in sp3]
table_MN = pd.read_html(scr)
Extract text and header from division info
div = pd.DataFrame(headers)
div.columns = ["division"]
df = pd.concat(table_MN, ignore_index=True)
df.columns = df.iloc[0]
df
It is still not clear what is the output you are looking for. However, may I suggest the following, which accomplishes selecting common headers from tables in table_MN and the concatenating the results. If it is going in the right direction pls let me know, and indicate what else you want to extract from the resulting table:
tmn_1 = [tbl.T.set_index(0).T for tbl in table_MN]
pd.concat(tmn_1, axis=0, ignore_index = True)
output:
Rank Program Name Team Name Raw Score Deductions Performance Score Event Score
-- ------ --------------------------- ----------------- ----------- ------------ ------------------- -------------
0 1 Rockstar Cheer New Jersey City Girls 47.8667 0 95.7333 95.6833
1 2 Cheer Factor Xtraordinary 46.6667 0.15 93.1833 92.8541
2 1 Rockstar Cheer New Jersey City Girls 47.7667 0 95.5333 23.8833
3 2 Cheer Factor Xtraordinary 46.0333 0.2 91.8667 22.9667
4 1 Star Athletics Roar 47.5333 0.9 94.1667 93.9959
5 1 Prime Time All Stars Lady Onyx 43.9 1.35 86.45 86.6958
6 1 Prime Time All Stars Lady Onyx 44.1667 0.9 87.4333 21.8583
7 1 Just Cheer All Stars Jag 5 46.4333 0.15 92.7167 92.2875
8 1 Just Cheer All Stars Jag 5 45.8 0.6 91 22.75
9 1 Quest Athletics Black Ops 47.4333 0.45 94.4167 93.725
10 1 Quest Athletics Black Ops 46.5 1.35 91.65 22.9125
11 1 The Stingray Allstars X-Rays 45.3 0.95 89.65 88.4375
12 1 Vortex Allstars Lady Rays 45.7 0.5 90.9 91.1083
13 1 Vortex Allstars Lady Rays 45.8667 0 91.7333 22.9333
14 1 Upper Merion All Stars Citrus 46.4333 0 92.8667 92.7
15 2 Cheer Factor JUNIOR X 45.9 1.1 90.7 90.6542
16 3 NJ Premier All Stars Prodigy 44.6333 0.05 89.2167 89.8292
17 1 Upper Merion All Stars Citrus 46.1 0 92.2 23.05
18 2 NJ Premier All Stars Prodigy 45.8333 0 91.6667 22.9167
19 3 Cheer Factor JUNIOR X 45.7333 0.95 90.5167 22.6292
20 1 Virginia Royalty Athletics Dynasty 46.5 0 93 92.9
21 1 Virginia Royalty Athletics Dynasty 46.3 0 92.6 23.15
22 1 South Jersey Storm Lady Reign 47.2333 0 94.4667 93.4875
...

Using matplotlib to plot a multiple boxplots

I have a data which looks like that:
DataFrame with cities names, their latitude and longitude:
import pandas as pd
city = {'Name': ['San Franciso', 'Paris', 'Tokyo', 'London', 'Barcelona'], 'Latitude': [50.69460297, 43.64984221, 60.5331547, 62.5331547, 63.5331547],'Longtitude': [41.43147227, 49.78045496691, 122.23536080538, 19.78045496691, 29.78045496691]}
city_df = pd.DataFrame(city)
List of 5 DataFrame which looks like that:
list1= [[1,"kids",0.00094], [2,"adult",0.0012], [3,"elderly",0.00114],[5,"kids",0.00088], [6,"adult",0.00113], [7,"elderly",0.00105]]
l1 = pd.DataFrame(list1)
list2= [[1,"kids",0.00044], [2,"adult",0.0012], [3,"elderly",0.00114],[5,"kids",0.00088], [6,"adult",0.00113], [7,"elderly",0.00105]]
l2 = pd.DataFrame(list2)
list3= [[1,"kids",0.00394], [2,"adult",0.0012], [3,"elderly",0.00114],[5,"kids",0.00588], [6,"adult",0.00113], [7,"elderly",0.00105]]
l3 = pd.DataFrame(list3)
list4= [[1,"kids",0.00074], [2,"adult",0.0012], [3,"elderly",0.00114],[5,"kids",0.00088], [6,"adult",0.00113], [7,"elderly",0.00105]]
l4 = pd.DataFrame(list4)
list5= [[1,"kids",0.00095], [2,"adult",0.0012], [3,"elderly",0.00114],[5,"kids",0.00043], [6,"adult",0.00113], [7,"elderly",0.00105]]
l5 = pd.DataFrame(list5)
l = [l1, l2, l3, l4, l5]
I want to create a plot looking like the one below enter image description here
For each city, a boxplot with values for a particular group, and for the y-axis, the cities are sorted by latitude.
I try to make that work with pd.concat and pd.melt (from: Plotting multiple boxplots in seaborn?).
It is a challenge for me. Thank you for your time.
You can concat with the city names as MultiIndex, and use seaborn.catplot to plot:
df = pd.concat(dict(zip(city_df['Name'], l)), names=['city']).reset_index(level=0)
import seaborn as sns
sns.catplot(data=df, row=1, x='city', y=2, kind='box', sharey=False)
output:
city 0 1 2
0 San Franciso 1 kids 0.00094
1 San Franciso 2 adult 0.00120
2 San Franciso 3 elderly 0.00114
3 San Franciso 5 kids 0.00088
4 San Franciso 6 adult 0.00113
5 San Franciso 7 elderly 0.00105
0 Paris 1 kids 0.00044
1 Paris 2 adult 0.00120
2 Paris 3 elderly 0.00114
3 Paris 5 kids 0.00088
4 Paris 6 adult 0.00113
5 Paris 7 elderly 0.00105
0 Tokyo 1 kids 0.00394
1 Tokyo 2 adult 0.00120
2 Tokyo 3 elderly 0.00114
3 Tokyo 5 kids 0.00588
4 Tokyo 6 adult 0.00113
5 Tokyo 7 elderly 0.00105
0 London 1 kids 0.00074
1 London 2 adult 0.00120
2 London 3 elderly 0.00114
3 London 5 kids 0.00088
4 London 6 adult 0.00113
5 London 7 elderly 0.00105
0 Barcelona 1 kids 0.00095
1 Barcelona 2 adult 0.00120
2 Barcelona 3 elderly 0.00114
3 Barcelona 5 kids 0.00043
4 Barcelona 6 adult 0.00113
5 Barcelona 7 elderly 0.00105

Using groupby to create new dataframe

I have a Python DataFrame (df) which includes time series data (days of the year) for various countries. Timeseries is a column of the df, and is repeated by country. The file looks something like this:
Country
Date
Total Vaccinations
People Vaccinated Per Hundred
Israel
1/1/21
100
.001
Israel
1/2/21
104
.002
...
...
...
.004
Israel
2/1/21
150
.010
USA
1/1/21
0
.000
USA
1/2/21
50
.001
...
...
...
USA
2/1/21
500
.05
etc,etc. Assume this is the case for 200+ countries.
I wish to create a new df which will contain data for a single feature (say "Total Vaccinations") by countries, using Date as the index.
So the engineered df becomes (selecting "Total Vaccinations"):
Date
Israel
USA
U.K.
1/1/21
100
0
0
1/2/21
104
50
25
...
...
...
...
2/1/21
150
500
250
You can try pivot_table to achieve this -
>>> import pandas as pd
>>>
>>> d = ['1/1/21','1/2/21','1/3/21','1/4/21'] * 3
>>> v = [100,104,103,205] * 3
>>> c = ['USA','Isreal','UK'] * 4
>>> df = pd.DataFrame({'Country':c,'Total_Vaccinations':v,'Date':d})
>>> df
Country Total_Vaccinations Date
0 USA 100 1/1/21
1 Isreal 104 1/2/21
2 UK 103 1/3/21
3 USA 205 1/4/21
4 Isreal 100 1/1/21
5 UK 104 1/2/21
6 USA 103 1/3/21
7 Isreal 205 1/4/21
8 UK 100 1/1/21
9 USA 104 1/2/21
10 Isreal 103 1/3/21
11 UK 205 1/4/21
>>> pivot_df = pd.pivot_table(df,index=['Date'],values=['Total_Vaccinations'],columns=['Country'])
>>> pivot_df.columns = pivot_df.columns.droplevel()
>>> pivot_df = pivot_df.reset_index()
>>> pivot_df
Country Date Isreal UK USA
0 1/1/21 100 100 100
1 1/2/21 104 104 104
2 1/3/21 103 103 103
3 1/4/21 205 205 205
>>>
You can also drop the Country column further
Use:
#create DatetimeIndex
df['Date'] = pd.to_datetime(df['Date'])
df = df.set_index('Date')
#create subset by list of countries
need = ['Israel','USA']
df = df[df['Country'].isin(need)]

Pandas dataframe long to wide grouping by column with duplicated element

Hello I imported a dataframe which has no headers.
I created some headers using
df=pd.read_csv(path, names=['Prim Index', 'Alt Index', 'Aka', 'Name', 'Unnamed9'])
Then, I only keep
df=df[['Prim Index', 'Name']]
My question is how do I make df from long to wide, as 'Prim Index' is duplicated, I would like to have each unique Prim Index in one row and their names in different columns.
Thanks in advance! I appreciate any help on this!
Current df
Prim Index Alt Index Aka Name Unnamed9
1 2345 aka Marcus 0
1 7634 aka Tiffany 0
1 3242 aka Royce 0
2 8765 aka Charlotte 0
2 4343 aka Sara 0
3 9825 aka Keith 0
4 6714 aka Jennifer 0
5 7875 aka Justin 0
5 1345 aka Diana 0
6 6591 aka Liz 0
Desired df
Prim Index Name1 Name2 Name3 Name4
1 Marcus Tiffany Royce
2 Charlotte Sara
3 Keith
4 Jennifer
5 Justin Diana
6 Liz
Use GroupBy.cumcount for counter with DataFrame.set_index for MultiIndex, then reshape by Series.unstack and change columns names by DataFrame.add_prefix:
df1 = (df.set_index(['Prim Index', df.groupby('Prim Index').cumcount().add(1)])['Name']
.unstack(fill_value='')
.add_prefix('Name'))
print (df1)
Name1 Name2 Name3
Prim Index
1 Marcus Tiffany Royce
2 Charlotte Sara
3 Keith
4 Jennifer
5 Justin Diana
6 Liz
If there hast to be always 4 names add DataFrame.reindex by range:
df1 = (df.set_index(['Prim Index', df.groupby('Prim Index').cumcount().add(1)])['Name']
.unstack(fill_value='')
.reindex(range(1, 5), fill_value='', axis=1)
.add_prefix('Name'))
print (df1)
Name1 Name2 Name3 Name4
Prim Index
1 Marcus Tiffany Royce
2 Charlotte Sara
3 Keith
4 Jennifer
5 Justin Diana
6 Liz
Using Pivot Table, you can get similar solution that #jezreal did.
c = ['Prim Index','Name']
d = [[1,'Marcus'],[1,'Tiffany'],[1,'Royce'],
[2,'Charlotte'],[2,'Sara'],
[3,'Keith'],
[4,'Jennifer'],
[5,'Justin'],
[5,'Diana'],
[6,'Liz']]
import pandas as pd
df = pd.DataFrame(data = d,columns=c)
print (df)
df=(pd.pivot_table(df,index='Prim Index',
columns=df.groupby('Prim Index').cumcount().add(1),values='Name',aggfunc='sum',fill_value='')
.add_prefix('Name'))
df = df.reset_index()
print (df)
output of this will be:
Prim Index Name1 Name2 Name3
0 1 Marcus Tiffany Royce
1 2 Charlotte Sara
2 3 Keith
3 4 Jennifer
4 5 Justin Diana
5 6 Liz

Groupby sum of two column and create new dataframe in pandas

I have a dataframe as shown below
Player Goal Freekick
Messi 2 5
Ronaldo 1 4
Messi 1 4
Messi 0 5
Ronaldo 0 9
Ronaldo 1 8
Xavi 1 1
Xavi 0 7
From the above I would like do groupby sum of Goal and Freekick as shown below.
Expected Output:
Player toatal_goals total_freekicks
Messi 3 14
Ronaldo 2 21
Xavi 1 8
I tried below code:
df1 = df.groupby(['Player'])['Goal'].sum().reset_index().rename({'Goal':'toatal_goals'})
df1['total_freekicks'] = df.groupby(['Player'])['Freekick'].sum()
But above one does not work, please help me..
First aggregate sum by Player, then DataFrame.add_prefix and convert columns names to lowercase:
df = df.groupby('Player').sum().add_prefix('total_').rename(columns=str.lower)
print (df)
total_goal total_freekick
Player
Messi 3 14
Ronaldo 2 21
Xavi 1 8
You can use namedagg to create the aggregations with customized column names.
(
df.groupby(by='Player')
.agg(toatal_goals=('Goal', 'sum'),
total_freekicks=('Freekick', 'sum'))
.reset_index()
)
Player toatal_goals total_freekicks
Messi 3 14
Ronaldo 2 21
Xavi 1 8