I have a data which looks like that:
DataFrame with cities names, their latitude and longitude:
import pandas as pd
city = {'Name': ['San Franciso', 'Paris', 'Tokyo', 'London', 'Barcelona'], 'Latitude': [50.69460297, 43.64984221, 60.5331547, 62.5331547, 63.5331547],'Longtitude': [41.43147227, 49.78045496691, 122.23536080538, 19.78045496691, 29.78045496691]}
city_df = pd.DataFrame(city)
List of 5 DataFrame which looks like that:
list1= [[1,"kids",0.00094], [2,"adult",0.0012], [3,"elderly",0.00114],[5,"kids",0.00088], [6,"adult",0.00113], [7,"elderly",0.00105]]
l1 = pd.DataFrame(list1)
list2= [[1,"kids",0.00044], [2,"adult",0.0012], [3,"elderly",0.00114],[5,"kids",0.00088], [6,"adult",0.00113], [7,"elderly",0.00105]]
l2 = pd.DataFrame(list2)
list3= [[1,"kids",0.00394], [2,"adult",0.0012], [3,"elderly",0.00114],[5,"kids",0.00588], [6,"adult",0.00113], [7,"elderly",0.00105]]
l3 = pd.DataFrame(list3)
list4= [[1,"kids",0.00074], [2,"adult",0.0012], [3,"elderly",0.00114],[5,"kids",0.00088], [6,"adult",0.00113], [7,"elderly",0.00105]]
l4 = pd.DataFrame(list4)
list5= [[1,"kids",0.00095], [2,"adult",0.0012], [3,"elderly",0.00114],[5,"kids",0.00043], [6,"adult",0.00113], [7,"elderly",0.00105]]
l5 = pd.DataFrame(list5)
l = [l1, l2, l3, l4, l5]
I want to create a plot looking like the one below enter image description here
For each city, a boxplot with values for a particular group, and for the y-axis, the cities are sorted by latitude.
I try to make that work with pd.concat and pd.melt (from: Plotting multiple boxplots in seaborn?).
It is a challenge for me. Thank you for your time.
You can concat with the city names as MultiIndex, and use seaborn.catplot to plot:
df = pd.concat(dict(zip(city_df['Name'], l)), names=['city']).reset_index(level=0)
import seaborn as sns
sns.catplot(data=df, row=1, x='city', y=2, kind='box', sharey=False)
output:
city 0 1 2
0 San Franciso 1 kids 0.00094
1 San Franciso 2 adult 0.00120
2 San Franciso 3 elderly 0.00114
3 San Franciso 5 kids 0.00088
4 San Franciso 6 adult 0.00113
5 San Franciso 7 elderly 0.00105
0 Paris 1 kids 0.00044
1 Paris 2 adult 0.00120
2 Paris 3 elderly 0.00114
3 Paris 5 kids 0.00088
4 Paris 6 adult 0.00113
5 Paris 7 elderly 0.00105
0 Tokyo 1 kids 0.00394
1 Tokyo 2 adult 0.00120
2 Tokyo 3 elderly 0.00114
3 Tokyo 5 kids 0.00588
4 Tokyo 6 adult 0.00113
5 Tokyo 7 elderly 0.00105
0 London 1 kids 0.00074
1 London 2 adult 0.00120
2 London 3 elderly 0.00114
3 London 5 kids 0.00088
4 London 6 adult 0.00113
5 London 7 elderly 0.00105
0 Barcelona 1 kids 0.00095
1 Barcelona 2 adult 0.00120
2 Barcelona 3 elderly 0.00114
3 Barcelona 5 kids 0.00043
4 Barcelona 6 adult 0.00113
5 Barcelona 7 elderly 0.00105
Related
I have a dataframe:
country rating owner
0 England a John Smith
1 England b John Smith
2 France a Frank Foo
3 France a Frank foo
4 France a Frank Foo
5 France b Frank Foo
I'd like to produce a count of owners after grouping by country and rating and
ignoring case
gnoring any spaces ( leading, trailing or inbetween)
I am expecting:
country rating owner count
0 England a John Smith 1
1 England b John Smith 1
2 France a Frank Foo 3
3 France b Frank Foo 1
I have tried:
df.group_by(['rating','owner'])['owner'].count()
and
df.group_by(['rating','owner'].str.lower())['owner'].count()
Use title and replace to rework the string and groupby.size to aggregate:
out = (df.groupby(['country', 'rating',
df['owner'].str.title().str.replace(r'\s+', ' ', regex=True)])
.size().reset_index(name='count')
)
Output:
country rating owner count
0 England a John Smith 1
1 England b John Smith 1
2 France a Frank Foo 3
3 France b Frank Foo 1
Use Series.str.strip, Series.str.title and remove multiple spaces by Series.str.replace with aggregate GroupBy.size:
DataFrameGroupBy.count is used for count exclude missing values, seems not necessary here.
df1 = (df.groupby(['county','rating',
df['owner'].str.strip().str.title().str.replace('\s+',' ', regex=True)])
.size()
.reset_index(name='count'))
Trying to learn pandas using English football scores.
Here is part of a list of football matches in date order.
"FTR" is the Full Time Result: "A" - win for the away team, "H" - win for the home team, "D"- a draw.
I created columns "HTWTD" - home team wins to date, and "ATWTD" - away team wins to date, to hold the number of wins the home and away teams have had up until that point. I populated the columns with 0s then put a 1 in the HTWTD when the FTR was H, and a 1 in the ATWTD where the FTR was A. This obviously only produces correct data for the first time each team plays.
When we get to row 9, Leeds wins a match having already won one in row 2. The HTWTD in row 9 should read 2 i.e at this point Leeds has won 2 games.
To my untrained mind the process should be...
Look at the row above, if Leeds features, get the corresponding HTWTD or ATWTD score, add 1 to it and put it in the current row HTWTD or ATWTD column. If Leeds doesn't feature (and you are not at the first row) go up one row.
Having googled around I haven't found anything about how to select only rows above current row, then alter entry in current row depending on test on selected rows.
I could probably write a little python function to do this, but is there a pandas way to go about it?
Row
Date
HomeTeam
AwayTeam
FTR
HTWTD
ATWTD
0
12/09/2020
Fulham
Arsenal
A
0
1
1
12/09/2020
Crystal Palace
Southampton
H
1
0
2
12/09/2020
Liverpool
Leeds
H
0
1
3
12/09/2020
West Ham
Newcastle
A
0
1
4
13/09/2020
West Brom
Leicester
A
0
1
5
13/09/2020
Tottenham
Everton
A
0
1
6
14/09/2020
Brighton
Chelsea
A
0
1
7
14/09/2020
Sheffield United
Wolves
A
0
1
8
19/09/2020
Everton
West Brom
H
1
0
9
19/09/2020
Leeds
Fulham
H
1
0
IIUC, you can use .eq() to return a boolean series of True or False for the condition and then use .cumsum() to cumulatively get the sum of the True values per HomeTeam and AwayTeam group result with a .groupby:
df['home_wins'] = df['FTR'].eq('H')
df['away_wins'] = df['FTR'].eq('A')
df['HTWTD'] = df.groupby('HomeTeam')['home_wins'].cumsum()
df['ATWTD'] = df.groupby('AwayTeam')['away_wins'].cumsum()
df.drop(['home_wins', 'away_wins'], axis=1)
Out[1]:
Row Date HomeTeam AwayTeam FTR HTWTD ATWTD
0 0 12/09/2020 Fulham Arsenal A 0 1
1 1 12/09/2020 Crystal Palace Southampton H 1 0
2 2 12/09/2020 Liverpool Leeds H 1 0
3 3 12/09/2020 West Ham Newcastle A 0 1
4 4 13/09/2020 West Brom Leicester A 0 1
5 5 13/09/2020 Tottenham Everton A 0 1
6 6 14/09/2020 Brighton Chelsea A 0 1
7 7 14/09/2020 Sheffield United Wolves A 0 1
8 8 19/09/2020 Everton West Brom H 1 0
9 9 19/09/2020 Leeds Fulham H 1 0
I want to use a list of names "headers" to create a new column in my dataframe. In the initial table, the name of each division is positioned above the results for each team in that division. I want to add that header to each row entry for each divsion to make the data more identifiable like this. I have the headers stored in the "header" object in my code. How can I multiply each division header by the number of rows that appear in the division and append to the dataset?
Edit: here is another snippet of what I want the get from the end product.
df3 = df.iloc[0:6]
df3.insert(0, 'Divisions', ['na','L5 Junior', 'L5 Junior', 'na',
'L5 Senior - Medium', 'L5 Senior - Medium'])
df3
.
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import requests
Import HTML
scr = 'https://tv.varsity.com/results/7361971-2022-spirit-unlimited-battle-at-the-
boardwalk-atlantic-city-grand-ntls/31220'
scr1 = requests.get(scr)
soup = BeautifulSoup(scr1.text, "html.parser")
List of names to append
table_MN = pd.read_html(scr)
sp3 = soup.find(class_="full-content").find_all("h2")
headers = [elt.text for elt in sp3]
table_MN = pd.read_html(scr)
Extract text and header from division info
div = pd.DataFrame(headers)
div.columns = ["division"]
df = pd.concat(table_MN, ignore_index=True)
df.columns = df.iloc[0]
df
It is still not clear what is the output you are looking for. However, may I suggest the following, which accomplishes selecting common headers from tables in table_MN and the concatenating the results. If it is going in the right direction pls let me know, and indicate what else you want to extract from the resulting table:
tmn_1 = [tbl.T.set_index(0).T for tbl in table_MN]
pd.concat(tmn_1, axis=0, ignore_index = True)
output:
Rank Program Name Team Name Raw Score Deductions Performance Score Event Score
-- ------ --------------------------- ----------------- ----------- ------------ ------------------- -------------
0 1 Rockstar Cheer New Jersey City Girls 47.8667 0 95.7333 95.6833
1 2 Cheer Factor Xtraordinary 46.6667 0.15 93.1833 92.8541
2 1 Rockstar Cheer New Jersey City Girls 47.7667 0 95.5333 23.8833
3 2 Cheer Factor Xtraordinary 46.0333 0.2 91.8667 22.9667
4 1 Star Athletics Roar 47.5333 0.9 94.1667 93.9959
5 1 Prime Time All Stars Lady Onyx 43.9 1.35 86.45 86.6958
6 1 Prime Time All Stars Lady Onyx 44.1667 0.9 87.4333 21.8583
7 1 Just Cheer All Stars Jag 5 46.4333 0.15 92.7167 92.2875
8 1 Just Cheer All Stars Jag 5 45.8 0.6 91 22.75
9 1 Quest Athletics Black Ops 47.4333 0.45 94.4167 93.725
10 1 Quest Athletics Black Ops 46.5 1.35 91.65 22.9125
11 1 The Stingray Allstars X-Rays 45.3 0.95 89.65 88.4375
12 1 Vortex Allstars Lady Rays 45.7 0.5 90.9 91.1083
13 1 Vortex Allstars Lady Rays 45.8667 0 91.7333 22.9333
14 1 Upper Merion All Stars Citrus 46.4333 0 92.8667 92.7
15 2 Cheer Factor JUNIOR X 45.9 1.1 90.7 90.6542
16 3 NJ Premier All Stars Prodigy 44.6333 0.05 89.2167 89.8292
17 1 Upper Merion All Stars Citrus 46.1 0 92.2 23.05
18 2 NJ Premier All Stars Prodigy 45.8333 0 91.6667 22.9167
19 3 Cheer Factor JUNIOR X 45.7333 0.95 90.5167 22.6292
20 1 Virginia Royalty Athletics Dynasty 46.5 0 93 92.9
21 1 Virginia Royalty Athletics Dynasty 46.3 0 92.6 23.15
22 1 South Jersey Storm Lady Reign 47.2333 0 94.4667 93.4875
...
The starting point is this kind of dataframe.
df = pd.DataFrame({'author': ['Jack', 'Steve', 'Greg', 'Jack', 'Steve', 'Greg', 'Greg'], 'country':['USA', None, None, 'USA', 'Germany', 'France', 'France'], 'c':np.random.randn(7), 'd':np.random.randn(7)})
author country c d
0 Jack USA -2.594532 2.027425
1 Steve None -1.104079 -0.852182
2 Greg None -2.356956 -0.450821
3 Jack USA -0.910153 -0.734682
4 Steve Germany 1.025113 0.441512
5 Greg France 0.218085 1.369443
6 Greg France 0.254485 0.322768
The desired output is one column or multiple columns with countrys of a author.
0 [USA]
1 [Germany]
2 [France]
3 [USA]
4 [Germany]
5 [France]
6 [France]
It has not to be a list, but my closest solution for now gives a list as output.
It could be seperated columns.
df.groupby('author')['country'].transform('unique')
0 [USA]
1 [None, Germany]
2 [None, France]
3 [USA]
4 [None, Germany]
5 [None, France]
6 [None, France]
Is there a easy way of deleting None out of this ?
You can remove missing values with Series.dropna, call SeriesGroupBy.unique and create new column by Series.map:
df['new'] = df['author'].map(df['country'].dropna().groupby(df['author']).unique())
print (df)
author country c d new
0 Jack USA 0.453358 -1.983282 [USA]
1 Steve None 0.011792 0.383322 [Germany]
2 Greg None -1.551810 0.308982 [France]
3 Jack USA 1.646301 0.040245 [USA]
4 Steve Germany -0.211451 0.841131 [Germany]
5 Greg France 1.049269 -0.813806 [France]
6 Greg France -1.244549 1.009006 [France]
I have a dataframe like this. i have regular fields till "state" then i will have trailers (3 columns tr1* represents 1 tailer) i want to convert those trailers to rows. I tried melt function but i am able to use only 1 trailer column. kindly look at below example you can understand
Name number city state tr1num tr1acct tr1ct tr2num tr2acct tr2ct tr3num tr3acct tr3ct
DJ 10 Edison nj 1001 20345 Dew 1002 20346 Newca. 1003. 20347. pen
ND 20 Newark DE 2001 1985 flor 2002 1986 rodge
I am expecting the output like this.
Name number city state trnum tracct trct
DJ 10 Edison nj 1001 20345 Dew
DJ 10 Edison nj 1002 20346 Newca
DJ 10 Edison nj 1003 20347 pen
ND 20 Newark DE 2001 1985 flor
ND 20 Newark DE 2002 1986 rodge
You need to look at using pd.wide_to_long. However, you will need to do some column renaming first.
df = df.set_index(['Name','number','city','state'])
df.columns = df.columns.str.replace('(\D+)(\d+)(\D+)',r'\1\3_\2')
df = df.reset_index()
pd.wide_to_long(df, ['trnum','trct','tracct'],
['Name','number','city','state'], 'Code',sep='_',suffix='\d+')\
.reset_index()\
.drop('Code',axis=1)
Output:
Name number city state trnum trct tracct
0 DJ 10 Edison nj 1001.0 Dew 20345.0
1 DJ 10 Edison nj 1002.0 Newca. 20346.0
2 DJ 10 Edison nj 1003.0 pen 20347.0
3 ND 20 Newark DE 2001.0 flor 1985.0
4 ND 20 Newark DE 2002.0 rodge 1986.0
5 ND 20 Newark DE NaN NaN NaN
you could achieve this by renaming your columns and bit and applying the pandas wide_to_long method. Below is the code which produces your desired output.
df = pd.DataFrame({"Name":["DJ", "ND"], "number":[10,20], "city":["Edison", "Newark"], "state":["nj","DE"],
"trnum_1":[1001,2001], "tracct_1":[20345,1985], "trct_1":["Dew", "flor"], "trnum_2":[1002,2002],
"trct_2":["Newca", "rodge"], "trnum_3":[1003,None], "tracct_3":[20347,None], "trct_3":["pen", None]})
pd.wide_to_long(df, stubnames=['trnum', 'tracct', 'trct'], i='Name', j='dropme', sep='_').reset_index().drop('dropme', axis=1)\
.sort_values('trnum')
outputs
Name state city number trnum tracct trct
0 DJ nj Edison 10 1001.0 20345.0 Dew
1 DJ nj Edison 10 1002.0 NaN Newca
2 DJ nj Edison 10 1003.0 20347.0 pen
3 ND DE Newark 20 2001.0 1985.0 flor
4 ND DE Newark 20 2002.0 NaN rodge
5 ND DE Newark 20 NaN NaN None
Another option:
df = pd.DataFrame({'col1': [1,2,3], 'col2':[3,4,5], 'col3':[5,6,7], 'tr1':[0,9,8], 'tr2':[0,9,8]})
The df:
col1 col2 col3 tr1 tr2
0 1 3 5 0 0
1 2 4 6 9 9
2 3 5 7 8 8
subsetting to create 2 df's:
tr1_df = df[['col1', 'col2', 'col3', 'tr1']].rename(index=str, columns={"tr1":"tr"})
tr2_df = df[['col1', 'col2', 'col3', 'tr2']].rename(index=str, columns={"tr2":"tr"})
res = pd.concat([tr1_df, tr2_df])
result:
col1 col2 col3 tr
0 1 3 5 0
1 2 4 6 9
2 3 5 7 8
0 1 3 5 0
1 2 4 6 9
2 3 5 7 8
One option is the pivot_longer function from pyjanitor, using the .value placeholder:
# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git
import janitor
import pandas as pd
(df
.pivot_longer(
index=slice("Name", "state"),
names_to=(".value", ".value"),
names_pattern=r"(.+)\d(.+)",
sort_by_appearance=True)
.dropna()
)
Name number city state trnum tracct trct
0 DJ 10 Edison nj 1001.0 20345.0 Dew
1 DJ 10 Edison nj 1002.0 20346.0 Newca.
2 DJ 10 Edison nj 1003.0 20347.0 pen
3 ND 20 Newark DE 2001.0 1985.0 flor
4 ND 20 Newark DE 2002.0 1986.0 rodge
The .value keeps the part of the column associated with it as header, and since we have multiple .value, they are combined into a single word. The .value is determined by the groups in the names_pattern, which is a regular expression.
Note that currently the multiple .value option is available in dev.