Concatenate labels to an existing dataframe - pandas

I want to use a list of names "headers" to create a new column in my dataframe. In the initial table, the name of each division is positioned above the results for each team in that division. I want to add that header to each row entry for each divsion to make the data more identifiable like this. I have the headers stored in the "header" object in my code. How can I multiply each division header by the number of rows that appear in the division and append to the dataset?
Edit: here is another snippet of what I want the get from the end product.
df3 = df.iloc[0:6]
df3.insert(0, 'Divisions', ['na','L5 Junior', 'L5 Junior', 'na',
'L5 Senior - Medium', 'L5 Senior - Medium'])
df3
.
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import requests
Import HTML
scr = 'https://tv.varsity.com/results/7361971-2022-spirit-unlimited-battle-at-the-
boardwalk-atlantic-city-grand-ntls/31220'
scr1 = requests.get(scr)
soup = BeautifulSoup(scr1.text, "html.parser")
List of names to append
table_MN = pd.read_html(scr)
sp3 = soup.find(class_="full-content").find_all("h2")
headers = [elt.text for elt in sp3]
table_MN = pd.read_html(scr)
Extract text and header from division info
div = pd.DataFrame(headers)
div.columns = ["division"]
df = pd.concat(table_MN, ignore_index=True)
df.columns = df.iloc[0]
df

It is still not clear what is the output you are looking for. However, may I suggest the following, which accomplishes selecting common headers from tables in table_MN and the concatenating the results. If it is going in the right direction pls let me know, and indicate what else you want to extract from the resulting table:
tmn_1 = [tbl.T.set_index(0).T for tbl in table_MN]
pd.concat(tmn_1, axis=0, ignore_index = True)
output:
Rank Program Name Team Name Raw Score Deductions Performance Score Event Score
-- ------ --------------------------- ----------------- ----------- ------------ ------------------- -------------
0 1 Rockstar Cheer New Jersey City Girls 47.8667 0 95.7333 95.6833
1 2 Cheer Factor Xtraordinary 46.6667 0.15 93.1833 92.8541
2 1 Rockstar Cheer New Jersey City Girls 47.7667 0 95.5333 23.8833
3 2 Cheer Factor Xtraordinary 46.0333 0.2 91.8667 22.9667
4 1 Star Athletics Roar 47.5333 0.9 94.1667 93.9959
5 1 Prime Time All Stars Lady Onyx 43.9 1.35 86.45 86.6958
6 1 Prime Time All Stars Lady Onyx 44.1667 0.9 87.4333 21.8583
7 1 Just Cheer All Stars Jag 5 46.4333 0.15 92.7167 92.2875
8 1 Just Cheer All Stars Jag 5 45.8 0.6 91 22.75
9 1 Quest Athletics Black Ops 47.4333 0.45 94.4167 93.725
10 1 Quest Athletics Black Ops 46.5 1.35 91.65 22.9125
11 1 The Stingray Allstars X-Rays 45.3 0.95 89.65 88.4375
12 1 Vortex Allstars Lady Rays 45.7 0.5 90.9 91.1083
13 1 Vortex Allstars Lady Rays 45.8667 0 91.7333 22.9333
14 1 Upper Merion All Stars Citrus 46.4333 0 92.8667 92.7
15 2 Cheer Factor JUNIOR X 45.9 1.1 90.7 90.6542
16 3 NJ Premier All Stars Prodigy 44.6333 0.05 89.2167 89.8292
17 1 Upper Merion All Stars Citrus 46.1 0 92.2 23.05
18 2 NJ Premier All Stars Prodigy 45.8333 0 91.6667 22.9167
19 3 Cheer Factor JUNIOR X 45.7333 0.95 90.5167 22.6292
20 1 Virginia Royalty Athletics Dynasty 46.5 0 93 92.9
21 1 Virginia Royalty Athletics Dynasty 46.3 0 92.6 23.15
22 1 South Jersey Storm Lady Reign 47.2333 0 94.4667 93.4875
...

Related

How to use Pandas to find relative share of rooms in individual floors [duplicate]

This question already has answers here:
Pandas percentage of total with groupby
(16 answers)
Closed 12 days ago.
I have a dataframe df with the following structure
Floor Room Area
0 1 Living room 25
1 1 Kitchen 20
2 1 Bedroom 15
3 2 Bathroom 21
4 2 Bedroom 14
and I want to add a series floor_share with the relative share/ratio of the given floor, so that the dataframe becomes
Floor Room Area floor_share
0 1 Living room 18 0,30
1 1 Kitchen 18 0,30
2 1 Bedroom 24 0,40
3 2 Bathroom 10 0,67
4 2 Bedroom 20 0,33
If it is possible to do this with a one-liner (or any other idiomatic manner), I'll be very happy to learn how.
Current workaround
What I have done that produces the correct results is to first find the total floor areas by
floor_area_sums = df.groupby('Floor')['Area'].sum()
which gives
Floor
1 60
2 35
Name: Area, dtype: int64
I then initialize a new series to 0, and find the correct values while iterating through the dataframe rows.
df["floor_share"] = 0
for idx, row in df.iterrows():
df.loc[idx, 'floor_share'] = df.loc[idx, 'Area']/floor_area_sums[row.Floor]
IIUC use:
df["floor_share"] = df['Area'].div(df.groupby('Floor')['Area'].transform('sum'))
print (df)
Floor Room Area floor_share
0 1 Living room 18 0.300000
1 1 Kitchen 18 0.300000
2 1 Bedroom 24 0.400000
3 2 Bathroom 10 0.333333
4 2 Bedroom 20 0.666667

pandas duplicate values: Result visual inspection not duplicates

Hello thanks in advance for all answers, I really appreciate community help
Here is my dataframe - from a csv containing scraped data from cars classified ads
Unnamed: 0 NameYear \
0 0 BMW 7 серия, 2007
1 1 BMW X3, 2021
2 2 BMW 2 серия Gran Coupe, 2021
3 3 BMW X5, 2021
4 4 BMW X1, 2021
Price \
0 520 000 ₽
1 от 4 810 000 ₽\n4 960 000 ₽ без скидки
2 2 560 000 ₽
3 от 9 259 800 ₽\n9 974 800 ₽ без скидки
4 от 3 130 000 ₽\n3 220 000 ₽ без скидки
CarParams \
0 187 000 км, AT (445 л.с.), седан, задний, бензин
1 2.0 AT (190 л.с.), внедорожник, полный, дизель
2 1.5 AMT (140 л.с.), седан, передний, бензин
3 3.0 AT (400 л.с.), внедорожник, полный, дизель
4 2.0 AT (192 л.с.), внедорожник, полный, бензин
url
0 https://www.avito.ru/moskva/avtomobili/bmw_7_s...
1 https://www.avito.ru/moskva/avtomobili/bmw_x3_...
2 https://www.avito.ru/moskva/avtomobili/bmw_2_s...
3 https://www.avito.ru/moskva/avtomobili/bmw_x5_...
4 https://www.avito.ru/moskva/avtomobili/bmw_x1_...
THE TASK - I want to know if there are duplicate rows, or if the SAME car advertisement appears twice. Most reliable maybe url because it should be unique: CarParameters or NameYear can repeat so I will check nunique and duplicated on url column
screenshot to visually inspect the reslt of duplicated:
THE ISSUE: Visual inspection (sorry for unprofessional jargon) shows these urls are not the SAME, but I wanted to get possible exactly same urls to check for repeat data. I tried to set keep = False as well
Try:
df.duplicated(subset=["url"], keep=False)
df.duplicted() gives you a pd.Series with bool-values.
Here is a example that your could probably use
from random import randint
import pandas as pd
urls=['http://www.google.com',
'http://www.stackoverfow.com',
'http://bla.xy','http://bla.com']
d=[]
for i, url in enumerate(urls):
for j in range(0,randint(1,i+1)):
d.append(dict(customer=str(randint(1,100)), url=url))
df=pd.DataFrame(d)
df['dups']=df['url'].duplicated(keep=False)
print(df)
resulting in the following df:
customer url dups
0 89 http://www.google.com False
1 43 http://www.stackoverfow.com False
2 36 http://bla.xy True
3 86 http://bla.xy True
4 32 http://bla.com False
the column dups shows you which urls exist more than once. In my example data is only the url http://bla.xy
The important thing is that you check what the parameter keep does
keep{‘first’, ‘last’, False}, default ‘first’
Determines which duplicates (if any) to mark.
first : Mark duplicates as True except for the first occurrence.
last : Mark duplicates as True except for the last occurrence.
False : Mark all duplicates as True.
In my case used False to get all duplicated values

Converting categorical column into a single dummy variable column

Consider I have the following dataframe:
Survived Pclass Sex Age Fare
0 0 3 male 22.0 7.2500
1 1 1 female 38.0 71.2833
2 1 3 female 26.0 7.9250
3 1 1 female 35.0 53.1000
4 0 3 male 35.0 8.0500
I used the get_dummies() function to create dummy variable. The code and output are as follows:
one_hot = pd.get_dummies(dataset, columns = ['Category'])
This will return:
Survived Pclass Age Fare Sex_female Sex_male
0 0 3 22 7.2500 0 1
1 1 1 38 71.2833 1 0
2 1 3 26 7.9250 1 0
3 1 1 35 53.1000 1 0
4 0 3 35 8.0500 0 1
What I would like to have is a single column for Sex having the values 0 or 1 instead of 2 columns.
Interestingly, when I used get_dummies() on a different dataframe, it worked just like I wanted.
For the following dataframe:
Category Message
0 ham Go until jurong point, crazy.. Available only ...
1 ham Ok lar... Joking wif u oni...
2 spam Free entry in 2 a wkly comp to win FA Cup final...
3 ham U dun say so early hor... U c already then say...
4 ham Nah I don't think he goes to usf, he lives aro...
With the code:
one_hot = pd.get_dummies(dataset, columns = ['Category'])
It returns:
Message ... Category_spam
0 Go until jurong point, crazy.. Available only ... ... 0
1 Ok lar... Joking wif u oni... ... 0
2 Free entry in 2 a wkly comp to win FA Cup fina... ... 1
3 U dun say so early hor... U c already then say... ... 0
4 Nah I don't think he goes to usf, he lives aro... ... 0
Why does get_dummies() work differently on these two dataframes?
How can I make sure I get the 2nd output everytime?
Here are multiple ways you can do:
from sklearn.preprocessing import LabelEncoder
lbl=LabelEncoder()
df['Sex_encoded'] = lbl.fit_transform(df['Sex'])
# using only pandas
df['Sex_encoded'] = df['Sex'].map({'male': 0, 'female': 1})
Survived Pclass Sex Age Fare Sex_encoded
0 0 3 male 22.0 7.2500 0
1 1 1 female 38.0 71.2833 1
2 1 3 female 26.0 7.9250 1
3 1 1 female 35.0 53.1000 1
4 0 3 male 35.0 8.0500 0

Calculate average of non numeric columns in pandas

I have a df "data" as below
Name Quality city
Tom High A
nick Medium B
krish Low A
Jack High A
Kevin High B
Phil Medium B
I want group it by city and a create a new columns based on the column "quality" and calculate avegare as below
city High Medium Low High_Avg Medium_AVG Low_avg
A 2 0 1 66.66 0 33.33
B 1 1 0 50 50 0
I tried with the below script and I know it is completely wrong.
data_average = data_df.groupby(['city'], as_index = False).count()
Get a count of the frequencies, divide the outcome by the sum across columns, and finally concatenate the datframes into one :
result = pd.crosstab(df.city, df.Quality)
averages = result.div(result.sum(1).array, axis=0).mul(100).round(2).add_suffix("_Avg")
#combine the dataframes
pd.concat((result, averages), axis=1)
Quality High Low Medium High_Avg Low_Avg Medium_Avg
city
A 2 1 0 66.67 33.33 0.00
B 1 0 2 33.33 0.00 66.67

python pandas melt on combined columns

I have a dataframe like this. i have regular fields till "state" then i will have trailers (3 columns tr1* represents 1 tailer) i want to convert those trailers to rows. I tried melt function but i am able to use only 1 trailer column. kindly look at below example you can understand
Name number city state tr1num tr1acct tr1ct tr2num tr2acct tr2ct tr3num tr3acct tr3ct
DJ 10 Edison nj 1001 20345 Dew 1002 20346 Newca. 1003. 20347. pen
ND 20 Newark DE 2001 1985 flor 2002 1986 rodge
I am expecting the output like this.
Name number city state trnum tracct trct
DJ 10 Edison nj 1001 20345 Dew
DJ 10 Edison nj 1002 20346 Newca
DJ 10 Edison nj 1003 20347 pen
ND 20 Newark DE 2001 1985 flor
ND 20 Newark DE 2002 1986 rodge
You need to look at using pd.wide_to_long. However, you will need to do some column renaming first.
df = df.set_index(['Name','number','city','state'])
df.columns = df.columns.str.replace('(\D+)(\d+)(\D+)',r'\1\3_\2')
df = df.reset_index()
pd.wide_to_long(df, ['trnum','trct','tracct'],
['Name','number','city','state'], 'Code',sep='_',suffix='\d+')\
.reset_index()\
.drop('Code',axis=1)
Output:
Name number city state trnum trct tracct
0 DJ 10 Edison nj 1001.0 Dew 20345.0
1 DJ 10 Edison nj 1002.0 Newca. 20346.0
2 DJ 10 Edison nj 1003.0 pen 20347.0
3 ND 20 Newark DE 2001.0 flor 1985.0
4 ND 20 Newark DE 2002.0 rodge 1986.0
5 ND 20 Newark DE NaN NaN NaN
you could achieve this by renaming your columns and bit and applying the pandas wide_to_long method. Below is the code which produces your desired output.
df = pd.DataFrame({"Name":["DJ", "ND"], "number":[10,20], "city":["Edison", "Newark"], "state":["nj","DE"],
"trnum_1":[1001,2001], "tracct_1":[20345,1985], "trct_1":["Dew", "flor"], "trnum_2":[1002,2002],
"trct_2":["Newca", "rodge"], "trnum_3":[1003,None], "tracct_3":[20347,None], "trct_3":["pen", None]})
pd.wide_to_long(df, stubnames=['trnum', 'tracct', 'trct'], i='Name', j='dropme', sep='_').reset_index().drop('dropme', axis=1)\
.sort_values('trnum')
outputs
Name state city number trnum tracct trct
0 DJ nj Edison 10 1001.0 20345.0 Dew
1 DJ nj Edison 10 1002.0 NaN Newca
2 DJ nj Edison 10 1003.0 20347.0 pen
3 ND DE Newark 20 2001.0 1985.0 flor
4 ND DE Newark 20 2002.0 NaN rodge
5 ND DE Newark 20 NaN NaN None
Another option:
df = pd.DataFrame({'col1': [1,2,3], 'col2':[3,4,5], 'col3':[5,6,7], 'tr1':[0,9,8], 'tr2':[0,9,8]})
The df:
col1 col2 col3 tr1 tr2
0 1 3 5 0 0
1 2 4 6 9 9
2 3 5 7 8 8
subsetting to create 2 df's:
tr1_df = df[['col1', 'col2', 'col3', 'tr1']].rename(index=str, columns={"tr1":"tr"})
tr2_df = df[['col1', 'col2', 'col3', 'tr2']].rename(index=str, columns={"tr2":"tr"})
res = pd.concat([tr1_df, tr2_df])
result:
col1 col2 col3 tr
0 1 3 5 0
1 2 4 6 9
2 3 5 7 8
0 1 3 5 0
1 2 4 6 9
2 3 5 7 8
One option is the pivot_longer function from pyjanitor, using the .value placeholder:
# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git
import janitor
import pandas as pd
(df
.pivot_longer(
index=slice("Name", "state"),
names_to=(".value", ".value"),
names_pattern=r"(.+)\d(.+)",
sort_by_appearance=True)
.dropna()
)
Name number city state trnum tracct trct
0 DJ 10 Edison nj 1001.0 20345.0 Dew
1 DJ 10 Edison nj 1002.0 20346.0 Newca.
2 DJ 10 Edison nj 1003.0 20347.0 pen
3 ND 20 Newark DE 2001.0 1985.0 flor
4 ND 20 Newark DE 2002.0 1986.0 rodge
The .value keeps the part of the column associated with it as header, and since we have multiple .value, they are combined into a single word. The .value is determined by the groups in the names_pattern, which is a regular expression.
Note that currently the multiple .value option is available in dev.