This question already has answers here:
Pandas percentage of total with groupby
(16 answers)
Closed 12 days ago.
I have a dataframe df with the following structure
Floor Room Area
0 1 Living room 25
1 1 Kitchen 20
2 1 Bedroom 15
3 2 Bathroom 21
4 2 Bedroom 14
and I want to add a series floor_share with the relative share/ratio of the given floor, so that the dataframe becomes
Floor Room Area floor_share
0 1 Living room 18 0,30
1 1 Kitchen 18 0,30
2 1 Bedroom 24 0,40
3 2 Bathroom 10 0,67
4 2 Bedroom 20 0,33
If it is possible to do this with a one-liner (or any other idiomatic manner), I'll be very happy to learn how.
Current workaround
What I have done that produces the correct results is to first find the total floor areas by
floor_area_sums = df.groupby('Floor')['Area'].sum()
which gives
Floor
1 60
2 35
Name: Area, dtype: int64
I then initialize a new series to 0, and find the correct values while iterating through the dataframe rows.
df["floor_share"] = 0
for idx, row in df.iterrows():
df.loc[idx, 'floor_share'] = df.loc[idx, 'Area']/floor_area_sums[row.Floor]
IIUC use:
df["floor_share"] = df['Area'].div(df.groupby('Floor')['Area'].transform('sum'))
print (df)
Floor Room Area floor_share
0 1 Living room 18 0.300000
1 1 Kitchen 18 0.300000
2 1 Bedroom 24 0.400000
3 2 Bathroom 10 0.333333
4 2 Bedroom 20 0.666667
Hello thanks in advance for all answers, I really appreciate community help
Here is my dataframe - from a csv containing scraped data from cars classified ads
Unnamed: 0 NameYear \
0 0 BMW 7 серия, 2007
1 1 BMW X3, 2021
2 2 BMW 2 серия Gran Coupe, 2021
3 3 BMW X5, 2021
4 4 BMW X1, 2021
Price \
0 520 000 ₽
1 от 4 810 000 ₽\n4 960 000 ₽ без скидки
2 2 560 000 ₽
3 от 9 259 800 ₽\n9 974 800 ₽ без скидки
4 от 3 130 000 ₽\n3 220 000 ₽ без скидки
CarParams \
0 187 000 км, AT (445 л.с.), седан, задний, бензин
1 2.0 AT (190 л.с.), внедорожник, полный, дизель
2 1.5 AMT (140 л.с.), седан, передний, бензин
3 3.0 AT (400 л.с.), внедорожник, полный, дизель
4 2.0 AT (192 л.с.), внедорожник, полный, бензин
url
0 https://www.avito.ru/moskva/avtomobili/bmw_7_s...
1 https://www.avito.ru/moskva/avtomobili/bmw_x3_...
2 https://www.avito.ru/moskva/avtomobili/bmw_2_s...
3 https://www.avito.ru/moskva/avtomobili/bmw_x5_...
4 https://www.avito.ru/moskva/avtomobili/bmw_x1_...
THE TASK - I want to know if there are duplicate rows, or if the SAME car advertisement appears twice. Most reliable maybe url because it should be unique: CarParameters or NameYear can repeat so I will check nunique and duplicated on url column
screenshot to visually inspect the reslt of duplicated:
THE ISSUE: Visual inspection (sorry for unprofessional jargon) shows these urls are not the SAME, but I wanted to get possible exactly same urls to check for repeat data. I tried to set keep = False as well
Try:
df.duplicated(subset=["url"], keep=False)
df.duplicted() gives you a pd.Series with bool-values.
Here is a example that your could probably use
from random import randint
import pandas as pd
urls=['http://www.google.com',
'http://www.stackoverfow.com',
'http://bla.xy','http://bla.com']
d=[]
for i, url in enumerate(urls):
for j in range(0,randint(1,i+1)):
d.append(dict(customer=str(randint(1,100)), url=url))
df=pd.DataFrame(d)
df['dups']=df['url'].duplicated(keep=False)
print(df)
resulting in the following df:
customer url dups
0 89 http://www.google.com False
1 43 http://www.stackoverfow.com False
2 36 http://bla.xy True
3 86 http://bla.xy True
4 32 http://bla.com False
the column dups shows you which urls exist more than once. In my example data is only the url http://bla.xy
The important thing is that you check what the parameter keep does
keep{‘first’, ‘last’, False}, default ‘first’
Determines which duplicates (if any) to mark.
first : Mark duplicates as True except for the first occurrence.
last : Mark duplicates as True except for the last occurrence.
False : Mark all duplicates as True.
In my case used False to get all duplicated values
Consider I have the following dataframe:
Survived Pclass Sex Age Fare
0 0 3 male 22.0 7.2500
1 1 1 female 38.0 71.2833
2 1 3 female 26.0 7.9250
3 1 1 female 35.0 53.1000
4 0 3 male 35.0 8.0500
I used the get_dummies() function to create dummy variable. The code and output are as follows:
one_hot = pd.get_dummies(dataset, columns = ['Category'])
This will return:
Survived Pclass Age Fare Sex_female Sex_male
0 0 3 22 7.2500 0 1
1 1 1 38 71.2833 1 0
2 1 3 26 7.9250 1 0
3 1 1 35 53.1000 1 0
4 0 3 35 8.0500 0 1
What I would like to have is a single column for Sex having the values 0 or 1 instead of 2 columns.
Interestingly, when I used get_dummies() on a different dataframe, it worked just like I wanted.
For the following dataframe:
Category Message
0 ham Go until jurong point, crazy.. Available only ...
1 ham Ok lar... Joking wif u oni...
2 spam Free entry in 2 a wkly comp to win FA Cup final...
3 ham U dun say so early hor... U c already then say...
4 ham Nah I don't think he goes to usf, he lives aro...
With the code:
one_hot = pd.get_dummies(dataset, columns = ['Category'])
It returns:
Message ... Category_spam
0 Go until jurong point, crazy.. Available only ... ... 0
1 Ok lar... Joking wif u oni... ... 0
2 Free entry in 2 a wkly comp to win FA Cup fina... ... 1
3 U dun say so early hor... U c already then say... ... 0
4 Nah I don't think he goes to usf, he lives aro... ... 0
Why does get_dummies() work differently on these two dataframes?
How can I make sure I get the 2nd output everytime?
Here are multiple ways you can do:
from sklearn.preprocessing import LabelEncoder
lbl=LabelEncoder()
df['Sex_encoded'] = lbl.fit_transform(df['Sex'])
# using only pandas
df['Sex_encoded'] = df['Sex'].map({'male': 0, 'female': 1})
Survived Pclass Sex Age Fare Sex_encoded
0 0 3 male 22.0 7.2500 0
1 1 1 female 38.0 71.2833 1
2 1 3 female 26.0 7.9250 1
3 1 1 female 35.0 53.1000 1
4 0 3 male 35.0 8.0500 0
I have a df "data" as below
Name Quality city
Tom High A
nick Medium B
krish Low A
Jack High A
Kevin High B
Phil Medium B
I want group it by city and a create a new columns based on the column "quality" and calculate avegare as below
city High Medium Low High_Avg Medium_AVG Low_avg
A 2 0 1 66.66 0 33.33
B 1 1 0 50 50 0
I tried with the below script and I know it is completely wrong.
data_average = data_df.groupby(['city'], as_index = False).count()
Get a count of the frequencies, divide the outcome by the sum across columns, and finally concatenate the datframes into one :
result = pd.crosstab(df.city, df.Quality)
averages = result.div(result.sum(1).array, axis=0).mul(100).round(2).add_suffix("_Avg")
#combine the dataframes
pd.concat((result, averages), axis=1)
Quality High Low Medium High_Avg Low_Avg Medium_Avg
city
A 2 1 0 66.67 33.33 0.00
B 1 0 2 33.33 0.00 66.67
I have a dataframe like this. i have regular fields till "state" then i will have trailers (3 columns tr1* represents 1 tailer) i want to convert those trailers to rows. I tried melt function but i am able to use only 1 trailer column. kindly look at below example you can understand
Name number city state tr1num tr1acct tr1ct tr2num tr2acct tr2ct tr3num tr3acct tr3ct
DJ 10 Edison nj 1001 20345 Dew 1002 20346 Newca. 1003. 20347. pen
ND 20 Newark DE 2001 1985 flor 2002 1986 rodge
I am expecting the output like this.
Name number city state trnum tracct trct
DJ 10 Edison nj 1001 20345 Dew
DJ 10 Edison nj 1002 20346 Newca
DJ 10 Edison nj 1003 20347 pen
ND 20 Newark DE 2001 1985 flor
ND 20 Newark DE 2002 1986 rodge
You need to look at using pd.wide_to_long. However, you will need to do some column renaming first.
df = df.set_index(['Name','number','city','state'])
df.columns = df.columns.str.replace('(\D+)(\d+)(\D+)',r'\1\3_\2')
df = df.reset_index()
pd.wide_to_long(df, ['trnum','trct','tracct'],
['Name','number','city','state'], 'Code',sep='_',suffix='\d+')\
.reset_index()\
.drop('Code',axis=1)
Output:
Name number city state trnum trct tracct
0 DJ 10 Edison nj 1001.0 Dew 20345.0
1 DJ 10 Edison nj 1002.0 Newca. 20346.0
2 DJ 10 Edison nj 1003.0 pen 20347.0
3 ND 20 Newark DE 2001.0 flor 1985.0
4 ND 20 Newark DE 2002.0 rodge 1986.0
5 ND 20 Newark DE NaN NaN NaN
you could achieve this by renaming your columns and bit and applying the pandas wide_to_long method. Below is the code which produces your desired output.
df = pd.DataFrame({"Name":["DJ", "ND"], "number":[10,20], "city":["Edison", "Newark"], "state":["nj","DE"],
"trnum_1":[1001,2001], "tracct_1":[20345,1985], "trct_1":["Dew", "flor"], "trnum_2":[1002,2002],
"trct_2":["Newca", "rodge"], "trnum_3":[1003,None], "tracct_3":[20347,None], "trct_3":["pen", None]})
pd.wide_to_long(df, stubnames=['trnum', 'tracct', 'trct'], i='Name', j='dropme', sep='_').reset_index().drop('dropme', axis=1)\
.sort_values('trnum')
outputs
Name state city number trnum tracct trct
0 DJ nj Edison 10 1001.0 20345.0 Dew
1 DJ nj Edison 10 1002.0 NaN Newca
2 DJ nj Edison 10 1003.0 20347.0 pen
3 ND DE Newark 20 2001.0 1985.0 flor
4 ND DE Newark 20 2002.0 NaN rodge
5 ND DE Newark 20 NaN NaN None
Another option:
df = pd.DataFrame({'col1': [1,2,3], 'col2':[3,4,5], 'col3':[5,6,7], 'tr1':[0,9,8], 'tr2':[0,9,8]})
The df:
col1 col2 col3 tr1 tr2
0 1 3 5 0 0
1 2 4 6 9 9
2 3 5 7 8 8
subsetting to create 2 df's:
tr1_df = df[['col1', 'col2', 'col3', 'tr1']].rename(index=str, columns={"tr1":"tr"})
tr2_df = df[['col1', 'col2', 'col3', 'tr2']].rename(index=str, columns={"tr2":"tr"})
res = pd.concat([tr1_df, tr2_df])
result:
col1 col2 col3 tr
0 1 3 5 0
1 2 4 6 9
2 3 5 7 8
0 1 3 5 0
1 2 4 6 9
2 3 5 7 8
One option is the pivot_longer function from pyjanitor, using the .value placeholder:
# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git
import janitor
import pandas as pd
(df
.pivot_longer(
index=slice("Name", "state"),
names_to=(".value", ".value"),
names_pattern=r"(.+)\d(.+)",
sort_by_appearance=True)
.dropna()
)
Name number city state trnum tracct trct
0 DJ 10 Edison nj 1001.0 20345.0 Dew
1 DJ 10 Edison nj 1002.0 20346.0 Newca.
2 DJ 10 Edison nj 1003.0 20347.0 pen
3 ND 20 Newark DE 2001.0 1985.0 flor
4 ND 20 Newark DE 2002.0 1986.0 rodge
The .value keeps the part of the column associated with it as header, and since we have multiple .value, they are combined into a single word. The .value is determined by the groups in the names_pattern, which is a regular expression.
Note that currently the multiple .value option is available in dev.