merge multiple columns into one column from the same dataframe pandas - pandas

My input dataset is as follows, and I want to rename multiple columns to the same variable name T1, T2, T3, T4 and bind the columns with the same name as one column.
df
ID Q3.4 Q3.6 Q3.8 Q3.18 Q4.4 Q4.6 Q4.8 Q4.12
1 NaN NaN NaN NaN 20 60 80 20
2 10 20 20 40 NaN NaN NaN NaN
3 30 40 40 40 NaN NaN NaN NaN
4 NaN NaN NaN NaN 50 50 50 50
rename vars
T1 = ['Q3.4', 'Q4.4']
T2 = ['Q3.6', 'Q4.6']
T3 = ['Q3.8', 'Q4.8']
T4 = ['Q3.18', 'Q4.12']
Step 1: I have renamed variables by (let me know if there is any faster code please)
df.rename(columns = {'Q3.4': 'T1',
'Q4.4': 'T1',
inplace = True)
df.rename(columns = {'Q3.6': 'T2',
'Q4.6': 'T2',
inplace = True)
df.rename(columns = {'Q3.8': 'T3',
'Q4.8': 'T3',
inplace = True)
df.rename(columns = {'Q3.18': 'T4',
'Q4.12': 'T4',
inplace = True)
ID T1 T2 T3 T4 T1 T2 T3 T4
1 NaN NaN NaN NaN 20 60 80 20
2 10 20 20 40 NaN NaN NaN NaN
3 30 40 40 40 NaN NaN NaN NaN
4 NaN NaN NaN NaN 50 50 50 50
How can I merge the columns into the following expected df?
ID T1 T2 T3 T4
1 20 60 80 20
2 10 20 20 40
3 30 40 40 40
4 50 50 50 50
Thanks!

Start with your original df, groupby with axis=1
d={'Q3.4': 'T1','Q4.4': 'T1',
'Q3.6': 'T2','Q4.6': 'T2',
'Q3.8': 'T3','Q4.8': 'T3',
'Q3.18': 'T4','Q4.12': 'T4'}
df.set_index('ID').groupby(d,axis=1).first()
Out[80]:
T1 T2 T3 T4
ID
1 20.0 60.0 80.0 20.0
2 10.0 20.0 20.0 40.0
3 30.0 40.0 40.0 40.0
4 50.0 50.0 50.0 50.0

How about this:
df.sum(level=0, axis=1)
Out[313]:
ID T1 T2 T3 T4
0 1.0 20.0 60.0 80.0 20.0
1 2.0 10.0 20.0 20.0 40.0
2 3.0 30.0 40.0 40.0 40.0
3 4.0 50.0 50.0 50.0 50.0

Try:
# set index if not already
df = df.set_index('ID')
# stack unstack:
df = df.stack().unstack().reset_index()
output:
ID T1 T2 T3 T4
0 1 20.0 60.0 80.0 20.0
1 2 10.0 20.0 20.0 40.0
2 3 30.0 40.0 40.0 40.0
3 4 50.0 50.0 50.0 50.0

Related

group/merge/pivot data by varied weight ranges in Pandas

Is there a way in Pandas to fit in the value according to weight ranges when pivoting the dataframe? I see some answers with setting bins but these are varied weight ranges depending on how the data is entered.
Here's my dataset.
import pandas as pd
df = pd.DataFrame({'tier': [1,1,1,1,1,1,1,1,1],
'services': ["A","A","A","A","A","A","A","A","A"],
'weight_start': [1,61,161,201,1,1,61,161,201],
'weight_end': [60,160,200,500,500,60,160,200,500],
'location': [1,1,1,1,2,3,3,3,3],
'discount': [70,30,10,0,0,60,20,5,0]})
pivot_df = df.pivot(index=['tier','services','weight_start','weight_end'],columns='location',values='discount')
display(pivot_df)
Output:
df
Desired Output:
desired df
Since location 2 is 0 percent covering the ranges 1 to 500, I want it to populate 0 based on the ranges prescribed for tier 1 service A instead of having its own row.
Edit: Mozway's answer works when there is one service. When I added a second service, the dataframe ungrouped.
Here's the new dataset with service B.
import pandas as pd
df = pd.DataFrame({'tier': [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1],
'services': ["A","A","A","A","A","A","A","A","A","B","B","B","B","B","B","B","B"],
'weight_start': [1,61,161,201,1,1,61,161,201,1,1,81,101,1,61,161,201],
'weight_end': [60,160,200,500,500,60,160,200,500,500,80,100,200,60,160,200,500],
'location': [1,1,1,1,2,3,3,3,3,1,2,2,2,3,3,3,3],
'discount': [70,30,10,0,0,60,20,5,0,50,70,50,10,65,55,45,5]})
pivot_df = df.pivot(index=['tier','services','weight_start','weight_end'],columns='location',values='discount')
display(pivot_df)
Output:
location 1 2 3
tier services weight_start weight_end
1 A 1 60 70.0 NaN 60.0
500 NaN 0.0 NaN
61 160 30.0 NaN 20.0
161 200 10.0 NaN 5.0
201 500 0.0 NaN 0.0
B 1 60 NaN NaN 65.0
80 NaN 70.0 NaN
500 50.0 NaN NaN
61 160 NaN NaN 55.0
81 100 NaN 50.0 NaN
101 200 NaN 10.0 NaN
161 200 NaN NaN 45.0
201 500 NaN NaN 5.0
Desired Output:
location 1 2 3
tier services weight_start weight_end
1 A 1 60 70.0 0.0 60.0
61 160 30.0 0.0 20.0
161 200 10.0 0.0 5.0
201 500 0.0 0.0 0.0
B 1 60 50 70 65.0
80 50 70.0 55
61 160 50 NaN 55.0
81 100 50 50.0 55
101 200 50 10.0 NaN
161 200 50 10 45.0
201 500 50 NaN 5.0
This will work
data = (df.set_index(['tier','services','weight_start','weight_end'])
.pivot(columns='location')['discount']
.reset_index()
.rename_axis(None, axis=1)
)
IIUC, you can (temporarily) exclude the columns with 0/nan and check if all remaining values are only NaNs per row. If so, drop those rows:
mask = ~pivot_df.loc[:, pivot_df.any()].isna().all(1)
out = pivot_df[mask].fillna(0)
output:
location 1 2 3
tier services weight_start weight_end
1 A 1 60 70.0 0.0 60.0
61 160 30.0 0.0 20.0
161 200 10.0 0.0 5.0
201 500 0.0 0.0 0.0
per group:
def drop(d):
mask = ~ d.loc[:, d.any()].isna().all(1)
return d[mask].fillna(0)
out = pivot_df.groupby(['services']).apply(drop)
output:
location 1 2 3
services tier services weight_start weight_end
A 1 A 1 60 70.0 0.0 60.0
61 160 30.0 0.0 20.0
161 200 10.0 0.0 5.0
201 500 0.0 0.0 0.0
B 1 B 1 60 0.0 0.0 65.0
80 0.0 70.0 0.0
500 50.0 0.0 0.0
61 160 0.0 0.0 55.0
81 100 0.0 50.0 0.0
101 200 0.0 10.0 0.0
161 200 0.0 0.0 45.0
201 500 0.0 0.0 5.0

Use condition in a dataframe to replace values in another dataframe with nan

I have a dataframe that contains concentration values for a set of samples as follows:
Sample
Ethanol
Acetone
Formaldehyde
Methane
A
20
20
20
20
A
30
23
20
nan
A
20
23
nan
nan
A
nan
20
nan
nan
B
21
46
87
54
B
23
74
nan
54
B
23
67
nan
53
B
23
nan
nan
33
C
23
nan
nan
66
C
22
nan
nan
88
C
22
nan
nan
90
C
22
nan
nan
88
I have second dataframe that contains the proportion of concentration values that are not missing in the first dataframe:
Sample
Ethanol
Acetone
Formaldehyde
Methane
A
0.75
1
0.5
0.25
B
1
0.75
0.25
1
C
1
0
0
1
I would like to replace value in the first dataframe with nan when the condition in the second dataframe is 0.5 or less. Hence, the resulting dataframe would look like that below. Any help would be great!
Sample
Ethanol
Acetone
Formaldehyde
Methane
A
20
20
nan
nan
A
30
23
nan
nan
A
20
23
nan
nan
A
nan
20
nan
nan
B
21
46
nan
54
B
23
74
nan
54
B
23
67
nan
53
B
23
nan
nan
33
C
23
nan
nan
66
C
22
nan
nan
88
C
22
nan
nan
90
C
22
nan
nan
88
Is it what your are looking for:
>>> df2.set_index('Sample').mask(lambda x: x <= 0.5) \
.mul(df1.set_index('Sample')).reset_index()
Sample Ethanol Acetone Formaldehyde Methane
0 A 15.0 20.00 NaN NaN
1 A 22.5 23.00 NaN NaN
2 A 15.0 23.00 NaN NaN
3 A NaN 20.00 NaN NaN
4 B 21.0 34.50 NaN 54.0
5 B 23.0 55.50 NaN 54.0
6 B 23.0 50.25 NaN 53.0
7 B 23.0 NaN NaN 33.0
8 C 23.0 NaN NaN 66.0
9 C 22.0 NaN NaN 88.0
10 C 22.0 NaN NaN 90.0
11 C 22.0 NaN NaN 88.0

Groupby Year and other column and calculate average based on specific condition pandas

I have a data frame as shown below
Tenancy_ID Unit_ID End_Date Rental_value
1 A 2012-04-26 10
2 A 2012-08-27 20
3 A 2013-04-27 50
4 A 2014-04-27 40
1 B 2011-06-26 10
2 B 2011-09-27 30
3 B 2013-04-27 60
4 B 2015-04-27 80
From the above I would like to prepare below data frame
Expected Output:
Unit_ID Avg_2011 Avg_2012 Avg_2013 Avg_2014 Avg_2015
A NaN 15 50 40 NaN
B 20 NaN 60 NaN 80
Steps:
Unit_ID = A, has two contracts in 2012 with rental value 10 and 20, Hence the average is 15.
Avg_2012 = Average rental value in 2012.
Use pivot_table directly with the s.dt.year
#df['End_Date']=pd.to_datetime(df['End_Date']) if dtype of End_Date is not datetime
final = (df.pivot_table('Rental_value','Unit_ID',df['End_Date'].dt.year)
.add_prefix('Avg_').reset_index().rename_axis(None,axis=1))
print(final)
Unit_ID Avg_2011 Avg_2012 Avg_2013 Avg_2014 Avg_2015
0 A NaN 15.0 50.0 40.0 NaN
1 B 20.0 NaN 60.0 NaN 80.0
You can aggregate averages and reshape by Series.unstack, last change columns names by DataFrame.add_prefix and last data cleaning - DataFrame.reset_index with DataFrame.rename_axis:
df1 = (df.groupby(['Unit_ID', df['End_Date'].dt.year])['Rental_value']
.mean()
.unstack()
.add_prefix('Avg_')
.reset_index()
.rename_axis(None, axis=1))
print (df1)
Unit_ID Avg_2011 Avg_2012 Avg_2013 Avg_2014 Avg_2015
0 A NaN 15.0 50.0 40.0 NaN
1 B 20.0 NaN 60.0 NaN 80.0

expand mid year values to month in pandas

following from expand year values to month in pandas
I have:
pd.DataFrame({'comp':['a','b'], 'period':['20180331','20171231'],'value':[12,24]})
comp period value
0 a 20180331 12
1 b 20171231 24
and would like to extrapolate to 201701 to 201812 inclusive. The value should be spread out for the 12 months preceding the period.
comp yyymm value
a 201701 na
a 201702 na
...
a 201705 12
a 201706 12
...
a 201803 12
a 201804 na
b 201701 24
...
b 201712 24
b 201801 na
...
Use:
#create month periods with min and max value
r = pd.period_range('2017-01', '2018-12', freq='m')
#convert column to period
df['period'] = pd.to_datetime(df['period']).dt.to_period('m')
#create MultiIndex for add all possible values
mux = pd.MultiIndex.from_product([df['comp'], r], names=('comp','period'))
#reindex for append values
df = df.set_index(['comp','period'])['value'].reindex(mux).reset_index()
#back filling by 11 values of missing values per groups
df['new'] = df.groupby('comp')['value'].bfill(limit=11)
print (df)
comp period value new
0 a 2017-01 NaN NaN
1 a 2017-02 NaN NaN
2 a 2017-03 NaN NaN
3 a 2017-04 NaN 12.0
4 a 2017-05 NaN 12.0
...
...
10 a 2017-11 NaN 12.0
11 a 2017-12 NaN 12.0
12 a 2018-01 NaN 12.0
13 a 2018-02 NaN 12.0
14 a 2018-03 12.0 12.0
15 a 2018-04 NaN NaN
16 a 2018-05 NaN NaN
17 a 2018-06 NaN NaN
18 a 2018-07 NaN NaN
19 a 2018-08 NaN NaN
20 a 2018-09 NaN NaN
21 a 2018-10 NaN NaN
22 a 2018-11 NaN NaN
23 a 2018-12 NaN NaN
24 b 2017-01 NaN 24.0
25 b 2017-02 NaN 24.0
26 b 2017-03 NaN 24.0
...
...
32 b 2017-09 NaN 24.0
33 b 2017-10 NaN 24.0
34 b 2017-11 NaN 24.0
35 b 2017-12 24.0 24.0
36 b 2018-01 NaN NaN
37 b 2018-02 NaN NaN
38 b 2018-03 NaN NaN
...
...
44 b 2018-09 NaN NaN
45 b 2018-10 NaN NaN
46 b 2018-11 NaN NaN
47 b 2018-12 NaN NaN
See if this works:
dftime = pd.DataFrame(pd.date_range('20170101','20181231'), columns=['dt']).apply(lambda x: x.dt.strftime('%Y-%m'), axis=1) # Populating full range including dates
dftime = dftime.assign(dt=dftime.dt.drop_duplicates().reset_index(drop=True)).dropna() # Dropping duplicates from above range
df['dt'] = pd.to_datetime(df.period).apply(lambda x: x.strftime('%Y-%m')) # Adding column for merging purpose
target = df.groupby('comp').apply(lambda x: dftime.merge(x[['comp','dt','value']], on='dt', how='left').fillna({'comp':x.comp.unique()[0]})).reset_index(drop=True) # Populating data for each company
This gives desired output:
print(target)
dt comp value
0 2017-01 a NaN
1 2017-02 a NaN
2 2017-03 a NaN
3 2017-04 a NaN
4 2017-05 a NaN
5 2017-06 a NaN
6 2017-07 a NaN
and so on.

Create a new ID column based on conditions in other column using pandas

I am trying to make a new column 'ID' which should give a unique ID each time there is no 'NaN' value in 'Data' column. If the non null values come right to each other, the ID remains the same. I have provided how my final Id column should look like below as reference to better understand. Could anyone guide me on this?
Id Data
0 NaN
0 NaN
0 NaN
1 54
1 55
0 NaN
0 NaN
2 67
0 NaN
0 NaN
3 33
3 44
3 22
0 NaN
.groupby the cumsum to get consecutive groups, using where to mask the NaN. .ngroup gets the consecutive IDs. Also possible with rank.
s = df.Data.isnull().cumsum().where(df.Data.notnull())
df['ID'] = df.groupby(s).ngroup()+1
# df['ID'] = s.rank(method='dense').fillna(0).astype(int)
Output:
Data ID
0 NaN 0
1 NaN 0
2 NaN 0
3 54.0 1
4 55.0 1
5 NaN 0
6 NaN 0
7 67.0 2
8 NaN 0
9 NaN 0
10 33.0 3
11 44.0 3
12 22.0 3
13 NaN 0
Using factorize
v=pd.factorize(df.Data.isnull().cumsum()[df.Data.notnull()])[0]+1
df.loc[df.Data.notnull(),'Newid']=v
df.Newid.fillna(0,inplace=True)
df
Id Data Newid
0 0 NaN 0.0
1 0 NaN 0.0
2 0 NaN 0.0
3 1 54.0 1.0
4 1 55.0 1.0
5 0 NaN 0.0
6 0 NaN 0.0
7 2 67.0 2.0
8 0 NaN 0.0
9 0 NaN 0.0
10 3 33.0 3.0
11 3 44.0 3.0
12 3 22.0 3.0
13 0 NaN 0.0