group/merge/pivot data by varied weight ranges in Pandas - pandas

Is there a way in Pandas to fit in the value according to weight ranges when pivoting the dataframe? I see some answers with setting bins but these are varied weight ranges depending on how the data is entered.
Here's my dataset.
import pandas as pd
df = pd.DataFrame({'tier': [1,1,1,1,1,1,1,1,1],
'services': ["A","A","A","A","A","A","A","A","A"],
'weight_start': [1,61,161,201,1,1,61,161,201],
'weight_end': [60,160,200,500,500,60,160,200,500],
'location': [1,1,1,1,2,3,3,3,3],
'discount': [70,30,10,0,0,60,20,5,0]})
pivot_df = df.pivot(index=['tier','services','weight_start','weight_end'],columns='location',values='discount')
display(pivot_df)
Output:
df
Desired Output:
desired df
Since location 2 is 0 percent covering the ranges 1 to 500, I want it to populate 0 based on the ranges prescribed for tier 1 service A instead of having its own row.
Edit: Mozway's answer works when there is one service. When I added a second service, the dataframe ungrouped.
Here's the new dataset with service B.
import pandas as pd
df = pd.DataFrame({'tier': [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1],
'services': ["A","A","A","A","A","A","A","A","A","B","B","B","B","B","B","B","B"],
'weight_start': [1,61,161,201,1,1,61,161,201,1,1,81,101,1,61,161,201],
'weight_end': [60,160,200,500,500,60,160,200,500,500,80,100,200,60,160,200,500],
'location': [1,1,1,1,2,3,3,3,3,1,2,2,2,3,3,3,3],
'discount': [70,30,10,0,0,60,20,5,0,50,70,50,10,65,55,45,5]})
pivot_df = df.pivot(index=['tier','services','weight_start','weight_end'],columns='location',values='discount')
display(pivot_df)
Output:
location 1 2 3
tier services weight_start weight_end
1 A 1 60 70.0 NaN 60.0
500 NaN 0.0 NaN
61 160 30.0 NaN 20.0
161 200 10.0 NaN 5.0
201 500 0.0 NaN 0.0
B 1 60 NaN NaN 65.0
80 NaN 70.0 NaN
500 50.0 NaN NaN
61 160 NaN NaN 55.0
81 100 NaN 50.0 NaN
101 200 NaN 10.0 NaN
161 200 NaN NaN 45.0
201 500 NaN NaN 5.0
Desired Output:
location 1 2 3
tier services weight_start weight_end
1 A 1 60 70.0 0.0 60.0
61 160 30.0 0.0 20.0
161 200 10.0 0.0 5.0
201 500 0.0 0.0 0.0
B 1 60 50 70 65.0
80 50 70.0 55
61 160 50 NaN 55.0
81 100 50 50.0 55
101 200 50 10.0 NaN
161 200 50 10 45.0
201 500 50 NaN 5.0

This will work
data = (df.set_index(['tier','services','weight_start','weight_end'])
.pivot(columns='location')['discount']
.reset_index()
.rename_axis(None, axis=1)
)

IIUC, you can (temporarily) exclude the columns with 0/nan and check if all remaining values are only NaNs per row. If so, drop those rows:
mask = ~pivot_df.loc[:, pivot_df.any()].isna().all(1)
out = pivot_df[mask].fillna(0)
output:
location 1 2 3
tier services weight_start weight_end
1 A 1 60 70.0 0.0 60.0
61 160 30.0 0.0 20.0
161 200 10.0 0.0 5.0
201 500 0.0 0.0 0.0
per group:
def drop(d):
mask = ~ d.loc[:, d.any()].isna().all(1)
return d[mask].fillna(0)
out = pivot_df.groupby(['services']).apply(drop)
output:
location 1 2 3
services tier services weight_start weight_end
A 1 A 1 60 70.0 0.0 60.0
61 160 30.0 0.0 20.0
161 200 10.0 0.0 5.0
201 500 0.0 0.0 0.0
B 1 B 1 60 0.0 0.0 65.0
80 0.0 70.0 0.0
500 50.0 0.0 0.0
61 160 0.0 0.0 55.0
81 100 0.0 50.0 0.0
101 200 0.0 10.0 0.0
161 200 0.0 0.0 45.0
201 500 0.0 0.0 5.0

Related

Use condition in a dataframe to replace values in another dataframe with nan

I have a dataframe that contains concentration values for a set of samples as follows:
Sample
Ethanol
Acetone
Formaldehyde
Methane
A
20
20
20
20
A
30
23
20
nan
A
20
23
nan
nan
A
nan
20
nan
nan
B
21
46
87
54
B
23
74
nan
54
B
23
67
nan
53
B
23
nan
nan
33
C
23
nan
nan
66
C
22
nan
nan
88
C
22
nan
nan
90
C
22
nan
nan
88
I have second dataframe that contains the proportion of concentration values that are not missing in the first dataframe:
Sample
Ethanol
Acetone
Formaldehyde
Methane
A
0.75
1
0.5
0.25
B
1
0.75
0.25
1
C
1
0
0
1
I would like to replace value in the first dataframe with nan when the condition in the second dataframe is 0.5 or less. Hence, the resulting dataframe would look like that below. Any help would be great!
Sample
Ethanol
Acetone
Formaldehyde
Methane
A
20
20
nan
nan
A
30
23
nan
nan
A
20
23
nan
nan
A
nan
20
nan
nan
B
21
46
nan
54
B
23
74
nan
54
B
23
67
nan
53
B
23
nan
nan
33
C
23
nan
nan
66
C
22
nan
nan
88
C
22
nan
nan
90
C
22
nan
nan
88
Is it what your are looking for:
>>> df2.set_index('Sample').mask(lambda x: x <= 0.5) \
.mul(df1.set_index('Sample')).reset_index()
Sample Ethanol Acetone Formaldehyde Methane
0 A 15.0 20.00 NaN NaN
1 A 22.5 23.00 NaN NaN
2 A 15.0 23.00 NaN NaN
3 A NaN 20.00 NaN NaN
4 B 21.0 34.50 NaN 54.0
5 B 23.0 55.50 NaN 54.0
6 B 23.0 50.25 NaN 53.0
7 B 23.0 NaN NaN 33.0
8 C 23.0 NaN NaN 66.0
9 C 22.0 NaN NaN 88.0
10 C 22.0 NaN NaN 90.0
11 C 22.0 NaN NaN 88.0

How to extract a database based on a condition in pandas?

Please help me
The below one is the problem...
write an expression to extract a new dataframe containing those days where the temperature reached at least 70 degrees, and assign that to the variable at_least_70. (You might need to think some about what the different columns in the full dataframe represent to decide how to extract the subset of interest.)
After that, write another expression that computes how many days reached at least 70 degrees, and assign that to the variable num_at_least_70.
This is the original DataFrame
Date Maximum Temperature Minimum Temperature \
0 2018-01-01 5 0
1 2018-01-02 13 1
2 2018-01-03 19 -2
3 2018-01-04 22 1
4 2018-01-05 18 -2
.. ... ... ...
360 2018-12-27 33 23
361 2018-12-28 40 21
362 2018-12-29 50 37
363 2018-12-30 37 24
364 2018-12-31 35 25
Average Temperature Precipitation Snowfall Snow Depth
0 2.5 0.04 1.0 3.0
1 7.0 0.03 0.6 4.0
2 8.5 0.00 0.0 4.0
3 11.5 0.00 0.0 3.0
4 8.0 0.09 1.2 4.0
.. ... ... ... ...
360 28.0 0.00 0.0 1.0
361 30.5 0.07 0.0 0.0
362 43.5 0.04 0.0 0.0
363 30.5 0.02 0.7 1.0
364 30.0 0.00 0.0 0.0
[365 rows x 7 columns]
I wrote the code for the above problem is`
at_least_70 = dfc.loc[dfc['Minimum Temperature']>=70,['Date']]
print(at_least_70)
num_at_least_70 = at_least_70.count()
print(num_at_least_70)
The Results it is showing
Date
204 2018-07-24
240 2018-08-29
245 2018-09-03
Date 3
dtype: int64
But when run the test case it is showing...
Incorrect!
You are not correctly extracting the subset.
As suggested by #HenryYik, remove the column selector:
at_least_70 = dfc.loc[dfc['Maximum Temperature'] >= 70,
['Date', 'Maximum Temperature']]
num_at_least_70 = len(at_least_70)
Use boolean indexing and for count Trues of mask use sum:
mask = dfc['Minimum Temperature'] >= 70
at_least_70 = dfs[mask]
num_at_least_70 = mask.sum()

Add header to .data file in Pandas

Given a file with the extention of .data, I have read it with pd.read_fwf("./input.data", sep=",", header = None):
Out:
0
0 63.0,1.0,1.0,145.0,233.0,1.0,2.0,150.0,0.0,2.3...
1 67.0,1.0,4.0,160.0,286.0,0.0,2.0,108.0,1.0,1.5...
2 67.0,1.0,4.0,120.0,229.0,0.0,2.0,129.0,1.0,2.6...
3 37.0,1.0,3.0,130.0,250.0,0.0,0.0,187.0,0.0,3.5...
4 41.0,0.0,2.0,130.0,204.0,0.0,2.0,172.0,0.0,1.4...
... ...
292 57.0,0.0,4.0,140.0,241.0,0.0,0.0,123.0,1.0,0.2...
293 45.0,1.0,1.0,110.0,264.0,0.0,0.0,132.0,0.0,1.2...
294 68.0,1.0,4.0,144.0,193.0,1.0,0.0,141.0,0.0,3.4...
295 57.0,1.0,4.0,130.0,131.0,0.0,0.0,115.0,1.0,1.2...
296 57.0,0.0,2.0,130.0,236.0,0.0,2.0,174.0,0.0,0.0...
How can I add the following column names to it? Thanks.
col_names = ["age", "sex", "cp", "restbp", "chol", "fbs", "restecg",
"thalach", "exang", "oldpeak", "slope", "ca", "thal", "num"]
Update:
pd.read_fwf("./input.data", names = col_names)
Out:
age sex cp restbp chol fbs restecg thalach exang oldpeak slope ca thal num
0 63.0,1.0,1.0,145.0,233.0,1.0,2.0,150.0,0.0,2.3... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 67.0,1.0,4.0,160.0,286.0,0.0,2.0,108.0,1.0,1.5... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 67.0,1.0,4.0,120.0,229.0,0.0,2.0,129.0,1.0,2.6... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 37.0,1.0,3.0,130.0,250.0,0.0,0.0,187.0,0.0,3.5... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 41.0,0.0,2.0,130.0,204.0,0.0,2.0,172.0,0.0,1.4... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
292 57.0,0.0,4.0,140.0,241.0,0.0,0.0,123.0,1.0,0.2... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
293 45.0,1.0,1.0,110.0,264.0,0.0,0.0,132.0,0.0,1.2... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
294 68.0,1.0,4.0,144.0,193.0,1.0,0.0,141.0,0.0,3.4... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
295 57.0,1.0,4.0,130.0,131.0,0.0,0.0,115.0,1.0,1.2... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
296 57.0,0.0,2.0,130.0,236.0,0.0,2.0,174.0,0.0,0.0... NaN NaN NaN NaN NaN NaN
If check read_fwf:
Read a table of fixed-width formatted lines into DataFrame.
So if there is separator , use read_csv:
col_names = ["age", "sex", "cp", "restbp", "chol", "fbs", "restecg",
"thalach", "exang", "oldpeak", "slope", "ca", "thal", "num"]
df = pd.read_csv("input.data", names=col_names)
print (df)
age sex cp restbp chol fbs restecg thalach exang oldpeak \
0 63.0 1.0 1.0 145.0 233.0 1.0 2.0 150.0 0.0 2.3
1 67.0 1.0 4.0 160.0 286.0 0.0 2.0 108.0 1.0 1.5
2 67.0 1.0 4.0 120.0 229.0 0.0 2.0 129.0 1.0 2.6
3 37.0 1.0 3.0 130.0 250.0 0.0 0.0 187.0 0.0 3.5
4 41.0 0.0 2.0 130.0 204.0 0.0 2.0 172.0 0.0 1.4
.. ... ... ... ... ... ... ... ... ... ...
292 57.0 0.0 4.0 140.0 241.0 0.0 0.0 123.0 1.0 0.2
293 45.0 1.0 1.0 110.0 264.0 0.0 0.0 132.0 0.0 1.2
294 68.0 1.0 4.0 144.0 193.0 1.0 0.0 141.0 0.0 3.4
295 57.0 1.0 4.0 130.0 131.0 0.0 0.0 115.0 1.0 1.2
296 57.0 0.0 2.0 130.0 236.0 0.0 2.0 174.0 0.0 0.0
slope ca thal num
0 3.0 0.0 6.0 0
1 2.0 3.0 3.0 1
2 2.0 2.0 7.0 1
3 3.0 0.0 3.0 0
4 1.0 0.0 3.0 0
.. ... ... ... ...
292 2.0 0.0 7.0 1
293 2.0 0.0 7.0 1
294 2.0 2.0 7.0 1
295 2.0 1.0 7.0 1
296 2.0 1.0 3.0 1
[297 rows x 14 columns]
Just do a read_csv without header and pass col_names:
df = pd.read_csv('input.data', header=None, names=col_names);
Output (head):
age sex cp restbp chol fbs restecg thalach exang oldpeak slope ca thal num
-- ----- ----- ---- -------- ------ ----- --------- --------- ------- --------- ------- ---- ------ -----
0 63 1 1 145 233 1 2 150 0 2.3 3 0 6 0
1 67 1 4 160 286 0 2 108 1 1.5 2 3 3 1
2 67 1 4 120 229 0 2 129 1 2.6 2 2 7 1
3 37 1 3 130 250 0 0 187 0 3.5 3 0 3 0
4 41 0 2 130 204 0 2 172 0 1.4 1 0 3 0

How to prepare paneldata to machine learning in Python?

I have a panel data set/time series. I want to prepare the dataset for machine learning prediction next year's gcp. My data looks like this:
ID,year,age,area,debt_ratio,gcp
654001,2013,49,East,0.14,0
654001,2014,50,East,0.17,0
654001,2015,51,East,0.23,1
654001,2016,52,East,0.18,0
112089,2013,39,West,0.13,0
112089,2014,40,West,0.15,0
112089,2015,41,West,0.18,1
112089,2016,42,West,0.21,1
What I want is something like this:
ID,year,age,area,debt_ratio,gcp,gcp-1,gcp-2,gcp-3
654001,2013,49,East,0.14,0,NA,NA,NA
654001,2014,50,East,0.17,0,0,NA,NA
654001,2015,51,East,0.23,1,0,0,NA
654001,2016,52,East,0.18,0,1,0,0
112089,2013,39,West,0.13,0,NA,NA,NA
112089,2014,40,West,0.15,0,0,NA,NA
112089,2015,41,West,0.18,1,0,0,NA
112089,2016,42,West,0.21,1,1,0,0
I've tried Pandas melt function, but it didn't work out. I searched online and found this post that is exact what I want to do, but it is done in R:
https://stackoverflow.com/questions/19813077/prepare-time-series-for-machine-learning-long-to-wide-format
Does anybody know how to do this in Python Pandas? Any suggestion would be appreciated!
Use DataFrameGroupBy.shift in loop:
for i in range(1, 4):
df[f'gcp-{i}'] = df.groupby('ID')['gcp'].shift(i)
print (df)
ID year age area debt_ratio gcp gcp-1 gcp-2 gcp-3
0 654001 2013 49 East 0.14 0 NaN NaN NaN
1 654001 2014 50 East 0.17 0 0.0 NaN NaN
2 654001 2015 51 East 0.23 1 0.0 0.0 NaN
3 654001 2016 52 East 0.18 0 1.0 0.0 0.0
4 112089 2013 39 West 0.13 0 NaN NaN NaN
5 112089 2014 40 West 0.15 0 0.0 NaN NaN
6 112089 2015 41 West 0.18 1 0.0 0.0 NaN
7 112089 2016 42 West 0.21 1 1.0 0.0 0.0
More dynamic solution is get maximum number of groups and pass to range:
N = df['ID'].value_counts().max()
for i in range(1, N):
df[f'gcp-{i}'] = df.groupby('ID')['gcp'].shift(i)
print (df)
ID year age area debt_ratio gcp gcp-1 gcp-2 gcp-3
0 654001 2013 49 East 0.14 0 NaN NaN NaN
1 654001 2014 50 East 0.17 0 0.0 NaN NaN
2 654001 2015 51 East 0.23 1 0.0 0.0 NaN
3 654001 2016 52 East 0.18 0 1.0 0.0 0.0
4 112089 2013 39 West 0.13 0 NaN NaN NaN
5 112089 2014 40 West 0.15 0 0.0 NaN NaN
6 112089 2015 41 West 0.18 1 0.0 0.0 NaN
7 112089 2016 42 West 0.21 1 1.0 0.0 0.0

merge multiple columns into one column from the same dataframe pandas

My input dataset is as follows, and I want to rename multiple columns to the same variable name T1, T2, T3, T4 and bind the columns with the same name as one column.
df
ID Q3.4 Q3.6 Q3.8 Q3.18 Q4.4 Q4.6 Q4.8 Q4.12
1 NaN NaN NaN NaN 20 60 80 20
2 10 20 20 40 NaN NaN NaN NaN
3 30 40 40 40 NaN NaN NaN NaN
4 NaN NaN NaN NaN 50 50 50 50
rename vars
T1 = ['Q3.4', 'Q4.4']
T2 = ['Q3.6', 'Q4.6']
T3 = ['Q3.8', 'Q4.8']
T4 = ['Q3.18', 'Q4.12']
Step 1: I have renamed variables by (let me know if there is any faster code please)
df.rename(columns = {'Q3.4': 'T1',
'Q4.4': 'T1',
inplace = True)
df.rename(columns = {'Q3.6': 'T2',
'Q4.6': 'T2',
inplace = True)
df.rename(columns = {'Q3.8': 'T3',
'Q4.8': 'T3',
inplace = True)
df.rename(columns = {'Q3.18': 'T4',
'Q4.12': 'T4',
inplace = True)
ID T1 T2 T3 T4 T1 T2 T3 T4
1 NaN NaN NaN NaN 20 60 80 20
2 10 20 20 40 NaN NaN NaN NaN
3 30 40 40 40 NaN NaN NaN NaN
4 NaN NaN NaN NaN 50 50 50 50
How can I merge the columns into the following expected df?
ID T1 T2 T3 T4
1 20 60 80 20
2 10 20 20 40
3 30 40 40 40
4 50 50 50 50
Thanks!
Start with your original df, groupby with axis=1
d={'Q3.4': 'T1','Q4.4': 'T1',
'Q3.6': 'T2','Q4.6': 'T2',
'Q3.8': 'T3','Q4.8': 'T3',
'Q3.18': 'T4','Q4.12': 'T4'}
df.set_index('ID').groupby(d,axis=1).first()
Out[80]:
T1 T2 T3 T4
ID
1 20.0 60.0 80.0 20.0
2 10.0 20.0 20.0 40.0
3 30.0 40.0 40.0 40.0
4 50.0 50.0 50.0 50.0
How about this:
df.sum(level=0, axis=1)
Out[313]:
ID T1 T2 T3 T4
0 1.0 20.0 60.0 80.0 20.0
1 2.0 10.0 20.0 20.0 40.0
2 3.0 30.0 40.0 40.0 40.0
3 4.0 50.0 50.0 50.0 50.0
Try:
# set index if not already
df = df.set_index('ID')
# stack unstack:
df = df.stack().unstack().reset_index()
output:
ID T1 T2 T3 T4
0 1 20.0 60.0 80.0 20.0
1 2 10.0 20.0 20.0 40.0
2 3 30.0 40.0 40.0 40.0
3 4 50.0 50.0 50.0 50.0