How to prepare paneldata to machine learning in Python? - pandas

I have a panel data set/time series. I want to prepare the dataset for machine learning prediction next year's gcp. My data looks like this:
ID,year,age,area,debt_ratio,gcp
654001,2013,49,East,0.14,0
654001,2014,50,East,0.17,0
654001,2015,51,East,0.23,1
654001,2016,52,East,0.18,0
112089,2013,39,West,0.13,0
112089,2014,40,West,0.15,0
112089,2015,41,West,0.18,1
112089,2016,42,West,0.21,1
What I want is something like this:
ID,year,age,area,debt_ratio,gcp,gcp-1,gcp-2,gcp-3
654001,2013,49,East,0.14,0,NA,NA,NA
654001,2014,50,East,0.17,0,0,NA,NA
654001,2015,51,East,0.23,1,0,0,NA
654001,2016,52,East,0.18,0,1,0,0
112089,2013,39,West,0.13,0,NA,NA,NA
112089,2014,40,West,0.15,0,0,NA,NA
112089,2015,41,West,0.18,1,0,0,NA
112089,2016,42,West,0.21,1,1,0,0
I've tried Pandas melt function, but it didn't work out. I searched online and found this post that is exact what I want to do, but it is done in R:
https://stackoverflow.com/questions/19813077/prepare-time-series-for-machine-learning-long-to-wide-format
Does anybody know how to do this in Python Pandas? Any suggestion would be appreciated!

Use DataFrameGroupBy.shift in loop:
for i in range(1, 4):
df[f'gcp-{i}'] = df.groupby('ID')['gcp'].shift(i)
print (df)
ID year age area debt_ratio gcp gcp-1 gcp-2 gcp-3
0 654001 2013 49 East 0.14 0 NaN NaN NaN
1 654001 2014 50 East 0.17 0 0.0 NaN NaN
2 654001 2015 51 East 0.23 1 0.0 0.0 NaN
3 654001 2016 52 East 0.18 0 1.0 0.0 0.0
4 112089 2013 39 West 0.13 0 NaN NaN NaN
5 112089 2014 40 West 0.15 0 0.0 NaN NaN
6 112089 2015 41 West 0.18 1 0.0 0.0 NaN
7 112089 2016 42 West 0.21 1 1.0 0.0 0.0
More dynamic solution is get maximum number of groups and pass to range:
N = df['ID'].value_counts().max()
for i in range(1, N):
df[f'gcp-{i}'] = df.groupby('ID')['gcp'].shift(i)
print (df)
ID year age area debt_ratio gcp gcp-1 gcp-2 gcp-3
0 654001 2013 49 East 0.14 0 NaN NaN NaN
1 654001 2014 50 East 0.17 0 0.0 NaN NaN
2 654001 2015 51 East 0.23 1 0.0 0.0 NaN
3 654001 2016 52 East 0.18 0 1.0 0.0 0.0
4 112089 2013 39 West 0.13 0 NaN NaN NaN
5 112089 2014 40 West 0.15 0 0.0 NaN NaN
6 112089 2015 41 West 0.18 1 0.0 0.0 NaN
7 112089 2016 42 West 0.21 1 1.0 0.0 0.0

Related

Why cannot I have a usual dataframe after using pivot()?

Under the variable names, there is an extra row that I do not want in my data set
fdi_autocracy = fdi_autocracy.pivot(index=["Country", "regime", "Year"],
columns="partner_regime",
values =['FDI_outward', "FDI_inward", "total_fdi"],
).reset_index()
Country regime Year FDI_outward FDI_inward total_fdi
partner_regime 0.0 0.0 0.0
0 Albania 0.0 1995 NaN NaN NaN
1 Albania 0.0 1996 NaN NaN NaN
2 Albania 0.0 1997 NaN NaN NaN
3 Albania 0.0 1998 NaN NaN NaN
4 Albania 0.0 1999 NaN NaN NaN
What I want is following:
Country regime Year FDI_outward FDI_inward total_fdi
0 Albania 0.0 1995 NaN NaN NaN
1 Albania 0.0 1996 NaN NaN NaN
2 Albania 0.0 1997 NaN NaN NaN
3 Albania 0.0 1998 NaN NaN NaN
4 Albania 0.0 1999 NaN NaN NaN
IIUC, you don't need the partner_regime?
this removes that title
fdi_autocracy.rename_axis(columns=[None, None])

group/merge/pivot data by varied weight ranges in Pandas

Is there a way in Pandas to fit in the value according to weight ranges when pivoting the dataframe? I see some answers with setting bins but these are varied weight ranges depending on how the data is entered.
Here's my dataset.
import pandas as pd
df = pd.DataFrame({'tier': [1,1,1,1,1,1,1,1,1],
'services': ["A","A","A","A","A","A","A","A","A"],
'weight_start': [1,61,161,201,1,1,61,161,201],
'weight_end': [60,160,200,500,500,60,160,200,500],
'location': [1,1,1,1,2,3,3,3,3],
'discount': [70,30,10,0,0,60,20,5,0]})
pivot_df = df.pivot(index=['tier','services','weight_start','weight_end'],columns='location',values='discount')
display(pivot_df)
Output:
df
Desired Output:
desired df
Since location 2 is 0 percent covering the ranges 1 to 500, I want it to populate 0 based on the ranges prescribed for tier 1 service A instead of having its own row.
Edit: Mozway's answer works when there is one service. When I added a second service, the dataframe ungrouped.
Here's the new dataset with service B.
import pandas as pd
df = pd.DataFrame({'tier': [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1],
'services': ["A","A","A","A","A","A","A","A","A","B","B","B","B","B","B","B","B"],
'weight_start': [1,61,161,201,1,1,61,161,201,1,1,81,101,1,61,161,201],
'weight_end': [60,160,200,500,500,60,160,200,500,500,80,100,200,60,160,200,500],
'location': [1,1,1,1,2,3,3,3,3,1,2,2,2,3,3,3,3],
'discount': [70,30,10,0,0,60,20,5,0,50,70,50,10,65,55,45,5]})
pivot_df = df.pivot(index=['tier','services','weight_start','weight_end'],columns='location',values='discount')
display(pivot_df)
Output:
location 1 2 3
tier services weight_start weight_end
1 A 1 60 70.0 NaN 60.0
500 NaN 0.0 NaN
61 160 30.0 NaN 20.0
161 200 10.0 NaN 5.0
201 500 0.0 NaN 0.0
B 1 60 NaN NaN 65.0
80 NaN 70.0 NaN
500 50.0 NaN NaN
61 160 NaN NaN 55.0
81 100 NaN 50.0 NaN
101 200 NaN 10.0 NaN
161 200 NaN NaN 45.0
201 500 NaN NaN 5.0
Desired Output:
location 1 2 3
tier services weight_start weight_end
1 A 1 60 70.0 0.0 60.0
61 160 30.0 0.0 20.0
161 200 10.0 0.0 5.0
201 500 0.0 0.0 0.0
B 1 60 50 70 65.0
80 50 70.0 55
61 160 50 NaN 55.0
81 100 50 50.0 55
101 200 50 10.0 NaN
161 200 50 10 45.0
201 500 50 NaN 5.0
This will work
data = (df.set_index(['tier','services','weight_start','weight_end'])
.pivot(columns='location')['discount']
.reset_index()
.rename_axis(None, axis=1)
)
IIUC, you can (temporarily) exclude the columns with 0/nan and check if all remaining values are only NaNs per row. If so, drop those rows:
mask = ~pivot_df.loc[:, pivot_df.any()].isna().all(1)
out = pivot_df[mask].fillna(0)
output:
location 1 2 3
tier services weight_start weight_end
1 A 1 60 70.0 0.0 60.0
61 160 30.0 0.0 20.0
161 200 10.0 0.0 5.0
201 500 0.0 0.0 0.0
per group:
def drop(d):
mask = ~ d.loc[:, d.any()].isna().all(1)
return d[mask].fillna(0)
out = pivot_df.groupby(['services']).apply(drop)
output:
location 1 2 3
services tier services weight_start weight_end
A 1 A 1 60 70.0 0.0 60.0
61 160 30.0 0.0 20.0
161 200 10.0 0.0 5.0
201 500 0.0 0.0 0.0
B 1 B 1 60 0.0 0.0 65.0
80 0.0 70.0 0.0
500 50.0 0.0 0.0
61 160 0.0 0.0 55.0
81 100 0.0 50.0 0.0
101 200 0.0 10.0 0.0
161 200 0.0 0.0 45.0
201 500 0.0 0.0 5.0

Select a row if two consecutive columns contain a negative value

From the given table of inflation rates below, I want to obtain the countries with negative inflation rates for two consecutive years.
2017 2018 2019 2020 2021 2022
Country
Turkey NaN 47.0 -7.0 -19.0 38.0 260.0
Argentina NaN 33.0 56.0 -22.0 15.0 8.0
Suriname NaN -68.0 -37.0 695.0 56.0 13.0
Zimbabwe NaN 106.0 2306.0 118.0 -83.0 -21.0
Lebanon NaN 2.0 -36.0 2826.0 82.0 39.0
Sudan NaN 96.0 -19.0 220.0 19.0 34.0
Venezuela NaN 1482.0 -70.0 -88.0 15.0 -89.0
I have seen some solutions in SO that use list comprehension or loops. I wonder if this task is possible without them.
I attempted to convert the dataframe into 1s and 0s, in which 1.0 indicates a negative inflation.
2017 2018 2019 2020 2021 2022
Country
Turkey NaN 0.0 1.0 1.0 0.0 0.0
Argentina NaN 0.0 0.0 1.0 0.0 0.0
Suriname NaN 1.0 1.0 0.0 0.0 0.0
Zimbabwe NaN 0.0 0.0 0.0 1.0 1.0
Lebanon NaN 0.0 1.0 0.0 0.0 0.0
Sudan NaN 0.0 1.0 0.0 0.0 0.0
Venezuela NaN 0.0 1.0 1.0 0.0 1.0
However, I am stuck at this point. I tried to use np.prod function but this returns 0 if at least one column as 0.0 data.
Any ideas about how to solve this problem?
You can first set an integer mask for the negative values (1 means negative). Then compute a rolling min on the axis 1, of the min is 1 all values are. This is generalizable to any number of consecutive columns.
N = 2
m1 = df.lt(0).astype(int)
m2 = m.rolling(N, axis=1).min().eq(1).any(axis=1)
df[m2]
Output:
2017 2018 2019 2020 2021 2022
Country
Turkey NaN 47.0 -7.0 -19.0 38.0 260.0
Suriname NaN -68.0 -37.0 695.0 56.0 13.0
Zimbabwe NaN 106.0 2306.0 118.0 -83.0 -21.0
Venezuela NaN 1482.0 -70.0 -88.0 15.0 -89.0
NB. One needs to work with integers as rolling is currently limited to numeric types
Alternative with a single mask for N=2
m = df.lt(0)
df[(m&m.shift(axis=1)).any(axis=1)]
Try this:
match = (df.lt(0) & df.shift(axis=1).lt(0)).any(axis=1)
df[match]
How it works:
df.lt(0): current year inflation is less than 0
df.shift(axis=1).lt(0): previous year inflation is less than 0
.any(axis=1): any such occurrence in the country.
Given your dataframe, this is what would work for me:
set the Country as an index so I just have digits in my df values
Define new column for check of 'Two sequential negatives' in columns using df.shift(axis=1).
So it would look like:
df.set_index('Country',inplace=True)
df['TwoNegatives'] = ((df.values < 0) & ((df.shift(axis=1)).values <0)).any(axis=1)
Try with rolling
out = df[df.le(0).T.rolling(window=2).sum().ge(2).any()]
Out[15]:
2017 2018 2019 2020 2021 2022
Country
Turkey NaN 47.0 -7.0 -19.0 38.0 260.0
Suriname NaN -68.0 -37.0 695.0 56.0 13.0
Zimbabwe NaN 106.0 2306.0 118.0 -83.0 -21.0
Venezuela NaN 1482.0 -70.0 -88.0 15.0 -89.0
def function1(ss:pd.Series):
ss.loc['col1']=ss.rolling(2).apply(lambda ss1:ss1.iloc[0]<0 and ss1.iloc[1]<0).eq(1).any()
return ss
df1.set_index('Country').apply(function1,axis=1).query('col1')
out
2017 2018 2019 2020 2021 2022 col1
Country
Turkey NaN 47.0 -7.0 -19.0 38.0 260.0 True
Suriname NaN -68.0 -37.0 695.0 56.0 13.0 True
Zimbabwe NaN 106.0 2306.0 118.0 -83.0 -21.0 True
Venezuela NaN 1482.0 -70.0 -88.0 15.0 -89.0 True

Use condition in a dataframe to replace values in another dataframe with nan

I have a dataframe that contains concentration values for a set of samples as follows:
Sample
Ethanol
Acetone
Formaldehyde
Methane
A
20
20
20
20
A
30
23
20
nan
A
20
23
nan
nan
A
nan
20
nan
nan
B
21
46
87
54
B
23
74
nan
54
B
23
67
nan
53
B
23
nan
nan
33
C
23
nan
nan
66
C
22
nan
nan
88
C
22
nan
nan
90
C
22
nan
nan
88
I have second dataframe that contains the proportion of concentration values that are not missing in the first dataframe:
Sample
Ethanol
Acetone
Formaldehyde
Methane
A
0.75
1
0.5
0.25
B
1
0.75
0.25
1
C
1
0
0
1
I would like to replace value in the first dataframe with nan when the condition in the second dataframe is 0.5 or less. Hence, the resulting dataframe would look like that below. Any help would be great!
Sample
Ethanol
Acetone
Formaldehyde
Methane
A
20
20
nan
nan
A
30
23
nan
nan
A
20
23
nan
nan
A
nan
20
nan
nan
B
21
46
nan
54
B
23
74
nan
54
B
23
67
nan
53
B
23
nan
nan
33
C
23
nan
nan
66
C
22
nan
nan
88
C
22
nan
nan
90
C
22
nan
nan
88
Is it what your are looking for:
>>> df2.set_index('Sample').mask(lambda x: x <= 0.5) \
.mul(df1.set_index('Sample')).reset_index()
Sample Ethanol Acetone Formaldehyde Methane
0 A 15.0 20.00 NaN NaN
1 A 22.5 23.00 NaN NaN
2 A 15.0 23.00 NaN NaN
3 A NaN 20.00 NaN NaN
4 B 21.0 34.50 NaN 54.0
5 B 23.0 55.50 NaN 54.0
6 B 23.0 50.25 NaN 53.0
7 B 23.0 NaN NaN 33.0
8 C 23.0 NaN NaN 66.0
9 C 22.0 NaN NaN 88.0
10 C 22.0 NaN NaN 90.0
11 C 22.0 NaN NaN 88.0

How to count months with at least 1 non NaN value?

I have this df:
CODE YEAR MONTH DAY TMAX TMIN PP
0 130 1991 1 1 32.6 23.4 0.0
1 130 1991 1 2 31.2 22.4 0.0
2 130 1991 1 3 32.0 NaN 0.0
3 130 1991 1 4 32.2 23.0 0.0
4 130 1991 1 5 30.5 22.0 0.0
... ... ... ... ... ... ...
20118 130 2018 9 30 31.8 21.2 NaN
30028 132 1991 1 1 35.2 NaN 0.0
30029 132 1991 1 2 34.6 NaN 0.0
30030 132 1991 1 3 35.8 NaN 0.0
30031 132 1991 1 4 34.8 NaN 0.0
... ... ... ... ... ... ...
45000 132 2019 10 5 35.5 NaN 21.1
46500 133 1991 1 1 35.5 NaN 21.1
I need to count months that have at least 1 non NaN value in TMAX,TMIN and PP columns. If the month have all nan values that month doesn't count. I need to do this by each CODE.
Expected value:
CODE YEAR MONTH DAY TMAX TMIN PP JANUARY_TMAX FEBRUARY_TMAX MARCH_TMAX APRIL_TMAX etc
130 1991 1 1 32.6 23.4 0 23 25 22 27 …
130 1991 1 2 31.2 22.4 0 NaN NaN NaN NaN NaN
130 1991 1 3 32 NaN 0 NaN NaN NaN NaN NaN
130 1991 1 4 32.2 23 0 NaN NaN NaN NaN NaN
130 1991 1 5 30.5 22 0 NaN NaN NaN NaN NaN
... ... ... ... ... ... ... NaN NaN NaN NaN NaN
130 2018 9 30 31.8 21.2 NaN NaN NaN NaN NaN NaN
132 1991 1 1 35.2 NaN 0 21 23 22 22 …
132 1991 1 2 34.6 NaN 0 NaN NaN NaN NaN NaN
132 1991 1 3 35.8 NaN 0 NaN NaN NaN NaN NaN
132 1991 1 4 34.8 NaN 0 NaN NaN NaN NaN NaN
... ... ... ... ... ... ... NaN NaN NaN NaN NaN
132 2019 1 1 35.5 NaN 21.1 NaN NaN NaN NaN NaN
... ... ... ... ... ... ... NaN NaN NaN NaN NaN
133 1991 1 1 35.5 NaN 21.1 25 22 22 21 …
... ... ... ... ... ... ... NaN NaN NaN NaN NaN
For example: In code 130 for TMAX column, i have 23 Januarys that have at least 1 non NaN value, i have 25 Februarys that have at least 1 non NaN value, etc.
Would you mind to help me? Thanks in advance.
This may not be super efficient, but here is how you can do it for one of columns, TMAX in this case. Just repeat the process for the other columns.
# Count occurrences of each month when TMAX is not null
tmax_cts_long = df[df.TMAX.notnull()].drop_duplicates(subset=['CODE', 'YEAR', 'MONTH']).groupby(['CODE', 'MONTH']).size().reset_index(name='COUNT')
# Transpose the long table of counts to wide format
tmax_cts_wide = tmax_cts_long.pivot(index='CODE', columns='MONTH', values='COUNT')
# Merge table of counts with the original dataframe
final_df = df.merge(tmax_cts_wide, on='CODE', how='left')
# Replace values in new columns in all rows after the first row with NaN
mask = final_df.index.isin(df.groupby(['CODE', 'MONTH']).head(1).index)
final_df.loc[~mask, [col for col in final_df.columns if isinstance(col, int)]] = None
# Rename new columns to follow the desired naming format
mon_dict = {1: 'JANUARY', 2: 'FEBRUARY', ...}
tmax_mon_dict = {k: v + '_TMAX' for k, v in mon_dict.items()}
final_df.rename(columns=tmax_mon_dict, inplace=True)