Under the variable names, there is an extra row that I do not want in my data set
fdi_autocracy = fdi_autocracy.pivot(index=["Country", "regime", "Year"],
columns="partner_regime",
values =['FDI_outward', "FDI_inward", "total_fdi"],
).reset_index()
Country regime Year FDI_outward FDI_inward total_fdi
partner_regime 0.0 0.0 0.0
0 Albania 0.0 1995 NaN NaN NaN
1 Albania 0.0 1996 NaN NaN NaN
2 Albania 0.0 1997 NaN NaN NaN
3 Albania 0.0 1998 NaN NaN NaN
4 Albania 0.0 1999 NaN NaN NaN
What I want is following:
Country regime Year FDI_outward FDI_inward total_fdi
0 Albania 0.0 1995 NaN NaN NaN
1 Albania 0.0 1996 NaN NaN NaN
2 Albania 0.0 1997 NaN NaN NaN
3 Albania 0.0 1998 NaN NaN NaN
4 Albania 0.0 1999 NaN NaN NaN
IIUC, you don't need the partner_regime?
this removes that title
fdi_autocracy.rename_axis(columns=[None, None])
Is there a way in Pandas to fit in the value according to weight ranges when pivoting the dataframe? I see some answers with setting bins but these are varied weight ranges depending on how the data is entered.
Here's my dataset.
import pandas as pd
df = pd.DataFrame({'tier': [1,1,1,1,1,1,1,1,1],
'services': ["A","A","A","A","A","A","A","A","A"],
'weight_start': [1,61,161,201,1,1,61,161,201],
'weight_end': [60,160,200,500,500,60,160,200,500],
'location': [1,1,1,1,2,3,3,3,3],
'discount': [70,30,10,0,0,60,20,5,0]})
pivot_df = df.pivot(index=['tier','services','weight_start','weight_end'],columns='location',values='discount')
display(pivot_df)
Output:
df
Desired Output:
desired df
Since location 2 is 0 percent covering the ranges 1 to 500, I want it to populate 0 based on the ranges prescribed for tier 1 service A instead of having its own row.
Edit: Mozway's answer works when there is one service. When I added a second service, the dataframe ungrouped.
Here's the new dataset with service B.
import pandas as pd
df = pd.DataFrame({'tier': [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1],
'services': ["A","A","A","A","A","A","A","A","A","B","B","B","B","B","B","B","B"],
'weight_start': [1,61,161,201,1,1,61,161,201,1,1,81,101,1,61,161,201],
'weight_end': [60,160,200,500,500,60,160,200,500,500,80,100,200,60,160,200,500],
'location': [1,1,1,1,2,3,3,3,3,1,2,2,2,3,3,3,3],
'discount': [70,30,10,0,0,60,20,5,0,50,70,50,10,65,55,45,5]})
pivot_df = df.pivot(index=['tier','services','weight_start','weight_end'],columns='location',values='discount')
display(pivot_df)
Output:
location 1 2 3
tier services weight_start weight_end
1 A 1 60 70.0 NaN 60.0
500 NaN 0.0 NaN
61 160 30.0 NaN 20.0
161 200 10.0 NaN 5.0
201 500 0.0 NaN 0.0
B 1 60 NaN NaN 65.0
80 NaN 70.0 NaN
500 50.0 NaN NaN
61 160 NaN NaN 55.0
81 100 NaN 50.0 NaN
101 200 NaN 10.0 NaN
161 200 NaN NaN 45.0
201 500 NaN NaN 5.0
Desired Output:
location 1 2 3
tier services weight_start weight_end
1 A 1 60 70.0 0.0 60.0
61 160 30.0 0.0 20.0
161 200 10.0 0.0 5.0
201 500 0.0 0.0 0.0
B 1 60 50 70 65.0
80 50 70.0 55
61 160 50 NaN 55.0
81 100 50 50.0 55
101 200 50 10.0 NaN
161 200 50 10 45.0
201 500 50 NaN 5.0
This will work
data = (df.set_index(['tier','services','weight_start','weight_end'])
.pivot(columns='location')['discount']
.reset_index()
.rename_axis(None, axis=1)
)
IIUC, you can (temporarily) exclude the columns with 0/nan and check if all remaining values are only NaNs per row. If so, drop those rows:
mask = ~pivot_df.loc[:, pivot_df.any()].isna().all(1)
out = pivot_df[mask].fillna(0)
output:
location 1 2 3
tier services weight_start weight_end
1 A 1 60 70.0 0.0 60.0
61 160 30.0 0.0 20.0
161 200 10.0 0.0 5.0
201 500 0.0 0.0 0.0
per group:
def drop(d):
mask = ~ d.loc[:, d.any()].isna().all(1)
return d[mask].fillna(0)
out = pivot_df.groupby(['services']).apply(drop)
output:
location 1 2 3
services tier services weight_start weight_end
A 1 A 1 60 70.0 0.0 60.0
61 160 30.0 0.0 20.0
161 200 10.0 0.0 5.0
201 500 0.0 0.0 0.0
B 1 B 1 60 0.0 0.0 65.0
80 0.0 70.0 0.0
500 50.0 0.0 0.0
61 160 0.0 0.0 55.0
81 100 0.0 50.0 0.0
101 200 0.0 10.0 0.0
161 200 0.0 0.0 45.0
201 500 0.0 0.0 5.0
From the given table of inflation rates below, I want to obtain the countries with negative inflation rates for two consecutive years.
2017 2018 2019 2020 2021 2022
Country
Turkey NaN 47.0 -7.0 -19.0 38.0 260.0
Argentina NaN 33.0 56.0 -22.0 15.0 8.0
Suriname NaN -68.0 -37.0 695.0 56.0 13.0
Zimbabwe NaN 106.0 2306.0 118.0 -83.0 -21.0
Lebanon NaN 2.0 -36.0 2826.0 82.0 39.0
Sudan NaN 96.0 -19.0 220.0 19.0 34.0
Venezuela NaN 1482.0 -70.0 -88.0 15.0 -89.0
I have seen some solutions in SO that use list comprehension or loops. I wonder if this task is possible without them.
I attempted to convert the dataframe into 1s and 0s, in which 1.0 indicates a negative inflation.
2017 2018 2019 2020 2021 2022
Country
Turkey NaN 0.0 1.0 1.0 0.0 0.0
Argentina NaN 0.0 0.0 1.0 0.0 0.0
Suriname NaN 1.0 1.0 0.0 0.0 0.0
Zimbabwe NaN 0.0 0.0 0.0 1.0 1.0
Lebanon NaN 0.0 1.0 0.0 0.0 0.0
Sudan NaN 0.0 1.0 0.0 0.0 0.0
Venezuela NaN 0.0 1.0 1.0 0.0 1.0
However, I am stuck at this point. I tried to use np.prod function but this returns 0 if at least one column as 0.0 data.
Any ideas about how to solve this problem?
You can first set an integer mask for the negative values (1 means negative). Then compute a rolling min on the axis 1, of the min is 1 all values are. This is generalizable to any number of consecutive columns.
N = 2
m1 = df.lt(0).astype(int)
m2 = m.rolling(N, axis=1).min().eq(1).any(axis=1)
df[m2]
Output:
2017 2018 2019 2020 2021 2022
Country
Turkey NaN 47.0 -7.0 -19.0 38.0 260.0
Suriname NaN -68.0 -37.0 695.0 56.0 13.0
Zimbabwe NaN 106.0 2306.0 118.0 -83.0 -21.0
Venezuela NaN 1482.0 -70.0 -88.0 15.0 -89.0
NB. One needs to work with integers as rolling is currently limited to numeric types
Alternative with a single mask for N=2
m = df.lt(0)
df[(m&m.shift(axis=1)).any(axis=1)]
Try this:
match = (df.lt(0) & df.shift(axis=1).lt(0)).any(axis=1)
df[match]
How it works:
df.lt(0): current year inflation is less than 0
df.shift(axis=1).lt(0): previous year inflation is less than 0
.any(axis=1): any such occurrence in the country.
Given your dataframe, this is what would work for me:
set the Country as an index so I just have digits in my df values
Define new column for check of 'Two sequential negatives' in columns using df.shift(axis=1).
So it would look like:
df.set_index('Country',inplace=True)
df['TwoNegatives'] = ((df.values < 0) & ((df.shift(axis=1)).values <0)).any(axis=1)
Try with rolling
out = df[df.le(0).T.rolling(window=2).sum().ge(2).any()]
Out[15]:
2017 2018 2019 2020 2021 2022
Country
Turkey NaN 47.0 -7.0 -19.0 38.0 260.0
Suriname NaN -68.0 -37.0 695.0 56.0 13.0
Zimbabwe NaN 106.0 2306.0 118.0 -83.0 -21.0
Venezuela NaN 1482.0 -70.0 -88.0 15.0 -89.0
def function1(ss:pd.Series):
ss.loc['col1']=ss.rolling(2).apply(lambda ss1:ss1.iloc[0]<0 and ss1.iloc[1]<0).eq(1).any()
return ss
df1.set_index('Country').apply(function1,axis=1).query('col1')
out
2017 2018 2019 2020 2021 2022 col1
Country
Turkey NaN 47.0 -7.0 -19.0 38.0 260.0 True
Suriname NaN -68.0 -37.0 695.0 56.0 13.0 True
Zimbabwe NaN 106.0 2306.0 118.0 -83.0 -21.0 True
Venezuela NaN 1482.0 -70.0 -88.0 15.0 -89.0 True
I have this df:
CODE YEAR MONTH DAY TMAX TMIN PP
0 130 1991 1 1 32.6 23.4 0.0
1 130 1991 1 2 31.2 22.4 0.0
2 130 1991 1 3 32.0 NaN 0.0
3 130 1991 1 4 32.2 23.0 0.0
4 130 1991 1 5 30.5 22.0 0.0
... ... ... ... ... ... ...
20118 130 2018 9 30 31.8 21.2 NaN
30028 132 1991 1 1 35.2 NaN 0.0
30029 132 1991 1 2 34.6 NaN 0.0
30030 132 1991 1 3 35.8 NaN 0.0
30031 132 1991 1 4 34.8 NaN 0.0
... ... ... ... ... ... ...
45000 132 2019 10 5 35.5 NaN 21.1
46500 133 1991 1 1 35.5 NaN 21.1
I need to count months that have at least 1 non NaN value in TMAX,TMIN and PP columns. If the month have all nan values that month doesn't count. I need to do this by each CODE.
Expected value:
CODE YEAR MONTH DAY TMAX TMIN PP JANUARY_TMAX FEBRUARY_TMAX MARCH_TMAX APRIL_TMAX etc
130 1991 1 1 32.6 23.4 0 23 25 22 27 …
130 1991 1 2 31.2 22.4 0 NaN NaN NaN NaN NaN
130 1991 1 3 32 NaN 0 NaN NaN NaN NaN NaN
130 1991 1 4 32.2 23 0 NaN NaN NaN NaN NaN
130 1991 1 5 30.5 22 0 NaN NaN NaN NaN NaN
... ... ... ... ... ... ... NaN NaN NaN NaN NaN
130 2018 9 30 31.8 21.2 NaN NaN NaN NaN NaN NaN
132 1991 1 1 35.2 NaN 0 21 23 22 22 …
132 1991 1 2 34.6 NaN 0 NaN NaN NaN NaN NaN
132 1991 1 3 35.8 NaN 0 NaN NaN NaN NaN NaN
132 1991 1 4 34.8 NaN 0 NaN NaN NaN NaN NaN
... ... ... ... ... ... ... NaN NaN NaN NaN NaN
132 2019 1 1 35.5 NaN 21.1 NaN NaN NaN NaN NaN
... ... ... ... ... ... ... NaN NaN NaN NaN NaN
133 1991 1 1 35.5 NaN 21.1 25 22 22 21 …
... ... ... ... ... ... ... NaN NaN NaN NaN NaN
For example: In code 130 for TMAX column, i have 23 Januarys that have at least 1 non NaN value, i have 25 Februarys that have at least 1 non NaN value, etc.
Would you mind to help me? Thanks in advance.
This may not be super efficient, but here is how you can do it for one of columns, TMAX in this case. Just repeat the process for the other columns.
# Count occurrences of each month when TMAX is not null
tmax_cts_long = df[df.TMAX.notnull()].drop_duplicates(subset=['CODE', 'YEAR', 'MONTH']).groupby(['CODE', 'MONTH']).size().reset_index(name='COUNT')
# Transpose the long table of counts to wide format
tmax_cts_wide = tmax_cts_long.pivot(index='CODE', columns='MONTH', values='COUNT')
# Merge table of counts with the original dataframe
final_df = df.merge(tmax_cts_wide, on='CODE', how='left')
# Replace values in new columns in all rows after the first row with NaN
mask = final_df.index.isin(df.groupby(['CODE', 'MONTH']).head(1).index)
final_df.loc[~mask, [col for col in final_df.columns if isinstance(col, int)]] = None
# Rename new columns to follow the desired naming format
mon_dict = {1: 'JANUARY', 2: 'FEBRUARY', ...}
tmax_mon_dict = {k: v + '_TMAX' for k, v in mon_dict.items()}
final_df.rename(columns=tmax_mon_dict, inplace=True)