Reindexing Multiindex dataframe - pandas

I have Multiindex dataframe and I want to reindex it. However, I get 'duplicate axis error'.
Product Date col1
A September 2019 5
October 2019 7
B September 2019 2
October 2019 4
How can I achieve output like this?
Product Date col1
A January 2019 0
February 2019 0
March 2019 0
April 2019 0
May 2019 0
June 2019 0
July 2019 0
August 2019 0
September 2019 5
October 2019 7
B January 2019 0
February 2019 0
March 2019 0
April 2019 0
May 2019 0
June 2019 0
July 2019 0
August 2019 0
September 2019 2
October 2019 4
First I tried this:
nested_df = nested_df.reindex(annual_date_range, level = 1, fill_value = 0)
Secondly,
nested_df = nested_df.reset_index().set_index('Date')
nested_df = nested_df.reindex(annual_date_range, fill_value = 0)

You should do the following for each month:
df.loc[('A', 'January 2019'), :] = (0)
df.loc[('B', 'January 2019'), :] = (0)

Let df1 be your first data frame with non-zero values. The approach is to create another data frame df with zero values and merge both data frames to obtain the result.
dates = ['{month}-2019'.format(month=month) for month in range(1,9)]*2
length = int(len(dates)/2)
products = ['A']*length + ['B']*length
Col1 = [0]*len(dates)
df = pd.DataFrame({'Dates': dates, 'Products': products, 'Col1':Col1}).set_index(['Products','Dates'])
Now the MultiIndex is converted to datetime:
df.index.set_levels(pd.to_datetime(df.index.get_level_values(1)[:8]).strftime('%m-%Y'), level=1,inplace=True)
In df1 you have to do the same, i.e. change the datetime multiindex level to the same format:
df1.index.set_levels(pd.to_datetime(df1.index.get_level_values(1)[:2]).strftime('%m-%Y'), level=1,inplace=True)
I did it because otherwise (for example if datetimes are formatted like %B %y) the sorting of the MultiIndex by months goes wrong. Now it is sufficient to merge both data frames:
result = pd.concat([df1,df]).sort_values(['Products','Dates'])
The final move is to change the datetime format:
result.index.set_levels(levels = pd.to_datetime(result.index.get_level_values(1)[:10]).strftime('%B %Y'), level=1, inplace=True)

Related

value_counts() returns removed/filtered-out data in "Categorical" datatype in Pandas

Could someone please clarify me this:
df = pd.DataFrame({'years': [2015, 2016, 2017,2017, 2018, 2019, 2019, 2020]})
df['years'] = df['years'].astype('category')
print(df.dtypes)
years category
dtype: object
now, I create a new variable to subset the years column:
subset_years = [2015, 2016, 2017, 2018]
then, filter the years:
subset_df = df[df['years'].isin(subset_years)]
print(subset_df)
years
0 2015
1 2016
2 2017
3 2017
4 2018
now, I take the unique elements:
subset_df.years.unique()
and I get:
[2015, 2016, 2017, 2018]
Categories (4, int64): [2015, 2016, 2017, 2018]
but, if I do subset_df.years.value_counts(), I get:
2015 1
2016 1
2017 2
2018 1
2019 0
2020 0
Name: years, dtype: int64
My question is that why does subset_df.years.value_counts() return 2019 and 2020 years and with count of 0 ? Since I already filter the years... was it not suppose to remove those years during subset/filter?
Could someone please clarify what is happening?
It's because 2019 and 2020 are still within the categories. You can reset category before value_counts if you don't want filtered years to show up:
subset_df.years.cat.set_categories(subset_years).value_counts()
#2017 2
#2015 1
#2016 1
#2018 1
#Name: years, dtype: int64

Rolling Rows in pandas.DataFrame

I have a dataframe that looks like this:
year
month
valueCounts
2019
1
73.411285
2019
2
53.589128
2019
3
71.103842
2019
4
79.528084
I want valueCounts column's values to be rolled like:
year
month
valueCounts
2019
1
53.589128
2019
2
71.103842
2019
3
79.528084
2019
4
NaN
I can do this by dropping first index of dataframe and assigning last index to NaN but it doesn't look efficient. Is there any simpler method to do this?
Thanks.
Assuming your dataframe are already sorted.
Use shift:
df['valueCounts'] = df['valueCounts'].shift(-1)
print(df)
# Output
year month valueCounts
0 2019 1 53.589128
1 2019 2 71.103842
2 2019 3 79.528084
3 2019 4 NaN

Pandas Shift Column & Remove Row

I have a dataframe 'df1' that has 2 columns and i need to shift the 2nd column down a row and then remove the entire top row of the df1.
My data looks like this:
year ER12
0 2017 -2.05
1 2018 1.05
2 2019 -0.04
3 2020 -0.60
4 2021 -99.99
And, I need it to look like this:
year ER12
0 2018 -2.05
1 2019 1.05
2 2020 -0.04
3 2021 -0.60
We can try this:
df = df.assign(ER12=df.ER12.shift()).dropna().reset_index(drop=True)
print(df)
year ER12
0 2018 -2.05
1 2019 1.05
2 2020 -0.04
3 2021 -0.60
This works on your example:
import pandas as pd
df = pd.DataFrame({'year':[2017,2018,2019,2020,2021], 'ER12':[-2.05,1.05,-0.04,-0.6,-99.99]})
df['year'] = df['year'].shift(-1)
df = df.dropna()

Pandas 1.0 create column of months from year and date

I have a dataframe df with values as:
df.iloc[1:4, 7:9]
Year Month
38 2020 4
65 2021 4
92 2022 4
I am trying to create a new MonthIdx column as:
df['MonthIdx'] = pd.to_timedelta(df['Year'], unit='Y') + pd.to_timedelta(df['Month'], unit='M') + pd.to_timedelta(1, unit='D')
But I get the error:
ValueError: Units 'M' and 'Y' are no longer supported, as they do not represent unambiguous timedelta values durations.
Following is the desired output:
df['MonthIdx']
MonthIdx
38 2020/04/01
65 2021/04/01
92 2022/04/01
So you can pad the month value in a series, and then reformat to get a datetime for all of the values:
month = df.Month.astype(str).str.pad(width=2, side='left', fillchar='0')
df['MonthIdx'] = pd.to_datetime(pd.Series([int('%d%s' % (x,y)) for x,y in zip(df['Year'],month)]),format='%Y%m')
This will give you:
Year Month MonthIdx
0 2020 4 2020-04-01
1 2021 4 2021-04-01
2 2022 4 2022-04-01
You can reformat the date to be a string to match exactly your format:
df['MonthIdx'] = df['MonthIdx'].apply(lambda x: x.strftime('%Y/%m/%d'))
Giving you:
Year Month MonthIdx
0 2020 4 2020/04/01
1 2021 4 2021/04/01
2 2022 4 2022/04/01

Select Rows Where MultiIndex Is In Another DataFrame

I have one DataFrame (DF1) with a MultiIndex and many additional columns. In another DataFrame (DF2) I have 2 columns containing a set of values from the MultiIndex. I would like to select the rows from DF1 where the MultiIndex matches the values in DF2.
df1 = pd.DataFrame({'month': [1, 3, 4, 7, 10],
'year': [2012, 2012, 2014, 2013, 2014],
'sale':[55, 17, 40, 84, 31]})
df1 = df1.set_index(['year','month'])
sale
year month
2012 1 55
2012 3 17
2014 4 40
2013 7 84
2014 10 31
df2 = pd.DataFrame({'year': [2012,2014],
'month': [1, 10]})
year month
0 2012 1
1 2014 10
I'd like to create a new DataFrame that would be:
sale
year month
2012 1 55
2014 10 31
I've tried many variations using .isin, .loc, slicing, but keep running into errors.
You could just set_index on df2 the same way and pass the index:
In[110]:
df1.loc[df2.set_index(['year','month']).index]
Out[110]:
sale
year month
2012 1 55
2014 10 31
more readable version:
In[111]:
idx = df2.set_index(['year','month']).index
df1.loc[idx]
Out[111]:
sale
year month
2012 1 55
2014 10 31