Rolling Rows in pandas.DataFrame - pandas

I have a dataframe that looks like this:
year
month
valueCounts
2019
1
73.411285
2019
2
53.589128
2019
3
71.103842
2019
4
79.528084
I want valueCounts column's values to be rolled like:
year
month
valueCounts
2019
1
53.589128
2019
2
71.103842
2019
3
79.528084
2019
4
NaN
I can do this by dropping first index of dataframe and assigning last index to NaN but it doesn't look efficient. Is there any simpler method to do this?
Thanks.

Assuming your dataframe are already sorted.
Use shift:
df['valueCounts'] = df['valueCounts'].shift(-1)
print(df)
# Output
year month valueCounts
0 2019 1 53.589128
1 2019 2 71.103842
2 2019 3 79.528084
3 2019 4 NaN

Related

Cumulative Deviation of 2 Columns in Pandas DF

I have a rather simple request and have not found a suitable solution online. I have a DF that looks like this below and I need to find the cumulative deviation as shown in a new column to the DF. My DF looks like this:
year month Curr Yr LT Avg
0 2022 1 667590.5985 594474.2003
1 2022 2 701655.5967 585753.1173
2 2022 3 667260.5368 575550.6112
3 2022 4 795338.8914 562312.5309
4 2022 5 516510.1103 501330.4306
5 2022 6 465717.9192 418087.1358
6 2022 7 366100.4456 344854.2453
7 2022 8 355089.157 351539.9371
8 2022 9 468479.4396 496831.2979
9 2022 10 569234.4156 570767.1723
10 2022 11 719505.8569 594368.6991
11 2022 12 670304.78 576495.7539
And, I need the cumulative deviation new column in this DF to look like this:
Cum Dev
0.122993392
0.160154637
0.159888559
0.221628609
0.187604073
0.178089327
0.16687643
0.152866293
0.129326033
0.114260993
0.124487107
0.128058305
In Excel, the calculation would look like this with data in Excel columns Z3:Z14, AA3:AA14 for the first row: =SUM(Z$3:Z3)/SUM(AA$3:AA3)-1 and for the next row: =SUM(Z$3:Z4)/SUM(AA$3:AA4)-1 and for the next as follows with the last row looking like this in the Excel example: =SUM(Z$3:Z14)/SUM(AA$3:AA14)-1
Thank you kindly for your help,
You can divide the cumulative sums of those 2 columns element-wise, and then subtract 1 at the end:
>>> (df["Curr Yr"].cumsum() / df["LT Avg"].cumsum()) - 1
0 0.122993
1 0.160155
2 0.159889
3 0.221629
4 0.187604
5 0.178089
6 0.166876
7 0.152866
8 0.129326
9 0.114261
10 0.124487
11 0.128058
dtype: float64

How to join columns in Julia?

I have opened a dataframe in julia where i have 3 columns like this:
day month year
1 1 2011
2 4 2015
3 12 2018
how can I make a new column called date that goes:
day month year date
1 1 2011 1/1/2011
2 4 2015 2/4/2015
3 12 2018 3/12/2018
I was trying with this:
df[!,:date]= df.day.*"/".*df.month.*"/".*df.year
but it didn't work.
in R i would do:
df$date=paste(df$day, df$month, df$year, sep="/")
is there anything similar?
thanks in advance!
Julia has an inbuilt Date type in its standard library:
julia> using Dates
julia> df[!, :date] = Date.(df.year, df.month, df.day)
3-element Vector{Date}:
2011-01-01
2015-04-02
2018-12-03

Pandas Shift Column & Remove Row

I have a dataframe 'df1' that has 2 columns and i need to shift the 2nd column down a row and then remove the entire top row of the df1.
My data looks like this:
year ER12
0 2017 -2.05
1 2018 1.05
2 2019 -0.04
3 2020 -0.60
4 2021 -99.99
And, I need it to look like this:
year ER12
0 2018 -2.05
1 2019 1.05
2 2020 -0.04
3 2021 -0.60
We can try this:
df = df.assign(ER12=df.ER12.shift()).dropna().reset_index(drop=True)
print(df)
year ER12
0 2018 -2.05
1 2019 1.05
2 2020 -0.04
3 2021 -0.60
This works on your example:
import pandas as pd
df = pd.DataFrame({'year':[2017,2018,2019,2020,2021], 'ER12':[-2.05,1.05,-0.04,-0.6,-99.99]})
df['year'] = df['year'].shift(-1)
df = df.dropna()

Reindexing Multiindex dataframe

I have Multiindex dataframe and I want to reindex it. However, I get 'duplicate axis error'.
Product Date col1
A September 2019 5
October 2019 7
B September 2019 2
October 2019 4
How can I achieve output like this?
Product Date col1
A January 2019 0
February 2019 0
March 2019 0
April 2019 0
May 2019 0
June 2019 0
July 2019 0
August 2019 0
September 2019 5
October 2019 7
B January 2019 0
February 2019 0
March 2019 0
April 2019 0
May 2019 0
June 2019 0
July 2019 0
August 2019 0
September 2019 2
October 2019 4
First I tried this:
nested_df = nested_df.reindex(annual_date_range, level = 1, fill_value = 0)
Secondly,
nested_df = nested_df.reset_index().set_index('Date')
nested_df = nested_df.reindex(annual_date_range, fill_value = 0)
You should do the following for each month:
df.loc[('A', 'January 2019'), :] = (0)
df.loc[('B', 'January 2019'), :] = (0)
Let df1 be your first data frame with non-zero values. The approach is to create another data frame df with zero values and merge both data frames to obtain the result.
dates = ['{month}-2019'.format(month=month) for month in range(1,9)]*2
length = int(len(dates)/2)
products = ['A']*length + ['B']*length
Col1 = [0]*len(dates)
df = pd.DataFrame({'Dates': dates, 'Products': products, 'Col1':Col1}).set_index(['Products','Dates'])
Now the MultiIndex is converted to datetime:
df.index.set_levels(pd.to_datetime(df.index.get_level_values(1)[:8]).strftime('%m-%Y'), level=1,inplace=True)
In df1 you have to do the same, i.e. change the datetime multiindex level to the same format:
df1.index.set_levels(pd.to_datetime(df1.index.get_level_values(1)[:2]).strftime('%m-%Y'), level=1,inplace=True)
I did it because otherwise (for example if datetimes are formatted like %B %y) the sorting of the MultiIndex by months goes wrong. Now it is sufficient to merge both data frames:
result = pd.concat([df1,df]).sort_values(['Products','Dates'])
The final move is to change the datetime format:
result.index.set_levels(levels = pd.to_datetime(result.index.get_level_values(1)[:10]).strftime('%B %Y'), level=1, inplace=True)

Incremental id based on another column's value

From this DataFrame:
car_id month
93829 September
27483 April
48372 October
93829 December
93829 March
48372 February
27483 March
How to add a third column which is basically a new id for car, but an incremental one, like this:
car_id month new_incremental_car_id
93829 September 0
27483 April 1
48372 October 2
93829 December 0
93829 March 0
48372 February 2
27483 March 1
Currently I'm doing it by using groupby('car_id') to create a new DataFrame, to which I add an incremental column, which I then join back to the original DataFrame using car_id join key.
Is there a less cumbersome, more direct method to achieve this goal?
EDIT
The code I'm currently using:
cars_id = pd.DataFrame(list(car_sales.groupby('car_id')['car_id'].groups))
cars_id['car_short_id'] = cars_id.index
cars_id.set_index(0, inplace=True)
car_sales.join(cars_id, on='car_id', how='left')
Apart from pd.factorize you can
Use, map a dict constructed from unique values.
In [959]: df.car_id.map({x: i for i, x in enumerate(df.car_id.unique())})
Out[959]:
0 0
1 1
2 2
3 0
4 0
5 2
6 1
Name: car_id, dtype: int64
Or, using category type and codes but not in the same order.
In [954]: df.car_id.astype('category').cat.codes
Out[954]:
0 2
1 0
2 1
3 2
4 2
5 1
6 0
dtype: int8
use factorize method:
In [49]: df['new_incremental_car_id'] = pd.factorize(df.car_id)[0].astype(np.uint16)
In [50]: df
Out[50]:
car_id month new_incremental_car_id
0 93829 September 0
1 27483 April 1
2 48372 October 2
3 93829 December 0
4 93829 March 0
5 48372 February 2
6 27483 March 1
In [51]: df.dtypes
Out[51]:
car_id int64
month object
new_incremental_car_id uint16
dtype: object