calculate value based on other column values with some step for rows of other columns - pandas

total beginner here. If my question is irrelevant, apologies in advance, I'll remove it. So, I have a question : using pandas, I want to calculate an evolution ratio for a week data compared with the previous rolling 4 weeks mean data.
df['rolling_mean_fourweeks'] = df.rolling(4).mean().round(decimals=1)
from here I wanna create a new column for the evolution ratio based on the week data compared with the row of the rolling mean at the previous week.
what is the best way to go here? (I don't have big data) I have tried unsuccessfully with .shift() but am very foreign to .shift()... I should get NAN for week 3 (fourth week) and ~47% for fifth week.
Any suggestion for retrieving the value at row with step -1?
Thanks and have a good day!

Your idea about using shift can perfectly work. The shift(x) function simply shifts a series (a full column in your case) of x steps.
A simple way to check if the rolling_mean_fourweeks is a good predictor can be to shift Column1 and then check how it differs from rolling_mean_fourweeks:
df['column1_shifted'] = df['Column1'].shift(-1)
df['rolling_accuracy'] = ((df['column1_shifted']-df['rolling_mean_fourweeks'])
/df['rolling_mean_fourweeks'])
resulting in:

Related

How to set Custom Business Day End Frequency in Pandas

I have a pandas dataframe with an unusual DatetimeIndex. The frame contains daily data (end of each day) from 1985 to 1990 but some "random" days are missing:
DatetimeIndex(['1985-01-02', '1985-01-03', '1985-01-04', '1985-01-07',
'1985-01-08', '1985-01-09', '1985-01-10', '1985-01-11',
'1985-01-14', '1985-01-15',
...
'1990-12-17', '1990-12-18', '1990-12-19', '1990-12-20',
'1990-12-21', '1990-12-24', '1990-12-26', '1990-12-27',
'1990-12-28', '1990-12-31'],
dtype='datetime64[ns]', name='date', length=1516, freq=None)
I often need operations like shifting an entire column such that a value that is at the last day of a month (which could e.g. in my DatetimeIndex be '1985-05-30') is shifted to the last day of the next (which could e.g. my DatetimeIndex be '1985-06-27').
While looking for a smart way to perform such shifts, I stumbled over Offset Aliases provided by pandas.tseries.offsets. It can be observed that there are the aliases custom business day frequency (C) and custom business month end frequency (CBM). When looking at an example, it seems like that this could provide exactly what I need:
mth_us = pd.offsets.CustomBusinessMonthEnd(calendar=USFederalHolidayCalendar())
day_us = pd.offsets.CustomBusinessDay(calendar=USFederalHolidayCalendar())
df['Col1_shifted'] = df['Col1'].shift(periods=1, freq = mth_us) # shifted by 1 month
df['Col2_shifted'] = df['Col2'].shift(periods=1, freq = day_us) # shifted by 1 day
The problem is that my DatetimeIndex is not equal to USFederalHolidayCalendar(). Can someone please tell me how I can use pd.offsets.CustomBusinessMonthEnd (and also pd.offsets.CustomBusinessDay) with my own custom DatetimeIndex?
If not, has any of you an idea how to tackle this issue in a different way?
Thanks a lot for your help!

Pandas to calculate best and worst rolling average returns

Given a list of market data (date, open, high, low, close), how would I create a list of best and worst returns for a given period of exact time?
Where Return = df['close'] / df['close'].shift(exactly 6 years in the past), not just a set number of rows back)
The result - Top 6 Years of best rolling returns:
1/1/1975 - 1/1/1981, 345.2%
2/1/1990 - 1/31/1997, 331.5%
etc.
Then for the worst returns, the same thing
Date-Date, %
Date-Date, %
etc...
I can brute-force do it with just Python and not much Pandas, but I bet there is some cool Pandas way to more elegantly do it. Thanks in advance
try using dataframe .rolling(window =7, center).mean(). this create a window of data over 7 rows and then averages them.

Subtract the mean of a group for a column away from a column value

I have a companies dataset with 35 columns. The companies can belong to one of 8 different groups. How do I for each group create a new dataframe which subtract the mean of the column for that group away from the original value?
Here is an example of part of the dataset.
So for example for row 1 I want to subtract the mean of BANK_AND_DEP for Consumer Markets away from the value of 7204.400207. I need to do this for each column.
I assume this is some kind of combination of a transform and a lambda - but cannot hit the syntax.
Although it might seem counter-intuitive for this to involve a loop at all, looping through the columns themselves allows you to do this as a vectorized operation, which will be quicker than .apply(). For what to subtract by, you'll combine .groupby() and .transform() to get the value you need to subtract from a column. Then, just subtract it.
for column in df.columns:
df['new_'+column] = df[column]-df.groupby('Cluster')['column'].transform('mean')

How to check the highest score among specific columns and compute the average in pandas?

Help with homework problem: "Let us define the "data science experience" of a given person as the person's largest score among Regression, Classification, and Clustering. Compute the average data science experience among all MSIS students."
Beginner to coding. I am trying to figure out how to check amongst columns and compare those columns to each other for the largest value. And then take the average of those found values.
I greatly appreciate your help in advance!
Picture of the sample data set: 1: https://i.stack.imgur.com/9OSjz.png
Provided Code:
import pandas as pd
df = pd.read_csv("cleaned_survey.csv", index_col=0)
df.drop(['ProgSkills','Languages','Expert'],axis=1,inplace=True)
Sample Data:
What I have tried so far:
df[data_science_experience]=df[["Regression","Classification","Clustering"]].values.max()
df['z']=df[['Regression','Classification','Clustering']].apply(np.max,axis=1)
df[data_science_experience]=df[["Regression","Classification","Clustering"]].apply(np.max,axis=1)
If you want to get the highest score of column 'hw1' you can get it with:
pd['hw1'].max(). this gives you a series of all the values in that column and max returns the maximum. for average use mean:
pd['hw1'].mean()
if you want to find the maximum of multiple columns, you can use:
maximum_list = list()
for col in pd.columns:
maximum_list.append(pd[col].max)
max = maximum_list.max()
avg = maximum_list.mean()
hope this helps.
First, you want to get only the rows with MSIS in the Program column. That can be done in the following way:
df[df['Program'] == 'MSIS']
Next, you want to get only the Regression, Classification and Clustering columns. The previous query filtered only rows; we can add to that, like this:
df.loc[df['Program'] == 'MSIS', ['Regression', 'Classification', 'Clustering']]
Now, for each row remaining, we want to take the maximum. That can be done by appending .max(axis=1) to the previous line (axis=1 because we want the maximum of each row, not each column).
At this point, we should have a DataFrame where each row represents the highest score of the three categories for each student. Now, all that's left to do is take the mean, which can be done with .mean(). The full code should therefore look like this:
df.loc[df['Program'] == 'MSIS', ['Regression', 'Classification', 'Clustering']].max(axis=1).mean()

Quick Delta Between Two Rows/Columns in GoodData

Right now, I see there are quick ways to get things like Sum/Avg/Max/Etc. for two or more rows or columns when building a table in GoodData.
quick total options
I am building a little table that shows last week and the week prior, and I'm trying to show the delta between them.
So if the first column is 100 and the second is 50, I want '-50'
If the first column is 25 and the second is 100, i want '75'
Is there an easy way to do this?
Let’s consider, that the first column contains result of calculating of metric #1 and the second column contains result of calculating of metric #2, you can simply create a metric #3, which would be defined as the (metric #1 - metric #2) or vice versa.