Realign Pandas Rolling Average series in a dataframe - pandas

Very new to Pandas. Importing stock data in a DF and I want to calculate 10D rolling average. That I can figure out. Issue is it gives 9 NaN because of the 10D moving average period.
I want to re-align the data so the 10th piece of data is a new rolling average column at the top of the data frame. I tried moving the data by writing the following code:
small = pd.rolling_mean(df['Close'],10)
and then trying to add that to the df with the following code
df['MA10D2'] = small[9:]
but it still provides 9 NaN at the top. Anyone can help me out?

Assignment will be done based on index. small[9:] will start the index at position 9, thus the assignement will keep their position starting at index 9.
The function you are searching for is called shift:
df['MA10D2'] = small.shift(-9)

Related

Pandas pivot_table/groupby taking too long on very large dataframe

I am working on a dataframe of 18 million rows with the following structure:
I need to get a count of the subsystem for each suite as per the name_heuristic (there are 4 values for that column). So I need an output with columns for each type of name_heuristic with the suite as index and values will be count of subsystems as per each column.
I have tried using pivot_table with the following code:
df_table = pd.pivot_table(df, index='suite', columns='name_heuristics', values='subsystem', aggfunc=np.sum
But even after an HOUR, it is not done computing. What is taking so long and how can I speed it up? I even tried a groupby alternative that is still running 15 minutes and counting:
df_table = df.groupby(['name_heuristics', 'suite']).agg({'subsystem': np.sum}).unstack(level='name_heuristics').fillna(0)
Any help is greatly appreciated! I have been stuck on this for hours.
It seems pivoting more than one categorical column crashes pandas. My solution to a similar problem was converting categorical to object for the target columns, using
step 1
df['col1'] = df['col1'].astype('object')
df['col2'] = df['col2'].astype('object')
step 2
df_pivot = pandas.pivot_table(df, columns=['col1', 'col2'], index=...
This was independent of dataframe size...

Pyspark dataframe: crosstab or other method to make row label as new columns

I have a pyspark dataframe as follows in the picture:
I.e. i have four columns: year, word, count, frequency. The year is from 2000 to 2015.
I could like to have some operation on the (pyspark) dataframe so that i get the result in a format as the following picture:
The new dataframe column should be : word, frequency_2000, frequency_2001, frequency_2002, ..., frequency_2015.
With the frequency of each word in each year coming from previous dataframe.
Any advice how I could write efficient code?
Also, please rename the title if you could come up some more informative.
After some research, I found a solution:
Now, the crosstab function can get the output directly :
topw_ys.crosstab("word", "year").toPandas()
Results:
word_year 2000 2015
0 mining 10 6
1 system 11 12
...

Pandas: Get rolling metric with adaptive window size [duplicate]

I am not sure I understand the parameter min_periods in Pandas rolling functions : why does it have to be smaller than the window parameter?
I would like to compute (for instance) the rolling max minus rolling min with a window of ten values BUT I want to wait maybe 20 values before starting computations:
In[1]: import pandas as pd
In[2]: import numpy as np
In[3]: df = pd.DataFrame(columns=['A','B'], data=np.random.randint(low=0,high=100,size=(100,2)))
In[4]: roll = df['A'].rolling(window=10, min_periods=20)
In[5]: df['C'] = roll.max() - roll.min()
In[6]: roll
Out[6]: Rolling [window=10,min_periods=20,center=False,axis=0]
In[7]: df['C'] = roll.max()-roll.min()
I get the following error:
ValueError: Invalid min_periods size 20 greater than window 10
I thought that min_periods was there to tell how many values the function had to wait before starting computations. The documentation says:
min_periods : int, default None
Minimum number of observations in window required to have a value
(otherwise result is NA)
I had not been carefull to the "in window" detail here...
Then what would be the most efficient way to achieve what I am trying to achieve? Should I do something like:
roll = df.loc[20:,'A'].rolling(window=10)
df['C'] = roll.max() - roll.min()
Is there a more efficient way?
the min_period = n option simply means that you require at least n valid observations to compute your rolling stats.
Example, suppose min_period = 5 and you have a rolling mean over the last 10 observations. Now, what happens if 6 of the last 10 observations are actually missing values? Then, given that 4<5 (indeed, there are only 4 non-missing values here and you require at least 5 non-missing observations), the rolling mean will be missing as well.
It's a very, very important option.
From the documentation
min_periods : int, default None Minimum number of observations in
window required to have a value (otherwise result is NA).
The min period argument is just a way to apply the function to a smaller sample than the rolling window. So let say you want the rolling minimum of window of 10, passing the min period argument of 5 would allow to calculate the min of the first 5 data, then the first 6, then 7,8,9 and finally 10. Now that pandas can start rolling his 10 data point windows, because it has more than 10 data point, it will keep period window of 10.

pandas df.resample('D').sum() returns NaN

I've got a pandas data frame with electricity meter readings(cumulative). The df DatetimeIndex dtype='datetime64[ns]'. When I load the .csv file the dataframe does not contain any NaN values. I need to calculate both the monthly and daily energy generated.
To calculate monthly generation I use dfmonth = df.resample('M').sum() . This works fine.
To calculate daily generation I thought of using: dfday = df.resample('D').sum(). Which partially works but for some index dates (no data missing in raw file) returns NaN.
Please see code below. Does anyone knows why this happens? Any proposed solution?
df = pd.read_csv(file)
df = df.set_index(pd.DatetimeIndex(df['Reading Timestamp']))
df=df.rename(columns = {'Energy kWh':'meter', 'Instantaneous Power kW (approx)': 'kW'})
df.drop(df.columns[:10], axis=1, inplace=True) #Delete columns I don't need.
df['kWh'] = df['meter'].sub(df['meter'].shift())
dfmonth = df.resample('M').sum() #This works OK calculating kWh. dfmonth does not contain any NaN.
dfday = df.resample('D').sum() # This returns a total of 8 NaN out of 596 sampled points. Original df has 27929 DatetimeIndex rows
Thank you in advance.
A big apology to you all. The .csv I was given and the raw .csv I was checking against are not the same file. Data was somehow corrupted....
I've been banging my head against the wall till now, there is not problem with df.resample('D').sum()
Sorry again, consider thread sorted.

Iterating through a pandas dataframe to create a chart that adds up to 100%

I have the following dataframe
I want add two columns "Stat total during quarter"(total value of "stat" without break down by param applied) and "% of quarter total" -- that would show how proportions have been changing over time and build a stacked chart that adds up to 100%
Unfortunately I have troubles calculating "Stat total during quarter" in the "pandas way".
I ended up iterating through the dataframe cell-by-cell which feels like a suboptimal solution and then dividing one column by another to get the %
for elements in df.index:
df.ix[elements,3] = df[df['period']==df.ix[elements,0]]['stat'].sum()
df['% of quarter total'] = df.stat / df.['stat total during quarter']
Would really appreciate your thoughts.
Use groupby/transform to compute the sum for each period. transform always returns a Series of the same length as the original DataFrame. Thus, you can assign the return value to a new column of df:
df['stat total during quarter'] = df.groupby('period')['stat'].transform('sum')
df['% of quarter total'] = df['stat'] / df['stat total during quarter']