How to calculate all aggregations at once without using a loop over indices? - pandas

How to calculate all aggregations at once without using a loop over indices?
%%time
import random
random.seed(1)
df = pd.DataFrame({'val':random.sample(range(10), 10)})
for j in range(10):
for i in df.index:
df.loc[i,'mean_last_{}'.format(j)] = df.loc[(df.index < i) & (df.index >= i - j),'val'].mean()
df.loc[i,'std_last_{}'.format(j)] = df.loc[(df.index < i) & (df.index >= i - j),'val'].std()
df.loc[i,'max_last_{}'.format(j)] = df.loc[(df.index < i) & (df.index >= i - j),'val'].max()
df.loc[i,'min_last_{}'.format(j)] = df.loc[(df.index < i) & (df.index >= i - j),'val'].min()
df.loc[i,'median_last_{}'.format(j)] = df.loc[(df.index < i) & (df.index >= i - j),'val'].median()

You could use the rolling method, see for example:
df = pd.DataFrame({'val': np.random.random(100)})
for i in range(10):
agg = df["val"].rolling(i).aggregate(['mean', 'median'])
df[[f"mean_{i}", f"median_{i}"]] = agg.values

I think what you're looking for is something like this:
import random
random.seed(1)
df = pd.DataFrame({'val':random.sample(range(10), 10)})
for j in range(1, 10):
df[f'mean_last_{j}'] = df['val'].rolling(j, min_periods=1).mean()
df[f'std_last_{j}'] = df['val'].rolling(j, min_periods=1).std()
df[f'max_last_{j}'] = df['val'].rolling(j, min_periods=1).max()
df[f'min_last_{j}'] = df['val'].rolling(j, min_periods=1).min()
df[f'median_last_{j}'] = df['val'].rolling(j, min_periods=1).median()
However, my code is "off-by-one" relative to your example code. Do you intend for each aggregation INCLUDE value from the current row, or should it only use the previous j rows, without the current one? My code includes the current row, but yours does not. Your code results in NaN values for the first group of aggregations.
Edit: The answer from #Carlos uses rolling(j).aggregate() to specify list of aggregations in one line. Here's what that looks like:
import random
random.seed(1)
df = pd.DataFrame({'val':random.sample(range(10), 10)})
aggs = ['mean', 'std', 'max', 'min', 'median']
for j in range(10):
stats = df["val"].rolling(j, min_periods=min(j, 1)).aggregate(aggs)
df[[f"{a}_last_{j}" for a in aggs]] = stats.values

Related

Automatically assigning p-value position in ggplot loop

I am running an mapply loop on a huge set of data to graph 13 parameters for 19 groups. This is working great except the p-value position. Due to the data varying for each plot I cannot assign position using label.y = 125 for example, in some plots it is in the middle of the bar/error bar. However, I can't assign it higher without having it way to high on other graphs. Is there a way to adjust to the data and error bars?
This is my graphing function and like I said the graph is great, except p-value position. Specifically, the stat compare means anova line.
ANOVA_plotter <- function(Variable, treatment, Grouping, df){
Inputdf <- df %>%
filter(Media == treatment, Group == Grouping) %>%
ggplot(aes_(x = ~ID, y = as.name(Variable))) +
geom_bar(aes(fill = ANOVA_Status), stat = "summary", fun = "mean", width = 0.9) +
stat_summary(geom = "errorbar", fun.data = "mean_sdl", fun.args = list(mult = 1), size = 1) +
labs(title = paste(Variable, "in", treatment, "in Group", Grouping, sep = " ")) +
theme(legend.position = "none",axis.title.x=element_blank(), axis.text = element_text(face="bold", size = 18 ), axis.text.x = element_text(angle = 45, hjust = 1)) +
stat_summary(geom = "errorbar", fun.data = "mean_sdl", fun.args = list(mult = 1), width = 0.2) +
stat_compare_means(method = "anova", label.y = 125) +
stat_compare_means(label = "p.signif", method = "t.test", paired = FALSE, ref.group = "Control")
}
I get graphs that look like this
(https://i.stack.imgur.com/hV9Ad.jpg)
But I can't assign it to label.y = 200 because of plots like this
(https://i.stack.imgur.com/uStez.jpg)

How to add new columns to pandas data frame using .loc

I am calculating very simple daily stock calculations in data frame ( for e.g. SMA, VWAP, RSI etc). After I upgraded to anaconda 3.0, my code stopped working and gives followed error. I don't have much experience in coding and need some help.
KeyError: "Passing list-likes to .loc or [] with any missing labels is no longer supported. The following labels were missing: Index(['RSI', 'ZONE'], dtype='object'). See https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike"
Followed is the code.
import yfinance as yf
import pandas as pd
def convert_to_dataframe_daily(data):
window = 10
window20 = 20
window50 = 50
window100 = 100
window200 = 200
ema_time = 8
#data = yf.download("googl", period="30d", interval="5m")
#data = yf.download('TSLA', period='30d', interval='5m')
pd.set_option('display.max_columns', None)
#calculation for VWAP
volumeC = data['Volume']
priceC = data['Close']
df = data.assign(VWAP=((volumeC * priceC).cumsum() / volumeC.cumsum()).ffill())
#Convert the timezone to Chicago central
#df.index = pd.DatetimeIndex(df.index.tz_convert('US/Central')) # aware--> aware
#reset the dataframe index and separate time
df.reset_index(inplace=True)
#df.index.intersection
#df2 = df[df.index.isin(dts)]
#df['Date'] = pd.to_datetime(df['Datetime']).dt.date
#df['Time'] = pd.to_datetime(df['Datetime']).dt.time
# calculate stochastic
df['low5']= df['Low'].rolling(5).min()
df['high5']= df['High'].rolling(5).max()
#k = 100 * (c - l) / (h - l)
df['K'] = (df['Close']-df['low5'])/(df['high5']-df['low5'])
#s.reindex([1, 2, 3])
columns = df.columns.values.tolist()
#df[columns[index]]
#df = pd.DataFrame(np.random.randn(8, 4),index=dates, columns=['A', 'B', 'C', 'D'])
df = df.loc[:, ('Date','Open','High','Low', 'Close','Volume','VWAP','K','RSI', 'ZONE')]
#df = df.reindex(['Date','Open','High','Low', 'Close','Volume','VWAP','K','RSI', 'ZONE'])
df['RSI'] = calculate_rsi(df)
filter_Z1 = df['K'] <=0.1
filter_Z2 = (df['K'] > 0.1) & (df['K'] <= 0.2)
filter_Z3 = (df['K'] > 0.2) & (df['K'] <= 0.3)
filter_Z4 = (df['K'] > 0.3) & (df['K'] <= 0.4)
filter_Z5 = (df['K'] > 0.4) & (df['K'] <= 0.5)
filter_Z6 = (df['K'] > 0.5) & (df['K'] <= 0.6)
filter_Z7 = (df['K'] > 0.6) & (df['K'] <= 0.7)
filter_Z8 = (df['K'] > 0.7) & (df['K'] <= 0.8)
filter_Z9 = (df['K'] > 0.8) & (df['K'] <= 0.9)
filter_Z10 = (df['K'] > 0.9) & (df['K'] <= 1)
#plug in stochastic zones
df['ZONE'].where(-filter_Z1, 'Z1', inplace=True)
df['ZONE'].where(-filter_Z2, 'Z2', inplace=True)
df['ZONE'].where(-filter_Z3, 'Z3', inplace=True)
df['ZONE'].where(-filter_Z4, 'Z4', inplace=True)
df['ZONE'].where(-filter_Z5, 'Z5', inplace=True)
df['ZONE'].where(-filter_Z6, 'Z6', inplace=True)
df['ZONE'].where(-filter_Z7, 'Z7', inplace=True)
df['ZONE'].where(-filter_Z8, 'Z9', inplace=True)
df['ZONE'].where(-filter_Z9, 'Z9', inplace=True)
df['ZONE'].where(-filter_Z10, 'Z10', inplace=True)
df = df['Date','Open','High','Low', 'Close','Volume','VWAP','K','RSI', 'ZONE']
return df
data = yf.download('ba', period='500d', interval='1d')
df = convert_to_dataframe_daily(data)
print(df)
A few lines need to be tweaked
Instead of
df = df.loc[:, ('Date','Open','High','Low', 'Close','Volume','VWAP','K','RSI', 'ZONE')]
use
df = df[['Date','Open','High','Low', 'Close','Volume','VWAP','K']]
before
df['ZONE'].where(-filter_Z1, 'Z1', inplace=True)
...
put a line
df['ZONE'] = 0
The line before return df should be changed to
df = df[['Date','Open','High','Low', 'Close','Volume','VWAP','K','RSI', 'ZONE']]

How to replace inplace with apply() function?

I have a panda dataframe df with column DIFF_HOURS:
I do this code:
for i in range(0, 72, 6):
df.loc[(df['DIFF_HOURS'] > i) & (df['DIFF_HOURS'] <= (i+6))]['DIFF_HOURS'].apply(lambda x: i)
But how to modify df rows inplace please with respect indexes?
Try adding assignment
df.loc[(df['DIFF_HOURS'] > i) & (df['DIFF_HOURS'] <= (i+6)), 'DIFF_HOURS'] = df.loc[(df['DIFF_HOURS'] > i) & (df['DIFF_HOURS'] <= (i+6))]['DIFF_HOURS'].apply(lambda x: i)

Is there a way to fit a normal curve to points?

As a small project I've made a program the throws nd dice an nt number of times. At each throw it sums the results from the dice and adds it to a list. At the end the data is rappresented with matplot.
import random
from collections import Counter
import matplotlib.pyplot as plt
nd = int(input("Insert number of dice: "))
nt = int(input("Insert number of throws: "))
print(nd, " dice thrown ", nt, " times")
print("Generating sums, please hold....")
c = 0
i = 0
sum = 0
sums = []
while nt >= i :
while nd >= c:
g = random.randint(1, 6)
sum = sum + g
c += 1
sums.append(sum)
i = i+1
c=0
sum = 0
print("Throw ", i, " of ", nt)
sums.sort()
max = max(sums)
min = min(sums)
print("||Maximum result: ", max, " ||Minimum result: ", min)
print("Now ordering results")
f = Counter(sums)
y = list(f.values())
x = list(f.keys())
print("Rappresenting results")
plt.plot(x, y)
plt.xlabel("Risultati")
plt.ylabel("Frequenza")
plt.title("Distribuzione delle somme")
plt.grid(True)
plt.tight_layout()
plt.show()
The resultant graph looks something like this:
I would like to know how to fit a gaussian curve to the points in order to make the graph clearer
The mean and the standard deviation of the sums are the parameters needed for the Gaussian normal. The pdf of a distribution has an area of 1. To scale it to the same size as the histogram, it needs to be multiplied with the number of input values (len(sums)).
Converting the code to work with numpy arrays, makes everything much faster:
import numpy as np
from collections import Counter
import matplotlib.pyplot as plt
from scipy.stats import norm
nd = 10000 # int(input("Insert number of dice: "))
nt = 10000 # int(input("Insert number of throws: "))
print(nd, "dice thrown", nt, "times")
print("Generating sums, please hold....")
sums = np.zeros(nt, dtype=np.int)
for i in range(nt):
sums[i] = np.sum(np.random.randint(1, 7, nd))
sums.sort()
xmax = sums.max()
xmin = sums.min()
print("||Maximum result: ", xmax, " ||Minimum result: ", xmin)
print("Now ordering results")
f = Counter(sums)
y = list(f.values())
x = list(f.keys())
print("Plotting results")
plt.plot(x, y)
mean = sums.mean()
std = sums.std()
xs = np.arange(xmin, xmax + 1)
plt.plot(xs, norm.pdf(xs, mean, std) * len(sums), color='red', alpha=0.7, lw=3)
plt.margins(x=0)
plt.xlim(xmin, xmax)
plt.ylim(ymin=0)
plt.tight_layout()
plt.show()
PS: Here is some code to add to the code of the question, using numpy only for calculating the mean and the standard deviation. (Note that as you use sum as a variable name, you get an error when you try to use Python's sum() function. Therefore, it is highly recommended to avoid naming variables such as sum and max.)
def f(x):
return norm.pdf(x, mean, std) * len(sums)
mean = np.mean(sums)
std = np.std(sums)
xs = range(xmin, xmax+1)
ys = [f(x) for x in xs]
plt.plot(xs, ys, color='red', lw=3)

vectorization of loop in pandas

I've been trying to vectorize the following with no such luck:
Consider two data frames. One is a list of dates:
cols = ['col1', 'col2']
index = pd.date_range('1/1/15','8/31/18')
df = pd.DataFrame(columns = cols )
What i'm doing currently is looping thru df and getting the counts of all rows that are less than or equal to the date in question with my main (large) dataframe df_main
for x in range(len(index)):
temp_arr = []
active = len(df_main[(df_main.n_date <= index[x])]
temp_arr = [index[x],active]
df= df.append(pd.Series(temp_arr,index=cols) ,ignore_index=True)
Is there a way to vectorize the above?
What about something like the following
#initializing
mycols = ['col1', 'col2']
myindex = pd.date_range('1/1/15','8/31/18')
mydf = pd.DataFrame(columns = mycols )
#create df_main (that has each of myindex's dates minus 10 days)
df_main = pd.DataFrame(data=myindex-pd.Timedelta(days=10), columns=['n_date'])
#wrap a dataframe around a list comprehension
mydf = pd.DataFrame([[x, len(df_main[df_main['n_date'] <= x])] for x in myindex])