Python - Finance Matplotlib related - matplotlib

I'm new to python and I'm testing the finance matploblib module.
I need to get the price and date values when the ma20 = ma50
Give me a clue on how to do this.
Here is my code:
# Modules
import datetime
import numpy as np
import matplotlib.finance as finance
import matplotlib.mlab as mlab
import matplotlib.pyplot as plot
# Define quote
startdate = datetime.date(2005,1,1)
today = enddate = datetime.date.today()
ticker = 'nvda'
# Catch CSV
fh = finance.fetch_historical_yahoo(ticker, startdate, enddate)
# From CSV to REACARRAY
r = mlab.csv2rec(fh); fh.close()
# Order by Desc
r.sort()
### Methods Begin
def moving_average(x, n, type='simple'):
"""
compute an n period moving average.
type is 'simple' | 'exponential'
"""
x = np.asarray(x)
if type=='simple':
weights = np.ones(n)
else:
weights = np.exp(np.linspace(-1., 0., n))
weights /= weights.sum()
a = np.convolve(x, weights, mode='full')[:len(x)]
a[:n] = a[n]
return a
### Methods End
prices = r.adj_close
dates = r.date
ma20 = moving_average(prices, 20, type='simple')
ma50 = moving_average(prices, 50, type='simple')
plot.plot(prices)
plot.plot(ma20)
plot.plot(ma50)
plot.show()

Since you are using numpy, you can use numpy's boolean indexing for arrays:
equal = ma20==ma50
print(dates[equal])
print(prices[equal])
'equal' is a boolean array of the same length as dates and prices. Numpy then picks from dates and prices only those entries where equal==True, or, equivalently, ma20==ma50.

Related

np.dot - weights are not being applied to inputs

Trying to use np.dot function to multiply annual returns with weights in a portfolio to return portfolio performance
import numpy as np
import pandas as pd
from pandas_datareader import data as pdr
import matplotlib.pyplot as plt
import yfinance as yf
yf.pdr_override()
y_symbols = ['PG','MSFT', 'F', 'GE']
from datetime import datetime
startdate = datetime(1995,1,3)
enddate = datetime(2017,3,24)
data = pdr.get_data_yahoo(y_symbols, start =startdate, end =enddate)['Adj Close']
returns = (data/data.shift(1)) - 1
annual_returns = returns.mean() * 252
annual_returns
F 0.118506
GE 0.127551
MSFT 0.197452
PG 0.129486
dtype: float64
weights = np.array([0.4, 0.4, 0.15, 0.05])
np.dot = (annual_returns, weights)
(F 0.118506
GE 0.127551
MSFT 0.197452
PG 0.129486
dtype: float64,
array([0.4, 0.4, 0.15, 0.05])
Would expect to see one average of annual return for each stock * weighting
Any idea why I am not seeing one average here?
You set the np.dot value to a tuple. You should be making a function call instead by calling np.dot(annual_returns, weights)

How to calculate the confidence intervals for prediction in Regression? and also how to plot it in python

Fig 7.1, An Introduction To Statistical Learning
I am currently studying a book named Introduction to Statistical Learning with applications in R, and also converting the solutions to python language.
I am not able to get how to get the confidence intervals and plot them as shown in the above image(dashed lines).
I have plotted the line. Here's my code for that -
(I am using polynomial regression with predictiors - 'age' and response - 'wage',degree is 4)
poly = PolynomialFeatures(4)
X = poly.fit_transform(data['age'].to_frame())
y = data['wage']
# X.shape
model = sm.OLS(y,X).fit()
print(model.summary())
# So, what we want here is not only the final line, but also the standart error related to the line
# TO find that we need to calcualte the predictions for some values of age
test_ages = np.linspace(data['age'].min(),data['age'].max(),100)
X_test = poly.transform(test_ages.reshape(-1,1))
pred = model.predict(X_test)
plt.figure(figsize = (12,8))
plt.scatter(data['age'],data['wage'],facecolors='none', edgecolors='darkgray')
plt.plot(test_ages,pred)
Here data is WAGE data which is available in R.
This is the resulting graph i get -
I have used bootstraping to calculate the confidence intervals, for this i have used a self customed module -
import numpy as np
import pandas as pd
from tqdm import tqdm
class Bootstrap_ci:
def boot(self,X_data,y_data,R,test_data,model):
predictions = []
for i in tqdm(range(R)):
predictions.append(self.alpha(X_data,y_data,self.get_indices(X_data,200),test_data,model))
return np.percentile(predictions,2.5,axis = 0),np.percentile(predictions,97.5,axis = 0)
def alpha(self,X_data,y_data,index,test_data,model):
X = X_data.loc[index]
y = y_data.loc[index]
lr = model
lr.fit(pd.DataFrame(X),y)
return lr.predict(pd.DataFrame(test_data))
def get_indices(self,data,num_samples):
return np.random.choice(data.index, num_samples, replace=True)
The above module can be used as -
poly = PolynomialFeatures(4)
X = poly.fit_transform(data['age'].to_frame())
y = data['wage']
X_test = np.linspace(min(data['age']),max(data['age']),100)
X_test_poly = poly.transform(X_test.reshape(-1,1))
from bootstrap import Bootstrap_ci
bootstrap = Bootstrap_ci()
li,ui = bootstrap.boot(pd.DataFrame(X),y,1000,X_test_poly,LinearRegression())
This will give us the lower confidence interval, and upper confidence interval.
To plot the graph -
plt.scatter(data['age'],data['wage'],facecolors='none', edgecolors='darkgray')
plt.plot(X_test,pred,label = 'Fitted Line')
plt.plot(X_test,ui,linestyle = 'dashed',color = 'r',label = 'Confidence Intervals')
plt.plot(X_test,li,linestyle = 'dashed',color = 'r')
The resultant graph is
Following code results in the 95% confidence interval
from scipy import stats
confidence = 0.95
squared_errors = (<<predicted values>> - <<true y_test values>>) ** 2
np.sqrt(stats.t.interval(confidence, len(squared_errors) - 1,
loc=squared_errors.mean(),
scale=stats.sem(squared_errors)))

Dask Dataframe: Defining meta for date diff in groubpy

I'm trying to find inter-purchase times (i.e., days between orders) for customers. Although my code is working correctly without defining meta, I would like to get it working properly and no longer see the warning asking me to provide meta.
Also, I would appreciate any suggestions on how to use map or map_partitions instead of apply.
So far I've tried:
meta={'days_since_last_order': 'datetime64[ns]'}
meta={'days_since_last_order': 'f8'}
meta={'ORDER_DATE_DT':'datetime64[ns]','days_since_last_order': 'datetime64[ns]'}
meta={'ORDER_DATE_DT':'f8','days_since_last_order': 'f8'}
meta=('days_since_last_order', 'f8')
meta=('days_since_last_order', 'datetime64[ns]')
Here is my code:
import numpy as np
import pandas as pd
import datetime as dt
import dask.dataframe as dd
from dask.distributed import wait, Client
client = Client(processes=True)
start = pd.to_datetime('2015-01-01')
end = pd.to_datetime('2018-01-01')
d = (end - start).days + 1
np.random.seed(0)
df = pd.DataFrame()
df['CUSTOMER_ID'] = np.random.randint(1, 4, 10)
df['ORDER_DATE_DT'] = start + pd.to_timedelta(np.random.randint(1, d, 10), unit='d')
print(df.sort_values(['CUSTOMER_ID','ORDER_DATE_DT']))
print(df)
ddf = dd.from_pandas(df, npartitions=2)
# setting ORDER_DATE_DT as index to sort by date
ddf = ddf.set_index('ORDER_DATE_DT')
ddf = client.persist(ddf)
wait(ddf)
ddf = ddf.reset_index()
grp = ddf.groupby('CUSTOMER_ID')[['ORDER_DATE_DT']].apply(
lambda df: df.assign(days_since_last_order=df.ORDER_DATE_DT.diff(1))
# meta=????
)
# for some reason, I'm unable to print grp unless I reset_index()
grp = grp.reset_index()
print(grp.compute())
Here is the printout of df.sort_values(['CUSTOMER_ID','ORDER_DATE_DT'])
Here is the printout of grp.compute()

Use matplotlib to plot scikit learn linear regression results

How can you plot the linear regression results from scikit learn after the analysis to see the "testing" data (real values vs. predicted values) at the end of the program? The code below is close but I believe it is missing a scaling factor.
input:
import pandas as pd
import numpy as np
import datetime
pd.core.common.is_list_like = pd.api.types.is_list_like # temp fix
import fix_yahoo_finance as yf
from pandas_datareader import data, wb
from datetime import date
from sklearn.linear_model import LinearRegression
from sklearn import preprocessing, cross_validation, svm
import matplotlib.pyplot as plt
df = yf.download('MMM', start = date (2012, 1, 1), end = date (2018, 1, 1) , progress = False)
df_low = df[['Low']] # create a new df with only the low column
forecast_out = int(5) # predicting some days into future
df_low['low_prediction'] = df_low[['Low']].shift(-forecast_out) # create a new column based on the existing col but shifted some days
X_low = np.array(df_low.drop(['low_prediction'], 1))
X_low = preprocessing.scale(X_low) # scaling the input values
X_low_forecast = X_low[-forecast_out:] # set X_forecast equal to last 5 days
X_low = X_low[:-forecast_out] # remove last 5 days from X
y_low = np.array(df_low['low_prediction'])
y_low = y_low[:-forecast_out]
X_low_train, X_low_test, y_low_train, y_low_test = cross_validation.train_test_split(X_low, y_low, test_size = 0.2)
clf_low = LinearRegression() # classifier
clf_low.fit(X_low_train, y_low_train) # training
confidence_low = clf_low.score(X_low_test, y_low_test) # testing
print("confidence for lows: ", confidence_low)
forecast_prediction_low = clf_low.predict(X_low_forecast)
print(forecast_prediction_low)
plt.figure(figsize = (17,9))
plt.grid(True)
plt.plot(X_low_test, color = "red")
plt.plot(y_low_test, color = "green")
plt.show()
image:
You plot y_test and X_test, while you should plot y_test and clf_low.predict(X_test) instead, if you want to compare target and predicted.
BTW, clf_low in your code is not a classifier, it is a regressor. It's better to use the alias model instead of clf.

Getting a score of zero using cross val score

I am trying to use cross_val_score on my dataset, but I keep getting zeros as the score:
This is my code:
df = pd.read_csv("Flaveria.csv")
df = pd.get_dummies(df, columns=["N level", "species"], drop_first=True)
# Extracting the target value from the dataset
X = df.iloc[:, df.columns != "Plant Weight(g)"]
y = np.array(df.iloc[:, 0], dtype="S6")
logreg = LogisticRegression()
loo = LeaveOneOut()
scores = cross_val_score(logreg, X, y, cv=loo)
print(scores)
The features are categorical values, while the target value is a float value. I am not exactly sure why I am ONLY getting zeros.
The data looks like this before creating dummy variables
N level,species,Plant Weight(g)
L,brownii,0.3008
L,brownii,0.3288
M,brownii,0.3304
M,brownii,0.388
M,brownii,0.406
H,brownii,0.3955
H,brownii,0.3797
H,brownii,0.2962
Updated code where I am still getting zeros:
from sklearn.model_selection import LeaveOneOut
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestRegressor
import numpy as np
import pandas as pd
# Creating dummies for the non numerical features in the dataset
df = pd.read_csv("Flaveria.csv")
df = pd.get_dummies(df, columns=["N level", "species"], drop_first=True)
# Extracting the target value from the dataset
X = df.iloc[:, df.columns != "Plant Weight(g)"]
y = df.iloc[:, 0]
forest = RandomForestRegressor()
loo = LeaveOneOut()
scores = cross_val_score(forest, X, y, cv=loo)
print(scores)
The general cross_val_score will split the data into train and test with the given iterator, then fit the model with the train data and score on the test fold. And for regressions, r2_score is the default in scikit.
You have specified LeaveOneOut() as your cv iterator. So each fold will contain a single test case. In this case, R_squared will always be 0.
Looking at the formula for R2 in wikipedia:
R2 = 1 - (SS_res/SS_tot)
And
SS_tot = sqr(sum(y - y_mean))
Here for a single case, y_mean will be equal to y value and hence denominator is 0. So the whole R2 is undefined (Nan). In this case, scikit-learn will set the value to 0, instead of nan.
Changing the LeaveOneOut() to any other CV iterator like KFold, will give you some non-zero results as you have already observed.