Pymc3 hierarchical model with interaction term - bayesian

I am trying to build a linear model in Pymc3 that uses age and age*sex interaction term to model some output variable. However, since sex is a [0, 1] categorical variable, the model can't effectively find both cov1_beta and cov2_beta. Any help is appreciated, thank you.
with pm.Model() as model_interaction:
mu = pm.Normal("a", mu=0, sd=1)
cov1_beta = pm.Normal("cov1_age", mu=0, sd=10, shape=1)
cov2_beta = pm.Normal("cov2_age_sex", mu=0, sd=10, shape=2)
mu = mu + cov1_beta*Age_mean_scaled + cov2_beta[Sex_w]*Age_mean_scaled
# Model error
eps = pm.HalfCauchy('eps',20)
# Data likelihood
mdl_lkl = pm.Normal('model', mu=mu, sd=eps, observed=X)

I'm not very familiar with PyMC3, but I can give a you few general pointers: You get the interaction of two variables by multiplying the vectors of the two variables and estimating a beta for the result. So for the interaction of age and sex, you multiply the vectors of age and sex and estimate only one beta for the interaction. You're also redefining mu in the model, which I would avoid because I don't know how PyMC3 deals with it. So assuming you have two vectors age and sex, your model should look like this:
age_sex = age * sex # elementwise product
with pm.Model() as model_interaction:
intercept = pm.Normal("intercept", mu=0, sd=1)
beta_age = pm.Normal("beta_age", mu=0, sd=10, shape=1)
beta_age_sex = pm.Normal("beta_age_sex", mu=0, sd=10, shape=1)
mu = intercept + beta_age * age + beta_age_sex * age_sex
eps = pm.HalfCauchy('eps',20)
mdl_lkl = pm.Normal('model', mu=mu, sd=eps, observed=X)

Related

Configuring Auto Arima for SARIMAX

I have a weekly time series from Jan 2018 to Dec 2021 that I am trying to build a model for. I am also using a weekly series of exogenous variables representing COVID cases.
I am not getting great results and I am unsure if it's because I've not considered something obvious (I'm new to time series prediction) or if it's because the data is just hard to predict. Would anyone be able to provide any advice on how to move forward in this situation?
Screenshots of the data.
Raw
Seasonal
Trend
Residual
Below is the code I'm using to run auto_arima. I set the seasonal differencing to 1 to force seasonal differencing as per tips on the auto_arima site.
from pmdarima.arima import auto_arima
y = merged_ts['y']
exogenous = merged_ts['exogenous']
train_size = int(len(y) * 0.8)
train_y = y[:train_size]
test_y = y[train_size:]
train_exogenous = merged_ts['exogenous'][:train_size]
test_exogenous = merged_ts['exogenous'][train_size:]
exogenous_df = pd.DataFrame(train_exogenous)[:train_size]
step_wise=auto_arima(
train_y,
X=exogenous_df,
m=52,
D=1,
seasonal=True,
trace=True,
stepwise = True,
n_jobs = -1,
error_action='ignore',
suppress_warnings=True)
best_order = step_wise.order
best_seasonal_order = step_wise.seasonal_order
I get best_order = (0, 0, 0) and best_seasonal_order = (1, 1, 0, 52)
The AIC for the best model is 827.
I then configure SARIMAX as follows
from pandas._libs.algos import take_1d_int16_float64
from statsmodels.tsa.statespace.sarimax import SARIMAX
start = len(train_y)
end = len(train_y) + len(test_y)
model = SARIMAX(
train_y,
exogenous=exogenous_df,
order=best_order,
seasonal_order=best_seasonal_order,
enforce_invertibility=False)
results = model.fit()
forecasting_window_for_validation = len(test_y)
forecast = results.predict(start, end, typ='levels')
forecast_based_on_forecasting_window=
pd.DataFrame(forecast[:forecasting_window_for_validation])
forecast_based_on_forecasting_window.set_index(
test_y.index[:forecasting_window_for_validation],
inplace=True)
forecast_based_on_forecasting_window = pd.merge(
forecast_based_on_forecasting_window,
test_y,
left_index=True,
right_index=True,
how='left')
forecast_based_on_forecasting_window.columns = ['Forecast', 'Actual']
forecast_based_on_forecasting_window['Forecast'] =
forecast_based_on_forecasting_window[['Forecast']]
forecast_based_on_forecasting_window['Actual'] =
forecast_based_on_forecasting_window[['Actual']]
forecast_based_on_forecasting_window.plot()
The mean squared error I get is 146.
Result
I would love some pointers in what I might be doing wrong or ways to improve it. My main issue is that I'm not sure if it's my lack of experience or the weak predictive power of the data, although I can see a seasonal pattern. I've tried using random walk approach, using a simple moving average and last value models but it feels like a seasonal model should be doable, but I'm not sure.
Thank you for any tips at all!

Knn give more weight to specific feature in distance

I'm using the Kobe Bryant Dataset.
I wish to predict the shot_made_flag with KnnRegressor.
I've used game_date to extract year and month features:
# covert season to years
kobe_data_encoded['season'] = kobe_data_encoded['season'].apply(lambda x: int(re.compile('(\d+)-').findall(x)[0]))
# add year and month using game_date
kobe_data_encoded['year'] = kobe_data_encoded['game_date'].apply(lambda x: int(re.compile('(\d{4})').findall(x)[0]))
kobe_data_encoded['month'] = kobe_data_encoded['game_date'].apply(lambda x: int(re.compile('-(\d+)-').findall(x)[0]))
kobe_data_encoded = kobe_data_encoded.drop(columns=['game_date'])
and I wish to use season, year, month features to give them more weight in the distance function so events with closer date to the current event will be closer neighbors but still maintain reasonable distances to potential other datapoints, so for example I don't wish an event withing the same day would be the closest neighbor just because of the date features but it'll take into account the other features such as shot_range etc..
To give it more weight I've tried to use metric argument with custom distance function but the arguments of the function are just numpy array without column information of pandas so I'm not sure what I can do and how to implement what I'm trying to do.
EDIT:
Using larger weights for date features to find the optimal k with cv of 10 running on k from [1, 100]:
from IPython.display import display
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score
# scaling
min_max_scaler = preprocessing.MinMaxScaler()
scaled_features_df = kobe_data_encoded.copy()
column_names = ['loc_x', 'loc_y', 'minutes_remaining', 'period',
'seconds_remaining', 'shot_distance', 'shot_type', 'shot_zone_range']
scaled_features = min_max_scaler.fit_transform(scaled_features_df[column_names])
scaled_features_df[column_names] = scaled_features
not_classified_df = scaled_features_df[scaled_features_df['shot_made_flag'].isnull()]
classified_df = scaled_features_df[scaled_features_df['shot_made_flag'].notnull()]
X = classified_df.drop(columns=['shot_made_flag'])
y = classified_df['shot_made_flag']
cv = StratifiedKFold(n_splits=10, shuffle=True)
neighbors = [x for x in range(1, 100)]
cv_scores = []
weight = np.ones((X.shape[1],))
weight[[X.columns.get_loc("season"),
X.columns.get_loc("year"),
X.columns.get_loc("month")
]] = 5
weight = weight/weight.sum() #Normalize weights
def my_distance(x, y):
dist = ((x-y)**2)
return np.dot(dist, weight)
for k in neighbors:
print('k: ', k)
knn = KNeighborsClassifier(n_neighbors=k, metric=my_distance)
cv_scores.append(np.mean(cross_val_score(knn, X, y, cv=cv, scoring='roc_auc')))
#optimal K
optimal_k_index = cv_scores.index(min(cv_scores))
optimal_k = neighbors[optimal_k_index]
print('best k: ', optimal_k)
plt.plot(neighbors, cv_scores)
plt.xlabel('Number of Neighbors K')
plt.ylabel('ROC AUC')
plt.show()
Runs really slow, any idea on how to make it faster?
The idea of the weighted features is to find neighbors more close to the data point date to avoid data leakage and cv for finding optimal k.
First, you have to prepare a numpy 1D weight array, specifying weight for each feature. You could do something like:
weight = np.ones((M,)) # M is no of features
weight[[1,7,10]] = 2 # Increase weight of 1st,7th and 10th features
weight = weight/weight.sum() #Normalize weights
You can use kobe_data_encoded.columns to find indexes of season, year, month features in your dataframe to replace 2nd line above.
Now define a distance function, which by guideline have to take two 1D numpy array.
def my_dist(x,y):
global weight #1D array, same shape as x or y
dist = ((x-y)**2) #1D array, same shape as x or y
return np.dot(dist,weight) # a scalar float
And initialize KNeighborsRegressor as:
knn = KNeighborsRegressor(metric=my_dist)
EDIT:
To make things efficient, you can precompute distance matrix, and reuse it in KNN. This should bring in significant speedup by reducing calls to my_dist, since this non-vectorized custom python distance function is quite slow. So now -
dist = np.zeros((len(X),len(X))) #Computing NXN distance matrix
for i in range(len(X)): # You can halve this by using the fact that dist[i,j] = dist[j,i]
for j in range(len(X)):
dist[i,j] = my_dist(X[i],X[j])
for k in neighbors:
print('k: ', k)
knn = KNeighborsClassifier(n_neighbors=k, metric='precomputed') #Note: metric='precomputed'
cv_scores.append(np.mean(cross_val_score(knn, dist, y, cv=cv, scoring='roc_auc'))) #Note: passing dist instead of X
I couldn't test it, so let me know if something isn't alright.
Just add on Shihab's answer regarding distance computation. Can use scipy pdist as suggested in this post, which is faster and more efficient.
from scipy.spatial.distance import pdist, minkowski, squareform
# create the custom weight array
weight = ...
# calculate pairwise distances, using Minkowski norm with custom weights
distances = pdist(X, minkowski, 2, weight)
# reformat the result as a square matrix
distances_as_2d_matrix = squareform(distances)

Minimizing negative log-likelihood of logistic regression, scipy returning warning: "Desired error not necessarily achieved due to precision loss."

I'm trying to sort out why scipy optimize isn't converging on a solution for the minimum negative-log-likelihood of the logistic regression function (as implemented below).
It seems to converge for smaller data sets, but for the larger data sets scipy returns the warning: "Desired error not necessarily achieved due to precision loss."
I thought this was a well-behaved optimization problem, so I'm anxious that I'm missing an obvious mistake.
Can anyone spot a mistake in my implementation or make a suggestion that I might try?
I'm using the default method, but I have had little luck with the other various methods that miminize allows.
Many thanks!
Quick summary of the implementation. I'm minimizing the following statement:
with the caveat that since b is a constant, I'm using the exponent -(w*x + b). I think I've implemented that function correct, but maybe I'm not seeing something. Since the data are constants with respect to the function being minimized, I just output a function definition that retains the data within it; thus, the function to be minimized only accepts the weights.
The data is a pandas dataframe of the format: rows == samples, columns == attributes, but LAST column == label (0 or 1). I've transformed all the data to make sure it is continuous, and I've normalized it to have a mean of 0 and a standard deviation of 1. I'm also starting with random weights between [0, 0.1], treating the first weight as 'b'.
def get_optimization_func_call(data, sheepda):
#
# Extract pos/neg data without label
pos_df = data[data[LABEL] == 1].as_matrix()[:, :-1]
neg_df = data[data[LABEL] == 0].as_matrix()[:, :-1]
#
# Def evaluation of positive terms by row
def eval_pos_row(pos_row, w, b):
cur_exponent = np.dot(w, pos_row) + b
cur_val = expit(cur_exponent)
if cur_val == 0:
print("pos", cur_exponent)
return (-1 * np.log(cur_val))
#
# Def evaluation of positive terms by row
def eval_neg_row(neg_row, w, b):
cur_exponent = np.dot(w, neg_row) + b
cur_val = 1.0 - expit(cur_exponent)
if cur_val == 0:
print("neg", cur_exponent)
return (-1 * np.log(cur_val))
#
# Define the function used for optimization
def log_likelihood(weights):
#
# Separate weights
w = weights[1:]
b = weights[0]
#
# Ge the norm of weights
w_norm = np.dot(w, w)
#
# Sum over positive examples
pos_sum = np.sum(
np.apply_along_axis(eval_pos_row, 1, pos_df, w, b)
)
neg_sum = np.sum(
np.apply_along_axis(eval_neg_row, 1, neg_df, w, b)
)
#
return (0.5 * w_norm) + sheepda * (pos_sum + neg_sum)
return log_likelihood
w = uniform.rvs(size=20) / 10.0
LL = get_optimization_func_call(clean_test_data, 0.5)
res = minimize(LL, w, options={"maxiter": 1e4, "disp": True})

Linear Regression overfitting

I'm pursuing course 2 on this coursera course on linear regression (https://www.coursera.org/specializations/machine-learning)
I've solved the training using graphlab but wanted to try out sklearn for the experience and learning. I'm using sklearn and pandas for this.
The model overfits on the data. How can I fix this? This is the code.
These are the coefficients i'm getting.
[ -3.33628603e-13 1.00000000e+00]
poly1_data = polynomial_dataframe(sales["sqft_living"], 1)
poly1_data["price"] = sales["price"]
model1 = LinearRegression()
model1.fit(poly1_data, sales["price"])
print(model1.coef_)
plt.plot(poly1_data['power_1'], poly1_data['price'], '.',poly1_data['power_1'], model1.predict(poly1_data),'-')
plt.show()
The plotted line is like this. As you see it connects every data point.
and this is the plot of the input data
I wouldn't even call this overfit. I'd say you aren't doing what you think you should be doing. In particular, you forgot to add a column of 1's to your design matrix, X. For example:
# generate some univariate data
x = np.arange(100)
y = 2*x + x*np.random.normal(0,1,100)
df = pd.DataFrame([x,y]).T
df.columns = ['x','y']
You're doing the following:
model1 = LinearRegression()
X = df["x"].values.reshape(1,-1)[0] # reshaping data
y = df["y"].values.reshape(1,-1)[0]
model1.fit(X,y)
Which leads to:
plt.plot(df['x'].values, df['y'].values,'.')
plt.plot(X[0], model1.predict(X)[0],'-')
plt.show()
Instead, you want to add a column of 1's to your design matrix (X):
X = np.column_stack([np.ones(len(df['x'])),df["x"].values.reshape(1,-1)[0]])
y = df["y"].values.reshape(1,-1)
model1.fit(X,y)
And (after some reshaping) you get:
plt.plot(df['x'].values, df['y'].values,'.')
plt.plot(df['x'].values, model1.predict(X),'-')
plt.show()

My TensorFlow Gradient Descent diverges

import tensorflow as tf
import pandas as pd
import numpy as np
def normalize(data):
return data - np.min(data) / np.max(data) - np.min(data)
df = pd.read_csv('sat.csv', skipinitialspace=True)
x_reading = df['reading_score']
x_math = df['math_score']
x_reading, x_math = np.array(x_reading[df.reading_score != 's']), np.array(x_math[df.math_score != 's'])
x_data = normalize(np.float32(np.array([x_reading, x_math])))
y_writing = df[['writing_score']]
y_data = normalize(np.float32(np.array(y_writing[df.writing_score != 's'])))
W = tf.Variable(tf.random_uniform([1, 2], -.5, .5)) #float32
b = tf.Variable(tf.ones([1]))
y = tf.matmul(W, x_data) + b
loss = tf.reduce_mean(tf.square(y - y_data.T))
optimizer = tf.train.GradientDescentOptimizer(0.005)
train = optimizer.minimize(loss)
init = tf.initialize_all_variables()
with tf.Session() as sess:
sess.run(init)
for step in range(1000):
sess.run(train)
print step, sess.run(W), sess.run(b), sess.run(loss)
Here's my code. My sat.csv contains a data of reading, writing and math scores at SAT. As you can guess, the difference between the features is not that big.
This is a part of sat.csv.
DBN,SCHOOL NAME,Num of Test Takers,reading_score,math_score,writing_score
01M292,HENRY STREET SCHOOL FOR INTERNATIONAL STUDIES,29,355,404,363
01M448,UNIVERSITY NEIGHBORHOOD HIGH SCHOOL,91,383,423,366
01M450,EAST SIDE COMMUNITY SCHOOL,70,377,402,370
01M458,FORSYTH SATELLITE ACADEMY,7,414,401,359
01M509,MARTA VALLE HIGH SCHOOL,44,390,433,384
01M515,LOWER EAST SIDE PREPARATORY HIGH SCHOOL,112,332,557,316
01M539,"NEW EXPLORATIONS INTO SCIENCE, TECHNOLOGY AND MATH HIGH SCHOOL",159,522,574,525
01M650,CASCADES HIGH SCHOOL,18,417,418,411
01M696,BARD HIGH SCHOOL EARLY COLLEGE,130,624,604,628
02M047,47 THE AMERICAN SIGN LANGUAGE AND ENGLISH SECONDARY SCHOOL,16,395,400,387
I've only used math, writing and reading scores. My goal for the code above is to predict the writing score if I give math and reading scores.
I've never seen Tensorflow's gradient descent model diverges with this such simple data. What'd be wrong?
Here are a few options you could try:
Normalise you input and output data
Set smaller initial values for your weights
Use a lower learning rate
Divide your loss by the amount of samples you have (not putting your data in a placeholder is already uncommon).
Let me know what (if any) of these options helped and good luck!