Scikit-learn prediction intervals for future values? - pandas

The numbers predicted by my code below are very specific and I do not get any exact matches, but some are pretty close. For example, on a certain date there were actually 388 events and this might predict 397.
Can I output a range of like 370 - 410? Or see the percentage chance that the value will be between a range? Or should I bin the values and check for accuracy that way?
Code:
def make_prediction(label, prediction):
X = df[[col1, col2, col3]].values
y = df[label].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)
X_train.shape, X_test.shape
clf = linear_model.LinearRegression()
clf.fit(X_train, y_train)
output = clf.predict(X)
result = np.c_[X, output]
df_result = pd.DataFrame(result, columns=[col1, col2, col3, prediction])
return df_result
So the code above places a value for each row (which is a date in this case but I number them from 1 onward based on the first value in the data set. How do I predict future values? When I run the code above I only get the predicted values for existing data, how can I use that model on other data sets or input future dates?

Assuming that you require binning on top of predicted values, you can use pandas cut() as follows:
In [1]: import pandas as pd
In [2]: df = pd.DataFrame([270,201,375,370,410,510], columns=['prediction'])
In [3]: bins = [0,370,420,600]
In [4]: group_labels = ['(0-370]', '(371-420]', '(421-600]']
In [5]: df['prediction_range'] = pd.cut(df.prediction, bins, labels=group_labels)
In [6]: df
Out[6]:
prediction prediction_range
0 270 (0-370]
1 201 (0-370]
2 375 (371-420]
3 370 (0-370]
4 410 (371-420]
5 510 (421-600]
Reference: Binning Data In Pandas

Related

How do you speed up a score calculation based on two rows in a Pandas Dataframe?

TLDR: How can one adjust the for-loop for a faster execution time:
import numpy as np
import pandas as pd
import time
np.random.seed(0)
# Given a DataFrame df and a row_index
df = pd.DataFrame(np.random.randint(0, 3, size=(30000, 50)))
target_row_index = 5
start = time.time()
target_row = df.loc[target_row_index]
result = []
# Method 1: Optimize this for-loop
for row in df.iterrows():
"""
Logic of calculating the variables check and score:
if the values for a specific column are 2 for both rows (row/target_row), it should add 1 to the score
if for one of the rows the value is 1 and for the other 2 for a specific column, it should subtract 1 from the score.
"""
check = row[1]+target_row # row[1] takes 30 microseconds per call
score = np.sum(check == 4) - np.sum(check == 3) # np.sum takes 47 microseconds per call
result.append(score)
print(time.time()-start)
# Goal: Calculate the list result as efficient as possible
# Method 2: Optimize Apply
def add(a, b):
check = a + b
return np.sum(check == 4) - np.sum(check == 3)
start = time.time()
q = df.apply(lambda row : add(row, target_row), axis = 1)
print(time.time()-start)
So I have a dataframe of size 30'000 and a target row in this dataframe with a given row index. Now I want to compare this row to all the other rows in the dataset by calculating a score. The score is calculated as follows:
if the values for a specific column are 2 for both rows, it should add 1 to the score
if for one of the rows the value is 1 and for the other 2 for a specific column, it should subtract 1 from the score.
The result is then the list of all the scores we just calculated.
As I need to execute this code quite often I would like to optimize it for performance.
Any help is very much appreciated.
I already read Optimization when using Pandas are there further resources you can recommend? Thanks
If you're willing to convert your df to a NumPy array, NumPy has some really good vectorisation that helps. My code using NumPy is as below:
df = pd.DataFrame(np.random.randint(0, 3, size=(30000, 50)))
target_row_index = 5
start_time = time.time()
# Converting stuff to NumPy arrays
target_row = df.loc[target_row_index].to_numpy()
np_arr = df.to_numpy()
# Calculations
np_arr += target_row
check = np.sum(np_arr == 4, axis=1) - np.sum(np_arr == 3, axis=1)
result = list(check)
end_time = time.time()
print(end_time - start_time)
Your complete code (on Google Colab for me) outputs a time of 14.875332832336426 s, while the NumPy code above outputs a time of 0.018691539764404297 s, and of course, the result list is the same in both cases.
Note that in general, if your calculations are purely numerical, NumPy will virtually always be better than Pandas and a for loop. Pandas really shines through with strings and when you need the column and row names, but for pure numbers, NumPy is the way to go due to vectorisation.

calculating the covariance matrix fast in python with some minor customizing

I have a pandas data frame and I'm trying to find the covariance of the percentage change of each column. For each pair, I want rows with missing values to be dropped, and the percentage be calculated afterwards. That is, I want something like this:
import pandas as pd
import numpy as np
# create dataframe example
N_ROWS, N_COLS = 249, 3535
df = pd.DataFrame(np.random.random((N_ROWS, N_COLS)))
df.iloc[np.random.choice(N_ROWS, N_COLS), np.random.choice(10, 50)] = np.nan
cov_df = pd.DataFrame(index=df.columns, columns=df.columns)
for col_i in df:
for col_j in df:
cov = df[[col_i, col_j]].dropna(how='any', axis=0).pct_change().cov()
cov_df.loc[col_i, col_j] = cov.iloc[0, 1]
The thing is this is super slow. The code below gives me results that is similar (but not exactly) what I want, but it runs quite fast
df.dropna(how='any', axis=0).pct_change().cov()
I am not sure why the second one runs so much faster. I want to speed up my code in the first, but I can't figure out how.
I have tried using combinations from itertools to avoid repeating the calculation for (col_i, col_j) and (col_j, col_i), and using map from multiprocessing to do the computations in parallel, but it still hasn't finished running after 90+ mintues.
somehow this works fast enough, although I am not sure why
from scipy.stats import pearsonr
corr = np.zeros((x.shape[1], x.shape[1]))
for i in range(x.shape[1]):
for j in range (i + 1, x.shape[1]):
y = x[:, [i, j]]
y = y[~np.isnan(y).any(axis=1)]
y = np.diff(y, axis=0) / y[:-1, :]
if len(y) < 2:
corr[i, j] = np.nan
continue
y = pearsonr(y[:, 0], y[:, 1])[0]
corr[i, j] = y
corr = corr + corr.T
np.fill_diagonal(corr, 1)
This takes within 8 minutes, which is fast enough for my use case.
On the other hand, this has been running for 30 minutes but still isn't done.
corr = pd.DataFrame(index=nav.columns, columns=nav.columns)
for col_i in df:
for col_j in df:
corr_ij = df[[col_i, col_j]].dropna(how='any', axis=0).pct_change().corr().iloc[0, 1]
corr.loc[col_i, col_j] = corr_ij
t1 = time.time()
Don't know why this is but anyways the first one is a good enough solution for me now.

running LinearRegression model using scikit-learn with different pandas dataframe (loop question)

I have a dataframe with cost, wind, solar and hour of day and like to use the linear regression model from scikit-learn to find the how wind and solar impact the cost. I have labelled each hour with P1-P24 (24 hour a day) i.e. each row depending on the hour of the day will be assigned with a P(1-24)
Therefore i have defined each corresponding row of wind/solar/cost to different dataframe according to the hour of the day
The code runs okay with everything i wanted to do. However I struggle to build a for loop code run repeatedly for every hour to find the linreg.intercept, linreg.coef and np.sqrt(metrics.mean_squared_error(y_test, y_pred) function from scikit-learn on various panda dataframe (P1 to P24).
So at the moment i have to manually change the P number 24 times to find the corresponding intercept/coefficient/mean squared error for each hour
I have some code below for the work but i always struggle to build for loop
I tried to build the for loop using for i in [P1,P2...] but the dataframe became a list and i also struggle to incorporate it to the scikit-learn part
b is the original dataframe with columns: cost, Period (half hourly, therefore i have period 1 to 48), wind, solar
import dataframe
b = pd.read_csv('/Users/Downloads/cost_latest.csv')
To put it into hourly therefore:
P1 = b[b['Period'].isin(['01','02'])]
P2 = b[b['Period'].isin(['03','04'])]...
the scikit-learn part:
feature_cols = ['wind','Solar']
X = P1[feature_cols]
y = P1['Price']
and here is my issue, i need to change the P1 to P2...P24 before running the following codes to get my parameters
the following are the scikit-learn part:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
from sklearn.linear_model import LinearRegression
linreg = LinearRegression()
fit the model to the training data (learn the coefficients)
linreg.fit(X_train, y_train)
print(linreg.intercept_)
print(linreg.coef_)
list(zip(feature_cols, linreg.coef_))
y_pred = linreg.predict(X_test)
from sklearn import metrics
print(np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
I think there is a smarter way to avoid me manually editing the following (P value) and running everything in one go, i welcome your advice, suggestions
thanks
X = P1[feature_cols]
y = P1['Price']
Just use this:
for P in [P1,P2, P3,P4,P5,P6,P7]:
X = P[feature_cols]
y = P['Price']
All together:
from sklearn import metrics
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
all_intercepts = []
all_coefs = []
for P in [P1,P2, P3,P4,P5,P6,P7]:
X = P[feature_cols]
y = P['Price']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
linreg = LinearRegression()
linreg.fit(X_train, y_train)
print(linreg.intercept_)
print(linreg.coef_)
list(zip(feature_cols, linreg.coef_))
y_pred = linreg.predict(X_test)
print(np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
all_intercepts.append(linreg.intercept_)
all_coefs.append(linreg.coef_)
print(all_intercepts)
print(all_coefs)
P will be your dataframes P1,P2,... according to each iteration
Put all Pn dataframes in a list and run your code.
all_P = [P1, P2, P3]
for P in all_P:
X = P[feature_cols]
y = P['Price']

Cannot cast array data from dtype('<M8[ns]') to dtype('float64')

I am trying to predict ticket sales and receive the following error:
TypeError: Cannot cast array data from dtype('<M8[ns]') to dtype('float64') according to the rule 'safe'
I attached here my code. The error seems to occur when running pred_lr = linear_reg.predict(X_all). I assume I have to change the type somewhere? But I couldn't figure out what I do actually wrong.
from sklearn.linear_model import LinearRegression
# Load data
event_data = pd.read_csv('event_data.csv')
# Explore data
data = pd.DataFrame(event_data)
split_date = pd.datetime(2019,3,31)
data['created'] = pd.to_datetime(data['created'])
data_train = data[data.created < split_date]
data_test = data[data.created >= split_date]
# predict prices based on date
X_train = data_train.created[:, np.newaxis]
y_train = data_train.tickets_sold
linear_reg = LinearRegression().fit(X_train, y_train)
# predict on all data
X_all = event_data.created[:, np.newaxis]
pred_lr = linear_reg.predict(X_all)
All rows here. Here the head of my data.
created event_id tickets_sold tickets_sold_sum
0 3/12/19 1 90 90
1 3/13/19 1 40 130
2 3/14/19 1 13 143
3 3/15/19 1 8 151
4 3/16/19 1 13 164
The simplest way to deal with datetime values is to convert them into POSIX timestamps.
X_train = data_train.created.astype("int64").values.reshape(-1, 1) // 10**9
and
X_all = event_data.created.astype("int64").values.reshape(-1, 1) // 10**9
However this way you are going to learn almost nothing useful to predict data in the future, since POSIX time values for the test set are reasonably outside of the range of POSIX time values in the training set.
My suggestion is to modify X_train and X_all so as to get from the date multiple informative features (as categorical features using a one-hot encoding):
day of the week
day of the month
month of the year
year

Look up BernoulliNB Probability in Dataframe

I have some training data (TRAIN) and some test data (TEST).
Each row of each dataframe contains an observed class (X) and some columns of binary (Y). BernoulliNB predicts the probability of X given Y in the test data based on the training data. I am trying to look up the probability of the observed class of each row in the test data (Pr).
Edit: I used Antoine Zambelli's advice to fix the code:
from sklearn.naive_bayes import BernoulliNB
BNB = BernoulliNB()
# Training Data
TRAIN = pd.DataFrame({'X' : [1,2,3,9],
'Y1': [1,1,0,0],
'Y4': [1,0,0,0]})
# Test Data
TEST = pd.DataFrame({'X' : [5,0,1,1,1,2,2,2,2],
'Y1': [1,1,0,1,0,1,0,0,0],
'Y2': [1,0,1,0,1,0,1,0,1],
'Y3': [1,1,0,1,1,0,0,0,0],
'Y4': [1,1,0,1,1,0,0,0,0]})
# Add the information that TRAIN has none of the missing items
diff_cols = set(TEST.columns)-set(TRAIN.columns)
for i in diff_cols:
TRAIN[i] = 0
# Split the data
Se_Tr_X = TRAIN['X']
Se_Te_X = TEST ['X']
df_Tr_Y = TRAIN .drop('X', axis=1)
df_Te_Y = TEST .drop('X', axis=1)
# Train: Bernoulli Naive Bayes Classifier
A_F = BNB.fit(df_Tr_Y, Se_Tr_X)
# Test: Predict Probability
Ar_R = BNB.predict_proba(df_Te_Y)
df_R = pd.DataFrame(Ar_R)
# Rename the columns after the classes of X
df_R.columns = BNB.classes_
df_S = df_R .join(TEST)
# Look up the predicted probability of the observed X
# Skip X's that are not in the training data
def get_lu(df):
def lu(i, j):
return df.get(j, {}).get(i, np.nan)
return lu
df_S['Pr'] = [*map(get_lu(df_R), df_S .T, df_S .X)]
This seemed to work, giving me the result (df_S):
This correctly gives a "NaN" for the first 2 rows because the training data contains no information about classes X=5 or X=0.
Ok, there's a couple issues here. I have a full working example below, but first those issues. Mainly the assertion that "This correctly gives a "NaN" for the first 2 rows".
This ties back to the way classification algorithms are used and what they can do. The training data contains all the information you want your algorithm to know and be able to act on. The test data is only going to be processed with that information in mind. Even if you (the person) know that the test label is 5 and not included in the training data, the algorithm doesn't know that. It is only going to look at the feature data and then try to predict the label from those. So it can't return nan (or 5, or anything not in the training set) - that nan is coming from your work going from df_R to df_S.
This leads to the second issue which is the line df_Te_Y = TEST .iloc[ : , 1 : ], that line should be df_Te_Y = TEST .iloc[ : , 2 : ], so that it does not include the label data. Label data only appears in the training set. The predicted labels will only ever be drawn from the set of labels that appear in the training data.
Note: I've changed the class labels to be Y and the feature data to be X because that's standard in the literature.
from sklearn.naive_bayes import BernoulliNB
from sklearn.metrics import accuracy_score
import pandas as pd
BNB = BernoulliNB()
# Training Data
train_df = pd.DataFrame({'Y' : [1,2,3,9], 'X1': [1,1,0,0], 'X2': [0,0,0,0], 'X3': [0,0,0,0], 'X4': [1,0,0,0]})
# Test Data
test_df = pd.DataFrame({'Y' : [5,0,1,1,1,2,2,2,2],
'X1': [1,1,0,1,0,1,0,0,0],
'X2': [1,0,1,0,1,0,1,0,1],
'X3': [1,1,0,1,1,0,0,0,0],
'X4': [1,1,0,1,1,0,0,0,0]})
X = train_df.drop('Y', axis=1) # Known training data - all but 'Y' column.
Y = train_df['Y'] # Known training labels - just the 'Y' column.
X_te = test_df.drop('Y', axis=1) # Test data.
Y_te = test_df['Y'] # Only used to measure accuracy of prediction - if desired.
Ar_R = BNB.fit(X, Y).predict_proba(X_te) # Can be combined to a single line.
df_R = pd.DataFrame(Ar_R)
df_R.columns = BNB.classes_ # Rename as per class labels.
# Columns are class labels and Rows are observations.
# Each entry is a probability of that observation being assigned to that class label.
print(df_R)
predicted_labels = df_R.idxmax(axis=1).values # For each row, take the column with the highest prob in that row.
print(predicted_labels) # [1 1 3 1 3 2 3 3 3]
print(accuracy_score(Y_te, predicted_labels)) # Percent accuracy of prediction.
print(BNB.fit(X, Y).predict(X_te)) # [1 1 3 1 3 2 3 3 3], can be used in one line if predicted_label is all we want.
# NOTE: change train_df to have 'Y': [1,2,1,9] and we get predicted_labels = [1 1 9 1 1 1 9 1 9].
# So probabilities have changed.
I recommend reviewing some tutorials or other material on clustering algorithms if this doesn't make sense after reading the code.