Cannot cast array data from dtype('<M8[ns]') to dtype('float64') - numpy

I am trying to predict ticket sales and receive the following error:
TypeError: Cannot cast array data from dtype('<M8[ns]') to dtype('float64') according to the rule 'safe'
I attached here my code. The error seems to occur when running pred_lr = linear_reg.predict(X_all). I assume I have to change the type somewhere? But I couldn't figure out what I do actually wrong.
from sklearn.linear_model import LinearRegression
# Load data
event_data = pd.read_csv('event_data.csv')
# Explore data
data = pd.DataFrame(event_data)
split_date = pd.datetime(2019,3,31)
data['created'] = pd.to_datetime(data['created'])
data_train = data[data.created < split_date]
data_test = data[data.created >= split_date]
# predict prices based on date
X_train = data_train.created[:, np.newaxis]
y_train = data_train.tickets_sold
linear_reg = LinearRegression().fit(X_train, y_train)
# predict on all data
X_all = event_data.created[:, np.newaxis]
pred_lr = linear_reg.predict(X_all)
All rows here. Here the head of my data.
created event_id tickets_sold tickets_sold_sum
0 3/12/19 1 90 90
1 3/13/19 1 40 130
2 3/14/19 1 13 143
3 3/15/19 1 8 151
4 3/16/19 1 13 164

The simplest way to deal with datetime values is to convert them into POSIX timestamps.
X_train = data_train.created.astype("int64").values.reshape(-1, 1) // 10**9
and
X_all = event_data.created.astype("int64").values.reshape(-1, 1) // 10**9
However this way you are going to learn almost nothing useful to predict data in the future, since POSIX time values for the test set are reasonably outside of the range of POSIX time values in the training set.
My suggestion is to modify X_train and X_all so as to get from the date multiple informative features (as categorical features using a one-hot encoding):
day of the week
day of the month
month of the year
year

Related

how to normalize test and train data as both having different different number of rows

I have a dataframe d and one of the columns is price (Numerical) having 109248 rows. I divided the data into two parts d_train and d_test. d_train has 73196 values and d_test has 36052 values. Now to normalize d_train['price'] and d_test['price'] i did something like this..
price_scalar = Normalizer()
X_train_price = price_scalar.fit_transform(d_train['price'].values.reshape(1, -1)
X_test_price = price_scalar.transform(d_test['price'].values.reshape(1, -1))
Now I'm having this issue
ValueError Traceback (most recent call last)
<ipython-input-20-ba623ca7bafa> in <module>()
3 X_train_price = price_scalar.fit_transform(X_train['price'].values.reshape(1, -1))
----> 4 X_test_price = price_scalar.transform(X_test['price'].values.reshape(1, -1))
/usr/local/lib/python3.7/dist-packages/sklearn/base.py in _check_n_features(self, X, reset)
394 if n_features != self.n_features_in_:
395 raise ValueError(
397 f"is expecting {self.n_features_in_} features as input."
398 )
ValueError: X has 36052 features, but Normalizer is expecting 73196 features as input.
Doing change: reshape(-1,1) instead of reshape(1,-1) runs ok but makes all row values of price to 1.
Reshape(-1, 1) is Ok.The results with 1 is what is expected if you use Normalizer from sklearn:
Each sample (i.e. each row of the data matrix) with at least one non zero component is rescaled independently of other samples so that its norm (l1, l2 or inf) equals one.
scikit-learn always assumes that the data is organized with shape (n_points, n_features) (i.e., each row is a data point). Also, from the documentation, Normalizer normalizes "samples individually to unit norm". This means that each data point (i.e., row) is normalized, rather than along the column (i.e., all price values).
To normalize the values to the [0, 1] range, you should use the MinMaxScaler with the data reshaped into a column. That is,
from sklearn.preprocessing import MinMaxScaler
price_scalar = MinMaxScaler()
X_train_price = price_scalar.fit_transform(d_train['price'].values.reshape(-1, 1))
X_test_price = price_scalar.transform(d_test['price'].values.reshape(-1, 1))
It it noteworthy that this does not guarantee that the price values in the test set are all within the [0, 1] range. That is the way it should be when learning an ML model, but remember to keep that in mind.
Here, you can directly fit_transform() function, instead of fit() and transform() function separately.
price_scalar = Normalizer()
X_train_price = price_scalar.fit_transform(d_train['price'].values.reshape(1, -1)
X_test_price = price_scalar.fit_transform(d_test['price'].values.reshape(1, -1))

LSTM Keras input and output dimensions

I have 30 time steps with 26 features, so I would imagine my input into the first layer would be of dimension #_samples x 30 x 26.
One problem I have is that my # of samples varies by the time step. Should I pad to make them uniform?
Another is that I am trying to create the time-indexed 3D array by separating out the dataset into their respective time steps and combining them into a 3D array, but all the different methods I've tried have failed so far.
Here's my latest attempt:
def lstm_data_processing(X_data, Y_data):
num_time_steps = X_data['month_id'].nunique()
month_ids = X_data['month_id'].unique()
X_processed = []
X_processed.reshape(X_data.shape[0], X_data.shape[1], num_time_steps)
for i in range(num_time_steps):
month_df = X_data.loc[X_data['month_id'] == month_ids[i]].copy()
month_df.drop('month_id', axis=1, inplace=True)
print(month_df.shape)
np.stack(X_processed, month_df)
print(X_processed.shape)

Future Warning: Passing datetime64-dtype data to TimedeltaIndex is deprecated

I have a dataset of measured values and their corresponding timestamps in the format hh:mm:ss, where hh can be > 24 h.
For machine learning tasks, the data need to be interpolated since there are multiple measured values with different timestamps, respectively.
For resampling and interpolation, I figuered out that the dtype of the index should be in the datetime-format.
For further data-processing and machine learning tasks, I would need the timedelta format again.
Here is some code:
Res_cont = Res_cont.set_index('t_a') #t_a is the column of the timestamps for the measured variable a from a dataframe
#Then, I need to change datetime-format for resampling and interpolation, otherwise timedate are not like 00:15:00, but like 00:15:16 for example
Res_cont.index = pd.to_datetime(Res_cont.index)
#first, upsample to seconds, then interpolate linearly and downsample to 15min steps, lastly
Res_cont = Res_cont.resample('s').interpolate(method='linear').resample('15T').asfreq().dropna()
Res_cont.index = pd.to_timedelta(Res_cont.index) #Here is, where the error ocurred
Unfortunatly, I get the following Error message:
FutureWarning: Passing datetime64-dtype data to TimedeltaIndex is
deprecated, will raise a TypeError in a future version Res_cont =
pd.to_timedelta(Res_cont.index)
So obviously, there is a problem with the last row of my provided code. I would like to know, how to change this code to prevent a Type Error in a future version. Unfortunatly, I don't have any idea how to fix it.
Maybe you can help?
EDIT: Here some arbitrary sample data:
t_a = ['00:00:26', '00:16:16', '00:25:31', '00:36:14', '25:45:44']
a = [0, 1.3, 2.4, 3.8, 4.9]
Res_cont = pd.Series(data = a, index = t_a)
You can use DatetimeIndex.strftime for convert output datetimes to HH:MM:SS format:
t_a = ['00:00:26', '00:16:16', '00:25:31', '00:36:14', '00:45:44']
a = [0, 1, 2, 3, 4]
Res_cont = pd.DataFrame({'t_a':t_a,'a':a})
print (Res_cont)
t_a a
0 00:00:26 0
1 00:16:16 1
2 00:25:31 2
3 00:36:14 3
4 00:45:44 4
Res_cont = Res_cont.set_index('t_a')
Res_cont.index = pd.to_datetime(Res_cont.index)
Res_cont=Res_cont.resample('s').interpolate(method='linear').resample('15T').asfreq().dropna()
Res_cont.index = pd.to_timedelta(Res_cont.index.strftime('%H:%M:%S'))
print (Res_cont)
a
00:15:00 0.920000
00:30:00 2.418351
00:45:00 3.922807

Look up BernoulliNB Probability in Dataframe

I have some training data (TRAIN) and some test data (TEST).
Each row of each dataframe contains an observed class (X) and some columns of binary (Y). BernoulliNB predicts the probability of X given Y in the test data based on the training data. I am trying to look up the probability of the observed class of each row in the test data (Pr).
Edit: I used Antoine Zambelli's advice to fix the code:
from sklearn.naive_bayes import BernoulliNB
BNB = BernoulliNB()
# Training Data
TRAIN = pd.DataFrame({'X' : [1,2,3,9],
'Y1': [1,1,0,0],
'Y4': [1,0,0,0]})
# Test Data
TEST = pd.DataFrame({'X' : [5,0,1,1,1,2,2,2,2],
'Y1': [1,1,0,1,0,1,0,0,0],
'Y2': [1,0,1,0,1,0,1,0,1],
'Y3': [1,1,0,1,1,0,0,0,0],
'Y4': [1,1,0,1,1,0,0,0,0]})
# Add the information that TRAIN has none of the missing items
diff_cols = set(TEST.columns)-set(TRAIN.columns)
for i in diff_cols:
TRAIN[i] = 0
# Split the data
Se_Tr_X = TRAIN['X']
Se_Te_X = TEST ['X']
df_Tr_Y = TRAIN .drop('X', axis=1)
df_Te_Y = TEST .drop('X', axis=1)
# Train: Bernoulli Naive Bayes Classifier
A_F = BNB.fit(df_Tr_Y, Se_Tr_X)
# Test: Predict Probability
Ar_R = BNB.predict_proba(df_Te_Y)
df_R = pd.DataFrame(Ar_R)
# Rename the columns after the classes of X
df_R.columns = BNB.classes_
df_S = df_R .join(TEST)
# Look up the predicted probability of the observed X
# Skip X's that are not in the training data
def get_lu(df):
def lu(i, j):
return df.get(j, {}).get(i, np.nan)
return lu
df_S['Pr'] = [*map(get_lu(df_R), df_S .T, df_S .X)]
This seemed to work, giving me the result (df_S):
This correctly gives a "NaN" for the first 2 rows because the training data contains no information about classes X=5 or X=0.
Ok, there's a couple issues here. I have a full working example below, but first those issues. Mainly the assertion that "This correctly gives a "NaN" for the first 2 rows".
This ties back to the way classification algorithms are used and what they can do. The training data contains all the information you want your algorithm to know and be able to act on. The test data is only going to be processed with that information in mind. Even if you (the person) know that the test label is 5 and not included in the training data, the algorithm doesn't know that. It is only going to look at the feature data and then try to predict the label from those. So it can't return nan (or 5, or anything not in the training set) - that nan is coming from your work going from df_R to df_S.
This leads to the second issue which is the line df_Te_Y = TEST .iloc[ : , 1 : ], that line should be df_Te_Y = TEST .iloc[ : , 2 : ], so that it does not include the label data. Label data only appears in the training set. The predicted labels will only ever be drawn from the set of labels that appear in the training data.
Note: I've changed the class labels to be Y and the feature data to be X because that's standard in the literature.
from sklearn.naive_bayes import BernoulliNB
from sklearn.metrics import accuracy_score
import pandas as pd
BNB = BernoulliNB()
# Training Data
train_df = pd.DataFrame({'Y' : [1,2,3,9], 'X1': [1,1,0,0], 'X2': [0,0,0,0], 'X3': [0,0,0,0], 'X4': [1,0,0,0]})
# Test Data
test_df = pd.DataFrame({'Y' : [5,0,1,1,1,2,2,2,2],
'X1': [1,1,0,1,0,1,0,0,0],
'X2': [1,0,1,0,1,0,1,0,1],
'X3': [1,1,0,1,1,0,0,0,0],
'X4': [1,1,0,1,1,0,0,0,0]})
X = train_df.drop('Y', axis=1) # Known training data - all but 'Y' column.
Y = train_df['Y'] # Known training labels - just the 'Y' column.
X_te = test_df.drop('Y', axis=1) # Test data.
Y_te = test_df['Y'] # Only used to measure accuracy of prediction - if desired.
Ar_R = BNB.fit(X, Y).predict_proba(X_te) # Can be combined to a single line.
df_R = pd.DataFrame(Ar_R)
df_R.columns = BNB.classes_ # Rename as per class labels.
# Columns are class labels and Rows are observations.
# Each entry is a probability of that observation being assigned to that class label.
print(df_R)
predicted_labels = df_R.idxmax(axis=1).values # For each row, take the column with the highest prob in that row.
print(predicted_labels) # [1 1 3 1 3 2 3 3 3]
print(accuracy_score(Y_te, predicted_labels)) # Percent accuracy of prediction.
print(BNB.fit(X, Y).predict(X_te)) # [1 1 3 1 3 2 3 3 3], can be used in one line if predicted_label is all we want.
# NOTE: change train_df to have 'Y': [1,2,1,9] and we get predicted_labels = [1 1 9 1 1 1 9 1 9].
# So probabilities have changed.
I recommend reviewing some tutorials or other material on clustering algorithms if this doesn't make sense after reading the code.

Scikit-learn prediction intervals for future values?

The numbers predicted by my code below are very specific and I do not get any exact matches, but some are pretty close. For example, on a certain date there were actually 388 events and this might predict 397.
Can I output a range of like 370 - 410? Or see the percentage chance that the value will be between a range? Or should I bin the values and check for accuracy that way?
Code:
def make_prediction(label, prediction):
X = df[[col1, col2, col3]].values
y = df[label].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)
X_train.shape, X_test.shape
clf = linear_model.LinearRegression()
clf.fit(X_train, y_train)
output = clf.predict(X)
result = np.c_[X, output]
df_result = pd.DataFrame(result, columns=[col1, col2, col3, prediction])
return df_result
So the code above places a value for each row (which is a date in this case but I number them from 1 onward based on the first value in the data set. How do I predict future values? When I run the code above I only get the predicted values for existing data, how can I use that model on other data sets or input future dates?
Assuming that you require binning on top of predicted values, you can use pandas cut() as follows:
In [1]: import pandas as pd
In [2]: df = pd.DataFrame([270,201,375,370,410,510], columns=['prediction'])
In [3]: bins = [0,370,420,600]
In [4]: group_labels = ['(0-370]', '(371-420]', '(421-600]']
In [5]: df['prediction_range'] = pd.cut(df.prediction, bins, labels=group_labels)
In [6]: df
Out[6]:
prediction prediction_range
0 270 (0-370]
1 201 (0-370]
2 375 (371-420]
3 370 (0-370]
4 410 (371-420]
5 510 (421-600]
Reference: Binning Data In Pandas