LSTM Keras input and output dimensions - pandas

I have 30 time steps with 26 features, so I would imagine my input into the first layer would be of dimension #_samples x 30 x 26.
One problem I have is that my # of samples varies by the time step. Should I pad to make them uniform?
Another is that I am trying to create the time-indexed 3D array by separating out the dataset into their respective time steps and combining them into a 3D array, but all the different methods I've tried have failed so far.
Here's my latest attempt:
def lstm_data_processing(X_data, Y_data):
num_time_steps = X_data['month_id'].nunique()
month_ids = X_data['month_id'].unique()
X_processed = []
X_processed.reshape(X_data.shape[0], X_data.shape[1], num_time_steps)
for i in range(num_time_steps):
month_df = X_data.loc[X_data['month_id'] == month_ids[i]].copy()
month_df.drop('month_id', axis=1, inplace=True)
print(month_df.shape)
np.stack(X_processed, month_df)
print(X_processed.shape)

Related

How to select almost equally distributed classes in training,validation,test samples?

I am working on MNIST Sign Language dataset to classify images using Keras. There are 24 different classes in the dataset. But the problem is that the distribution of classes of very different.
I used sklearn.model_selection.train_test_split to stratify=df['label'] but still some classes have 5% while others have 3% of the whole data. How can I make them to choose a data that is around 4% distributed among the classes.
My test_df has 7172 rows and 785 columns one of which is a label column and remaining 784 are grayscale pixel values (28*28)
test_df = pd.read_csv(TEST_PATH)
# shuffle and split validation,test data
test_df = test_df.sample(frac=1.0,random_state=SEED).iloc[:2000,:] # shuffle the whole data, get first 2000 rows
val_df,test_df = train_test_split(test_df,test_size=0.5,random_state=SEED,stratify=test_df['label'])
# stratify the labels so that distribution of classes is almost same
# extract pixels and labels for both validation,test data
X_val = val_df.drop('label',axis=1).values.reshape((val_df.shape[0],28,28))/255.0 # validation images
y_val = val_df['label'].ravel() # validation label
X_test = test_df.drop('label',axis=1).values.reshape((test_df.shape[0],28,28))/255.0 # test images
y_val = test_df['label'].ravel() # test label
this line enables you to have a uniform distribution with val and test. you can play also with the number of samples
SEED = 42
n_classes = 24
test_df = pd.read_csv(TEST_PATH)
test_df = [test_df.loc[test_df.label==i].sample(n=int(2000/n_classes),random_state=SEED) for i in test_df.label.unique()]
test_df = pd.concat(test_df, axis=0, ignore_index=True)
val_df,test_df = train_test_split(test_df,test_size=0.5,random_state=SEED,stratify=test_df['label'])

Cannot cast array data from dtype('<M8[ns]') to dtype('float64')

I am trying to predict ticket sales and receive the following error:
TypeError: Cannot cast array data from dtype('<M8[ns]') to dtype('float64') according to the rule 'safe'
I attached here my code. The error seems to occur when running pred_lr = linear_reg.predict(X_all). I assume I have to change the type somewhere? But I couldn't figure out what I do actually wrong.
from sklearn.linear_model import LinearRegression
# Load data
event_data = pd.read_csv('event_data.csv')
# Explore data
data = pd.DataFrame(event_data)
split_date = pd.datetime(2019,3,31)
data['created'] = pd.to_datetime(data['created'])
data_train = data[data.created < split_date]
data_test = data[data.created >= split_date]
# predict prices based on date
X_train = data_train.created[:, np.newaxis]
y_train = data_train.tickets_sold
linear_reg = LinearRegression().fit(X_train, y_train)
# predict on all data
X_all = event_data.created[:, np.newaxis]
pred_lr = linear_reg.predict(X_all)
All rows here. Here the head of my data.
created event_id tickets_sold tickets_sold_sum
0 3/12/19 1 90 90
1 3/13/19 1 40 130
2 3/14/19 1 13 143
3 3/15/19 1 8 151
4 3/16/19 1 13 164
The simplest way to deal with datetime values is to convert them into POSIX timestamps.
X_train = data_train.created.astype("int64").values.reshape(-1, 1) // 10**9
and
X_all = event_data.created.astype("int64").values.reshape(-1, 1) // 10**9
However this way you are going to learn almost nothing useful to predict data in the future, since POSIX time values for the test set are reasonably outside of the range of POSIX time values in the training set.
My suggestion is to modify X_train and X_all so as to get from the date multiple informative features (as categorical features using a one-hot encoding):
day of the week
day of the month
month of the year
year

Look up BernoulliNB Probability in Dataframe

I have some training data (TRAIN) and some test data (TEST).
Each row of each dataframe contains an observed class (X) and some columns of binary (Y). BernoulliNB predicts the probability of X given Y in the test data based on the training data. I am trying to look up the probability of the observed class of each row in the test data (Pr).
Edit: I used Antoine Zambelli's advice to fix the code:
from sklearn.naive_bayes import BernoulliNB
BNB = BernoulliNB()
# Training Data
TRAIN = pd.DataFrame({'X' : [1,2,3,9],
'Y1': [1,1,0,0],
'Y4': [1,0,0,0]})
# Test Data
TEST = pd.DataFrame({'X' : [5,0,1,1,1,2,2,2,2],
'Y1': [1,1,0,1,0,1,0,0,0],
'Y2': [1,0,1,0,1,0,1,0,1],
'Y3': [1,1,0,1,1,0,0,0,0],
'Y4': [1,1,0,1,1,0,0,0,0]})
# Add the information that TRAIN has none of the missing items
diff_cols = set(TEST.columns)-set(TRAIN.columns)
for i in diff_cols:
TRAIN[i] = 0
# Split the data
Se_Tr_X = TRAIN['X']
Se_Te_X = TEST ['X']
df_Tr_Y = TRAIN .drop('X', axis=1)
df_Te_Y = TEST .drop('X', axis=1)
# Train: Bernoulli Naive Bayes Classifier
A_F = BNB.fit(df_Tr_Y, Se_Tr_X)
# Test: Predict Probability
Ar_R = BNB.predict_proba(df_Te_Y)
df_R = pd.DataFrame(Ar_R)
# Rename the columns after the classes of X
df_R.columns = BNB.classes_
df_S = df_R .join(TEST)
# Look up the predicted probability of the observed X
# Skip X's that are not in the training data
def get_lu(df):
def lu(i, j):
return df.get(j, {}).get(i, np.nan)
return lu
df_S['Pr'] = [*map(get_lu(df_R), df_S .T, df_S .X)]
This seemed to work, giving me the result (df_S):
This correctly gives a "NaN" for the first 2 rows because the training data contains no information about classes X=5 or X=0.
Ok, there's a couple issues here. I have a full working example below, but first those issues. Mainly the assertion that "This correctly gives a "NaN" for the first 2 rows".
This ties back to the way classification algorithms are used and what they can do. The training data contains all the information you want your algorithm to know and be able to act on. The test data is only going to be processed with that information in mind. Even if you (the person) know that the test label is 5 and not included in the training data, the algorithm doesn't know that. It is only going to look at the feature data and then try to predict the label from those. So it can't return nan (or 5, or anything not in the training set) - that nan is coming from your work going from df_R to df_S.
This leads to the second issue which is the line df_Te_Y = TEST .iloc[ : , 1 : ], that line should be df_Te_Y = TEST .iloc[ : , 2 : ], so that it does not include the label data. Label data only appears in the training set. The predicted labels will only ever be drawn from the set of labels that appear in the training data.
Note: I've changed the class labels to be Y and the feature data to be X because that's standard in the literature.
from sklearn.naive_bayes import BernoulliNB
from sklearn.metrics import accuracy_score
import pandas as pd
BNB = BernoulliNB()
# Training Data
train_df = pd.DataFrame({'Y' : [1,2,3,9], 'X1': [1,1,0,0], 'X2': [0,0,0,0], 'X3': [0,0,0,0], 'X4': [1,0,0,0]})
# Test Data
test_df = pd.DataFrame({'Y' : [5,0,1,1,1,2,2,2,2],
'X1': [1,1,0,1,0,1,0,0,0],
'X2': [1,0,1,0,1,0,1,0,1],
'X3': [1,1,0,1,1,0,0,0,0],
'X4': [1,1,0,1,1,0,0,0,0]})
X = train_df.drop('Y', axis=1) # Known training data - all but 'Y' column.
Y = train_df['Y'] # Known training labels - just the 'Y' column.
X_te = test_df.drop('Y', axis=1) # Test data.
Y_te = test_df['Y'] # Only used to measure accuracy of prediction - if desired.
Ar_R = BNB.fit(X, Y).predict_proba(X_te) # Can be combined to a single line.
df_R = pd.DataFrame(Ar_R)
df_R.columns = BNB.classes_ # Rename as per class labels.
# Columns are class labels and Rows are observations.
# Each entry is a probability of that observation being assigned to that class label.
print(df_R)
predicted_labels = df_R.idxmax(axis=1).values # For each row, take the column with the highest prob in that row.
print(predicted_labels) # [1 1 3 1 3 2 3 3 3]
print(accuracy_score(Y_te, predicted_labels)) # Percent accuracy of prediction.
print(BNB.fit(X, Y).predict(X_te)) # [1 1 3 1 3 2 3 3 3], can be used in one line if predicted_label is all we want.
# NOTE: change train_df to have 'Y': [1,2,1,9] and we get predicted_labels = [1 1 9 1 1 1 9 1 9].
# So probabilities have changed.
I recommend reviewing some tutorials or other material on clustering algorithms if this doesn't make sense after reading the code.

Numpy: stack arrays whose internal dimensions differ

I have a situation similar to the following:
import numpy as np
a = np.random.rand(55, 1, 3)
b = np.random.rand(55, 626, 3)
Here the shapes represent the number of observations, then the number of time slices per observation, then the number of dimensions of the observation at the given time slice. So b is a full representation of 3 dimensions for each of the 55 observations at one new time interval.
I'd like to stack a and b into an array with shape 55, 627, 3. How can one accomplish this in numpy? Any suggestions would be greatly appreciated!
To follow up on Divakar's answer above, the axis argument in numpy is the index of a given dimension within an array's shape. Here I want to stack a and b by virtue of their middle shape value, which is at index = 1:
import numpy as np
a = np.random.rand(5, 1, 3)
b = np.random.rand(5, 100, 3)
# create the desired result shape: 55, 627, 3
stacked = np.concatenate((b, a), axis=1)
# validate that a was appended to the end of b
print(stacked[:, -1, :], '\n\n\n', a.squeeze())
This returns:
[[0.72598529 0.99395887 0.21811998]
[0.9833895 0.465955 0.29518207]
[0.38914048 0.61633291 0.0132326 ]
[0.05986115 0.81354865 0.43589306]
[0.17706517 0.94801426 0.4567973 ]]
[[0.72598529 0.99395887 0.21811998]
[0.9833895 0.465955 0.29518207]
[0.38914048 0.61633291 0.0132326 ]
[0.05986115 0.81354865 0.43589306]
[0.17706517 0.94801426 0.4567973 ]]
A purist might use instead np.all(stacked[:, -1, :] == a.squeeze()) to validate this equivalence. All glory to #Divakar!
Strictly for the curious, the use case for this concatenation is a kind of wonky data preparation pipeline for a Long Short Term Memory Neural Network. In that kind of network, the training data shape should be number_of_observations, number_of_time_intervals, number_of_dimensions_per_observation. I am generating new predictions of each object at a new time interval, so those predictions have shape number_of_observations, 1, number_of_dimensions_per_observation. To visualize the sequence of observations' positions over time, I want to add the new positions to the array of previous positions, hence the question above.

How does a 1D multi-channel convolutional layer (Keras) train?

I am working with time series EEG data recorded from 10 individual locations on the body to classify future behavior in terms of increasing heart activity. I would like to better understand how my labeled data corresponds to the training inputs.
So far, several RNN configurations as well as countless combinations of vanilla dense networks have not gotten me great results and I'd figure a 1D convnet is worth a try.
The things I'm having trouble understanding are:
1.) Feeding data into the model.
orig shape = (30000 timesteps, 10 channels)
array fed to layer = (300 slices, 100 timesteps, 10 channels)
Are the slices separated by 1 time step, giving me 300 slices of timesteps at either end of the original array, or are they separated end to end? If the second is true, how could I create an array of (30000 - 100) slices separated by one ts and is also compatible with the 1D CNN layer?
2) Matching labels with the training and testing data
My understanding is that when you feed in a sequence of train_x_shape = (30000, 10), there are 30000 labels with train_y_shape = (30000, 2) (2 classes) associated with the train_x data.
So, when (300 slices of) 100 timesteps of train_x data with shape = (300, 100, 10) are fed into the model, does the label value correspond to the entire 100 ts (one label per 100 ts, with this label being equal to the last time step's label), or are each 100 rows/vectors in the slice labeled- one for each ts?
Train input:
train_x = train_x.reshape(train_x.shape[0], 1, train_x.shape[1])
n_timesteps = 100
n_channels = 10
layer : model.add(Convolution1D(filters = n_channels * 2, padding = 'same', kernel_size = 3, input_shape = (n_timesteps, n_channels)))
final layer : model.add(Dense(2, activation = 'softmax'))
I use categorical_crossentropy for loss.
Answer 1
This will really depend on "how did you get those slices"?
The answer is totally dependent on what "you're doing". So, what do you want?
If you have simply reshaped (array.reshape(...)) the original array from shape (30000,10) to shape (300,100,10), the model will see:
300 individual (and not connected) sequences
100 timesteps in each sequence
Sequence 1 goes from step 0 to 299;
Sequence 2 goes from step 300 to 599 and so on.
Creating overlapping slices - Sliding window
If you want to create sequences shifted by only one timestep, make a loop for that.
import numpy as np
originalSequence = someArrayWithShape((30000,10))
newSlices = [] #empty list
start = 0
end = start + 300
while end <= 30000:
newSlices.append(originalSequence[start:end])
start+=1
end+=1
newSlices = np.asarray(newSlices)
Beware: if you do this in the input data, you will have to do a similar thing in your output data as well.
Answer2
Again, that's totally up to you. What do you want to achieve?
Convolutional layers will keep the timesteps with these options:
If you use padding='same', the final length will be the same as the input
If you don't, the final length will be reduced depending on the kernel size you choose
Recurrent layers will keep the timesteps or not depending on:
Whether you use return_sequences=True - Output has timesteps
Or you use return_sequences=False - Output has no timesteps
If you want only one output for each sequence (not per timestep):
Recurrent models:
Use LSTM(...., return_sequences=True) until the last LSTM
The last LSTM will be LSTM(..., return_sequences=False)
Convolutional models:
At some point after the convolutions, choose one of these to add:
GlobalMaxPooling1D
GlobalAveragePooling1D
Flatten (but treat the number of channels later with a Dense(2)
Reshape((2,))
I think I'd go with GlobalMaxPooling2D if using convoltions, but recurrent models seem better for this. (Not a rule, though).
You can choose to use intermediate MaxPooling1D layers to gradually reduce the length from 100 to 50, then to 25 and so on. This will probably reach a better output.
Remember to keep X and Y paired:
import numpy as np
train_x = someArrayWithShape((30000,10))
train_y = someArrayWithShape((30000,2))
newXSlices = [] #empty list
newYSlices = [] #empty list
start = 0
end = start + 300
while end <= 30000:
newXSlices.append(train_x[start:end])
newYSlices.append(train_y[end-1:end])
start+=1
end+=1
newXSlices = np.asarray(newXSlices)
newYSlices = np.asarray(newYSlices)