Low score in Linear Regression with discrete attributes - pandas

I'm trying to do a linear regression in my dataframe. The dataframe is about apple applications, and I want to predict the notes of applications. The notes are in following format:
1.0
1.5
2.0
2.5
...
5.0
My code is:
atributos = ['size_bytes','price','rating_count_tot','cont_rating','sup_devices_num','num_screenshots','num_lang','vpp_lic']
atrib_prev = ['nota']
X = np.array(data_regress.drop(['nota'],1))
y = np.array(data_regress['nota'])
X = preprocessing.scale(X)
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.2)
clf = LinearRegression()
clf.fit(X_train, y_train)
accuracy = clf.score(X_test, y_test)
print(accuracy)
But my accuracy is 0.046295306696438665. I think this occurs because the linear model is predicting real values, while my 'note' is real, but at intervals. I don't know how to round this values before the clf.score.

First, for regression models, clf.score() calculates R-squared value, not accuracy. So you would need to decide if you want to treat this problem as a classification problem (For some fixed number of target labels) or a regression problem (for a real-valued target)
Secondly, if you insist on using regression models and not classification, you can call clf.predict() to first get the predicted values and then round off as you want to, and then call r2_score() on actual and predicted labels. Something like:
# Get actual predictions
y_pred = clf.predict(X_test)
# You will need to implement the round function yourself
y_pred_rounded = round(y_pred)
# Call the appropriate scorer
score = r2_score(y_test, y_pred_rounded)
You can look at the sklearn documentation here for available metrics in sklearn.

Related

Why can't I classify my data perfectly on this simple problem using a NN?

I have a set of observations made of 10 features, each of these features being a real number in the interval (0,2). Say I wanted to train a simple neural network to classify whether the average of those features is above or below 1.0.
Unless I'm missing something, it should be enough with a two-layer network with one neuron on each layer. The activation functions would be a linear one (i.e. no activation function) on the first layer and a sigmoid on the output layer. An example of a NN with this architecture that would work is one that calculates the average on the first layer (i.e. all weights = 0.1 and bias=0) and asseses whether that is above or below 1.0 in the second layer (i.e. weight = 1.0 and bias = -1.0).
When I implement this using TensorFlow (see code below), I obviously get a very high accuracy quite quickly, but never get to 100% accuracy... I would like some help to understand conceptually why this is the case. I don't see why the backppropagation algorithm does not reach a set of optimal weights (may be this is related with the loss function I'm using, which has local minmums?). Also I would like to know whether a 100% accuracy is achievable if I use different activations and/or loss function.
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
X = [np.random.random(10)*2.0 for _ in range(10000)]
X = np.array(X)
y = X.mean(axis=1) >= 1.0
y = y.astype('int')
train_ratio = 0.8
train_len = int(X.shape[0]*0.8)
X_train, X_test = X[:train_len,:], X[train_len:,:]
y_train, y_test = y[:train_len], y[train_len:]
def create_classifier(lr = 0.001):
classifier = tf.keras.Sequential()
classifier.add(tf.keras.layers.Dense(units=1))
classifier.add(tf.keras.layers.Dense(units=1, activation='sigmoid'))#, input_shape=input_shape))
optimizer = tf.keras.optimizers.Adam(learning_rate=lr)
metrics=[tf.keras.metrics.BinaryAccuracy()],
classifier.compile(optimizer=optimizer, loss=tf.keras.losses.BinaryCrossentropy(from_logits=False), metrics=metrics)
return classifier
classifier = create_classifier(lr = 0.1)
history = classifier.fit(X_train, y_train, batch_size=1000, validation_split=0.1, epochs=2000)
Ignoring the fact that a neural network is an odd approach for this problem, and answering your specific question - it looks like your learning rate might be too high which could explain the fluctuations around the optimal point.

Electricity categorization

I'm trying to categorize which electronic devices that are turned ON based only the sum of all electricity for my apartment. I have a setup where I measure each watt hour (blink of a LED), so the current consumption in watts have a precision of about 10 seconds, which is great.
I am trying to do this in tensorflow, and in the first iteration I want to use only one input (the total watts, e.g. 200W), and I want to have one output per electronic device. I also use dummy data now to see how it works (and because it would be very troublesome to categorize every measurement to be able to teach the algorithm).
Here is my code now:
import tensorflow as tf
import numpy as np
LABELS = [
'Nothing',
'Toaster', # Toaster uses 800W
'Lamp'] # Lamp uses just 100W
DATA_LENGTH = 20000
np.random.seed(1) # To be able to reproduce
# Create dummy data (1:s or 0:s)
nothing_data = np.array([1] * DATA_LENGTH)
toaster_data = np.random.randint(2, size=DATA_LENGTH)
lamp_data = np.random.randint(2, size=DATA_LENGTH)
labels = np.array(list(zip(nothing_data, toaster_data, lamp_data)))
x_train = (toaster_data * 800 + lamp_data * 100) / 900 # Normalize
y_train = labels
# Split up train and test data
x_test = x_train[15000:]
y_test = y_train[15000:]
x_train = x_train[:15000]
y_train = y_train[:15000]
# The model
model = tf.keras.models.Sequential([
tf.keras.layers.Dense(1, input_dim=1),
tf.keras.layers.Dense(4, activation=tf.nn.relu),
tf.keras.layers.Dense(4, activation=tf.nn.relu),
tf.keras.layers.Dense(3, activation=tf.nn.sigmoid)
])
model.compile(optimizer='adadelta',
loss='binary_crossentropy',
metrics=['accuracy'])
history = model.fit(x_train, y_train, epochs=10)
val_loss, val_acc = model.evaluate(x_test, y_test)
print(val_loss, val_acc)
Now to the problem, the val_acc is 1.0, 100%. (val_loss=0.059, val_acc=1.0)
Still when I predict, the predictions are very off.
# Predict
predict_input = [0.88888, 0.111111, 1.0000, 0.222]
predict_output = model.predict(predict_input)
First one should be toaster + nothing, but it also has 33% lamp. I would have liked binary output, if that was possible.
Do I need to have a "nothing" output?
You need to match the model type to your problem. You've applied what is basically a mixed linear regression prediction, to a problem of binary classification. The model is good if you want to predict that wattage, given the appliances turned on, but it's not so good in the opposite direction.
It's going to try all sorts of things with the paucity of data given and the freedom inherent in the model. Note that you really have only four training inputs: making multiple copies in equal amounts doesn't really make your training better.
Most of all, why are you not doing this with the "sum to target" algorithm, a much simpler and more effective way to solve the problem. The presented problem isn't really a ML sort of problem.
If you simply want to do this by training a model, then build one with multiple binary outputs. You can research "multiple labels" for leads on how to do so. If you're doing it only for a handful of appliances in your home, you might want to beat it to death with 2^n output states, and not worry about the structural accuracy.

Tensorflow: my rnn always output same value, weights of rnn are not trained

I used tensorflow to implement a simple RNN model to learn possible trends of time series data and predict future values. However, the model always produces same values after training. Actually, the best model it got is:
y = b.
The RNN structure is:
InputLayer -> BasicRNNCell -> Dense -> OutputLayer
RNN code:
def RNN(n_timesteps, n_input, n_output, n_units):
tf.reset_default_graph()
X = tf.placeholder(dtype=tf.float32, shape=[None, n_timesteps, n_input])
cells = [tf.contrib.rnn.BasicRNNCell(num_units=n_units)]
stacked_rnn = tf.contrib.rnn.MultiRNNCell(cells)
stacked_output, states = tf.nn.dynamic_rnn(stacked_rnn, X, dtype=tf.float32)
stacked_output = tf.layers.dense(stacked_output, n_output)
return X, stacked_output
while in training, n_timesteps=1, n_input=1, n_output=1, n_units=2, learning_rate=0.0000001. And loss is calculated by mean squared error.
Input is a sequence of data in continuous days. Output is the data after the days of input.
(Maybe these are not good settings. But no matter how I change them, the results are almost same. So I just set these to help show them later.)
And I found out this is because weights and bias of BasicRNNCell are not trained. They keep same from beginning. And only the weights and bias of Dense keep changing. So in training, I got a prediction like these:
In the beginning:
loss: 1433683500.0
rnn/multi_rnn_cell/cell_0/cell0/kernel:0 [KEEP UNCHANGED]
rnn/multi_rnn_cell/cell_0/cell0/bias:0 [KEEP UNCHANGED]
dense/kernel:0 [CHANGING]
dense/bias:0 [CHANGING]
After a while:
loss: 175372340.0
rnn/multi_rnn_cell/cell_0/cell0/kernel:0 [KEEP UNCHANGED]
rnn/multi_rnn_cell/cell_0/cell0/bias:0 [KEEP UNCHANGED]
dense/kernel:0 [CHANGING]
dense/bias:0 [CHANGING]
The orange line indicates the true data, the blue line indicates results of my code. Through training, the blue line will keep going up until model gets a stable loss.
So I doubt whether I did a wrong implementation, so I generate a group of data with y = 10x + 5 for testing. This time, My model learns the correct results.
In the beginning:
In the end:
I have tried:
add more layers of both BasicRNNCell and Dense
increase rnn cell hidden num(n_units) to 128
decrease learning_rate to 1e-10
increase timesteps to 60
They all dont work.
So, my questions are:
Is it because my model is too simple? But I think the trend of my data is not so complicated to learn. At least something like y = ax + b will produce a smaller loss than y = b.
What may lead to these results?
Or how should I go on debugging?
And now, I double maybe BasicRNNCell is not fully realized, users should implement some functions of it? I have no experience with tensorflow before.
It seems your net is just not fit for that kind of data, or from another point of view, your data is badly scaled. When adding the 4 lines below after the split_data, I get some sort of learning behavior, similar to the one with the a*x+b case
data = read_data(work_dir, input_file)
plot_data(data)
input_data, output_data, n_batches = split_data(data, n_timesteps, n_input, n_output)
# scale input and output data
input_data = input_data-input_data[0]
input_data = input_data/np.max(input_data)*1000
output_data = output_data-output_data[0]
output_data = output_data/np.max(output_data)*1000

Making predictions with a TensorFlow model

I followed the given mnist tutorials and was able to train a model and evaluate its accuracy. However, the tutorials don't show how to make predictions given a model. I'm not interested in accuracy, I just want to use the model to predict a new example and in the output see all the results (labels), each with its assigned score (sorted or not).
In the "Deep MNIST for Experts" example, see this line:
We can now implement our regression model. It only takes one line! We
multiply the vectorized input images x by the weight matrix W, add the
bias b, and compute the softmax probabilities that are assigned to
each class.
y = tf.nn.softmax(tf.matmul(x,W) + b)
Just pull on node y and you'll have what you want.
feed_dict = {x: [your_image]}
classification = tf.run(y, feed_dict)
print classification
This applies to just about any model you create - you'll have computed the prediction probabilities as one of the last steps before computing the loss.
As #dga suggested, you need to run your new instance of the data though your already predicted model.
Here is an example:
Assume you went though the first tutorial and calculated the accuracy of your model (the model is this: y = tf.nn.softmax(tf.matmul(x, W) + b)). Now you grab your model and apply the new data point to it. In the following code I calculate the vector, getting the position of the maximum value. Show the image and print that maximum position.
from matplotlib import pyplot as plt
from random import randint
num = randint(0, mnist.test.images.shape[0])
img = mnist.test.images[num]
classification = sess.run(tf.argmax(y, 1), feed_dict={x: [img]})
plt.imshow(img.reshape(28, 28), cmap=plt.cm.binary)
plt.show()
print 'NN predicted', classification[0]
2.0 Compatible Answer: Suppose you have built a Keras Model as shown below:
model = keras.Sequential([
keras.layers.Flatten(input_shape=(28, 28)),
keras.layers.Dense(128, activation='relu'),
keras.layers.Dense(10, activation='softmax')
])
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
Then Train and Evaluate the Model using the below code:
model.fit(train_images, train_labels, epochs=10)
test_loss, test_acc = model.evaluate(test_images, test_labels, verbose=2)
After that, if you want to predict the class of a particular image, you can do it using the below code:
predictions_single = model.predict(img)
If you want to predict the classes of a set of Images, you can use the below code:
predictions = model.predict(new_images)
where new_images is an Array of Images.
For more information, refer this Tensorflow Tutorial.
The question is specifically about the Google MNIST tutorial, which defines a predictor but doesn't apply it. Using guidance from Jonathan Hui's TensorFlow Estimator blog post, here is code which exactly fits the Google tutorial and does predictions:
from matplotlib import pyplot as plt
images = mnist.test.images[0:10]
predict_input_fn = tf.estimator.inputs.numpy_input_fn(
x={"x":images},
num_epochs=1,
shuffle=False)
mnist_classifier.predict(input_fn=predict_input_fn)
for image,p in zip(images,mnist_classifier.predict(input_fn=predict_input_fn)):
print(np.argmax(p['probabilities']))
plt.imshow(image.reshape(28, 28), cmap=plt.cm.binary)
plt.show()

Scikit ROC auc raises ValueError: Only one class present in y_true. ROC AUC score is not defined in that case

Trying to create a ROC curve.
model = RandomForestClassifier(500, n_jobs = -1);
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
probas = model.predict_proba(X_test)[:, 1]
precision = metrics.precision_score(y_test, y_pred) # returns 0.72
recall = metrics.recall_score(y_test.values, y_pred) # returns 0.35
y_test.shape # (39257, 1)
auc = metrics.roc_auc_score(y_test, probas) # fails.
ValueError: Only one class present in y_true. ROC AUC score is not defined in that case.
Ended up answering my own question:
Had imported y_test as a pandas DataFrame instead of a Series (had saved it using to_csv and imported elsewhere with from_csv).
This confused scikit on the ROC curves, but it seems quite happy with that everywhere else.
I'll leave this here in the (unlikely) case someone runs into the same thing.
sometime we face with Imbalanced dataset.
Like at time of splitting, there would be chance that any one of Classes is not present any dataset (test dataset)
. So better to use stratify technique while splitting .
Or If you are facing while training MLP model then you can try with increasing "batch_size"
I hope, It may be helpful.
Thanks