Dropping/skipping records when loading data - tensorflow

I've discovered some erroneous data in my training set (mis-labeled examples) and while I've fixed the source, I'd like to continue experimenting with the same dataset so I need to skip these records.
I'm using a TFRecordReader and loading with parse_single_example & shuffle_batch. Can I provide a filter somewhere?

There's a short reference to how to do it in the docs using tf.train.shuffle_batch() and enqueue_many=True. If you can determine if an example is mislabeled using graph operations, then you can filter the result like so (adapted from another SO answer):
X, y = tf.parse_single_example(...)
is_correctly_labelled = correctly_labelled(X, y)
X = tf.expand_dims(X, 0)
y = tf.expand_dims(y, 0)
empty = tf.constant([], tf.int32)
X, y = tf.cond(is_correctly_labelled,
lambda: [X, y],
lambda: [tf.gather(X, empty), tf.gather(y, empty)])
Xs, ys = tf.train.shuffle_batch(
[X, y], batch_size, capacity, min_after_dequeue,
enqueue_many=True)
The tf.gather is just a way to get a zero-sized slice. In numpy it would just be X[[], ...].

Related

Constant difference while transforming forecast from its "differenced" form

I am transforming "differencing" transformation to my data. But when i want to do the inverse operation to my forecast, I am getting something like this, this, and this as prediction.
How can I fix this "constant blank" issue?
How I apply difference transform to my dataset (pretty simple.):
df['diff'] = df.loc[:,'RequestResponseLogDuration'].diff(1)
And here is how I am trying to revert this operation:
def inverse_difference(history, yhat, interval=1):
return yhat + history[-interval]
for x, y in train_data_multi.take(10):
predictions = list()
for i in range(len(y)):
# predict
X, T = y[i, 0:-1], y[i, -1]
yhat = multi_step_model.predict(x)[i]
# invert differencing
yhat = inverse_difference(dataset, yhat, len(T)+1-i)
# store forecast
predictions.append(yhat)
multi_step_plot(x[v], y[v], predictions[v])
UPDATE
I transferred code to this:
from tensorflow.python.ops.numpy_ops import np_config
np_config.enable_numpy_behavior()
for x, y in train_data_multi.take(1):
predictions = list()
for i in range(len(y)):
# predict
yhat = multi_step_model.predict(x)[i]
# invert scaling
# invert differencing
yhat = inverse_difference(dataset, yhat, i)
xx=True
# store forecast
predictions.append(yhat)
multi_step_plot(x[v], y[v], predictions[v])
print(predictions[v].ravel()-y[v].ravel())
Now this is the result

Receiving the same (not random) augmentations of image dataset

dataset = tf.data.Dataset.range(1, 6)
def aug(y):
x = np.random.uniform(0,1)
if x > 0.5:
y = 100
return y
dataset = dataset.map(aug)
print(list(dataset))
Run this code, then all the elements in the dataset are as they were, or all equal to 100. How do I make it so each element is individually transformed?
My more specific question below is basically asking this
I create my segmentation training set by:
dataset = tf.data.Dataset.from_tensor_slices((image_paths, mask_paths))
I then apply my augmentation function to the dataset:
def augment(image_path, mask_path)):
//use tf.io.read_file and tf.io.decode_jpeg to convert paths to tensors
x = np.random.choice([0,1])
if x == 1:
image = tf.image.flip_up_down(image)
mask = tf.image.flip_up_down(mask)
return image, mask
training_dataset = dataset.map(augment)
BATCH_SIZE=2
training_dataset = training_dataset.shuffle(100, reshuffle_each_iteration=True)
training_dataset = training_dataset.batch(BATCH_SIZE)
training_dataset = training_dataset.repeat()
training_dataset = training_dataset.prefetch(-1)
However when I visualise my training dataset, all the images have same flip applied- the are all either flipped upside down or not flipped. Where as I'm expecting them to have different flips- some upside down and some not.
Why is this happening?
You need to use tensorflow operations (not numpy or normal python) because tf.data.Dataset.map() executes the mapped function as a graph. When converting a function to a graph, numpy and base python are converted to constants. The augmentation function is only running np.random.uniform(0,1) once and storing it as a constant.
Note that irrespective of the context in which map_func is defined (eager vs. graph), tf.data traces the function and executes it as a graph.
The source for the above is here.
One solution is to use tensorflow operations. I have included an example below. Note that the y value in the if has to be cast to the same dtype as the input.
dataset = tf.data.Dataset.range(1, 6)
def aug(y):
x = tf.random.uniform([], 0, 1)
if x > 0.5:
y = tf.cast(100, y.dtype)
return y
dataset = dataset.map(aug)
print(list(dataset))
You can use a uniform random function or other probability distribution
tf.random.uniform(
shape, minval=0, maxval=None, dtype=tf.dtypes.float32, seed=None, name=None
)
even you can use prebuild method in TensorFlow or Keras for fliping
tf.keras.layers.experimental.preprocessing.RandomFlip(
mode=HORIZONTAL_AND_VERTICAL, seed=None, name=None, **kwargs
)

Decision boundary in perceptron not correct

I was preparing some code for a lecture and re-implemented a simple perceptron: 2 inputs and 1 output. Aim: a linear classifier.
Here's the code that creates the data, setups the perceptron and trains it:
from ipywidgets import interact
import numpy as np
import matplotlib.pyplot as plt
# Two randoms clouds
x = [(1,3)]*10+[(3,1)]*10
x = np.asarray([(i+np.random.rand(), j+np.random.rand()) for i,j in x])
# Colors
cs = "m"*10+"b"*10
# classes
y = [0]*10+[1]*10
class Perceptron:
def __init__(self):
self.w = np.random.randn(3)
self.lr = 0.01
def train(self, x, y, verbose=False):
errs = 0.
for xi, yi in zip(x,y):
x_ = np.insert(xi, 0, 1)
r = self.w # x_
######## HERE IS THE MAGIC HAPPENING #####
r = r >= 0
##########################################
err = float(yi)-float(r)
errs += np.abs(err)
if verbose:
print(yi, r)
self.w = self.w + self.lr * err * x_
return errs
def predict(self, x):
return np.round(self.w # np.insert(x, 0, 1, 1).T)
def decisionLine(self):
w = self.w
slope = -(w[0]/w[2]) / (w[0]/w[1])
intercept = -w[0]/w[2]
return slope, intercept
p = Perceptron()
line_properties = []
errs = []
for i in range(20):
errs.append(p.train(x, y, True if i == 999 else False))
line_properties.append(p.decisionLine())
print(p.predict(x)) # works like a charm!
#interact
def showLine(i:(0,len(line_properties)-1,1)=0):
xs = np.linspace(1, 4)
a, b = line_properties[i]
ys = a * xs + b
plt.scatter(*x.T)
plt.plot(xs, ys, "k--")
At the end, I am calculating the decision boundary, i.e. the linear eq. separating class 0 and 1. However, it seems to be off. I tried inversion etc but have no clue what is wrong. Interestingly, if I change the learning rule to
self.w = self.w + self.lr * err / x_
i.e. dividing by x_, it works properly - I am totally confused. Anyone an idea?
Solved for real
Now I added one small, but very important part to the Perceptron that I just forgot (and maybe others may forget it as well). You have to do the thresholded activation! r = r >= 0 - and now it is centered on 0 and then it does work - this is basically the answer below. If you don't do this, you have to change the classes to get again the center at 0. Currently, I prefer having the classes -1 and 1 as this gives a better decision line (centered) instead of a line that is very close to one of the data clouds.
Before:
Now:
You are creating a linear regression (not logistic regression!) with targets 0 and 1. And the line you plot is the line where the model predicts 0, so it should ideally cut through the cloud of points labeled 0, as in your first plot.
If you don't want to implement the sigmoid for logistic regression, then at least you will want to display a boundary line that corresponds to a value of 0.5 rather than 0.
As for inverting the weights providing a plot that looks like what you want, I think that's just a coincidence of this data.

numpy.atleast(), RamdonForestClassifier and numpy.hstack functions

I have this doubt about what is the exactly way of working of numpy.atleast(), RamdonForestClassifier and numpy.hstack functions.
I've read documentation about the purpose of all of those functions I mention above, but it is still not clear.
Can someone please help me?!
The method I am dealing with is the one below:
def fit(self, X, Y):
X, Y = map(np.atleast_2d, (X, Y))
assert X.shape[0] == Y.shape[0]
Ny = Y.shape[1]
self.clfs = []
for i in range(Ny):
clf = RandomForestClassifier(*self.args, **self.kwargs,n_jobs=-1)
Xi = np.hstack([X, Y[:, :i]])
yi = Y[:, i]
self.clfs.append(clf.fit(Xi, yi))
so let me explain step by step.
So, np.atleast_2d() converts any input to a 2D array.
np.atleast_2d(1)
Out[5]: array([[1]])
np.atleast_2d([1,2])
Out[6]: array([[1, 2]])
So, as you can see it is converting them to a 2D array, similarly in your code,
X, Y = map(np.atleast_2d, (X, Y)), this maps the function np.atleast_2d to these inputs X and Y such that given inputs X, Y, it will convert them to to 2-D arrays.
Next, regarding RandomForestClassifier
clf = RandomForestClassifier(*self.args, **self.kwargs,n_jobs=-1) line initializes the model or the classifier for you.
a = np.arange(0,4,1).reshape(2,2)
Y = np.arange(5,9,1).reshape(2,2)
res = np.hstack([a,Y])
res
Out[10]:
array([[0, 1, 5, 6],
[2, 3, 7, 8]])
res.shape
Out[11]: (2, 4)
See 2 rows and 4 columns
np.hstack just horizontally stacks the input arrays.
Xi = np.hstack([X, Y[:, :i]]), this line basically stacks the inputs X with the probable labels.
clf.fit(Xi, yi), this function fits the data to your model. Its like I initialized a black box or a system and now I am passing the data into that system to train the system to adapt to that data. Feel free to ask if you have more questions.

plotting with meshgrid and imshow

Imshow and meshgrid are not working the way I thought. I have some function defined for a given (x,y) point in 2D that returns a scalar f(x,y). I want to visualize the function f using imshow.
x = np.linspace(0,4)
y = np.linspace(0,1)
X,Y = np.meshgrid(x,y)
Z = np.zeros((50,50))
for i in range(50):
for j in range(50):
Z[i,j] = f(X[i,j],Y[i,j])
fig = plt.figure()
plt.imshow(Z,extent=[0,4,1,0])
plt.show()
This works as expected except in the extent I think it should be [0,4,0,1]... Am I defining the Z[i,j] to each (x,y) pair incorrectly? An explanation for how this works would be great! Thanks!
As far as I am aware, the imshow is normally used to display an image. The extent is then used to define how large it should be, say you might want to give an image as the background of the plot.
Instead I think you will find it more intuitive to use pcolor, a demo can be found here. It works much the same as imshow so you can just supply Z. However, you can also give it the X and Y arrays. This way you can really check if your supplying the values correctly:
import matplotlib.pyplot as plt
import numpy as np
x = np.linspace(0,4)
y = np.linspace(0,1)
def f(x, y):
return y * np.sin(x)
X, Y = np.meshgrid(x,y)
Z = np.zeros((50,50))
for i in range(50):
for j in range(50):
Z[i,j] = f(X[i,j],Y[i,j])
plt.pcolor(X, Y, Z)
plt.show()
I have added a function to show it works. Note that if your function is able to handle numpy arrays you can replace the initialisation of Z and the nested for loops with
X, Y = np.meshgrid(x,y)
Z = f(X, Y)
This is cleaner and will be faster to compute.