Shaping input labels for Tensorflow - tensorflow

Say I have 1000x500 table, where 500 are the columns and 1000 rows.
And the rows represent 1000 sample, each sample is composed of 499 features and 1 label
If I want to put this tensorflow model and, say that each time I get a batch of 20 samples:
.........................................
inputdata #is filled and has a shape of 499x1000
inputlabel #is filled and has a shape of 1x1000
y_ = tf.placeholder(tf.float32,[None,batchSize],name='Labels')
for j in range( numberOfRows/BatchSize):
sess.run(train_step,feed_dict={x:batch_xs[j],y_:np.reshape(inputlabel[j] ,(batchSize,1) )}))
So I've been trying to run my code for two days without any success, So I'll be grateful for any help considering the y_ and reshaping part. The problem that I have is to understand, that when I read a batch of 20 data row how should I shape the labels Y_

First issue: put your batch_size dimension as your first dimension, that's the standard and a fair number of computations in tensorflow assume as much.
Second, I don't see a placeholder for your data, X. But your passing it as a variable to sess.run.
To keep things simple, I suggest you do all this reshaping outside of tensorflow, use numpy. Don't get me wrong, you can absolutely do this in tensorflow, but if slicing and merging are confusing you (they confused everyone the first time), tensorflow will only add to that confusion because you can't simply print the results of a slicing operation as conveniently in tensorflow as you can in numpy to debug your situation.
So to that end, let's do it:
# your data
mydata = np.random.rand(500,1000)
# tensorflow placeholders
X = tf.placeholder(tf.float32, [batchSize, 499], name='X')
y_ = tf.placeholder(tf.float32, [batchSize, 1], name=y_')
# let's transpose your data so the batch is the first dimension (1000 x 500)
mydata = mydata.T
# Let's split the labels from the data
data = mydata[:,0:499]
labels = mydata[:,500]
# Now train
for j in range(numOfRows/BatchSize):
row_from = j * BatchSize
row_to = j * BatchSize + BatchSize
sess.run(train_step, feed_dict={
x : data[row_from:row_to, :]
y_ : labels[row_from:row_to]
})
Don't forget to permute your data, we didn't do it here. I personally like np.random.permutation(1000) to get a random list of indexes, then just take the first BatchSize indexes and then np.roll the random permutation, super easy way to iterate through data sets without dealing with computing indexes or the trailing batch that isn't an even size.

Related

How to use the TensorFlow dataset API with unknown shapes properly?

I've been trying for several hours to complete this task with no success.
I have a very large dataset which is comprised of the following structure:
I want to split this data into X and Y (and pass Y to tf.to_categorical) as in the picture using the tf.data.Dataset API, but unfortunately every attempt of me trying to use it has ended up with some kind of error.
How do I use tf.data.Dataset to:
Split each row to x and y.
Convert Y to categorical with tf.to_categorical.
Split the dataset into batches.
Feed my model with the dataset.
My current attempt:
def map_sequence():
for sequence in input_sequences:
yield sequence[:-1], keras.utils.to_categorical(sequence[-1], total_words)
dataset = tf.data.Dataset.from_generator(map_sequence,
(tf.int32, tf.int32),
(tf.TensorShape(title_length-1), tf.TensorShape(total_words)))
But when I try to train my model with the following code:
inputs = keras.layers.Input(shape=(title_length-1, ))
x = keras.layers.Embedding(total_words, 32)(inputs)
x = keras.layers.Bidirectional(keras.layers.LSTM(64, return_sequences=True))(x)
x = keras.layers.Bidirectional(keras.layers.LSTM(64))(x)
predictions = keras.layers.Dense(total_words, activation='softmax')(x)
model = keras.Model(inputs=inputs, outputs=predictions)
model.compile('Adam', 'categorical_crossentropy', metrics=['acc'])
model.fit(dataset)
I am getting this error: ValueError: Shapes (32954, 1) and (65, 32954) are incompatible
I think you have a similar problem as in this question. Keras expects the dataset that you give to produce batches, not individual examples. Since you are giving it two one-dimensional vectors at a time, Keras interprets that each of these is a batch of examples with one feature. So, your X data, which has 65 elements, is interpreted as a batch of 65 examples with a single feature (a 65x1 tensor). This fixes the batch size to 65. The output of the model has then shape 65x32,954 (which I assume is the value of total_words). But your Y vector, with 32,954 elements, is again interpreted as a batch of 32,954 with one features (32,954x1 tensor). These two things don't match, hence the error. You should be able to fix it by simply making a new dataset with batch before passing it to fit.
In any case, if you input_sequences is a NumPy array, as it seems to be, your method to produce the dataset is not really good, as using a generator will be really slow. This is a better way to do the same:
def map_sequence(sequence):
# Using tf.one_hot instead of keras.utils.to_categorical
# because we are working with TensorFlow tensors now
return sequence[:-1], tf.one_hot(sequence[-1], total_words)
dataset = tf.data.Dataset.from_tensor_slices(input_sequences)
dataset = dataset.map(map_sequence)
dataset = dataset.batch(batch_size)

How to map an array of values for y_true to a single value in order to compare to y_pred in a Tensorflow loss function (Tensorflow/Tensorflow Quantum)

I am trying to implement the circuits listed on page 8 in the following paper: https://arxiv.org/pdf/1905.10876.pdf using Tensorflow Quantum (TFQ). I have done so previously for a subset of circuits using Qiskit, and ended up with accuracies that can be found on page 14 in the following paper: https://arxiv.org/pdf/2003.09887.pdf. In TFQ, my accuracies are way down. I think this delta originates because in TFQ, I only used 1 observable Pauli Z operator on the first qubit, and the circuits do not seem to "transfer all knowledge" to the first qubit. I place this in quotes, because I am sure there is a better way to describe this. In Qiskit on the other hand, 16 states (4^2) get mapped to 2 states.
My question: how can I get my accuracies back up?
Potential answer a): some method of "transferring all information" to a single qubit, potentially an ancilla qubit, and doing a readout on this qubit.
Potential answer b) placing a Pauli Z observable on all qubits (4 in total), mapping half of the 16 states to a label 0 and the other half to a label 1. I attempted this in the code below.
My attempt at answer b):
I have a Tensorflow Quantum (TFQ) circuit implemented in Tensorflow. The circuit has multiple observables, which I try to bring together in my loss function. I prefer to use as many standard components as possible, but need to map my quantum states to a label in order to determine the loss. I think what I am trying to achieve is not unique to TFQ. I define my model in the following way:
def circuit():
data_qubits = cirq.GridQubit.rect(4, 1)
circuit = cirq.Circuit()
...
return circuit, [cirq.Z(data_qubits[0]), cirq.Z(data_qubits[1]), cirq.Z(data_qubits[2]), cirq.Z(data_qubits[3])]
model_circuit, model_readout = circuit()
model = tf.keras.Sequential([
tf.keras.layers.Input(shape=(), dtype=tf.string),
# The PQC layer returns the expected value of the readout gate, range [-1,1].
tfq.layers.PQC(model_circuit, model_readout),
])
# compile model
model.compile(
loss = loss_mse,
optimizer=tf.keras.optimizers.Adam(learning_rate=0.01),
metrics=[])
in loss_mse (Mean Square Error), I receive a (32, 4) tensor for y_pred. One row could look like
[-0.2, 0.33, 0.6, 0.3]
This would have to be first mapped from [-1,1] to a binarized version of [0,1], so that it looks like:
[0, 1, 1, 1]
Now, a table lookup needs to happen, which tells if this combination is 0 or 1. Finally, the regular (y_true-y_pred)^2 can be performed by that row, followed by a np.sum on all rows. I tried to implement this:
def get_label(measurement):
if measurement == [0,0,0,0]: return 0
...
elif measurement == [1,1,1,1]: return 0
else: return -1
def py_call(y_true, y_pred):
# cast tensor to numpy
y_pred_np = np.asarray(y_pred)
loss = np.zeros((len(y_pred))) # could be a single variable with += within the loop
# evalaute all 32 samples
for pred in range(len(y_pred_np)):
# map, binarize and lookup
y_labelled = get_label([0 if y<0 else 1 for y in y_pred_np[pred]])
# regular loss comparison
loss[pred] = (y_labelled - y_true[pred])**2
# reduce
loss = np.sum(loss)/len(y_true)
return loss
#tf.function
def loss_mse(y_true, y_pred):
external_list = []
loss = tf.py_function(py_call, inp=[y_true, y_pred], Tout=[tf.float64])
return loss
However, the system appears to still expect a (32,4) tensor. I would have thought I could simply provide a single loss values (float). My question: how can I map multiple values for y_true to a single number in order to compare with a single y_pred value in a tensorflow loss function?
So it looks like there are a couple of things going on here. To answer your question
how can I map multiple values for y_true to a single number in order to compare with a single y_pred value in a tensorflow loss function ?
What you might want is some kind of tf.reduce_* function like tf.reduce_mean or tf.reduce_sum. This function will allow you to apply this reduction operation accross a given tensor axis allowing you to convert a tensor of shape (32, 4) to a tensor of shape (32,) or a tensor of shape (4,). Here is a quick snippet:
#tf.function
def my_loss(y_true, y_pred):
# y_true is shape (32, 4)
# y_pred is shape (32, 4)
# Scale from [-1, 1] to [0, 1]
y_true += 1
y_true /= 2
y_pred += 1
y_pred /= 2
# These are now both (32,) with the reduction of taking the mean applied along
# the second axis.
reduced_true = tf.reduce_mean(y_true, axis=1)
reduced_pred = tf.reduce_mean(y_pred, axis=1)
# Now a scalar loss.
loss = tf.reduce_mean((reduce_true - reduced_pred) ** 2)
return loss
Now the above isn't exactly what you want, since it's not super clear to me at least what exact reduction rules you have in mind for taking something like [0,1,1,1] -> 0 vs [0,0,0,0] -> 1.
Another thing I will also mention is that if you want JUST the sum of these Pauli Operators in cirq that you have term by term in the list [cirq.Z(data_qubits[0]), cirq.Z(data_qubits[1]), cirq.Z(data_qubits[2]), cirq.Z(data_qubits[3])] and all you care about is the final sum of these expectations, you could just as easily do:
my_operator = sum([cirq.Z(data_qubits[0]), cirq.Z(data_qubits[1]),
cirq.Z(data_qubits[2]), cirq.Z(data_qubits[3])])
print(my_op)
Which should give something like:
cirq.PauliSum(cirq.LinearDict({frozenset({(cirq.GridQubit(0, 0), cirq.Z)}): (1+0j), frozenset({(cirq.GridQubit(0, 1), cirq.Z)}): (1+0j), frozenset({(cirq.GridQubit(0, 2), cirq.Z)}): (1+0j), frozenset({(cirq.GridQubit(0, 3), cirq.Z)}): (1+0j)}))
Which is also compatable as a readout operation in the PQC layer. Lastly if would recommend reading through some of the snippets and examples here:
https://www.tensorflow.org/quantum/api_docs/python/tfq/layers/PQC
and here:
https://www.tensorflow.org/quantum/api_docs/python/tfq/layers/Expectation
Which give a pretty good description of how the input and output signatures of the functions look as well as the shapes you can expect from them.

Feeding integer CSV data to a Keras Dense first layer in sequential model

The documentation for CSV Datasets stops short of showing how to use a CSV dataset for anything practical like using the data to train a neural network. Can anyone provide a straightforward example to demonstrate how to do this, with clarity around data shape and type issues at a minimum, and preferably considering batching, shuffling, repeating over epochs as well?
For example, I have a CSV file of M rows, each row being an integer class label followed by N integers from which I hope to predict the class label using an old-style 3-layer neural network with H hidden neurons:
model = Sequential()
model.add(Dense(H, activation='relu', input_dim=N))
...
model.fit(train_ds, ...)
For my data, M > 50000 and N > 200. I have tried creating my dataset by using:
train_ds = tf.data.experimental.make_csv_dataset('mydata.csv`, batch_size=B)
However... this leads to compatibility problems between the dataset and the model... but it's not clear where these compatibility problems lie - are they in the input shape, the integer (not float) data, or somewhere else?
This question may provide some help... although the answers mostly relate to Tensorflow V1.x
It is likely that CSV Datasets are not required for this task. The data size indicated will probably fit in memory, and a tf.data.Dataset may wrap the data in more complexity than valuable functionality. You can do it without datasets (as shown below) so long as ALL the data is integers.
If you persist with the CSV Dataset approach, understand that there are many ways CSVs are used, and different approaches to loading them (e.g. see here and here). Because CSVs can have a variety of column types (numerical, boolean, text, categorical, ...), the first step is usually to load the CSV data in a column-oriented format. This provides access to the columns via their labels - useful for pre-processing. However, you probably want to provide rows of data to your model, so translating from columns to rows may be one source of confusion. At some point you will probably need to convert your integer data to float, but this may occur as a side-effect of certain pre-processing.
So long as your CSVs contain integers only, without missing data, and with a header row, you can do it without a tf.data.Dataset, step-by-step as follows:
import numpy as np
from numpy import genfromtxt
import tensorflow as tf
train_data = genfromtxt('train set.csv', delimiter=',')
test_data = genfromtxt('test set.csv', delimiter=',')
train_data = np.delete(train_data, (0), axis=0) # delete header row
test_data = np.delete(test_data, (0), axis=0) # delete header row
train_labels = train_data[:,[0]]
test_labels = test_data[:,[0]]
train_labels = tf.keras.utils.to_categorical(train_labels)
# count labels used in training set; categorise test set on same basis
# even if test set only uses subset of categories learning in training
K = len(train_labels[ 0 ])
test_labels = tf.keras.utils.to_categorical(test_labels, K)
train_data = np.delete(train_data, (0), axis=1) # delete label column
test_data = np.delete(test_data, (0), axis=1) # delete label column
# Data will have been read in as float... but you may want scaling/normalization...
scale = lambda x: x/1000.0 - 500.0 # change to suit
scale(train_data)
scale(test_data)
N_train = len(train_data[0]) # columns in training set
N_test = len(test_data[0]) # columns in test set
if N_train != N_test:
print("Datasets have incompatible column counts: %d vs %d" % (N_train, N_test))
exit()
M_train = len(train_data) # rows in training set
M_test = len(test_data) # rows in test set
print("Training data size: %d rows x %d columns" % (M_train, N_train))
print("Test set data size: %d rows x %d columns" % (M_test, N_test))
print("Training to predict %d classes" % (K))
model = Sequential()
model.add(Dense(H, activation='relu', input_dim=N_train)) # H not yet defined...
...
model.compile(...)
model.fit( train_data, train_labels, ... ) # see docs for shuffle, batch, etc
model.evaluate( test_data, test_labels )

Tensorflow dimensions /placeholders

I want to run a neural network in tensorflow. I am trying to do email classification, so my training data is an array of count vectorized documents.
Im trying to understand the dimensions for how I should input data into tensorflow. I am creating placeholders like this:
X = tf.placeholder(tf.int64, [None, #features]
Y = tf.placeholder(tf.int64, [None, #labels])
then later, I have to transform the actual y_train to have dimensionality (1, #observations) since I get some dimensionality errors when I run the code.
Should the placeholders and the variables have the same dimensionality? What is the correspondence? I am getting out of memory errors, so am concerned that I have something wrong with the input dimensions.
A little unsure as to what your "#" symbols refer. This if often used to mean "number" in which case what you have written would be incorrect. To be clear you want to define your placeholders for X and Y as
X = tf.placeholder(tf.int64, [None, input_dimensions])
Y = tf.placeholder(tf.int64, [None, 1])
Here the None values accommodate the number of samples in the training data you pass in; if you feed in 10 emails, None will be 10. The input_dimensions means "how long is the vector that represents a single training example". In the case of a grey-scale image this would be equal to the number of pixels, in the case of your e-mail inputs this should be the length of the longest vectorized email.
All of your email inputs will need to be input at the same length, and a common practice for all those shorter than the longest email is to pad the vectors up to the max length with zeros.
When comparing Y to the training labels (y_train) they should both be tensors of the same shape. So as Y has shape (number_of_emails, 1), so should y_train. You can convert from (1, number_of_emails) to (number_of_emails, 1) using
y_train = tf.reshape(y_train, [-1,1])
Finally the out of memory errors are unlikely to be to do with any dimension miss-match, but more likely you are feeding too many emails into the network at once. Each time you feed in some emails as X they must be held in memory. If there are many emails, feeding them all in at once will exhaust the memory resources (particularly if training on a GPU). For this reason it is common practice to batch your inputs into smaller groups fed in sequentially. Tensorflow provides a guide to importing data, as well as specific help on batching.

Back-propagation exhibiting quadratic memory consumption

I'm running into a weird problem with TensorFlow. I've set up a very simple classification problem, four input variables, one binary output variable, one layer of weights and bias, output goes through a sigmoid to 0 or 1.
The problem is, memory consumption is quadratic in the number of records of training data! With only 5,000 records, it's already 900 megabytes; at 10,000, it runs into a few gigabytes. Since I want to end up using at least a few tens of thousands of records, this is a problem.
It is happening specifically in the back propagation step; when I just try to evaluate the cost function, memory consumption is linear in the number of records, as expected.
Code follows. What am I doing wrong?
import numpy as np
import os
import psutil
import tensorflow as tf
process = psutil.Process(os.getpid())
sess = tf.InteractiveSession()
# Parameters
learning_rate = 0.01
random_seed = 1
tf.set_random_seed(random_seed)
# Data
data = np.loadtxt('train.csv', delimiter=',', dtype=np.float32)
train_X = data[:, :-1]
train_Y = data[:, -1]
rows = np.shape(train_X)[0]
cols = np.shape(train_X)[1]
# Inputs and outputs
X = tf.placeholder(np.float32, shape=(rows, cols))
Y = tf.placeholder(np.float32, shape=rows,)
# Weights
W = tf.Variable(tf.random_normal((cols, 1)))
b = tf.Variable(tf.random_normal(()))
# Model
p = tf.nn.sigmoid(tf.matmul(X, W) + b)
cost = tf.reduce_sum((p-Y)**2/rows)
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)
tf.global_variables_initializer().run()
# Just one optimizer step is enough to demonstrate the problem
optimizer.run({X: train_X, Y: train_Y})
# Memory consumption is quadratic in number of rows
print('{0:,} bytes'.format(process.memory_info().peak_wset))
It turns out to be again the problem of shape. Using matmul the way I did there, generates output of shape (n,1). Using that in a context where shape (n,) was expected, silently generates quadratic blowup.
The solution is squeeze. Specifically, tf.squeeze(tf.matmul(X, W)).
It makes sense that memory consumption blows up like that since the backprop requires the extra memory to keep track of the gradients of each operation (though I can't figure out how it ends up being quadratic).
Solution : Mini-batches
This is usually the goto method when it comes to training models. Split up your training data into little mini-batches each containing a fixed number of samples (this is rarely more than 200 samples) at feed it to the optimizer one mini-batch at a time. So if your batch_size=64 then the train_X and train_Y fed to the optimizer will be of the shapes (64, 4) and (64,) respectively.
I would try something like this
batch_size = 64
for i in range(rows):
batch_X = train_X[i*batch_size : (i + 1)*batch_size]
batch_Y = train_Y[i*batch_size : (i + 1)*batch_size]
optimizer.run({X: batch_X, Y:batch_Y})