Why is CNTK using the embedding dimension for the decoder? - cntk

I have trained a good sequence to sequence model that I've tested on my local box, but now I'm trying to evaluate a lot of queries. I'm seeing this error, though:
02/08/2017 00:50:54: EXCEPTION occurred: Node 'decoderInput._' (If operation): Input dimensions [100] and [57408 x 3] are not compatible.
57408 is the vocabulary size. I'm guessing 100 is coming from the number of embedding dimensions, which is set to 100.
I'm confused why this is not working, because the fact that the input and output is "sparse" is set in the "cntkReaderInputDef."
cntkReaderInputDef = { rawInput = { alias = "S0" ; dim = $inputVocabSize$ ; format = "sparse" } ; rawLabels = { alias = "S1" ; dim = $labelVocabSize$ ; format = "sparse" } }

Posted by William Darling:
because you are using an embedding, you need to use a modified version of the CNTK.core.bs file. In line 1515, there is currently:
decoderFeedback = /*EmbedLabels*/ (tokens.word) # [embeddingDim x Dnew]
The next line is where your error is coming from:
delayedDecoderFeedback = Boolean.If (Loop.IsFirst (labelSentenceStartEmbeddedScattered), labelSentenceStartEmbeddedScattered, Loop.Previous (decoderFeedback))
The decoderFeedback has shape [W x Dnew] but the labelSentenceStartEmbeddedScattered has shape [E] where E is the embedding dimension. In BrainScript there isn't a good way to pass in the embedding macro used in the model definition, so you need to write it out explicitly. So, change line 1515 to:
decoderFeedback = TransposeTimes(modelAsTrained.Einput, tokens.word)
that will turn your decoderFeedback representation into something compatible with the embedding shape.
Btw, the format = sparse of the reader definition is only with respect to how you formatted your CTF input file. With the sparse format that means you have things like 7:1 meaning that there is a one-hot vector with a 1 at position 7 instead of having to write out a whole bunch of zeros (which you would have with the dense format).

Chris Basoglu says:
Actually you should be able to just copy your CNTK.core.bs file next to your own config file. It should look it that folder first.

Related

tf.io.decode_raw return tensor how to make it bytes or string

I'm struggling with this for a while. I searched stack and check tf2
doc a bunch of times. There is one solution indicated, but
I don't understand why my solution doesn't work.
In my case, I store a binary string (i.e., bytes) in tfrecords.
if I iterate over dataset via as_numpy_list or directly call numpy()
on each item, I can get back binary string.
while iterating the dataset, it does work.
I'm not sure what exactly map() passes to test_callback.
I see doesn't have a method nor property numpy, and the same about type
tf.io.decode_raw return. (it is Tensor, but it has no numpy as well)
Essentially I need to take a binary string, parse it via my
x = decoder.FromString(y) and then pass it my encoder
that will transform x binary string to tensor.
def test_callback(example_proto):
# I tried to figure out. can I use bytes?decode
# directly and what is the most optimal solution.
parsed_features = tf.io.decode_raw(example_proto, out_type=tf.uint8)
# tf.io.decoder returns tensor with N bytes.
x = creator.FromString(parsed_features.numpy)
encoded_seq = midi_encoder.encode(x)
return encoded_seq
raw_dataset = tf.data.TFRecordDataset(filenames=["main.tfrecord"])
raw_dataset = raw_dataset.map(test_callback)
Thank you, folks.
I found one solution but I would love to see more suggestions.
def test_callback(example_proto):
from_string = creator.FromString(example_proto.numpy())
encoded_seq = encoder.encoder(from_string)
return encoded_seq
raw_dataset = tf.data.TFRecordDataset(filenames=["main.tfrecord"])
raw_dataset = raw_dataset.map(lambda x: tf.py_function(test_callback, [x], [tf.int64]))
My understanding that tf.py_function has a penalty on performance.
Thank you

Testing on some basic example in trying to better understand about .padded_batch in TensorFlow

I have a data a very simple one to test on my understanding about the usage of tf.padded_batch
text file is saved as .txt format:
test = "I use tensorflow for this data\n
I will be testing\n
The current tensorflow data
Please do mark that I am using tensorflow version 2.0 so I do not need to use tf.Session to initialize my variables
dataset = tf.data.TextLineDataset("test.txt")
dataset = dataset.map(lambda string: tf.string_split([string]).values)
dataset = dataset.padded_batch(2)
for x in dataset:
print(x.numpy())
Error that I received:
TypeError: padded_batch() missing 1 required positional argument: 'padded_shapes'
Expected output:
[[b'I' b'use' b'tensorflow' b'for' b'this' b'data']
[b'I' b'will' b'be' b'testing' b'unknown' b'unknown']]
[[b'The' b'current' b'tensorflow' b'data' b'unknown' b'unknown']]
How should I configure my padded_shapes and also padded_values? I wish to make the length of the tensor to be the same by insert "unknown" for each empty element. (This might be a little confused by above shows my expected results.)
Please note that tf.data.Dataset().dataset.padded_batch expects the shape of your inputs, and in your case, since you want the padded value to be "unknown" the padding value that you will use. Below is the code snipped you want to use.
dataset = tf.data.TextLineDataset("test.txt")
dataset = dataset.map(lambda string: tf.string_split([string]).values)
dataset = dataset.padded_batch(3, padded_shapes=[None], padding_values="unknown")
for x in dataset:
print(x.numpy())
# [[b'I' b'use' b'tensorflow' b'for' b'this' b'data']
# [b'I' b'will' b'be' b'testing' b'unknown' b'unknown']
# [b'The' b'current' b'tensorflow' b'data' b'unknown' b'unknown']]

Accessing learned weights of a DNN in CNTK

How can one access to the learned weights of a DNN saved as following:
lstm_network_output.save(model_path)
The weights/parameters of a network can be accessed by calling ‘lstm_network_output.parameters’ which returns a list of ‘Parameter’ variable objects. The value of a Parameter can be obtained using ‘value’ property of the Parameter object in the form of a numpy array. The value of the Parameter can be updated by ‘.value = ’.
If you used name= properties in creating your model, you can also identify layers by name. For example:
model = Sequential([Embedding(300, name='embed'), Recurrence(LSTM(500)), Dense(10)])
E = model.embed.E # accesses the embedding matrix of the embed layer
To know that the parameter is .E, please consult the docstring of the respective function (e.g. help(Embedding)). (In Dense and Convolution, the parameters would be .W and .b.)
The pattern above is for named layers, which are created using as_block(). You can also name intermediate variables, and access them in the same way. E.g.:
W = Parameter((13,42), init=0, name='W')
x = Input(13)
y = times(x, W, name='times1')
W_recovered = y.times1.W
# e.g. check the shape to see that they are the same
W_recovered.shape # --> (13, 42)
W.shape # --> (13, 42)
Technically, this will search all parameters that feed y. In case of a more complex network, you may end up having multiple parameters of the same name. Then an error will be thrown due to the ambiguity. In that case, you must work the .parameters tuple mentioned in Anna's response.
This python code worked for me to visualize some weights:
import numpy as np
import cntk as C
dnnFile = C.cntk_py.Function.load('Models\ConvNet_MNIST_5.dnn') # load model from MS example
layer8 = dnnFile.parameters()[8].value()
filter_num = 0
sliced = layer8.asarray()[ filter_num ][ 0 ] # shows filter works on input image
print(sliced)

Tensorflow vocabularyprocessor

I am following the wildml blog on text classification using tensorflow. I am not able to understand the purpose of max_document_length in the code statement :
vocab_processor = learn.preprocessing.VocabularyProcessor(max_document_length)
Also how can i extract vocabulary from the vocab_processor
I have figured out how to extract vocabulary from vocabularyprocessor object. This worked perfectly for me.
import numpy as np
from tensorflow.contrib import learn
x_text = ['This is a cat','This must be boy', 'This is a a dog']
max_document_length = max([len(x.split(" ")) for x in x_text])
## Create the vocabularyprocessor object, setting the max lengh of the documents.
vocab_processor = learn.preprocessing.VocabularyProcessor(max_document_length)
## Transform the documents using the vocabulary.
x = np.array(list(vocab_processor.fit_transform(x_text)))
## Extract word:id mapping from the object.
vocab_dict = vocab_processor.vocabulary_._mapping
## Sort the vocabulary dictionary on the basis of values(id).
## Both statements perform same task.
#sorted_vocab = sorted(vocab_dict.items(), key=operator.itemgetter(1))
sorted_vocab = sorted(vocab_dict.items(), key = lambda x : x[1])
## Treat the id's as index into list and create a list of words in the ascending order of id's
## word with id i goes at index i of the list.
vocabulary = list(list(zip(*sorted_vocab))[0])
print(vocabulary)
print(x)
not able to understand the purpose of max_document_length
The VocabularyProcessor maps your text documents into vectors, and you need these vectors to be of a consistent length.
Your input data records may not (or probably won't) be all the same length. For example if you're working with sentences for sentiment analysis they'll be of various lengths.
You provide this parameter to the VocabularyProcessor so that it can adjust the length of output vectors. According to the documentation,
max_document_length: Maximum length of documents. if documents are
longer, they will be trimmed, if shorter - padded.
Check out the source code.
def transform(self, raw_documents):
"""Transform documents to word-id matrix.
Convert words to ids with vocabulary fitted with fit or the one
provided in the constructor.
Args:
raw_documents: An iterable which yield either str or unicode.
Yields:
x: iterable, [n_samples, max_document_length]. Word-id matrix.
"""
for tokens in self._tokenizer(raw_documents):
word_ids = np.zeros(self.max_document_length, np.int64)
for idx, token in enumerate(tokens):
if idx >= self.max_document_length:
break
word_ids[idx] = self.vocabulary_.get(token)
yield word_ids
Note the line word_ids = np.zeros(self.max_document_length).
Each row in raw_documents variable will be mapped to a vector of length max_document_length.

How to read data from numpy files in TensorFlow? [duplicate]

I have read the CNN Tutorial on the TensorFlow and I am trying to use the same model for my project.
The problem is now in data reading. I have around 25000 images for training and around 5000 for testing and validation each. The files are in png format and I can read them and convert them into the numpy.ndarray.
The CNN example in the tutorials use a queue to fetch the records from the file list provided. I tried to create my own such binary file by reshaping my images into 1-D array and attaching a label value in the front of it. So my data looks like this
[[1,12,34,24,53,...,105,234,102],
[12,112,43,24,52,...,115,244,98],
....
]
The single row of the above array is of length 22501 size where the first element is the label.
I dumped the file to using pickle and the tried to read from the file using the
tf.FixedLengthRecordReader to read from the file as demonstrated in example
I am doing the same things as given in the cifar10_input.py to read the binary file and putting them into the record object.
Now when I read from the files the labels and the image values are different. I can understand the reason for this to be that pickle dumps the extra information of braces and brackets also in the binary file and they change the fixed length record size.
The above example uses the filenames and pass it to a queue to fetch the files and then the queue to read a single record from the file.
I want to know if I can pass the numpy array as defined above instead of the filenames to some reader and it can fetch records one by one from that array instead of the files.
Probably the easiest way to make your data work with the CNN example code is to make a modified version of read_cifar10() and use it instead:
Write out a binary file containing the contents of your numpy array.
import numpy as np
images_and_labels_array = np.array([[...], ...], # [[1,12,34,24,53,...,102],
# [12,112,43,24,52,...,98],
# ...]
dtype=np.uint8)
images_and_labels_array.tofile("/tmp/images.bin")
This file is similar to the format used in CIFAR10 datafiles. You might want to generate multiple files in order to get read parallelism. Note that ndarray.tofile() writes binary data in row-major order with no other metadata; pickling the array will add Python-specific metadata that TensorFlow's parsing routines do not understand.
Write a modified version of read_cifar10() that handles your record format.
def read_my_data(filename_queue):
class ImageRecord(object):
pass
result = ImageRecord()
# Dimensions of the images in the dataset.
label_bytes = 1
# Set the following constants as appropriate.
result.height = IMAGE_HEIGHT
result.width = IMAGE_WIDTH
result.depth = IMAGE_DEPTH
image_bytes = result.height * result.width * result.depth
# Every record consists of a label followed by the image, with a
# fixed number of bytes for each.
record_bytes = label_bytes + image_bytes
assert record_bytes == 22501 # Based on your question.
# Read a record, getting filenames from the filename_queue. No
# header or footer in the binary, so we leave header_bytes
# and footer_bytes at their default of 0.
reader = tf.FixedLengthRecordReader(record_bytes=record_bytes)
result.key, value = reader.read(filename_queue)
# Convert from a string to a vector of uint8 that is record_bytes long.
record_bytes = tf.decode_raw(value, tf.uint8)
# The first bytes represent the label, which we convert from uint8->int32.
result.label = tf.cast(
tf.slice(record_bytes, [0], [label_bytes]), tf.int32)
# The remaining bytes after the label represent the image, which we reshape
# from [depth * height * width] to [depth, height, width].
depth_major = tf.reshape(tf.slice(record_bytes, [label_bytes], [image_bytes]),
[result.depth, result.height, result.width])
# Convert from [depth, height, width] to [height, width, depth].
result.uint8image = tf.transpose(depth_major, [1, 2, 0])
return result
Modify distorted_inputs() to use your new dataset:
def distorted_inputs(data_dir, batch_size):
"""[...]"""
filenames = ["/tmp/images.bin"] # Or a list of filenames if you
# generated multiple files in step 1.
for f in filenames:
if not gfile.Exists(f):
raise ValueError('Failed to find file: ' + f)
# Create a queue that produces the filenames to read.
filename_queue = tf.train.string_input_producer(filenames)
# Read examples from files in the filename queue.
read_input = read_my_data(filename_queue)
reshaped_image = tf.cast(read_input.uint8image, tf.float32)
# [...] (Maybe modify other parameters in here depending on your problem.)
This is intended to be a minimal set of steps, given your starting point. It may be more efficient to do the PNG decoding using TensorFlow ops, but that would be a larger change.
In your question, you specifically asked:
I want to know if I can pass the numpy array as defined above instead of the filenames to some reader and it can fetch records one by one from that array instead of the files.
You can feed the numpy array to a queue directly, but it will be a more invasive change to the cifar10_input.py code than my other answer suggests.
As before, let's assume you have the following array from your question:
import numpy as np
images_and_labels_array = np.array([[...], ...], # [[1,12,34,24,53,...,102],
# [12,112,43,24,52,...,98],
# ...]
dtype=np.uint8)
You can then define a queue that contains the entire data as follows:
q = tf.FIFOQueue([tf.uint8, tf.uint8], shapes=[[], [22500]])
enqueue_op = q.enqueue_many([image_and_labels_array[:, 0], image_and_labels_array[:, 1:]])
...then call sess.run(enqueue_op) to populate the queue.
Another—more efficient—approach would be to feed records to the queue, which you could do from a parallel thread (see this answer for more details on how this would work):
# [With q as defined above.]
label_input = tf.placeholder(tf.uint8, shape=[])
image_input = tf.placeholder(tf.uint8, shape=[22500])
enqueue_single_from_feed_op = q.enqueue([label_input, image_input])
# Then, to enqueue a single example `i` from the array.
sess.run(enqueue_single_from_feed_op,
feed_dict={label_input: image_and_labels_array[i, 0],
image_input: image_and_labels_array[i, 1:]})
Alternatively, to enqueue a batch at a time, which will be more efficient:
label_batch_input = tf.placeholder(tf.uint8, shape=[None])
image_batch_input = tf.placeholder(tf.uint8, shape=[None, 22500])
enqueue_batch_from_feed_op = q.enqueue([label_batch_input, image_batch_input])
# Then, to enqueue a batch examples `i` through `j-1` from the array.
sess.run(enqueue_single_from_feed_op,
feed_dict={label_input: image_and_labels_array[i:j, 0],
image_input: image_and_labels_array[i:j, 1:]})
I want to know if I can pass the numpy array as defined above instead
of the filenames to some reader and it can fetch records one by one
from that array instead of the files.
tf.py_func, that wraps a python function and uses it as a TensorFlow operator, might help. Here's an example.
However, since you've mentioned that your images are stored in png files, I think the simplest solution would be to replace this:
reader = tf.FixedLengthRecordReader(record_bytes=record_bytes)
result.key, value = reader.read(filename_queue)
with this:
result.key, value = tf.WholeFileReader().read(filename_queue))
value = tf.image.decode_jpeg(value)