CNTK error: Did not find a valid input name at offset 201303500 in the input file - cntk

I am trying to read SVHN dataset using CNTK CTFDeserializer. SVHN dataset is a .mat dataset. So I am using scipy.io.loadmat to load them and trying to modify https://github.com/Microsoft/CNTK/blob/master/Tutorials/CNTK_103A_MNIST_DataLoader.ipynb for read the data, flatten it and store it as txt. and https://github.com/Microsoft/CNTK/blob/master/Tutorials/CNTK_201B_CIFAR-10_ImageHandsOn.ipynb to read and reshape the txt file and run the CNN model.
It is throwing me "Did not find a valid input name at offset 201303500 in the input file" error.
My txt file is in the following format for 73257 times.
|labels 1 0 0 0 0 0 0 0 0 0 |features 33 30 for (3*32*32 times)

The error is probably happening near the end of the file (maybe not all output was flushed to the file?). To figure out what goes wrong write a subset of the data to the file and manually inspect the end of the file. Searching through the codebase for the error did not yield any results, but I can update this answer if you provide a precise error message.
Another possibility if you have the dataset already as numpy arrays is to just slice off minibatches and feed it to cntk yourself. CNTK tutorials 104 and 106 show how to do this.

Related

Tensorflow: Count number of examples in a TFRecord file -- without using deprecated `tf.python_io.tf_record_iterator`

Please read post before marking Duplicate:
I was looking for an efficient way to count the number of examples in a TFRecord file of images. Since a TFRecord file does not save any metadata about the file itself, the user has to loop through the file in order to calculate this information.
There are a few different questions on StackOverflow that answer this question. The problem is that all of them seem to use the DEPRECATED tf.python_io.tf_record_iterator command, so this is not a stable solution. Here is the sample of existing posts:
Obtaining total number of records from .tfrecords file in Tensorflow
Number of examples in each tfrecord
Number of examples in each tfrecord
So I was wondering if there was a way to count the number of records using the new Dataset API.
There is a reduce method listed under the Dataset class. They give an example of counting records using the method:
# generate the dataset (batch size and repeat must be 1, maybe avoid dataset manipulation like map and shard)
ds = tf.data.Dataset.range(5)
# count the examples by reduce
cnt = ds.reduce(np.int64(0), lambda x, _: x + 1)
## produces 5
Don't know whether this method is faster than the #krishnab's for loop.
I got the following code to work without the deprecated command. Hopefully this will help others.
Using the Dataset API I setup and iterator and then loop over it. Not sure if this is the fastest, but it works. MAKE SURE THE BATCH SIZE AND REPEAT ARE SET TO 1, otherwise the code will return the number of batches and not the number of examples in the dataset.
count_test = tf.data.TFRecordDataset('testing.tfrecord')
count_test = count_test.map(_parse_image_function)
count_test = count_test.repeat(1)
count_test = count_test.batch(1)
test_counter = count_test.make_one_shot_iterator()
c = 0
for ex in test_counter:
c += 1
f"There are {c} testing records"
This seemed to work reasonably well even on a relatively large file.
The following works for me using TensorFlow version 2.1 (using the code found in this answer):
def count_tfrecord_examples(
tfrecords_dir: str,
) -> int:
"""
Counts the total number of examples in a collection of TFRecord files.
:param tfrecords_dir: directory that is assumed to contain only TFRecord files
:return: the total number of examples in the collection of TFRecord files
found in the specified directory
"""
count = 0
for file_name in os.listdir(tfrecords_dir):
tfrecord_path = os.path.join(tfrecords_dir, file_name)
count += sum(1 for _ in tf.data.TFRecordDataset(tfrecord_path))
return count

Keras/TensorFlow: How do I transform text to use as input?

I've been reading tutorials for the last few days, but they all seem to start at the step of "I have my data from this pre-prepared data set, let's go".
What I'm trying to do is take a set of emails I've tokenized, and figure out how to get them into a model as the training and evaluation data.
Example email:
0 0 0 0 0 0 0 0 0 0 0 0 32192 6675 16943 1380 433 8767 2254 8869 8155
I have a folder of emails (one file per email) for each spam and not spam:
/spam/
93451.txt
...
/not-spam/
112.txt
...
How can I get Keras to read this data?
Alternatively, how can I generate a CSV or some other format that it wants to use to input it?
There are many ways to do this, but ill try in this order:
You need to create dictionary of all the words in dataset and then assign a token for each of them. When inputing to the network you can convert it into a one-hot encoded form.
You can convert the input text by feeding it to a pretrained word embeddings model like glove or word-2-vec and obtain a embeddings vector.
You can use the one-hot vector from 1 and train your own embeddings.
As I understood from your task description (please guide me if I'm wrong), you need to classify texts into either spam or not spam category.
Basically, if you want to create the universal text data classification input solution, your
data input stage code should contain 3 steps:
1. Reading list of folders ("spam", "not spam" in your case) and iterating each folder to the list of files.
At the end you should have:
a) a dictionary containing (label_id -> label_name).
So in your case, you should stay with (0-> spam, 1->not_spam).
b) A pair of (file_content, label).
As you understand, this is out of scope of both keras and tensorflow. It is typical python' code.
2. For each part (file_content, label) you should process the first element, and that's the most interesting part usually.
In your example I can see 0 0 0 0 0 0 0 0 0 0 0 0 32192 6675 16943 1380 433 8767 2254 8869 8155. So you already have the indexes of the words, but they are in the text form. All you need is to transform the string to the array having the 300 items (words in your message).
For the further text machine learning projects, I suggest to use raw text data as a source and transform it to the word indexes using tf.contrib.learn.preprocessing.VocabularyProcessor.
3. Transform labels(categories) to the one-hot vector.
So at the end of these step you have a pair of (word_indexes_as_array, label_as_one_hot).
Then you can use these data as input data for training.
Naturally, you would divide this tuple into two, treating the first 80% of data as training set and 20% as testing set (please do not focus on 80/20, numbers it is just a sample).
You may look at at the text classification with keras examples. They are rather straightforward and may be helpful for you as they are starting from the data input step.
Also, please, look at the load_data_and_labels() method in the data input step example. It is a very similar case to yours (positive/negative).

libsvm fails for very small numbers with error code 'Wrong input on line'

I've tried searching the internet for inputs on this one, but ineffectively.
I am using libSVM (https://www.csie.ntu.edu.tw/~cjlin/libsvm/) and I've encountered this while training the SVM with rbf kernel.
If a feature contains very small numbers, like feature 15 in the following
0 1:4.25606e+07 2:4.2179e+07 3:5.1059e+07 4:7.72388e+06 5:7.72037e+06 6:8.87669e+06 7:4.40263e-06 8:0.0282494 9:819 10:2.34513e-05 11:21.5385 12:95.8974 13:179.117 14:9 15:6.91877e-310
libSVM will fail reading the file with the error code Wrong input at line <lineID>.
After some testing, I was able to confirm that changing such a small number to 0 appears to fix the error. i.e. this line is correctly read:
0 1:4.17077e+07 2:4.12838e+07 3:5.04597e+07 4:7.76011e+06 5:7.74881e+06 6:8.91813e+06 7:3.97472e-06 8:0.0284308 9:936 10:2.46506e-05 11:22.8714 12:100.969 13:186.641 14:17 15:0
Can anybody help me figure out why this is happening? My file contains a lot of number around that order of magnitude.
I am calling the SVM via terminal on Ubuntu like:
<path to>/svm-train -s 0 -t 2 -g 0.001 -c 100000 <path to features file> <path for output model file>

Reading sequential data from TFRecords files within the TensorFlow graph?

I'm working with video data, but I believe this question should apply to any sequential data. I want to pass my RNN 10 sequential examples (video frames) from a TFRecords file. When I first start reading the file, I need to grab 10 examples, and use this to create a sequence-example which is then pushed onto the queue for the RNN to take when it's ready. However, now that I have the 10 frames, next time I read from the TFRecords file, I only need to take 1 example and just shift the other 9 over. But when I hit the end of the first TFRecords file, I need to restart the process on the second TFRecords file. It's my understanding that the cond op will process the ops required under each condition even if that condition is not the one that is to be used. This would be a problem when using a condition to check whether to read 10 examples or only 1. Is there anyway to resolve this problem to still have the desired result outlined above?
You can use the recently added Dataset.window() transformation in TensorFlow 1.12 to do this:
filenames = tf.data.Dataset.list_files(...)
# Define a function that will be applied to each filename, and return the sequences in that
# file.
def get_examples_from_file(filename):
# Read and parse the examples from the file using the appropriate logic.
examples = tf.data.TFRecordDataset(filename).map(...)
# Selects a sliding window of 10 examples, shifting along 1 example at a time.
sequences = examples.window(size=10, shift=1, drop_remainder=True)
# Each element of `sequences` is a nested dataset containing 10 consecutive examples.
# Use `Dataset.batch()` and get the resulting tensor to convert it to a tensor value
# (or values, if there are multiple features in an example).
return sequences.map(
lambda d: tf.data.experimental.get_single_element(d.batch(10)))
# Alternatively, you can use `filenames.interleave()` to mix together sequences from
# different files.
sequences = filenames.flat_map(get_examples_from_file)

Libsvm model file: support vector labels are different from class labels

I have a training file for a two class problem and the labels are +1 and -1. After I run svm-train, the model file generated has real valued labels between -2 and +2.
Portion of Training file:
-1 1:-0.0902235 2:0.642459 3:-0.996008 4:-0.990354 5:-0.0415552 6:-0.559606 7:0.481824
-1 1:-0.53561 2:-0.739702 3:0.0719997 4:-0.0874957 5:-0.804345 6:-0.492728 7:-0.192003
1 1:-0.0607377 2:0.621136 3:-0.998019 4:-0.997149 5:0.0494642 6:-0.402682 7:0.128106
Corresponding support vectors in the model file:
-2 1:-0.0902235 2:0.642459 3:-0.996008 4:-0.990354 5:-0.0415552 6:-0.559606 7:0.481824
-0.962578101983108 1:-0.53561 2:-0.739702 3:0.0719997 4:-0.0874957 5:-0.804345 6:-0.492728 7:-0.19200
2 1:-0.0607377 2:0.621136 3:-0.998019 4:-0.997149 5:0.0494642 6:-0.402682 7:0.128106
They are in libsvm format.
I have not been able to figure out why this label alteration happens. Are the support vector labels important for tests?
Just got the answer. The label alteration is explained by the following:
http://www.csie.ntu.edu.tw/~cjlin/libsvm/faq.html#f402