Tensorflow data import - tensorflow

I just started to use tensorflow, but I failed to import the data properly to use with the DNNClassifier. I actually have two files in the hdf5 format, that I import with pandas. The feature vector has dimension 100 and there are 5 classes where the features can belong to. If I use for example the following code:
import pandas as pd
import numpy as np
import tensorflow as tf
#Data
train = pd.read_hdf("train.h5", "train")
test = pd.read_hdf("test.h5", "test")
Y=train.iloc[0:,0]
X=train.iloc[0:,1:]
X_t=test.iloc[0:,0:]
Y=np.array(Y.values).astype('int')
X=np.array(X.values).astype('double')
X_t=np.array(X_t.values).astype('double')
#Train
feature_columns = [tf.contrib.layers.real_valued_column("", dimension=100)]
classifier = tf.contrib.learn.DNNClassifier(feature_columns=feature_columns,
hidden_units=[10, 20],
n_classes=5,
model_dir="/tmp/model")
# Define the training inputs
def get_train_inputs():
x = tf.constant(X)
y = tf.constant(Y)
return x, y
#fit
classifier.fit(input_fn=get_train_inputs, steps=1000)
predictions = list(classifier.predict(input_fn=get_train_inputs))
print(predictions)
I get the error: InvalidArgumentError (see above for traceback): Shape in shape_and_slice spec [100,10] does not match the shape stored in checkpoint: [1,10]
[[Node: save/RestoreV2_2 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/cpu:0"](_recv_save/Const_0, save/RestoreV2_2/tensor_names, save/RestoreV2_2/shape_and_slices)]]
I don't get why this happens? How should I transform my data to apply to this classifier?

My Solution:-
Change your model_dir="/tmp/model" to
model_dir="/tmp/model-1
Note:- It need not to be model-1, replace it with any valid names like
model_dir="/tmp/model-a ..something like that..

Related

Not getting global steps in TensorFlow

When I run following code, I don't get global steps until 5000.
First,
When I run this code, output is what I expect.
from __future__ import absolute_import, division, print_function, unicode_literals
import tensorflow as tf
import pandas as pd
CSV_COLUMN_NAMES = ['SepalLength', 'SepalWidth', 'PetalLength', 'PetalWidth', 'Species']
SPECIES = ['Setosa', 'Versicolor', 'Virginica']
train_path = tf.keras.utils.get_file(
"iris_training.csv", "https://storage.googleapis.com/download.tensorflow.org/data/iris_training.csv")
test_path = tf.keras.utils.get_file(
"iris_test.csv", "https://storage.googleapis.com/download.tensorflow.org/data/iris_test.csv")
train = pd.read_csv(train_path, names=CSV_COLUMN_NAMES, header=0)
test = pd.read_csv(test_path, names=CSV_COLUMN_NAMES, header=0)
train_y = train.pop('Species')
test_y = test.pop('Species')
def input_fn(features, labels, training=True, batch_size=256):
# Convert the inputs to a Dataset.
dataset = tf.data.Dataset.from_tensor_slices((dict(features), labels))
# Shuffle and repeat if you are in training mode.
if training:
dataset = dataset.shuffle(1000).repeat()
return dataset.batch(batch_size)
my_feature_columns = []
for key in train.keys():
my_feature_columns.append(tf.feature_column.numeric_column(key=key))
#print(my_feature_columns)
But,
After adding classifier code, I don't get global steps.
This is addition of code after writing code above:
classifier = tf.estimator.DNNClassifier(
feature_columns=my_feature_columns,
# Two hidden layers of 30 and 10 nodes respectively.
hidden_units=[30, 10],
# The model must choose between 3 classes.
n_classes=3)
classifier.train(
input_fn=lambda: input_fn(train, train_y, training=True),
steps=5000)
Instead of getting global steps until 5000, I get this output
WARNING:tensorflow:Using temporary folder as model directory: /tmp/tmpfl0bqu11
WARNING:tensorflow:It seems that global step (tf.train.get_global_step) has not been increased. Current value (could be stable): 1061 vs previous value: 1061. You could increase the global step by passing tf.train.get_global_step() to Optimizer.apply_gradients or Optimizer.minimize
I write the same code that my teacher demo but he could run this code. I can't.
How could I get global steps?

Convert TensorFlow data to be used by ONNX inference

I'm trying to convert a LSTM model from TensorFlow into ONNX. The code for generating data for TensorFlow model training is as below:
def make_dataset(self, data):
data = np.array(data, dtype=np.float32)
ds = tf.keras.utils.timeseries_dataset_from_array(
data=data,
targets=None,
sequence_length=self.total_window_size,
sequence_stride=1,
shuffle=True,
batch_size=32, )
ds = ds.map(self.split_window)
The model training code is actually from the official tutorial. Then after conversion to ONNX, I try to perform prediction as follows:
import onnx
import onnxruntime as rt
from tf_lstm import WindowGenerator
import tensorflow as tf
wide_window = WindowGenerator(
input_width=24, label_width=24, shift=1,
label_columns=['T (degC)'])
model = onnx.load_model('models/onnx/tf-lstm-weather.onnx')
print(model)
sess = rt.InferenceSession('models/onnx/tf-lstm-weather.onnx')
input_name = sess.get_inputs()[0].name
label_name = sess.get_outputs()[0].name
pred = sess.run([label_name], {input_name: wide_window.test})[0]
But it throws this error:
RuntimeError: Input must be a list of dictionaries or a single numpy array for input 'lstm_input'.
I tried to convert wide_window.test into numpy array and use it instead as follows:
test_data = []
test_label = []
for x, y in wide_window.test:
test_data.append(x.numpy())
test_label.append(y.numpy())
test_data2 = np.array(test_data, dtype=np.float)
pred = sess.run([label_name], {input_name: test_data2})[0]
Then it gives this error:
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (219,) + inhomogeneous part.
Any idea?
That's a numpy error. Each row you add to the input array has to have the same number of elements.
setting an array element with a sequence requested array has an inhomogeneous shape after 1 dimensions The detected shape was (2,)+inhomogeneous part

Getting TypeError: sparse matrix length is ambiguous; use getnnz() or shape[0] while doing multi class classification

from sklearn.naive_bayes import CategoricalNB
from sklearn.datasets import make_multilabel_classification
X, y = make_multilabel_classification(sparse = True, n_labels = 15,
return_indicator = 'sparse', allow_unlabeled = False)
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.25,random_state=0)
I tried using X.todense() but the error is still raised.
X_train = X_train.todense()
X_test = X_test.todense()
Training on the dataset
from skmultilearn.adapt import MLkNN
from sklearn.metrics import accuracy_score
classifier = MLkNN(k=20)
classifier.fit(X_train, y_train)
predicting the output of trained dataset.
y_pred = classifier.predict(X_test)
accuracy_score(y_test,y_pred)
np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1)
You are trying to get the length from a matrix, which is ambigious:
len(y_pred)
Your matrix y_pred has the dimension (25,5), as seen with y_pred.shape.
So instead of len(y_pred), you could use y_pred.shape[0], which would return 25.
But then you will encounter a problem when you are using y_pred.reshape(y_pred.shape[0],1)
ValueError: cannot reshape array of size 125 into shape (25, 1)
(previously: y_pred.reshape(len(y_pred),1))
This error makes sense, because you are trying to reshape a matrix with 125 values into a matrix with only 25 values. You need to rethink your code here.

what's the meaning of 'input_length'?

the data have 4 timestamps,but the embedding's input_length=3,so what's the meaning of input_length?
from tensorflow import keras
import numpy as np
data = np.array([[0,0,0,0]])
emb = keras.layers.Embedding(input_dim=2, output_dim=3, input_length=3)
emb(data)
As per the official documentation here,
input_length: Length of input sequences, when it is constant. This
argument is required if you are going to connect Flatten then Dense
layers upstream (without it, the shape of the dense outputs cannot be
computed).
from tensorflow import keras
import numpy as np
model = keras.models.Sequential()
model.add(keras.layers.Embedding(input_dim=2, output_dim=3, input_length=4))
# the model will take as input an integer matrix of size (batch, input_length).
input_array = np.array([[0,0,0,0]])
model.compile('rmsprop', 'mse')
output_array = model.predict(input_array)
print(output_array)
Above works fine, but if you change input_length to 3, then you will get below error:
ValueError: Error when checking input: expected embedding_input to
have shape (3,) but got array with shape (4,)

Converting tokens to word vectors effectively with TensorFlow Transform

I would like to use TensorFlow Transform to convert tokens to word vectors during my training, validation and inference phase.
I followed this StackOverflow post and implemented the initial conversion from tokens to vectors. The conversion works as expected and I obtain vectors of EMB_DIM for each token.
import numpy as np
import tensorflow as tf
tf.reset_default_graph()
EMB_DIM = 10
def load_pretrained_glove():
tokens = ["a", "cat", "plays", "piano"]
return tokens, np.random.rand(len(tokens), EMB_DIM)
# sample string
string_tensor = tf.constant(["plays", "piano", "unknown_token", "another_unknown_token"])
pretrained_vocab, pretrained_embs = load_pretrained_glove()
vocab_lookup = tf.contrib.lookup.index_table_from_tensor(
mapping = tf.constant(pretrained_vocab),
default_value = len(pretrained_vocab))
string_tensor = vocab_lookup.lookup(string_tensor)
# define the word embedding
pretrained_embs = tf.get_variable(
name="embs_pretrained",
initializer=tf.constant_initializer(np.asarray(pretrained_embs), dtype=tf.float32),
shape=pretrained_embs.shape,
trainable=False)
unk_embedding = tf.get_variable(
name="unk_embedding",
shape=[1, EMB_DIM],
initializer=tf.random_uniform_initializer(-0.04, 0.04),
trainable=False)
embeddings = tf.cast(tf.concat([pretrained_embs, unk_embedding], axis=0), tf.float32)
word_vectors = tf.nn.embedding_lookup(embeddings, string_tensor)
with tf.Session() as sess:
tf.tables_initializer().run()
tf.global_variables_initializer().run()
print(sess.run(word_vectors))
When I refactor the code to run as a TFX Transform Graph, I am getting the error the ConversionError below.
import pprint
import tempfile
import numpy as np
import tensorflow as tf
import tensorflow_transform as tft
import tensorflow_transform.beam.impl as beam_impl
from tensorflow_transform.tf_metadata import dataset_metadata
from tensorflow_transform.tf_metadata import dataset_schema
tf.reset_default_graph()
EMB_DIM = 10
def load_pretrained_glove():
tokens = ["a", "cat", "plays", "piano"]
return tokens, np.random.rand(len(tokens), EMB_DIM)
def embed_tensor(string_tensor, trainable=False):
"""
Convert List of strings into list of indices then into EMB_DIM vectors
"""
pretrained_vocab, pretrained_embs = load_pretrained_glove()
vocab_lookup = tf.contrib.lookup.index_table_from_tensor(
mapping=tf.constant(pretrained_vocab),
default_value=len(pretrained_vocab))
string_tensor = vocab_lookup.lookup(string_tensor)
pretrained_embs = tf.get_variable(
name="embs_pretrained",
initializer=tf.constant_initializer(np.asarray(pretrained_embs), dtype=tf.float32),
shape=pretrained_embs.shape,
trainable=trainable)
unk_embedding = tf.get_variable(
name="unk_embedding",
shape=[1, EMB_DIM],
initializer=tf.random_uniform_initializer(-0.04, 0.04),
trainable=False)
embeddings = tf.cast(tf.concat([pretrained_embs, unk_embedding], axis=0), tf.float32)
return tf.nn.embedding_lookup(embeddings, string_tensor)
def preprocessing_fn(inputs):
input_string = tf.string_split(inputs['sentence'], delimiter=" ")
return {'word_vectors': tft.apply_function(embed_tensor, input_string)}
raw_data = [{'sentence': 'This is a sample sentence'},]
raw_data_metadata = dataset_metadata.DatasetMetadata(dataset_schema.Schema({
'sentence': dataset_schema.ColumnSchema(
tf.string, [], dataset_schema.FixedColumnRepresentation())
}))
with beam_impl.Context(temp_dir=tempfile.mkdtemp()):
transformed_dataset, transform_fn = ( # pylint: disable=unused-variable
(raw_data, raw_data_metadata) | beam_impl.AnalyzeAndTransformDataset(
preprocessing_fn))
transformed_data, transformed_metadata = transformed_dataset # pylint: disable=unused-variable
pprint.pprint(transformed_data)
Error Message
TypeError: Failed to convert object of type <class
'tensorflow.python.framework.sparse_tensor.SparseTensor'> to Tensor.
Contents: SparseTensor(indices=Tensor("StringSplit:0", shape=(?, 2),
dtype=int64), values=Tensor("hash_table_Lookup:0", shape=(?,),
dtype=int64), dense_shape=Tensor("StringSplit:2", shape=(2,),
dtype=int64)). Consider casting elements to a supported type.
Questions
Why would the TF Transform step require an additional conversion/casting?
Is this approach of converting tokens to word vectors feasible? The word vectors might be multiple gigabytes in memory. How is Apache Beam handling the vectors? If Beam in a distributed setup, would it require N x vector memory with N the number of workers?
The SparseTensor related error is because you are calling string_split which returns a SparseTensor. Your test code does not call string_split so that's why it only happens with your Transform code.
Regarding memory, you are correct, the embedding matrix must be loaded into each worker.
One cannot put a SparseTensor into the dictionary, returned by the TFX Transform, in your case by the function "preprocessing_fn". The reason is that SparseTensor is not a Tensor, it is actually a small subgraph.
To fix your code, you can convert your SparseTensor into a Tensor. There is a number of ways to do so, I would recommend to use tf.serialize_sparse for regular SparseTensor and tf.serialize_many_sparse for batched one.
To consume such serialized Tensor in Trainer, you could call the function tf. deserialize_many_sparse.