When should I use GoldParse when training NER? - spacy

When looking at example code for training NER with SpaCy, I see GoldParse used sometimes and sometimes not.
TRAINING_DATA = [
("How to preorder the iPhone X", {'entities': [(20, 28, 'GADGET')]})
#Lots of other things
]
(Then common stuff, adding labels to the NER pipe, disabling other pipes, etc.)
Then I see two approaches:
for iteration in range(10):
random.shuffle(TRAINING_DATA)
losses = {}
for text, annotations in TRAINING_DATA:
doc = nlp.make_doc(text)
entity_offsets = annotations["entities"]
gold = GoldParse(doc, entities=entity_offsets)
nlp.update([doc], [gold], drop=0.5, sgd=optimizer, losses=losses)
print('Losses with gold', losses)
OR
for iteration in range(10):
random.shuffle(TRAINING_DATA)
losses = {}
batches = minibatch(TRAINING_DATA, size=compounding(4.0, 32.0, 1.001))
for batch in batches:
texts, annotations = zip(*batch)
nlp.update(texts, annotations, drop=0.5, sgd=optimizer, losses=losses)
print('Losses without gold', losses)
What purpose (if any) is GoldParse serving in this example? The loss outputs are a bit different, but I don't feel like I'm really understanding what the difference is.

They should be identical underneath. If you comment out the shuffle, I would expect the losses to be identical.

Related

How to properly stack RNN layers?

I've been trying to implement a character-level language model in tensorflow based on this tutorial.
I would like to extend the model by allowing multiple RNN layers to be stacked. So far I've come up with this:
class MyModel(tf.keras.Model):
def __init__(self, vocab_size, embedding_dim, rnn_type, rnn_units, num_layers, dropout):
super().__init__(self)
self.rnn_type = rnn_type.lower()
self.num_layers = num_layers
self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
if self.rnn_type == 'gru':
rnn_layer = tf.keras.layers.GRU
elif self.rnn_type == 'lstm':
rnn_layer = tf.keras.layers.LSTM
elif self.rnn_type == 'rnn':
rnn_layer = tf.keras.layers.SimpleRNN
else:
raise ValueError(f'Unsupported RNN layer: {rnn_type}')
setattr(self, self.rnn_type, rnn_layer(rnn_units, return_sequences=True, return_state=True, dropout=dropout))
for i in range(1, num_layers):
setattr(self, f'{self.rnn_type}_{i}', rnn_layer(rnn_units, return_sequences=True, return_state=True, dropout=dropout))
self.dense = tf.keras.layers.Dense(vocab_size)
def call(self, inputs, states=None, return_state=False, training=False):
x = inputs
x = self.embedding(x, training=training)
rnn = self.get_layer(self.rnn_type)
if states is None:
states = rnn.get_initial_state(x)
x, states = rnn(x, initial_state=states, training=training)
for i in range(1, self.num_layers):
layer = self.get_layer(f'{self.rnn_type}_{i}')
x, states = layer(x, initial_state=states, training=training)
x = self.dense(x, training=training)
if return_state:
return x, states
else:
return x
model = MyModel(
vocab_size=vocab_size,
embedding_dim=embedding_dim,
rnn_type='gru',
rnn_units=512,
num_layers=3,
dropout=dropout)
When trained for 30 epochs on the dataset in the tutorial, this model generates some random gibberish. Now I don't know if I'm doing the stacking wrong or if the dataset is just too small.
There are multiple factors contributing to the bad predictions of your model:
The dataset is small
The model itself you are using is quite simple
The training time is very short
Predicting Shakespeare sonnets will produce random gibberish even if trained right
Try to train it for longer. This will ultimately lead to better results although predicting coorect speech based on text may be one of the hardest tasks in Machine Learning in general. For example GPT3, one of the models, which solves this problem almost perfectly, consists of billions of parameters (see here).
EDIT: The reason why your model is performing worse than the one in the tutorial although you have more stacked RNN layers may be, that more layers need more training time. Simply increasing the number of layers will not necessarily increase your prediction quality. As I said, try to increase training time or play around with hyper parameters (learning rate, Nomralization layers, etc.).

Debug Gradient Computation Tensorflow

I am using a frozen graph to extract features and then want to train a predictor on top to perform some inference.
Unfortunately, the gradients can not be computed and my process is killed with RAM demands >100GB. I have checked several things:
1) Reducing input image sizes or batch sizes is not the problem.
2) I can use intermediate layers of my frozen network (Variant of ResNet) and performan training of my small inference net. But, using the later layers leads to huge memory demands (killed). This confuses me because I keep my network static and there are no trainable variables in the ResNet. Thus, I do not think the gradient should depend on the layer of my frozen net which i extract.
This behavior is unexpected to me. What are ways to debug what causes this huge memory demand when calling sess.run(train_op, feed_dict)?
More information:
tf.reset_default_graph()
graph1 = tf.Graph()
graph1.__enter__()
input_tensor = tf.placeholder('float', shape=input_shape, name='image')
# Loading frozen graph and mapping inputs
f = gfile.FastGFile(pb_file, 'rb')
graph_def = tf.GraphDef()
# Parses a serialized binary message into the current message.
graph_def.ParseFromString(f.read())
f.close()
_ = tf.import_graph_def(graph_def , input_map={'input0': input_tensor})
# Get feature layer
output_feature = 'import/layer3.00/add:0'
feature_tensor = graph.get_tensor_by_name(output_feature)
output = tf.contrib.layers.fully_connected(feature_tensor, 100, scope='readout_network)
def batch_data(batches):
# e.g. batches = [[1,2], [3,4]]
for batch in batches:
images = [stimuli.stimuli[n] for n in batch]
xs = []
ys = []
for i, n in enumerate(batch):
xs.append(this_cache['xs'][n])
ys.append(this_cache['ys'][n])
yield prepare_input(images, xs, ys)
loss = ...
params = slim.get_variables_to_restore(include=['readout_network'])
train_opt = tf.train.AdamOptimizer(learning_rate=0.001).minimize(loss, var_list=params)
for feed_dict in batch_data:
sess.run(train_op, feed_dict = feed_dict)

how to define a TensorFlow graph with more than one input of different dim and combined multi different dim's layer to one layer?

After set each layer's name, My codes in below run well.
=================== old ===============
how to define a TensorFlow graph with more than one input of different dim?
for example, I have the Input (X1, X2, X3) with different dim(d1, d2, d3).
how to define a multi-input layer combined with different size's hidden-1 layer, and then combine the three hidden-1 layer to hidden-2 layer, then with a output layer ?
Thanks for all!
I tryed some code like this:
model_fn(features, labels, mode, params):
input_layers = [tf.feature_column.input_layer(features=features, feature_columns=params["feature_columns"][i]) for i, fi in enumerate(FEA_DIM)]
hidden1 = [tf.layers.dense(input_layers[i], H1_DIM[i], tf.nn.selu) for i, _ in enumerate(FEA_DIM)]
hidden1_c = tf.concat(hidden1, -1, "concat")
hidden2 = tf.layers.dense( inputs=hidden1_c, units=32, activation=tf.nn.selu, )
predictions = tf.layers.dense(inputs=hidden2, units=NCLASS, activation=tf.nn.softmax)
labels = tf.contrib.layers.one_hot_encoding(labels, NCLASS)
loss = tf.losses.sigmoid_cross_entropy(labels, predictions)
optimizer = tf.train.AdamOptimizer(learning_rate=1)
train_op = optimizer.minimize( loss=loss, global_step=tf.train.get_global_step())
return tf.estimator.EstimatorSpec( mode=mode, loss=loss, train_op=train_op)
But it doesn't work. The accuracy is not changing at training time.
The tensorboard's model graph is (the dense_xx is the hidden1's tensors):
The biggest problem lies in these lines
predictions = tf.layers.dense(inputs=hidden2, units=NCLASS, activation=tf.nn.softmax)
labels = tf.contrib.layers.one_hot_encoding(labels, NCLASS)
loss = tf.losses.sigmoid_cross_entropy(labels, predictions)
First, since you have multiple classes, you should use softmax_cross_entropy, or better, sparse_softmax_cross_entropy to dispense with the one-hot encoding.
Second, the input to softmax_cross_entropy or sigmoid_cross_entropy should be unnormalized scores, so activation=tf.nn.softmax is wrong. All deep learning frameworks combine the softmax/sigmoid with cross entropy in one step because the combined operation has better performance and numeric stability, so you should not calculate the softmax yourself first.
Third, your learning rate is too high. Even 0.0025 is, under most circumstances, still too high. You should start with 0.001 and then tune it up and down from there.
Finally, I don't understand why you first dense then concat. Why not just concatenate all the features and then transform together?
for how to concat the layers, give my complete running code for examples:
input_layers = [tf.feature_column.input_layer(features=features, feature_columns=params["feature_columns"][i]) for i, fi in enumerate(FEA_DIM)]
hidden1 = [tf.layers.dense(input_layers[i], H1_DIM[i], tf.nn.selu, name="h_1_%s" % i,
kernel_regularizer=tf.contrib.layers.l1_l2_regularizer(scale_l1=1e-3, scale_l2=1e-2), kernel_initializer=tf.truncated_normal_initializer(stddev=1.0/math.sqrt(H1_DIM[i]+FEA_DIM[i]))
) for i, _ in enumerate(FEA_DIM)]
hidden1_c = tf.concat(hidden1, -1, "concat")
hidden2 = tf.layers.dense(inputs=hidden1_c, units=128, activation=tf.nn.selu, name="h_2",
kernel_regularizer=tf.contrib.layers.l2_regularizer(scale=1e-2), kernel_initializer=tf.truncated_normal_initializer(stddev=1.0/math.sqrt(128+H1_DIM[i])))
predictions = tf.layers.dense(inputs=hidden2, units=NCLASS, activation=None, kernel_regularizer=tf.contrib.layers.l2_regularizer(scale=1e-2), kernel_initializer=tf.truncated_normal_initializer(stddev=0.1), name="output")

Online oversampling in Tensorflow input pipeline

I have an input pipeline similar to the one in the Convolutional Neural Network tutorial. My dataset is imbalanced and I want to use minority oversampling to try to deal with this. Ideally, I want to do this "online", i.e. I don't want to duplicate data samples on disk.
Essentially, what I want to do is duplicate individual examples (with some probability) based on the label. I have been reading a bit on Control Flow in Tensorflow. And it seems tf.cond(pred, fn1, fn2) is the way to go. I am just struggling to find the right parameterisation, since fn1 and fn2 would need to output lists of tensors, where the lists have the same size.
This is roughly what I have so far:
image = image_preprocessing(image_buffer, bbox, False, thread_id)
pred = tf.reshape(tf.equal(label, tf.convert_to_tensor([2])), [])
r_image = tf.cond(pred, lambda: [tf.identity(image), tf.identity(image)], lambda: [tf.identity(image),])
r_label = tf.cond(pred, lambda: [tf.identity(label), tf.identity(label)], lambda: [tf.identity(label),])
However, this raises an error as I mentioned before:
ValueError: fn1 and fn2 must return the same number of results.
Any ideas?
P.S.: this is my first Stack Overflow question. Any feedback on my question is appreciated.
After doing a bit more research, I found a solution for what I wanted to do. What I forgot to mention is that the code mentioned in my question is followed by a batch method, such as batch() or batch_join().
These functions take an argument that allows you to group tensors of various batch size rather than just tensors of a single example. The argument is enqueue_many and should be set to True.
The following piece of code does the trick for me:
for thread_id in range(num_preprocess_threads):
# Parse a serialized Example proto to extract the image and metadata.
image_buffer, label_index = parse_example_proto(
example_serialized)
image = image_preprocessing(image_buffer, bbox, False, thread_id)
# Convert 3D tensor of shape [height, width, channels] to
# a 4D tensor of shape [batch_size, height, width, channels]
image = tf.expand_dims(image, 0)
# Define the boolean predicate to be true when the class label is 1
pred = tf.equal(label_index, tf.convert_to_tensor([1]))
pred = tf.reshape(pred, [])
oversample_factor = 2
r_image = tf.cond(pred, lambda: tf.concat(0, [image]*oversample_factor), lambda: image)
r_label = tf.cond(pred, lambda: tf.concat(0, [label_index]*oversample_factor), lambda: label_index)
images_and_labels.append([r_image, r_label])
images, label_batch = tf.train.shuffle_batch_join(
images_and_labels,
batch_size=batch_size,
capacity=2 * num_preprocess_threads * batch_size,
min_after_dequeue=1 * num_preprocess_threads * batch_size,
enqueue_many=True)

Gradients are always zero

I have written an algorithm using tensorflow framework and faced with the problem, that tf.train.Optimizer.compute_gradients(loss) returns zero for all weights. Another problem is if I put batch size larger than about 5, tf.histogram_summary for weights throws an error that some of values are NaN.
I cannot provide here a reproducible example, because my code is quite bulky and I am not so good in TF for make it shorter. I will try to paste here some fragments.
Main loop:
images_ph = tf.placeholder(tf.float32, shape=some_shape)
labels_ph = tf.placeholder(tf.float32, shape=some_shape)
output = inference(BATCH_SIZE, images_ph)
loss = loss(labels_ph, output)
train_op = train(loss, global_step)
session = tf.Session()
session.run(tf.initialize_all_variables())
for i in xrange(MAX_STEPS):
images, labels = train_dataset.get_batch(BATCH_SIZE, yolo.INPUT_SIZE, yolo.OUTPUT_SIZE)
session.run([loss, train_op], feed_dict={images_ph : images, labels_ph : labels})
Train_op (here is the problem occures):
def train(total_loss)
opt = tf.train.AdamOptimizer()
grads = opt.compute_gradients(total_loss)
# Here gradients are zeros
for grad, var in grads:
if grad is not None:
tf.histogram_summary("gradients/" + var.op.name, grad)
return opt.apply_gradients(grads, global_step=global_step)
Loss (the loss is calculated correctly, since it changes from sample to sample):
def loss(labels, output)
return tf.reduce_mean(tf.squared_difference(labels, output))
Inference: a set of convolution layers with ReLU followed by 3 fully connected layers with sigmoid activation in the last layer. All weights initialized by truncated normal rv's. All labels are vectors of fixed length with real numbers in range [0,1].
Thanks in advance for any help! If you have some hypothesis for my problem, please share I will try them. Also I can share the whole code if you like.