I'm trying to train a new model in SpaCy with custom entities and I am having issues when running it.
I only have one pipe (ner) and I am adding all my entity types as label to it.
I figured out that adding a lot of distinct labels (~219 labels) to the ner pipe makes it crash on the first nlp.update (Process finished with exit code -1073740791 (0xC0000409))
I'm running Spacy version: 2.0.12 on a 16gb RAM laptop on Windows 10 with Python 3.7. Any idea why it crashes on the first nlp.update execution the more labels I add and how can I prevent that ? I tried with only ~100 labels and it works fine.
Here's my code:
def __train_model(self, spacy_model, entity_types):
nlp = spacy.blank("en")
ner = nlp.create_pipe("ner")
nlp.add_pipe(ner)
for entity_type in list(entity_types):
ner.add_label(entity_type)
optimizer = nlp.begin_training()
# Start training
for i in range(20):
losses = {}
index = 0
random.shuffle(spacy_model)
for statement, entities in spacy_model:
nlp.update([statement], [entities], sgd=optimizer, losses=losses, drop=0.5)
return nlp
spacy_model:
[
('Simply put I see no other conclusion than Comcast has actively blocked our Smart TVs from accessing Netflix on purpose.', {'entities': [(42, 49, 'ORGANIZATION:SERVICE_PROVIDER:COMMUNICATIONS'), (75, 80, 'DEVICE:COMMUNICATIONS:TV:FEATURE'), (100, 107, 'ORGANIZATION:SERVICE_PROVIDER:COMMUNICATIONS')]})
...
]
EDIT: I tried on a Ubuntu 18.04 VM with 24Gb RAM and 2 cores and encountered the following error:
*** stack smashing detected ***: <unknown> terminated
Aborted (core dumped)
EDIT2: Fixed here: https://github.com/explosion/spaCy/issues/2800#issuecomment-425057478
Related
I am using preemptible v2.8 Google Cloud TPUs to perform large-scale hyperparameter optimization. I created the nodes using GKE with tensorflow 2.3 (the latest available version for Cloud TPUs.) Unfortunately, I keep encountering a memory leak on the TPU nodes during the search. This memory leak seems to ultimately cause an "Unavailable: Socket Closed" error (or sometimes an OOM error), where the TPU becomes unable to perform any additional training or evaluation even after re-deploying the code. The problem does not occur when I test my code on either a CPU or GPU.
This problem only occurs on the TPU worker node, but not the controller CPU. (At one point, I had been encountering another memory leak on the CPU due to a buildup of old models and unnecessary operations on the computation graph.) Methods such as tf.backend.clear_session() and del model resolved the memory leak with the CPU, but it persists on the TPU. Here is a graph of the TPU runtime memory usage (the decrease in memory at the end appears to occur after the TPU disconnects because GKE deletes it automatically):
Ultimately, as the used memory increases on the TPU, I get the following error:
raise_from tensorflow.python.framework.errors_impl.ResourceExhaustedError: 9 root error(s) found.
Error
2021-08-02T16:36:47.652282141ZHint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
Error
2021-08-02T16:36:47.652288611Z
Error
2021-08-02T16:36:47.652296423Z (4) Resource exhausted: {{function_node __inference_train_function_37854}} Attempting to reserve 3.27G at the bottom of memory. That was not possible. There are 3.48G free, 0B reserved, and 1.67G reservable.
Error
2021-08-02T16:36:47.652313550Z [[{{node cluster_train_function/_execute_4_0}}]]
2021-08-02T16:36:47.652921274Z0 successful operations.
Error
2021-08-02T16:36:47.654639274Z0 derived errors ignored.
Occasionally, I instead get an "Unavailable: Socket Closed" error or an "Unable to destroy remote tensor handles" error.
This error typically only occurs after training several networks. I tried multiple methods suggested by other posts to fix the error, such as typecasting my data to float32, not caching my dataset into memory, using a smaller mini batch size to decrease memory consumption, and using "from_logits=True" in my cost function. I even tried using multiprocessing to perform the network training so memory would be cleared after each network evaluation, but for some reason, the Cloud TPU fails to execute any of the for loops in my code or in the training code (a problem I did not have with either a GPU or CPU, cloud or otherwise.) Larger networks seem to cause the problem to occur much more quickly than smaller networks, which suggests to me that old, unused models are still kept in memory on the TPU. Is there any way to clear the memory on the TPU or reset its state to stop this memory leak?
Here is an MVE I wrote to duplicate the problem:
import os
import gc
import sys
import random
import numpy as np
import tensorflow as tf
from sklearn import metrics
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import InputLayer, Conv2D, Flatten, Dense
from tensorflow.keras.optimizers import Adam
h = 128
w = 128
channels = 1
mini_batch_size = 256
epochs = 15
using_tpu = True
if using_tpu:
## Get tpu name from arguments
tpu_name = sys.argv[1]
tpu_name = tpu_name.replace('--tpu=', '')
## Initialize TPU
tpu = tf.distribute.cluster_resolver.TPUClusterResolver(tpu_name) # TPU detection
print('Running on TPU ', tpu.cluster_spec().as_dict()['worker'])
tf.config.experimental_connect_to_cluster(tpu)
tf.tpu.experimental.initialize_tpu_system(tpu)
tpu_strategy = tf.distribute.TPUStrategy(tpu)
def create_network():
strategy = tf.distribute.TPUStrategy(tpu)
with strategy.scope():
## Create random data
x_train = np.random.randn(1024, 128, 128, 1).astype('float32') # astype necessary to help prevent Connect to Socket Error
y_train = np.random.randn(1024, 50).astype('float32')
x_test = np.random.randn(256, 128, 128, 1).astype('float32')
y_test = np.random.randn(256, 50).astype('float32')
model = Sequential()
model.add(InputLayer((h, w, channels)))
layers = 5
ks = [np.random.choice([3, 5, 7]) for l in range(layers)]
filters = [np.random.choice([64, 128, 256]) for l in range(layers)]
for l in range(layers):
model.add(
Conv2D(kernel_size=(ks[l], ks[l]), padding='same',
filters=filters[l], name='conv' + str(l), activation='relu'))
model.add(Flatten())
# Softmax output layer
model.add(Dense(50)) # Don't need softmax activation because from_logits performs that operation automatically
lr = 0.001
opt = Adam(learning_rate=lr, decay=1e-6)
model.compile(optimizer=opt, loss = tf.keras.losses.CategoricalCrossentropy(from_logits=True), metrics=['accuracy'])
model.fit(x_train, y_train, epochs=epochs, batch_size=mini_batch_size, shuffle=True, verbose=1)
##### memory leak also occurs with dataset API:
'''
train_dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train)).batch(mini_batch_size,
drop_remainder=True)
model.fit(train_dataset, epochs=epochs, verbose=1, shuffle=shuffle,
steps_per_epoch=len(x_train) // mini_batch_size)
'''
#######
y_pred = model(x_test)
## Attempt to clear memory
print(gc.collect())
del model
tf.keras.backend.clear_session()
while True:
create_network()
Thank you so much! Please let me know if I should include any other information.
A few things:
Your error message:
(4) Resource exhausted: {{function_node __inference_train_function_37854}} Attempting to reserve 3.27G at the bottom of memory. That was not possible. There are 3.48G free, 0B reserved, and 1.67G reservable.
indicates an HBM OOM rather than a memory OOM. Basically the TPU has its own set of memory on the chips - in this case you've exhausted that memory. If it was an OOM (like RAM OOM), then you will likely see the SocketClosed error, which you saw as well.
That being said, what are your options? I suggest you go with the tf.data approach but with a few modifications:
def get_dataset(is_training: bool):
def generate_data(_):
return tf.random.normal([128, 128, 1], dtype=tf.bfloat16)
dataset = tf.data.Dataset.range(1)
dataset = dataset.repeat()
dataset = dataset.map(generate_data, num_parallel_calls=tf.data.experimental.AUTOTUNE)
dataset = dataset.repeat().batch(mini_batch_size, drop_remainder=is_training)
train_dataset = get_dataset(is_training=True)
eval_dataset = get_dataset(is_training=False)
In this example we can use bfloat16 which reduces the memory footprint on HBM, but you may need to further reduce your minibatch size from 1024 to 512. Alternatively you can go up from v2-8 to v3-8 which has 2x the HBM. I'm not sure if the numpy based method contributes to the OOMs/SocketClosed errors you see, but I don't think this approach should run into that. Of course you'll eventually use real data, and in that case you should use tf.data for optimal performance. More info here.
IIUC tf.backend.clear_session() and gc.collect() will only clear memory on your host VM, not on the TPU server.
PS: You can also use the steps_per_execution flag to further improve the performance. Please see here for more info. Basically this prevents execution from continuously switching from CPU to TPU every step. If you set this to equal the number of training steps in an epoch, this will give you the best throughput.
my situation is that saving model is extremely slow under Colab TPU environment.
I first encountered this issue when using checkpoint callback, which causes the training stuck at the end of the 1st epoch.
Then, I tried taking out callback and just save the model using model.save_weights(), but nothing has changed. By using Colab terminal, I found that the saving speed is about ~100k for 5 minutes.
The version of Tensorflow = 2.3
My code of model fitting is here:
with tpu_strategy.scope(): # creating the model in the TPUStrategy scope means we will train the model on the TPU
Baseline = create_model()
checkpoint = keras.callbacks.ModelCheckpoint('baseline_{epoch:03d}.h5',
save_weights_only=True, save_freq="epoch")
hist = model.fit(get_train_ds().repeat(),
steps_per_epoch = 100,
epochs = 5,
verbose = 1,
callbacks = [checkpoint])
model.save_weights("epoch-test.h5", overwrite=True)
I found the issue happened because I explicitly switched to graph mode by writing
from tensorflow.python.framework.ops import disable_eager_execution
disable_eager_execution()
Before
with tpu_strategy.scope():
model.fit(...)
Though I still don't understand the cause, remove disable_eager_execution solved the issue.
Please help me finding the solution to my problems. It's important for me to state first that, I have successfully created my own custom dataset and I have successfully trained that dataset using resnet101 on my own computer (16GB RAM and 4GB NVIDIA 980).
The problem arise when I tried to switch the backbone using inception-resnet and nasnet. I got the following error
"ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape ..."
And I thought I didn't have enough resource on my computer, so I created instance on AWS EC2 with 60GB RAM and 12GB NVIDIA Tesla K80 (my work place only provide this service) and trained the network there.
The training for inception-resnet worked well, however that's not the case with nasnet. Even with 100GB memory I still get OOM error
I found one solution on github tensorflow models web page at issue #1817 and I followed the instruction by adding the following line of code into nasnet config file
train_config: {
batch_size: 1
batch_queue_capacity: 50
num_batch_queue_threads: 8
prefetch_queue_capacity: 10
...
and the code ran well for a while (the following is "top" screenshot). However, I still got the OOM error after running around 6000 steps
INFO:tensorflow:global step 6348: loss = 2.0393 (3.988 sec/step)
INFO:tensorflow:Saving checkpoint to path /home/ubuntu/crack-detection/structure-crack/models/faster_rcnn_nas_coco_2017_11_08/train/model.ckpt
INFO:tensorflow:global step 6349: loss = 0.9803 (3.980 sec/step)
2018-01-25 05:51:25.959402: W tensorflow/core/common_runtime/bfc_allocator.cc:273] Allocator (GPU_0_bfc) ran out of memory trying to allocate 79.73MiB. Current allocation summary follows.
...
ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[64,17,17,4032]
[[Node: MaxPool2D/MaxPool = MaxPool[T=DT_FLOAT, data_format="NHWC", ksize=[1, 1, 1, 1], padding="VALID", strides=[1, 1, 1, 1],
...
Is there anything else I could do to run this smoothly without any OOM errors? Thanks for your help
EDIT #1: The errors come more frequently now, it'll show after 1000-1500 steps.
EDIT #2: Based on the issue #2668 and issue #3014, there's one more thing we can do to be able to run the code without OOM error by adding second_stage_batch_size: 25 (default is 50) in model section of the config file. So, the file should look like the following
model{
faster_rcnn {
...
second_stage_localization_loss_weight: 2.0
second_stage_classification_loss_weight: 1.0
second_stage_batch_size: 25
}
}
Hope this can help.
I would like to point out that the memory that you run out of is the one of the GPU, so I'm afraid those 100GB are only useful for data wrangling outside a training purpose. Also, without code, it's really difficult to figure out where the error is coming from.
That being said, if you can initialize the neural net architecture with weights, train for 6000 iterations and suddenly run out of GPU memory then I guess that you are either somehow storing values in GPU memory or, if you have variable length inputs, you might be passing a sequence, in that iteration, which is too big memory wise.
I am using DNNClassifier Estimator to train a binary classifier. I want to log device info to verify whether my model is running on GPU or CPU.
Since, with using Estimator we don't deal with session, how can I log device info?
Major Problem: My 3 layered neural net with hidden units [100, 75, 50] is running faster on CPU than GPU. I tried to increase batch size till 256 but still the same. Hence, I want to confirm whether it actually is using GPU.
Use config argument of tf.estimator.Estimator.__init__:
classifier = \
DNNClassifier(feature_columns=feature_columns,
hidden_units=[100, 75, 50],
config=tf.estimator.RunConfig(session_config=tf.ConfigProto(log_device_placement=True)))
I am running Google's tensorflow object-detection API's jupyter notebook on an Ubuntu 16.04 Parallels desktop on my Mac. I wanted to test out one of the non-default models (i.e. not SSD with Mobilenet) to see how the accuracy of the bounding boxes may change on an object-detection task.
I changed the section on Model Preparation in the notebook as follows:
# What model to download.
MODEL_NAME = 'ssd_mobilenet_v1_coco_11_06_2017'
MODEL_NAME = 'ssd_inception_v2_coco_11_06_2017'
MODEL_NAME = 'rfcn_resnet101_coco_11_06_2017'
#MODEL_NAME = 'faster_rcnn_resnet101_coco_11_06_2017'
#MODEL_NAME = 'faster_rcnn_inception_resnet_v2_atrous_coco_11_06_2017'
MODEL_FILE = MODEL_NAME + '.tar.gz'
DOWNLOAD_BASE = 'http://download.tensorflow.org/models/object_detection/'
# Path to frozen detection graph. This is the actual model that is used for the object detection.
PATH_TO_CKPT = MODEL_NAME + '/frozen_inference_graph.pb'
# List of the strings that is used to add correct label for each box.
PATH_TO_LABELS = os.path.join('data', 'mscoco_label_map.pbtxt')
NUM_CLASSES = 90
I then jump to executing the cell that loads the frozen Tensorflow model into memory. Unfortunately, if I try any of the last 3 models (rfcn_resnet101_coco_11_06_2017, faster_rcnn_resnet101_coco_11_06_2017, faster_rcnn_inception_resnet_v2_atrous_coco_11_06_2017), the notebook crashes in Firefox and I get the following error message:
The kernel appears to have died. It will restart automatically.
So I am unable to test out the last 3 models, even though I have downloaded the tar.gz files and extracted them in the object_detection folder. Could somebody please explain what I may be doing wrong?
Thank you for your time!
As it turns out, this issue was caused because I was not allocating sufficient memory to Parallels. The script worked after I allocated more memory. Thanks for the tip Jonathan!