NaN loss values during traning an auto-encoder with multiple outputs - tensorflow

I am trying to optimize an autoencoder such that It also reproduces the calculated committor values simultaneously. The code looks like :
encoder_input=keras.Input(shape=(ncv,))
xencode=keras.layers.Dense(hidden_layers_encoder[0],activation='linear')(encoder_input)
for i in hidden_layers_encoder[1:]:
xencode=keras.layers.Dense(i,activation='linear')(xencode)
xencode=keras.layers.Dropout(0.1)(xencode)
encoder_output=keras.layers.Dense(n_bottleneck,activation='linear')(xencode)
encoder = keras.Model(encoder_input, encoder_output, name="encoder")
decoder_input=keras.layers.Dense(hidden_layers_decoder[0],activation='tanh'(encoder_output)
xdecode=decoder_input
xdecode=keras.layers.Dropout(0.1)(xdecode)
for j in range(nhid-1):
xdecode=keras.layers.Dense(hidden_layers_decoder[j+1],activation='tanh')(xdecode)
xdecode=keras.layers.Dropout(0.1)(xdecode)
decoder_output=keras.layers.Dense(ncv,activation='linear',name='decoder')(xdecode)
opt = keras.optimizers.Adam(lr=0.001)
auto_encoder = keras.Model(encoder_input, decoder_output, name="auto-encoder")
pb_input=keras.layers.Dense(hidden_layers_decoder[0],activation='sigmoid')(encoder_output)
pb_cal=keras.layers.Dropout(0.1)(pb_input)
for k in range(nhid-1):
pb_cal=keras.layers.Dense(hidden_layers_decoder[j+1],activation='sigmoid')(pb_cal)
pb_cal=keras.layers.Dropout(0.1)(pb_cal)
pb_output=keras.layers.Dense(npb,activation='sigmoid',name='pb_decoder')(encoder_output)
pbcoder = keras.Model(encoder_input, pb_output, name="pbcoder")
auto_encoder_pb = keras.Model(inputs=encoder_input, outputs=[decoder_output, pb_output], name="auto-encoder-pb")
auto_encoder_pb.compile(optimizer=opt,loss=['mse','mse'],metrics=['accuracy'])
history=auto_encoder_pb.fit(x_train, [x_train, y_train],validation_data=(x_test, [x_test, y_test]),batch_size=500,epochs=500)
The input dimension is 14 and I have used four hidden layers in all the cases with 56 neurons each. I have varied the dimension of the bottleneck from 1 to 8. I have thoroughly checked my data file to make sure that there are no NAN/Inf values. But while fitting it, is giving me :
Epoch 1/500
143/143 [==============================] - 1s 10ms/step - loss: nan - decoder_loss: nan - pb_decoder_loss: nan - decoder_accuracy: 0.0310 - pb_decoder_accuracy: 0.5448 - val_loss: nan - val_decoder_loss: nan - val_pb_decoder_loss: nan - val_decoder_accuracy: 0.0311 - val_pb_decoder_accuracy: 0.5421
Epoch 2/500
143/143 [==============================] - 1s 7ms/step - loss: nan - decoder_loss: nan - pb_decoder_loss: nan - decoder_accuracy: 0.0307 - pb_decoder_accuracy: 0.5448 - val_loss: nan - val_decoder_loss: nan - val_pb_decoder_loss: nan - val_decoder_accuracy: 0.0311 - val_pb_decoder_accuracy: 0.5421
Epoch 3/500
143/143 [==============================] - 1s 8ms/step - loss: nan - decoder_loss: nan - pb_decoder_loss: nan - decoder_accuracy: 0.0307 - pb_decoder_accuracy: 0.5448 - val_loss: nan - val_decoder_loss: nan - val_pb_decoder_loss: nan - val_decoder_accuracy: 0.0311 - val_pb_decoder_accuracy: 0.5421
How can I fix this?

Related

Val_loss "Nan" After Decreasing Sample Size in the CSV File but Processed Data Is the Same

I have tried the following example, which works very well. In the example file, the values are stored in 10-minute intervals. However, since I need to bring in more values that are just hourly available, I deleted from the database all values that were not at a full hour. Say: There are now only 1/6 as many rows and three more columns that are not selected in this test run so far.
If I now execute the code exactly as before, the following step will return
path_checkpoint = "model_checkpoint.h5"
es_callback = keras.callbacks.EarlyStopping(monitor="val_loss", min_delta=0, patience=5)
modelckpt_callback = keras.callbacks.ModelCheckpoint(
monitor="val_loss",
filepath=path_checkpoint,
verbose=1,
save_weights_only=True,
save_best_only=True,
)
history = model.fit(
dataset_train,
epochs=epochs,
validation_data=dataset_val,
callbacks=[es_callback, modelckpt_callback],
)
always the message the val_loss error for each epoch:
Epoch 1/10
871/871 [==============================] - ETA: 0s - loss: 0.4529
Epoch 1: val_loss did not improve from inf
871/871 [==============================] - 288s 328ms/step - loss: 0.4529 - val_loss: nan
I think it is related to this previous code block,
split_fraction = 0.715
train_split = int(split_fraction * int(df.shape[0]))
step = 6
past = 720
future = 72
learning_rate = 0.001
batch_size = 256
epochs = 10
def normalize(data, train_split):
data_mean = data[:train_split].mean(axis=0)
data_std = data[:train_split].std(axis=0)
return (data - data_mean) / data_std
where the original author specifies that only every sixth record should be used. Since I already removed every sixth record before, it should now use all records. Therefore I already tried to set step = 1, but without success. It still comes with the message that val_loss did not improve from inf
Does anyone know what else I would need to adjust to satisfy the code that I now have only one-sixth as many rows as originally thought? The result should initially end up with the same values as in the example because I have not yet used the new data.
The Issue was inside the .csv file.
In two of the 300000 rows, the date was formatted as 25.10.18, but in the other rows, the time was 25.10.2018.
After editing the format to a consistent dd.mm.yyyy, the val_loss decreased as expected.
If you are facing the same issue, this code can help you to find wrong formatted rows:
date_time = pd.to_datetime(df.pop('Date Time'), format='%d.%m.%Y %H:%M:%S')

How do I add a line break to an external text log file from a pentaho transform?

I'm using pentaho pdi (spoon) I have a transform to compare 2 database tables (from a query selecting year and quarters within those tables), i'm then hoping to a merge rows (diff) to a filter rows if flagfield is not identical, which if success logs the matches, and if doesn't match logs the output, both with text file output steps...
my issue is my external log file gets appended and looks like this:
412542 - 21 - 4 - deleted - DOMAIN1
461623 - 22 - 1 - deleted - DOMAIN1
^failuresDOMAIN1 - 238388 - 12 - 4 - identical
DOMAIN1- 223016 - 13 - 1 - identical
DOMAIN1- 171764 - 13 - 2 - identical
DOMAIN1- 185569 - 13 - 3 - identical
DOMAIN1- 232247 - 13 - 4 - identical
DOMAIN1- 260057 - 14 - 1 - identical
^successes
I want this output:
412542 - 21 - 4 - deleted - DOMAIN1
461623 - 22 - 1 - deleted - DOMAIN1
^failures
DOMAIN1 - 238388 - 12 - 4 - identical
DOMAIN1- 223016 - 13 - 1 - identical
DOMAIN1- 171764 - 13 - 2 - identical
DOMAIN1- 185569 - 13 - 3 - identical
DOMAIN1- 232247 - 13 - 4 - identical
DOMAIN1- 260057 - 14 - 1 - identical
^successes
notice the line breaks between the successes and failures
using add a data grid w/ a "line_break" string that's simply a new line, then passing that to each "text file output" that logs as this "line_break" data column string value quickly which helps, but I can't seem to sequence the transform steps because they're parallel...

What is this batch in the verbose output during model training in tensorflow 2.3.0?

I have specified the batch size = 16 and steps per epoch = 1000. The total samples in the dataset are 2628. The verbose output is shown below as well as in the image link.
Epoch 26/100
1000/1000 [==============================] - ETA: 0s - batch: 499.5000 - size: 16.0000 - loss: 0.2308 - Z_loss: 0.0200 - Z_logits_loss: 0.2108
lr = 0.001961161382496357, step = 26000
1000/1000 [==============================] - 3168s 3s/step - batch: 499.5000 - size: 16.0000 - loss: 0.2308 - Z_loss: 0.0200 - Z_logits_loss: 0.2108
What does batch = 499.5000 mean?
Model training Verbose output

Unable to get AWS SageMaker to read RecordIO files

I'm trying to convert an object detection lst file to a rec file and train with it in SageMaker. My list looks something like this:
10 2 5 9.0000 1008.0000 1774.0000 1324.0000 1953.0000 3.0000 2697.0000 3340.0000 948.0000 1559.0000 0.0000 0.0000 0.0000 0.0000 0.0000 IMG_1091.JPG
58 2 5 11.0000 1735.0000 2065.0000 1047.0000 1300.0000 6.0000 2444.0000 2806.0000 1194.0000 1482.0000 1.0000 2975.0000 3417.0000 1739.0000 2139.0000 IMG_7000.JPG
60 2 5 12.0000 1243.0000 1861.0000 1222.0000 1710.0000 6.0000 2423.0000 2971.0000 1205.0000 1693.0000 0.0000 0.0000 0.0000 0.0000 0.0000 IMG_7061.JPG
80 2 5 1.0000 1865.0000 2146.0000 818.0000 969.0000 14.0000 1559.0000 1918.0000 1658.0000 1914.0000 6.0000 2638.0000 3042.0000 2125.0000 2490.0000 IMG_9479.JPG
79 2 5 13.0000 1556.0000 1812.0000 1440.0000 1637.0000 7.0000 2216.0000 2452.0000 1595.0000 1816.0000 0.0000 0.0000 0.0000 0.0000 0.0000 IMG_9443.JPG
Where the columns are
index, header length, object length, class id, xmin, ymin, xmax, ymax, (repeat any other ids...), image path
I then run the list through im2rec with
$ /incubator-mxnet/tools/im2rec.py my_lst.lst my_image_folder
I then upload the resultant .rec file to s3.
I then pull the necessary parts from this AWS sample notebook.
I think the only key piece is probably this:
def set_hyperparameters(num_epochs, lr_steps):
num_classes = 16
num_training_samples = 227
print('num classes: {}, num training images: {}'.format(num_classes, num_training_samples))
od_model.set_hyperparameters(base_network='resnet-50',
use_pretrained_model=1,
num_classes=num_classes,
mini_batch_size=16,
epochs=num_epochs,
learning_rate=0.001,
lr_scheduler_step=lr_steps,
lr_scheduler_factor=0.1,
optimizer='sgd',
momentum=0.9,
weight_decay=0.0005,
overlap_threshold=0.5,
nms_threshold=0.45,
image_shape=512,
label_width=350,
num_training_samples=num_training_samples)
set_hyperparameters(100, '33,67')
Ultimately I get the error: Not enough label packed in img_list or rec file.
Can someone help me identify what parts I'm missing in order to properly train with SageMaker and RecordIO files?
Thanks for your help!
Also, if I instead use
$ /incubator-mxnet/tools/im2rec.py my_lst.lst my_image_folder --pass-through --pack-label
I get the error:
Expected number of batches: 14, did not match the number of batches processed: 5. This may happen when some images or annotations are invalid and cannot be parsed. Please check the dataset and ensure it follows the format in the documentation.
This may come late but did you label your classes starting from 0 in the .lst file?
In the link you posted:
The classes should be labeled with successive numbers and start with 0.

Printing particularities of pgf77 (FORTRAN 77 behaviour?)

I compile and run this simple FORTRAN 77 program:
program test
write(6,*) '- - - - - - - - - - - - - - - - - - - - - - - - - - ',
& '- - - - - - - - - - - - - - - - - - - - - - - - - -'
write(6,'(2G15.5)') 0.1,0.0
end
with gfortran or f95 the output is:
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
0.10000 0.0000
with pgf77 it is:
- - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - - - - - - - - - - - -
0.10000 0.00000E+00
and with g77 or ifort:
- - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - - - - - - - - - - - -
0.10000 0.0000
A couple of questions arise:
Why is 0.0 printed with four decimal places instead of five, as
requested in the format G15.5? Is this spec-compliant? And why
does pgf77 write it differently?
I guess the line break in the - - - - - - line with the last three
compilers is due to some limitation in the output line length... Is
there any way of increasing this limit, or otherwise force
single-line writes, at compile time?
By the way, the desired output is
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
0.10000 0.00000
which matches none of the above.
Exactly what the G edit descriptor causes to be printed is a little complicated but for the value 0.0 the standard (10.7.5.2.2 in the 2008 edition) states that the compiler should print a representation with d-1 (ie 4 in your example) digits in the fractional part of the number. So most of your compilers are behaving correctly; I think that pgf77 is incorrect but it may be working to an earlier standard with a different requirement.
The simplest fix for this is probably to use an f edit descriptor instead, (2F15.5).
As for the printing of the lines of hyphens, your use of the * edit descriptor, which causes list-directed output, surrenders precise control of the output to the compiler. My opinion is that it is a little perverse of a compiler to print the two parts of the expression on two lines but it is not non-standard behaviour.
If you want the hyphens printed all on one line take control of the output format, write(6,'(2A24)') or something similar ought to do it (I didn't count the hyphens for you, just guessed that there are 24 in each part of the output.) If that doesn't appeal to you simply write one string with all the hyphens in; that will probably get written on one line even using list-directed output.