Tensorflow & Keras: LSTM performs bad on seq2seq problem with clear solution - tensorflow

I am learning about tensorflow, and seq2seq problems for machine translation.
For this I gave me the following task:
I created an Excel, containing random dates in different types, for example:
05.09.2192
martes, 07 de mayo de 2329
Friday, 30 December, 2129
In my dataset, each type is occuring 1000 times. These are my train (X) value.
My target (Y) values are in one half always in this type:
05.09.2192
07.03.2329
30.12.2129
And in another half in this type:
Samstag, 12. Juni 2669
Donnerstag, 1. April 2990
Freitag, 10. November 2124
To make the model beeing able to differentiate these two Y values, another context information (C) is given as text:
Ausgeschrieben (written out)
Datum (date)
So some rows look like this:
So my goal is, to create a model, which is able to "translate" any date type to the german date type e.g. 05.09.2192.
The dataset contains 34.000 pairs.
To solve this, I use a character based tokenizer to transform text into integers:
tokenizer = keras.preprocessing.text.Tokenizer(filters='', char_level=True, oov_token="|")
I use an LSTM encoder-decoder model and I expect it, to reach an perfect accuracy, since the dependency between X and Y can be solved perfectly.
However, I reach up to an maximum of 72% of accuracy. Even worse, the accuracy is only reaching that much, because the padding is generated well. E.g. most of the Y values are pretty short and are therefore padded. So 12.02.2001 becomes e.g. ||||||||||||||||||||12.02.2001. So the model learns well to generate the padding token, but not the expected value.
This is the model structure I used at my latest test:
from tensorflow.keras.layers import Concatenate
encoder_inputs = keras.layers.Input(batch_input_shape=[32,None], dtype=np.int32)
decoder_inputs = keras.layers.Input(batch_input_shape=[32,None], dtype=np.int32)
embeddings = keras.layers.Embedding(vocab_size, 1)
encoder_embeddings = embeddings(encoder_inputs)
decoder_embeddings = embeddings(decoder_inputs)
encoder_0 = keras.layers.Dense(128)(encoder_embeddings)
encoder_0d = keras.layers.Dropout(0.4)(encoder_0)
encoder_0_1 = keras.layers.Dense(256)(encoder_0d)
encoder_0_1d = keras.layers.Dropout(0.2)(encoder_0_1)
encoder_0_2 = keras.layers.Dense(128)(encoder_0_1d)
encoder_0_2d = keras.layers.Dropout(0.05)(encoder_0_2)
encoder_0_3 = keras.layers.Dense(64)(encoder_0_2d)
encoder_1 = keras.layers.LSTM(64, return_state=True, return_sequences=True, recurrent_dropout=0.2)
encoder_lstm_bidirectional = keras.layers.Bidirectional(encoder_1)
encoder_output, state_h1, state_c1, state_h2, state_c2 = encoder_lstm_bidirectional(encoder_0_3)
encoder_state = [Concatenate()([state_h1, state_h2]), Concatenate()([state_c1, state_c2])]
sampler = tfa.seq2seq.sampler.TrainingSampler()
decoder_cell = keras.layers.LSTMCell(64*2)
output_layer = keras.layers.Dense(vocab_size)
decoder = tfa.seq2seq.basic_decoder.BasicDecoder(decoder_cell, sampler, output_layer=output_layer)
final_outputs, final_state, final_sequence_lengths = decoder(decoder_embeddings, initial_state=encoder_state,
sequence_length=[sequence_length], training=True)
y_proba = tf.nn.softmax(final_outputs.rnn_output)
model = keras.Model(inputs=[encoder_inputs, decoder_inputs], outputs=[y_proba])
If needed, I can deploy the whole notebook in github, but maybe there is a simple solution, I just did not see so far.
Thanks for your help!

So, in case this helps anyone in the future:
The model did exactly what I asked it to do.
BUT
You need to be careful, that your data preprocession does not lead to ambiguity.
So you have to prevent something like:
a -> b and also a -> c
While improving one equatation, the other one will lose. That was my problem.
See this example:
eq1: 26.04.1994 -> 26.04.1994
eq2: 26.04.1994 -> Tuesday, 26.04.1994
One the one hand, the models increases the accuracy for eq1. On the other hand, it decreases the loss for eq2. So 74% is kind of a compromise, the model found.
To solve that, I had to add another factor, which describes the data more specifically. So I added an extra condition, describing whether y is written out or just written as date type. So now I have a data structure like the following and my accuracy grew to 98%:
eq1: 26.04.1994, dateformat -> 26.04.1994
eq2: 26.04.1994, written_out -> Tuesday, 26.04.1994

Related

How do I get value function/critic values from Rllib's PPO algorithm for a range of observations?

Goal: I want to train a PPO agent on a problem and determine its optimal value function for a range of observations. Later I plan to work with this value function (economic inequality research). The problem is sufficiently complex so that dynamic programming techniques no longer work.
Approach: In order to check, whether I get correct outputs for the value function, I have trained PPO on a simple problem, whose analytical solution is known. However, the results for the value function are rubbish, which is why I suspect that I have done sth wrong.
The code:
from keras import backend as k_util
...
parser = argparse.ArgumentParser()
# Define framework to use
parser.add_argument(
"--framework",
choices=["tf", "tf2", "tfe", "torch"],
default="tf",
help="The DL framework specifier.",
)
...
def get_rllib_config(seeds, debug=False, framework="tf") -> Dict:
...
def get_value_function(agent, min_state, max_state):
policy = agent.get_policy()
value_function = []
for i in np.arange(min_state, max_state, 1):
model_out, _ = policy.model({"obs": np.array([[i]], dtype=np.float32)})
value = k_util.eval(policy.model.value_function())[0]
value_function.append(value)
print(i, value)
return value_function
def train_schedule(config, reporter):
rllib_config = config["config"]
iterations = rllib_config.pop("training_iteration", 10)
agent = PPOTrainer(env=rllib_config["env"], config=rllib_config)
for _ in range(iterations):
result = agent.train()
reporter(**result)
values = get_value_function(agent, 0, 100)
print(values)
agent.stop()
...
resources = PPO.default_resource_request(exp_config)
tune_analysis = tune.Tuner(tune.with_resources(train_schedule, resources=resources), param_space=exp_config).fit()
ray.shutdown()
So first I get the policy (policy = agent.get_policy()) and run a forward pass with each of the 100 values (model_out, _ = policy.model({"obs": np.array([[i]], dtype=np.float32)})). Then, after each forward pass I use the value_function() method to get the output of the critic network and evaluate the tensor via keras backend.
The results:
True VF (analytical solution)
VF output of Rllib
Unfortunately you can see that the results are not that promising. Maybe I have missed a pre- or postprocessing step? Does the value_function() method even return the last layer of the critic network?
I am very grateful for any help!
It's not part of your script, but I assume that you have trained the policy before you attempt to get useful values out of it.
You are correct in assuming that the value_function() returns the output of the last layer of the critic network in RLlib's implementations.
Have a look at the value function metrics to see if it's actually learning anything (RLlib logs .../learner_stats/vf_loss and .../learner_stats/vf_explained_var)!
After training the model, I'd also try to query the model directly. If that looks better, something is likely off with the code you posted here.

Getting nans for gradient

I am trying to create a search relevance model where I take the dot product between query vector and resulting documents. I add a positional bias term on top to take into account the fact that position 1 is more likely to be clicked on. The final (unnormalised) log likelihood calculation is as follows:
query = self.query_model(query_input_ids, query_attention_mask)
docs = self.doc_model(doc_input_ids, doc_attention_mask)
positional_bias = self.position_model()
if optimizer_idx is not None:
if optimizer_idx == 0:
docs = docs.detach()
positional_bias = positional_bias.clone().detach()
elif optimizer_idx == 1:
query = query.detach()
positional_bias = positional_bias.clone().detach()
else:
query = query.detach()
docs = docs.detach()
similarity = (docs # query.unsqueeze(-1)).squeeze()
click_log_lik = (similarity + positional_bias)\
.reshape(doc_mask.shape)\
.masked_fill_((1 - doc_mask).bool(), float("-inf"))
The query and doc model is simply a distilbert model with a projection layer on top of CLS token. The models can be seen here: https://pastebin.com/g21g9MG3
When inspecting the first gradient descent step, it has nans, but only for the query model and not the doc model. My hypothesis is that normalizing the return values for doc and query models (return F.normalize(out, dim=-1)) is somehow playing up with the gradients.
Does anyone know 1. If my hypothesis is true and more importantly 2. How can I rectify nan gradients?.
Additional Info:
None of the losses are inf or nan.
query is BS x 768
docs is BS x DOC_RESULTS x 768
positional_bias is DOC_RESULTS
DOC_RESULTS is 10 in my case.
The masked_fill in the last line is because occasionally I have less than 10 data points for a query.
Update 1
The following changes made no difference to nans:
Changing masked_fill from -inf to 1e5.
Changing the projection from F.normalize(out, dim=-1) to out / 100.
Removed positional bias altogether with again no luck.
If it helps anyone, and you come across this while using Transformers this is what I did:
So in the end the bug was due to the fact that I was masking away nan's. Since I had some documents with zero length, the output of the transformer was nan. I was hoping that masked_fill would fix this problem, but it doesn't. The solution in my case was to only put non-zero length sequences through transformers, and then append with zeros to fill the batch size.

How can I find a standard method of predicting next values of a stock market using Tensorflow?

Thank you for reading. I'm not good at English.
I am wondering how to predict and get future time series data after model training. I would like to get the values after N steps.
I wonder if the time series data has been properly learned and predicted.
How i do this right get the following (next) value?
I want to get the next value using like model.predict or etc
I have x_test and x_test[-1] == t, so the meaning of the next value is t+1, t+2, .... t+n,
In this example I want to get predictions of the next t+1, t+2 ... t+n
First
I tried using stock index data
inputs = total_data[len(total_data) - forecast - look_back:]
inputs = scaler.transform(inputs)
X_test = []
for i in range(look_back, inputs.shape[0]):
X_test.append(inputs[i - look_back:i])
X_test = np.array(X_test)
predicted = model.predict(X_test)
but the result is like below
The results from X_test[-20:] and the following 20 predictions looks like same.
I'm wondering if it's the correct train and predicted value.
I'm wondering if it was a right training and predict.
full source
The method I tried first did not work correctly.
Seconds
I realized something is wrong, I tried using another official data
So, I used the time series in the Tensorflow tutorial to practice predicting the model.
a = y_val[-look_back:]
for i in range(N-step prediction): # predict a new value n times.
tmp = model.predict(a.reshape(-1, look_back, num_feature)) # predicted value
a = a[1:] # remove first
a = np.append(a, tmp) # insert predicted value
The results were predicted in a linear regression shape very differently from the real data.
Output a linear regression that is independent of the real data:
full source (After the 25th line is my code.)
I'm really very curious what is a standard method of predicting next values of a stock market.
Thank you for reading the long question. I seek advice about your priceless opinion.
Q : "How can I find a standard method of predicting next values of a stock market...?"
First - salutes to C64 practitioner!
Next, let me say, there is no standard method - there cannot be ( one ).
Principally - let me draw from your field of a shared experience - one can easily predict the near future flow of laminar fluids ( a technically "working" market instrument - is a model A, for which one can derive a better or worse predictive tool )
That will never work, however, for turbulent states of the fluids ( just read the complexity of the attempts to formulate the many-dimensional high-order PDE for a turbulence ( and it still just approximates the turbulence ) ) -- and this is the fundamentally "working" market ( after some expected fundamental factor was released ( read NFP or CPI ) or some flash-news was announced in the news - ( read a Swiss release of currency-bonding of CHF to some USD parity or Cyprus one time state tax on all speculative deposits ... the financial Big Bangs follow ... )
So, please, do not expect one, the less any simple, model for reasonably precise predictions, working for both the laminar and turbulent fluidics - the real world is for sure way more complex than this :o)

How to add after each iteration in tensorflow

I am trying to achieve the following:
compute the losses in the previous 25 predictions and sum them before
computing the gradient. I have tried this:
loss_summation=tf.Variable(0,dtype=tf.dtypes.float32,name="loss")
xentropy=tf.nn.sparse_softmax_cross_entropy_with_logits(labels=next_element[1],logits=logits2,name="xentropy")
loss=tf.math.reduce_sum(tf.reduce_mean(xentropy,name="loss"))
loss_summation=tf.assign(loss_summation,loss_summation+loss)
optimizer = tf.train.AdamOptimizer(learning_rate=self.learning_rate)
gvs = optimizer.compute_gradients(loss_summation,[vars])
with tf.Session() as sess():
for i in range(25):
b=sess.run([loss_summation])
However optimizer.compute_gradients() complains that
None values not supported. How can go around this ?
I am actually trying to implement the following function(feedforward of LSTM) in tensorflow to predict the next word given the previous ones
def feedforward(self,x_s,hpre,targets,p_s):
fts,its,gts,css,ots,output,inputs=[],[],[],[],[],[],[]
losses=[]
hprev=hpre
hts=[hprev]
loss=0
losses=[]
previous_state=p_s
css.append(previous_state)
for x,y in zip(x_s,targets):
k=np.zeros((self.vocab_size,1))
k[x]=1
M_c=np.row_stack((hprev,k))
ft=self.sigmoid(np.dot(self.W1,M_c)+self.b1)
fts.append(ft)
it=self.sigmoid(np.dot(self.W2,M_c)+self.b2)
its.append(it)
gt=np.tanh(np.dot(self.W3,M_c)+self.b3)
gts.append(gt)
cs=(ft*previous_state)+(it*gt)
previous_state=cs
css.append(cs)
ot=self.sigmoid(np.dot(self.W4,M_c)+self.b4)
ots.append(ot)
ht=ot*np.tanh(cs)
hts.append(ht)
yt=self.softmax(np.dot(self.W5,ht)+self.b5)
hprev=ht
output.append(yt)
inputs.append(M_c)
loss+=-np.log(yt[y])
losses.append(loss)
return fts,its,gts,css,ots,output,hts,loss,hts[-1],css[-1],inputs
x_s is a list of integers representing words.
x_s=[0,1,2,3,4,5,6,7,8....,24]
targets is the list of integers expected i.e if x_s=0 then next letter is 1
targets=[1,2,3,4,5,6,7,8,9...,25]
The loss which is a summation of 25 losses is what will be minimized.
There are a few things you need to address here:
Is there a good reason not to just use larger batches? Are you trying to implement the lookahead optimizer or something?
You look like you're getting started with TensorFlow. Consider turning on eager execution with tf.enable_eager_execution(). TensorFlow 2.0 is coming soon, don't waste your time messing with tf.Sessions.
Variables are not differentiable. So accumulating the losses in a variable doesn't make any sense.
I would make a copy of all the model's variables, and accumulate new values there. Then, after N iterations assign those values back to the model. Something like:
model = tf.keras.Sequential(...)
vars = model.trainable_variables
weight_acc = [tf.Variable(var) for var in model.trainable_variables]
for n,(batch, label) in enumerate(dataset):
with tf.GradientTape() as tape:
pred = model(batch)
loss = cal_loss(batch, label)
grads = tape.gradients(loss, vars)
for g, a in zip(grad, weight_acc):
a.assign_add(learning_rate*g)
if n%25 == 0:
for a, v in zip(weight_acc, vars):
v.assign_add(lookahead_fraction*(a-v))

After quantisation in neural network, will the output need to be scaled with the inverse of the weight scaling

I'm currently writing a script to quantise a Keras model down to 8 bits. I'm doing a fairly basic linear scaling on the weights, by assuming a normal distribution of weights and biases, and then interpolating all the values within 2 standard deviations of the mean, to the range [-128, 127].
This all works, and I run the model through inference, but my image out is crazy bad. I know there will be a small performance hit, but I'm seeing roughly 10x performance degradation.
My question is, after this scaling of the weights, do I need to do the inverse scaling operation to my output? None of the papers I've been reading seem to mention this, but I'm unsure why else my results would be so bad.
The network is for image demosaicing. It takes in a RAW image, and is meant to output an image with very low noise, and no demosaicing artefacts. My full precision model is very good, with image PSNRs of around 40-43dB, but after quantisation, I'm getting 4-8dB, and incredibly bad looking images.
Code for anyone who's bothered to read it
for i in layer_index:
count = count+1
layer = model.get_layer(index = i);
weights = layer.get_weights();
weights_act = weights[0];
bias_act = weights[1];
std = np.std(weights_act)
if (std > max_std):
max_std = std
mean = np.mean(weights_act)
mean_of_mean = mean_of_mean + mean
mean_of_mean = mean_of_mean / count
max_bound = mean_of_mean + 2*max_std
min_bound = mean_of_mean - 2*max_std
print(max_bound, min_bound)
for i in layer_index:
layer = model.get_layer(index = i);
weights = layer.get_weights();
weights_act = weights[0];
bias_act = weights[1];
weights_shape = weights_act.shape;
bias_shape = bias_act.shape;
new_weights = np.empty(weights_shape, dtype = np.int8)
print(new_weights.dtype)
new_biass = np.empty(bias_shape, dtype = np.int8)
for a in range(weights_shape[0]):
for b in range(weights_shape[1]):
for c in range(weights_shape[2]):
for d in range(weights_shape[3]):
new_weight = (((weights_act[a,b,c,d] - min_bound) * (127 - (-128)) / (max_bound - min_bound)) + (-128))
new_weights[a,b,c,d] = np.int8(new_weight)
#print(new_weights[a,b,c,d], weights_act[a,b,c,d])
for e in range(bias_shape[0]):
new_bias = (((bias_act[e] - min_bound) * (127 - (-128)) / (max_bound - min_bound)) + (-128))
new_biass[e] = np.int8(new_bias)
new_weight_layer = (new_weights, new_biass)
layer.set_weights(new_weight_layer)
You dont do what you think you are doing, I'll explain.
If you wish to take pre-trained model and quantize it you have to add scales after each operation that involves weights, lets take for example the convolution operation.
As we know convolution operation is linear in my explantion i will ignore the bias for the sake of simplicity (adding him is relatively easy), Let's assume X is our input Y is our output and W is the weights, convolution can be written as:
Y=W*X
where '*' represent the convolution operation, what you are basically doing is taking the weights and multiple them by some scalar (lets call it 'a') and shift them by some other scalar (let's call it 'b') so in your model you use W' where: W'= Wa+b
So if we return to the convolution operation we get that in your quantized network you basically do the next operation: Y' = W'*X = (Wa+b)*X
Because convolution is linear we get: Y' = a(W*X) + b*X'
Don't forget that in your network you want to receive Y not Y' at the output of the convolution therefore you must do shift + re scale to get the correct answer.
So after that explanation (which i hope was clear enough) i hope you can understand what is the problem in your network, you do this scale and shift to all of weights and you never compensate for it, I think your confusion is because your read papers that trained models in quantized mode from the beginning and didn't take pretrained model quantized it.
For you problem i think tensorflow graph transform tool might help, take a look at:
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/tools/graph_transforms/README.md
If you wish to read more about quantizing pre trained model you can find more information in (for more academic info just go to scholar.google.com:
https://www.tensorflow.org/lite/performance/post_training_quantization