Neural network for AI playing Connect Four: how to encode inputs and outputs, and what kind of NN setup? [closed] - tensorflow

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
Trying to understand how to build and train neural network based AI for games, and struggling to get some details straight.
Not even concerned yet with whether to use TensorFlow or something else or build my own. First I'm trying to grasp some basics such as what ranges to use, e.g. inputs between -1 and 1 or between 0 and 1, or how to represent input game situations and output moves.
Suppose I'm building a neural network to play Connect Four. Given a game situation, the AI is to generate an 'optimal move'. Let's say the optimal move is the move with the highest probability of eventually winning the game, assuming a reasonably smart opponent.
I guess the input would be 7 columns * 6 rows = 42 input neurons, each containing a value that represents either my color, or the opponent's color, or empty.
How do I encode this? Does it make a difference whether I use:
0 = empty, 1 = my piece, 2 = opponent's piece
0 = empty, 0.5 = my piece, 1 = opponent's piece
0 = empty, 1 = my piece, -1 = opponent's piece
Or should I even use two times 42 = 84 input neurons, all binary/boolean, where 0=empty, 1s in the first 42 neurons represent my pieces, and 1s in the second 42 neurons represent the opponent's pieces?
What does the output look like? 7 neurons that each represent one column, getting an output value in the 0..1 range, and the AI's move would be the one with the highest value? Or just one output value ranging from 0 to 1, where roughly each 1/7th of the interval represents a particular column?
Also, how should I design my neural network, how many hidden layers? Classification or regression? Sigmoid or Relu or tanh as activation function?
Intuitively, based on my limited experience with neural networks, I would say 2 or 3 hidden layers, each with two times the number of input neurons. No idea about the other considerations, I would just make stabs in the dark and trial and error.
Any feedback or suggestions to get me in the right direction?

If you have not yet delved into AI gameplay, then might I suggest OpenAI Gym as a starting place?
I believe starting with that foundation will allow you to 1) start building and learning with active participation right away, and 2) answer a lot of the foundational 'what-should-I-do?' questions that you have now - or at least at the time of asking this question, and then come back with the 'How-do-I' questions that we can really help you with here

Related

Use of DeepExplainer to get shap values for an MLP model in Keras with tensorflow backend

I am playing around with DeepExplainer to get shap values for deep learning models. By following some tutorials I can get some results, i.e. what variables are pushing the model prediction from the base value, which is the average model output in training set.
I have around 5,000 observations along with 70 features. The performance of DeepExplainer is quite satisfactory. And my code is:
model0 = load_model(model_p+'health0.h5')
background = healthScaler.transform(train[healthFeatures])
e = shap.DeepExplainer(model0, background)
shap_values = e.shap_values(healthScaler.transform(test[healthFeatures]))
test2 = test[healthFeatures].copy()
test2[healthFeatures] = healthScaler.transform(test[healthFeatures])
shap.force_plot(e.expected_value[0], shap_values[0][947,:], test2.iloc[947,:])
And the plot is the following:
Here the base value is 0.012 (can also be seen through e.expected_value[0]) and very close to the output value which is 0.01.
At this point I have some questions:
1) The output value is not identical to the prediction gotten through model0.predict(test[healthFeatures])[947] = -0.103 How should I assess output value?
2) As can be seen, I am using whole training set as the background to approximate conditional expectations of SHAP values. What is the difference between using random samples from training set and entire set? Is it only related to performance issue?
Many thanks in advance!
Probably too late but stil a most common question that will benefit other begginers. To answer (1), the expected and out values will be different. the expected is, as the name suggest, is the avereage over the scores predicted by your model, e.g., if it was probability then it is the average of the probabilties that your model spits. For (2), as long as the backroung values are less then 5k, it wont change much, but if > 5k then your calculations will take days to finish.
See this (lines 21-25) for more comprehensive answers.

How to customize a Deep Learning model if the output is one-hot vectors? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I am trying to build a Deep Learning model with TensorFlow and Keras. This is a sequential model for tasks of Single-Instance Multi-Label, which is a simplified version of Multi-Instance Multi-Label.
Concretely, the input of my model is an array of fixed length, so it can be represented as a vector like this:
The output of my model is a sequence of letters, which are from a alphabet with a fixed size. For example, an alphabet of {A,B,C,D} with only 4 possible members. So I can use a one-hot vector to represent each letter in a sequence.
The length of the sequences is variable, but for simplicity, I use a fixed length(equals to that of the longest sequence) to store all sequences.
If the length of a sequence is shorter than the fixed length, the sequence is represented by one-hot vectors(equal to the seuqence's actual length) and zero vectors(equal to the remaining length). For example, CADB is represented by a 4 * 5 matrix like this:
Please note: the first 4 columns of this matrix are one-hot vectors, each of which has one and only one 1 entry, and all other entries are 0s.
But the entries of the last column are all 0s, which can be seen as a zero padding because the sequence of letters is not long enough.
So in one word, the input is a vector and the output is a matrix.
Different from the link posted above, the output matrix should be seen as a whole. So one input vector is assigned to a whole matrix, not to a row or column of this matrix.
My question is : how to customize my deep learning model for this special output, for example:
What loss function and accuracy metric should I choose or design?
Do I need to customize a special layer at the very end of my model?
You should use softmax activation on the output layer and have categorical_crossentropy as the loss function.
However, as you can see in the links above, the problem is that these two functions by default are applied on the last axis (axis=-1), while in you situation it is the second last axis (the columns of the matrix) that is one-hot encoded.
To use the right axis, one option is to define your own versions of these functions like so:
def softmax_columns(x):
return tf.keras.backend.softmax(x, axis=-2)
def categorical_crossentropy_columns(target, output):
return tf.keras.backend.categorical_crossentropy(target, output, axis=-2)
Then, you can use these like so:
model.add(SomeLayer(..., activation=softmax_columns, ...)) # output layer
model.compile(loss=categorical_crossentropy_columns, ...)
One good alternative (in general, not only here) is to make use of from_logits=True in the categorical_crossentropy call. This effectively makes the softmax built-in into the loss function, so that your model itself does not need (in fact: must not have) the final softmax activation anymore. This not only saves work, but is also more numerically stable.

Building a mutlivariate, multi-task LSTM with Keras

Preamble
I am currently working on a Machine Learning problem where we are tasked with using past data on product sales in order to predict sales volumes going forward (so that shops can better plan their stocks). We essentially have time series data, where for each and every product we know how many units were sold on which days. We also have information like what the weather was like, whether there was a public holiday, if any of the products were on sales etc.
We've been able to model this with some success using an MLP with dense layers, and just using a sliding window approach to include sales volumes from the surrounding days. However, we believe we'll be able to get much better results with a time-series approach such as an LSTM.
Data
The data we have essentially is as follows:
(EDIT: for clarity the "Time" column in the picture above is not correct. We have inputs once per day, not once per month. But otherwise the structure is the same!)
So the X data is of shape:
(numProducts, numTimesteps, numFeatures) = (50 products, 1096 days, 90 features)
And the Y data is of shape:
(numProducts, numTimesteps, numTargets) = (50 products, 1096 days, 3 binary targets)
So we have data for three years (2014, 2015, 2016) and want to train on this in order to make predictions for 2017. (That's of course not 100% true, since we actually have data up to Oct 2017, but let's just ignore that for now)
Problem
I would like to build an LSTM in Keras that allows me to make these predictions. There are a few places where I am getting stuck though. So I have six concrete questions (I know one is supposed to try to limit a Stackoverflow post to one question, but these are all intertwined).
Firstly, how would I slice up my data for the batches? Since I have three full years, does it make sense to simply push through three batches, each time of size one year? Or does it make more sense to make smaller batches (say 30 days) and also to using sliding windows? I.e. instead of 36 batches of 30 days each, I use 36 * 6 batches of 30 days each, each time sliding with 5 days? Or is this not really the way LSTMs should be used? (Note that there is quite a bit of seasonality in the data, to I need to catch that kind of long-term trend as well).
Secondly, does it make sense to use return_sequences=True here? In other words, I keep my Y data as is (50, 1096, 3) so that (as far as I've understood it) there is a prediction at every time step for which a loss can be calculated against the target data? Or would I be better off with return_sequences=False, so that only the final value of each batch is used to evaluate the loss (i.e. if using yearly batches, then in 2016 for product 1, we evaluate against the Dec 2016 value of (1,1,1)).
Thirdly how should I deal with the 50 different products? They are different, but still strongly correlated and we've seen with other approaches (for example an MLP with simple time-windows) that the results are better when all products are considered in the same model. Some ideas that are currently on the table are:
change the target variable to be not just 3 variables, but 3 * 50 = 150; i.e. for each product there are three targets, all of which are trained simultaneously.
split up the results after the LSTM layer into 50 dense networks, which take as input the ouputs from the LSTM, plus some features that are specific to each product - i.e. we get a multi-task network with 50 loss functions, which we then optimise together. Would that be crazy?
consider a product as a single observation, and include product specific features already at the LSTM layer. Use just this one layer followed by an ouput layer of size 3 (for the three targets). Push through each product in a separate batch.
Fourthly, how do I deal with validation data? Normally I would just keep out a randomly selected sample to validate against, but here we need to keep the time ordering in place. So I guess the best is to just keep a few months aside?
Fifthly, and this is the part that is probably the most unclear to me - how can I use the actual results to perform predictions? Let's say I used return_sequences=False and I trained on all three years in three batches (each time up to Nov) with the goal of training the model to predict the next value (Dec 2014, Dec 2015, Dec 2016). If I want to use these results in 2017, how does this actually work? If I understood it correctly, the only thing I can do in this instance is to then feed the model all the data points for Jan to Nov 2017 and it will give me back a prediction for Dec 2017. Is that correct? However, if I were to use return_sequences=True, then trained on all data up to Dec 2016, would I then be able to get a prediction for Jan 2017 just by giving the model the features observed at Jan 2017? Or do I need to also give it the 12 months before Jan 2017? What about Feb 2017, do I in addition need to give the value for 2017, plus a further 11 months before that? (If it sounds like I'm confused, it's because I am!)
Lastly, depending on what structure I should use, how do I do this in Keras? What I have in mind at the moment is something along the following lines: (though this would be for only one product, so doesn't solve having all products in the same model):
Keras code
trainX = trainingDataReshaped #Data for Product 1, Jan 2014 to Dec 2016
trainY = trainingTargetReshaped
validX = validDataReshaped #Data for Product 1, for ??? Maybe for a few months?
validY = validTargetReshaped
numSequences = trainX.shape[0]
numTimeSteps = trainX.shape[1]
numFeatures = trainX.shape[2]
numTargets = trainY.shape[2]
model = Sequential()
model.add(LSTM(100, input_shape=(None, numFeatures), return_sequences=True))
model.add(Dense(numTargets, activation="softmax"))
model.compile(loss=stackEntry.params["loss"],
optimizer="adam",
metrics=['accuracy'])
history = model.fit(trainX, trainY,
batch_size=30,
epochs=20,
verbose=1,
validation_data=(validX, validY))
predictX = predictionDataReshaped #Data for Product 1, Jan 2017 to Dec 2017
prediction=model.predict(predictX)
So:
Firstly, how would I slice up my data for the batches? Since I have
three full years, does it make sense to simply push through three
batches, each time of size one year? Or does it make more sense to
make smaller batches (say 30 days) and also to using sliding windows?
I.e. instead of 36 batches of 30 days each, I use 36 * 6 batches of 30
days each, each time sliding with 5 days? Or is this not really the
way LSTMs should be used? (Note that there is quite a bit of
seasonality in the data, to I need to catch that kind of long-term
trend as well).
Honestly - modeling such data is something really hard. First of all - I wouldn't advise you to use LSTMs as they are rather designed to capture a little bit different kind of data (e.g. NLP or speech where it's really important to model long-term dependencies - not seasonality) and they need a lot of data in order to be learned. I would rather advise you to use either GRU or SimpleRNN which are way easier to learn and should be better for your task.
When it comes to batching - I would definitely advise you to use a fixed window technique as it will end up in producing way more data points than feeding a whole year or a whole month. Try to set a number of days as meta parameter which will be also optimized by using different values in training and choosing the most suitable one.
When it comes to seasonality - of course, this is a case but:
You might have way too few data points and years collected to provide a good estimate of season trends,
Using any kind of recurrent neural network to capture such seasonalities is a really bad idea.
What I advise you to do instead is:
try adding seasonal features (e.g. the month variable, day variable, a variable which is set to be true if there is a certain holiday that day or how many days there are to the next important holiday - this is a room where you could be really creative)
Use an aggregated last year data as a feature - you could, for example, feed last year results or aggregations of them like running average of the last year's results, maximum, minimum - etc.
Secondly, does it make sense to use return_sequences=True here? In
other words, I keep my Y data as is (50, 1096, 3) so that (as far as
I've understood it) there is a prediction at every time step for which
a loss can be calculated against the target data? Or would I be better
off with return_sequences=False, so that only the final value of each
batch is used to evaluate the loss (i.e. if using yearly batches, then
in 2016 for product 1, we evaluate against the Dec 2016 value of
(1,1,1)).
Using return_sequences=True might be useful but only in following cases:
When a given LSTM (or another recurrent layer) will be followed by yet another recurrent layer.
In a scenario - when you feed a shifted original series as an output by what you are simultaneously learning a model in different time windows, etc.
The way described in a second point might be an interesting approach but keep the mind in mind that it might be a little bit hard to implement as you will need to rewrite your model in order to obtain a production result. What also might be harder is that you'll need to test your model against many types of time instabilities - and such approach might make this totally unfeasible.
Thirdly how should I deal with the 50 different products? They are
different, but still strongly correlated and we've seen with other
approaches (for example an MLP with simple time-windows) that the
results are better when all products are considered in the same model.
Some ideas that are currently on the table are:
change the target variable to be not just 3 variables, but 3 * 50 = 150; i.e. for each product there are three targets, all of which are trained simultaneously.
split up the results after the LSTM layer into 50 dense networks, which take as input the ouputs from the LSTM, plus some features that
are specific to each product - i.e. we get a multi-task network with
50 loss functions, which we then optimise together. Would that be
crazy?
consider a product as a single observation, and include product-specific features already at the LSTM layer. Use just this one layer
followed by an ouput layer of size 3 (for the three targets). Push
through each product in a separate batch.
I would definitely go for a first choice but before providing a detailed explanation I will discuss disadvantages of 2nd and 3rd ones:
In the second approach: It wouldn't be mad but you will lose a lot of correlations between products targets,
In third approach: you'll lose a lot of interesting patterns occuring in dependencies between different time series.
Before getting to my choice - let's discuss yet another issue - redundancies in your dataset. I guess that you have 3 kinds of features:
product specific ones (let's say that there is 'm' of them)
general features - let's say that there is 'n` of them.
Now you have table of size (timesteps, m * n, products). I would transform it into table of shape (timesteps, products * m + n) as general features are the same for all products. This will save you a lot of memory and also make it feasible to feed to recurrent network (keep in mind that recurrent layers in keras have only one feature dimension - whereas you had two - product and feature ones).
So why the first approach is the best in my opinion? Becasue it takes advantage of many interesting dependencies from data. Of course - this might harm the training process - but there is an easy trick to overcome this: dimensionality reduction. You could e.g. train PCA on your 150 dimensional vector and reduce it size to a much smaller one - thanks to what you have your dependencies modeled by PCA and your output has a much more feasible size.
Fourthly, how do I deal with validation data? Normally I would just
keep out a randomly selected sample to validate against, but here we
need to keep the time ordering in place. So I guess the best is to
just keep a few months aside?
This is a really important question. From my experience - you need to test your solution against many types of instabilities in order to be sure that it works fine. So a few rules which you should keep in mind:
There should be no overlap between your training sequences and test sequences. If there would be such - you will have a valid values from a test set fed to a model while training,
You need to test model time stability against many kinds of time dependencies.
The last point might be a little bit vague - so to provide you some examples:
year stability - validate your model by training it using each possible combination of two years and test it on a hold out one (e.g. 2015, 2016 against 2017, 2015, 2017 against 2016, etc.) - this will show you how year changes affects your model,
future prediction stability - train your model on a subset of weeks/months/years and test it using a following week/month/year result (e.g. train it on January 2015, January 2016 and January 2017 and test it using Feburary 2015, Feburary 2016, Feburary 2017 data, etc.)
month stability - train model when keeping a certain month in a test set.
Of course - you could try yet another hold outs.
Fifthly, and this is the part that is probably the most unclear to me
- how can I use the actual results to perform predictions? Let's say I used return_sequences=False and I trained on all three years in three
batches (each time up to Nov) with the goal of training the model to
predict the next value (Dec 2014, Dec 2015, Dec 2016). If I want to
use these results in 2017, how does this actually work? If I
understood it correctly, the only thing I can do in this instance is
to then feed the model all the data points for Jan to Nov 2017 and it
will give me back a prediction for Dec 2017. Is that correct? However,
if I were to use return_sequences=True, then trained on all data up to
Dec 2016, would I then be able to get a prediction for Jan 2017 just
by giving the model the features observed at Jan 2017? Or do I need to
also give it the 12 months before Jan 2017? What about Feb 2017, do I
in addition need to give the value for 2017, plus a further 11 months
before that? (If it sounds like I'm confused, it's because I am!)
This depends on how you've built your model:
if you used return_sequences=True you need to rewrite it to have return_sequence=False or just taking the output and considering only the last step from result,
if you used a fixed-window - then you need to just feed a window before prediction to model,
if you used a varying length - you could feed any timesteps proceding your prediction period you want (but I advice you to feed at least 7 proceding days).
Lastly, depending on what structure I should use, how do I do this in Keras? What I have in mind at the moment is something along the following lines: (though this would be for only one product, so doesn't solve having all products in the same model)
Here - more info on what kind of model you've choosed is needed.
Question 1
There are several approaches for this problem. The one that you propose seems to be a sliding window.
But in fact you don't need to slice the time dimension, you can input all 3 years at once. You may slice the products dimension, in case your batch gets too big for the memory and speed.
You can work with a single array with shape (products, time, features)
Question 2
Yes, it makes sense to use return_sequences=True.
If I understood your question correctly, you have y predictions for every day, right?
Question 3
That is really an open question. All approaches have their advantages.
But if you're considering to put all the product features together, being these features of different nature, you should probably expand all possible features as if there were a big one-hot vector considering all features of all products.
If each product has independent features that apply only to itself, the idea of creating individual models for each product doesn't seem insane to me.
You might also thing of making the product id a one-hot vector input, and use a single model.
Question 4
Depending on which approach you choose, you may:
Split some products as validation data
Leave the final portion of time steps as validation data
Try a crossvalidation method leaving different lengths for training and test (the longer the test data, the bigger the error, though, you might want to crop this test data to have a fixed length)
Question 5
There may be also many approaches.
There are approaches where you use sliding windows. You train your model for fixed time lengths.
And there are approaches where you train the LSTM layers with the entire length. In this case you'd first predict the entire known part, and then start predicting the unknown part.
My question: is the X data known for the period where you have to predict Y? Of X is also unknown in this period, so you have also to predict X?
Question 6
I recommend you to take a look at this question and its answer: How to deal with multi-step time series forecasting in multivariate LSTM in keras
See also this notebook that manages to demonstrate the idea: https://github.com/danmoller/TestRepo/blob/master/TestBookLSTM.ipynb
In this notebook, though, I used an approach that puts X and Y as inputs. And we predict future X and Y.
You can try creating a model (if that's the case) only to predict X. Then a second model to predict Y from X.
In another case (if you already have all X data, no need to predict X), you can create a model that only predicts Y from X. (You'd still follow part of the method in the notebook, where you first predict the already known Y just to make your model get adjusted to where in the sequence it is, then you predict the unknown Y) -- This can be done in one single full-length X input (which contains the training X at the beginning and the test X at the end).
Bonus answer
Knowing which approach and which kind of model to choose is probably the exact answer to win the competition... so, there isn't a best answer for this question, every competitor is trying to find out this answer.

Should my seq2seq RNN idea work?

I want to predict stock price.
Normally, people would feed the input as a sequence of stock prices.
Then they would feed the output as the same sequence but shifted to the left.
When testing, they would feed the output of the prediction into the next input timestep like this:
I have another idea, which is to fix the sequence length, for example 50 timesteps.
The input and output are exactly the same sequence.
When training, I replace last 3 elements of the input by zero to let the model know that I have no input for those timesteps.
When testing, I would feed the model a sequence of 50 elements. The last 3 are zeros. The predictions I care are the last 3 elements of the output.
Would this work or is there a flaw in this idea?
The main flaw of this idea is that it does not add anything to the model's learning, and it reduces its capacity, as you force your model to learn identity mapping for first 47 steps (50-3). Note, that providing 0 as inputs is equivalent of not providing input for an RNN, as zero input, after multiplying by a weight matrix is still zero, so the only source of information is bias and output from previous timestep - both are already there in the original formulation. Now second addon, where we have output for first 47 steps - there is nothing to be gained by learning the identity mapping, yet network will have to "pay the price" for it - it will need to use weights to encode this mapping in order not to be penalised.
So in short - yes, your idea will work, but it is nearly impossible to get better results this way as compared to the original approach (as you do not provide any new information, do not really modify learning dynamics, yet you limit capacity by requesting identity mapping to be learned per-step; especially that it is an extremely easy thing to learn, so gradient descent will discover this relation first, before even trying to "model the future").

What machine learning algorithm for this simple optimisation? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
I'll formulate a simple problem that I'd like to solve with machine learning (in R or similar platforms): my algorithm takes 3 parameters (a,b,c), and returns a score s in range [0,1]. The parameters are all categorical: a has 3 options, b has 4, and c has 10.
Therefore my dataset has 3 * 4 * 10 = 120 cases.
High scores are desirable (close to 1), low scores are not (close to 0).
Let's treat the algorihm as a black box, taking a,b,c and returning a s.
The dataset looks like this:
a, b, c, s
------------------
a1, b1, c1, 0.223
a1, b1, c2, 0.454
...
If I plot the density of the s for each parameter, I get very wide distributions, in which some cases perform very well (s > .8 ), others badly (s < .2 ).
If I look at the cases where s is very high, I can't see any clear pattern.
Parameter values that overall perform badly can perform very well in combination with specific parameters, and vice versa.
To measure how well a specific value performs (e.g. a1), I compute the median:
median( mydataset[ a == a1]$s )
For example, median(a1)=.5, median(b3)=.9, but when I combine them, I get a lower result s(a_1,b_3)= .3.
On the other hand, median(a2)=.3, median(b1)=.4, but s(a2,b1)= .7.
Given that there aren't parameter values that perform always well, I guess I should look for combinations (of 2 parameters) that seem to perform well together, in a statistically significant way (i.e. excluding outliers that happen to have very high scores).
In other words, I want to obtain a policy to make the optimal parameter choice, e.g. the best performing combinations are (a1,b3), (a2,b1), etc.
Now, I guess that this is an optimisation problem that can be solved using machine learning.
What standard techniques would you recommend in this context?
EDIT: somebody suggested a linear programming solution with glpk, but I don't understand how to apply linear programming to this problem.
The most standard technique for this question is Linear Regression. You may predict the value for specific parameters; in more general - to get the function that on your 3 parameters gives you maximum value