The PennTree Bank data seems difficult to understand. Below are two links
https://github.com/townie/PTB-dataset-from-Tomas-Mikolov-s-webpage/tree/master/data
https://www.tensorflow.org/tutorials/recurrent
My concern is as follows. The reader gives me around a million occurrences of 10000 words. I have written code to convert this data set into one-hot encoding. Thus, I have a million vectors of 10000 dimensions, each vector having a 1 at a single location. Now, I want to train a LSTM (long short term memory) model on this for prediction.
For simplicity let us assume that there are 30 (and not a million occurences) occurences, and the sequence length is equal to 10 for the LSTM (the number of timesteps that it unrolls). Let us denote these occurences by
X1,X2,....,X10,X11,...,X20,X21,...X30
Now, my concern is that should I use 3 data samples for training
X1,..X10 and X11,..,X20, and X21,..X30
or should I use 20 data samples for training
X1,..X10 and X2,...,X11, and X3,..,X12, so on until X21,..,X30
In case I go with the latter, then am I not breaking the i.i.d. assumption of training data sequence generations?
Related
In my first TensorFlow project, I have a big dataset (1M elements) which contains 8 categories of elements, with each category, has a different number of elements of course. I want to split the big dataset into 10 exclusive small datasets, with each of them having approximately 1/10 of each category. (This is for 10-fold cross-validation purposes.)
Here is how I do.
I wind up having 80 datasets, with each category having 10 small datasets, then I randomly sample data from 80 of them by using sample_from_datasets. However, after some steps, I met a lot of warning saying "DirectedInterleave selected an exhausted input:36" where 36 can be some other integer numbers.
The reason I want to do sample_from_datasets is that I tried to do shuffle the original dataset. Even though shuffle only 0.4 x total elements, it still takes a long long time to finish (about 20mins).
My questions are
1. based on my case, any good advice on how to structure the datasets?
2. is it normal to have a long shuffling time? any better solution for shuffling?
3. why do I get this DirectIngerleave selected an exhausted input: warning? and what does it mean?
thank you.
Split your whole datasets into Training, Testing and Validation categories. As you have 1M data, you can split like this: 60% training, 20% testing and 20% validation. Splitting of datasets is completely up to you and your requirements. But normally maximum data is used for training the model. Next, the rest of the datasets can be used for testing and validation.
As you have ten class datasets, split each category into Training, Testing and Validation categories.
Let, you have A, B, C and D categories data. Split your data "A", "B", "C", and "D" like below:
'A'- 60 % for training 20% testing and 20% validation
'B'- 60 % for training 20% testing and 20% validation
'C'- 60 % for training 20% testing and 20% validation
'D'- 60 % for training 20% testing and 20% validation
Finally merge all the A, B, C and D training, testing and validation datasets.
After training a RNN does it makes sense to save the final state so that it is then the initial state for testing?
I am using:
stacked_lstm = rnn.MultiRNNCell([rnn.BasicLSTMCell(n_hidden,state_is_tuple=True) for _ in range(number_of_layers)], state_is_tuple=True)
The state has a very specific meaning and purpose. This isn't a question of "advisable" or not, there's a right and wrong answer here, and it depends on your data.
Consider each timestep in your sequence of data. At the first time step your state should be initialized to all zeros. This value has a specific meaning, it tells the network that this is the beginning of your sequence.
At each time step the RNN is computing a new state. The MultiRNNCell implementation in tensorflow is hiding this from you, but internally in that function a new hidden state is computed at each time step and passed forward.
The value of state at the 2nd time step is the output of the state at the 1st time step, and so on and so forth.
So the answer to your question is yes only if the next batch is continuing in time from the previous batch. Let me explain this with a couple of examples where you do, and don't perform this operation respectively.
Example 1: let's say you are training a character RNN, a common tutorial example where your input is each character in the works of Shakespear. There are millions of characters in this sequence. You can't train on a sequence that long. So you break your sequence into segments of 100 (if you don't know why to do otherwise limit your sequences to roughly 100 time steps). In this example, each training step is a sequence of 100 characters, and is a continuation of the last 100 characters. So you must carry the state forward to the next training step.
Example 2: where this isn't use would be in training an RNN to recognize MNIST handwritten digits. In this case you split your image into 28 rows of 28 pixels and each training has only 28 time steps, one per row in the image. In this case each training iteration starts at the beginning of the sequence for that image and trains fully until the end of the sequence for that image. You would not carry the hidden state forward in this case, your hidden state must start with zero's to tell the system that this is the beginning of a new image sequence, not the continuation of the last image you trained on.
I hope those two examples illustrate the important difference there. Know that if you have sequence lengths that are very long (say over ~100 timesteps) you need to break them up and think through the process of carrying forward the state appropriately. You can't effectively train on infinitely long sequence lengths. If your sequence lengths are under this rough threshold then you won't worry about this detail and always initialize your state to zero.
Also know that even though you only train on say 100 timesteps at a time the RNN can still be expected to learn patterns that operate over longer sequences, Karpathy's fabulous paper/blog on "The unreasonable effectiveness of RNNs" demonstrates this beautifully. Those character level RNNs can keep track of important details like whether a quote is open or not over many hundreds of characters, far more than were ever trained on in one batch, specifically because the hidden state was carried forward in the appropriate manner.
I want to predict stock price.
Normally, people would feed the input as a sequence of stock prices.
Then they would feed the output as the same sequence but shifted to the left.
When testing, they would feed the output of the prediction into the next input timestep like this:
I have another idea, which is to fix the sequence length, for example 50 timesteps.
The input and output are exactly the same sequence.
When training, I replace last 3 elements of the input by zero to let the model know that I have no input for those timesteps.
When testing, I would feed the model a sequence of 50 elements. The last 3 are zeros. The predictions I care are the last 3 elements of the output.
Would this work or is there a flaw in this idea?
The main flaw of this idea is that it does not add anything to the model's learning, and it reduces its capacity, as you force your model to learn identity mapping for first 47 steps (50-3). Note, that providing 0 as inputs is equivalent of not providing input for an RNN, as zero input, after multiplying by a weight matrix is still zero, so the only source of information is bias and output from previous timestep - both are already there in the original formulation. Now second addon, where we have output for first 47 steps - there is nothing to be gained by learning the identity mapping, yet network will have to "pay the price" for it - it will need to use weights to encode this mapping in order not to be penalised.
So in short - yes, your idea will work, but it is nearly impossible to get better results this way as compared to the original approach (as you do not provide any new information, do not really modify learning dynamics, yet you limit capacity by requesting identity mapping to be learned per-step; especially that it is an extremely easy thing to learn, so gradient descent will discover this relation first, before even trying to "model the future").
I want to train a DNN model by training data with more than one billion feature dimensions. So the shape of the first layer weight matrix will be (1,000,000,000, 512). this weight matrix is too large to be stored in one box.
By now, is there any solution to deal with such large variables, for example partition the large weight matrix to multiple boxes.
Update:
Thanks Olivier and Keveman. let me add more detail about my problem.
The example is very sparse and all features are binary value: 0 or 1. The parameter weight looks like tf.Variable(tf.truncated_normal([1 000 000 000, 512],stddev=0.1))
The solutions kaveman gave seem reasonable, and I will update results after trying.
The answer to this question depends greatly on what operations you want to perform on the weight matrix.
The typical way to handle such a large number of features is to treat the 512 vector per feature as an embedding. If each of your example in the data set has only one of the 1 billion features, then you can use the tf.nn.embedding_lookup function to lookup the embeddings for the features present in a mini-batch of examples. If each example has more than one feature, but presumably only a handful of them, then you can use the tf.nn.embedding_lookup_sparse to lookup the embeddings.
In both these cases, your weight matrix can be distributed across many machines. That is, the params argument to both of these functions is a list of tensors. You would shard your large weight matrix and locate the shards in different machines. Please look at tf.device and the primer on distributed execution to understand how data and computation can be distributed across many machines.
If you really want to do some dense operation on the weight matrix, say, multiply the matrix with another matrix, that is still conceivable, although there are no ready-made recipes in TensorFlow to handle that. You would still shard your weight matrix across machines. But then, you have to manually construct a sequence of matrix multiplies on the distributed blocks of your weight matrix, and combine the results.
All,
I am doing Bayesian modeling using rjags. However, when the number of observation is larger than 1000. The graph size is too big.
More specifically, I am doing a Bayesian ranking problem. Traditionally, one observation means one X[i, 1:N]-Y[i] pair, where X[i, 1:N] means the i-th item is represented by a N-size predictor vector, and Y[i] is a response. The objective is to minimize the point-wise error of predicted values,for example, least square error.
A ranking problem is different. Since we more care about the order, we use a pair-wise 1-0 indicator to represent the order between Y[i] and Y[j], for example, when Y[i]>Y[j], I(i,j)=1; otherwise I(i,j)=0. We treat this 1-0 indicator as an observation. Therefore, assuming we have K items: Y[1:K], the number of indicator is 0.5*K*(K-1). Hence when K is increased from 500 to 5000, the number of observations is very large, i.e. from 500^2 to 5000^2. The garph size of the rjags model is large too, for example graph size > 500,000. And the log-posterior will be very small.
And it takes a long time to complete the training. I think the consumed time is >40 hours. It is not practical for me to do further experiment. Therefore, do you have any idea to speed up the rjags. I heard that the RStan is faster than Rjags. Any one who has similar experience?