Pair an input tensor with different (randomly chosen) elements of the output tensor in each epoch - tensorflow

I am looking to train a model with a cycle loss (similar to CycleGAN) on a different x/y paired dataset in each epoch. The aim is that, across many epochs, the model would be trained on many if not all of the admissible pairings of the elements of x with y.
E.g., suppose 2 tf.data datasets: x_tf_data and y_tf_data. Each element of x_tf_data can be paired with 1 or more elements of y_tf_data. E.g., the first element of x_tf_data can be paired with the first 10 elements of y_tf_data. This is given by a list of vectors denoted list_vectors such that list_vectors[0] = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] and list_vectors[i-1] are the y_tf_data elements that can be paired with the i'th element of x_tf_data.
In each epoch, the x/y pair presented to the model should be (potentially) different. E.g., in each epoch, the first element of x_tf_data can be paired with any of the first 10 elements of y_tf_data. This can be achieved by randomly selecting 1 element of list_vectors[i], for all i, in each epoch.
What may be a scalable solution?

After a lot of experimentation, what worked best was to create set of N tf.data Datasets in which each element of x was paired with a randomly chosen element of y, and then to sequentially concatenate the set of N Datasets to form one humungous Dataset. This Dataset was then saved to file and read into Keras. This achieved two goals. It helped the model converge more quickly because all the data did not change each epoch and it helped to ensure that a sufficient number of pairings were used for each element of x so as to get robust results.

Related

How to see the indices of the split on the data that GridSearchCV used when it made the split?

When using GridSearchCV() to perform a k-fold cross validation analysis on some data is there a way to know which data was used for each split?
For example, assumed the goal is to build a binary classifier of your choosing, named 'model'. There are 100 data points (rows) with 5 features each and an associated 1 or 0 target. 20 of the 100 data points are held out for testing after training and hyperparameter tuning, GridSearchCV will never see those 20 data points. The other 80 data rows are put into the estimator as X and Y, so GridSearchCV will only see 80 rows of data. Various hyper parameters are tuned and laid out in the param_grid variable. For this case the cross validation parameter of cv is assigned a value of 3, as shown:
grid = GridSearchCV(estimator=model, param_grid=param_grid, cv=3) grid_result = grid.fit(X, Y)
Is there a way to see which data was used as the training data and as the cross validation data for each fold? Maybe seeing which indices were used for the split?

dimension of a tensor created by tf.zeros(n)

I'm confused by the dimension of a tensor created with tf.zeros(n). For instance, if I write: tf.zeros(6).eval.shape, this will return me (6, ). What dimension is this? is this a matrix of 6 rows and arbitrary # of columns? Or is this a matrix of 6 columns with arbitrary # of rows?
weights = tf.random_uniform([3, 6], minval=-1, maxval=1, seed=1)- this is 3X6 matrix
b=tf.zeros(6).eval- I'm not sure what dimension this is.
Why I am able to add the two like weights+b? If I understand correctly, in order for the two to be added, b needs to be 3X1 dimension.
why i am able to add the two like weights+b?
Operator + is the same as using tf.add() (<obj>.__add__() calls the tf.add() or tf.math.add()) and if you read the documentation it says:
NOTE: math.add supports broadcasting. AddN does not. More about broadcasting here
Now I'm quoting from numpy broadcasting rules (which are the same for tensorflow):
When operating on two arrays, NumPy compares their shapes element-wise. It starts with the trailing dimensions, and works its way forward. Two dimensions are compatible when
they are equal, or
one of them is 1
So you're able to add two tensors with different shapes because they have the same trailing dimensions. If you change the dimension of your weights tensor to, let's say [3, 5], you will get InvalidArgumentError exception because trailing dimensions differ.
(6,) is python syntax for a tuple with 6 as a single element. Hence the shape here is a uni-dimensional vector of length 6.

Tensorflow/Keras find two most similar filters

I have a tensorflow/keras CNN. It has layers and some are Conv2D. In a given layer I want to efficiently find the two filters in the Conv2D that are most similar.
The layer.weights is a list of shape (height, width, depth) filter_count long.
I want to compare by the difference or maybe the sqrt(diff^2) between each element in (height,width,depth) then sum so the difference is a single float value.
If T1 is thelayer.weights[idx1] and T2 is thelayer.weights[idx2]
then the comparison is tf.sqrt(tf.reduce_sum(tf.squared_difference(T1, T2)))
I want to compare every filter to every other filter and take the 3 lowest differences. (The first one will always be zero where it T1 and T2 are the same tensor, self)
Obviously I can do nested loops but that is not functional and nifty.
Is there some built in tensorflow or keras function to do this fast and possibly in the GPU?
Its not quite clear from your description, but I assume the shape of weights is [filter_count, height,width,depth]. If filter_count is along a different axis the arguments to "reduce_sum" will have to be modified accordingly.
You can use broadcasting to parallelize this process.
differences = tf.sqrt(
tf.reduce_sum(
tf.squared_difference(
tf.expand_dims(thelayer.weights,0),
tf.expand_dims(thelayer.weights,1),
),
(-1,-2,-3)
)
)
This will result in a tensor of shape [filter_count, filter_count] where element differences[i, j] measure differences between filter weights i and j.
You can then filter to find the desired elements.

Understanding multidimensional full covariance of normal multivariate distribution in TensorFlow

Suppose I have, say, 3 identically distributed random vectors: w, v and x generally with different lengths. w is length 2, v is length 3 and x is length 4.
How should I define the full covariance matrix sigma of these vectors for tf.contrib.distributions.MultivariateNormalFullCovariance(mean, sigma)?
I think about full covariance in this case as [(2 + 3 + 4) x (2 + 3 + 4)] square matrix (tensor rank 2), where diagonal elements are standard deviations and non-diagonal are cross-covariances between each other component of each other vector. How can I switch my mind to the terms of multidimensional covariance? What is it?
Or should I build full covariance matrix by concatenating it from pieces (e.g. particular covariances and, for instance, assuming independence of these vectors I should build partitioned block diagonal matrix) and cut (split) results of sampling into particular vectors I want to get? (I did that with R.) Or is there an easier way?
What I want is full control over all random vectors including their covariances and cross-covariances.
There is no special consideration about the dimensionality just because your random variables are distributed across multiple vectors. From a probabilistic point of view, three normally-distributed vectors of sizes 2, 3 and 4, a normally-distributed vector of size 9 and and a normally-distributed matrix of size 3x3 are all the same: a 9-dimensional normal distribution. Of course, you could have three distributions of 2, 3 and 4 dimensions, but that's a different thing, it doesn't allow you to model correlations among variables of different vectors (just like having a one-dimensional normal distribution per number does not allow you to model any correlation at all); this may or may not be enough for your use case.
If you want to use a single distribution, you just need to establish a bijection between the domain of your problem (e.g. tuples of three vectors of sizes 2, 3 and 4) and the domain of the distribution (e.g. 9-dimensional vectors). In this case is pretty obvious, just flatten (if necessary) and concatenate the vectors to obtain a distribution sample and split a sample three parts of size 2, 3 and 4 to obtain the vectors.

Setting up the input on an RNN in Keras

So I had a specific question with setting up the input in Keras.
I understand that the sequence length refers to the window length of the longest sequence that you are looking to model with the rest being padded by 0's.
However, how do I set up something that is already in a time series array?
For example, right now I have an array that is 550k x 28. So there are 550k rows each with 28 columns (27 features and 1 target). Do I have to manually split the array into (550k- sequence length) different arrays and feed all of those to the network?
Assuming that I want to the first layer to be equivalent to the number of features per row, and looking at the past 50 rows, how do I size the input layer?
Is that simply input_size = (50,27), and again do I have to manually split the dataset up or would Keras automatically do that for me?
RNN inputs are like: (NumberOfSequences, TimeSteps, ElementsPerStep)
Each sequence is a row in your input array. This is also called "batch size", number of examples, samples, etc.
Time steps are the amount of steps for each sequence
Elements per step is how much info you have in each step of a sequence
I'm assuming the 27 features are inputs and relate to ElementsPerStep, while the 1 target is the expected output having 1 output per step.
So I'm also assuming that your output is a sequence with also 550k steps.
Shaping the array:
Since you have only one sequence in the array, and this sequence has 550k steps, then you must reshape your array like this:
(1, 550000, 28)
#1 sequence
#550000 steps per sequence
#28 data elements per step
#PS: this sequence is too long, if it creates memory problems to you, maybe it will be a good idea to use a `stateful=True` RNN, but I'm explaining the non stateful method first.
Now you must split this array for inputs and targets:
X_train = thisArray[:, :, :27] #inputs
Y_train = thisArray[:, :, 27] #targets
Shaping the keras layers:
Keras layers will ignore the batch size (number of sequences) when you define them, so you will use input_shape=(550000,27).
Since your desired result is a sequence with same length, we will use return_sequences=True. (Else, you'd get only one result).
LSTM(numberOfCells, input_shape=(550000,27), return_sequences=True)
This will output a shape of (BatchSize, 550000, numberOfCells)
You may use a single layer with 1 cell to achieve your output, or you could stack more layers, considering that the last one should have 1 cell to match the shape of your output. (If you're using only recurrent layers, of course)
stateful = True:
When you have sequences so long that your memory can't handle them well, you must define the layer with stateful=True.
In that case, you will have to divide X_train in smaller length sequences*. The system will understand that every new batch is a sequel of the previous batches.
Then you will need to define batch_input_shape=(BatchSize,ReducedTimeSteps,Elements). In this case, the batch size should not be ignored like in the other case.
* Unfortunately I have no experience with stateful=True. I'm not sure about whether you must manually divide your array (less likely, I guess), or if the system automatically divides it internally (more likely).
The sliding window case:
In this case, what I often see is people dividing the input data like this:
From the 550k steps, get smaller arrays with 50 steps:
X = []
for i in range(550000-49):
X.append(originalX[i:i+50]) #then take care of the 28th element
Y = #it seems you just exclude the first 49 ones from the original