I am dying to understand one question that I can not find any answer to:
When doing Conv1D on a multivariate time-series - is the KERNEL convolved across ALL dimensions or for each dimension individually? IS the size of the kernel [kernel_size x 1] or [kernel_size x num_dims]?
The thing is that I input a 800 by 10 time series into a Conv1D(filter =16,kernel_size=6)
And I get 800 by 16 as output, whereas I would expect to get 800 by 16 by 10 , because each time series dimension is convolved with the filter individually.
What is the case?
Edit: Toy example for discussion:
We have a 3 input channels, 800 time steps long. We have a kernel of 6 time steps width meaning the effective kernel dimensions are [3,1,6].
Each time step, 6 timesteps in each channel are convolved with the kernel. Then all the kernels elements are summed.
If this is correct - what is 1D about this convolution, if the image of the convolution operation is clearly 2-dimensional with [3 x 6] ?
When you convolve an "image" with multiple channels you sum across all the channels and then stack up filters you use to get a new "image" with (# of filters) channels. The thing that's a bit difficult for some people to understand is that the filter itself is actually (kernel_size x 1 x Number of channels). In other words your filters have depth.
So given that you're inputting this as a 800 x 1 "image" with 10 channels, you will end up with an 800 x 1 x 16 image, since you stack 16 filters. Of course the 1s aren't really important for conv1d and can be ignored, so tl;dr 800 x 6 -> 800 x 16 in this case.
Response to part 2:
We have a 3 input channels, 800 time steps long. We have a kernel of 6 time steps width meaning the effective kernel dimensions are [3,1,6].
This is essentially correct.
Each time step, 6 timesteps in each channel are convolved with the kernel. Then all the kernels elements are summed.
Yes, this is essentially correct. We end up with a slightly smaller image as we'll repeat this operation each time we slide the kernel over this timestep, giving us a 700 and something by 1 by 1 new image. We the repeat this operation # of filters times, and stack these on top of each other. This is still in the third dimension, so we end up with 7xx by 1 by (# of filters).
If this is correct - what is 1D about this convolution, if the image of the convolution operation is clearly 2-dimensional with [3 x 6] ?
For something to require Conv2d, it needs to have a 2nd dimension value greater than 1. For example, a color photograph might be 224 x 224 and have 3 color channels so it'd be 224 x 224 by 3.
Notably when we perform Conv2D, we also are sliding our kernel in an additional direction, for example, up and down. This is not required when you simply add more channels, since they're just added to the sum for that cell. Since we're only sliding on one axis in your example (time), we only need Conv1D.
Related
This question has been asked multiple times but still I could not get what I was looking for. Imagine
data=np.random.rand(N,N) #shape N x N
kernel=np.random.rand(3,3) #shape M x M
I know convolution typically means placing the kernel all over the data. But in my case N and M are of the orders of 10000. So I wish to get the value of the convolution at a specific location in the data, say at (10,37) without doing unnecessary calculations at all locations. So the output will be just a number. The main goal is to reduce the computation and memory expenses. Is there any inbuilt function that does this with minimal adjustments?
Indeed, applying the convolution for a particular position coincides with the mere sum over the entries of a (pointwise) multiplication of the submatrix in data and the flipped kernel itself. Here, is a reproducible example.
Code
N = 1000
M = 3
np.random.seed(777)
data = np.random.rand(N,N) #shape N x N
kernel= np.random.rand(M,M) #shape M x M
# Pointwise convolution = pointwise product
data[10:10+M,37:37+M]*kernel[::-1, ::-1]
>array([[0.70980514, 0.37426475, 0.02392947],
[0.24387766, 0.1985901 , 0.01103323],
[0.06321042, 0.57352696, 0.25606805]])
with output
conv = np.sum(data[10:10+M,37:37+M]*kernel[::-1, ::-1])
conv
>2.45430578
The kernel is being flipped by definition of the convolution as explained in here and was kindly pointed Warren Weckesser. Thanks!
The key is to make sense of the index you provided. I assumed it refers to the upper left corner of the sub-matrix in data. However, it can refer to the midpoint as well when M is odd.
Concept
A different example with N=7 and M=3 exemplifies the idea
and is presented in here for the kernel
kernel = np.array([[3,0,-1], [2,0,1], [4,4,3]])
which, when flipped, yields
k[::-1,::-1]
> array([[ 3, 4, 4],
[ 1, 0, 2],
[-1, 0, 3]])
EDIT 1:
Please note that the lecturer in this video does not explicitly mention that flipping the kernel is required before the pointwise multiplication to adhere to the mathematically proper definition of convolution.
EDIT 2:
For large M and target index close to the boundary of data, a ValueError: operands could not be broadcast together with shapes ... might be thrown. To prevent this, padding the matrix data with zeros can prevent this (although it blows up the memory requirement). I.e.
data = np.pad(data, pad_width=M, mode='constant')
I am looking to train a model with a cycle loss (similar to CycleGAN) on a different x/y paired dataset in each epoch. The aim is that, across many epochs, the model would be trained on many if not all of the admissible pairings of the elements of x with y.
E.g., suppose 2 tf.data datasets: x_tf_data and y_tf_data. Each element of x_tf_data can be paired with 1 or more elements of y_tf_data. E.g., the first element of x_tf_data can be paired with the first 10 elements of y_tf_data. This is given by a list of vectors denoted list_vectors such that list_vectors[0] = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] and list_vectors[i-1] are the y_tf_data elements that can be paired with the i'th element of x_tf_data.
In each epoch, the x/y pair presented to the model should be (potentially) different. E.g., in each epoch, the first element of x_tf_data can be paired with any of the first 10 elements of y_tf_data. This can be achieved by randomly selecting 1 element of list_vectors[i], for all i, in each epoch.
What may be a scalable solution?
After a lot of experimentation, what worked best was to create set of N tf.data Datasets in which each element of x was paired with a randomly chosen element of y, and then to sequentially concatenate the set of N Datasets to form one humungous Dataset. This Dataset was then saved to file and read into Keras. This achieved two goals. It helped the model converge more quickly because all the data did not change each epoch and it helped to ensure that a sufficient number of pairings were used for each element of x so as to get robust results.
I’m trying to understand the NASNet-A architecture in detail, but can’t match the parameter counts in the paper.
For example, the paper says CIFAR-10 NASNet-A “6 # 768” model has 3.3M params, but by my calculations a single “sep 5x5” primitive in the final cell should alone have 2.9M params… which can’t be right!
Here’s how I derive this count…
The “6 # 768” notation means the “number of filters in the penultimate layer of the network” is 768, which I assume means the number of filters in each of the primitive operations in the cell is 768, and therefore the output depth of the concat operation (with 5 block inputs) is 5 * 768. Since shape is only changed by reduction cells, the input to the final cell (concat output from prior normal cell) will also be of depth 5 * 768.
So for a 5x5 separable convolution with 5 * 768 input channels and 768 output channels, the number of parameters is:
5x5x1 * (5 * 768) = 96,00 params for the 5x5 depthwise filters, plus
1x1x(5 * 768) x 768 = 2,949,128 params for the 1x1 pointwise filters
Where am I going wrong?!
The amount of output channels from each operation of cell's block is according to the defined num_conv_filters. In example for CIFAR NASNet-A is 32, and it doubles after each Reduction Cell.
Although they mention they have B=5 blocks and no residual connection it seems they have 6 concatenated chunks of filters, the last seem to come from the previous layer.
See: https://github.com/tensorflow/models/blob/d07447a3e34bc66acd9ba7267437ebe9d15b45c0/research/slim/nets/nasnet/nasnet_utils.py#L309
This is why for example you have 192 feature depth in the first cell:
6*32=192.
You can take a look on the expected depths here:
https://github.com/tensorflow/models/blob/d07447a3e34bc66acd9ba7267437ebe9d15b45c0/research/slim/nets/nasnet/nasnet_test.py#L127
So for example, for the last 5x5 separable convolution you can get:
5x5*768 + 768*128 = 117504 parametes
For more info about the separable convolution:
http://forums.fast.ai/t/how-depthwise-separable-convolutions-work/4249
So I had a specific question with setting up the input in Keras.
I understand that the sequence length refers to the window length of the longest sequence that you are looking to model with the rest being padded by 0's.
However, how do I set up something that is already in a time series array?
For example, right now I have an array that is 550k x 28. So there are 550k rows each with 28 columns (27 features and 1 target). Do I have to manually split the array into (550k- sequence length) different arrays and feed all of those to the network?
Assuming that I want to the first layer to be equivalent to the number of features per row, and looking at the past 50 rows, how do I size the input layer?
Is that simply input_size = (50,27), and again do I have to manually split the dataset up or would Keras automatically do that for me?
RNN inputs are like: (NumberOfSequences, TimeSteps, ElementsPerStep)
Each sequence is a row in your input array. This is also called "batch size", number of examples, samples, etc.
Time steps are the amount of steps for each sequence
Elements per step is how much info you have in each step of a sequence
I'm assuming the 27 features are inputs and relate to ElementsPerStep, while the 1 target is the expected output having 1 output per step.
So I'm also assuming that your output is a sequence with also 550k steps.
Shaping the array:
Since you have only one sequence in the array, and this sequence has 550k steps, then you must reshape your array like this:
(1, 550000, 28)
#1 sequence
#550000 steps per sequence
#28 data elements per step
#PS: this sequence is too long, if it creates memory problems to you, maybe it will be a good idea to use a `stateful=True` RNN, but I'm explaining the non stateful method first.
Now you must split this array for inputs and targets:
X_train = thisArray[:, :, :27] #inputs
Y_train = thisArray[:, :, 27] #targets
Shaping the keras layers:
Keras layers will ignore the batch size (number of sequences) when you define them, so you will use input_shape=(550000,27).
Since your desired result is a sequence with same length, we will use return_sequences=True. (Else, you'd get only one result).
LSTM(numberOfCells, input_shape=(550000,27), return_sequences=True)
This will output a shape of (BatchSize, 550000, numberOfCells)
You may use a single layer with 1 cell to achieve your output, or you could stack more layers, considering that the last one should have 1 cell to match the shape of your output. (If you're using only recurrent layers, of course)
stateful = True:
When you have sequences so long that your memory can't handle them well, you must define the layer with stateful=True.
In that case, you will have to divide X_train in smaller length sequences*. The system will understand that every new batch is a sequel of the previous batches.
Then you will need to define batch_input_shape=(BatchSize,ReducedTimeSteps,Elements). In this case, the batch size should not be ignored like in the other case.
* Unfortunately I have no experience with stateful=True. I'm not sure about whether you must manually divide your array (less likely, I guess), or if the system automatically divides it internally (more likely).
The sliding window case:
In this case, what I often see is people dividing the input data like this:
From the 550k steps, get smaller arrays with 50 steps:
X = []
for i in range(550000-49):
X.append(originalX[i:i+50]) #then take care of the 28th element
Y = #it seems you just exclude the first 49 ones from the original
I am trying to use the extract_glimpse function of tensorflow but I encounter some difficulties with the offset parameter.
Let's assume that I have a batch of one single channel 5x5 matrix called M and that I want to extract a 3x3 matrix of it.
When I call extract_glimpse([M], [3,3], [[1,1]], centered=False, normalized=False), it returns the result I am expecting: the 3x3 matrix centered at the position (1,1) in M.
But when I call extract_glimpse([M], [3,3], [[2,1]], centered=False, normalized=False), it doesn't return the 3x3 matrix centered at the position (2,1) in M but it returns the same as in the first call.
What is the point that I don't get?
The pixel coordinates actually have a range of 2 times the size (not documented - so it is a bug indeed). This is at least true in the case of centered=True and normalized=False. With those settings, the offsets range from minus the size to plus the size of the tensor. I therefore wrote a wrapper that is more intuitive to numpy users, using pixel coordinates starting at (0,0). This wrapper and more details about the problem are available on the tensorflow GitHub page.
For your particular case, I would try something like:
offsets1 = [-5 + 3,
-5 + 3]
extract_glimpse([M], [3,3], [offsets1], centered=True, normalized=False)
offsets2 = [-5 + 3 + 2,
-5 + 3]
extract_glimpse([M], [3,3], [offsets2], centered=True, normalized=False)