Basically I'm seeking for the data structure of eigen packing of matrix. on eigen convolution eigen packed matrix and after that it perform multiplication. Other optimizer like blas,dnn eigen also used packet convolution but they do something on packet processing. As I run a unet model on eigen and other optimizer . eigen don't have data structure overhead but others have.
Related
I have a seasonal timeseries dataset containing 3 target variables and n feature variables. I am trying to apply a PCA algorithm before feeding the data to a simple LSTM.
The operations I do are the following:
Split train - validation - test
Standard scaler (force mean=0 & std=1) of the train dataset (including target and features)
Apply PCA for only features of the train dataset
Transform through the PCA matrix in step 3 the feature variables from validation and target
Where I get lost: What to do with target's validation and target's test variables?
... more neural networks pre-processing and building the architecture of the LSTM
My question is: How do I scale / normalize the target variables? Through a PCA too?, through any independent scaler (standard, mapminmax, etc.)? If I leave the original target values I got overfitting in my LSTM.
The most disappointing is that without the PCA, the LSTM I've build is showing no overfitting
Thanks a lot for your help!
I know this comes late...
As far as I know, you should not apply PCA to the target variables. PCA is used in a way to reduce dimensionality on the feature variables.
As you have applied the PCA transformation trained with the Train dataset, you can do the same with the used Scaler.
A question that comes from the fact that I never had to debug my models in TF so deeply.
I'm running a variational inference with a full-rank Gaussian approximation using Tensorflow Probability. I noticed my optimization often explodes. Here is my loss curve.
I suspect numerical issues, as all the losses and the optimization process look reasonable and I don't observe any NaNs.
I use tfp.distributions.MultivariateNormalTriL with a covariance parameter transformed by tfp.bijectors.FillScaleTriL with the default diagonal shift. The condition number of the covariance matrix is reasonable. The variational inference is performed with fit_surrogate_posterior function.
I optimize with an SGD with momentum, using 10 samples per iteration.
Internally in Tensorflow Probability source code, the minimization objective uses a gradient tape:
with tf.GradientTape(watch_accessed_variables=trainable_variables is None) as tape:
for v in trainable_variables or []:
tape.watch(v)
loss = loss_fn()
In order to solve my issue I would like to see the gradient through every operation.
My question is how can I get more insight into which operation is exploding by the gradient computation? How to get the value of gradient at every tensor?
And if any of you faced a similar issue:
Is there a better way to prevent instabilities in the covariance matrix optimization?
Detailed explanations:
I observed that this explosion is caused by one parameter (though it is not always the same parameter that explodes). This can be simply checked by comparing the covariance matrix two iterations before the explosion
and one iteration before the point where the loss explodes
Note the last parameter. When I run the same optimization multiple times, it might happen that one of the "small" parameters (rows from 9 to the last) explodes at some point.
Thanks,
Mateusz
In my code for my convolutional neural network there is a step which involves a tensor contraction along 3 dimensions; in NumPy it looks like this (however I am planning on using raw BLAS):
y = np.einsum('abcijk,ijkd->abcd', x, f)
which is symbolic for
Generally, CNN routines tackle large tensor contractions by converting higher-dimensional tensors into 2-D "images", then using regular matrix multiplication - the cost and redundancy of the initial dimension reduction is offset by leveraging the speed of GEMM and other well-optimised matrix multiplication routines.
My method involves skipping the dimension reduction stage, and the only real computational effort required is in the tensor multiplication of X and F. Do there exist similarly fast routines for contracting the inner indices of tensors? How can I whittle this down to something that is well-established in BLAS etc.? The fact that only inner indices are being contracted suggests to me that there should be a way to take advantage of the locality of reference via the natural traversal order, rather than using generic routines such as "GETT".
We are trying to find a solution to implement a stochastic gradient descent (or a coordinate gradient descent) on a very very very large and very sparse least-squares problem . That is to say, for the standard least squares problem:
min ||y - Ax||2
The matrix A on the order of a billion rows and a billion columns, but it is 99% sparse . Similarly, the coefficient vector x is about 98% sparse, and the observation vector Y is also about 97% sparse.
However, although I know that tensorflow has a stochastic gradient descent function, it's unclear to me whether the gradient descent functions accept sparse matrices/vectors, and even if they do, it's unclear if these gradient descent libraries behind the scenes eventually converts the sparse matrices and vectors into dense representations which would then defeat the point of having inputted the data in sparse format to begin with. If tensorflow's gradient descent method converts everything to dense format, we would blow up the memory easily and also blow up the performance (since we would have ~ 100 * 100 more computations needed for all those zero values.)
If the SGD algorithms implicitly convert sparse matrixes and vectors to their dense counterparts, how hard would it be to change that logic to a sparse only logic? As in, could a reasonably knowledgeable python/C++ engineer (with some tensorflow knowledge) do it or is a major architectural rewrite?
I have a large data set (~30 million data-points with 5 features) that I have reduced using K-means down to 200,000 clusters. The data is a time-series with ~150,000 time-steps. The data on which I would like to train the model is the presence of particular clusters at each time-step. The purpose of the predictive model is generate a generalized sequence similar to generating syntactically correct sentences from a model trained on word sequences. The easiest way to think about this data is that I'm trying to predict the pixels in the next video frame from pixels in the current video frame in order to generate a new sequence of frames that approximate the original sequence.
The raw and sparse representation at each time-step would be 200,000 binary values representing which clusters are present or not at that time step. Note, no more than 200 clusters may be present in any one time-step and thus this representation is extremely sparse.
What is the best representation to convert this sparse vector to a dense vector that would be more suitable to time-series prediction using Tensorflow?
I initially had in mind a RNN / LSTM trained on the vectors at each time-step, but due to the size of the training vector I'm now wondering if a convolution approach would be more suitable.
Note, I have not actually used tensorflow beyond some simple tutorials, but have have previously used OpenCV ML functions. Please consider me a novice in your responses.
Thank you.