Working with sparse matrices in numpy and sklearn - numpy

I have a time series dataset geenrated from some electrophysiological data. I have a frequncy dataset and the matrix is quite sparse but huge, like it contains 0.005 s time bins for 2000+ neurons recorded over an hour so the matrix is huge. I am using this to train regressions in sklearn and I was wondering if there were ways to represent the matrix more efficiently to speed up my code? Tons of the data is taken up by 0 values.
Specifically I will be using these two functions on the matrix,
https://scikit-learn.org/stable/modules/generated/sklearn.cross_decomposition.CCA.html
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ElasticNetCV.html
Will scipy.sparse run with sklearn functions, and will it train faster using scipy.sparse or other options?
https://docs.scipy.org/doc/scipy/reference/sparse.html

Related

Tensorflow Sparse Tensor operation too slow

I'm trying to convert a scipy sparse matrix to Tensorflow Sparse Tensor using the code below:
coo = norm_adj_mat.tocoo().astype(np.float32) ## norm_adj_mat is the scipy CSR matrix
indices = np.mat([coo.row, coo.col]).transpose()
A_tilde = tf.SparseTensor(indices, coo.data, coo.shape)
My original matrix is too large (>1 Million rows, cols) - the tensorflow conversion takes forever to convert it (>20 hours). I've tried it with a toy matrix and it seems to work fine for it. Any inputs on how to speed up this step?
I'm using tensorflow 2.9.1 & scipy 1.9.

How to train large dataset on tensorflow 2.x

I have a large dataset with about 2M rows and 6,000 columns. The input numpy array (X, y) can hold the training data okay. But when it goes to model.fit(), I get a GPU Out-Of-Memory error. I am using tensorflow 2.2. According to its manual, model.fit_generator has been deprecated and model.fit is preferred.
Can someone outline the steps for training large datasets with tensorflow v2.2?
The best solution is to use tf.data.Dataset() and thus you can easily batch your data with the .batch() method.
There are plenty of tutorials available here, you may want to use from_tensor_slices() for playing directly with numpy arrays.
Below there are two excellent documentations to suit your needs.
https://www.tensorflow.org/tutorials/load_data/numpy
https://www.tensorflow.org/guide/data

Multiplying sparse matrices in tensorflow

I have a very large sparse 2D matrix (200k x 100k) (sparsity around 0.002), which I would need to multiply with itself (after transposition).
A = 200k x 100k
and I need to calculate:
A x A.transpose()
The problem is, that in Tensorflow it seems there is no sparse matrix to sparse matrix multiplication.
tf.sparse_tensor_dense_matmul() requires one side to be turned into a dense matrix, which would hit the memory limit.
tf.sparse_matmul() seems to actually require dense matrices.
There is no way to fit the whole matrix (A) in the GPU memory as a dense vector.
Is there a way to multiple sparse matrices with other sparse matrices in Tensorflow? I can't seem to find one.

Number of training and test samples in tensor flow code?

If in tensor flow code number of input sample is 5000000. Does it mean that it training all these samples for training? How can i know the number of samples used for training and test purpose separately?
You will have to choose which ammount of samples are used for training and which for testing. A general approach would be to set a random 70% of samples to train and the remaining 30% to test. This can be done fairly simply as such:
Lets assume you have a dataframe of 5000000 samples named df. The sample() function from pandas will allow you to select a specified percentage of random samples which can be set aside for training. The remaining 30% will be indexed and used for testing.
import pandas as pd
train_set = df.sample(frac=0.7)
test_set = df.loc[~data_.index.isin(train_set.index)]
Now you have two dataframes, one for training (3500000 samples) and one for testing (1500000 samples)

Vector representation in multidimentional time-series prediction in Tensorflow

I have a large data set (~30 million data-points with 5 features) that I have reduced using K-means down to 200,000 clusters. The data is a time-series with ~150,000 time-steps. The data on which I would like to train the model is the presence of particular clusters at each time-step. The purpose of the predictive model is generate a generalized sequence similar to generating syntactically correct sentences from a model trained on word sequences. The easiest way to think about this data is that I'm trying to predict the pixels in the next video frame from pixels in the current video frame in order to generate a new sequence of frames that approximate the original sequence.
The raw and sparse representation at each time-step would be 200,000 binary values representing which clusters are present or not at that time step. Note, no more than 200 clusters may be present in any one time-step and thus this representation is extremely sparse.
What is the best representation to convert this sparse vector to a dense vector that would be more suitable to time-series prediction using Tensorflow?
I initially had in mind a RNN / LSTM trained on the vectors at each time-step, but due to the size of the training vector I'm now wondering if a convolution approach would be more suitable.
Note, I have not actually used tensorflow beyond some simple tutorials, but have have previously used OpenCV ML functions. Please consider me a novice in your responses.
Thank you.