Representing high precision floats in tensorflow - tensorflow

We intend to train an ANN with some data which do not fit into float64 or any other existing data types in TensorFlow. For example, an input neuron may receive:
1.13760089015656552359796023723738662029191734968655617689014826158791613333
I have two questions:
Is there a way to represent the data in TensorFlow or Keras?
If there is not a clear and convenient solution, do you think that casting the data into complex128 would help us? for example:
1.137600890156565523597960237237 + 73866202919173496865561768901 i
For our case, the more floating-point means more accuracy. So we need to keep them as much as possible.
We prefer to use a GPU so the solution should cover them.

Have you considered preprocessing the data? It's rather rare to see noticeably important divergence with so high precision. It might be also hard to calculate meaningful gradient, when you only operate on such flat space.

Related

A huge number of discrete features

I'm developing a regression model. But I ran into a problem when preparing the data. 17 out of 20 signs are categorical, and there are a lot of categories in each of them. Using one-hot-encoding, my data table is transformed into a 10000x6000 table. How should I prepare this type of data?
I used PCA, trying to reduce the dimension, but even 70% of the variance is in 2500 features. That's why I joined.
Unfortunately, I can't attach the dataset, as it is confidential
How do I prepare the data to achieve the best results in the learning process?
Can the data be mapped more accurately in a non-linear manner? If so, you might want to try using an autoencoder for dimensionality reduction.
One thing to note about PCA is that it computes an orthogonal projection of the data into linear space. This means that it only gives a linear mapping of the data. Autoencoders, on the other hand, can give you a non-linear mapping, and so is able to represent a greater amount of variance in the data in fewer dimensions. Just be sure to use non-linear activation functions in your autoencoder architecture.
It really depends on exactly what you are trying to do. Getting a covariance matrix (and also PCA decomp.) will give you great insight about which classes tend to come together (and this requires one-hot encoded categories), but training a model off of that might be problematic.
In general, it really depends on the model you want to use.
One option would be a random forest. They can definitely be used for regression, though they need to be trained specifically for that. SKLearn has a class just for this:
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html
The benifits of random forest is that it is great for tabular data (as is the case here), and can easily be trained using numerical values for class features, meaning your data vector can only be of dimension 20!
Decision tree models (such as random forest) are being shown to out-preform deep-learning in many cases, and this may be one of them.
TLDR; If you use a random forest, it can take learn even with numerical values for categories, and you can avoid creating incredibly large vectors for data.

Isn't it dangerous to apply Min Max Scaling to the test set?

Here's the situation I am worrying about.
Let me say I have a model trained with min-max scaled data. I want to test my model, so I also scaled the test dataset with my old scaler which was used in the training stage. However, my new test data's turned out to be the newer minimum, so the scaler returned negative value.
As far as I know, minimum and maximum aren't that stable value, especially in the volatile dataset such as cryptocurrency data. In this case, should I update my scaler? Or should I retrain my model?
I happen to disagree with #Sharan_Sundar. The point of scaling is to bring all of your features onto a single scale, not to rigorously ensure that they lie in the interval [0,1]. This can be very important, especially when considering regularization techniques the penalize large coefficients (whether they be linear regression coefficients or neural network weights). The combination of feature scaling and regularization help to ensure your model generalizes to unobserved data.
Scaling based on your "test" data is not a great idea because in practice, as you pointed out, you can easily observe new data points that don't lie within the bounds of your original observations. Your model needs to be robust to this.
In general, I would recommend considering different scaling routines. scikitlearn's MinMaxScaler is one, as is StandardScaler (subtract mean and divide by standard deviation). In the case where your target variable, cryptocurrency price can vary over multiple orders of magnitude, it might be worth using the logarithm function for scaling some of your variables. This is where data science becomes an art -- there's not necessarily a 'right' answer here.
(EDIT) - Also see: Do you apply min max scaling separately on training and test data?
Ideally you should scale first and then only split into test and train. But its not preferable to use minmax scaler with data which can have dynamically varying min and max values with significant variance in realtime scenario.

Tensorflow / Keras: Normalize train / test / realtime Data or how to handle reality?

I started developing some LSTM-models and now have some questions about normalization.
Lets pretend I have some time series data that is roughly ranging between +500 and -500. Would it be more realistic to scale the Data from -1 to 1, or is 0 to 1 a better way, I tested it and 0 to 1 seemed to be faster. Is there a wrong way to do it? Or would it just be slower to learn?
Second question: When do I normalize the data? I split the data into training and testdata, do I have to scale / normalize this data seperately? maybe the trainingdata is only ranging between +300 to -200 and the testdata ranges from +600 to -100. Thats not very good I guess.
But on the other hand... If I scale / normalize the entire dataframe and split it after that, the data is fine for training and test, but how do I handle real new incomming data? The model is trained to scaled data, so I have to scale the new data as well, right? But what if the new Data is 1000? the normalization would turn this into something more then 1, because its a bigger number then everything else before.
To make a long story short, when do I normalize data and what happens to completely new data?
I hope I could make it clear what my problem is :D
Thank you very much!
Would like to know how to handle reality as well tbh...
On a serious note though:
1. How to normalize data
Usually, neural networks benefit from data coming from Gaussian Standard distribution (mean 0 and variance 1).
Techniques like Batch Normalization (simplifying), help neural net to have this trait throughout the whole network, so it's usually beneficial.
There are other approaches that you mentioned, to tell reliably what helps for which problem and specified architecture you just have to check and measure.
2. What about test data?
Mean to subtract and variance to divide each instance by (or any other statistic you gather by any normalization scheme mentioned previously) should be gathered from your training dataset. If you take them from test, you perform data leakage (info about test distribution is incorporated into training) and you may get false impression your algorithm performs better than in reality.
So just compute statistics over training dataset and use them on incoming/validation/test data as well.

Is there any obvious reason that Tensorflow uses COO format other than CSR for sparse matrix?

I'm trying to take performance advantages from built-in Sparse Matrix Multiplication API of Tensorflow.
And keveman recommended that tf.embedding_lookup_sparse is the right way.
But, it seems that the performance of embedding_lookup_sparse is somewhat disappointed in my experiments. Though it performs fairly small matrix multiplications, <1, 3196> and <3196, 1024>, sparse matmul with 0.1 sparsity fails to win the dense matrix multiplication.
If my implementation is correct, I think one of the reasons is that Tensorflow uses COO format which saves all index-nonzero pair. I'm not an expert on this domain but, isn't it widely known that CSR format is more performant on this kind of computation? Is there any obvious reason that Tensorflow internally uses COO format other than CSR for sparse matrix representation?
Just for the record, you say matrix multiplication, but one of your matrices is in fact a vector (1 x 3196). So this would make it a matrix-vector multiplication (different BLAS kernel). I will assume you mean matrix-vector multiplication for my answer.
Yes, CSR should theoretically be faster than COO for matrix-vector multiplication; this is because the storage size in CSR format is O(2nnz + n) vs O(3nnzs) and the sparse matrix vector multiplication is in many cases memory bound.
The exact performance difference compared to a dense matrix multiplication varies though based on the problem size, sparsity pattern, data type and implementation. It is difficult to say off the bat which should be faster, because the sparse storage format introduces indirection, which potentially leads to reduced locality and poor(er) utilisation of arithmetic units (e.g. no use of vectorisation).
Particularly when the matrix and vector size are so small that almost everything fits in cache, I would expect limited performance benefits. Sparse matrix structures are typically more useful for truly large matrices, ranging from 10sK x 10sK to 1B x 1B, which wouldn't even fit in main memory using a dense representation. For small problem sizes, in my experience, the storage advantage compared to dense formats is usually negated by the loss in locality and arithmetic efficiency. To some extent this is addressed by hybrid storage formats (such as Block CSR) which try to take the best of both worlds, and are very useful for some applications (doesn't look like tensorflow supports this).
In tensorflow, I would assume the COO format is used because it is more efficient for other operations, for example it supports O(1) updates, insertions and deletions from the data structure. It seems reasonable to trade ~50% performance in sparse matrix-vector multiply to improve performance on these operations.

How to create a synthetic dataset

I want to run some Machine Learning clustering algorithms on some big data.
The problem is that I'm having troubles to find interesting data for this purpose on the web.Also, usually this data might be inconvenient to use because the format won't fit for me.
I need a txt file which each line represents a mathematical vector, each element seperated by space, for example:
1 2.2 3.1
1.12 0.13 4.46
1 2 54.44
Therefore, I decided to first run those algorithms on some synthetic data which I'll create by my self. How can I do this in a smart way with numpy?
In smart way, I mean that it won't be generated uniformly, because it's a little bit boring. How can I generate some interesting clusters?
I want to have 5GB / 10GB of data at the moment.
You need to define what you mean by "clusters", but I think what you are asking for is several random-parameter normal distributions combined together, for each of your coordinate values.
From http://docs.scipy.org/doc/numpy-1.10.0/reference/generated/numpy.random.randn.html#numpy.random.randn:
For random samples from N(\mu, \sigma^2), use:
sigma * np.random.randn(...) + mu
And use <range> * np.random.rand(<howmany>) for each of sigma and mu.
There is no one good answer for such question. What is interesting? For clustering, unfortunately, there is no such thing as an interesting or even well posed problem. Clustering as such has no well defineid evaluation, consequently each method is equally good/bad, as long as it has well defined internal objective. So k-means will always be good one to minimize inter-cluster euclidean distance and will struggle with sparse data, non-convex, imbalanced clusters. DBScan will always be the best in greedy density based sense and will strugle with diverse density clusters. GMM will be always great fitting on gaussian mixtures, and will strugle with clusters which are not gaussians (for example lines, squares etc.).
From the question one could deduce that you are at the very begining of work with clustering and so need "just anything more complex than uniform", so I suggest you take a look at datasets generators, in particular accesible in scikit-learn (python) http://scikit-learn.org/stable/datasets/ or in clusterSim (R) http://www.inside-r.org/packages/cran/clusterSim/docs/cluster.Gen or clusterGeneration (R) https://cran.r-project.org/web/packages/clusterGeneration/clusterGeneration.pdf