In deep learning based model training, in general batch of inputs are passed. For example for training a deep learning model with [512] dimensional input feature vector, say for batch size= 4, we mainly pass [4,512] dimenional input. I am curious what are the logical significance of passing the same input after flattening the input across the batch and channel dimenions [2048]. Logically the locality structure will be destroyed but will it significanlty speed up my implementation? And can it affect the performance?
In supervised learning, you would usually be working with data points (e.g. a feature vector or a multi-dimensional input such as an image) paired with some kind of ground-truth (a label for classifications tasks, or another multi-dimensional object altogether). Feeding to your model a flattened tensor containing multiple data points would not make sense in terms of supervision. Assuming you do an inference this way, what would be the supervision signal at the output level of your model? Would you combine the labels as well? All of this seem to depend heavily on the use case: is there some kind of temporal coherence between the elements of the batch?
Performance-wise, this has no implications whatsoever. Tensors are already 'flattened' by design since their memory is laid out in contiguous memory buffers. The idea of multi-dimensionality is an abstraction layer provided by those libraries (namely NumPy's arrays and Torch's tensors) to allow for easier and more flexible control over data.
Related
I have a data model consisting only of categorial features and a categorial label.
So when I build that model manually in XGBoost, I would basically transform the features to binary columns (using LabelEncoder and OneHotEncoder), and the label into classes using LabelEncoder. I would then run a Multilabel Classification (multi:softmax).
I tried that with my dataset and ended up with an accuracy around 0.4 (unfortunately can't share the dataset due to confidentiality)
Now, if I run the same dataset in Azure AutoML, I end up with an accuracy around 0.85 in the best experiment. But what is really interesting is that the AutoML uses SparseNormalizer, XGBoostClassifier, with reg:logistic as objective.
So if I interpret this right, AzureML just normalizes the data (somehow from categorial data?) and then executes a logistic regression? Is this even possible / does this make sense with categorial data?
Thanks in advance.
TL;DR You're right that normalization doesn't make sense for training gradient-boosted decision trees (GBDTs) on categorical data, but it won't have an adverse impact. AutoML is an automated framework for modeling. In exchange for calibration control, you get ease-of-use. It is still worth verifying first that AutoML is receiving data with the columns properly encoded as categorical.
Think of an AutoML model as effectively a sklearn Pipeline, which is a bundled set of pre-processing steps along with a predictive Estimator. AutoML will attempt to sample from a large swath of pre-configured Pipelines such that the most accurate Pipeline will be discovered. As the docs say:
In every automated machine learning experiment, your data is automatically scaled or normalized to help algorithms perform well. During model training, one of the following scaling or normalization techniques will be applied to each model.
Too see this, you can called .named_steps on your fitted model. Also check out fitted_model.get_featurization_summary()
I especially empathize with your concern especially w.r.t. how LightGBM (MSFT's GBDT implementation) is levered by AutoML. LightGBM accepts categorical columns and instead of one-hot encoding, will bin them into two subsets whenever split. Despite this, AutoML will pre-process away the categorical columns by one-hot encoding, scaling, and/or normalization; so this unique categorical approach is never utilized in AutoML.
If you're interested in "manual" ML in Azure ML, I highly suggest looking into Estimators and Azure ML Pipelines
If I am correct Stratified kfold is used so that the dependent variable ratio in the splits is similar to the original data.
What I want to understand is why is it necessary or important to retain that ratio.
Is it necessary for fraud detection problems where data is highly imbalanced.
If yes, why?
Taken from https://stats.stackexchange.com/questions/49540/understanding-stratified-cross-validation
Cross-validation article in Encyclopedia of Database Systems says:
Stratification is the process of rearranging the data as to ensure each fold is a
good representative of the whole. For example in a binary classification problem
where each class comprises 50% of the data, it is best to arrange the data such
that in every fold, each class comprises around half the instances.
About the importance of the stratification, Kohavi (A study of cross-validation
and bootstrap for accuracy estimation and model selection) concludes that:
stratification is generally a better scheme, both in terms of bias and variance,
when compared to regular cross-validation.
All metrics calculate against the true labels. If there is a bias in the system, say it predicts more of one label, a fold with more of said label would give artificially large results.
a methodology to take care of that is to ensure the true labels distribution is very similar for each fold. Then the aggregation of results would be more indicative of system performance.
Let's say I trained a model with a very complex computational graph tailored for training. After a lot of training, the best model was saved to a checkpoint file. Now, I want to use the learned parameters of this best model for inference. However, the computational graph used for training is not exactly the same as the one I intend to use for inference. Concretely, there is a module in the graph with several layers in charge of outputting embedding vectors for items (recommender system context). However, for the sake of computational performance, during inference time I would like to have all the item embedding vectors precomputed in advance, so that the only computation required per request would just involve a couple of hidden layers.
Therefore, what I would like to know how to do is:
How to just restore the part of the network that outputs item embedding vectors, in order to precompute these vectors for all items (this would happen in some pre-processing script off-line)
Once all item embedding vectors are precomputed, during on-line inference time how to just restore the hidden layers in the later parts of the network and make them receive the precomputed item embedding vectors instead.
How can the points above be accomplished? I think point 1. is easier to get done. But my biggest concern is with point 2. In the computational graph used for training, in order to evaluate any layer I would have to provide values for the input placeholders. However, during on-line inference these placeholders would be obsolete because a lot of stuff would be precomputed and I don't know how to tell hidden layers in the later parts of the network that they should no longer depend on these obsolete placeholders but depend on the precomputed stuff instead.
This tutorial has the tensor-flow implementation of batch normal layer for training and testing phases.
When we using transfer learning is it ok to use batch normalization layer? Specially when data distributions are different.
Because in the inference phase BN layer just uses fixed mini batch mean and variance(Which is calculated with the help of training distribution).
So if our model has a different distribution of data , can it give wrong results?
With transfer learning, you're transferring the learned parameters from a domain to another.
Usually, this means that you're keeping fixed the learned values of the convolutional layer whilst adding new fully connected layers that learn to classify the features extracted by the CNN.
When you add batch normalization to every layer, you're injecting values sampled from the input distribution into the layer, in order to force the output layer to be normally distributed.
In order of doing that, you compute the exponential moving average of the layer output and then in the testing phase, you subtract this value from the layer output.
Although data dependent, this mean values (for every convolutional layer) are computed on the output of the layer, thus on the transformation learned.
Thus, in my opinion, the various averages that the BN layer subtracts from its convolutional layer output are general enough to be transferred: they are computed on the transformed data and not on the original data.
Moreover, the convolutional layer learns to extract local patterns thus they're more robust and difficult to influence.
Thus, in short and in my opinion:
you can apply transfer learning of convolutional layer with batch norm applied. But on fully connected layers the influence of the computed value (that see the whole input and not only local patches) can bee too much data dependent and thus I'll avoid it.
However, as a rule of thumb: if you're insecure about something just try it and see if it works!
I've read the XLA prerelease document here.
https://www.tensorflow.org/versions/master/resources/xla_prerelease#xla_accelerated_linear_algebra
It discusses datatypes of elements, but does not go into much detail about the data organization of the tensors themselves. How will operations on SparseTensor objects be handled once XLA is available?
The layouts restrict the data organization of input and output tensors and don't include sparse layouts, although as Jingyue suggests, they could be extended in the future. The internal representation of tensors in the AST can in principle be anything a backend wants, and it is expected that the compiler may reorganize the data to different layouts for the convenience of different operators implemented by different backends.
I am not aware that anyone has put much thought into how to do this efficiently for sparse tensors. In principle maybe it could be done as a compiler pass to infer sparsity and propagate it, with sparse implementations for all the relevant operators. Nothing like that exists today.
No, XLA focuses on dense tensors and doesn't deal with sparse tensors in an efficient way today.
It could be easily extended to allow users to express some sparsity using layouts (e.g. interior padding).
Sparse data is something we'd like to have working, though it has some challenges. E.g. currently XLA depends on knowing the exact size of every buffer statically. We could certainly find a way to deal with that, but have been focusing on dense data so far.
A few years later, XLA seems to have some support for sparse tensors, and working well at that. My workflow involves sparse tensors for very high dimensional data that would be prohibitive to keep in memory, then slicing and manipulating and finally performing math ops on a lower dimensional sense tensor. For slicing sparse tensors I’m getting a roughly 4x speed up with xla