I have a dataframe data containing real values and some NaN values. I'm trying to perform locality sensitive hashing using random projections to reduce the dimension to 25 components, specifically with thesklearn.random_projection.GaussianRandomProjection class. However, when I run:
tx = random_projection.GaussianRandomProjection(n_components = 25)
data25 = tx.fit_transform(data)
I get Input contains NaN, infinity or a value too large for dtype('float64'). Is there a work-around to this? I tried changing all the NaN values to a value that is never present in my dataset, such as -1. How valid would my output be in this case? I'm not an expert behind the theory of locality sensitive hashing/random projections so any insight would be helpful as well. Thanks.
NA / NaN values (not-available / not-a-number) are, I have found, just plain troublesome.
You don't want to just substitute a random value like -1. If you are inclined to do that, use one of the Imputer classes. Otherwise, you are likely to very substantially change the distances between points. You likely want to preserve distances as much as possible if you are using random projection:
The dimensions and distribution of random projections matrices are controlled so as to preserve the pairwise distances between any two samples of the dataset.
However, this may or may not result in reasonable values for learning. As far as I know, imputation is an open field of study, which (for instance) this gentlemen has specialized in studying.
If you have enough examples, consider dropping rows or columns that contain NaN values. Another possibility is training a generative model like a Restricted Boltzman Machine and use that to fill in missing values:
rbm = sklearn.neural_network.BernoulliRBM().fit( data_with_no_nans )
mean_imputed_data = sklearn.preprocessing.Imputer().fit_transform( all_data )
rbm_imputation = rbm.gibbs( mean_imputed_data )
nan_mask = np.isnan( all_data )
all_data[ nan_mask ] = rbm_imputation[ nan_mask ]
Finally, you might consider imputing using nearest neighbors. For a given column, train a nearest neighbors model on all the variables except that column using all complete rows. Then, for a row missing that column, find the k nearest neighbors and use the average value among them. (This gets very costly, especially if you have rows with more than one missing value, as you will have to train a model for every combination of missing columns).
Related
I have dataset with 15 columns with below scenario
9 -columns are categorical use so I have convert the data one hot encoder
6 columns are numeric, out of 6 - 3 columns is having outlier since column values are different range, so I have chosen RobustScaler() as scaling features and other I chosen standard Scalar.
after that I have combined all the data frame and apply the Logistic Regression algorithm my model produced very low score even I got the good score with out scaling.
will any one can able to help on this ?
please apply column standardization to data frame and see the output..I guess since logistic regression is sensitive to outliers,you are facing problem
impute outliers properly and then apply column standardization
I'm trying to understand the TensorFlow tutorial on wide and deep learning. The demo application creates indicator columns for categorical features with few categories (gender, education), and it creates embedded columns for categorical features with many categories (native_country, occupation).
I don't understand embedded columns. Is there a rule that clarifies when to use embedded columns instead of indicator columns? According to the documentation, the dimension parameter sets the dimension of the embedding. What does that mean?
From the feature columns tutorial:
Now, suppose instead of having just three possible classes, we have a million. Or maybe a billion. For a number of reasons, as the number of categories grow large, it becomes infeasible to train a neural network using indicator columns.
We can use an embedding column to overcome this limitation. Instead of representing the data as a one-hot vector of many dimensions, an embedding column represents that data as a lower-dimensional, ordinary vector in which each cell can contain any number, not just 0 or 1. By permitting a richer palette of numbers for every cell, an embedding column contains far fewer cells than an indicator column.
The dimension parameter is the length of the vector you're reducing the categories to.
nce_loss() asks for a static int value for num_true. That works well for problems where we have the same amount of labels per training example and we know it in advance.
When labels have a variable shape [None], and being batched and/or bucketed by bucket size with .padded_batch() + .group_by_window() it is necessary to provide a variable size num_true in order to accustom for all training examples. This is currently unsupported to my knowledge (correct me if I'm wrong).
In other words suppose we have either a dataset of images with an arbitrary amount of labels per each image (dog, cat, duck, etc.) or a text dataset with numerous multiple classes per sentence (class_1, class_2, ..., class_n). Classes are NOT mutually exclusive, and can vary in size between examples.
But as the amount of possible labels can be huge 10k-100k is there a way to do a sampling loss to improve performance (in comparison with a sigmoid_cross_entropy)?
Is there a proper way to do this or any other workarounds?
nce_loss = tf.nn.nce_loss(
weights=nce_weights,
biases=nce_biases,
labels=labels,
inputs=inputs,
num_sampled=num_sampled,
# Something like this:
# `num_true=(tf.shape(labels)[-1])` instead of `num_true=const_int`
# , would be preferable here
num_classes=self.num_classes)
I see two issues:
1) Work with NCE with different numbers of true values;
2) Classes that are NOT mutually exclusive.
To the first issue, as #michal said, there is an expectative of including this functionality in the future. I have tried almost the same thing: to use labels with shape=(None, None), i.e., true_values dimension None. The sampled_values parameter has the same problem: true_values number must be a fixed integer number. The recomended work around is to use a class (0 is the best one) representing <PAD> and complete the number of true_values. In my case, 0 is an special token that represents <PAD>. Part of code is here:
assert len(labels) <= (window_size * 2)
zeros = ((window_size * 2) - len(labels)) * [0]
labels = labels + zeros
labels.sort()
I sorted the label because considering another recommendation:
Note: By default this uses a log-uniform (Zipfian) distribution for
sampling, so your labels must be sorted in order of decreasing
frequency to achieve good results.
In my case, the special tokens and more frequent words have lower indexes, otherwise, less frequent words have higher indexes. I included all label classes associated to the input at same time and completed with zero till the true_values number. Of course, you must ignore the 0 class at the end.
Leaving that they are from two different binaries.
I know that series/dataframe can hold any data type, and ndarray is also heterogenous data.
And also all the slicing operations of numpy are applicable to series.
Is there any other difference between them?
After some research I found the answer to my question I asked above. For anyone who needs, here it is from pandas docs:
A key difference between Series and ndarray is that operations between
Series automatically align the data based on the label. Thus, you can
write computations without giving consideration to whether the Series
involved have the same labels.
An example:
s[1:] + s[:-1]
The result for above would produce NaN for both first and last index.
If a label is not found in one Series or the other, the result will be marked as missing NaN.
This question already has answers here:
np.mean() vs np.average() in Python NumPy?
(5 answers)
Closed 4 years ago.
NumPy has two different functions for calculating an average:
np.average()
and
np.mean()
Since it is unlikely that NumPy would include a redundant feature their must be a nuanced difference.
This was a concept I was very unclear on when starting data analysis in Python so I decided to make a detailed self-answer here as I am sure others are struggling with it.
Short Answer:
'Mean' and 'Average' are two different things. People use them interchangeably but shouldn't. np.mean() gives you the arithmetic mean where as np.average() allows you to get the arithmetic mean if you don't add other parameters, but can also be used to take a weighted average.
Long Answer and Background:
Statistics:
Since NumPy is mostly used for working with data sets it is important to understand the mathematical concept that causes this confusion. In simple mathematics and every day life we use the word Average and Mean as interchangeable words when this is not the case.
Mean: Commonly refers to the 'Arithmetic Mean' or the sum of a collection of numbers divided by the number of numbers in the collection1
Average: Average can refer to many different calculations, of which the 'Arithmetic Mean' is one. Others include 'Median', 'Mode', 'Weighted Mean, 'Interquartile Mean' and many others.2
What This Means For NumPy:
Back to the topic at hand. Since NumPy is normally used in applications related to mathematics it needs to be a bit more precise about the difference between Average() and Mean() than tools like Excel which use Average() as a function for finding the 'Arithmetic Mean'.
np.mean()
In NumPy, np.mean() will allow you to calculate the 'Arithmetic Mean' across a specified axis.
Here's how you would use it:
myArray = np.array([[3, 4], [5, 6]])
np.mean(myArray)
There are also parameters for changing which dType is used and which axis the function should compute along (the default is the flattened array).
np.average()
np.average() on the other hand allows you to take a 'Weighted Mean' in which different numbers in your array may have a different weight. For example, in the documentation we can see:
>>> data = range(1,5)
>>> data
[1, 2, 3, 4]
>>> np.average(data)
2.5
>>> np.average(range(1,11), weights=range(10,0,-1))
4.0
For the last function if you were to take a non-weighted average you would expect the answer to be 6. However, it ends up being 4 because we applied the weights too it.
If you don't have a good handle on what a 'weighted mean' we can try and simplify it:
Consider this a very elementary summary of our 'weighted mean' it isn't going to be quite mathematically accurate (which I hope someone will correct) but it should allow you to visualize what we're discussing.
A mean is the average of all numbers summed and divided by the total number of numbers. This means they all have an equal weight, or are counted once. For our mean sample this meant:
(1+2+3+4+5+6+7+8+9+10+11)/11 = 6
A weighted mean involves including numbers at different weights. Since in our above example it wouldn't include whole numbers it can be a bit confusing to visualize so we'll imagine the weighting fit more nicely across the numbers and it would look something like this:
(1+1+1+1+1+1+1+1+1+1+1+2+2+2+2+2+2+2+2+2+3+3+3+3+3+3+3+3+4+4+4+4+4+4+4+5+5+5+5+5+5+6+6+6+6+6+6+7+7+7+7+7+8+8+8+8+9+9+9+-11)/59 = 3.9~
Even though in the actual number set there is only one instance of the number 1 we're counting it at 10 times its normal weight. This can also be done the other way, we could count a number at 1/3 of its normal weight.
If you don't provide a weight parameter to np.average() it will simply give you the equal weighted average across the flattened axis which is equivalent to the np.mean().
Why Would I Ever Use np.mean()?
If np.average() can be used to find the flat arithmetic mean then you may be asking yourself "why would I ever use np.mean()?" np.mean() allows for a few useful parameters that np.average() does not. One of the key ones is the dType parameter which allows you to set the type used in the computation.
For example the NumPy docs give us this case:
Single point precision:
>>> a = np.zeros((2, 512*512), dtype=np.float32)
>>> a[0, :] = 1.0
>>> a[1, :] = 0.1
>>> np.mean(a)
0.546875
Based on the calculation above it looks like our average is 0.546875 but if we use the dType parameter to float64 we get a different result:
>>> np.mean(a, dtype=np.float64)
0.55000000074505806
The actual average 0.55000000074505806.
Now, if you round both of these to two significant digits you get 0.55 in both cases. Where this accuracy becomes important is if you are doing multiple sets of operations on the number still, especially when dealing with very large (or very small numbers) that need a high accuracy.
For example:
((((0.55000000074505806*184.6651)^5)+0.666321)/46.778) =
231,044,656.404611
((((0.55000000074505806*184.6651)^5)+0.666321)/46.778) =
231,044,654.839687
Even in simpler equations you can end up being off by a few decimal places and that can be relevant in:
Scientific simulations: Due to lengthy equations, multiple steps and a high degree of accuracy needed.
Statistics: The difference between a few percentage points of accuracy can be crucial (for example in medical studies).
Finance: Continually being off by even a few cents in large financial models or when tracking large amounts of capital (banking/private equity) could result in hundreds of thousands of dollars in errors by the end of the year.
Important Word Distinction
Lastly, simply on interpretation you may find yourself in a situation where analyzing data where it is asked of you to find the 'Average' of a dataset. You may want to use a different method of average to find the most accurate representation of the dataset. For example, np.median() may be more accurate than np.average() in cases with outliers and so its important to know the statistical difference.