Data type already in int but receiving error: cannot convert float infinity to integer - pandas

I have a column: "Rented Bike Count" in my data frame, which is the dependent variable of my linear regression project. I found that its distribution is highly skewed, so I wanted to transform it into a more normal distribution as a data preparation step for linear regression.
But when I used Log transformation for the same using:
sns.distplot(np.log10(temp_df['Rented Bike Count']))
It showed me the following error:
"Rented Bike Count" is already of int data type. Can anyone help?

This is speculative as no data is provided, but from the histogram you show I assume you have zero values in temp_df['Rented Bike Count']. You cannot calculate the logarithm of 0, which makes numpy return -inf (instead of throwing an error). In order to plot the data you would need to replace these infinite values first.

Related

Vertex AI AutoML Batch Predict returns prediction scores as Array Float64 in BigQuery Table instead of just FLOAT64 values>

So I have this tabular AutoML model in Vertex AI. It successfully ran batch predictions and outputs to BigQuery. However, when I try to query the prediction data based off of the score being above a certain threshold, I get an error saying the datatype doesn't support float operations. When I tried to cast the scores to float, it said that the scores are a float64 array? This confuses me because they're just individual values of a column in the table. I don't understand why they aren't normal floats, nor do I know how to convert them. Any help would be greatly appreciated.
I tried casting the datatype to float, which obviously didn't work. I tried using different operators like BETWEEN and LIKE, but again won't work because it says it's an array. I just don't understand why it's getting converted to an array. Each value should be its own value just as the table shows it to be.
AutoML does store your result in a so called RECORD, at least if you're doing classification. If that is the case for you, it stores two things within this RECORD: classes and scores. Scores itself is also an array, consisting of the probability of 0 and the probability of 1. So to access it you have to do something like this:
prediction_variable.scores[offset(1)]
This will give you the value for the probability of your classification being 1.

AWS Cloudwatch Math Expressions: draw null / empty

I am trying to create a dashboard widget that says "if a metric sample count is less than certain number, don't draw the graph".
The only Math expression that seem promising is IF, however the value can only be a metric or a scalar. I'm trying to find a way to draw a null/no data point/empty instead.
Any way?
Reference: https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/using-metric-math.html#using-IF-expressions
CloudWatch will drop data points that are not numbers (NaN, +Infinity, -Infinity) when graphing the data. Also, metric math will evaluate basic operations in the expression. You can divide by zero to get non-number value.
So you can do something like this to trick it into dropping the values you don't want:
Have your metric in the graph as m1.
Have the sample count of your metric in the graph as m2.
Add an IF function to drop data points if the sample count is lower than some number (10 in this example): IF(m2 < 10, 1/0, m1)
Disable m1 and m2 on the graph and only show the expression.

I can't understand an numpy array concept in sklearn

my code
diabetes_x=np.array([[1],[2],[3]])
diabetes_x_train=diabetes_x
diabetes_x_test=diabetes_x
diabetes_y_train=np.array([3,2,4])
diabetes_y_test=np.array([3,2,4])
model=linear_model.LinearRegression()
model.fit(diabetes_x_train,diabetes_y_train)
diabetes_y_predict=model.predict(diabetes_x_test)
print("Mean Squared error is :",mean_squared_error(diabetes_y_test,diabetes_y_predict))
print("weights : ",model.coef_)
print("intercept : ",model.intercept_)
in this code we are taking diabetes_x value in 2-D but in diabetes_y_train and test why we are taking 1-D array. Can someone please explain me both of the concept of diabetes_x and _y
In machine learning terminology X is regarded as the input variable and y is regarded as output variable.
Suppose there is dataset with 5 columns where the last column is the result. So the input will consist of all the column except the last and the last column will be used to check if the mapping is correct after training or during validation to calculate the error.

Explained variance calculation

My questions are specific to https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html#sklearn.decomposition.PCA.
I don't understand why you square eigenvalues
https://github.com/scikit-learn/scikit-learn/blob/55bf5d9/sklearn/decomposition/pca.py#L444
here?
Also, explained_variance is not computed for new transformed data other than original data used to compute eigen-vectors. Is that not normally done?
pca = PCA(n_components=2, svd_solver='full')
pca.fit(X)
pca.transform(Y)
In this case, won't you separately calculate explained variance for data Y as well. For that purpose, I think we would have to use point 3 instead of using eigen-values.
Explained variance can be also computed by taking the variance of each axis in the transformed space and dividing by the total variance. Any reason that is not done here?
Answers to your questions:
1) The square roots of the eigenvalues of the scatter matrix (e.g. XX.T) are the singular values of X (see here: https://math.stackexchange.com/a/3871/536826). So you square them. Important: the initial matrix X should be centered (data has been preprocessed to have zero mean) in order for the above to hold.
2) Yes this is the way to go. explained_variance is computed based on the singular values. See point 1.
3) It's the same but in the case you describe you HAVE to project the data and then do additional computations. No need for that if you just compute it using the eigenvalues / singular values (see point 1 again for the connection between these two).
Finally, keep in mind that not everyone really wants to project the data. Someone can only get the eigenvalues and then immediately estimate the explained variance WITHOUT projecting the data. So that's the best gold standard way to do it.
EDIT 1:
Answer to edited Point 2
No. PCA is an unsupervised method. It only transforms the X data not the Y (labels).
Again, the explained variance can be computed fast, easily, and with half line of code using the eigenvalues/singular values OR as you said using the projected data e.g. estimating the covariance of the projected data, then variances of PCs will be in the diagonal.

LIBSVM Data Preparation: Excel data to LIBSVM format

I want to study how to perform LIBSVM for regression and I'm currently stuck in preparing my data. Currently I have this form of data in .csv and .xlsx format and I want to convert it into libsvm data format.
So far, I understand that the data should be in this format so that it can be used in LIBSVM:
Based on what I read, for regression, "label" is the target value which can be any real number.
I am doing a electric load prediction study. Can anyone tell me what it is? And finally, how should I organized my columns and rows?
The LIBSVM data format is given by:
<label> <index1>:<value1> <index2>:<value2> .........
As you can see, this forms a matrix [(IndexCount + 1) columns, LineCount rows]. More precisely a sparse matrix. If you specify a value for each index, you have a dense matrix, but if you only specify a few indices like <label> <5:value> <8:value>, only the indices 5 and 8 and of course label will have a custom value, all other values are set to 0. This is just for notational simplicity or to save space, since datasets can be huge.
For the meanig of the tags, I cite the ReadMe file:
<label> is the target value of the training data. For classification,
it should be an integer which identifies a class (multi-class
classification is supported). For regression, it's any real
number. For one-class SVM, it's not used so can be any number.
is an integer starting from 1, <value> is a real number. The indices
must be in an ascending order.
As you can see, the label is the data you want to predict. The index marks a feature of your data and its value. A feature is simply an indicator to associate or correlate your target value with, so a better prediction can be made.
Totally Fictional story time: Gabriel Luna (a totally fictional character) wants to predict his energy consumption for the next few days. He found out, that the outside temperature from the day before is a good indicator for that, so he selects Temperature with index 1 as feature. Important: Indices always start at one, zero can sometimes cause strange LIBSVM behaviour. Then, he surprisingly notices, that the day of the week (Monday to Sunday or 0 to 6) also affects his load, so he selects it as a second feature with index 2. A matrix row for LIBSVM now has the following format:
<myLoad_Value> <1:outsideTemperatureFromYesterday_Value> <2:dayOfTheWeek_Value>
Gabriel Luna (he is Batman at night) now captures these data over a few weeks, which could look something like this (load in kWh, temperature in °C, day as mentioned above):
0.72 1:25 2:0
0.65 1:21 2:1
0.68 2:29 2:2
...
Notice, that we could leave out 2:0, because of the sparse matrix format. This would be your training data to train a LIBSVM model. Then, we predict the load of tomorrow as follows. You know the temperature of today, let us say 23°C and today is Tuesday, which is 1, so tomorrow is 2. So, this is the line or vector to use with the model:
0 1:23 2:2
Here, you can set the <label> value arbitrarily. It will be overwritten with the predicted value. I hope this helps.