AWS Cloudwatch Math Expressions: draw null / empty - amazon-cloudwatch

I am trying to create a dashboard widget that says "if a metric sample count is less than certain number, don't draw the graph".
The only Math expression that seem promising is IF, however the value can only be a metric or a scalar. I'm trying to find a way to draw a null/no data point/empty instead.
Any way?
Reference: https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/using-metric-math.html#using-IF-expressions

CloudWatch will drop data points that are not numbers (NaN, +Infinity, -Infinity) when graphing the data. Also, metric math will evaluate basic operations in the expression. You can divide by zero to get non-number value.
So you can do something like this to trick it into dropping the values you don't want:
Have your metric in the graph as m1.
Have the sample count of your metric in the graph as m2.
Add an IF function to drop data points if the sample count is lower than some number (10 in this example): IF(m2 < 10, 1/0, m1)
Disable m1 and m2 on the graph and only show the expression.

Related

Vertex AI AutoML Batch Predict returns prediction scores as Array Float64 in BigQuery Table instead of just FLOAT64 values>

So I have this tabular AutoML model in Vertex AI. It successfully ran batch predictions and outputs to BigQuery. However, when I try to query the prediction data based off of the score being above a certain threshold, I get an error saying the datatype doesn't support float operations. When I tried to cast the scores to float, it said that the scores are a float64 array? This confuses me because they're just individual values of a column in the table. I don't understand why they aren't normal floats, nor do I know how to convert them. Any help would be greatly appreciated.
I tried casting the datatype to float, which obviously didn't work. I tried using different operators like BETWEEN and LIKE, but again won't work because it says it's an array. I just don't understand why it's getting converted to an array. Each value should be its own value just as the table shows it to be.
AutoML does store your result in a so called RECORD, at least if you're doing classification. If that is the case for you, it stores two things within this RECORD: classes and scores. Scores itself is also an array, consisting of the probability of 0 and the probability of 1. So to access it you have to do something like this:
prediction_variable.scores[offset(1)]
This will give you the value for the probability of your classification being 1.

Data type already in int but receiving error: cannot convert float infinity to integer

I have a column: "Rented Bike Count" in my data frame, which is the dependent variable of my linear regression project. I found that its distribution is highly skewed, so I wanted to transform it into a more normal distribution as a data preparation step for linear regression.
But when I used Log transformation for the same using:
sns.distplot(np.log10(temp_df['Rented Bike Count']))
It showed me the following error:
"Rented Bike Count" is already of int data type. Can anyone help?
This is speculative as no data is provided, but from the histogram you show I assume you have zero values in temp_df['Rented Bike Count']. You cannot calculate the logarithm of 0, which makes numpy return -inf (instead of throwing an error). In order to plot the data you would need to replace these infinite values first.

How is the Gini-Index minimized in CART Algorithm for Decision Trees?

For neural networks for example I minimize the cost function by using the backpropagation algorithm. Is there something equivalent for the Gini Index in decision trees?
CART Algorithm always states "choose partition of set A, that minimizes Gini-Index", but how to I actually get that partition mathematically?
Any input on this would be helpful :)
For a decision tree, there are different methods for splitting continuous variables like age, weight, income, etc.
A) Discretize the continuous variable to use it as a categorical variable in all aspects of the DT algorithm. This can be done:
only once during the start and then keeping this discretization
static
at every stage where a split is required, using percentiles or
interval ranges or clustering to bucketize the variable
B) Split at all possible distinct values of the variable and see where there is the highest decrease in the Gini Index. This can be computationally expensive. So, there are optimized variants where you sort the variables and instead of choosing all distinct values, choose the midpoints between two consecutive values as the splits. For example, if the variable 'weight' has 70, 80, 90 and 100 kgs in the data points, try 75, 85, 95 as splits and pick the best one (highest decrease in Gini or other impurities)
But then, what is the exact split algorithm that is implemented in scikit-learn in python, rpart in R, and the mlib package in pyspark , and what are the differences between them in the splitting of a continuous variable is something I am not sure as well and am still researching.
Here there is a good example of CART algorithm. Basically, we get the gini index like this:
For each attribute we have different values each of which will have a gini index, according to the class they belong to. For example, if we had two classes (positive and negative), each value of an attribute will have some records that belong to the positive class and some other values that belong to the negative class. So we can calculate the probabilities. Say if an attribute was called weather and it had two values (e.g. rainy and sunny), and we had these information:
rainy: 2 positive, 3 negative
sunny: 1 positive, 2 negative
we could say:
Then we can have the weighted sum of gini indexes for weather (assuming we had a total of 8 records):
We do this for all the other attributes (like we did for weather) and at the end we choose the attribute with the lowest gini index to be the one to split the tree from. We have to do all this at each split (unless we could classify the sub-tree without the need for splitting).

Explained variance calculation

My questions are specific to https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html#sklearn.decomposition.PCA.
I don't understand why you square eigenvalues
https://github.com/scikit-learn/scikit-learn/blob/55bf5d9/sklearn/decomposition/pca.py#L444
here?
Also, explained_variance is not computed for new transformed data other than original data used to compute eigen-vectors. Is that not normally done?
pca = PCA(n_components=2, svd_solver='full')
pca.fit(X)
pca.transform(Y)
In this case, won't you separately calculate explained variance for data Y as well. For that purpose, I think we would have to use point 3 instead of using eigen-values.
Explained variance can be also computed by taking the variance of each axis in the transformed space and dividing by the total variance. Any reason that is not done here?
Answers to your questions:
1) The square roots of the eigenvalues of the scatter matrix (e.g. XX.T) are the singular values of X (see here: https://math.stackexchange.com/a/3871/536826). So you square them. Important: the initial matrix X should be centered (data has been preprocessed to have zero mean) in order for the above to hold.
2) Yes this is the way to go. explained_variance is computed based on the singular values. See point 1.
3) It's the same but in the case you describe you HAVE to project the data and then do additional computations. No need for that if you just compute it using the eigenvalues / singular values (see point 1 again for the connection between these two).
Finally, keep in mind that not everyone really wants to project the data. Someone can only get the eigenvalues and then immediately estimate the explained variance WITHOUT projecting the data. So that's the best gold standard way to do it.
EDIT 1:
Answer to edited Point 2
No. PCA is an unsupervised method. It only transforms the X data not the Y (labels).
Again, the explained variance can be computed fast, easily, and with half line of code using the eigenvalues/singular values OR as you said using the projected data e.g. estimating the covariance of the projected data, then variances of PCs will be in the diagonal.

How to plot a Pearson correlation given a time series?

I am using the code in this website http://blog.chrislowis.co.uk/2008/11/24/ruby-gsl-pearson.html to implement a Pearson Correlation given two time series data like so:
require 'gsl'
pearson_correlation = GSL::Stats::correlation(
GSL::Vector.alloc(first_metrics),GSL::Vector.alloc(second_metrics)
)
This returns a number such as -0.2352461593569471.
I'm currently using the highcharts library and am feeding it two sets of timeseries data. Given that I have a finite time series for both sets, can I do something with this number (-0.2352461593569471) to create a third time series showing the slope of this curve? If anyone can point me in the right direction I'd really appreciate it!
No, correlation doesn't tell you anything about the slope of the line of best fit. It just tells you approximately how much of the variability in one variable (or one time series, in this case) can be explained by the other. There is a reasonably good description here: http://www.graphpad.com/support/faqid/1141/.
How you deal with the data in your specific case is highly dependent on what you're trying to achieve. Are you trying to show that variable X causes variable Y? If so, you could start by dropping the time-series-ness, and just treat the data as paired values, and use linear regression. If you're trying to find a model of how X and Y vary together over time, you could look at multivariate linear regression (I'm not very familiar with this, though).