Calculating a t-test in Stata - optimization

I am currently trying to run a t test on a variable and determine if it's statistically significantly different from 1. Here is the code I am using:
ttest dm1=1
And it is spitting out this output:
I don't want my null hypothesis to be that mean=1, I want it to be that dm1=1. When I do regular calculations ({Beta(dm1)-1}/SE(Beta(dm1))) on the ttest, I get that the new t statistic should be around -48.89. What is the code to determine if the coefficient is statistically different than one, if this is not the proper way? Also, here is an image of the regression model for reference:

The first t-test syntax is for testing that the null that the mean of dm1 is 1. It has nothing to do with the regression coefficients at all.
If I understand what you are asking, you want a Wald test:
sysuse auto
reg price mpg weight i.foreign
test mpg=1

Related

Predict a nonlinear array based on 2 features with scalar values using XGBoost or equivalent

So I have been looking at XGBoost as a place to start with this, however I am not sure the best way to accomplish what I want.
My data is set up something like this
Where every value, whether it be input or output is numerical. The issue I'm facing is that I only have 3 input data points per several output data points.
I have seen that XGBoost has a multi-output regression method, however I am only really seeing it used to predict around 2 outputs per 1 input, whereas my data may have upwards of 50 output points needing to be predicted with only a handful of scalar input features.
I'd appreciate any ideas you may have.
For reference, I've been looking at mainly these two demos (they are the same idea just one is scikit and the other xgboost)
https://machinelearningmastery.com/multi-output-regression-models-with-python/
https://xgboost.readthedocs.io/en/stable/python/examples/multioutput_regression.html

Auto Arima Fit Warning

I am trying to apply t a SARIMAX model to predict monthly sales, but when I try to fit the model i have this warning:
Too few observations to estimate starting parameters for seasonal
ARMA, All parameters except for variances will be set to zeros.
Even when the dataset shows a clearly seasonality.
Seasonal_Decompose:
I've used a Stepwise search to find the best model orders but still having the warning and pretty bad RMSE compared to the test data.
stepwise_model = auto_arima(df_arima['sales_diff'],
start_p=1, start_q=1,
max_p=3, max_q=3, m=12,
start_P=0, seasonal=True,
d=1, D=1, trace=True,
error_action='ignore',
suppress_warnings=True,
stepwise=True)
PD, the original data is non-stationary so have to work with the differencing to make it stationary.
Any tip to work around that?

How do I compare effectiveness of different linear regression models

I have a dataframe which contains three more or less significant correlations between target column and other columns ( LinarRegressionModel.coef_ from sklearn shows 57, 97 and 79). And I don't know what exact model to choose: should I use only most correlated column for regression or use regression with all three predictors. Is there any way to compare models effectiveness? Sorry, I'm very new to data analysis, I couldn't google any tools for this task
Well first at all, you must know that when we are choosing the best model to apply to new data, we are going to choose the best model to fit out of sample data, which is the kind of samples that might are not present in the training process, after all, you want to predict new probabilities or cases. In your case, predict a new number.
So, how can we do this? Well, the best is to use metrics which can help us to choose which model is better for our dataset.
There are so many kinds of metrics for regression:
MAE: Mean absolute error is the mean of the absolute value of the errors. This is the easiest of the metrics to understand since it’s just the average error.
MSE: Mean squared error is the mean of the squared error. It’s more popular than a mean absolute error because the focus is geared more towards large errors.
RMSE: Root means the squared error is the square root of the mean squared error. This is one of the most popular of the evaluation metrics because root means the squared error is interpretable in the same units as the response vector or y units, making it easy to relate its information.
RAE: Relative absolute error, also known as the residual sum of a square, where y bar is a mean value of y, takes the total absolute error and normalizes it by dividing by the total absolute error of the simple predictor.
You can work with any of these, but I highly recommend to use MSE and RMSE.

Is multiple regression the best approach for optimization?

I am being asked to take a look at a scenario where a company has many projects that they wish to complete, but with any company budget comes into play. There is a Y value of a predefined score, with multiple X inputs. There are also 3 main constraints of Capital Costs, Expense Cost and Time for Completion in Months.
The ask is could an algorithmic approach be used to optimize which projects should be done for the year given the 3 constraints. The approach also should give different results if the constraint values change. The suggested method is multiple regression. Though I have looked into different approaches in detail. I would like to ask the wider community, if anyone has dealt with a similar problem, and what approaches have you used.
Fisrt thing we should understood, a conclution of something is not base on one argument.
this is from communication theory, that every human make a frame of knowledge (understanding conclution), where the frame construct from many piece of knowledge / information).
the concequence is we cannot use single linear regression in math to create a ML / DL system.
at least we should use two different variabel to make a sub conclution. if we push to use single variable with use linear regression (y=mx+c). it's similar to push computer predict something with low accuration. what ever optimization method that you pick...it's still low accuracy..., why...because linear regresion if you use in real life, it similar with predict 'habbit' base on data, not calculating the real condition.
that's means...., we should use multiple linear regression (y=m1x1+m2x2+ ... + c) to calculate anything in order to make computer understood / have conclution / create model of regression. but, not so simple like it. because of computer try to make a conclution from data that have multiple character / varians ... you must classified the data and the conclution.
for an example, try to make computer understood phitagoras.
we know that phitagoras formula is c=((a^2)+(b^2))^(1/2), and we want our computer can make prediction the phitagoras side (c) from two input values (a and b). so to do that, we should make a model or a mutiple linear regresion formula of phitagoras.
step 1 of course we should make a multi character data of phitagoras.
this is an example
a b c
3 4 5
8 6 10
3 14 etc..., try put 10 until 20 data
try to make a conclution of regression formula with multiple regression to predic the c base on a and b values.
you will found that some data have high accuration (higher than 98%) for some value and some value is not to accurate (under 90%). example a=3 and b=14 or b=15, will give low accuration result (under 90%).
so you must make and optimization....but how to do it...
I know many method to optimize, but i found in manual way, if I exclude the data that giving low accuracy result and put them in different group then, recalculate again to the data group that excluded, i will get more significant result. do again...until you reach the accuracy target that you want.
each group data, that have a new regression, is a new class.
means i will have several multiple regression base on data that i input (the regression come from each group of data / class) and the accuracy is really high, 99% - 99.99%.
and with the several class, the regresion have a fuction as a 'label' of the class, this is what happens in the backgroud of the automation computation. but with many module, the user of the module, feel put 'string' object as label, but the truth is, the string object binding to a regresion that constructed as label.
with some conditional parameter you can get the good ML with minimum number of data train.
try it on excel / libreoffice before step more further...
try to follow the tutorial from this video
and implement it in simple data that easy to construct in excel, like pythagoras.
so the answer is yes...the multiple regression is the best approach for optimization.

What is the output of XGboost using 'rank:pairwise'?

I use the python implementation of XGBoost. One of the objectives is rank:pairwise and it minimizes the pairwise loss (Documentation). However, it does not say anything about the scope of the output. I see numbers between -10 and 10, but can it be in principle -inf to inf?
good question. you may have a look in kaggle competition:
Actually, in Learning to Rank field, we are trying to predict the relative score for each document to a specific query. That is, this is not a regression problem or classification problem. Hence, if a document, attached to a query, gets a negative predict score, it means and only means that it's relatively less relative to the query, when comparing to other document(s), with positive scores.
It gives predicted score for ranking.
However, the scores are valid for ranking only in their own groups.
So we must set the groups for input data.
For esay ranking, refer to my project xgboostExtension
If I understand your questions correctly, you mean the output of the predict function on a model fitted using rank:pairwise.
Predict gives the predicted variable (y_hat).
This is the same for reg:linear / binary:logistic etc. The only difference is that reg:linear builds trees to Min(RMSE(y, y_hat)), while rank:pairwise build trees to Max(Map(Rank(y), Rank(y_hat))). However, output is always y_hat.
Depending on the values of your dependent variables, output can be anything. But I typically expect output to be much smaller in variance vs the dependent variable. This is usually the case as it is not necessary to fit extreme data values, the tree just needs to produce predictors that are large/small enough to be ranked first/last in the group.