I am trying to evaluate the influence of an intervention over health outcome (count).
Intervention is binary, while health outcome is a count. My data was collected at different times of the day, therefore to account for correlated data, I am using a using negative binomial GEE approach.
I would appreciate any suggestion for 2 queries-
How to interpret the interaction in this negative binomial GEE regression?
How to draw a DID graph to show the impact of the intervention?
Thank you in advance.
Related
One part of my MCMC algorithm is using MH algorithm to update (n\times 1) vector of parameters $\boldsymbol{\delta}$. I think it is less computational intensive to propose a new sample from a $n\times 1$ multivariate proposal distribution (n is large). However, it seems impossible to tune the acceptance rate towards some ideal interval, such as 0.2 to 0.5.
I have tried random walk update based on multivariate normal and multivariate uniform distribution. No matter how I adjust the step size, the pattern of acceptance rate looks similar to the following figure.
enter image description here
Is there anyone have such experience? Any good suggest is welcome!
I have a dataframe which contains three more or less significant correlations between target column and other columns ( LinarRegressionModel.coef_ from sklearn shows 57, 97 and 79). And I don't know what exact model to choose: should I use only most correlated column for regression or use regression with all three predictors. Is there any way to compare models effectiveness? Sorry, I'm very new to data analysis, I couldn't google any tools for this task
Well first at all, you must know that when we are choosing the best model to apply to new data, we are going to choose the best model to fit out of sample data, which is the kind of samples that might are not present in the training process, after all, you want to predict new probabilities or cases. In your case, predict a new number.
So, how can we do this? Well, the best is to use metrics which can help us to choose which model is better for our dataset.
There are so many kinds of metrics for regression:
MAE: Mean absolute error is the mean of the absolute value of the errors. This is the easiest of the metrics to understand since it’s just the average error.
MSE: Mean squared error is the mean of the squared error. It’s more popular than a mean absolute error because the focus is geared more towards large errors.
RMSE: Root means the squared error is the square root of the mean squared error. This is one of the most popular of the evaluation metrics because root means the squared error is interpretable in the same units as the response vector or y units, making it easy to relate its information.
RAE: Relative absolute error, also known as the residual sum of a square, where y bar is a mean value of y, takes the total absolute error and normalizes it by dividing by the total absolute error of the simple predictor.
You can work with any of these, but I highly recommend to use MSE and RMSE.
I'm analysing longitudinal panel data, in which individuals transition between different states in a Markov chain. I'm modelling the transition rates between states using a series of multinomial logistic regressions. This means that I end up with a very large number of regression slopes.
For each regression slope, I obtain a posterior distribution (using WinBUGS). From the posterior distribution, we get the mean, standard deviation, and 95% credible interval associated with the slope in question.
The value I am ultimately interested in is the expected first passage time ('hitting time') through the Markov chain. This is a function of all the different predictor variables, and so is built from the many regression slopes produced by the multinomial logistic regressions.
A simple approach would be to take the mean of each posterior distribution as a point-estimate for each regression slope, and solve for the expected first passage time at a series of different values of the predictor variables. I have now done this, but it is potentially misleading because it doesn't show the uncertainty around the predicted values of expected first passage time.
My question is: how can I calculate a credible interval for the expected first passage time?
My first thought was to approximate the error via simulation, by sampling individual values for the regression slopes from each posterior distribution, obtaining the expected first passage time given those values, and then plotting the standard deviation of all these simulated values. However, I feel like (a) this would make a statistician scream and (b) it doesn't take into account the fact that different posterior distributions will be correlated (it samples from each one independently).
In WinBUGS, you can actually obtain the correlations between the posterior distributions. So if the simulation idea is appropriate, I could in theory simulate the regression slope coefficients incorporating these correlations.
Is there a more direct and less approximate way to find the uncertainty? Could I, for instance, use WinBUGS to find the posterior distribution of the expected first passage time for a given set of values of the predictor variables? Rather like the answer to this question: define a new node and monitor it. I would imagine defining a series of new nodes, where each one is for a different set of actual predictor values, and monitoring each one. Does this make good statistical sense?
Any thoughts about this would be really appreciated!
Using scikit-learn on balanced training data of around 50 millions samples (50% one class, 50% the other, 8 continuous features in interval (0,1)), all classifiers that I have been able to try so far (Linear/LogisticRegression, LinearSVC, RandomForestClassifier, ...) show a strange behavior:
When testing on the training data, the percentage of false-positives is much lower than the percentage of false-negatives (fnr). When correcting the intercept manually in order to increase false-positive rate (fpr), the accuracy actually improves considerably.
Why do the classification algorithms not find a close-to-optimal intercept (that I guess would more or less be at fpr=fnr)?
I guess the idea is that there's no single definition of "optimal"; for some applications, you'll tolerate false positives much more than false negatives (i.e. detecting fraud or disease where you don't want to miss a positive) whereas for other applications false positives are much worse (predicting equipment failures, crimes, or something else where the cost of taking action is expensive). By default, predict just chooses 0.5 as the threshold, this is usually not what you want, you need think about your application and then look at the ROC curve and the gains/lift charts to decide where you want to set the prediction threshold.
I am now implementing an email filtering application using the Naive Bayes algorithm. My application uses the Spambase Data Set from the UCI Machine Learning Repository. Since the attributes are continuous, I calculate the probability using the Probability Density Function (PDF). However, when I evaluate the data using the k-fold cross validation, a training set may contain only 0 for one of its attributes. For this reason, I got a 0 standard deviation and the PDF returns NaN and it leads to a huge number of spams are not correctly classified with that training set. What should I do to fix the problem?
You could use a discrete PDF, which will always be bounded.
Alternatively, simply ignore any attribute with zero variance. There is no point in including distributions with zero variance, because they won't actually do anything. For example, you want to know how old I am, and then I tell you that I live on planet Earth. That shouldn't change your estimate, because every single piece of data you have is for people on planet Earth.