Add bootstrapped p-value to esttab - formatting

I bootstrapped my own p-values for a regression and now need to add them to my esttab table. I would ideally like the p-values to appear immediately below the coefficient point estimates. While I am able to place the new p-values into my table, they appear in the table is if they arenew point estimates with labels c1, c2, etc.
Here is the relevant simplified code (following the regression) where the bootstrapped p-value is in the local newpvalue
matrix pval = (`newpvalue')
estadd matrix Puse = pval
esttab , replace cells(b(star fmt(3)) Puse(fmt(3) par) )
This code produces the standard table, but with the p-value placed several spaces below the coefficient estimate. Thanks very much for any help, and please let me know if the question is not clear.

An example that might help (with in-line comments):
clear
set more off
*----- example data -----
sysuse auto
keep price weight mpg
*----- what you want -----
//regress and store
reg price weight mpg
eststo m1
// create matrix of "scalars"
matrix s = (2.1 , 2.4 , 3.2)
// rename matrix columns to coincide with those of regression
mat colnames s = weight mpg _cons
// add
estadd matrix s
// print
estout m1, cells(b s)
Very similar to: Stata: combining regression results with other results but notice the absence of quotes in the last line of code.

Related

Plotting an exponential function given one parameter

I'm fairly new to python so bare with me. I have plotted a histogram using some generated data. This data has many many points. I have defined it with the variable vals. I have then plotted a histogram with these values, though I have limited it so that only values between 104 and 155 are taken into account. This has been done as follows:
bin_heights, bin_edges = np.histogram(vals, range=[104, 155], bins=30)
bin_centres = (bin_edges[:-1] + bin_edges[1:])/2.
plt.errorbar(bin_centres, bin_heights, np.sqrt(bin_heights), fmt=',', capsize=2)
plt.xlabel("$m_{\gamma\gamma} (GeV)$")
plt.ylabel("Number of entries")
plt.show()
Giving the above plot:
My next step is to take into account values from vals which are less than 120. I have done this as follows:
background_data=[j for j in vals if j <= 120] #to avoid taking the signal bump, upper limit of 120 MeV set
I need to plot a curve on the same plot as the histogram, which follows the form B(x) = Ae^(-x/λ)
I then estimated a value of λ using the maximum likelihood estimator formula:
background_data=[j for j in vals if j <= 120] #to avoid taking the signal bump, upper limit of 120 MeV set
#print(background_data)
N_background=len(background_data)
print(N_background)
sigma_background_data=sum(background_data)
print(sigma_background_data)
lamb = (sigma_background_data)/(N_background) #maximum likelihood estimator for lambda
print('lambda estimate is', lamb)
where lamb = λ. I got a value of roughly lamb = 27.75, which I know is correct. I now need to get an estimate for A.
I have been advised to do this as follows:
Given a value of λ, find A by scaling the PDF to the data such that the area beneath
the scaled PDF has equal area to the data
I'm not quite sure what this means, or how I'd go about trying to do this. PDF means probability density function. I assume an integration will have to take place, so to get the area under the data (vals), I have done this:
data_area= integrate.cumtrapz(background_data, x=None, dx=1.0)
print(data_area)
plt.plot(background_data, data_area)
However, this gives me an error
ValueError: x and y must have same first dimension, but have shapes (981555,) and (981554,)
I'm not sure how to fix it. The end result should be something like:
See the cumtrapz docs:
Returns: ... If initial is None, the shape is such that the axis of integration has one less value than y. If initial is given, the shape is equal to that of y.
So you are either to pass an initial value like
data_area = integrate.cumtrapz(background_data, x=None, dx=1.0, initial = 0.0)
or discard the first value of the background_data:
plt.plot(background_data[1:], data_area)

How to find time-varying coefficients for a VAR model by using the Kalman Filter

I'm trying to write some code in R to reproduce the model i found in this article.
The idea is to model the signal as a VAR model, but fit the coefficients by a Kalman-filter model. This would essentially enable me to create a robust time-varying VAR(p) model and analyze non-stationary data to a degree.
The model to track the coefficients is:
X(t) = F(t) X(t− 1) +W(t)
Y(t) = H(t) X(t) + E(t),
where H(t) is the Kronecker product between lagged measurements in my time-series Y and a unit vector, and X(t) fills the role of regression-coefficients. F(t) is taken to be an identity matrix, as that should mean we assume coefficients to evolve as a random walk.
In the article, from W(T), the state noise covariance matrix Q(t) is chosen at 10^-3 at first and then fitted based on some iteration scheme. From E(t) the state noise covariance matrix is R(t) substituted by the covariance of the noise term unexplained by the model: Y(t) - H(t)Xhat(t)
I have the a priori covariance matrix of estimation error (denoted Σ in the article) written as P (based on other sources) and the a posteriori as Pmin, since it will be used in the next recursion as a priori, if that makes sense.
So far i've written the following, based on the articles Appendix A 1.2
Y <- *my timeseries, for test purposes two channels of 3000 points*
F <- diag(8) # F is (m^2*p by m^2 *p) where m=2 dimensions and p =2 lags
H <- diag(2) %x% t(vec(Y[,1:2])) #the Kronecker product of vectorized lags Y-1 and Y-2
Xhatminus <- matrix(1,8,1) # an arbitrary *a priori* coefficient matrix
Q <- diag(8)%x%(10**-7) #just a number a really low number diagonal matrix, found it used in some examples
R<- 1 #Didnt know what else to put here just yet
Pmin = diag(8) #*a priori* error estimate, just some 1-s...
Now should start the reccursion. To test i just took the first 3000 points of one trial of my data.
Xhatstorage <- matrix(0,8,3000)
for(j in 3:3000){
H <- diag(2) %x% t(vec(Y[,(j-2):(j-1)]))
K <- (Pmin %*% t(H)) %*% solve((H%*%Pmin%*%t(H) + R)) ##Solve gives inverse matrix ()^-1
P <- Pmin - K%*% H %*% Pmin
Xhatplus <- F%*%( Xhatminus + K%*%(Y[,j]-H%*%Xhatminus) )
Pplus <- (F%*% P %*% F) + Q
Xhatminus <- Xhatplus
Xhatstorage[,j] <- Xhatplus
Pmin <- Pplus
}
I extracted Xhatplus values into a storage matrix and used them to write this primitive VAR model with them:
Yhat<-array(0,3000)
for(t in 3:3000){
Yhat[t]<- t(vec(Y[,(t-2)])) %*% Xhatstorage[c(1,3),t] + t(vec(Y[,(t-1)])) %*% Xhatstorage[c(2,4),t]
}
The looks like this .
The blue line is VAR with Kalman filter found coefficients, Black is original data..
I'm having issue understanding how i can better evaluate my coefficients? Why is it so off?
How should i better choose the first a priori and a posteriori estimates to start the recursion? Currently, adding more lags to the VAR is not the issue i'm sure, it's that i don't know how to choose the initial values for Pmin and Xhatmin. Most places i pieced this together from start from arbitrary 0 assumptions in toy models, but in this case, choosing any of the said matrixes as 0 will just collapse the entire algorithm.
Lastly, is this recursion even a correct implementation of Oya et al describe in the article? I know im still missing the R evaluation based on previously unexplained errors (V(t) in Appendix A 1.2), but in general?

How to calculate slope of the line

I am trying to calculate the slope of the line for a 50 day EMA I created from the adjusted closing price on a few stocks I downloaded using the getSymbols function.
My EMA looks like this :
getSymbols("COLUM.CO")
COLUM.CO$EMA <- EMA(COLUM.CO[,6],n=50)
This gives me an extra column that contains the 50 day EMA on the adjusted closing price. Now I would like to include an additional column that contains the slope of this line. I'm sure it's a fairly easy answer, but I would really appreciate some help on this. Thank you in advance.
A good way to do this is with rolling least squares regression. rollSFM does a fast and efficient job for computing the slope of a series. It usually makes sense to look at the slope in relation to units of price activity in time (bars), so x can simply be equally spaced points.
The only tricky part is working out an effective value of n, the length of the window over which you fit the slope.
library(quantmod)
getSymbols("AAPL")
AAPL$EMA <- EMA(Ad(AAPL),n=50)
# Compute slope over 50 bar lookback:
AAPL <- merge(AAPL, rollSFM(Ra = AAPL[, "EMA"],
Rb = 1:nrow(AAPL), n = 50))
The column labeled beta contains the rolling window value of the slope (alpha contains the intercept, r.squared contains the R2 value).

Flop count for variable initialization

Consider the following pseudo code:
a <- [0,0,0] (initializing a 3d vector to zeros)
b <- [0,0,0] (initializing a 3d vector to zeros)
c <- a . b (Dot product of two vectors)
In the above pseudo code, what is the flop count (i.e. number floating point operations)?
More generally, what I want to know is whether initialization of variables counts towards the total floating point operations or not, when looking at an algorithm's complexity.
In your case, both a and b vectors are zeros and I don't think that it is a good idea to use zeros to describe or explain the flops operation.
I would say that given vector a with entries a1,a2 and a3, and also given vector b with entries b1, b2, b3. The dot product of the two vectors is equal to aTb that gives
aTb = a1*b1+a2*b2+a3*b3
Here we have 3 multiplication operations
(i.e: a1*b1, a2*b2, a3*b3) and 2 addition operations. In total we have 5 operations or 5 flops.
If we want to generalize this example for n dimensional vectors a_n and b_n, we would have n times multiplication operations and n-1 times addition operations. In total we would end up with n+n-1 = 2n-1 operations or flops.
I hope the example I used above gives you the intuition.

Bayesian estimation of log-normal using JAGS

I try to find 95% credible interval of 50 sample means. Sample sizes range from 2 to 600, and the values in each sample are bounded between 1 and 5.
ex:
sample 1 = (1,3.5,2.8,5,4.6)
sample 2 = (1,5)
sample 3 = (4.1,1.1,5,3.5,2,2.4,...)
Samples with size of 10 or more have a lognormal distribution where i used JAGS for Bayesian estimation of log-normal parameters adapted from John K. Kruschke, with model specification as below:
modelstring = "
model {
for( i in 1 : N ) {
y[i] ~ dlnorm( muOfLogY , 1/sigmaOfLogY^2 )
}
sigmaOfLogY ~ dunif( 0.001*sdOfLogY , 1000*sdOfLogY )
muOfLogY ~ dunif( 0.001*meanOfLogY , 1000*meanOfLogY )
muOfY <- exp(muOfLogY+sigmaOfLogY^2/2)
modeOfY <- exp(muOfLogY-sigmaOfLogY^2)
sigmaOfY <- sqrt(exp(2*muOfLogY+sigmaOfLogY^2)*(exp(sigmaOfLogY^2)-1))
}
"
The model works fine with sample size > 10. However, with 3 <= samples < 10 i got extreme values in upper limit (e.g., 3000) which exceeded the maximum possible value of the mean (e.g., 5).
In case of sample size = 2, i got the below error:
Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
NA/NaN/Inf in 'y'
I am new to JAGS and can't figure out how to solve this issues. I think for smaples < 10 the distribution is no longer lognormal!
Any ideas?
Thank you
First a semantic note. You are not using JAGS to find sample means. You are using JAGS to find the means of the populations from which the samples arose. If you wanted to find the sample (log)means, you could just take the mean of the (logarithms of the) sample values.
Now, if the values in each sample are bounded between 1 and 5 (due to some external constraint), then the sample is NEVER drawn from a log-normal distribution, which inherently puts probability mass over values greater than five.
Let's imagine, for the sake of saying, that the samples do arise from lognormal sampling (and therefore aren't inherently bounded between 1 and 5). Then JAGS is simply telling you that there is not enough information contained in the sample to get a good estimate of the population mean from which it is drawn. I wouldn't worry about understanding the error when the sample size is two, because there is literally no way to get good inference about the population mean from two samples. This is true even if you know that the population is indeed log-normally distributed. And since your populations are not actually log-normally distributed (they are bounded between 1 and 5) the entire inferential procedure is invalid anyway.