Slow sampling of STAN compared to BUGS - bayesian

I am trying to make a switch from WinBugs to Stan, since I like the programming language and usage in R is better for Stan than WinBugs. I recoded my hierarchical Bayesian model to the Stan language, and applied all recommended settings I found on the internet. However, my WinBugs sampler is over twenty times as fast as the Stan sampler. (so I'm not talking about the compiler, of which I know it is slower for Stan.)
I tried to speed up the sampling any way I can find on the internet, I vectorized my model and I'm running my chains on 2 cores for Stan and only 1 for Bugs. All forums I can find say Stan shouldn't be much slower than Bugs. Does anyone have any idea why the speed difference is so big and how I can speed my sampling up?
My models for BUGS and Stan are posted below. The first three lines of each code are the main part of the code, the rest is just priors. T and N are both about 150.
Thank you!
BUGS:
model
{
for( i in 1 : N)
{
for (t in 1 : T)
{
Y[i,t] ~ dnorm(mu[i,t],hy[i])
mu[i,t] <- alpha1[i] + rho[i] * X0[i,t] + phi[i,t] * X1[i,t] + beta1[i] * X2[i,t] + beta2[i] * X3[i,t] + beta3[i] * X4[i,t] + beta4[i] * gt[i,t]
phi[i,t] <- alpha2[i] + gamma1[i] * exp(gamma2[i]*pow(M[i] - P[i,t],2)) + beta5[i] * X5[i,t] + beta6[i] * X6[i,t]
}
alpha1[i] ~ dflat()
hy[i] ~ dgamma(0,0)
rho[i] ~ dflat()
beta1[i] ~ dflat()
beta2[i] ~ dflat()
beta4[i] ~ dflat()
alpha2[i] ~ dnorm(mu_alpha2, sigma_alpha2)
gamma1[i] ~ dnorm(mu_gamma1, sigma_gamma1)
gamma2[i] ~ dnorm(mu_gamma2, sigma_gamma2)
M[i] ~ dnorm(mu_M, sigma_M)
beta3[i] ~ dnorm(mu_beta3, sigma_beta3)
beta5[i] ~ dnorm(mu_beta5, sigma_beta5)
beta6[i] ~ dnorm(mu_beta6, sigma_beta6)
}
mu_alpha2 ~ dflat()
mu_gamma1 ~ dflat()
mu_gamma2 ~ dunif(-1.5,-0.5)
mu_M ~ dunif(0.7,1.3)
mu_beta3 ~ dflat()
mu_beta5 ~ dflat()
mu_beta6 ~ dflat()
sigma_alpha2 ~ dgamma(0.1,0.1)
sigma_gamma1 ~ dgamma(0.1,0.1)
sigma_gamma2 ~ dgamma(0.1,0.1)
sigma_M ~ dgamma(0.1,0.1)
sigma_beta3 ~ dgamma(0.1,0.1)
sigma_beta5 ~ dgamma(0.1,0.1)
sigma_beta6 ~ dgamma(0.1,0.1)
}
Stan:
transformed parameters {
for (i in 1:N) {
phi[i] <- alpha2[i] + gamma1[i] * exp(gamma2[i] * (M[i] - P[i]) .* (M[i] - P[i])) + beta4[i] * log(t) + beta5[i] * D4[i];
mu[i] <- alpha1[i] + phi[i] .* X1[i] + rho[i] * X0[i] + beta1[i] * X2[i] + beta2[i] * X3[i] + beta3[i] * X4[i];
sigma[i] <- sqrt(sigma_sq[i]);
}
sigma_beta3 <- sqrt(sigma_sq_beta3);
sigma_gamma1 <- sqrt(sigma_sq_gamma1);
sigma_gamma2 <- sqrt(sigma_sq_gamma2);
sigma_M <- sqrt(sigma_sq_M);
sigma_beta4 <- sqrt(sigma_sq_beta4);
sigma_beta5 <- sqrt(sigma_sq_beta5);
sigma_alpha2 <- sqrt(sigma_sq_alpha2);
}
model {
// Priors
alpha1 ~ normal(0,100);
rho ~ normal(0,100);
beta1 ~ normal(0,100);
beta2 ~ normal(0,100);
mu_beta3 ~ normal(0,100);
mu_alpha2 ~ normal(0,100);
mu_gamma1 ~ normal(0,100);
mu_gamma2 ~ uniform(-1.5, -0.5);
mu_M ~ uniform(0.7, 1.3);
mu_beta4 ~ normal(0,100);
mu_beta5 ~ normal(0,100);
sigma_sq ~ inv_gamma(0.001, 0.001);
sigma_sq_beta3 ~ inv_gamma(0.001, 0.001);
sigma_sq_alpha2 ~ inv_gamma(0.001, 0.001);
sigma_sq_gamma1 ~ inv_gamma(0.001, 0.001);
sigma_sq_gamma2 ~ inv_gamma(0.001, 0.001);
sigma_sq_M ~ inv_gamma(0.001, 0.001);
sigma_sq_beta4 ~ inv_gamma(0.001, 0.001);
sigma_sq_beta5 ~ inv_gamma(0.001, 0.001);
// likelihoods
beta3 ~ normal(mu_beta3, sigma_beta3);
alpha2 ~ normal(mu_alpha2, sigma_alpha2);
gamma1 ~ normal(mu_gamma1, sigma_gamma1);
gamma2 ~ normal(mu_gamma2, sigma_gamma2);
M ~ normal(mu_M, sigma_M);
beta4 ~ normal(mu_beta4, sigma_beta4);
beta5 ~ normal(mu_beta5, sigma_beta5);
for (i in 1:N){
Y[i] ~ normal(mu[i], sigma[i]);
}
}

Related

"Error: Attempt to redefine node" in Mixture that changes size every iteration

My data has three columns Time, Interval, Count. I have a mixture of Poissons that goes like this
mod_string = " model{
for(i in 2:length(Count)){
Count[i] ~ dpois(lambda.hacked[i]*z[i]+0.0001)
z[i] ~dbern(p)
lambda.hacked[i] <- mu[ clust[i] ]
Prob <- p^-(1:i) * (1-p) / p
mu <- (Time[1:i] - Interval[1:i])*lambda
clust[i] ~ dcat( Prob)
}
## Priors
lambda ~ dgamma(0.01,0.02)
p ~ dbeta(1,1)
}"
mu changes size at every iteration. As i grows, the number of clusters also grows.
How can I adapt this?

I have a code in OpenBUGS but the error is "variable CR is not defined"

model
{
for( i in 1 : N ) {
dgf[i] ~ dbin(p[i],n[i])
logit(p[i]) <- a[subject[i]] + beta[1] * CR[i]
}
for (j in 1:94)
{
a[j]~dnorm(beta0,prec.tau)
}
beta[1] ~ dnorm(0.0,.000001)
beta0 ~ dnorm(0.0,.000001)
prec.tau ~ dgamma(0.001,.001)
tau<-sqrt(1/prec.tau)
}
list(
n=c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1),
dgf=c(0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,1,1,1,0,0,1,1,0,0,0,1,1,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1,0,0,1,1,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,1,1,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0),
subject=c(1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8,9,9,10,10,11,11,12,12,13,13,14,14,15,15,16,16,17,17,18,18,19,19,20,20,21,21,22,22,23,23,24,24,25,25,26,26,27,27,28,28,29,29,30,30,31,31,32,32,33,33,34,34,35,35,36,36,37,37,38,38,39,39,40,40,41,41,42,42,43,43,44,44,45,45,46,46,47,47,48,48,49,49,50,50,51,51,52,52,53,53,54,54,55,55,56,56,57,57,58,58,59,59,60,60,61,61,62,62,63,63,64,64,65,65,66,66,67,67,68,68,69,69,70,70,71,71,72,72,73,73,74,74,75,75,76,76,77,77,78,78,79,79,80,80,81,81,82,82,83,83,84,84,85,85,86,86,87,87,88,88,89,89,90,90,91,91,92,92,93,93,94,94),
CR=c(NA,NA,1.41,0.85,1.13,0.65,NA,NA,2.13,1.61,7.31,3.8,1.65,2.32,1.13,2.3,0.99,1.5,1.32,3.95,7.2,2.97,0.83,1.55,NA,6.5,0.89,1.2,1.52,7,8.68,7.41,NA,0.86,NA,1.92,NA,1.31,7.8,1.78,NA,1.67,NA,NA,NA,NA,NA,2.25,0.98,0.82,3.94,1.14,12,2.58,2.42,2.59,NA,NA,NA,NA,6.6,3.22,NA,2.02,2.43,1.96,0.82,1.64,1.81,1.53,1.01,5.21,8.33,1.14,1.49,6,5.6,2,3.33,4.08,NA,NA,1.14,1.25,0.85,5.42,0.85,0.65,1.02,1.33,1.1,1.12,NA,NA,1.53,1.76,2,0.85,2.9,5,4.09,2.68,0.98,1.48,0.66,0.57,5.72,2.34,0.93,2.39,1.39,1.44,4.77,2.39,1.79,1.2,0.81,1.25,4.69,1.22,1.92,1.48,2.46,NA,NA,2.53,1.12,1.74,3.45,1.22,1.27,2.61,1.75,0.82,NA,1.4,NA,5.1,1.24,1.5,1.94,1.24,1.04,1.24,NA,2.39,NA,2.07,2.19,1.6,6,6.38,1.17,1.2,5.62,6.39,1.82,1.31,NA,1.18,3.71,2.03,5.4,2.17,NA,1.94,1.57,1.44,1.35,1.63,1.24,1.54,1.5,NA,NA,NA,NA,1.44,NA,2.19,7.98,2.15,1.71,1.45,NA,0.98,2.37,1.58), N = 188)
I know that the error is because of "NA" in variable "CR" but i don't know how to solve it. i'll appreciate any help.
You have 2 options:
1) Remove the missing CR and the corresponding elements of n, dgf, and subject (and reduce N accordingly).
2) Define a stochastic relation for CR within your model, so that the model estimates the missing CR and uses these estimates in the logistic regression. Something like:
for(i in 1:N){
CR[i] ~ dnorm(CR_mu, CR_tau)
}
CR_mu ~ dnorm(0, 10^-6)
CR_tau ~ drama(0.001, 0.001)
CR_mu and CR_tau are probably not of interest but can be monitored if you want.
Note that both approaches assume that CR are missing at random (and not e.g. censored) - if CR are missing not at random this will give you biased results.
Matt

Spline in JAGS mixing badly

I have a model that calculates a spline for Mark-recapture data with survival data. The model is working fine, but the parameters that calculates the spline are mixing super badly.
mean 2.5% 97.5% That n.eff
...
m[1] 1.667899656 -0.555606 4.18479 2.8829 4
m[2] 1.293023680 -0.951046 3.90294 2.8476 4
m[3] 1.717855378 -0.484097 4.23105 2.8690 4
m[4] 1.723899423 -0.474260 4.23869 2.8686 4
m[5] 1.747050770 -0.456455 4.26314 2.8578 4
...
Basically, I'm calculating a recapture rate p composed of a species specific effect p.sp and the sampling effort p.effort. I also calculate a fitness component phi with a species specific term phi.sp, the effect of year phi.year, a climate factor phi.sum.preci and the spline m.
run.model <- function(d, ## incoming data (packaged up in src/analyses.R)
ni=1100, ## number of iterations to run ## number of draws per chain
nt=10, ## thinning rate ##to save space on computer disk space see p.61 Kéry
nb=100, ## burn in ## should be large enough to discard initial part of Markov chains that have not yet converged
nc=3, ## number of chains to run ## multiple chain to check the convergence
n.cluster = 3) {
model.jags <- function() {
## Priors ------------------------------------------------------------------
## Random effect species-specific intercept (survival)
mu.phi.sp ~ dnorm(0,0.01)
sigma.phi.sp ~ dunif(0,10)
tau.phi.sp <- 1/(sigma.phi.sp)^2
## Random effect for recapture rate
mu.p.sp ~ dnorm(0,0.01)
## Random effect of year and fixed effect of precipitation & abundance
sigma.phi.year ~ dunif(0,10)
tau.phi.year <- 1/(sigma.phi.year)^2
## fixed effect of effort
p.effort ~ dnorm(0, 0.01) ## fixed effect
## Fixed precipitation per year
phi.sum.preci ~ dnorm(0, 0.01) ## fixed effect
# Prior spline ------------------------------------------------------------
###BEGIN SPLINE###
# prior distribution for the fixed effects parameters
for (l in 1:3) {
beta[l] ~ dnorm(0,0.1)
}
prior.scaleeps <- 1
xi ~ dnorm(0, tau.xi)
tau.xi <- pow(prior.scaleeps, -2)
for (k in 1:nknotsb) {
b[k] <- yi*etab[k]
etab[k] ~ dnorm(0, tau.etab) # hierarchical model for theta
} # closing k
prior.scaleb <- 1
yi ~ dnorm (0, tau.yi)
tau.yi <- pow(prior.scaleb, -2)
tau.etab ~ dgamma(.5, .5) # chi^2 with 1 d.f.
sigmab <- abs(xi)/sqrt(tau.etab) # cauchy = normal/sqrt(chi^2)
###END SPLINE###
for(sp in 1:nsp) {
## Random species-specific intercept
phi.sp[sp] ~ dnorm(mu.phi.sp, tau.phi.sp)
## Random recapture rate
p.sp[sp] <- mu.p.sp # Changed from a comment from Luke Jan. 9 2017
}
for (yr in 1:nyear) {
## random year
phi.year[yr] ~ dnorm(0, tau.phi.year)
}
## Likelihood!
for(sp in 1:nsp) { ## per species
## Rates -------------------------------------------------------------------
## recapture rate
for (yr in 1:nyear) {
logit(p[sp,yr]) <- # added logit here
p.sp[sp] +
p.effort*effort[yr]
} ## closing for (year in 1:nyear)
} ## closing for (sp in 1:nsp)
## Each ID ----------------------------------------------------------------
## Likelihood!
for(ind in 1:nind) { ## nind = nrow(d$X)
### BEGIN SPLINE ###
## mean function model
m[ind] <-mfe[ind] + mre1[ind] + mre2[ind]
# fixed effect part
mfe[ind] <- beta[1] * Xfix[ind,1] +beta[2] * Xfix[ind,2] + beta[3] * Xfix[ind,3]
mre1[ind] <- b[1]*Z[ind,1] + b[2]*Z[ind,2] + b[3]*Z[ind,3] + b[4]*Z[ind,4] + b[5]*Z[ind,5] + b[6]*Z[ind,6] + b[7]*Z[ind,7] + b[8]*Z[ind,8] + b[9]*Z[ind,9] + b[10]*Z[ind,10]
mre2[ind] <- b[11]*Z[ind,11] + b[12]*Z[ind,12] + b[13]*Z[ind,13] + b[14]*Z[ind,14] + b[15]*Z[ind,15]
###END SPLINE###
}
## for each individual
for(ind in 1:nind) { ## nind = nrow(d$X)
for(yr in 1:nyear) {
logit(phi[ind,yr]) <-
phi.sp[species[ind]] + ## effect of species
phi.year[yr] + ## effect of year
# Effect of the traits on survival values
m[ind]+ # spline
phi.sum.preci*sum.rainfall[yr] # effect of precipitation per sampling event
} ## (yr in 1:nyear)
## First occasion
for(yr in 1:first[ind]) {
z[ind,yr] ~ dbern(1)
} ## (yr in 1:first[ind])
## Subsequent occasions
for(yr in (first[ind]+1):nyear) { # (so, here, we're just indexing from year "first+1" onwards).
mu.z[ind,yr] <- phi[ind,yr-1]*z[ind,yr-1]
z[ind,yr] ~ dbern(mu.z[ind,yr])
## Observation process
sight.p[ind,yr] <- z[ind,yr]*p[species[ind],yr] ## sightp probability of something to be seen
X[ind,yr] ~ dbern(sight.p[ind,yr]) ## X matrix : ind by years
} ## yr
} ## closing for(ind in 1:nind)
} ## closing model.jags function
## Calling Jags ------------------------------------------------------------
jags.parallel(data = d$data,
inits = d$inits,
parameters.to.save = d$params,
model.file = model.jags,
n.chains = nc, n.thin = nt, n.iter = ni, n.burnin = nb,
working.directory = NULL,
n.cluster = n.cluster)
} ## closing the run.model function
# Monitored parameters ----------------------------------------------------
get.params <- function()
c('phi.sp','mu.phi.sp','sigma.phi.sp','mu.p.sp','sigma.p.sp','phi.year','phi','p', 'phi.sum.preci','p.sp','p.effort','z',
# Spline parameters
"m","sigmab","b","beta")

Why the jags result and depmixS4 are sometimes different?

I have a data set like the following simulated data:
Pi = matrix(c(0.9,0.1,0.3,0.7),2,2,byrow=TRUE)
delta = c(.5,.5)
z = sample(c(1,2),1,prob=delta)
T = 365
for( t in 2:T){
z[t] = sample(x=c(1,2),1,prob=Pi[z[t-1],])
}
x <- sample(x=seq(-1, 1.5, length.out=T),T,replace=TRUE)
alpha = c(-1, -3.2)
Beta = c(-4,3)
y<-NA
for(i in 1:T){
y[i] = rbinom(1,size=10,prob=1/(1+exp(-Beta[z[i]]*x[i]-alpha[z[i]])))
}
SimulatedBinomData <- data.frame('y' = y, 'x' = x , size=rep(10,T), 'z' = z)
yy<-NA
xx<-NA
for(i in 1:dim(SimulatedBinomData)[1]){
yy<-c(yy,c(rep(1,SimulatedBinomData$y[i]),rep(0,(SimulatedBinomData$size[i]-SimulatedBinomData$y[i]))))
xx<-c(xx,rep(SimulatedBinomData$x[i],SimulatedBinomData$size[i]))
}
yy<-yy[-1]
xx<-xx[-1]
SimulatedBernolliData<-data.frame(y=yy,x=xx, tt=rep(c(1:T),rep(10,T)))
This is a HMM problem with two states meaning that the Hidden Markov chain z_t belongs to {1,2}. To estimate alpha and Beta in two different states I can use the package 'depmixS4' and find the Maximum Likelihood estimates or I can use MCMC in 'rjags' package.
I expect that these two estimations be almost the same while when I run the following program for different simulated data, in several times, the answers are not the same and very different!!
library("rjags")
library("depmixS4")
mod <- depmix(cbind(y,(size-y))~x, data=SimulatedBinomData, nstates=2, family=binomial(logit))
fm <- fit(mod)
getpars(fm)
n<-length(SimulatedBernolliData$y)
T<-max(SimulatedBernolliData$tt)
cat("model {
# Transition Probability
Ptrans[1,1:2] ~ ddirch(a)
Ptrans[2,1:2] ~ ddirch(a)
# States
Pinit[1] <- 0.5 #failor
Pinit[2] <- 0.5 #success
state[1] ~ dbern(Pinit[2])
for (t in 2:T) {
state[t] ~ dbern(Ptrans[(state[t-1]+1),2])
}
# Parameters
alpha[1] ~ dunif(-1.e10, 1.e10)
alpha[2] ~ dunif(-1.e10, 1.e10)
Beta[1] ~ dunif(-1.e10, 1.e10)
Beta[2] ~ dunif(-1.e10, 1.e10)
# Observations
for (i in 1:n){
z[i] <- state[tt[i]]
y[i] ~ dbern(1/(1+exp(-(alpha[(z[i]+1)]+Beta[(z[i]+1)]*x[i]))))
}
}",
file="LeftBehindHiddenMarkov.bug")
jags <- jags.model('LeftBehindHiddenMarkov.bug', data = list('x' = SimulatedBernolliData$x, 'y' = SimulatedBernolliData$y, 'tt' = SimulatedBernolliData$tt, T=T, n = n, a = c(1,1) ))
res <- coda.samples(jags,c('alpha', 'Beta', 'Ptrans','state'),1000)
res.median = apply(res[[1]],2,median)
res.median[1:8]
res.mean = apply(res[[1]],2,mean)
res.mean[1:8]
res.sd = apply(res[[1]],2,sd)
res.sd[1:8]
res.mode = apply(res[[1]],2,function(x){as.numeric(names(table(x))
[which.max(table(x))]) })
res.mode[1:8]
You are having a problem of label switching in your JAGS code, that is, states z[i]=1 is not bounded to the lower posterior value for Beta and z[i]=2 to the higher Beta. Therefore, for each iteration of the MCMC they can switch. There are several ways to solve this problem. One of them is the partial reordering, that is, for every MCMC iteration, draw two independent values for Beta and order them so that Beta[1] < Beta[2].
You can do that by substituting
Beta[1] ~ dunif(-1.e10, 1.e10)
Beta[2] ~ dunif(-1.e10, 1.e10)
for
Beta[1:2] <- sort(Betaaux)
Betaaux[1] ~ dunif(-1.e10, 1.e10)
Betaaux[2] ~ dunif(-1.e10, 1.e10)
Of course, the ordering could also be done on the alpha parameters instead. The election of which parameter to use for the partial reordering depends on the problem.

counting the number of residues in a file

I have a file as follows. I would like to count the number of each character.
>1DMLA
MTDSPGGVAPASPVEDASDASLGQPEEGAPCQVVLQGAELNGILQAFAPLRTSLLDSLLVMGDRGILIHNTIFGEQVFLP
LEHSQFSRYRWRGPTAAFLSLVDQKRSLLSVFRANQYPDLRRVELAITGQAPFRTLVQRIWTTTSDGEAVELASETLMKR
ELTSFVVLVPQGTPDVQLRLTRPQLTKVLNATGADSATPTTFELGVNGKFSVFTTSTCVTFAAREEGVSSSTSTQVQILS
NALTKAGQAAANAKTVYGENTHRTFSVVVDDCSMRAVLRRLQVGGGTLKFFLTTPVPSLCVTATGPNAVSAVFLLKPQK
>1DMLB
DDVAARLRAAGFGAVGAGATAEETRRMLHRAFDTLA
>2BHDC
MTDSPGGVAPASPVEDASDASLGQPEEGAPCQVVLQGAELNGILQAFAPLRTSLLDSLLVMGDRGILIHNTIFGEQVFLP
LEHSQFSRYRWRGPTAAFLSLVDQKRSLLSVFRANQYPDLRRVELAITGQAPFRTLVQRIWTTTSDGEAVELASETLMKR
ELTSFVVLVPQGTPDVQLRLTRPQLTKVLNATGADSATPTTFELGVNGKFSVFTTSTCVTFAAREEGVSSSTSTQVQILS
I tried the following code.
awk '/^>/ { res=substr($0, 2); } /^[^>]/ { print res " - " length($0); }' <file
The output of the above code is
1DMLA - 80
1DMLA - 80
1DMLA - 80
1DMLA - 79
1DMLB - 36
2BHDC - 80
2BHDC - 80
2BHDC - 80
My desired output is
1DMLA - 319
1DMLB - 36
2BHDC - 240
How do I change the above code for getting my desired output?
Here's one way using awk:
awk '/^>/ && r { print r, "-", s; r=s="" } /^>/ { r = substr($0, 2); next } { s += length } END { print r, "-", s }' file
Results:
1DMLA - 319
1DMLB - 36
2BHDC - 240
awk -vRS='>' '$1{gsub( "[\r]", "",$1 );
printf "%s - %d\n", $1, length($0) - length($1) - NF + 1}' input
This way:
awk -F\> '/^>/ {if (seqlen != ""){print seqlen}printf("%s - ",$2);seqlen=0;next}seqlen != ""{seqlen +=length($0)}END{print seqlen}' infile
Or formatted:
awk -F\> '/^>/ { if (seqlen != "")
print seqlen
printf("%s - ",$2)
seqlen=0
next }
seqlen != ""{seqlen+=length($0)}
END{
print seqlen}' infile
see:
Sequence length of FASTA file
Apart from the expected result, this will handle these unexpected file formats.
$ cat infile
MTDSPGGVAPASPVEDASDASLGQPEEGAPCQVVLQGAELNGILQAFAPLRTSLLDSLLVMGDRGILIHNTIFGEQVFLP
LEHSQFSRYRWRGPTAAFLSLVDQKRSLLSVFRANQYPDLRRVELAITGQAPFRTLVQRIWTTTSDGEAVELASETLMKR
ELTSFVVLVPQGTPDVQLRLTRPQLTKVLNATGADSATPTTFELGVNGKFSVFTTSTCVTFAAREEGVSSSTSTQVQILS
NALTKAGQAAANAKTVYGENTHRTFSVVVDDCSMRAVLRRLQVGGGTLKFFLTTPVPSLCVTATGPNAVSAVFLLKPQK
>1DMLB
>2BHDC
MTDSPGGVAPASPVEDASDASLGQPEEGAPCQVVLQGAELNGILQAFAPLRTSLLDSLLVMGDRGILIHNTIFGEQVFLP
LEHSQFSRYRWRGPTAAFLSLVDQKRSLLSVFRANQYPDLRRVELAITGQAPFRTLVQRIWTTTSDGEAVELASETLMKR
ELTSFVVLVPQGTPDVQLRLTRPQLTKVLNATGADSATPTTFELGVNGKFSVFTTSTCVTFAAREEGVSSSTSTQVQILS
$ awk -F\> '/^>/ {if (seqlen != ""){print seqlen}printf("%s - ",$2);seqlen=0;next}seqlen != ""{seqlen +=length($0)}END{print seqlen}' kk2
1DMLB - 0
2BHDC - 240