Slow sampling of STAN compared to BUGS - bayesian
I am trying to make a switch from WinBugs to Stan, since I like the programming language and usage in R is better for Stan than WinBugs. I recoded my hierarchical Bayesian model to the Stan language, and applied all recommended settings I found on the internet. However, my WinBugs sampler is over twenty times as fast as the Stan sampler. (so I'm not talking about the compiler, of which I know it is slower for Stan.)
I tried to speed up the sampling any way I can find on the internet, I vectorized my model and I'm running my chains on 2 cores for Stan and only 1 for Bugs. All forums I can find say Stan shouldn't be much slower than Bugs. Does anyone have any idea why the speed difference is so big and how I can speed my sampling up?
My models for BUGS and Stan are posted below. The first three lines of each code are the main part of the code, the rest is just priors. T and N are both about 150.
Thank you!
BUGS:
model
{
for( i in 1 : N)
{
for (t in 1 : T)
{
Y[i,t] ~ dnorm(mu[i,t],hy[i])
mu[i,t] <- alpha1[i] + rho[i] * X0[i,t] + phi[i,t] * X1[i,t] + beta1[i] * X2[i,t] + beta2[i] * X3[i,t] + beta3[i] * X4[i,t] + beta4[i] * gt[i,t]
phi[i,t] <- alpha2[i] + gamma1[i] * exp(gamma2[i]*pow(M[i] - P[i,t],2)) + beta5[i] * X5[i,t] + beta6[i] * X6[i,t]
}
alpha1[i] ~ dflat()
hy[i] ~ dgamma(0,0)
rho[i] ~ dflat()
beta1[i] ~ dflat()
beta2[i] ~ dflat()
beta4[i] ~ dflat()
alpha2[i] ~ dnorm(mu_alpha2, sigma_alpha2)
gamma1[i] ~ dnorm(mu_gamma1, sigma_gamma1)
gamma2[i] ~ dnorm(mu_gamma2, sigma_gamma2)
M[i] ~ dnorm(mu_M, sigma_M)
beta3[i] ~ dnorm(mu_beta3, sigma_beta3)
beta5[i] ~ dnorm(mu_beta5, sigma_beta5)
beta6[i] ~ dnorm(mu_beta6, sigma_beta6)
}
mu_alpha2 ~ dflat()
mu_gamma1 ~ dflat()
mu_gamma2 ~ dunif(-1.5,-0.5)
mu_M ~ dunif(0.7,1.3)
mu_beta3 ~ dflat()
mu_beta5 ~ dflat()
mu_beta6 ~ dflat()
sigma_alpha2 ~ dgamma(0.1,0.1)
sigma_gamma1 ~ dgamma(0.1,0.1)
sigma_gamma2 ~ dgamma(0.1,0.1)
sigma_M ~ dgamma(0.1,0.1)
sigma_beta3 ~ dgamma(0.1,0.1)
sigma_beta5 ~ dgamma(0.1,0.1)
sigma_beta6 ~ dgamma(0.1,0.1)
}
Stan:
transformed parameters {
for (i in 1:N) {
phi[i] <- alpha2[i] + gamma1[i] * exp(gamma2[i] * (M[i] - P[i]) .* (M[i] - P[i])) + beta4[i] * log(t) + beta5[i] * D4[i];
mu[i] <- alpha1[i] + phi[i] .* X1[i] + rho[i] * X0[i] + beta1[i] * X2[i] + beta2[i] * X3[i] + beta3[i] * X4[i];
sigma[i] <- sqrt(sigma_sq[i]);
}
sigma_beta3 <- sqrt(sigma_sq_beta3);
sigma_gamma1 <- sqrt(sigma_sq_gamma1);
sigma_gamma2 <- sqrt(sigma_sq_gamma2);
sigma_M <- sqrt(sigma_sq_M);
sigma_beta4 <- sqrt(sigma_sq_beta4);
sigma_beta5 <- sqrt(sigma_sq_beta5);
sigma_alpha2 <- sqrt(sigma_sq_alpha2);
}
model {
// Priors
alpha1 ~ normal(0,100);
rho ~ normal(0,100);
beta1 ~ normal(0,100);
beta2 ~ normal(0,100);
mu_beta3 ~ normal(0,100);
mu_alpha2 ~ normal(0,100);
mu_gamma1 ~ normal(0,100);
mu_gamma2 ~ uniform(-1.5, -0.5);
mu_M ~ uniform(0.7, 1.3);
mu_beta4 ~ normal(0,100);
mu_beta5 ~ normal(0,100);
sigma_sq ~ inv_gamma(0.001, 0.001);
sigma_sq_beta3 ~ inv_gamma(0.001, 0.001);
sigma_sq_alpha2 ~ inv_gamma(0.001, 0.001);
sigma_sq_gamma1 ~ inv_gamma(0.001, 0.001);
sigma_sq_gamma2 ~ inv_gamma(0.001, 0.001);
sigma_sq_M ~ inv_gamma(0.001, 0.001);
sigma_sq_beta4 ~ inv_gamma(0.001, 0.001);
sigma_sq_beta5 ~ inv_gamma(0.001, 0.001);
// likelihoods
beta3 ~ normal(mu_beta3, sigma_beta3);
alpha2 ~ normal(mu_alpha2, sigma_alpha2);
gamma1 ~ normal(mu_gamma1, sigma_gamma1);
gamma2 ~ normal(mu_gamma2, sigma_gamma2);
M ~ normal(mu_M, sigma_M);
beta4 ~ normal(mu_beta4, sigma_beta4);
beta5 ~ normal(mu_beta5, sigma_beta5);
for (i in 1:N){
Y[i] ~ normal(mu[i], sigma[i]);
}
}
Related
"Error: Attempt to redefine node" in Mixture that changes size every iteration
My data has three columns Time, Interval, Count. I have a mixture of Poissons that goes like this mod_string = " model{ for(i in 2:length(Count)){ Count[i] ~ dpois(lambda.hacked[i]*z[i]+0.0001) z[i] ~dbern(p) lambda.hacked[i] <- mu[ clust[i] ] Prob <- p^-(1:i) * (1-p) / p mu <- (Time[1:i] - Interval[1:i])*lambda clust[i] ~ dcat( Prob) } ## Priors lambda ~ dgamma(0.01,0.02) p ~ dbeta(1,1) }" mu changes size at every iteration. As i grows, the number of clusters also grows. How can I adapt this?
I have a code in OpenBUGS but the error is "variable CR is not defined"
model { for( i in 1 : N ) { dgf[i] ~ dbin(p[i],n[i]) logit(p[i]) <- a[subject[i]] + beta[1] * CR[i] } for (j in 1:94) { a[j]~dnorm(beta0,prec.tau) } beta[1] ~ dnorm(0.0,.000001) beta0 ~ dnorm(0.0,.000001) prec.tau ~ dgamma(0.001,.001) tau<-sqrt(1/prec.tau) } list( n=c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1), dgf=c(0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,1,1,1,0,0,1,1,0,0,0,1,1,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1,0,0,1,1,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,1,1,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0), subject=c(1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8,9,9,10,10,11,11,12,12,13,13,14,14,15,15,16,16,17,17,18,18,19,19,20,20,21,21,22,22,23,23,24,24,25,25,26,26,27,27,28,28,29,29,30,30,31,31,32,32,33,33,34,34,35,35,36,36,37,37,38,38,39,39,40,40,41,41,42,42,43,43,44,44,45,45,46,46,47,47,48,48,49,49,50,50,51,51,52,52,53,53,54,54,55,55,56,56,57,57,58,58,59,59,60,60,61,61,62,62,63,63,64,64,65,65,66,66,67,67,68,68,69,69,70,70,71,71,72,72,73,73,74,74,75,75,76,76,77,77,78,78,79,79,80,80,81,81,82,82,83,83,84,84,85,85,86,86,87,87,88,88,89,89,90,90,91,91,92,92,93,93,94,94), CR=c(NA,NA,1.41,0.85,1.13,0.65,NA,NA,2.13,1.61,7.31,3.8,1.65,2.32,1.13,2.3,0.99,1.5,1.32,3.95,7.2,2.97,0.83,1.55,NA,6.5,0.89,1.2,1.52,7,8.68,7.41,NA,0.86,NA,1.92,NA,1.31,7.8,1.78,NA,1.67,NA,NA,NA,NA,NA,2.25,0.98,0.82,3.94,1.14,12,2.58,2.42,2.59,NA,NA,NA,NA,6.6,3.22,NA,2.02,2.43,1.96,0.82,1.64,1.81,1.53,1.01,5.21,8.33,1.14,1.49,6,5.6,2,3.33,4.08,NA,NA,1.14,1.25,0.85,5.42,0.85,0.65,1.02,1.33,1.1,1.12,NA,NA,1.53,1.76,2,0.85,2.9,5,4.09,2.68,0.98,1.48,0.66,0.57,5.72,2.34,0.93,2.39,1.39,1.44,4.77,2.39,1.79,1.2,0.81,1.25,4.69,1.22,1.92,1.48,2.46,NA,NA,2.53,1.12,1.74,3.45,1.22,1.27,2.61,1.75,0.82,NA,1.4,NA,5.1,1.24,1.5,1.94,1.24,1.04,1.24,NA,2.39,NA,2.07,2.19,1.6,6,6.38,1.17,1.2,5.62,6.39,1.82,1.31,NA,1.18,3.71,2.03,5.4,2.17,NA,1.94,1.57,1.44,1.35,1.63,1.24,1.54,1.5,NA,NA,NA,NA,1.44,NA,2.19,7.98,2.15,1.71,1.45,NA,0.98,2.37,1.58), N = 188) I know that the error is because of "NA" in variable "CR" but i don't know how to solve it. i'll appreciate any help.
You have 2 options: 1) Remove the missing CR and the corresponding elements of n, dgf, and subject (and reduce N accordingly). 2) Define a stochastic relation for CR within your model, so that the model estimates the missing CR and uses these estimates in the logistic regression. Something like: for(i in 1:N){ CR[i] ~ dnorm(CR_mu, CR_tau) } CR_mu ~ dnorm(0, 10^-6) CR_tau ~ drama(0.001, 0.001) CR_mu and CR_tau are probably not of interest but can be monitored if you want. Note that both approaches assume that CR are missing at random (and not e.g. censored) - if CR are missing not at random this will give you biased results. Matt
Spline in JAGS mixing badly
I have a model that calculates a spline for Mark-recapture data with survival data. The model is working fine, but the parameters that calculates the spline are mixing super badly. mean 2.5% 97.5% That n.eff ... m[1] 1.667899656 -0.555606 4.18479 2.8829 4 m[2] 1.293023680 -0.951046 3.90294 2.8476 4 m[3] 1.717855378 -0.484097 4.23105 2.8690 4 m[4] 1.723899423 -0.474260 4.23869 2.8686 4 m[5] 1.747050770 -0.456455 4.26314 2.8578 4 ... Basically, I'm calculating a recapture rate p composed of a species specific effect p.sp and the sampling effort p.effort. I also calculate a fitness component phi with a species specific term phi.sp, the effect of year phi.year, a climate factor phi.sum.preci and the spline m. run.model <- function(d, ## incoming data (packaged up in src/analyses.R) ni=1100, ## number of iterations to run ## number of draws per chain nt=10, ## thinning rate ##to save space on computer disk space see p.61 Kéry nb=100, ## burn in ## should be large enough to discard initial part of Markov chains that have not yet converged nc=3, ## number of chains to run ## multiple chain to check the convergence n.cluster = 3) { model.jags <- function() { ## Priors ------------------------------------------------------------------ ## Random effect species-specific intercept (survival) mu.phi.sp ~ dnorm(0,0.01) sigma.phi.sp ~ dunif(0,10) tau.phi.sp <- 1/(sigma.phi.sp)^2 ## Random effect for recapture rate mu.p.sp ~ dnorm(0,0.01) ## Random effect of year and fixed effect of precipitation & abundance sigma.phi.year ~ dunif(0,10) tau.phi.year <- 1/(sigma.phi.year)^2 ## fixed effect of effort p.effort ~ dnorm(0, 0.01) ## fixed effect ## Fixed precipitation per year phi.sum.preci ~ dnorm(0, 0.01) ## fixed effect # Prior spline ------------------------------------------------------------ ###BEGIN SPLINE### # prior distribution for the fixed effects parameters for (l in 1:3) { beta[l] ~ dnorm(0,0.1) } prior.scaleeps <- 1 xi ~ dnorm(0, tau.xi) tau.xi <- pow(prior.scaleeps, -2) for (k in 1:nknotsb) { b[k] <- yi*etab[k] etab[k] ~ dnorm(0, tau.etab) # hierarchical model for theta } # closing k prior.scaleb <- 1 yi ~ dnorm (0, tau.yi) tau.yi <- pow(prior.scaleb, -2) tau.etab ~ dgamma(.5, .5) # chi^2 with 1 d.f. sigmab <- abs(xi)/sqrt(tau.etab) # cauchy = normal/sqrt(chi^2) ###END SPLINE### for(sp in 1:nsp) { ## Random species-specific intercept phi.sp[sp] ~ dnorm(mu.phi.sp, tau.phi.sp) ## Random recapture rate p.sp[sp] <- mu.p.sp # Changed from a comment from Luke Jan. 9 2017 } for (yr in 1:nyear) { ## random year phi.year[yr] ~ dnorm(0, tau.phi.year) } ## Likelihood! for(sp in 1:nsp) { ## per species ## Rates ------------------------------------------------------------------- ## recapture rate for (yr in 1:nyear) { logit(p[sp,yr]) <- # added logit here p.sp[sp] + p.effort*effort[yr] } ## closing for (year in 1:nyear) } ## closing for (sp in 1:nsp) ## Each ID ---------------------------------------------------------------- ## Likelihood! for(ind in 1:nind) { ## nind = nrow(d$X) ### BEGIN SPLINE ### ## mean function model m[ind] <-mfe[ind] + mre1[ind] + mre2[ind] # fixed effect part mfe[ind] <- beta[1] * Xfix[ind,1] +beta[2] * Xfix[ind,2] + beta[3] * Xfix[ind,3] mre1[ind] <- b[1]*Z[ind,1] + b[2]*Z[ind,2] + b[3]*Z[ind,3] + b[4]*Z[ind,4] + b[5]*Z[ind,5] + b[6]*Z[ind,6] + b[7]*Z[ind,7] + b[8]*Z[ind,8] + b[9]*Z[ind,9] + b[10]*Z[ind,10] mre2[ind] <- b[11]*Z[ind,11] + b[12]*Z[ind,12] + b[13]*Z[ind,13] + b[14]*Z[ind,14] + b[15]*Z[ind,15] ###END SPLINE### } ## for each individual for(ind in 1:nind) { ## nind = nrow(d$X) for(yr in 1:nyear) { logit(phi[ind,yr]) <- phi.sp[species[ind]] + ## effect of species phi.year[yr] + ## effect of year # Effect of the traits on survival values m[ind]+ # spline phi.sum.preci*sum.rainfall[yr] # effect of precipitation per sampling event } ## (yr in 1:nyear) ## First occasion for(yr in 1:first[ind]) { z[ind,yr] ~ dbern(1) } ## (yr in 1:first[ind]) ## Subsequent occasions for(yr in (first[ind]+1):nyear) { # (so, here, we're just indexing from year "first+1" onwards). mu.z[ind,yr] <- phi[ind,yr-1]*z[ind,yr-1] z[ind,yr] ~ dbern(mu.z[ind,yr]) ## Observation process sight.p[ind,yr] <- z[ind,yr]*p[species[ind],yr] ## sightp probability of something to be seen X[ind,yr] ~ dbern(sight.p[ind,yr]) ## X matrix : ind by years } ## yr } ## closing for(ind in 1:nind) } ## closing model.jags function ## Calling Jags ------------------------------------------------------------ jags.parallel(data = d$data, inits = d$inits, parameters.to.save = d$params, model.file = model.jags, n.chains = nc, n.thin = nt, n.iter = ni, n.burnin = nb, working.directory = NULL, n.cluster = n.cluster) } ## closing the run.model function # Monitored parameters ---------------------------------------------------- get.params <- function() c('phi.sp','mu.phi.sp','sigma.phi.sp','mu.p.sp','sigma.p.sp','phi.year','phi','p', 'phi.sum.preci','p.sp','p.effort','z', # Spline parameters "m","sigmab","b","beta")
Why the jags result and depmixS4 are sometimes different?
I have a data set like the following simulated data: Pi = matrix(c(0.9,0.1,0.3,0.7),2,2,byrow=TRUE) delta = c(.5,.5) z = sample(c(1,2),1,prob=delta) T = 365 for( t in 2:T){ z[t] = sample(x=c(1,2),1,prob=Pi[z[t-1],]) } x <- sample(x=seq(-1, 1.5, length.out=T),T,replace=TRUE) alpha = c(-1, -3.2) Beta = c(-4,3) y<-NA for(i in 1:T){ y[i] = rbinom(1,size=10,prob=1/(1+exp(-Beta[z[i]]*x[i]-alpha[z[i]]))) } SimulatedBinomData <- data.frame('y' = y, 'x' = x , size=rep(10,T), 'z' = z) yy<-NA xx<-NA for(i in 1:dim(SimulatedBinomData)[1]){ yy<-c(yy,c(rep(1,SimulatedBinomData$y[i]),rep(0,(SimulatedBinomData$size[i]-SimulatedBinomData$y[i])))) xx<-c(xx,rep(SimulatedBinomData$x[i],SimulatedBinomData$size[i])) } yy<-yy[-1] xx<-xx[-1] SimulatedBernolliData<-data.frame(y=yy,x=xx, tt=rep(c(1:T),rep(10,T))) This is a HMM problem with two states meaning that the Hidden Markov chain z_t belongs to {1,2}. To estimate alpha and Beta in two different states I can use the package 'depmixS4' and find the Maximum Likelihood estimates or I can use MCMC in 'rjags' package. I expect that these two estimations be almost the same while when I run the following program for different simulated data, in several times, the answers are not the same and very different!! library("rjags") library("depmixS4") mod <- depmix(cbind(y,(size-y))~x, data=SimulatedBinomData, nstates=2, family=binomial(logit)) fm <- fit(mod) getpars(fm) n<-length(SimulatedBernolliData$y) T<-max(SimulatedBernolliData$tt) cat("model { # Transition Probability Ptrans[1,1:2] ~ ddirch(a) Ptrans[2,1:2] ~ ddirch(a) # States Pinit[1] <- 0.5 #failor Pinit[2] <- 0.5 #success state[1] ~ dbern(Pinit[2]) for (t in 2:T) { state[t] ~ dbern(Ptrans[(state[t-1]+1),2]) } # Parameters alpha[1] ~ dunif(-1.e10, 1.e10) alpha[2] ~ dunif(-1.e10, 1.e10) Beta[1] ~ dunif(-1.e10, 1.e10) Beta[2] ~ dunif(-1.e10, 1.e10) # Observations for (i in 1:n){ z[i] <- state[tt[i]] y[i] ~ dbern(1/(1+exp(-(alpha[(z[i]+1)]+Beta[(z[i]+1)]*x[i])))) } }", file="LeftBehindHiddenMarkov.bug") jags <- jags.model('LeftBehindHiddenMarkov.bug', data = list('x' = SimulatedBernolliData$x, 'y' = SimulatedBernolliData$y, 'tt' = SimulatedBernolliData$tt, T=T, n = n, a = c(1,1) )) res <- coda.samples(jags,c('alpha', 'Beta', 'Ptrans','state'),1000) res.median = apply(res[[1]],2,median) res.median[1:8] res.mean = apply(res[[1]],2,mean) res.mean[1:8] res.sd = apply(res[[1]],2,sd) res.sd[1:8] res.mode = apply(res[[1]],2,function(x){as.numeric(names(table(x)) [which.max(table(x))]) }) res.mode[1:8]
You are having a problem of label switching in your JAGS code, that is, states z[i]=1 is not bounded to the lower posterior value for Beta and z[i]=2 to the higher Beta. Therefore, for each iteration of the MCMC they can switch. There are several ways to solve this problem. One of them is the partial reordering, that is, for every MCMC iteration, draw two independent values for Beta and order them so that Beta[1] < Beta[2]. You can do that by substituting Beta[1] ~ dunif(-1.e10, 1.e10) Beta[2] ~ dunif(-1.e10, 1.e10) for Beta[1:2] <- sort(Betaaux) Betaaux[1] ~ dunif(-1.e10, 1.e10) Betaaux[2] ~ dunif(-1.e10, 1.e10) Of course, the ordering could also be done on the alpha parameters instead. The election of which parameter to use for the partial reordering depends on the problem.
counting the number of residues in a file
I have a file as follows. I would like to count the number of each character. >1DMLA MTDSPGGVAPASPVEDASDASLGQPEEGAPCQVVLQGAELNGILQAFAPLRTSLLDSLLVMGDRGILIHNTIFGEQVFLP LEHSQFSRYRWRGPTAAFLSLVDQKRSLLSVFRANQYPDLRRVELAITGQAPFRTLVQRIWTTTSDGEAVELASETLMKR ELTSFVVLVPQGTPDVQLRLTRPQLTKVLNATGADSATPTTFELGVNGKFSVFTTSTCVTFAAREEGVSSSTSTQVQILS NALTKAGQAAANAKTVYGENTHRTFSVVVDDCSMRAVLRRLQVGGGTLKFFLTTPVPSLCVTATGPNAVSAVFLLKPQK >1DMLB DDVAARLRAAGFGAVGAGATAEETRRMLHRAFDTLA >2BHDC MTDSPGGVAPASPVEDASDASLGQPEEGAPCQVVLQGAELNGILQAFAPLRTSLLDSLLVMGDRGILIHNTIFGEQVFLP LEHSQFSRYRWRGPTAAFLSLVDQKRSLLSVFRANQYPDLRRVELAITGQAPFRTLVQRIWTTTSDGEAVELASETLMKR ELTSFVVLVPQGTPDVQLRLTRPQLTKVLNATGADSATPTTFELGVNGKFSVFTTSTCVTFAAREEGVSSSTSTQVQILS I tried the following code. awk '/^>/ { res=substr($0, 2); } /^[^>]/ { print res " - " length($0); }' <file The output of the above code is 1DMLA - 80 1DMLA - 80 1DMLA - 80 1DMLA - 79 1DMLB - 36 2BHDC - 80 2BHDC - 80 2BHDC - 80 My desired output is 1DMLA - 319 1DMLB - 36 2BHDC - 240 How do I change the above code for getting my desired output?
Here's one way using awk: awk '/^>/ && r { print r, "-", s; r=s="" } /^>/ { r = substr($0, 2); next } { s += length } END { print r, "-", s }' file Results: 1DMLA - 319 1DMLB - 36 2BHDC - 240
awk -vRS='>' '$1{gsub( "[\r]", "",$1 ); printf "%s - %d\n", $1, length($0) - length($1) - NF + 1}' input
This way: awk -F\> '/^>/ {if (seqlen != ""){print seqlen}printf("%s - ",$2);seqlen=0;next}seqlen != ""{seqlen +=length($0)}END{print seqlen}' infile Or formatted: awk -F\> '/^>/ { if (seqlen != "") print seqlen printf("%s - ",$2) seqlen=0 next } seqlen != ""{seqlen+=length($0)} END{ print seqlen}' infile see: Sequence length of FASTA file Apart from the expected result, this will handle these unexpected file formats. $ cat infile MTDSPGGVAPASPVEDASDASLGQPEEGAPCQVVLQGAELNGILQAFAPLRTSLLDSLLVMGDRGILIHNTIFGEQVFLP LEHSQFSRYRWRGPTAAFLSLVDQKRSLLSVFRANQYPDLRRVELAITGQAPFRTLVQRIWTTTSDGEAVELASETLMKR ELTSFVVLVPQGTPDVQLRLTRPQLTKVLNATGADSATPTTFELGVNGKFSVFTTSTCVTFAAREEGVSSSTSTQVQILS NALTKAGQAAANAKTVYGENTHRTFSVVVDDCSMRAVLRRLQVGGGTLKFFLTTPVPSLCVTATGPNAVSAVFLLKPQK >1DMLB >2BHDC MTDSPGGVAPASPVEDASDASLGQPEEGAPCQVVLQGAELNGILQAFAPLRTSLLDSLLVMGDRGILIHNTIFGEQVFLP LEHSQFSRYRWRGPTAAFLSLVDQKRSLLSVFRANQYPDLRRVELAITGQAPFRTLVQRIWTTTSDGEAVELASETLMKR ELTSFVVLVPQGTPDVQLRLTRPQLTKVLNATGADSATPTTFELGVNGKFSVFTTSTCVTFAAREEGVSSSTSTQVQILS $ awk -F\> '/^>/ {if (seqlen != ""){print seqlen}printf("%s - ",$2);seqlen=0;next}seqlen != ""{seqlen +=length($0)}END{print seqlen}' kk2 1DMLB - 0 2BHDC - 240