Substitution from specify lines of a look up table - awk

I'm trying to get a script to automate some tasks using the GAMESS package, from which I'd hope to extrapolate to more complex cases later. Alas it would seem my Unix programming skills are not up to par.
I have a general GAMESS input file 'ion.inp'of the form:
<tag1> energy
Dnh 2
<tag2> <tag3> .0 .0 .0
And I have (as a MWE) a look up table for the parameters of 'ion.inp' like 'table.dat', where the <tag#> are taken from each line of the table.
<tag1> | <tag2> | <tag3> | <tag4> | <tag5> | <tag6>
Hidrogen | H | 1.0 | ROHF | 0 | 2
Hidrogen cation | H | 1.0 | RHF | 1 | 1
For portability, I'd like to get a solution using POSIX sh, sed or awk, but after some trials (using sh or sed, I'm not familiar with awk at all, even though I know it is a potential solution in this case) I couldn't get it to work.
The file 'ion.inp' can be edited in place because it will be run inside a sh loop. I already got everything else working, except for this supposedly simple substitution.
Any help would be much appreciated!

Here is an example using sed and awk. The script is named script.awk. Everything is output on stdout but you could redirect it to files using > inside the AWK script. See second solution below.
The idea is to drive the process using awk and the table.dat file which contains data to make the substitutions and then for each batch of substitutions (each line of the file), we use sed to perform the actual substitutions once we have each tag and its value.
BEGIN { FS = "\\s*\\|\\s*" } changes the field separator to use "optional spaces followed by | then followed by optional spaces". That means $1, $2, ... will give us the values for the tag numbered 1, 2, ...
NR == 1 { next } is used to skip the first line which is useless since tags are ordered from 1 without any gap. If it was not the case, we would have to adapt the AWK script.
{ ... } for each line we build the sed command and execute it. The output of sed becomes the output of awk for that specific line.
BEGIN { FS = "\\s*\\|\\s*" }
NR == 1 { next }
s = ""
for (i=1; i <= NF; i++)
s = s sprintf(";s/%s/%s/g", "<tag" i ">", $i)
system("sed '" s "' ion.inp")
$ cat table.dat
<tag1> | <tag2> | <tag3> | <tag4> | <tag5> | <tag6>
Hidrogen | H | 1.0 | ROHF | 0 | 2
Hidrogen cation | H | 1.0 | RHF | 1 | 1
$ cat ion.inp
<tag1> energy
Dnh 2
<tag2> <tag3> .0 .0 .0
$ awk -f script.awk table.dat
Hidrogen energy
Dnh 2
H 1.0 .0 .0 .0
Hidrogen cation energy
Dnh 2
H 1.0 .0 .0 .0
Redirecting to files with command awk -f script2.awk table.dat. The script script2.awk is:
BEGIN { FS = "\\s*\\|\\s*" }
NR == 1 { next }
s = ""
for (i=1; i <= NF; i++)
s = s sprintf(";s/%s/%s/g", "<tag" i ">", $i)
system("sed '" s "' ion.inp > " sprintf("output%02d.txt", NR-1))
$ cat output01.txt
Hidrogen energy
Dnh 2
H 1.0 .0 .0 .0
$ cat output02.txt
Hidrogen cation energy
Dnh 2
H 1.0 .0 .0 .0


awk duplicated lines with starting with # symbol

In the below awk is there a way to process only lines below a pattern #CHROM, however print all in the output. The problem I am having is if I ignore all lines with a # they do print in the output, but the other lines without the # get duplicated. In my data file there are thousands of lines but only the oone format below is updated by the awk. Thank you :).
file tab-delimited
##bcftools_normCommand=norm -m-both -o genome_split.vcf genome.vcf.gz
##bcftools_normCommand=norm -f /home/cmccabe/Desktop/NGS/picard-tools-1.140/resources/ucsc.hg19.fasta -o genome_annovar.vcf genome_split.vcf
chr1 948797 . C . 0 PASS DP=159;END=948845;MAX_DP=224;MIN_DP=95 GT:DP:MIN_DP:MAX_DP 0/0:159:95:224
awk '!/^#/
BEGIN {FS = OFS = "\t"
NF == 10 {
split($8, a, /[=;]/)
$11 = $12 = $13 = $14 = $15 = $18 = "."
$16 = (a[1] == "DP") ? a[2] : "DP=num_Missing"
$17 = "homref"
1' out > ref
curent output tab-delimited
##bcftools_normCommand=norm -m-both -o genome_split.vcf genome.vcf.gz
##bcftools_normCommand=norm -f /home/cmccabe/Desktop/NGS/picard-tools-1.140/resources/ucsc.hg19.fasta -o genome_annovar.vcf genome_split.vcf
chr1 948797 . C . 0 PASS DP=159;END=948845;MAX_DP=224;MIN_DP=95 GT:DP:MIN_DP:MAX_DP 0/0:159:95:224 --- duplicated line ---
chr1 948797 . C . 0 PASS DP=159;END=948845;MAX_DP=224;MIN_DP=95 GT:DP:MIN_DP:MAX_DP 0/0:159:95:224 . . . . . 159 homref . --- this line is correct ---
desired output tab-delimited
##bcftools_normCommand=norm -m-both -o genome_split.vcf genome.vcf.gz
##bcftools_normCommand=norm -f /home/cmccabe/Desktop/NGS/picard-tools-1.140/resources/ucsc.hg19.fasta -o genome_annovar.vcf genome_split.vcf
chr1 948797 . C . 0 PASS DP=159;END=948845;MAX_DP=224;MIN_DP=95 GT:DP:MIN_DP:MAX_DP 0/0:159:95:224 . . . . . 159 homref .
Your first statement:
says "print every line that starts with #" and your last:
says "print every line". Hence the duplicate lines in the output.
To only modify lines that don't start with # but print all lines would be:
!/^#/ { do stuff }

awk to update value in field of out file using contents of another

In the out.txt below I am trying to use awk to update the contents of $9. The out.txt is created by the awk before the pipe |. If $9 contains a + or - then $8 of out.txt is used as a key to lookup in $2 of file2. When a match ( there will always be one) is found the $3 value of that file2 is used to update $9 of out.txt seperated by a :. So the original +6 in out.txt would be +6:NM_005101.3. The awk below is close but has syntax errors after the | that I can not seem to fix. Thank you :).
out.txt tab-delimited
R_Index Chr Start End Ref Alt Func.IDP.refGene Gene.IDP.refGene GeneDetail.IDP.refGene Inheritence ExonicFunc.IDP.refGene AAChange.IDP.refGene
1 chr1 948846 948846 - A upstream ISG15 -0 . . .
2 chr1 948870 948870 C G UTR5 ISG15 NM_005101.3:c.-84C>G . .
4 chr1 949925 949925 C T downstream ISG15 +6 . . .
5 chr1 207646923 207646923 G A intronic CR2 >50 . . .
8 chr1 948840 948840 - C upstream ISG15 -6 . . .
file2 space-delimited
2 ISG15 NM_005101.3 948846-948956 949363-949919
desired output `tab-delimited'
R_Index Chr Start End Ref Alt Func.IDP.refGene Gene.IDP.refGene GeneDetail.IDP.refGene Inheritence ExonicFunc.IDP.refGene AAChange.IDP.refGene
1 chr1 948846 948846 - A upstream ISG15 -0:NM_005101.3 . . .
2 chr1 948870 948870 C G UTR5 ISG15 NM_005101.3:c.-84C>G . .
4 chr1 949925 949925 C T downstream ISG15 +6:NM_005101.3 . . .
5 chr1 207646923 207646923 G A intronic CR2 >50 . . .
8 chr1 948840 948840 - C upstream ISG15 -6:NM_005101.3 . . .
lines 1, 3, 5 `$9` updated with`: ` and value of `$3` in `file2`
line 2 and 4 are skipped as these do not have a `+` or `-` in them
awk -v extra=50 -v OFS='\t' '
NR == FNR {
count[$2] = $1
for(i = 1; i <= $1; i++) {
low[$2, i] = $(2 + 2 * i)
high[$2, i] = $(3 + 2 * i)
mid[$2, i] = (low[$2, i] + high[$2, i]) / 2
FNR != 1 && $9 == "." && $12 == "." && $8 in count {
for(i = 1; i <= count[$8]; i++)
if($4 >= (low[$8, i] - extra) && $4 <= (high[$8, i] + extra)) {
if($4 > mid[$8, i]) {
sign = "+"
value = high[$8, i]
else {
sign = "-"
value = low[$8, i]
diff = (value > $4) ? value - $4 : $4 - value
$9 = (diff > 50) ? ">50" : (sign diff)
if(i > count[$8]) {
$9 = ">50"
' FS='[- ]' file2 FS='\t' file1 | awk if($6 == "-" || $6 == "+") printf ":" ; 'FNR==NR {a[$2]=$3; next} a[$8]{$3=a[$8]}1' OFS='\t' file2 > final.txt
bash: syntax error near unexpected token `('
As far as I can tell, your awk code is OK and your bash usage is wrong.
FS='[- ]' file2 FS='\t' file1 |
awk if($6 == "-" || $6 == "+")
printf ":" ;
'FNR==NR {a[$2]=$3; next}
a[$8]{$3=a[$8]}1' OFS='\t' file2 > final.txt
bash: syntax error near unexpected token `('
I don't know what that's supposed to do. This for sure, though: on the second line, the awk code needs to be quoted (awk 'if(....). The bash error message stems from the fact that bash is interpreting the (unquoted) awk code, and ( is not a valid shell-script token after if.

AWK prints empty line of NA's at end of file

I have an older script that has been bugging me for a while now, which has a small bug in it that I haven't really gotten around to fixing, but I think it's about time. The script basically appends the columns of different files based on the ID of the rows. For example...
a 3
b 2
a 5
b 9
... should yield a result of:
a 3 5
b 2 9
The script itself looks like this:
gawk 'BEGIN { OFS="\t" }
{ vals[$1,ARGIND]=$2; keys[$1] }
for (key in keys) {
printf "%s%s", key, OFS
for (colNr=1; colNr<=ARGIND; colNr++) {
printf "%s%s", vals[key,colNr], (colNr<ARGIND?OFS:ORS)
} printf "\n"
}' $1 $2
... called as $ test1.txt test2.txt. The problem is that the result I get is not exactly what I should get:
a 3 5
b 2 9
... i.e. I get a row with NA's at the very end of the file. So far I've just been deleting this row manually, but it'd be nice to not have to do that. I don't really see where this weird functionality is coming from, though... Anybody got any ideas? I'm using GAWK on OSX, if that matters.
Here's some actual input (that's what I get for trying to make the question simple and to the point! =P)
target_id length eff_length est_counts tpm
ENST00000574176 596 282 6 0.825408
ENST00000575242 103 718 105 5.19804
ENST00000573052 291 291 21 2.61356
ENST00000312051 223 192 2559 46.8843
I'm interested in the target_id and tpm columns, the others are unimportant. My full script:
FILES=$(find . -name 'data.txt' | xargs)
# Get replicate names for column header
printf "%s" 'ENSTID'
for file in $FILES; do
printf "\t%s" $file3
printf "\n"
gawk 'BEGIN { OFS="\t" }
{ vals[$1,ARGIND]=$5; keys[$1] }
for (key in keys) {
printf "%s%s", key, OFS
for (colNr=1; colNr<=ARGIND; colNr++) {
printf "%s%s", vals[key,colNr], (colNr<ARGIND?OFS:ORS)
} printf "\n"
(i.e. all the files are named data.txt, but are located in differently named subfolders.)
A simpler idiomatic way to do it would be
$ cat test1.txt
a 3
b 2
$ cat test2.txt
a 5
b 9
$ awk -v OFS="\t" 'NR==FNR{rec[$1]=$0;next}$1 in rec{print rec[$1],$2}' test1.txt test2.txt
a 3 5
b 2 9
For the actual input
$ cat test1.txt
target_id length eff_length est_counts tpm
ENST00000574176 596 282 6 0.825408
ENST00000575242 103 718 105 5.19804
ENST00000573052 291 291 21 2.61356
ENST00000312051 223 192 2559 46.8843
$ cat test2.txt
target_id length eff_length est_counts tpm
ENST00000574176 996 122 6 0.3634
ENST00000575242 213 618 105 7.277
ENST00000573052 329 291 89 2.0356
ENST00000312051 21 00 45 0.123
$ awk 'NR==FNR{rec1[$1]=$1;rec2[$1]=$5;next}$1 in rec1{printf "%-20s %-15s %-15s\n", rec1[$1],rec2[$1],$5}' test1.txt test2.txt
target_id tpm tpm
ENST00000574176 0.825408 0.3634
ENST00000575242 5.19804 7.277
ENST00000573052 2.61356 2.0356
ENST00000312051 46.8843 0.123
Notes :
-v OFS="\t" is for tab separated fields in output, order of passed files is important (Matters to first solution).
Hard-coding newlines as in
printf "%-20s %-15s %-15s\n", rec1[$1],rec2[$1],$5
is not a good idea as it renders the script less portable.You may well replace it with
printf "%-20s %-15s %-15s", rec1[$1],rec2[$1],$5;print # same effect
Edit : for more than two files
$ shopt -s globstar
$ awk 'NR==FNR{rec1[$1]=$1" "$5;next}{if($1 in rec1){rec1[$1]=rec1[$1]" "$5}else{rec1[$1]=$1" "$5}}END{for(i in rec1){if(i != "target_id"){print rec1[i];}}}' **/test*.txt
ENST00000312051 46.8843 46.8843 0.123 46.8843 0.123
ENST00000573052 2.61356 2.61356 2.0356 2.61356 2.0356
ENST00000575242 5.19804 5.19804 7.277 5.19804 7.277
ENST00000574176 0.825408 0.825408 0.3634 0.825408 0.3634
ENST77777777777 01245
ENST66666666666 7.277 7.277
$ shopt -u globstar
As far as I can see, the only reason you would get an empty line at the end of the output (which is what I get with gawk on OS X) is that you have a printf "\n" at the end of the script, which will add a newline even though you've just printed ORS.
Since your bash script is essentially an awk script, I would make a proper awk script out of it. That would additionally save you the problem of having incorrect quoting of $1 and $2 in the shell script (would break on exotic filenames). This also gives you proper syntax highlighting in your favourite text editor, if it understands Awk:
#!/usr/bin/gawk -f
BEGIN { OFS = "\t" }
vals[$1,ARGIND] = $2;
keys[$1] = 1;
for (key in keys) {
printf("%s%s", key, OFS);
for (colNr = 1; colNr <= ARGIND; colNr++) {
printf("%s%s", vals[key,colNr], (colNr < ARGIND ? OFS : ORS));
The same can be done with more complex sed editing scripts.

Awk - Substring comparison

Working native bash code :
while read line
if [[ $a != "0000000" || $b != "0000000" ]]
echo "$line" >> FILE_OT_YHAV
echo "$line" >> FILE_OT_NHAV
done <$FILE_IN
I have the following file (its a dummy), the substrings being checked are both on the 4th field, so nm the exact numbers.
BBBBBBB AXXXXXX CC DDDDDDD 10101010000000000000
I m trying to write an awk script that compares two specific substrings, if either one is not 000000 it outputs the line into File A, if both of them are 000000 it outputs the line into File B, this is the code i have so far :
# Before first line.
print "Awk Started"
# For each line of input.
# print "length = #" length($0) "#"
print "length = #" length(fline) "#"
print "##" substr($0,112,7) "##" substr($0,123,7) "##"
if ( (substr($0,112,7) != "0000000") || (substr($0,123,7) != "0000000") )
print $0 > FILE_OT_YHAV;
print $0 > FILE_OT_NHAV;
# After last line.
print "Awk Ended"
The problem is that when i run it, it :
a) Treats every line as having a different length
b) Therefore the substrings are applied to different parts of it (that is why i added the print length stuff before the if, to check on it.
This is a sample output of the line length awk reads and the different substrings :
Awk Started
length = #130#
## ## ##
length = #136#
##0000000##22016 ##
length = #133#
##0000001##16 ##
length = #129#
##0010220## ##
length = #138#
length = #136#
##0000000##22016 ##
length = #134#
##0000000##016 ##
length = #137#
##0000000##022016 ##
Is there a reason why awk treats lines of the same length as having a different length? Does it have something to do with the spacing of the input file?
Thanks in advance for any help.
After the comments about cleaning the file up with sed, i got this output (and yes now the lines have a different size) :
1 0M-DM-EM-G M-A.M-E. #DEH M-SM-TM-OM-IM-WM-EM-IM-A M-DM-V/M-DM-T/M-TM-AM-P 01022016 $
2 110000080103M-CM-EM-QM-OM-MM-TM-A M-A. 6M-AM-HM-GM-MM-A 1055801001102 0000120000012001001142 19500000120 0100M-D000000000000000000000001022016 $
3 110000106302M-TM-AM-QM-EM-KM-KM-A 5M-AM-AM-HM-GM-MM-A 1043801001101 0000100000010001001361 19500000100M-IM-SM-O0100M-D000000000000000000000001022016 $
4 110000178902M-JM-AM-QM-AM-CM-IM-AM-MM-MM-G M-KM-EM-KM-AM-S 71M-AM-HM-GM-MM-A 1136101001101 0000130000013001006061 19500000130 0100M-D000000000000000000000001022016 $

How to append column to existing file in awk?

I have a file named bt.B.1.log that looks like this:
Time in seconds = 260.37
Compiled procs = 1
Time in seconds = 260.04
Compiled procs = 1
and so on for 40 records of Time in seconds and Compiled procs (dots represent useless lines).
How do I add a single column with the value of Compiled procs (which is 1) to the result of the following two commands:
This prints the average of Time in seconds values (thanks to dawg for this one)
awk -F= '/Time in seconds/ {s+=$2; c++} END {print s/c}' bt.B.1.log > t1avg.dat
Desired output:
260.20 1
This prints all of the ocurrences of Time in seconds, but there is a small problem with it; it is printing an extra blank line at the beginning of the list.
awk 'BEGIN { FS = "Time in seconds =" } ; { printf $2 } {printf " "}' bt.B.1.log > t1.dat
Desired output:
260.37 1
In both cases I need the value of Compiled procs to appear only once, preferrably in the first line, and no use of intermediate files.
What I managed to do so far prints all values of Time in seconds with the Compiled procs column appearing in every line and with a strange identation:
awk '/seconds/ {printf $5} {printf " "} /procs/ {print $4}' bt.B.1.log > t1.dat
Please help!
Contents of file bt.B.1.log:
Start in 16:40:51--25/12/2014
NAS Parallel Benchmarks 3.3 -- BT Benchmark
No input file Using compiled defaults
Size: 102x 102x 102
Iterations: 200 dt: 0.0003000
Number of active processes: 1
Time step 1
Time step 20
Time step 40
Time step 60
Time step 80
Time step 100
Time step 120
Time step 140
Time step 160
Time step 180
Time step 200
Verification being performed for class B
accuracy setting for epsilon = 0.1000000000000E-07
Comparison of RMS-norms of residual
1 0.1423359722929E+04 0.1423359722929E+04 0.7507984505732E-14
2 0.9933052259015E+02 0.9933052259015E+02 0.3147459568137E-14
3 0.3564602564454E+03 0.3564602564454E+03 0.4783990739472E-14
4 0.3248544795908E+03 0.3248544795908E+03 0.2309751522921E-13
5 0.3270754125466E+04 0.3270754125466E+04 0.8481098651866E-14
Comparison of RMS-norms of solution error
1 0.5296984714094E+02 0.5296984714094E+02 0.2682819657265E-15
2 0.4463289611567E+01 0.4463289611567E+01 0.1989963674771E-15
3 0.1312257334221E+02 0.1312257334221E+02 0.4060995034457E-15
4 0.1200692532356E+02 0.1200692532356E+02 0.2958887128106E-15
5 0.1245957615104E+03 0.1245957615104E+03 0.2281113665977E-15
Verification Successful
BT Benchmark Completed.
Class = B
Size = 102x 102x 102
Iterations = 200
Time in seconds = 260.37
Total processes = 1
Compiled procs = 1
Mop/s total = 2696.83
Mop/s/process = 2696.83
Operation type = floating point
Verification = SUCCESSFUL
Version = 3.3
Compile date = 25 Dec 2014
Compile options:
MPIF77 = mpif77
FMPI_LIB = -L/usr/lib/openmpi/lib -lmpi -lopen-rte -lo...
FMPI_INC = -I/usr/lib/openmpi/include -I/usr/lib/openm...
RAND = (none)
Please send the results of this run to:
NPB Development Team
If email is not available, send this to:
MS T27A-1
NASA Ames Research Center
Moffett Field, CA 94035-1000
Fax: 650-604-3957
Finish in 16:45:14--25/12/2014
Start in 16:58:50--25/12/2014
NAS Parallel Benchmarks 3.3 -- BT Benchmark
No input file Using compiled defaults
Size: 102x 102x 102
Iterations: 200 dt: 0.0003000
Number of active processes: 1
Time step 1
Time step 20
Time step 40
Time step 60
Time step 80
Time step 100
Time step 120
Time step 140
Time step 160
Time step 180
Time step 200
Verification being performed for class B
accuracy setting for epsilon = 0.1000000000000E-07
Comparison of RMS-norms of residual
1 0.1423359722929E+04 0.1423359722929E+04 0.7507984505732E-14
2 0.9933052259015E+02 0.9933052259015E+02 0.3147459568137E-14
3 0.3564602564454E+03 0.3564602564454E+03 0.4783990739472E-14
4 0.3248544795908E+03 0.3248544795908E+03 0.2309751522921E-13
5 0.3270754125466E+04 0.3270754125466E+04 0.8481098651866E-14
Comparison of RMS-norms of solution error
1 0.5296984714094E+02 0.5296984714094E+02 0.2682819657265E-15
2 0.4463289611567E+01 0.4463289611567E+01 0.1989963674771E-15
3 0.1312257334221E+02 0.1312257334221E+02 0.4060995034457E-15
4 0.1200692532356E+02 0.1200692532356E+02 0.2958887128106E-15
5 0.1245957615104E+03 0.1245957615104E+03 0.2281113665977E-15
Verification Successful
BT Benchmark Completed.
Class = B
Size = 102x 102x 102
Iterations = 200
Time in seconds = 260.04
Total processes = 1
Compiled procs = 1
Mop/s total = 2700.25
Mop/s/process = 2700.25
Operation type = floating point
Verification = SUCCESSFUL
Version = 3.3
Compile date = 25 Dec 2014
Compile options:
MPIF77 = mpif77
FMPI_LIB = -L/usr/lib/openmpi/lib -lmpi -lopen-rte -lo...
FMPI_INC = -I/usr/lib/openmpi/include -I/usr/lib/openm...
RAND = (none)
Please send the results of this run to:
NPB Development Team
If email is not available, send this to:
MS T27A-1
NASA Ames Research Center
Moffett Field, CA 94035-1000
Fax: 650-604-3957
Finish in 17:03:12--25/12/2014
There are 40 entries in the log, but I've provided only 2 for abbreviation purposes.
To fix the first issue, replace:
awk -F= '/Time in seconds/ {s+=$2; c++} END {print s/c}' bt.B.1.log > t1avg.dat
awk 'BEGIN { FS = "[ \t]*=[ \t]*" } /Time in seconds/ { s += $2; c++ } /Compiled procs/ { if (! CP) CP = $2 } END { print s/c, CP }' bt.B.1.log >t1avg.dat
A potential minor issue is that 260.205 1 might be output but the question does not address this as a weakness of the given script. Rounding with something like printf "%.2f %s\n", s/c, CP gives 260.21 1 though. To truncate the extra digit, use something like printf "%.2f %s\n", int (s/c * 100) / 100, CP.
To fix the second issue, replace:
awk 'BEGIN { FS = "Time in seconds =" } ; { printf $2 } {printf " "}' bt.B.1.log > t1.dat
awk 'BEGIN { FS = "[ \t]*[=][ \t]" } /Time in seconds/ { printf "%s", $2 } /Compiled procs/ { if (CP) { printf "\n" } else { CP = $2; printf " %s\n", $2 } }' bt.B.1.log > t1.dat
BTW, the "strange indentation" is a result of failing to output a newline when using printf and failing to filter unwanted input lines from processing.