tcsh error: while loop - while-loop

This is a basic program but since I'm a newbie, I'm not able to figure out the solution.
I have a file named rama.xvg in the following format:
-75.635 105.879 ASN-2
-153.704 64.7089 ARG-3
-148.238 -47.6076 GLN-4
-63.2568 -8.05441 LEU-5
-97.8149 -7.34302 GLU-6
-119.276 8.99017 ARG-7
-144.198 -103.917 SER-8
-65.4354 -10.3962 GLY-9
-60.6926 12.424 ARG-10
-159.797 -0.551989 PHE-11
65.9924 -48.8993 GLY-12
179.677 -7.93138 GLY-13
..........
...........
-70.5046 38.0408 GLY-146
-155.876 153.746 TRP-147
-132.355 151.023 GLY-148
-66.2679 167.798 ASN-2
-151.342 -33.0647 ARG-3
-146.483 41.3483 GLN-4
..........
..........
-108.566 0.0212432 SER-139
47.6854 33.6991 MET-140
47.9466 40.1073 ASP-141
46.4783 48.5301 SER-142
-139.17 172.486 LYS-143
58.9514 32.0602 SER-144
60.744 18.3059 SER-145
-94.0533 165.745 GLY-146
-161.809 177.435 TRP-147
129.172 -101.736 GLY-148
I need to extract all the lines containing "ASN-2" in one file all_1.dat and so on for all the 147 residues.
If I run the following command in the terminal, it gives the desired output for ASN-2:
awk '{if( NR%147 == 1 ) printf $0 "\n"}' rama.xvg > all_1.dat
To avoid doing it repeatedly for all the residues, I have written the following code.
#!/bin/tcsh
set i = 1
while ( $i < 148)
echo $i
awk '{if( NR%147 == i ) printf $0 "\n"}' rama.xvg > all_"$i".dat
# i++
end
But this code prints the lines containing GLY-148 in all the output files.
Please let me know what is the error in this code. I think it is related to nesting.

In your awk-line the variable i is an awk-variable not shell variable! If you want use shell-variable $i you can do:
awk -v i="$i" '{if( NR%147 == i ) printf $0 "\n"}' rama.xvg > all_"$i".dat
But I think would better put your while-loop into awk:
awk '{for (i=1; i<=147; i++) { if (NR%147==i) {printf $0 "\n" > ("all_" i ".dat") } } }' rama.xvg

Related

An awk script without hard-coded field information

We have the following awk script that extracts fields 6, 7 and 14 from a CSV file:
awk -F, '{for (i=1; i<=NF; i++) if (i in [6, 7, 14]) printf "%s,", $i; print ""}' $input_file
The script works beautifully, except that the information about the fields of interest is hard-coded. We would like to be able to pass this information as a single command line argument (or even a series of command line arguments), to make the script more versatile. We tried a few things, including the following, but we keep getting a syntax error:
awk -F, '{for (i=1; i<=NF; i++) if (i in ['$2']) printf "%s,", $i; print ""}' $input_file
awk -F, '{for (i=1; i<=NF; i++) if (i in [6, 7, 14]) printf "%s,", $i; print ""}' $input_file
is not valid awk syntax which is one reason why
awk -F, '{for (i=1; i<=NF; i++) if (i in ['$2']) printf "%s,", $i; print ""}' $input_file
or any variation of it would also give you a syntax error.
This is probably what you're trying to do:
awk -F, -v vals="$2" '
BEGIN { split(vals,tmp); for (i in tmp) arr[tmp[i]] }
{ for (i=1; i<=NF; i++) if (i in arr) printf "%s,", $i; print "" }
' "$input_file"
assuming $2 contains a comma-separated string like 6,7,14 and your input file is a CSV with unquoted fields.
That would still print a trailing , on each line and looping through all fields and discarding the ones you don't want for every input line is an inefficient way to do what you're trying to do. This would solve both of those additional problems:
awk -v vals="$2" '
BEGIN { FS=OFS=","; n=split(vals,arr) }
{ for (i=1; i<=n; i++) printf "%s%s", $(arr[i]), (i<n ? OFS : ORS) }
' "$input_file"
Another option is to not use (g)awk, and use cut:
cut -d "," -f "6,7,14" inputfile
(or: a="6,7,14"; cut -d "," -f "$a" inputfile)
When input contains:
a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z
1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,2,25,26
output should look like:
f,g,n
6,7,14

How to get cardinality of fields with AWK?

I am trying to count the unique occurrences for each field in a txt file.
Sample:
2008,12,13,6,1007,847,1149,1010,DL,1631,N909DA,162,143,122,99,80,ATL,IAH,689,8,32,0,,0,1,0,19,0,79
2008,12,13,6,638,640,808,753,DL,1632,N604DL,90,73,50,15,-2,JAX,ATL,270,14,26,0,,0,0,0,15,0,0
2008,12,13,6,756,800,1032,1026,DL,1633,N642DL,96,86,56,6,-4,MSY,ATL,425,23,17,0,,0,NA,NA,NA,NA,NA
2008,12,13,6,612,615,923,907,DL,1635,N907DA,131,112,103,16,-3,GEG,SLC,546,5,23,0,,0,0,0,16,0,0
2008,12,13,6,749,750,901,859,DL,1636,N646DL,72,69,41,2,-1,SAV,ATL,215,20,11,0,,0,NA,NA,NA,NA,NA
2008,12,13,6,1002,959,1204,1150,DL,1636,N646DL,122,111,71,14,3,ATL,IAD,533,6,45,0,,0,NA,NA,NA,NA,NA
2008,12,13,6,834,835,1021,1023,DL,1637,N908DL,167,168,139,-2,-1,ATL,SAT,874,5,23,0,,0,NA,NA,NA,NA,NA
2008,12,13,6,655,700,856,856,DL,1638,N671DN,121,116,85,0,-5,PBI,ATL,545,24,12,0,,0,NA,NA,NA,NA,NA
2008,12,13,6,1251,1240,1446,1437,DL,1639,N646DL,115,117,89,9,11,IAD,ATL,533,13,13,0,,0,NA,NA,NA,NA,NA
2008,12,13,6,1110,1103,1413,1418,DL,1641,N908DL,123,135,104,-5,7,SAT,ATL,874,8,11,0,,0,NA,NA,NA,NA,NA
Full dataset here: https://github.com/markgrover/cloudcon-hive (Flight delay dataset from 2008.)
For a single column we can do:
for i in $(seq 1 28); do cut -d',' -f$i 2008.csv | head |sort | uniq | wc -l ; done |tr '\n' ':' ; echo
Is there a way to do it in one go for all the columns?
I think the expected output looks like this:
1:1:1:1:10:10:10:10:1:10:9:9:6:9:9:9:2:5:5:5:6:1:1:1:3:2:2:2:
For the entire dataset:
1:12:31:7:1441:1217:1441:1378:20:7539:5374:690:526:664:1154:1135:303:304:1435:191:343:2:5:2:985:600:575:157:
With GNU awk for true multi-dimensional arrays:
$ cat tst.awk
BEGIN { FS=","; OFS=":" }
{
for (i=1; i<=NF; i++) {
vals[i][$i]
}
}
END {
for (i=1; i<=NF; i++) {
printf "%s%s", length(vals[i]), (i<NF?OFS:ORS)
}
}
$ awk -f tst.awk file
1:1:1:1:10:10:10:10:1:9:7:10:10:10:10:9:8:5:8:8:8:1:1:1:3:2:4:2:3
and with any awk:
$ cat tst.awk
BEGIN { FS=","; OFS=":" }
{
for (i=1; i<=NF; i++) {
if ( !seen[i,$i]++ ) {
cnt[i]++
}
}
}
END {
for (i=1; i<=NF; i++) {
printf "%s%s", cnt[i], (i<NF?OFS:ORS)
}
}
$ awk -f tst.awk file
1:1:1:1:10:10:10:10:1:9:7:10:10:10:10:9:8:5:8:8:8:1:1:1:3:2:4:2:3
In GNU awk:
$ awk '
BEGIN { FS=OFS="," } # delimiters to ,
{
for(i=1;i<=NF;i++) # iterate over every field
a[i][$i] # store unique values to 2d hash
}
END { # after all the records
for(i=1;i<=NF;i++) # iterate the unique values for each field
for(j in a[i])
c[i]++ # count them and
for(i=1;i<=NF;i++)
printf "%s%s",c[i], (i==NF?ORS:OFS) # output the values
}' file
1,1,1,1,10,10,10,10,1,9,7,10,10,10,10,9,8,5,8,8,8,1,1,1,3,2,4,2,3
The output is not exactly the same, not sure if the mistake is your or mine. Well, the last column has the values 79,0 and NA so mine is more accurate on that one.
another awk
this will give you a rolling counts, pipe to tail -1 to get the last line for the overall counts
$ awk -F, -v OFS=: '{for(i=1;i<=NF;i++)
printf "%s%s", NR-(a[i,$i]++?++c[i]:c[i]),(i==NF)?ORS:OFS}' file
1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1
1:1:1:1:2:2:2:2:1:2:2:2:2:2:2:2:2:2:2:2:2:1:1:1:2:1:2:1:2
1:1:1:1:3:3:3:3:1:3:3:3:3:3:3:3:3:2:3:3:3:1:1:1:3:2:3:2:3
1:1:1:1:4:4:4:4:1:4:4:4:4:4:4:4:4:3:4:4:4:1:1:1:3:2:4:2:3
1:1:1:1:5:5:5:5:1:5:5:5:5:5:5:5:5:3:5:5:5:1:1:1:3:2:4:2:3
1:1:1:1:6:6:6:6:1:5:5:6:6:6:6:6:5:4:6:6:6:1:1:1:3:2:4:2:3
1:1:1:1:7:7:7:7:1:6:6:7:7:7:7:6:5:5:7:6:6:1:1:1:3:2:4:2:3
1:1:1:1:8:8:8:8:1:7:7:8:8:8:8:7:6:5:8:7:7:1:1:1:3:2:4:2:3
1:1:1:1:9:9:9:9:1:8:7:9:9:9:9:8:7:5:8:8:8:1:1:1:3:2:4:2:3
1:1:1:1:10:10:10:10:1:9:7:10:10:10:10:9:8:5:8:8:8:1:1:1:3:2:4:2:3

Append lines to a previous line

I am trying to append all lines that begin with > to the previous line that did not begin with >
cat tmp
ATAAACGGAAAAACACTACTTTAGCTTACGGGATCCGGT
>Aa_816
>Aa_817
>Aa_818
CCAAACGGAAAAACACTACTTGAGCTTACGGGATCCGGT
>Aa_940
>Aa_941
CTAAAAGGAAAAACACTACTTTAGCTTTTGGGATCCGGT
What I want is this:
ATAAACGGAAAAACACTACTTTAGCTTACGGGATCCGGT >Aa_816 >Aa_817 >Aa_818
CCAAACGGAAAAACACTACTTGAGCTTACGGGATCCGGT >Aa_940 >Aa_941
CTAAAAGGAAAAACACTACTTTAGCTTTTGGGATCCGGT
This almost gets me there:
cat tmp |awk '!/>/ {sub(/\\$/,""); getline t; print $0 t; next}; 1'
With awk:
awk '!/^>/{printf "%s%s", (NR==1)?"":RS,$0;next}{printf "%s", FS $0}END{print ""}' file
Using awk
awk '!/>/{printf (NR==1)?$0:RS $0;next}{printf FS $0}' file
If you don't care the output has new line generated on the first line, here is the shorter one.
awk '{printf (/>/?FS $0:RS $0)}' file
I think all you need is a little sed:
sed ':a; N; $!ba; s/\n>/ >/g' file
Results:
ATAAACGGAAAAACACTACTTTAGCTTACGGGATCCGGT >Aa_816 >Aa_817 >Aa_818
CCAAACGGAAAAACACTACTTGAGCTTACGGGATCCGGT >Aa_940 >Aa_941
CTAAAAGGAAAAACACTACTTTAGCTTTTGGGATCCGGT
awk '/^[^>]/ { if (length(old) > 0) print old; old = $0 }
/^>/ { old = old " " $0 }
END { if (length(old) > 0) print old }'

missing field and extra space after using for loop in awk

I need to use an awk script to extract some information from a file.
I have a title line which has 11 field and I split it to an array called titleList.
Student Number:Name:Lab1:Lab2:Lab3:Lab4:Lab5:Lab6:Exam1:Exam2:Final
After finding a proper line I need to print the fields which proceeds by the titles for example if the result is :
92839342:Robert Bloomingdale:9:26:18:22:9:12:25:39:99
I must print it in this way:
Student Number:92839342 Name:Robert Bloomingdale Lab1:9 Lab2:26 Lab3:18
Lab4:22 Lab5:9 Lab6:12 Exam1:25 Exam2:39 Final:99
I use a for loop to manage it:
for (i=0 ;i<=NF ;i++)
{
printf "%s %s %s %s",titleList[i],":",$i," "
}
everything look good except the result which has 2 problems:
first there is an extra space between each result and second the last field of the searched line is missing
Student Number : 92839342 Name : Robert Bloomingdale Lab1 : 9 Lab2 : 26
Lab3:18 Lab4 : 22 Lab5 : 9 Lab6 : 12 Exam1 : 25 Exam2 : 39 Final
what should I do?
is there any problem with \n at the end of the search result?
You can correct the amount of extra whitespace between fields by correcting the printf statement:
awk -F ":" 'NR == 1 { split($0, array, FS) } NR >= 2 { for (i=1; i<=NF; i++) printf "%s:%s ", array[i], $i; printf "\n" }' file.txt
Contents of file.txt:
Student Number:Name:Lab1:Lab2:Lab3:Lab4:Lab5:Lab6:Exam1:Exam2:Final
92839342:Robert Bloomingdale:9:26:18:22:9:12:25:39:99
Results:
Student Number:92839342 Name:Robert Bloomingdale Lab1:9 Lab2:26 Lab3:18 Lab4:22 Lab5:9 Lab6:12 Exam1:25 Exam2:39 Final:99
EDIT:
Also, your missing the last value because the file you're working with probably has windows newline endings. To fix this, run: dos2unix file.txt before running your awk code. Alternatively, you can set awk's record separater so that it understands newline endings:
awk 'BEGIN { RS="\r\n"; FS=":" } NR == 1 { split($0, array, FS) } NR >= 2 { for (i=1; i<=NF; i++) printf "%s:%s ", array[i], $i; printf "\n" }' file.txt
EDIT:
The above requires GNU awk, split() splits on the FS by default so no need to use that as an arg, it's common to use "next" rather than specifying opposite conditions, and it's common to use print "" instead of printf "\n" so you use the ORS setting rather than hard-coding it's value in output statements. So, the above should be tweaked to:
gawk 'BEGIN { RS="\r\n"; FS=":" } NR == 1 { split($0, array); next } { for (i=1; i<=NF; i++) printf "%s:%s ", array[i], $i; print "" }' file.txt

awk output format for average

I am computing average of many values and printing it using awk using following script.
for j in `ls *.txt`; do
for i in emptyloop dd cp sleep10 gpid forkbomb gzip bzip2; do
echo -n $j $i" "; cat $j | grep $i | awk '{ sum+=$2} END {print sum/NR}'
done;
echo ""
done
but problem is, it is printing the value in in 1.2345e+05, which I do not want, I want it to print values in round figure. but I am unable to find where to pass the output format.
EDIT: using {print "average,%3d = ",sum/NR}' inplace of {print sum/NR}' is not helping, because it is printing "average,%3d 1.2345e+05".
You need printf instead of simply print. Print is a much simpler routine than printf is.
for j in *.txt; do
for i in emptyloop dd cp sleep10 gpid forkbomb gzip bzip2; do
awk -v "i=$i" -v "j=$j" '$0 ~ i {sum += $2} END {printf j, i, "average %6d", sum/NR}' "$j"
done
echo
done
You don't need ls - a glob will do.
Useless use of cat.
Quote all variables when they are expanded.
It's not necessary to use echo - AWK can do the job.
It's not necessary to use grep - AWK can do the job.
If you're getting numbers like 1.2345e+05 then %6d might be a better format string than %3d. Use printf in order to use format strings - print doesn't support them.
The following all-AWK script might do what you're looking for and be quite a bit faster. Without seeing your input data I've made a few assumptions, primarily that the command name being matched is in column 1.
awk '
BEGIN {
cmdstring = "emptyloop dd cp sleep10 gpid forkbomb gzip bzip2";
n = split(cmdstring, cmdarray);
for (i = 1; i <= n; i++) {
cmds[cmdarray[i]]
}
}
$1 in cmds {
sums[$1, FILENAME] += $2;
counts[$1, FILENAME]++
files[FILENAME]
}
END {
for file in files {
for cmd in cmds {
printf "%s %s %6d", file, cmd, sums[cmd, file]/counts[cmd, file]
}
}
}' *.txt