Awk script displaying incorrect output - awk

I'm facing an issue in awk script - I need to generate a report containing the lowest, highest and average score for each assignment in the data file. The name of the assignment is located in column 3.
Input data is:
Student,Catehory,Assignment,Score,Possible
Chelsey,Homework,H01,90,100
Chelsey,Homework,H02,89,100
Chelsey,Homework,H03,77,100
Chelsey,Homework,H04,80,100
Chelsey,Homework,H05,82,100
Chelsey,Homework,H06,84,100
Chelsey,Homework,H07,86,100
Chelsey,Lab,L01,91,100
Chelsey,Lab,L02,100,100
Chelsey,Lab,L03,100,100
Chelsey,Lab,L04,100,100
Chelsey,Lab,L05,96,100
Chelsey,Lab,L06,80,100
Chelsey,Lab,L07,81,100
Chelsey,Quiz,Q01,100,100
Chelsey,Quiz,Q02,100,100
Chelsey,Quiz,Q03,98,100
Chelsey,Quiz,Q04,93,100
Chelsey,Quiz,Q05,99,100
Chelsey,Quiz,Q06,88,100
Chelsey,Quiz,Q07,100,100
Chelsey,Final,FINAL,82,100
Chelsey,Survey,WS,5,5
Sam,Homework,H01,19,100
Sam,Homework,H02,82,100
Sam,Homework,H03,95,100
Sam,Homework,H04,46,100
Sam,Homework,H05,82,100
Sam,Homework,H06,97,100
Sam,Homework,H07,52,100
Sam,Lab,L01,41,100
Sam,Lab,L02,85,100
Sam,Lab,L03,99,100
Sam,Lab,L04,99,100
Sam,Lab,L05,0,100
Sam,Lab,L06,0,100
Sam,Lab,L07,0,100
Sam,Quiz,Q01,91,100
Sam,Quiz,Q02,85,100
Sam,Quiz,Q03,33,100
Sam,Quiz,Q04,64,100
Sam,Quiz,Q05,54,100
Sam,Quiz,Q06,95,100
Sam,Quiz,Q07,68,100
Sam,Final,FINAL,58,100
Sam,Survey,WS,5,5
Andrew,Homework,H01,25,100
Andrew,Homework,H02,47,100
Andrew,Homework,H03,85,100
Andrew,Homework,H04,65,100
Andrew,Homework,H05,54,100
Andrew,Homework,H06,58,100
Andrew,Homework,H07,52,100
Andrew,Lab,L01,87,100
Andrew,Lab,L02,45,100
Andrew,Lab,L03,92,100
Andrew,Lab,L04,48,100
Andrew,Lab,L05,42,100
Andrew,Lab,L06,99,100
Andrew,Lab,L07,86,100
Andrew,Quiz,Q01,25,100
Andrew,Quiz,Q02,84,100
Andrew,Quiz,Q03,59,100
Andrew,Quiz,Q04,93,100
Andrew,Quiz,Q05,85,100
Andrew,Quiz,Q06,94,100
Andrew,Quiz,Q07,58,100
Andrew,Final,FINAL,99,100
Andrew,Survey,WS,5,5
Ava,Homework,H01,55,100
Ava,Homework,H02,95,100
Ava,Homework,H03,84,100
Ava,Homework,H04,74,100
Ava,Homework,H05,95,100
Ava,Homework,H06,84,100
Ava,Homework,H07,55,100
Ava,Lab,L01,66,100
Ava,Lab,L02,77,100
Ava,Lab,L03,88,100
Ava,Lab,L04,99,100
Ava,Lab,L05,55,100
Ava,Lab,L06,66,100
Ava,Lab,L07,77,100
Ava,Quiz,Q01,88,100
Ava,Quiz,Q02,99,100
Ava,Quiz,Q03,44,100
Ava,Quiz,Q04,55,100
Ava,Quiz,Q05,66,100
Ava,Quiz,Q06,77,100
Ava,Quiz,Q07,88,100
Ava,Final,FINAL,99,100
Ava,Survey,WS,5,5
Shane,Homework,H01,50,100
Shane,Homework,H02,60,100
Shane,Homework,H03,70,100
Shane,Homework,H04,60,100
Shane,Homework,H05,70,100
Shane,Homework,H06,80,100
Shane,Homework,H07,90,100
Shane,Lab,L01,90,100
Shane,Lab,L02,0,100
Shane,Lab,L03,100,100
Shane,Lab,L04,50,100
Shane,Lab,L05,40,100
Shane,Lab,L06,60,100
Shane,Lab,L07,80,100
Shane,Quiz,Q01,70,100
Shane,Quiz,Q02,90,100
Shane,Quiz,Q03,100,100
Shane,Quiz,Q04,100,100
Shane,Quiz,Q05,80,100
Shane,Quiz,Q06,80,100
Shane,Quiz,Q07,80,100
Shane,Final,FINAL,90,100
Shane,Survey,WS,5,5
awk script :
BEGIN {
FS=" *\\, *"
}
FNR>1 {
min[$3]=(!($3 in min) || min[$3]> $4 )? $4 : min[$3]
max[$3]=(max[$3]> $4)? max[$3] : $4
cnt[$3]++
sum[$3]+=$4
}
END {
print "Name\tLow\tHigh\tAverage"
for (i in cnt)
printf("%s\t%d\t%d\t%.1f\n", i, min[i], max[i], sum[i]/cnt[i])
}
Expected sample output:
Name Low High Average
Q06 77 95 86.80
L05 40 96 46.60
WS 5 5 5
Q07 58 100 78.80
L06 60 99 61
L07 77 86 64.80
When I run the script, I get a "Low" of 0 for all assignments which is not correct. Where am I going wrong? Please guide.

You can certainly do this with awk, but since you tagged this scripting as well, I'm assuming other tools are an option. For this sort of gathering of statistics on groups present in the data, GNU datamash often reduces the job to a simple one-liner. For example:
$ (echo Name,Low,High,Average; datamash --header-in -s -t, -g3 min 4 max 4 mean 4 < input.csv) | tr , '\t'
Name Low High Average
FINAL 58 99 85.6
H01 19 90 47.8
H02 47 95 74.6
H03 70 95 82.2
H04 46 80 65
H05 54 95 76.6
H06 58 97 80.6
H07 52 90 67
L01 41 91 75
L02 0 100 61.4
L03 88 100 95.8
L04 48 100 79.2
L05 0 96 46.6
L06 0 99 61
L07 0 86 64.8
Q01 25 100 74.8
Q02 84 100 91.6
Q03 33 100 66.8
Q04 55 100 81
Q05 54 99 76.8
Q06 77 95 86.8
Q07 58 100 78.8
WS 5 5 5
This says that for each group with the same value for the 3rd column (-g3, plus -s to sort the input (A requirement of the tool)) of simple CSV input (-t,) with a header (--header-in), display the minimum, maximum, and mean of the 4th column. It's all given a new header and piped to tr to turn the commas into tabs.

Your code works as-is with GNU awk. However, running it with the -t option to warn about non-portable constructs gives:
awk: foo.awk:6: warning: old awk does not support the keyword `in' except after `for'
awk: foo.awk:2: warning: old awk does not support regexps as value of `FS'
And running the script with a different implementation of awk (mawk in my case) does give 0's for the Low column. So, some tweaks to the script:
BEGIN {
FS=","
}
FNR>1 {
min[$3]=(cnt[$3] == 0 || min[$3]> $4 )? $4 : min[$3]
max[$3]=(max[$3]> $4)? max[$3] : $4
cnt[$3]++
sum[$3]+=$4
}
END {
print "Name\tLow\tHigh\tAverage"
PROCINFO["sorted_in"] = "#ind_str_asc" # gawk-ism for pretty output; ignored on other awks
for (i in cnt)
printf("%s\t%d\t%d\t%.1f\n", i, min[i], max[i], sum[i]/cnt[i])
}
and it works as expected on that other awk too.
The changes:
Using a simple comma as the field separator instead of a regex.
Changing the min conditional to setting to the current value on the first time this assignment has been seen by checking to see if cnt[$3] is equal to 0 (Which it will be the first time because that value is incremented in a later line), or if the current min is greater than this value.

another similar approach
$ awk -F, 'NR==1 {print "name","low","high","average"; next}
{k=$3; sum[k]+=$4; count[k]++}
!(k in min) {min[k]=max[k]=$4}
min[k]>$4 {min[k]=$4}
max[k]<$4 {max[k]=$4}
END {for(k in min) print k,min[k],max[k],sum[k]/count[k]}' file |
column -t
name low high average
Q06 77 95 86.8
L05 0 96 46.6
WS 5 5 5
Q07 58 100 78.8
L06 0 99 61
L07 0 86 64.8
H01 19 90 47.8
H02 47 95 74.6
H03 70 95 82.2

Related

Extract info from a column based on a range within another file

I've tried looking around but what I found was looking up within the same file or combining columns with exact matches. Whereas I would not have exact matches and right now trying to combine these two codes is above my skill level. Basically I need to add an extra column to include the gene name based chromosome position and grabbing the gene name based on the range of the gene within another file. I know awk is my best bet possibly with FNR==NR.
File1 looks like this, where $1 is chromosome, $2 is position, the rest of the columns are sample coverage across that position:
chr1H 49525 47 41 60 74 93 34 117
chr1H 49526 48 41 62 74 94 34 118
chr1H 53978 48 40 61 73 94 33 117
chr1H 53979 48 40 62 72 94 33 116
File2 looks like this, where $1 is the chromosome, $2 is the start of the gene $3 is the end of the gene and $4 is the gene name:
chr1H 49525 49772 gene1
chr1H 50194 50649 gene2
chr1H 53978 54323 gene3
chr1H 76743 77373 gene4
Either over writing or making a new file to end up with a file that looks like this:
chr1H 49525 47 41 60 74 93 34 117 gene1
chr1H 49526 48 41 62 74 94 34 118 gene1
chr1H 53978 48 40 61 73 94 33 117 gene3
chr1H 53979 48 40 62 72 94 33 116 gene3
Right now my code looks like this but I'm not sure how I specify the files (right now I've put in file1 or 2 so you know what my thinking is). So that the chromosomes match in both files and the position in the coverage file is between a range within start and end positions within the second file, then printing the entire line from file1 and the gene name from file2:
awk '{ if (file1$1 == file2$1 && file1$2 >= file2$2 && file1$2 <= file2$3) print file1$0, file2$4 }' file1 file2 > file3
Thanks for any help!
You can do this fairly simply in awk by reading the range values from file2 into arrays indexed by gene name. That gives you a range by gene name to compare against the 2nd field in file1. You can do:
awk '
NR == FNR { # reading file2
b[$4] = $2 # store b[] (begin) indexed by name
e[$4] = $3 # store e[] (end) indexed by name
next # get next record
}
{ # for all file1 records
for(i in b) { # loop over values by gene name
if ($2 >= b[i] && $2 <= e[i]) { # if in range b[] to e[]
printf "%s %s\n", $0, i # output with gene name at end
next # get next record
}
}
}
' file2 file1
Example Use/Output
With the values shown in file1 and file2 you would have:
$ awk '
> NR == FNR { # reading file2
> b[$4] = $2 # store b[] (begin) indexed by name
> e[$4] = $3 # store e[] (end) indexed by name
> next # get next record
> }
> { # for all file1 records
> for(i in b) { # loop over values by gene name
> if ($2 >= b[i] && $2 <= e[i]) { # if in range b[] to e[]
> printf "%s %s\n", $0, i # output with gene name at end
> next # get next record
> }
> }
> }
> ' file2 file1
chr1H 49525 47 41 60 74 93 34 117 gene1
chr1H 49526 48 41 62 74 94 34 118 gene1
chr1H 53978 48 40 61 73 94 33 117 gene3
chr1H 53979 48 40 62 72 94 33 116 gene3
You can try this solution. Although I would suggest to use e.g. bedtools for stuff like this. You can encounter situations where intuition fails and it's good to have stress tested tools when that happens.
$ awk 'NR==FNR{ chr[NR]=$1; st[NR]=$2; en[NR]=$3; gene[NR]=$4; x=NR }
NR!=FNR{ for(i=1;i<=x;i++){ if($1==chr[i] && $2>=st[i] && $2<=en[i]){
print $0,gene[i]; break } } }' f2 f1
chr1H 49525 47 41 60 74 93 34 117 gene1
chr1H 49526 48 41 62 74 94 34 118 gene1
chr1H 53978 48 40 61 73 94 33 117 gene3
chr1H 53979 48 40 62 72 94 33 116 gene3
Btw, this doesn't handle multiple matches. If you want to allow these you can remove the break statement and change the print to a printf.
You can use join to get half way there:
sort -k 2,2 file-1 > f1-sorted
sort -k 2,2 file-2 > f2-sorted
join -1 2 -2 2 -o 1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8,1.9,2.4 f1-sorted f2-sorted
#gives
chr1H 49525 47 41 60 74 93 34 117 gene1
chr1H 53978 48 40 61 73 94 33 117 gene3
join joins files on a column, but it must be sorted
-1 2 -2 2 means join on the second column of file 1 and file 2
-o specifies output format for the columns (1.1 is file 1, column 1, 2.4 is file 2, column 4)
But you have an unusual requirement of matching the next immediate position number with the previous gene name. For that I would use this awk:
awk '
FNR==NR {name[$2]=$4}
FNR!=NR && name[$2] {n = name[$2]}
FNR!=NR && name[$2-1] {n = name[$2-1]}
FNR!=NR {print $0,n}' file-2 file-1
This gives your expected output exactly. You can also use FNR!=NR && n {print $0,n}, to only print a record if there's a match for the position number (column 2) in both files.

Using awk to select rows with a specific value in column greater than x

I tried to use awk to select all rows with a value greater than 98 in the third column. In the output, only lines between 98 - 98.99... were selected and lines with a value more than 98.99 not.
I would like to extract all lines with a value greater than 98 including 99, 100 and so on.
Here my code and my input format:
for i in *input.file; do awk '$3>98' $i >{i/input./output.}; done
A chr11 98.80 83 1 0 2 84
B chr7 95.45 22 1 0 40 61
C chr7 88.89 27 0 1 46 72
D chr6 100.00 20 0 0 1 20
Expected Output
A chr11 98.80 83 1 0 2 84
D chr6 100.00 20 0 0 1 20
Okay, if you have a series of files, *input.file and you want to select those lines where $3 > 98 and then write the values to the same prefix, but with output.file as the rest of the filename, you can use:
awk '$3 > 98 {
match (FILENAME,/input.file$/)
print $0 > substr(FILENAME,1,RSTART-1) "output.file"
}' *input.file
Which uses match to find the index where input.file begins and then uses substr to get the part of the filename before that and appends "output.file" to the substring for the final output filename.
match() sets the RSTART value to the index where input.file begins in the current filename which is then used by substr truncate the current filename at that index. See GNU awk String Functions for complete details.
For exmaple, if you had input files:
$ ls -1 *input.file
v1input.file
v2input.file
Both with your example content:
$ cat v1input.file
A chr11 98.80 83 1 0 2 84
B chr7 95.45 22 1 0 40 61
C chr7 88.89 27 0 1 46 72
D chr6 100.00 20 0 0 1 20
Running the awk command above would results in two output files:
$ ls -1 *output.file
v1output.file
v2output.file
Containing the records where the third-field was greater than 98:
$ cat v1output.file
A chr11 98.80 83 1 0 2 84
D chr6 100.00 20 0 0 1 20

AWK print next line of match between matches

Let's presume I have file test.txt with following data:
.0
41
0.0
42
0.0
43
0.0
44
0.0
45
0.0
46
0.0
START
90
34
17
34
10
100
20
2056
30
0.0
10
53
20
2345
30
0.0
10
45
20
875
30
0.0
END
0.0
48
0.0
49
0.0
140
0.0
With AWK how would I print the lines after 10 and 20 between START and END.
So the output would be.
100
2056
53
2345
45
875
I was able to get the lines with 10 and 20 with
awk '/START/,/END/ {if($0==10 || $0==20) print $0}' test.txt
but how would I get the next lines?
I actually got what I wanted with
awk '/^START/,/^END/ {if($0==10 || $0==20) {getline; print} }' test.txt
Range in awk works fine, but is less flexible than using flags.
awk '/^START/ {f=1} /^END/ {f=0} f && /^(1|2)0$/ {getline;print}' file
100
2056
53
2345
45
875
Don't use ranges as they make trivial things slightly briefer but require a complete rewrite or duplicate conditions when things get even slightly more complicated.
Don't use getline unless it's an appropriate application and you have read and fully understand http://awk.info/?tip/getline.
Just let awk read your lines as designed:
$ cat tst.awk
/START/ { inBlock=1 }
/END/ { inBlock=0 }
foundTgt { print; foundTgt=0 }
inBlock && /^[12]0$/ { foundTgt=1 }
$ awk -f tst.awk file
100
2056
53
2345
45
875
Feel free to use single-character variable names and cram it all onto one line if you find that useful:
awk '/START/{b=1} /END/{b=0} f{print;f=0} b&&/^[12]0$/{f=1}' file

join the contents of files into a new file

I have some text files as shown below. I would like to join the contents of these files into one.
file A
>AXC
145
146
147
>SDF
1
8
67
>FGH
file B
>AXC
>SDF
12
65
>FGH
123
156
190
Desired ouput
new file
>AXC
145
146
147
>SDF
1
8
67
12
65
>FGH
123
156
190
your help would be appreciated!
awk '
/^>/ { key=$0; if (!seen[key]++) keys[++numKeys] = key; next }
{ vals[key] = vals[key] ORS $0 }
END{ for (keyNr=1;keyNr<=numKeys;keyNr++) {key = keys[keyNr]; print key vals[key]} }
' fileA fileB
>AXC
145
146
147
>SDF
1
8
67
12
65
>FGH
123
156
190
If you really want the leading white space added to the ">SDF" values from fileA, tell us why that's the case for that one but not ">AXC" so we can code an appropriate solution.
A bit shorter than Ed's answer
awk '/^>/{a=$0;next}{x[a]=x[a]$0"\n"}END{for(i in x)printf"%s\n%s",i,x[i]}'
Blocks will be printed in an unspecified order.
RS=">" seperate records by > character
OFS="\n" is to have number it's own line.
a[i]=a[i] $0 add fields into array with index of first field.
rt=RT is for adding > character to index
$ awk 'BEGIN{ RS=">"; OFS="\n" }
{i=rt $1; $1=""; a[i]=a[i] $0; rt=RT; next}
END { for (i in a) {print i a[i] }}' d6 d5
>SDF
12
65
1
8
67
>FGH
123
156
190
>AXC
145
146
147

awk + Need to print everything (all rest fields) except $1 and $2

I have the following file and I need to print everything except $1 and $2 by awk
File:
INFORMATION DATA 12 33 55 33 66 43
INFORMATION DATA 45 76 44 66 77 33
INFORMATION DATA 77 83 56 77 88 22
...
the desirable output:
12 33 55 33 66 43
45 76 44 66 77 33
77 83 56 77 88 22
...
Well, given your data, cut should be sufficient:
cut -d\ -f3- infile
Although it adds an extra space at the beginning of each line compared to yael's expected output, here is a shorter and simpler awk based solution than the previously suggested ones:
awk '{$1=$2=""; print}'
or even:
awk '{$1=$2=""}1'
$ cat t
INFORMATION DATA 12 33 55 33 66 43
INFORMATION DATA 45 76 44 66 77 33
INFORMATION DATA 77 83 56 77 88 22
$ awk '{for (i = 3; i <= NF; i++) printf $i " "; print ""}' t
12 33 55 33 66 43
45 76 44 66 77 33
77 83 56 77 88 22
danbens answer leaves a whitespace at the end of the resulting string. so the correct way to do it would be:
awk '{for (i=3; i<NF; i++) printf $i " "; print $NF}' filename
If the first two words don't change, probably the simplest thing would be:
awk -F 'INFORMATION DATA ' '{print $2}' t
Here's another awk solution, that's more flexible than the cut one and is shorter than the other awk ones. Assuming your separators are single spaces (modify the regex as necessary if they are not):
awk --posix '{sub(/([^ ]* ){2}/, ""); print}'
If Perl is an option:
perl -lane 'splice #F,0,2; print join " ",#F' file
These command-line options are used:
-n loop around every line of the input file, do not automatically print it
-l removes newlines before processing, and adds them back in afterwards
-a autosplit mode – split input lines into the #F array. Defaults to splitting on whitespace
-e execute the perl code
splice #F,0,2 cleanly removes columns 0 and 1 from the #F array
join " ",#F joins the elements of the #F array, using a space in-between each element
Variation for csv input files:
perl -F, -lane 'splice #F,0,2; print join " ",#F' file
This uses the -F field separator option with a comma