Using awk to select rows with a specific value in column greater than x - awk

I tried to use awk to select all rows with a value greater than 98 in the third column. In the output, only lines between 98 - 98.99... were selected and lines with a value more than 98.99 not.
I would like to extract all lines with a value greater than 98 including 99, 100 and so on.
Here my code and my input format:
for i in *input.file; do awk '$3>98' $i >{i/input./output.}; done
A chr11 98.80 83 1 0 2 84
B chr7 95.45 22 1 0 40 61
C chr7 88.89 27 0 1 46 72
D chr6 100.00 20 0 0 1 20
Expected Output
A chr11 98.80 83 1 0 2 84
D chr6 100.00 20 0 0 1 20

Okay, if you have a series of files, *input.file and you want to select those lines where $3 > 98 and then write the values to the same prefix, but with output.file as the rest of the filename, you can use:
awk '$3 > 98 {
match (FILENAME,/input.file$/)
print $0 > substr(FILENAME,1,RSTART-1) "output.file"
}' *input.file
Which uses match to find the index where input.file begins and then uses substr to get the part of the filename before that and appends "output.file" to the substring for the final output filename.
match() sets the RSTART value to the index where input.file begins in the current filename which is then used by substr truncate the current filename at that index. See GNU awk String Functions for complete details.
For exmaple, if you had input files:
$ ls -1 *input.file
v1input.file
v2input.file
Both with your example content:
$ cat v1input.file
A chr11 98.80 83 1 0 2 84
B chr7 95.45 22 1 0 40 61
C chr7 88.89 27 0 1 46 72
D chr6 100.00 20 0 0 1 20
Running the awk command above would results in two output files:
$ ls -1 *output.file
v1output.file
v2output.file
Containing the records where the third-field was greater than 98:
$ cat v1output.file
A chr11 98.80 83 1 0 2 84
D chr6 100.00 20 0 0 1 20

Related

How to insert column from a file to another file at multiple places

I would like to insert columns no. 1 and 2 from file no. 2 into file no. 1 after every second column and till the last column.
File1.txt (tab-separated, column range from 1-2400 and cell range from 1-4500)
ID IMPACT ID IMPACT ID IMPACT
51 0.288 128 0.4557 156 0.85
625 0.858 15 -0.589 51 0.96
8 0.845 7 0.5891
File2.txt (consist of only two-tab separated column with 19000 raws)
ID IMPACT
18 -1
165 -1
41 -1
11 -1
Output file
ID IMPACT ID IMPACT ID IMPACT ID IMPACT ID IMPACT ID IMPACT
51 0.288 18 -1 128 0.4557 18 -1 156 0.85 18 -1
625 0.858 165 -1 15 -0.589 165 -1 51 0.96 165 -1
8 0.845 41 -1 7 0.5891 41 -1 41 -1
11 -1 11 -1 11 -1
I tried the below commands but it's not working
paste <(cut -f 1,2 File1.txt) <(cut -f 1,2 File2.txt) <(cut -f 3,4 File1.txt) <(cut -f 1,2 File2.txt)......... > File3
Prob: It starts sifting the File2.txt column value into different columns after the highest cell of File1.txt
paste File1.txt File2.txt > File3.txt
awk '{print $1 "\t" $2 "\t" $3 "\t" $4 "\t" $5 "\t" $6 "\t" $3 "\t" $4....}' File3.txt > File4.txt
This do the job, however it mixup the value of File1.txt from one column to another column.
I tried everything but failed to succeed.
Any help would be appreciated, however, bash or pandas would be better. Thanks in advance.
$ awk '
BEGIN {
FS=OFS="\t" # tab-separated data
}
NR==FNR { # hash fields of file2
a[FNR]=$1 # index with record numbers FNR
b[FNR]=$2
next
}
{ # print file1 records with file2 fields
print $1,$2,a[FNR],b[FNR],$3,$4,a[FNR],b[FNR],$5,$6,a[FNR],b[FNR]
}
END { # in the end
for(i=(FNR+1);(i in a);i++) # deal with extra records of file2
print "","",a[i],b[i],"","",a[i],b[i],"","",a[i],b[i]
}' file2 file1
Output:
ID IMPACT ID IMPACT ID IMPACT ID IMPACT ID IMPACT ID IMPACT
51 0.288 18 -1 128 0.4557 18 -1 156 0.85 18 -1
625 0.858 165 -1 15 -0.589 165 -1 51 0.96 165 -1
8 0.845 41 -1 7 0.5891 41 -1 41 -1
11 -1 11 -1 11 -1

Awk script displaying incorrect output

I'm facing an issue in awk script - I need to generate a report containing the lowest, highest and average score for each assignment in the data file. The name of the assignment is located in column 3.
Input data is:
Student,Catehory,Assignment,Score,Possible
Chelsey,Homework,H01,90,100
Chelsey,Homework,H02,89,100
Chelsey,Homework,H03,77,100
Chelsey,Homework,H04,80,100
Chelsey,Homework,H05,82,100
Chelsey,Homework,H06,84,100
Chelsey,Homework,H07,86,100
Chelsey,Lab,L01,91,100
Chelsey,Lab,L02,100,100
Chelsey,Lab,L03,100,100
Chelsey,Lab,L04,100,100
Chelsey,Lab,L05,96,100
Chelsey,Lab,L06,80,100
Chelsey,Lab,L07,81,100
Chelsey,Quiz,Q01,100,100
Chelsey,Quiz,Q02,100,100
Chelsey,Quiz,Q03,98,100
Chelsey,Quiz,Q04,93,100
Chelsey,Quiz,Q05,99,100
Chelsey,Quiz,Q06,88,100
Chelsey,Quiz,Q07,100,100
Chelsey,Final,FINAL,82,100
Chelsey,Survey,WS,5,5
Sam,Homework,H01,19,100
Sam,Homework,H02,82,100
Sam,Homework,H03,95,100
Sam,Homework,H04,46,100
Sam,Homework,H05,82,100
Sam,Homework,H06,97,100
Sam,Homework,H07,52,100
Sam,Lab,L01,41,100
Sam,Lab,L02,85,100
Sam,Lab,L03,99,100
Sam,Lab,L04,99,100
Sam,Lab,L05,0,100
Sam,Lab,L06,0,100
Sam,Lab,L07,0,100
Sam,Quiz,Q01,91,100
Sam,Quiz,Q02,85,100
Sam,Quiz,Q03,33,100
Sam,Quiz,Q04,64,100
Sam,Quiz,Q05,54,100
Sam,Quiz,Q06,95,100
Sam,Quiz,Q07,68,100
Sam,Final,FINAL,58,100
Sam,Survey,WS,5,5
Andrew,Homework,H01,25,100
Andrew,Homework,H02,47,100
Andrew,Homework,H03,85,100
Andrew,Homework,H04,65,100
Andrew,Homework,H05,54,100
Andrew,Homework,H06,58,100
Andrew,Homework,H07,52,100
Andrew,Lab,L01,87,100
Andrew,Lab,L02,45,100
Andrew,Lab,L03,92,100
Andrew,Lab,L04,48,100
Andrew,Lab,L05,42,100
Andrew,Lab,L06,99,100
Andrew,Lab,L07,86,100
Andrew,Quiz,Q01,25,100
Andrew,Quiz,Q02,84,100
Andrew,Quiz,Q03,59,100
Andrew,Quiz,Q04,93,100
Andrew,Quiz,Q05,85,100
Andrew,Quiz,Q06,94,100
Andrew,Quiz,Q07,58,100
Andrew,Final,FINAL,99,100
Andrew,Survey,WS,5,5
Ava,Homework,H01,55,100
Ava,Homework,H02,95,100
Ava,Homework,H03,84,100
Ava,Homework,H04,74,100
Ava,Homework,H05,95,100
Ava,Homework,H06,84,100
Ava,Homework,H07,55,100
Ava,Lab,L01,66,100
Ava,Lab,L02,77,100
Ava,Lab,L03,88,100
Ava,Lab,L04,99,100
Ava,Lab,L05,55,100
Ava,Lab,L06,66,100
Ava,Lab,L07,77,100
Ava,Quiz,Q01,88,100
Ava,Quiz,Q02,99,100
Ava,Quiz,Q03,44,100
Ava,Quiz,Q04,55,100
Ava,Quiz,Q05,66,100
Ava,Quiz,Q06,77,100
Ava,Quiz,Q07,88,100
Ava,Final,FINAL,99,100
Ava,Survey,WS,5,5
Shane,Homework,H01,50,100
Shane,Homework,H02,60,100
Shane,Homework,H03,70,100
Shane,Homework,H04,60,100
Shane,Homework,H05,70,100
Shane,Homework,H06,80,100
Shane,Homework,H07,90,100
Shane,Lab,L01,90,100
Shane,Lab,L02,0,100
Shane,Lab,L03,100,100
Shane,Lab,L04,50,100
Shane,Lab,L05,40,100
Shane,Lab,L06,60,100
Shane,Lab,L07,80,100
Shane,Quiz,Q01,70,100
Shane,Quiz,Q02,90,100
Shane,Quiz,Q03,100,100
Shane,Quiz,Q04,100,100
Shane,Quiz,Q05,80,100
Shane,Quiz,Q06,80,100
Shane,Quiz,Q07,80,100
Shane,Final,FINAL,90,100
Shane,Survey,WS,5,5
awk script :
BEGIN {
FS=" *\\, *"
}
FNR>1 {
min[$3]=(!($3 in min) || min[$3]> $4 )? $4 : min[$3]
max[$3]=(max[$3]> $4)? max[$3] : $4
cnt[$3]++
sum[$3]+=$4
}
END {
print "Name\tLow\tHigh\tAverage"
for (i in cnt)
printf("%s\t%d\t%d\t%.1f\n", i, min[i], max[i], sum[i]/cnt[i])
}
Expected sample output:
Name Low High Average
Q06 77 95 86.80
L05 40 96 46.60
WS 5 5 5
Q07 58 100 78.80
L06 60 99 61
L07 77 86 64.80
When I run the script, I get a "Low" of 0 for all assignments which is not correct. Where am I going wrong? Please guide.
You can certainly do this with awk, but since you tagged this scripting as well, I'm assuming other tools are an option. For this sort of gathering of statistics on groups present in the data, GNU datamash often reduces the job to a simple one-liner. For example:
$ (echo Name,Low,High,Average; datamash --header-in -s -t, -g3 min 4 max 4 mean 4 < input.csv) | tr , '\t'
Name Low High Average
FINAL 58 99 85.6
H01 19 90 47.8
H02 47 95 74.6
H03 70 95 82.2
H04 46 80 65
H05 54 95 76.6
H06 58 97 80.6
H07 52 90 67
L01 41 91 75
L02 0 100 61.4
L03 88 100 95.8
L04 48 100 79.2
L05 0 96 46.6
L06 0 99 61
L07 0 86 64.8
Q01 25 100 74.8
Q02 84 100 91.6
Q03 33 100 66.8
Q04 55 100 81
Q05 54 99 76.8
Q06 77 95 86.8
Q07 58 100 78.8
WS 5 5 5
This says that for each group with the same value for the 3rd column (-g3, plus -s to sort the input (A requirement of the tool)) of simple CSV input (-t,) with a header (--header-in), display the minimum, maximum, and mean of the 4th column. It's all given a new header and piped to tr to turn the commas into tabs.
Your code works as-is with GNU awk. However, running it with the -t option to warn about non-portable constructs gives:
awk: foo.awk:6: warning: old awk does not support the keyword `in' except after `for'
awk: foo.awk:2: warning: old awk does not support regexps as value of `FS'
And running the script with a different implementation of awk (mawk in my case) does give 0's for the Low column. So, some tweaks to the script:
BEGIN {
FS=","
}
FNR>1 {
min[$3]=(cnt[$3] == 0 || min[$3]> $4 )? $4 : min[$3]
max[$3]=(max[$3]> $4)? max[$3] : $4
cnt[$3]++
sum[$3]+=$4
}
END {
print "Name\tLow\tHigh\tAverage"
PROCINFO["sorted_in"] = "#ind_str_asc" # gawk-ism for pretty output; ignored on other awks
for (i in cnt)
printf("%s\t%d\t%d\t%.1f\n", i, min[i], max[i], sum[i]/cnt[i])
}
and it works as expected on that other awk too.
The changes:
Using a simple comma as the field separator instead of a regex.
Changing the min conditional to setting to the current value on the first time this assignment has been seen by checking to see if cnt[$3] is equal to 0 (Which it will be the first time because that value is incremented in a later line), or if the current min is greater than this value.
another similar approach
$ awk -F, 'NR==1 {print "name","low","high","average"; next}
{k=$3; sum[k]+=$4; count[k]++}
!(k in min) {min[k]=max[k]=$4}
min[k]>$4 {min[k]=$4}
max[k]<$4 {max[k]=$4}
END {for(k in min) print k,min[k],max[k],sum[k]/count[k]}' file |
column -t
name low high average
Q06 77 95 86.8
L05 0 96 46.6
WS 5 5 5
Q07 58 100 78.8
L06 0 99 61
L07 0 86 64.8
H01 19 90 47.8
H02 47 95 74.6
H03 70 95 82.2

How to loop awk command over row values

I would like to use awk to search for a particular word in the first column of a table and print the value in the 6th column. I understand how to do this searching one word at time using something along the lines of:
awk '$1 == "<insert-word>" { print $6 }' file.txt
But I was wondering if it is possible to loop this over a list of words in a row?
For example If I had a table like file1.txt below:
cat file1.txt
dna1 dna4 dna5
dna3 dna6 dna2
dna7 dna8 dna9
Could I loop over each value in row 1 and search for this word in column 1 of file2.txt below, each time printing the value of column 6? Then do this for row 2, 3 and so on...
cat file2
dna1 0 229 7 0 4 0 0
dna2 0 296 39 2 1 3 100
dna3 0 255 15 0 6 0 0
dna4 0 209 3 0 0 0 0
dna5 0 253 14 2 3 7 100
dna6 0 897 629 7 8 1 100
dna7 0 214 4 0 9 0 0
dna8 0 255 15 0 2 0 0
dna9 0 606 338 8 3 1 100
So an example looping the awk over row 1 of file 1 would return the numbers 4, 0 and 3.
The looping the command over row 2 would return the numbers 6, 8 and 1
And finally looping over row 3 would return the number 9, 2, 3
An example output might be
4 0 3
6 8 1
9 2 3
What I would really like to to is sum the total value of the numbers returned for each row. I just wasn't sure if this would be possible...
An example output of this would be
7
15
14
But I am not worried if this step isn't possible using awk as I could just do it separately
Hope this makes sense
Cheers
Ollie
yes, you can give awk multiple input files. For your example:
awk 'NR==FNR{a[$1]=a[$2]=1;next}a[$1]{print $6}' file1 file2
I didn't test the above one-liner, but it should go. At least you get the idea.
If you don't know how many columns in your file1, as you said, you want to do a loop:
awk 'NR==FNR{for(x=1;x<=NF;x++)a[$x]=1;next}a[$1]{print $6}' file1 file2
update
edit for the new requirement:
awk 'NR==FNR{a[$1]=$6;next}{for(i=1;i<=NF;i++)s+=a[$i];print s;s=0}' f2 f1
The output of above one-liner: (take f1 and f2 as your input example file1 file2):
7
15
14

extract specific lines based on another file

I have a folder containing text files. I need to extract specific lines from the files of this folder based on another file input.txt. I tried the following code . But it doesn't work.
awk '
NR==FNR{
if(NF>1)f=$3
else A[f,$1]
next
}
(FILENAME,$3) in A {
print > ( todir "/" FILENAME )
}
' todir=/home/alan/Desktop/output FS=\* /home/alan/Desktop/input.txt FS=" " *
file1
PHE .0 10 .0 0
ASP 59.8 11 59.8 0
LEU 66.8 15 66.8 0
ARG 21.0 16 21.0 0
file2
SER 57.9 43 57.9 0
PHE 2.4 44 2.4 0
GLN 86.1 49 83.2 2.9
THR 33.2 50 33.2 0
SER 10.2 51 .9 9.3
input.txt
*file1
10
16
*file2
43
44
49
Desired output
file1
PHE 0 10 0 0
ARG 21.0 16 21.0 0
file2
SER 57.9 43 57.9 0
PHE 2.4 44 2.4 0
GLN 86.1 49 83.2 2.9
On line 3,
$3 needs to be changed to $2.
Since asterisk is the field separator in input.txt, the (empty, non-existent) string before it is counted as $1 and the file name that comes after it as $2.
awk '
NR==FNR{
if(NF>1)f=$2
else A[f,$1]
next
}

merge similar files in awk

I have the following files
File A
Kmax Event File - Text Format
1 6 1000
1 4143 9256 13645 16426 20490
49 4144 8820 14751 16529 20505
45 4146 8308 12303 16912 22715
75 4139 9049 14408 16447 20480
23 4137 8449 13223 16511 20498
22 4142 8795 14955 16615 20493
File B
Kmax Event File - Text Format
1 6 1000
42 4143 9203 13401 16475 20480
44 4140 8251 12302 16932 21872
849 6283 8455 12301 16415 20673
18 4148 8238 12757 16597 20484
19 4144 8268 12306 17110 21868
50 4134 8331 12663 16501 20606
988 5682 8296 12306 16577 20592
61 4147 8330 12307 16945 22497
0 4138 8333 12310 16871 22749
File C, File D, ... and all those files have exact the same format. In addition the file name of each file is the following : run, run%1, run%2, run%3, run%4 etc. The file number could reach even up to 30, run%30 that is.
What I'd like to do is to merge the files in the following way
Kmax Event File - Text Format
1 6 1000
1 4143 9256 13645 16426 20490
49 4144 8820 14751 16529 20505
45 4146 8308 12303 16912 22715
75 4139 9049 14408 16447 20480
23 4137 8449 13223 16511 20498
22 4142 8795 14955 16615 2049
42 4143 9203 13401 16475 20480
44 4140 8251 12302 16932 21872
849 6283 8455 12301 16415 20673
18 4148 8238 12757 16597 20484
19 4144 8268 12306 17110 21868
50 4134 8331 12663 16501 20606
988 5682 8296 12306 16577 20592
61 4147 8330 12307 16945 22497
0 4138 8333 12310 16871 22749
I believe I can do it using
awk '{i=$1;sub(i,x);A[i]=A[i]$0} FILENAME==ARGV[ARGC-1]{print i A[i]}'
but in this way the two first lines of the second file will be present. In addition I don't know if the above line will work. The problem is that I will need to merge many files at the same time. Any idea to merge those almost identical files?
Using grouping braces in the shell
{ cat run; sed '1,2d' run%*; } > c
You can use cat and tail:
cat A > C && tail -n +3 B >> C
This will merge file A and B in a new file named C.
Using awk:
awk 'FNR==NR{print; next} FNR>2' A B > C
If you have more than one file to merge into one, you can list them next to A B in the awk version, e.g A B D. C in awk version is the output file containing merged data.
In cat and tail version you can repeat tail part of the code for other files, e.g
cat A > C && tail -n +3 B >> C && tail -n +3 D >> C
or create some kind of loop to iterate over files.
Print all lines from the first file (NR==FNR) and only line 3 and on from the rest of the files (FNR>2):
awk 'NR==FNR||FNR>2' run*