I have a script code which works, but how do I make this script code more "elegant"? - awk

Some background. I have two files (A and B) which contain data I need to extract.
For file A, I only need the last two lines which look like this:
RMM: 17 -0.221674395053E+01 0.59892E-04 0.00000E+00 31 0.259E-03
1 F= -.22167440E+01 E0= -.22167440E+01 d E =-.398708E-10 mag= 2.0000
I need to extract the following numbers:
-1st Line, 2nd field (17)
-1st Line 4th field (0.59892E-04)
-2nd Line, 1st field (1)
-2nd Line, 3rd field (-.22167440E+01)
-2nd Line, 5th field (-.22167440E+01)
-2nd Line, 8th field (-.398708E-10)
-2nd Line, 10th field (2.0000)
For file B, I only need the last 11 lines which look like this:
Total CPU time used (sec): 0.364
User time (sec): 0.355
System time (sec): 0.009
Elapsed time (sec): 1.423
Maximum memory used (kb): 9896.
Average memory used (kb): 0.
Minor page faults: 2761
Major page faults: 4
Voluntary context switches: 24
I need to extract the following numbers:
-1st line, 6th field (0.364)
-2nd line, 4th field (0.355)
-3rd line, 4th field (0.009)
-4th line, 4th field (1.423)
-6th line, 5th field (9896.)
-7th line, 5th field (0.)
My output should be like this:
mainfolder1[tab/space]subfolder1[tab/space][all the extracted info separated by tab]
mainfolder2[tab/space]subfolder2[tab/space][all the extracted info separated by tab]
mainfolder3[tab/space]subfolder3[tab/space][all the extracted info separated by tab]
...
mainfoldern[tab/space]subfoldern[tab/space][all the extracted info separated by tab]
Now this is my script code:
for m in ./*/; do
main=$(basename "$m")
for s in "$m"*/; do
sub=$(basename "$s")
vdata=$(tail -n2 ./$main/$sub/A | awk -F'[ =]+' NR==1'{a=$2;b=$4;next}{print s,a,$2,$4,$6,$9, $11}')
ctime=$(tail -n11 ./$main/$sub/B |head -n1|awk '{print $6}')
utime=$(tail -n10 ./$main/$sub/B |head -n1|awk '{print $4}')
stime=$(tail -n9 ./$main/$sub/B |head -n1|awk '{print $4}')
etime=$(tail -n8 ./$main/$sub/B |head -n1|awk '{print $4}')
maxmem=$(tail -n6 ./$main/$sub/B |head -n1|awk '{print $5}')
avemem=$(tail -n5 ./$main/$sub/B |head -n1|awk '{print $5}')
c=$(echo $sub| cut -c 2-)
echo "$m $c $vdata $ctime $utime $stime $etime $maxmem $avemem"
done
done > output
Now, the fourth line, the vdata part, was actually a "recycled" line from a previous forum question. I do not fully understand it. I wanted my file B codes to be as a elegant as that awk code for file A. How do I do it? Thank you! :)

awk 'NR==1{print $6} NR==2{print $4} NR==3{print $4} ...'
You could simplify a bit with:
NR==2 || NR==3 || NR==4
but that seems hard to maintain. Or you could use an array:
awk 'BEGIN{a[1]=6;a[2]=4...} NR in a{ print $a[NR]}'
But I think you really just want:
awk '{print $NF}' ORS=\\t
(You don't want the 6th field from row 1. You want the last field.)
Rather than trying to collect the output into variables just to be echoed, add ORS=\\t to get tab separated output, and just let it print to stdout of the script.

For file B try something like:
tail -n11 B | awk -F':' '{ print $2 }'
if you need to retain the values and then echo, you could do something like:
array=($(tail -n11 B | awk -F':' '{ print $2 }'))
for value in "${array[#]}"
do
echo $value
done

You should look into find and xargs since every time you write a loop in shell just to manipulate text you have the wrong approach BUT to keep it simple and retain your original structure, it sounds like you could use something like:
for m in ./*/; do
main=$(basename "$m")
for s in "$m"*/; do
sub=$(basename "$s")
fileA="${main}/${sub}/A"
fileB="${main}/${sub}/B"
awk -v sizeA=$(wc -l < "$fileA") -v sizeB=$(wc -l < "$fileB") '
NR==FNR {
if ( FNR == (sizeA-1) ) { split($0,p) }
if ( FNR == sizeA ) { split($0,a) }
next
}
{ b[sizeB + 1 - FNR] = $NF }
END {
split(FILENAME,f,"/")
print f[1], f[2], p[2], p[4], a[1], a[3], a[5], a[8], a[10], b[11], b[10], b[9], b[8], b[6], b[5]
}
' "$fileA" "$fileB"
done
done > output
Note that the above only opens each "B" file 1 time instead of 6.

Related

Countif like function in AWK with field headers

I am looking for a way of counting the number of times a value in a field appears in a range of fields in a csv file much the same as countif in excel although I would like to use an awk command if possible.
So column 6 should have the range of values and column 7 would have the times the value appears in column 7, as per below
>awk -F, '{print $0}' file3
f1,f2,f3,f4,f5,test
row1_1,row1_2,row1_3,SBCDE,row1_5,SBCD
row2_1,row2_2,row2_3,AWERF,row2_5,AWER
row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF
row4_1,row4_2,row4_3,PRE-ASDQG,row4_5,ASDQ
row4_1,row4_2,row4_3,PRE-ASDQF,row4_5,ASDQ
>awk -F, '{print $6}' file3
test
SBCD
AWER
ASDF
ASDQ
ASDQ
What i want is:
f1,f2,f3,f4,f5,test,count
row1_1,row1_2,row1_3,SBCDE,row1_5,SBCD,1
row2_1,row2_2,row2_3,AWERF,row2_5,AWER,1
row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF,1
row4_1,row4_2,row4_3,PRE-ASDQG,row4_5,ASDQ,2
row4_1,row4_2,row4_3,PRE-ASDQF,row4_5,ASDQ,2
#adds field name count that I want:
awk -F, -v OFS=, 'NR==1{ print $0, "count"}
NR>1{ print $0}' file3
Ho do I get the output I want?
I have tried this from previous/similar question but no joy,
>awk -F, 'NR>1{c[$6]++;l[NR>1]=$0}END{for(i=0;i++<NR;){split(l[i],s,",");print l[i]","c[s[1]]}}' file3
row4_1,row4_2,row4_3,PRE-ASDQF,row4_5,ASDQ,
,
,
,
,
,
very similar question to this one
similar python related Q, for my ref
I would harness GNU AWK for this task following way, let file.txt content be
f1,f2,f3,f4,f5,test
row1_1,row1_2,row1_3,SBCDE,row1_5,SBCD
row2_1,row2_2,row2_3,AWERF,row2_5,AWER
row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF
row4_1,row4_2,row4_3,PRE-ASDQG,row4_5,ASDQ
row4_1,row4_2,row4_3,PRE-ASDQF,row4_5,ASDQ
then
awk 'BEGIN{FS=OFS=","}NR==1{print $0,"count";next}FNR==NR{arr[$6]+=1;next}FNR>1{print $0,arr[$6]}' file.txt file.txt
gives output
f1,f2,f3,f4,f5,test,count
row1_1,row1_2,row1_3,SBCDE,row1_5,SBCD,1
row2_1,row2_2,row2_3,AWERF,row2_5,AWER,1
row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF,1
row4_1,row4_2,row4_3,PRE-ASDQG,row4_5,ASDQ,2
row4_1,row4_2,row4_3,PRE-ASDQF,row4_5,ASDQ,2
Explanation: this is two-pass approach, hence file.txt appears twice. I inform GNU AWK that , is both field separator (FS) and output field separator (OFS), then for first line (header) I print it followed by count and instruct GNU AWK to go to next line, so nothing other is done regarding 1st line, then for first pass, i.e. where global number of line (NR) is equal to number of line in file (FNR) I count number of occurences of values in 6th field and store them as values in array arr, then instruct GNU AWK to get to next line, so onthing other is done in this pass. During second pass for all lines after 1st (FNR>1) I print whole line ($0) followed by corresponding value from array arr
(tested in GNU Awk 5.0.1)
You did not copy the code from the linked question properly. Why change l[NR] to l[NR>1] at all? On the other hand, you should change s[1] to s[6] since it's the sixth field that has the key you're counting:
awk -F, 'NR>1{c[$6]++;l[NR]=$0}END{for(i=0;i++<NR;){split(l[i],s,",");print l[i]","c[s[6]]}}'
You can also output the header with the new field name:
awk -F, -vOFS=, 'NR==1{print $0,"count"}NR>1{c[$6]++;l[NR]=$0}END{for(i=0;i++<NR;){split(l[i],s,",");print l[i],c[s[6]]}}'
One awk idea:
awk '
BEGIN { FS=OFS="," } # define input/output field delimiters as comma
{ lines[NR]=$0
if (NR==1) next
col6[NR]=$6 # copy field 6 so we do not have to parse the contents of lines[] in the END block
cnt[$6]++
}
END { for (i=1;i<=NR;i++)
print lines[i], (i==1 ? "count" : cnt[col6[i]] )
}
' file3
This generates:
f1,f2,f3,f4,f5,test,count
row1_1,row1_2,row1_3,SBCDE,row1_5,SBCD,1
row2_1,row2_2,row2_3,AWERF,row2_5,AWER,1
row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF,1
row4_1,row4_2,row4_3,PRE-ASDQG,row4_5,ASDQ,2
row4_1,row4_2,row4_3,PRE-ASDQF,row4_5,ASDQ,2

Sum specific column value until a certain value is reached

I want to print first column's value until its reached a certain value, like;
43 12.00 53888
29 10.00 36507
14 9.00 18365
8 8.00 10244
1 7.00 2079
1 9.50 1633
0 6.00 760
I would like the output to be:
val = 90
43 12.00 53888
29 10.00 36507
14 9.00 18365
Could you please try following, written and tested with shown samples. Explicitly putting exit in condition when 1st column's sum goes greater than mentioned value to avoid unnecessary reading rest of the Input_file.
awk -v val="90" '($1+prev)>val{exit} ($1+prev)<=val{print}{prev+=$1}' Input_file
OR
awk -v val="90" '($1+prev)>val{exit} ($1+prev)<=val; {prev+=$1}' Input_file
Explanation: Adding detailed explanation for above.
awk -v val="90" ' ##Starting awk program from here and mentioning variable val as 90 here.
($1+prev)>val{ ##Checking condition if first field and prev variable sum is greater than val then do following.
exit ##exit from program to save some time.
}
($1+prev)<=val; ##Checking condition if sum of $1 and prev is lesser than or equal to val then print the current line.
{
prev+=$1 ##keep adding 1st field to prev variable here.
}
' Input_file ##Mentioning Input_file name here.
Perl to the rescue!
perl -sape ' $s += $F[0] ; exit if $s > $vv' -- -vv=90 file
-s enables setting variables from the command line, -vv=90 sets the $vv variable to 90
-p processes the input line by line, it prints each line after processing
-a splits each line on whitespace and populates the #F array
The variable $s is used to hold the running sum. The line is printed only when the sum is less than $vv, once the sum is too large, the program exits.
Consider small one-line awk
Revised: (Sep 2020): Modified to take into account Bruno's comments, going for readable solution, see kvantour for compact solution.
awk -v val=85 '{ s+= $1 ; if ( s > val ) exit ; print }'
Original Post: (Aug 2020)
awk -v val=85 '{ s += $1 ; if ( s <= val ) print }'
Or even
awk -v val=85 '{ s+= $1 } s <= val'
Consider an even smaller awk which is very much in line with the solution of dash-o
awk -v v=90 '((v-=$1)<0){exit}1' file
or the smallest:
awk -v v=90 '0<=(v-=$1)' file

Filter logs with awk for last 100 lines

I can filter the last 500 lines using tail or grep
tail --line 500 my_log | grep "ERROR"
What is the equivalent command for using awk
How can I add no of lines in below command
awk '/ERROR/' my_log
awk don't know about end of file until it change of reading file but you can read twhice the file, first time to find the end, second to treat line that are in the scope. You could also keep the X last line in a buffer but it's a bit heavy in memory consuption and process. Notice that the file need to be mentionned twice at the end for this.
awk 'FNR==NR{L=NR-500;next};FNR>=L && /ERROR/{ print FNR":"$0}' my_log my_log
With explanaition
awk '# first reading
FNR==NR{
#last line is this minus 500
LL=NR-500
# go to next line (for this file)
next
}
# at second read (due to previous section filtering)
# if line number is after(included) LL AND error is on the line content, print it
FNR >= LL && /ERROR/ { print FNR ":" $0 }
' my_log my_log
on gnu sed
sed '$-500,$ {/ERROR/ p}' my_log
As you had no sample data to test with, I'll show with just numbers using seq 1 10. This one stores last n records and prints them out in the end:
$ seq 1 10 |
awk -v n=3 '{a[++c]=$0;delete a[c-n]}END{for(i=c-n+1;i<=c;i++)print a[i]}'
8
9
10
If you want to filter the data add for example /ERROR/ before {a[++c]=$0; ....
Explained:
awk -v n=3 '{ # set wanted amount of records
a[++c]=$0 # hash to a
delete a[c-n] # delete the ones outside of the window
}
END { # in the end
for(i=c-n+1;i<=c;i++) # in order
print a[i] # output records
}'
Could you please try following.
tac Input_file | awk 'FNR<=100 && /error/' | tac
In case you want to add number of lines in awk command then try following.
awk '/ERROR/{print FNR,$0}' Input_file

awk: print each column of a file into separate files

I have a file with 100 columns of data. I want to print the first column and i-th column in 99 separate files, I am trying to use
for i in {2..99}; do awk '{print $1" " $i }' input.txt > data${i}; done
But I am getting errors
awk: illegal field $(), name "i"
input record number 1, file input.txt
source line number 1
How to correctly use $i inside the {print }?
Following single awk may help you too here:
awk -v start=2 -v end=99 '{for(i=start;i<=end;i++){print $1,$i > "file"i;close("file"i)}}' Input_file
An all awk solution. First test data:
$ cat foo
11 12 13
21 22 23
Then the awk:
$ awk '{for(i=2;i<=NF;i++) print $1,$i > ("data" i)}' foo
and results:
$ ls data*
data2 data3
$ cat data2
11 12
21 22
The for iterates from 2 to the last field. If there are more fields that you desire to process, change the NF to the number you'd like. If, for some reason, a hundred open files would be a problem in your system, you'd need to put the print into a block and add a close call:
$ awk '{for(i=2;i<=NF;i++){f=("data" i); print $1,$i >> f; close(f)}}' foo
If you want to do what you try to accomplish :
for i in {2..99}; do
awk -v x=$i '{print $1" " $x }' input.txt > data${i}
done
Note
the -v switch of awk to pass variables
$x is the nth column defined in your variable x
Note2 : this is not the fastest solution, one awk call is fastest, but I just try to correct your logic. Ideally, take time to understand awk, it's never a wasted time

take out specific columns from mulitple files

I have multiple files that look like the one below. They are tab-separated. For all the files I would like to take out column 1 and the column that start with XF:Z:. This will give me output 1
The files names are htseqoutput*.sam.sam where * varies. I am not sure about the awk function use, and if the for-loop is correct.
for f in htseqoutput*.sam.sam
do
awk ????? "$f" > “out${f#htseqoutput}”
done
input example
AACAGATGATGAACTTATTGACGGGCGGACAGGAACTGTGTGCTGATTGTC_11 16 chr22 39715068 24 51M * 0 0 GACAATCAGCACACAGTTCCTGTCCGCCCGTCAATAAGTTCATCATCTGTT IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII AS:i:-12 XN:i:0 XM:i:2 XO:i:0 XG:i:0 NM:i:2 MD:Z:18T31G0 YT:Z:UU XF:Z:SNORD43
GTTTCCTTAGTGTAGCGGTTATCACATTCGCCT_0 16 chr19 4724687 40 33M * 0 0 AGGCGAATGTGATAACCGCTACACTAAGGAAAC IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII AS:i:-6 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:26C6 YT:Z:UU XF:Z:tRNA
TCGACTCCCGGTGTGGGAACC_0 16 chr13 45492060 23 21M * 0 0 GGTTCCCACACCGGGAGTCGA IIIIIIIIIIIIIIIIIIIII AS:i:-6 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:0C20 YT:Z:UU XF:Z:tRNA
output 1:
AACAGATGATGAACTTATTGACGGGCGGACAGGAACTGTGTGCTGATTGTC_11 SNORD43
GTTTCCTTAGTGTAGCGGTTATCACATTCGCCT_0 tRNA
TCGACTCCCGGTGTGGGAACC_0 tRNA
Seems like you could just use sed for this:
sed -r 's/^([ACGT0-9_]+).*XF:Z:([[:alnum:]]+).*/\1\t\2/' file
This captures the part at the start of the line and the alphanumeric part following XF:Z: and outputs them, separated by a tab character. One potential advantage of this approach is that it will work independently of the position of the XF:Z: string.
Your loop looks OK (you can use this sed command in place of the awk part) but be careful with your quotes. " should be used, not “/”.
Alternatively, if you prefer awk (and assuming that the bit you're interested in is always part of the last field), you can use a custom field separator:
awk -F'[[:space:]](XF:Z:)?' -v OFS='\t' '{print $1, $NF}' file
This optionally adds the XF:Z: part to the field separator, so that it is removed from the start of the last field.
You can try, if column with "XF:Z:" is always at the end
awk 'BEGIN{OFS="\t"}{n=split($NF,a,":"); print $1, a[n]}' file.sam
you get,
AACAGATGATGAACTTATTGACGGGCGGACAGGAACTGTGTGCTGATTGTC_11 SNORD43
GTTTCCTTAGTGTAGCGGTTATCACATTCGCCT_0 tRNA
TCGACTCCCGGTGTGGGAACC_0 tRNA
or, if this column is a variable position for each file
awk 'BEGIN{OFS="\t"}
FNR==1{
for(i=1;i<=NF;i++){
if($i ~ /^XF:Z:/) break
}
}
{n=split($i,a,":"); print $1, a[n]}' file.sam