I have a script code which works, but how do I make this script code more "elegant"? - awk
Some background. I have two files (A and B) which contain data I need to extract.
For file A, I only need the last two lines which look like this:
RMM: 17 -0.221674395053E+01 0.59892E-04 0.00000E+00 31 0.259E-03
1 F= -.22167440E+01 E0= -.22167440E+01 d E =-.398708E-10 mag= 2.0000
I need to extract the following numbers:
-1st Line, 2nd field (17)
-1st Line 4th field (0.59892E-04)
-2nd Line, 1st field (1)
-2nd Line, 3rd field (-.22167440E+01)
-2nd Line, 5th field (-.22167440E+01)
-2nd Line, 8th field (-.398708E-10)
-2nd Line, 10th field (2.0000)
For file B, I only need the last 11 lines which look like this:
Total CPU time used (sec): 0.364
User time (sec): 0.355
System time (sec): 0.009
Elapsed time (sec): 1.423
Maximum memory used (kb): 9896.
Average memory used (kb): 0.
Minor page faults: 2761
Major page faults: 4
Voluntary context switches: 24
I need to extract the following numbers:
-1st line, 6th field (0.364)
-2nd line, 4th field (0.355)
-3rd line, 4th field (0.009)
-4th line, 4th field (1.423)
-6th line, 5th field (9896.)
-7th line, 5th field (0.)
My output should be like this:
mainfolder1[tab/space]subfolder1[tab/space][all the extracted info separated by tab]
mainfolder2[tab/space]subfolder2[tab/space][all the extracted info separated by tab]
mainfolder3[tab/space]subfolder3[tab/space][all the extracted info separated by tab]
...
mainfoldern[tab/space]subfoldern[tab/space][all the extracted info separated by tab]
Now this is my script code:
for m in ./*/; do
main=$(basename "$m")
for s in "$m"*/; do
sub=$(basename "$s")
vdata=$(tail -n2 ./$main/$sub/A | awk -F'[ =]+' NR==1'{a=$2;b=$4;next}{print s,a,$2,$4,$6,$9, $11}')
ctime=$(tail -n11 ./$main/$sub/B |head -n1|awk '{print $6}')
utime=$(tail -n10 ./$main/$sub/B |head -n1|awk '{print $4}')
stime=$(tail -n9 ./$main/$sub/B |head -n1|awk '{print $4}')
etime=$(tail -n8 ./$main/$sub/B |head -n1|awk '{print $4}')
maxmem=$(tail -n6 ./$main/$sub/B |head -n1|awk '{print $5}')
avemem=$(tail -n5 ./$main/$sub/B |head -n1|awk '{print $5}')
c=$(echo $sub| cut -c 2-)
echo "$m $c $vdata $ctime $utime $stime $etime $maxmem $avemem"
done
done > output
Now, the fourth line, the vdata part, was actually a "recycled" line from a previous forum question. I do not fully understand it. I wanted my file B codes to be as a elegant as that awk code for file A. How do I do it? Thank you! :)
awk 'NR==1{print $6} NR==2{print $4} NR==3{print $4} ...'
You could simplify a bit with:
NR==2 || NR==3 || NR==4
but that seems hard to maintain. Or you could use an array:
awk 'BEGIN{a[1]=6;a[2]=4...} NR in a{ print $a[NR]}'
But I think you really just want:
awk '{print $NF}' ORS=\\t
(You don't want the 6th field from row 1. You want the last field.)
Rather than trying to collect the output into variables just to be echoed, add ORS=\\t to get tab separated output, and just let it print to stdout of the script.
For file B try something like:
tail -n11 B | awk -F':' '{ print $2 }'
if you need to retain the values and then echo, you could do something like:
array=($(tail -n11 B | awk -F':' '{ print $2 }'))
for value in "${array[#]}"
do
echo $value
done
You should look into find and xargs since every time you write a loop in shell just to manipulate text you have the wrong approach BUT to keep it simple and retain your original structure, it sounds like you could use something like:
for m in ./*/; do
main=$(basename "$m")
for s in "$m"*/; do
sub=$(basename "$s")
fileA="${main}/${sub}/A"
fileB="${main}/${sub}/B"
awk -v sizeA=$(wc -l < "$fileA") -v sizeB=$(wc -l < "$fileB") '
NR==FNR {
if ( FNR == (sizeA-1) ) { split($0,p) }
if ( FNR == sizeA ) { split($0,a) }
next
}
{ b[sizeB + 1 - FNR] = $NF }
END {
split(FILENAME,f,"/")
print f[1], f[2], p[2], p[4], a[1], a[3], a[5], a[8], a[10], b[11], b[10], b[9], b[8], b[6], b[5]
}
' "$fileA" "$fileB"
done
done > output
Note that the above only opens each "B" file 1 time instead of 6.
Related
Countif like function in AWK with field headers
I am looking for a way of counting the number of times a value in a field appears in a range of fields in a csv file much the same as countif in excel although I would like to use an awk command if possible. So column 6 should have the range of values and column 7 would have the times the value appears in column 7, as per below >awk -F, '{print $0}' file3 f1,f2,f3,f4,f5,test row1_1,row1_2,row1_3,SBCDE,row1_5,SBCD row2_1,row2_2,row2_3,AWERF,row2_5,AWER row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF row4_1,row4_2,row4_3,PRE-ASDQG,row4_5,ASDQ row4_1,row4_2,row4_3,PRE-ASDQF,row4_5,ASDQ >awk -F, '{print $6}' file3 test SBCD AWER ASDF ASDQ ASDQ What i want is: f1,f2,f3,f4,f5,test,count row1_1,row1_2,row1_3,SBCDE,row1_5,SBCD,1 row2_1,row2_2,row2_3,AWERF,row2_5,AWER,1 row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF,1 row4_1,row4_2,row4_3,PRE-ASDQG,row4_5,ASDQ,2 row4_1,row4_2,row4_3,PRE-ASDQF,row4_5,ASDQ,2 #adds field name count that I want: awk -F, -v OFS=, 'NR==1{ print $0, "count"} NR>1{ print $0}' file3 Ho do I get the output I want? I have tried this from previous/similar question but no joy, >awk -F, 'NR>1{c[$6]++;l[NR>1]=$0}END{for(i=0;i++<NR;){split(l[i],s,",");print l[i]","c[s[1]]}}' file3 row4_1,row4_2,row4_3,PRE-ASDQF,row4_5,ASDQ, , , , , , very similar question to this one similar python related Q, for my ref
I would harness GNU AWK for this task following way, let file.txt content be f1,f2,f3,f4,f5,test row1_1,row1_2,row1_3,SBCDE,row1_5,SBCD row2_1,row2_2,row2_3,AWERF,row2_5,AWER row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF row4_1,row4_2,row4_3,PRE-ASDQG,row4_5,ASDQ row4_1,row4_2,row4_3,PRE-ASDQF,row4_5,ASDQ then awk 'BEGIN{FS=OFS=","}NR==1{print $0,"count";next}FNR==NR{arr[$6]+=1;next}FNR>1{print $0,arr[$6]}' file.txt file.txt gives output f1,f2,f3,f4,f5,test,count row1_1,row1_2,row1_3,SBCDE,row1_5,SBCD,1 row2_1,row2_2,row2_3,AWERF,row2_5,AWER,1 row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF,1 row4_1,row4_2,row4_3,PRE-ASDQG,row4_5,ASDQ,2 row4_1,row4_2,row4_3,PRE-ASDQF,row4_5,ASDQ,2 Explanation: this is two-pass approach, hence file.txt appears twice. I inform GNU AWK that , is both field separator (FS) and output field separator (OFS), then for first line (header) I print it followed by count and instruct GNU AWK to go to next line, so nothing other is done regarding 1st line, then for first pass, i.e. where global number of line (NR) is equal to number of line in file (FNR) I count number of occurences of values in 6th field and store them as values in array arr, then instruct GNU AWK to get to next line, so onthing other is done in this pass. During second pass for all lines after 1st (FNR>1) I print whole line ($0) followed by corresponding value from array arr (tested in GNU Awk 5.0.1)
You did not copy the code from the linked question properly. Why change l[NR] to l[NR>1] at all? On the other hand, you should change s[1] to s[6] since it's the sixth field that has the key you're counting: awk -F, 'NR>1{c[$6]++;l[NR]=$0}END{for(i=0;i++<NR;){split(l[i],s,",");print l[i]","c[s[6]]}}' You can also output the header with the new field name: awk -F, -vOFS=, 'NR==1{print $0,"count"}NR>1{c[$6]++;l[NR]=$0}END{for(i=0;i++<NR;){split(l[i],s,",");print l[i],c[s[6]]}}'
One awk idea: awk ' BEGIN { FS=OFS="," } # define input/output field delimiters as comma { lines[NR]=$0 if (NR==1) next col6[NR]=$6 # copy field 6 so we do not have to parse the contents of lines[] in the END block cnt[$6]++ } END { for (i=1;i<=NR;i++) print lines[i], (i==1 ? "count" : cnt[col6[i]] ) } ' file3 This generates: f1,f2,f3,f4,f5,test,count row1_1,row1_2,row1_3,SBCDE,row1_5,SBCD,1 row2_1,row2_2,row2_3,AWERF,row2_5,AWER,1 row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF,1 row4_1,row4_2,row4_3,PRE-ASDQG,row4_5,ASDQ,2 row4_1,row4_2,row4_3,PRE-ASDQF,row4_5,ASDQ,2
Sum specific column value until a certain value is reached
I want to print first column's value until its reached a certain value, like; 43 12.00 53888 29 10.00 36507 14 9.00 18365 8 8.00 10244 1 7.00 2079 1 9.50 1633 0 6.00 760 I would like the output to be: val = 90 43 12.00 53888 29 10.00 36507 14 9.00 18365
Could you please try following, written and tested with shown samples. Explicitly putting exit in condition when 1st column's sum goes greater than mentioned value to avoid unnecessary reading rest of the Input_file. awk -v val="90" '($1+prev)>val{exit} ($1+prev)<=val{print}{prev+=$1}' Input_file OR awk -v val="90" '($1+prev)>val{exit} ($1+prev)<=val; {prev+=$1}' Input_file Explanation: Adding detailed explanation for above. awk -v val="90" ' ##Starting awk program from here and mentioning variable val as 90 here. ($1+prev)>val{ ##Checking condition if first field and prev variable sum is greater than val then do following. exit ##exit from program to save some time. } ($1+prev)<=val; ##Checking condition if sum of $1 and prev is lesser than or equal to val then print the current line. { prev+=$1 ##keep adding 1st field to prev variable here. } ' Input_file ##Mentioning Input_file name here.
Perl to the rescue! perl -sape ' $s += $F[0] ; exit if $s > $vv' -- -vv=90 file -s enables setting variables from the command line, -vv=90 sets the $vv variable to 90 -p processes the input line by line, it prints each line after processing -a splits each line on whitespace and populates the #F array The variable $s is used to hold the running sum. The line is printed only when the sum is less than $vv, once the sum is too large, the program exits.
Consider small one-line awk Revised: (Sep 2020): Modified to take into account Bruno's comments, going for readable solution, see kvantour for compact solution. awk -v val=85 '{ s+= $1 ; if ( s > val ) exit ; print }' Original Post: (Aug 2020) awk -v val=85 '{ s += $1 ; if ( s <= val ) print }' Or even awk -v val=85 '{ s+= $1 } s <= val'
Consider an even smaller awk which is very much in line with the solution of dash-o awk -v v=90 '((v-=$1)<0){exit}1' file or the smallest: awk -v v=90 '0<=(v-=$1)' file
Filter logs with awk for last 100 lines
I can filter the last 500 lines using tail or grep tail --line 500 my_log | grep "ERROR" What is the equivalent command for using awk How can I add no of lines in below command awk '/ERROR/' my_log
awk don't know about end of file until it change of reading file but you can read twhice the file, first time to find the end, second to treat line that are in the scope. You could also keep the X last line in a buffer but it's a bit heavy in memory consuption and process. Notice that the file need to be mentionned twice at the end for this. awk 'FNR==NR{L=NR-500;next};FNR>=L && /ERROR/{ print FNR":"$0}' my_log my_log With explanaition awk '# first reading FNR==NR{ #last line is this minus 500 LL=NR-500 # go to next line (for this file) next } # at second read (due to previous section filtering) # if line number is after(included) LL AND error is on the line content, print it FNR >= LL && /ERROR/ { print FNR ":" $0 } ' my_log my_log on gnu sed sed '$-500,$ {/ERROR/ p}' my_log
As you had no sample data to test with, I'll show with just numbers using seq 1 10. This one stores last n records and prints them out in the end: $ seq 1 10 | awk -v n=3 '{a[++c]=$0;delete a[c-n]}END{for(i=c-n+1;i<=c;i++)print a[i]}' 8 9 10 If you want to filter the data add for example /ERROR/ before {a[++c]=$0; .... Explained: awk -v n=3 '{ # set wanted amount of records a[++c]=$0 # hash to a delete a[c-n] # delete the ones outside of the window } END { # in the end for(i=c-n+1;i<=c;i++) # in order print a[i] # output records }'
Could you please try following. tac Input_file | awk 'FNR<=100 && /error/' | tac In case you want to add number of lines in awk command then try following. awk '/ERROR/{print FNR,$0}' Input_file
awk: print each column of a file into separate files
I have a file with 100 columns of data. I want to print the first column and i-th column in 99 separate files, I am trying to use for i in {2..99}; do awk '{print $1" " $i }' input.txt > data${i}; done But I am getting errors awk: illegal field $(), name "i" input record number 1, file input.txt source line number 1 How to correctly use $i inside the {print }?
Following single awk may help you too here: awk -v start=2 -v end=99 '{for(i=start;i<=end;i++){print $1,$i > "file"i;close("file"i)}}' Input_file
An all awk solution. First test data: $ cat foo 11 12 13 21 22 23 Then the awk: $ awk '{for(i=2;i<=NF;i++) print $1,$i > ("data" i)}' foo and results: $ ls data* data2 data3 $ cat data2 11 12 21 22 The for iterates from 2 to the last field. If there are more fields that you desire to process, change the NF to the number you'd like. If, for some reason, a hundred open files would be a problem in your system, you'd need to put the print into a block and add a close call: $ awk '{for(i=2;i<=NF;i++){f=("data" i); print $1,$i >> f; close(f)}}' foo
If you want to do what you try to accomplish : for i in {2..99}; do awk -v x=$i '{print $1" " $x }' input.txt > data${i} done Note the -v switch of awk to pass variables $x is the nth column defined in your variable x Note2 : this is not the fastest solution, one awk call is fastest, but I just try to correct your logic. Ideally, take time to understand awk, it's never a wasted time
take out specific columns from mulitple files
I have multiple files that look like the one below. They are tab-separated. For all the files I would like to take out column 1 and the column that start with XF:Z:. This will give me output 1 The files names are htseqoutput*.sam.sam where * varies. I am not sure about the awk function use, and if the for-loop is correct. for f in htseqoutput*.sam.sam do awk ????? "$f" > “out${f#htseqoutput}” done input example AACAGATGATGAACTTATTGACGGGCGGACAGGAACTGTGTGCTGATTGTC_11 16 chr22 39715068 24 51M * 0 0 GACAATCAGCACACAGTTCCTGTCCGCCCGTCAATAAGTTCATCATCTGTT IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII AS:i:-12 XN:i:0 XM:i:2 XO:i:0 XG:i:0 NM:i:2 MD:Z:18T31G0 YT:Z:UU XF:Z:SNORD43 GTTTCCTTAGTGTAGCGGTTATCACATTCGCCT_0 16 chr19 4724687 40 33M * 0 0 AGGCGAATGTGATAACCGCTACACTAAGGAAAC IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII AS:i:-6 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:26C6 YT:Z:UU XF:Z:tRNA TCGACTCCCGGTGTGGGAACC_0 16 chr13 45492060 23 21M * 0 0 GGTTCCCACACCGGGAGTCGA IIIIIIIIIIIIIIIIIIIII AS:i:-6 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:0C20 YT:Z:UU XF:Z:tRNA output 1: AACAGATGATGAACTTATTGACGGGCGGACAGGAACTGTGTGCTGATTGTC_11 SNORD43 GTTTCCTTAGTGTAGCGGTTATCACATTCGCCT_0 tRNA TCGACTCCCGGTGTGGGAACC_0 tRNA
Seems like you could just use sed for this: sed -r 's/^([ACGT0-9_]+).*XF:Z:([[:alnum:]]+).*/\1\t\2/' file This captures the part at the start of the line and the alphanumeric part following XF:Z: and outputs them, separated by a tab character. One potential advantage of this approach is that it will work independently of the position of the XF:Z: string. Your loop looks OK (you can use this sed command in place of the awk part) but be careful with your quotes. " should be used, not “/”. Alternatively, if you prefer awk (and assuming that the bit you're interested in is always part of the last field), you can use a custom field separator: awk -F'[[:space:]](XF:Z:)?' -v OFS='\t' '{print $1, $NF}' file This optionally adds the XF:Z: part to the field separator, so that it is removed from the start of the last field.
You can try, if column with "XF:Z:" is always at the end awk 'BEGIN{OFS="\t"}{n=split($NF,a,":"); print $1, a[n]}' file.sam you get, AACAGATGATGAACTTATTGACGGGCGGACAGGAACTGTGTGCTGATTGTC_11 SNORD43 GTTTCCTTAGTGTAGCGGTTATCACATTCGCCT_0 tRNA TCGACTCCCGGTGTGGGAACC_0 tRNA or, if this column is a variable position for each file awk 'BEGIN{OFS="\t"} FNR==1{ for(i=1;i<=NF;i++){ if($i ~ /^XF:Z:/) break } } {n=split($i,a,":"); print $1, a[n]}' file.sam