Set number of lines to variable in awk - awk

I am experimenting with an awk script (an independent file).
I want it to process a text file which looks like this:
value1: n
value2: n
value3: n
value4: n
value5: n
value6: n
value7: n
value1: n
:
The text file contains a lot of these blocks with 7 values in each of them. I want the awk script to print some of these values (name of the value and "n") into a new file or the commandline. I thought I'd process it with a while loop, which works with a variable set to the number of all lines. But I just cant get the total of all lines in the file into a variable. It seems I have to process every line and do something with it until the end of the file to get the total. But I'd like to have the total in a variable and then process it with the while loop which loops until the total is reached.
Do you have an idea?

Where $1 is the input parameter to your script: myscript textfile.txt
count="`wc -l $1 | cut -d' ' -f1`"
echo "Number of lines in $1 is $count"
Then do your awk command utilising $count as your line count
Edit: courtesy of fedorqui
count="`wc -l <$1`"
echo "Number of lines in $1 is $count"
Edit 2: (forgive my awk command it's not something that I use much)
count="`wc -l </etc/fstab`"
echo "Number of lines in /etc/fstab is $count"
awk '{print $0,"\t","\tLine ",NR," of ","'$count'";}' /etc/fstab

Either do two passes over the file:
awk 'NR==FNR { lines = NR; next }
{ ... this happens on the second pass, use lines as you wish ... }' file file
or read the lines into an array and process it in END:
awk '{ a[NR] = $0 }
END { lines = NR; for(i=1; i<=lines; ++i) { line = a[i]; ... } }' file
The first consumes I/O, the second memory.
In more detail,
awk 'NR==FNR { count++; next }
{ print "Item " FNR " of " count ": " $0 }' file file
or similarly
awk '{ a[NR] = $0; }
END { for (i=1; i<=NR; ++i) print "Item " i " of " NR ": " a[i] }' file
If you need the line count outside of Awk, you will need to print it and capture it from your script. If you have an Awk script which performs something useful and produces the count as a side effect, you will want to print the line count to standard output, and take care that all other output is directed somewhere else. Perhaps like this;
lines=$(awk '{ sum += $1; }
END { print "Average: " sum/NR >"average.txt"; print NR }')
# Back in Bash
echo "Total of $lines processed, output in average.txt"

Related

Find/replace within a line only if line does not contain a certain string (awk)

I'm trying to reproduce an awk command using different syntax. I have a file (test.txt) that looks like this:
>NAME_123_CONSENSUS
GACTATACA
ATACTAGA
>NAME2_48_TEST
ATAGCGA
and I'm hoping to replace all occurences of "A" with "1" using different syntax of awk. I can solve this using the following line:
awk '!/_/{gsub("A", "1"); 1' test.txt
However, I cannot get the same result using a for loop,
awk '{for(j=1; j<=NF; j++) if ($j ~ "_") print; else print gsub("A","1")}' test.txt
nor using the following input
awk '{ if ($0 ~ "_") print $0; else print gsub("A", "1"); }' test.txt
Both of these last commands give the following output. Why are they giving different output and what am I missing to make both of the last two commands give the desired output?
>NAME_123_CONSENSUS
4
4
5
>NAME2_48_TEST
3
You are incorrectly using the gsub() function here. The sub()/gsub() function return the number of substitutions made and not the modified string. You set the string to modify as the last argument and print it back
awk '{ for(j=1; j<=NF; j++) if ($j ~ "_") print; else { gsub("A","1",$0); print } }'
That said your first command is most efficient/terse way of writing this. Notice you were missing a } in the OP. It should been written as
awk '!/_/{ gsub("A", "1") }1'
Or use gensub() available in GNU Awk's that return the modified string that you can use to print. See more about it on String-Functions of GNU Awk
awk '{ for(j=1; j<=NF; j++) if ($j ~ "_") print; else print gensub(/A/, "1", "g") }'

How to merge lines using awk command so that there should be specific fields in a line

I want to merge some rows in a file so that the lines should contain 22 fields seperated by ~.
Input file looks like this.
200269~7414~0027001~VALTD~OM3500~963~~~~716~423~2523~Y~UN~~2423~223~~~~A~200423
2269~744~2701~VALD~3500~93~~~~76~423~223~Y~
UN~~243~223~~~~A~200123
209~7414~7001~VALD~OM30~963~~~
~76~23~2523~Y~UN~~223~223~~~~A~123
and So on
First line looks fine. 2nd and 3rd line needs to be merged so that it becomes a line with 22 fields. 4th,5th and 6th line should be merged and so on.
Expected output:
200269~7414~0027001~VALTD~OM3500~963~~~~716~423~2523~Y~UN~~2423~223~~~~A~200423
2269~744~2701~VALD~3500~93~~~~76~423~223~Y~UN~~243~223~~~~A~200123
209~7414~7001~VALD~OM30~963~~~~76~23~2523~Y~UN~~223~223~~~~A~123
The file has 10 GB data but the code I wrote (used while loop) is taking too much time to execute . How to solve this problem using awk/sed command?
Code Used:
IFS=$'\n'
set -f
while read line
do
count_tild=`echo $line | grep -o '~' | wc -l`
if [ $count_tild == 21 ]
then
echo $line
else
checkLine
fi
done < file.txt
function checkLine
{
current_line=$line
read line1
next_line=$line1
new_line=`echo "$current_line$next_line"`
count_tild_mod=`echo $new_line | grep -o '~' | wc -l`
if [ $count_tild_mod == 21 ]
then
echo "$new_line"
else
line=$new_line
checkLine
fi
}
Using only the shell for this is slow, error-prone, and frustrating. Try Awk instead.
awk -F '~' 'NF==1 { next } # Hack; see below
NF<22 {
for(i=1; i<=NF; i++) f[++a]=$i }
a==22 {
for(i=1; i<=a; ++i) printf "%s%s", f[i], (i==22 ? "\n" : "~")
a=0 }
NF==22
END {
if(a) for(i=1; i<=a; i++) printf "%s%s", f[i], (i==a ? "\n" : "~") }' file.txt>file.new
This assumes that consecutive lines with too few fields will always add up to exactly 22 when you merge them. You might want to check this assumption (or perhaps accept this answer and ask a new question with more and better details). Or maybe just add something like
a>22 {
print FILENAME ":" FNR ": Too many fields " a >"/dev/stderr"
exit 1 }
The NF==1 block is a hack to bypass the weirdness of the completely empty line 5 in your sample.
Your attempt contained multiple errors and inefficiencies; for a start, try http://shellcheck.net/ to diagnose many of them.
$ cat tst.awk
BEGIN { FS="~" }
{
sub(/^[0-9]+\./,"")
gsub(/[[:space:]]+/,"")
$0 = prev $0
if ( NF == 22 ) {
print ++cnt "." $0
prev = ""
}
else {
prev = $0
}
}
$ awk -f tst.awk file
1.200269~7414~0027001~VALTD~OM3500~963~~~~716~423~2523~Y~UN~~2423~223~~~~A~200423
2.2269~744~2701~VALD~3500~93~~~~76~423~223~Y~UN~~243~223~~~~A~200123
3.209~7414~7001~VALD~OM30~963~~~~76~23~2523~Y~UN~~223~223~~~~A~123
The assumption above is that you never have more than 22 fields on 1 line nor do you exceed 22 in any concatenation of the contiguous lines that are each less than 22 fields, just like you show in your sample input.
You can try this awk
awk '
BEGIN {
FS=OFS="~"
}
{
while(NF<22) {
if(NF==0)
break
a=$0
getline
$0=a$0
}
if(NF!=0)
print
}
' infile
or this sed
sed -E '
:A
s/((.*~){21})([^~]*)/\1\3/
tB
N
bA
:B
s/\n//g
' infile

Analysing two files using awk with if condition

I have two files. First contains names, numbers and days for all samples
sam_name.csv
Number,Day,Sample
171386,0,38_171386_D0_2-1.raw
171386,0,38_171386_D0_2-2.raw
171386,2,30_171386_D2_1-1.raw
171386,2,30_171386_D2_1-2.raw
171386,-1,40_171386_D-1_1-1.raw
171386,-1,40_171386_D-1_1-2.raw
The second includes information about batches (last column)
sam_batch.csv
Number,Day,Quar,Code,M.F,Status,Batch
171386,0,1,x,F,C,1
171386,1,1,x,F,C,2
171386,2,1,x,F,C,5
171386,-1,1,x,F,C,6
I would like to get the information about batches (using two condition number and day) and add it to the first file. I have used awk command to do that, but I am getting results only at one-time point (-1).
Here is my command:
awk -F"," 'NR==FNR{number[$1]=$1;day[$1]=$2;batch[$1]=$7; next}{if($1==number[$1] && $2==day[$1]){print $0 "," number[$1] "," day[$1] "," batch[$1]}}' sam_batch.csv sam_nam.csv
Here are my results: (a file sam_name, number and day from file sam_batch (just to check if a condition is working) and batch number (a value which I need)
Number,Day,Sample,Number,Day, Batch
171386,-1,40_171386_D-1_1-1.raw,171386,-1,6
171386,-1,40_171386_D-1_1-2.raw,171386,-1,6
175618,-1,08_175618_D-1_1-1.raw,175618,-1,2
Here I corrected your AWK code:
awk -F"," 'NR==FNR{
number_day = $1 FS $2
batch[number_day]=$7
next
}
{
number_day = $1 FS $2
print $0 "," batch[number_day]
}' sam_batch.csv sam_name.csv
Output:
Number,Day,Sample,Batch
171386,0,38_171386_D0_2-1.raw,1
171386,0,38_171386_D0_2-2.raw,1
171386,2,30_171386_D2_1-1.raw,5
171386,2,30_171386_D2_1-2.raw,5
171386,-1,40_171386_D-1_1-1.raw,6
171386,-1,40_171386_D-1_1-2.raw,6
(No need for double-checking if you understand how the script works.)
Here's another AWK solution (my original answer):
awk -v "b=sam_batch.csv" 'BEGIN {
FS=OFS=","
while(( getline line < b) > 0) {
n = split(line,a)
nd = a[1] FS a[2]
nd2b[nd] = a[n]
}
}
{ print $1,$2,$3,nd2b[$1 FS $2] }' sam_name.csv
Both solutions parse file sam_batch.csv at the beginning to form a dictionary of (number, day) -> batch. Then they parse sam_name.csv, printing out the first three fields together with the "Batch" from another file.

awk 1-liner to cover match and non-matchng cas

Is there a way in an awk one-liner to cover both the positive and negative match case with different print statements?
To illustrate. Let's say I have a file where I want to prepend a set of words with '#' but still want to print all the words in the file.
Something like :
awk '/red/||/green/ { print "# My mod : " $0 } else { print $0 }'
Of course the above wont work. But what's the simple way to do this in awk.
Cheers,
Gert
To cover the general case i.e. printing something completely different, you can use next:
awk '/red|green/ { print "foo"; next } { print "bar" }' file
The second block is only reached if the first pattern is false.
You can also use if/else within a single action block.
For your specific case, where you are just adding to the record and printing, you can use this:
awk '/red|green/ { $0 = "# My mod : " $0 } 1' file
The 1 at the end ensures that every record is printed.
How about "painful" long hand
awk '{if (/red/||/green/ ) { print "# My mod : " $0 } else { print $0 }}'
That works for me ;-), but you can save a few chars typing with
awk '{if (/red|green/ ) {print "# My mod : " $0 } else {print $0} }'
OR
awk '{print ($0~/red|green/) ? "# My mod : $0 " : "" $0}'
Which is the shortest amt of code I can think of to achieve the result you are looking for.
awk '{ print (/red|green/ ? "# My mod : " : "") $0 }' file

awk output format for average

I am computing average of many values and printing it using awk using following script.
for j in `ls *.txt`; do
for i in emptyloop dd cp sleep10 gpid forkbomb gzip bzip2; do
echo -n $j $i" "; cat $j | grep $i | awk '{ sum+=$2} END {print sum/NR}'
done;
echo ""
done
but problem is, it is printing the value in in 1.2345e+05, which I do not want, I want it to print values in round figure. but I am unable to find where to pass the output format.
EDIT: using {print "average,%3d = ",sum/NR}' inplace of {print sum/NR}' is not helping, because it is printing "average,%3d 1.2345e+05".
You need printf instead of simply print. Print is a much simpler routine than printf is.
for j in *.txt; do
for i in emptyloop dd cp sleep10 gpid forkbomb gzip bzip2; do
awk -v "i=$i" -v "j=$j" '$0 ~ i {sum += $2} END {printf j, i, "average %6d", sum/NR}' "$j"
done
echo
done
You don't need ls - a glob will do.
Useless use of cat.
Quote all variables when they are expanded.
It's not necessary to use echo - AWK can do the job.
It's not necessary to use grep - AWK can do the job.
If you're getting numbers like 1.2345e+05 then %6d might be a better format string than %3d. Use printf in order to use format strings - print doesn't support them.
The following all-AWK script might do what you're looking for and be quite a bit faster. Without seeing your input data I've made a few assumptions, primarily that the command name being matched is in column 1.
awk '
BEGIN {
cmdstring = "emptyloop dd cp sleep10 gpid forkbomb gzip bzip2";
n = split(cmdstring, cmdarray);
for (i = 1; i <= n; i++) {
cmds[cmdarray[i]]
}
}
$1 in cmds {
sums[$1, FILENAME] += $2;
counts[$1, FILENAME]++
files[FILENAME]
}
END {
for file in files {
for cmd in cmds {
printf "%s %s %6d", file, cmd, sums[cmd, file]/counts[cmd, file]
}
}
}' *.txt