compare on number of awk result is not correct - awk

I use awk to get the number of fields for multiple files, and then use if [[ ]] to judge whether the number of fields equal to an exact number, if so, then return the file name.
The code is as follows:
for file in /root/TB_MOVIL_CDR/incorrect_files/*
do
num=$(awk -F '|' '{print NF}' $file)
if [[ $num -eq 24 ]];then
echo $file
fi
done
But if found the result is not correct, I am confused,
the syntax of if [[ $num -eq 24 ]] is wrong ?

No need for bash commands when you do use awk, try this:
awk -F'|' 'NF==24 {print FILENAME}' /root/TB_MOVIL_CDR/incorrect_files/*
This will test if number of fields are 24 in each line of each file, and if it finds a line with 24 fields, it will print the file name. NB if there are several files with 24 fields it print filename multiple times. This can be avoided if needed.
You can use this gnu awk to print the file name only once for each file with 24 fields (gnu due to the ENDFILE function):
awk -F'|' 'NF==24 {f=1} ENDFILE {if (f) print FILENAME;f=0}' /root/TB_MOVIL_CDR/incorrect_files/*
A shorter gnu awk that print file name once for each file.
awk -F'|' 'NF==24 {print FILENAME;nextfile}' /root/TB_MOVIL_CDR/incorrect_files/*

Related

Removing content of a column based on number of occurences

I have a file (; seperated) with data like this
111111121;000-000.1;000-000.2
111111211;000-000.1;000-000.2
111112111;000-000.1;000-000.2
111121111;000-000.1;000-000.2
111211111;000-000.1;000-000.2
112111111;000-000.1;000-000.2
121111112;000-000.2;020-000.8
121111121;000-000.2;020-000.8
121111211;000-000.2;020-000.8
121113111;000-000.3;000-200.2
211111121;000-000.1;000-000.2
I would like to remove any $3 that has less than 3 occurences, so the outcome would be like
111111121;000-000.1;000-000.2
111111211;000-000.1;000-000.2
111112111;000-000.1;000-000.2
111121111;000-000.1;000-000.2
111211111;000-000.1;000-000.2
112111111;000-000.1;000-000.2
121111112;000-000.2;020-000.8
121111121;000-000.2;020-000.8
121111211;000-000.2;020-000.8
121113111;000-000.3
211111121;000-000.1;000-000.2
That is, only $3 got deleted, as it had only a single occurence
Sadly I am not really sure if (thus how) this could be done relatively easily (as doing the =COUNT.IF matching, and manuel delete in Excel feels quite embarrassing)
$ awk -F';' 'NR==FNR{cnt[$3]++;next} cnt[$3]<3{sub(/;[^;]+$/,"")} 1' file file
111111121;000-000.1;000-000.2
111111211;000-000.1;000-000.2
111112111;000-000.1;000-000.2
111121111;000-000.1;000-000.2
111211111;000-000.1;000-000.2
112111111;000-000.1;000-000.2
121111112;000-000.2;020-000.8
121111121;000-000.2;020-000.8
121111211;000-000.2;020-000.8
121113111;000-000.3
211111121;000-000.1;000-000.2
or if you prefer:
$ awk -F';' 'NR==FNR{cnt[$3]++;next} {print (cnt[$3]<3 ? $1 FS $2 : $0)}' file file
this awk one-liner can help, it processes the file twice:
awk -F';' 'NR==FNR{a[$3]++;next}a[$3]<3{NF--}7' file file
Though that awk solutions are the best in terms of performance, your goal could be also achieved with something like this:
while IFS=" " read a b;do
if [[ "$a" -lt "3" ]];then
sed -i "s/$b//" b.txt
fi
done <<<"$(cut -d";" -f3 b.txt |sort |uniq -c)"
Operation is based on the output of cut which counts occurrences.
$cut -d";" -f3 b.txt |sort |uniq -c
7 000-000.2
1 000-200.2
3 020-000.8
Above works for editing source file in place, so keep a back up for testing.
You can feed the file twice to awk. On the first run you gather a statistic that you use in the second run:
script.awk
FNR == NR { stats[ $3 ]++
next
}
{ if( stats[$3] < 3) print $1 $2
else print
}
Run it like this: awk -F\; -f script.awk yourfile yourfile .
The condition FNR == NR is true during processing of the first filename given to awk. The next statement skips the second block.
Thus the second block is only used for processing the second filename given to awk (which is here the same as the first filename).

awk printing the second to last record of a file

I have a file set up like
Words on
many line
%
More Words
on many lines
%
Even More Words
on many lines
%
and I would like to output the second to last record of this file where the record is delimited by % after each block of text.
I have used:
awk -v RS=\% ' END{ print NR }' $f
to find the number of records (1136). Then I did
awk -v RS=\% ' { print $(NR-1) }' $f
and
awk -v RS=\% ' { print $(NR=1135) }' $f
.
Neither of these worked, and, instead, displayed a record towards the beginning of the file and a many blank lines.
OUTPUT:
"You know, of course, that the Tasmanians, who never committed adultery, are
now extinct."
-- M. Somerset Maugham
"The
is
what
that
This output had many, many more blank lines and contains a record near the middle of the file.
awk -v RS=\% 'END{ print $(NR-1) }' $f
returns a blank line. The same command with different $(NR-x) values also returns a blank line.
Can someone help me to print the second to last record in this case?
Thanks
You can do:
awk '{this=last;last=$0} END{print this}' file
Or, if you don't mind having the entire file in memory:
awk '{a[NR]=$0} END{print a[NR-1]}' file
Or, if it is just line count (or record count) based, you can keep a rolling deletion going so you are not too piggish on memory:
$ seq 999999 | tail -2
999998
999999
$ seq 999999 | awk '{a[NR]=$0; delete a[NR-3]} END{print a[NR-1]}'
999998
If they are blocks of text the same method works if you can separate the blocks into delimited records.
Given:
$ echo "$txt"
Words on
many line
%
More Words
on many lines
%
Even More Words
on many lines
%
You can do:
$ echo "$txt" | awk -v RS=\% '{a[NR]=$0} END{print a[NR-1]}'
Even More Words
on many lines
$ echo "$txt" | awk -v RS=\% '{a[NR]=$0} END{print a[NR-2]}'
More Words
on many lines
If you want to not print the leading and trailing \n you can do:
$ echo "$txt" | awk 'BEGIN{RS="%\n"} {a[NR]=$0} END{printf a[NR-2]}'
Words on
many line
Finally, if you know the specific record you want to print, do it this way in awk:
$ seq 999999 | awk -v mrk=1135 'NR==mrk{print; exit}'
1135
If you want a random record, you can do:
$ awk -v min=1 -v max=1135 'BEGIN{srand()
RS="%\n"
tgt=int(min+rand()*(max-min+1))
}
NR==tgt{print; exit}' file
Does the solution have to be with awk? Just using head and tail would be simpler.
tail -2 file.txt | head 1 > justthatline.txt
The best way for this would be to use the BEGIN construct.
awk 'BEGIN{RS="%\n"; ORS="%\n"}(NR>=2){print}' file
RS and ORS set the input file and output record separators respectively.

How to print insert in awk loop

How to print the file name in the loop? I want to print the file name and the average value of column 4 at same line:
for i in `ls *cov`
do
awk '{sum +=$4;n++}END{print sum/n}' $i
done
I mean I want to
awk '{sum +=$4;n++}END{print $i\t sum/n}' $i
You can use bash variables in an awk script using the -v flag:
awk -v file=$i '{sum +=$4;n++}END{print file\t sum/n}' $i
But, there is also the built in awk variable FILENAME:
awk '{sum +=$4;n++}END{print FILENAME\t sum/n}' $i
Which is much cleaner since you aren't passing around variables.
Lose the loop (see why-is-using-a-shell-loop-to-process-text-considered-bad-practice) and just use:
awk -v OFS='\t' '{sum+=$4} ENDFILE{print FILENAME, (FNR>0 ? sum/FNR : 0); sum=0}' *cov
The above uses GNU awk for ENDFILE, there's simple tweaks for other awks but the important things are:
A surrounding shell loop is neither required nor desirable.
The variable n isn't needed since awk has builtin variables.
You have to protect yourself from divide by zero on empty files.

Awk to find duplicates across column

I have a bunch of dns entries in a file
a1.us.company.com ------ DO NOT PRINT
a2.us.us.company.com ------PRINT------ ("us" is repeated)
a3.eu.a3.compamy.com ------PRINT------ ("a3" is repeated)
a4.tx.a4.tx.company.com -----PRINT------- ("a4" and "tx" is repeated)
awk 'BEGIN {FS="."; OFS="."} {if ($2==$3) print $1"."$2"."$NF}' device_list
awk 'BEGIN {FS="."; OFS="."} {if ($1==$3) print $1"."$2"."$NF}' device_list
I am using 2 commands above.
Can someone please give me a awk command that lists duplicate columns per row.
Some of the names are crazy with as many as 7 to 8 . separated fields.
$ cat file
a1.us.company.com
a2.us.us.company.com
a3.eu.a3.compamy.com
a4.tx.a4.tx.company.com
$ awk -F'.' '{delete seen; for (i=1;i<=NF;i++) if (seen[$i]++) {print; next} }' file
a2.us.us.company.com
a3.eu.a3.compamy.com
a4.tx.a4.tx.company.com
Note that using delete seen is GNU-awk specific, with other awks you can delete the whole array by doing split("",seen).
$ awk -F. '{for(i=1;i<=NF;i++)if(x[$i]++){print;delete x;next}}' file
a2.us.us.company.com
a3.eu.a3.compamy.com
a4.tx.a4.tx.company.com
If you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk , /usr/xpg6/bin/awk , or nawk

awk to read specific column from a file

I have a small problem and I would appreciate helping me in it.
In summary, I have a file:
1,5,6,7,8,9
2,3,8,5,35,3
2,46,76,98,9
I need to read specific lines from it and print them into another text document. I know I can use (awk '{print "$2" "$3"}') to print the second and third columns beside each other. However, I need to use two statement as (awk '{print "$2"}' >> file.text) then (awk '{print "$3"}' >> file.text), but the two columns would appear under each other and not beside each other.
How can I make them appear beside each other?
If you must extract the columns in separate processes, use paste to stitch them together. I assume your shell is bash/zsh/ksh, and I assume the blank lines in your sample input should not be there.
paste -d, <(awk -F, '{print $2}' file) <(awk -F, '{print $3}' file)
produces
5,6
3,8
46,76
Without the process substitutions:
awk -F, '{print $2}' file > tmp1
awk -F, '{print $3}' file > tmp2
paste -d, tmp1 tmp2 > output
Update based on your answer:
On first appearance, that's a confusing setup. Does this work?
for (( x=1; x<=$number_of_features; x++ )); do
feature_number=$(sed -n "$x {p;q}" feature.txt)
if [[ -f out.txt ]]; then
cut -d, -f$feature_number file.txt > out.txt
else
paste -d, out.txt <(cut -d, -f$feature_number file.txt) > tmp &&
mv tmp out.txt
fi
done
That has to read the file.txt file a number of times. It would clearly be more efficient to only have to read it once:
awk -F, -f numfeat=$number_of_features '
# read the feature file into an array
NR==FNR {
colno[++i] = $0
next
}
# now, process the file.txt and emit the desired columns
{
sep = ""
for (i=1; i<=numfeat; i++) {
printf "%s%s", sep, $(colno[i])
sep = FS
}
print ""
}
' feature.txt file.txt > out.txt
Thanks all for contributing in the answers. I believe that i should be more clearer in my question, sorry for that.
My code is as follow:
for (( x = 1; x <= $number_of_features ; x++ )) # the number extracted from a text file
do
feature_number=$(awk 'FNR == "'$x'" {print}' feature.txt)
awk -F, '{print $"'$feature_number'"}' file.txt >> out.txt
done
Basically, I extract the feature number (which is the same as column number) from a text document and then print that column. the text document may contains many features number.
The thing is, each time I have different features number (which reflect the column number). so, applying the above solutions are not sufficient for this problem.
I hope it is clearer now.
Waiting for your comments please.
Thanks
Ahmad
instead of using awks file redirection, use shell redirection eg
awk '{print $2,$3}' >> file
the comma is replaced with the value of the output field seperator( space by default ).