How to calculate anomaly using awk - awk

A have a file:
file.txt
1 32
2 34
3 32
4 43
5 25
6 34
7 65
8 34
9 23
10 44
I would like to find anomaly on send column:
my below script printing anomalies considering row-2 to row-10 values. It is not considering row-1 values.
awk 'FNR==NR{
f=1;
if($1 >= 1 && $1 <= 10){
count++;
SUM+=$2;
};
next
}
FNR==1 && f==1{
AVG=SUM/count;
next
}
($1 >= 1 && $1 <= 10){
print $1, $2-AVG
}
' file.txt file.txt
My desire output:
1 -4.6
2 -2.6
3 -4.6
4 6.4
5 -11.6
6 -2.6
7 28.4
8 -2.6
9 -13.6
10 7.4
I got a solution of it:
awk '{f=$1>=1 && $1<=10}f && NR==FNR{sum+=$2; c++; next}f{ print $1, $2-(sum/c) }' file.txt file.txt
I am still wondering why the first script is not giving correct answer.

Since this is just 2 columns file, this can be done in a single pass awk also:
awk '{map[$1] = $2; s += $2}
END {mean = s/NR; for (i in map) print i, map[i] - mean}' file
1 -4.6
2 -2.6
3 -4.6
4 6.4
5 -11.6
6 -2.6
7 28.4
8 -2.6
9 -13.6
10 7.4

The first script in the OP is not giving the correct value, because you skip the first line in the second pass of your file. This is seen in the statement (FNR==1 && f==1) { AVG=sum/count; next }. Due to the next statement, you skip the computation of the deviation from the mean value for the first record.
This is an efficient computation of the deviation from the mean in a double pass:
awk '(NR==FNR){s+=$2;c++;next}
(FNR==1){s/=c}
{print $1,$2-s}' file file
If file contains values bigger than 10 or smaller than 1 in the first, column, but you only want to see this for values in the range of [0,10], then you can do:
awk '($1<1 || $1>10) {next}
(NR==FNR){s+=$2;c++;next}
(FNR==1){s/=c}
{print $1,$2-s}' file file
There are still other optimizations that can be done, but these only become beneficial when working with extremely large files (many millions of lines).

Related

Sum column and count lines

I am trying to sum certain numbers in colum 2, it works with my code. But I want to count also how many times the same value in colum 2 is repeated and print in the last column.
file1
36 2605 1 2
36 2605 1 2
36 2603 1 2
36 2605 1 2
36 2605 1 2
36 2605 1 2
36 2606 1 2
Output Desired
2603 36 1 2 1
2605 180 5 10 5
2606 36 1 2 1
I tried
awk '{a[$2]+=$1}{b[$2]+=$3}{c[$2]+=$4;count[$2]+=$2}END{for(i in a)print i,a[i],b[i],c[i],count[i]}' file1
Thanks in advance
Renamed the vars and added pretty print:
awk '
{
sum1[$2]+=$1
sum3[$2]+=$3
sum4[$2]+=$4
count[$2]++
len2=((l=length($2))>len2?l:len2)
len1=((l=length(sum1[$2]))>len1?l:len1)
len3=((l=length(sum3[$2]))>len3?l:len3)
len4=((l=length(sum4[$2]))>len4?l:len4)
len5=((l=length(sum5[$2]))>len5?l:len5)
}
END {
for(i in count) {
printf "%*d %*d %*d %*d %*d\n",
len2,i,len1,sum1[i],len3,sum3[i],len4,sum4[i],len5,count[i]
}
}' file
Output:
2603 36 1 2 1
2605 180 5 10 5
2606 36 1 2 1
Space chars are relatively inexpensive these days, you should really consider getting some for your code, especially if you want other people to read it to help you debug it! Here's the code you posted:
awk '{a[$2]+=$1}{b[$2]+=$3}{c[$2]+=$4;count[$2]+=$2}END{for(i in a)print i,a[i],b[i],c[i],count[i]}' file1
and here it is after having been run through a code beautifier (I used gawk -o):
{
a[$2] += $1
}
{
b[$2] += $3
}
{
c[$2] += $4
count[$2] += $2
}
END {
for (i in a) {
print i, a[i], b[i], c[i], count[i]
}
}
See how just by adding some white space it's now vastly easier to understand and so the bug in how count[$2] is being populated is glaringly obvious? Some meaningful variable names are always extremely useful too and I hear alphanumeric chars are on special right now!
FWIW here's how I'd do this:
$ cat tst.awk
BEGIN { keyFldNr = 2 }
{
numOutFlds = 0
for (i=1; i<=NF; i++) {
if (i != keyFldNr) {
sum[$keyFldNr,++numOutFlds] += $i
}
}
cnt[$keyFldNr]++
}
END {
for (key in cnt) {
printf "%s%s", key, OFS
for (i=1; i<=numOutFlds; i++) {
printf "%s%s", sum[key,i], OFS
}
print cnt[key]
}
}
$ awk -f tst.awk file
2603 36 1 2 1
2605 180 5 10 5
2606 36 1 2 1
$ awk -f tst.awk file | column -t
2603 36 1 2 1
2605 180 5 10 5
2606 36 1 2 1
Notice that it'll work as-is no matter how many fields you have on each line and if you need to use a different field for the key that you count and sum on then you just change the value of keyFldNr in the BEGIN section from 2 to whatever you want it to be.
A non-awk approach, using the very useful GNU datamash, which is designed for tasks like this one:
$ datamash -Ws groupby 2 sum 1,3,4 count 2 < input.txt
2603 36 1 2 1
2605 180 5 10 5
2606 36 1 2 1
Read as: For each group of rows with the same value in column 2, display that value, the sums of columns 1, 3 and 4, and the number of rows in the group.
You've almost nailed it, you're not increasing count[$2] properly.
$ awk '{a[$2]+=$1;b[$2]+=$3;c[$2]+=$4;count[$2]++}
END{for(i in a) print i,a[i],b[i],c[i],count[i]}' file
2603 36 1 2 1
2605 180 5 10 5
2606 36 1 2 1
no need external program, faster ~21ms, tried on pure gnu awk
awk '{if($0~/^[A-Za-z0-9]/)a[NR]=$2" "$1" "$3" "$4}END{asort(a);$0="";for(;i++<NR;){split(a[i],b);if($1==""||b[1]==$1){$2+=b[2];$3+=b[3];$4+=b[4];$5++} else {print;$2=b[2];$3=b[3];$4=b[4];$5=1} $1=b[1]} print}' file1

Finding NR of row with specific conditions (using next line)

Guys I have a file like this
NR column
1 1
2 1
3 0
4 0
5 0
6 1
7 1
8 1
9 1
10 0
11 0
12 0
13 1
14 1
What I need is to find the NR what will tell me where there are 1.
so my ideal output should tell me from NR=1 - 2 (there are 1s, then), NR=6 - 9, NR=13 - 14
or
1
2
6
9
13
14
Since, I think is easier not consider in the output the first row and the last. I expect that the output is
2
6
9
13
I've been trying a way to use getline but unsuccessfully.
I am sure there is an easy way to do this, help?
Thanks
Assuming your output above was incorrect (and it should really be the line number where the 0/1 or 1/0 transition happens - so the lines would be: "1, 3, 6, 10, 13"), then an awk oneliner is:
awk 'prev!=$0{print NR};{prev=$0}' file
which says:
for every line that doesn't match the prev line, print the line number, and
for every line, save the prev line
$ awk 'NR>1 && $0!=prev{print NR} {prev=$0}' file
3
6
10
13
or for your updated requirements:
$ awk '$1!=prev{print NR-prev} {prev=$1} END{if (prev) print NR}' file
1
2
6
9
13
14
awk to the rescue!
$ awk '!p&&$2==1{p=$1}
p&&!$2{print p"-"($1-1);p=0}
END{if(p) print p"-"$1}' file
1-2
6-9
13-14
{
if (NR > 1 && last != $0) {
print NR;
}
last = $0;
}
Another way
awk '$2!=x{x=$2;print NR-!($2)}END{if(x)print NR}' file
1
2
6
9
13
14

print a line from every 5 elements of a column

I am looking for a way to select a column (e. g. eighth column) of a data file and write the first five numbers of that column in a row, the next five numbers in second row, and so on.
I have been testing with awk and printf without success.
The awk way to do this is to switch from using OFS and ORS to separate the output using the modulus function:
$ seq 1 20 | awk '{printf "%s", $1 (NR % 5 ? OFS : ORS)}'
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
16 17 18 19 20
Change $1 to $8 for the eigth column for example and NR % 5 to NR % 10 for rows of 10 instead of 5. The seq command just generate a single column of digits from 1 to 20 used for demonstration.
I also find using xargs useful for this kind of thing:
$ seq 1 20 | awk '{print $1}' | xargs -n5
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
16 17 18 19 20
The awk isn't necessary for the example as seq only produces a single column however for your question change $1 to $8 to select only the eighth column from your input. With this approach you could also switch out awk with cut.
This will also produce the format requested
seq 1 20 | awk '{printf("%s ", $1); if (NR % 5 == 0) printf("\n")}'
where $1 indicates de column number which could be changed when passing an archive to the awk line.

rearrange columns using awk or cut command

I have large file with 1000 columns. I want to rearrange so that last column should be the 3rd column. FOr this i have used,
cut -f1-2,1000,3- file > out.txt
But this does not change the order.
Could anyone help using cut or awk?
Also, I want to rearrange columns 10 and 11 as shown below:
Example:
1 10 11 2 3 4 5 6 7 8 9 12 13 14 15 16 17 18 19 20
try this awk one-liner:
awk '{$3=$NF OFS $3;$NF=""}7' file
this is moving the last col to the 3rd col. if you have 1000, then it does it with 1000th col.
EDIT
if the file is tab-delimited, you could try:
awk -F'\t' -v OFS="\t" '{$3=$NF OFS $3;$NF=""}7' file
EDIT2
add an example:
kent$ seq 20|paste -s -d'\t'
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
kent$ seq 20|paste -s -d'\t'|awk -F'\t' -v OFS="\t" '{$3=$NF OFS $3;$NF=""}7'
1 2 20 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
EDIT3
You didn't give any input example. so assume you don't have empty columns in original file. (no continuous multi-tabs):
kent$ seq 20|paste -s -d'\t'|awk -F'\t' -v OFS="\t" '{$3=$10 FS $11 FS $3;$10=$11="";gsub(/\t+/,"\t")}7'
1 2 10 11 3 4 5 6 7 8 9 12 13 14 15 16 17 18 19 20
After all we could print those fields in a loop.
I THINK what you want is:
awk 'BEGIN{FS=OFS="\t"} {$3=$NF OFS $3; sub(OFS "[^" OFS "]*$","")}1' file
This might also work for you depending on your awk version:
awk 'BEGIN{FS=OFS="\t"} {$3=$NF OFS $3; NF--}1' file
Without the part after the semi-colon you'll have trailing tabs in your output.
Since many people are searching for this and even the best awk solution is not really pretty and easy to use I wanted to post my solution (mycut) written in Python:
#!/usr/bin/env python3
import sys
from signal import signal, SIGPIPE, SIG_DFL
signal(SIGPIPE,SIG_DFL)
#example usage: cat file | mycut 3 2 1
columns = [int(x) for x in sys.argv[1:]]
delimiter = "\t"
for line in sys.stdin:
parts = line.split(delimiter)
print("\t".join([parts[col] for col in columns]))
I think about adding the other features of cut like changing the delimiter and a feature to use a * to print the remaning columns. But then it will get an own page.
A shell wrapper function for awk' that uses simpler syntax:
# Usage: rearrange int_n [int_o int_p ... ] < file
rearrange ()
{
unset n;
n="{ print ";
while [ "$1" ]; do
n="$n\$$1\" \" ";
shift;
done;
n="$n }";
awk "$n" | grep '\w'
}
Examples...
echo foo bar baz | rearrange 2 3 1
bar baz foo
Using bash brace expansion, rearrange first and last 5 items in descending order:
echo {1..1000}a | tr '\n' ' ' | rearrange {1000..995} {5..1}
1000a 999a 998a 997a 996a 995a 5a 4a 3a 2a 1a
Sorted 3-letter shells in /bin:
ls -lLSr /bin/?sh | rearrange 5 9
150792 /bin/csh
154072 /bin/ash
771552 /bin/zsh
1554072 /bin/ksh

How to Add Column with Percentage

I would like to calculate percentage of value in each line out of all lines and add it as another column.
Input (delimiter is \t):
1 10
2 10
3 20
4 40
Desired output with added third column showing calculated percentage based on values in second column:
1 10 12.50
2 10 12.50
3 20 25.00
4 40 50.00
I have tried to do it myself, but when I calculated total for all lines I didn't know how to preserve rest of line unchanged. Thanks a lot for help!
Here you go, one pass step awk solution -
awk 'NR==FNR{a = a + $2;next} {c = ($2/a)*100;print $1,$2,c }' file file
[jaypal:~/Temp] cat file
1 10
2 10
3 20
4 40
[jaypal:~/Temp] awk 'NR==FNR{a = a + $2;next} {c = ($2/a)*100;print $1,$2,c }' file file
1 10 12.5
2 10 12.5
3 20 25
4 40 50
Update: If tab is a required in output then just set the OFS variable to "\t".
[jaypal:~/Temp] awk -v OFS="\t" 'NR==FNR{a = a + $2;next} {c = ($2/a)*100;print $1,$2,c }' file file
1 10 12.5
2 10 12.5
3 20 25
4 40 50
Breakout of pattern {action} statements:
The first pattern is NR==FNR. FNR is awk's in-built variable that keeps track of number of records (by default separated by a new line) in a given file. So FNR in our case would be 4. NR is similar to FNR but it does not get reset to 0. It continues to grow on. So NR in our case would be 8.
This pattern will be true only for the first 4 records and thats exactly what we want. After perusing through the 4 records, we are assign the total to a variable a. Notice that we did not initialize it. In awk we don't have to. However, this would break if entire column 2 is 0. So you can handle it by putting an if statement in the second action statement i.e do the division only if a > 0 else say division by 0 or something.
next is needed cause we don't really want second pattern {action} statement to execute. next tells awk to stop further actions and move to the next record.
Once the four records are parsed, the next pattern{action} begins, which is pretty straight forward. Doing the percentage and print column 1 and 2 along with percentage next to them.
Note: As #lhf mentioned in the comment, this one-liner will only work as long as you have the data set in a file. It won't work if you pass data through a pipe.
In the comments, there is a discussion going on ways to make this awk one-liner take input from a pipe instead of a file. Well the only way I could think of was to store the column values in array and then using for loop to spit each value out along with their percentage.
Now arrays in awk are associative and are never in order, i.e pulling the values out of arrays will not be in the same order as they went in. So if that is ok then the following one-liner should work.
[jaypal:~/Temp] cat file
1 10
2 10
3 20
4 40
[jaypal:~/Temp] cat file | awk '{b[$1]=$2;sum=sum+$2} END{for (i in b) print i,b[i],(b[i]/sum)*100}'
2 10 12.5
3 20 25
4 40 50
1 10 12.5
To get them in order, you can pipe the result to sort.
[jaypal:~/Temp] cat file | awk '{b[$1]=$2;sum=sum+$2} END{for (i in b) print i,b[i],(b[i]/sum)*100}' | sort -n
1 10 12.5
2 10 12.5
3 20 25
4 40 50
You can do it in a couple of passes
#!/bin/bash
total=$(awk '{total=total+$2}END{print total}' file)
awk -v total=$total '{ printf ("%s\t%s\t%.2f\n", $1, $2, ($2/total)*100)}' file
You need to escape it as %%. For instance:
printf("%s\t%s\t%s%%\n", $1, $2, $3)
Perhaps there is better way but I would pass file twice.
Content of 'infile':
1 10
2 10
3 20
4 40
Content of 'script.awk':
BEGIN {
## Tab as field separator.
FS = "\t";
}
## First pass of input file. Get total from second field.
ARGIND == 1 {
total += $2;
next;
}
## Second pass of input file. Print each original line and percentage as third field.
{
printf( "%s\t%2.2f\n", $0, $2 * 100 / total );
}
Run the script in my linux box:
gawk -f script.awk infile infile
And result:
1 10 12.50
2 10 12.50
3 20 25.00
4 40 50.00