How to track lines in large log file that don't appear in the expected order? - awk

I have a large log file which includes lines in the format
id_number message_type
Here is an example for a log file where all lines appear in the expected order
1 A
2 A
1 B
1 C
2 B
2 C
However, not all lines appear in the expected order in my log file and I'd like to get a list of all id numbers that don't appear in expected order. For the following file
1 A
2 A
1 C
1 B
2 B
2 C
I would like to get an output that indicates id number 1 has lines that don't appear in the expected order. How to do this, using grep, sed and awk?

This works for me:
awk -v "a=ABC" 'substr(a, b[$1]++ + 1, 1) != $2 {print $1}' logfile
When you run this, the ID number from each out-of-order line will be printed. If there are no out-of-order lines, then nothing is printed.
How it works
-v "a=ABC"
This defines the variable a with the list of characters in their expected order.
substr(a, b[$1]++ + 1, 1) != $2 {print $1}
For each ID number, the array b keeps track of where we are. Initially, b is zero for all IDs. With this initial value, that is b[$1]==0, the expression substr(a, b[$1] + 1, 1) returns A which is our first expected output. The condition substr(a, b[$1] + 1, 1) != $2 thus checks if the expected output, from the substr function, differs from the actual output shown in the second field, $2. If it does differ, then the ID value, $1, is printed.
After the substr expression is computed, the trailing ++ in the expression b[$1]++ increments the value of b[$1] by 1 so that the value of b[$1] is ready for the next time that ID $1 is encountered.
Refinement
The above prints an ID number every time an out-of-order line is encountered. If you just want each bad ID printed once, not multiple times, use:
awk -v "a=ABC" 'substr(a, b[$1]++ + 1, 1) != $2 {bad[$1]++} END{for (n in bad) print n}' logfile

I am only on my iPad with no way to test this, but I can give you an idea how to do it with awk since no-one else is answering...
Something like this:
awk 'BEGIN{for(i=0;i<10000;i++)expected[i]=ord("A")}
{if(expected[$1]!=ord($2))
print "Out of order at line ", NR, $0;
expected[i]=ord($2)+1
}' yourFile
You will need to paste in the ord() function from here.
Basically, the concept is to initialise an array called expected[] that keeps track of the next message type expected for each id and then, as each line is read, check it is the next expected value.

Batch only (last sort is not mandatory)
sort -k1n YourFile | tee file1 | sort -k2 > file2 && comm -23 file1 file2 | sort

Related

how to extract lines which have no duplicated values in first column?

For some statistics research, I want to separate my data which have duplicated value in first column. I work with vim.
suppose that a part of my data is like this:
Item_ID Customer_ID
123 200
104 134
734 500
123 345
764 347
1000 235
734 546
as you can see, some lines have equal values in first column,
i want to generate two separated files, which one of them contains just non repeated values and the other contains lines with equal first column value.
for above example i want to have these two files:
first one contains:
Item_ID Customer_ID
123 200
734 500
123 345
734 546
and second one contains:
Item_ID Customer_ID
104 134
764 347
1000 235
can anybody help me?
I think awk would be a better option here.
$ awk 'FNR == NR { seen[$1]++; next } seen[$1] == 1' input.txt input.txt > uniq.txt
$ awk 'FNR == NR { seen[$1]++; next } seen[$1] > 1' input.txt input.txt > dup.txt
Prettier version of awk code:
FNR == NR {
seen[$1]++;
next
}
seen[$1] == 1
Overview
We loop over the text twice. By supplying the same file to our awk script twice we are effectively looping over the text twice. First time though the loop count the number of times we see our field's value. The second time though the loop output only the records which have a field value count of 1. For the duplicate line case we only output lines which have field value counts greater than 1.
Awk primer
awk loops over lines (or records) in a text file/input and splits each line into fields. $1 for the first field, $2 for the second field, etc. By default fields are separated by whitespaces (this can be configured).
awk runs each line through a series of rules in the form of condition { action }. Any time a condition matches then action is taken.
Example of printing the first field which line matches foo:
awk '/foo/ { print $1 }` input.txt
Glory of Details
Let's take a look at finding only the unique lines which the first field only appears once.
$ awk 'FNR == NR { seen[$1]++; next } seen[$1] == 1' input.txt input.txt > uniq.txt
Prettier version for readability:
FNR == NR {
seen[$1]++;
next
}
seen[$1] == 1
awk 'code' input > output - run code over the input file, input, and then redirect the output to file, output
awk can take more than one input. e.g. awk 'code' input1.txt input2.txt.
Use the same input file, input.txt, twice to loop over the input twice
awk 'FNR == NR { code1; next } code2' file1 file2 is a common awk idiom which will run code1 for file1 and run code2 for file2
NR is the current record (line) number. This increments after each record
FNR is the current file's record number. e.g. FNR will be 1 for the first line in each file
next will stop executing any more actions and go to the next record/line
FNR == NR will only be true for the first file
$1 is the first field's data
seen[$1]++ - seen is an array/dictionary where we use the first field, $1, as our key and increment the value so we can get a count
$0 is the entire line
print ... prints out the given fields
print $0 will print out the entire line
just print is short for print $0
condition { print $0 } can be shorted to condition { print } which can be shorted further to just condition
seen[$1] == 1 which check to see if the first field's value count is equal to 1 and print the line
Here is an awk solution:
awk 'NR>1{a[$1]++;b[NR]=$1;c[NR]=$2} END {for (i=2;i<=NR;i++) print b[i],c[i] > (a[b[i]]==1?"single":"multiple")}' file
cat single
104 134
764 347
1000 235
cat multiple
123 200
734 500
123 345
734 546
PS I skipped the first line, but it could be implemented.
This way you get one file for single hits, one for double, one for triple etc.
awk 'NR>1{a[$1]++;b[NR]=$1;c[NR]=$2} END {for (i=2;i<=NR;i++) print b[i],c[i] > "file"a[b[i]]}'
That would require some filtering of the list of lines in the buffer. If you're really into statistics research, I'd go search for a tool that is better suited than a general-purpose text editor, though.
That said, my PatternsOnText plugin has some commands that can do the job:
:2,$DeleteUniqueLinesIgnoring /\s\+\d\+$/
:w first
:undo
:2,$DeleteAllDuplicateLinesIgnoring /\s\+\d\+$/
:w second
As you want to filter on the first column, the commands' /{pattern}/ has to filter out the second column; \s\+\d\+$ matches the final number and its preceding whitespace.
:DeleteUniqueLinesIgnoring (from the plugin) gives you just the duplicates, :DeleteAllDuplicateLinesIgnoring just the unique lines. I simply :write them to separate files and :undo in between.

In a CSV file, subtotal 2 columns based on a third one, using AWK in KSH

Disclaimers:
1) English is my second language, so please forgive any grammatical horrors you may find. I am pretty confident you will be able to understand what I need despite these.
2) I have found several examples in this site that address questions/problems similar to mine, though I was unfortunately not able to figure out the modifications that would need to be introduced to fit my needs.
The "Problem":
I have an CSV file that looks like this:
c1,c2,c3,c4,c5,134.6,,c8,c9,SERVER1,c11
c1,c2,c3,c4,c5,0,,c8,c9,SERVER1,c11
c1,c2,c3,c4,c5,0.18,,c8,c9,SERVER2,c11
c1,c2,c3,c4,c5,0,,c8,c9,SERVER2,c11
c1,c2,c3,c4,c5,416.09,,c8,c9,SERVER3,c11
c1,c2,c3,c4,c5,0,,c8,c9,SERVER3,c11
c1,c2,c3,c4,c5,12.1,,c8,c9,SERVER3,c11
c1,c2,c3,c4,c5,480.64,,c8,c9,SERVER4,c11
c1,c2,c3,c4,c5,,83.65,c8,c9,SERVER5,c11
c1,c2,c3,c4,c5,,253.15,c8,c9,SERVER6,c11
c1,c2,c3,c4,c5,,18.84,c8,c9,SERVER7,c11
c1,c2,c3,c4,c5,,8.12,c8,c9,SERVER7,c11
c1,c2,c3,c4,c5,,22.45,c8,c9,SERVER7,c11
c1,c2,c3,c4,c5,,117.81,c8,c9,SERVER8,c11
c1,c2,c3,c4,c5,,96.34,c8,c9,SERVER9,c11
Complementary facts:
1) File has 11 columns.
2) The data in columns 1, 2, 3, 4, 5, 8, 9 and 11 is irrelevant in this case. In other words, I will only work with columns 6, 7 and 10.
3) Column 10 will be typically alphanumeric strings (server names), though it may contain also "-" and/or "_".
4) Columns 6 and 7 will have exclusively numbers, with up to two decimal places (A possible value is 0). Only one of the two will have data per line, never both.
What I need as an output:
- A single occurrence of every string in column 10 (as column 1), then the sum (subtotal) of it's values in column 6 (as column 2) and last, the sum (subtotal) of it's values in column 7 (as column 3).
- If the total for a field is "0" the field must be left empty, but still must exist (it's respective comma has to be printed).
- **Note** that the strings in column 10 will be already alphabetically sorted, so there is no need to do that part of the processing with AWK.
Output sample, using the sample above as an input:
SERVER1,134.6,,
SERVER2,0.18,,
SERVER3,428.19,,
SERVER4,480.64,,
SERVER5,,83.65
SERVER6,,253.15
SERVER7,,26.96
I've already found within these pages not one, but two AWK oneliners that PARTIALLY accomplish what it need:
awk -F "," 'NR==1{last=$10; sum=0;}{if (last != $10) {print last "," sum; last=$10; sum=0;} sum += $6;}END{print last "," sum;}' inputfile
awk -F, '{a[$10]+=$6;}END{for(i in a)print i","a[i];}' inputfile
My "problems" in both cases are the same:
- Subtotals of 0 are printed.
- I can only handle the sum of one column at a time. Whenever I try to add the second one, I get either a syntax error or it does simply not print the third column at all.
Thanks in advance for your support people!
Regards,
Martín
something like this?
$ awk 'BEGIN{FS=OFS=","}
{s6[$10]+=$6; s7[$10]+=$7}
END{for(k in s6) print k,(s6[k]?s6[k]:""),(s7[k]?s7[k]:"")}' file | sort
SERVER1,134.6,
SERVER2,0.18,
SERVER3,428.19,
SERVER4,480.64,
SERVER5,,83.65
SERVER6,,253.15
SERVER7,,49.41
SERVER8,,117.81
SERVER9,,96.34
note that your treatment of commas is not consistent, you're adding an extra one when the last field is zero (count the commas)
Your posted expected output doesn't seem to match your posted sample input so we're guessing but this might be what you're looking for:
$ cat tst.awk
BEGIN { FS=OFS="," }
$10 != prev {
if (NR > 1) {
print prev, sum6, sum7
}
sum6 = sum7 = ""
prev = $10
}
$6 { sum6 += $6 }
$7 { sum7 += $7 }
END { print prev, sum6, sum7 }
$ awk -f tst.awk file
SERVER1,134.6,
SERVER2,0.18,
SERVER3,428.19,
SERVER4,480.64,
SERVER5,,83.65
SERVER6,,253.15
SERVER7,,49.41
SERVER8,,117.81
SERVER9,,96.34

awk print row where column 1 matches and column three is highest value

I am looking to print columns 1 & 2 where column 1 matches and column 3 is the highest value. I am currently using awk and sort to get this type of output:
EXCEPTION 91 3
EXCEPTION 15 5
TEST 5 1
TEST 1 8
the end desired output I am looking for:
EXCEPTION 15
TEST 1
Here is a file example and the commands I am running to get the uniq counts. What I would really like is for sort to print the last record in the uniq sort
EXCEPTION 15
so I don't have to all the crazy uniq --count logic.
IE: I want to know if column1 matches >= 3times and print the last recorded column two value for that match.
cat /tmp/testing.txt |grep EXCEPTION
EXCEPTION 15
EXCEPTION 15
EXCEPTION 15
EXCEPTION 91
EXCEPTION 91
EXCEPTION 91
EXCEPTION 91
EXCEPTION 15
EXCEPTION 15
cat /tmp/testing.txt|awk '{print $1 " " $2}'|sed '/^$/d'| awk '$2 >= '1' '|sort |uniq --count|awk '{print $1" "$2" "$3}'|awk '$1 >= '3''|awk '{print $1" "$2" "$3}'|awk '{print $2" "$3" "$1}'
EXCEPTION 15 5
EXCEPTION 91 4
Just keep track of the maximums for any given 1st field and store its corresponding 2nd field:
awk '{if ($3>max[$1])
{max[$1]=$3; val[$1]=$2}
}
END {for (i in val) print i, val[i]}' file
Test
$ awk '{if ($3>max[$1]) {max[$1]=$3; val[$1]=$2}} END {for (i in val) print i, val[i]}' file
EXCEPTION 15
TEST 1
You said you didn't want horrible uniq logic... but in case you change your mind, this task does fit quite neatly into sort/uniq/cut's purview (though this isn't as efficient as the awk solution).
From your testing file you can get your desired output with
sort -k1,2n < testing |
uniq -c |
sort -k2,2 -k1rn,1 |
cut -c8- |
sort -u -k1,1
In order: sort by first column alphabetically (default) and then second column numerically - this puts identical lines in sequence.
Then count the occurences of each line, prepend 8 characters to each line containing the count along with whitespaces.
Sort by count descending, after sorting by the string (EXCEPTION, TEST) which is now the second field.
Remove the first 8 characters from each line (the counts).
Finally "sort" by the string and only output uniques. As the record you're interested in has been sorted to the top, this is the one it outputs. This can be thought of as "uniq by field."
(If you want to remove the trailing spaces from your input you can replace the cut command with sed 's/^ *[0-9]\+ *//')

Print lines containing the same second field for more than 3 times in a text file

Here is what I am doing.
The text file is comma separated and has three field,
and I want to extract all the line containing the same second field
more than three times.
Text file (filename is "text"):
11,keyword1,content1
4,keyword1,content3
5,keyword1,content2
6,keyword2,content5
6,keyword2,content5
7,keyword1,content4
8,keyword1,content2
1,keyword1,content2
My command is like below. cat the whole text file inside awk and grep with the second field of each line and count the number of the line.
If the number of the line is greater than 2, print the whole line.
The command:
awk -F "," '{ "cat text | grep "$2 " | wc -l" | getline var; if ( 2 < var ) print $0}' text
However, the command output contains only first three consecutive lines,
instead of printing also last three lines containing "keyword1" which occurs in the text for six times.
Result:
11,keyword1,content1
4,keyword1,content3
5,keyword1,content2
My expected result:
11,keyword1,content1
4,keyword1,content3
5,keyword1,content2
7,keyword1,content4
8,keyword1,content2
1,keyword1,content2
Can somebody tell me what I am doing wrong?
It is relatively straight-forward to make just two passes over the file. In the first pass, you count the number of occurrences of each value in column 2. In the second pass, you print out the rows where the value in column 2 occurs more than your threshold value of 3 times.
awk -F, 'FNR == NR { count[$2]++ }
FNR != NR { if (count[$2] > 3) print }' text text
The first line of code handles the first pass; it counts the occurrences of each different value of the second column.
The second line of code handles the second pass; if the value in column 2 was counted more than 3 times, print the whole line.
This doesn't work if the input is only available on a pipe rather than as a file (so you can't make two passes over the data). Then you have to work much harder.

awk script for finding smallest value from column

I am beginner in AWK, so please help me to learn it. I have a text file with name snd and it values are
1 0 141
1 2 223
1 3 250
1 4 280
I want to print the entire row when the third column value is minimu
This should do it:
awk 'NR == 1 {line = $0; min = $3}
NR > 1 && $3 < min {line = $0; min = $3}
END{print line}' file.txt
EDIT:
What this does is:
Remember the 1st line and its 3rd field.
For the other lines, if the 3rd field is smaller than the min found so far, remember the line and its 3rd field.
At the end of the script, print the line.
Note that the test NR > 1 can be skipped, as for the 1st line, $3 < min will be false. If you know that the 3rd column is always positive (not negative), you can also skip the NR == 1 ... test as min's value at the beginning of the script is zero.
EDIT2:
This is shorter:
awk 'NR == 1 || $3 < min {line = $0; min = $3}END{print line}' file.txt
You don't need awk to do what you want. Use sort
sort -nk 3 file.txt | head -n 1
Results:
1 0 141
I think sort is an excellent answer, unless for some reason what you're looking for is the awk logic to do this in a larger script, or you want to avoid the extra pipes, or the purpose of this question is to learn more about awk.
$ awk 'NR==1{x=$3;line=$0} $3<x{line=$0} END{print line}' snd
Broken out into pieces, this is:
NR==1 {x=$3;line=$0} -- On the first line, set an initial value for comparison and store the line.
$3<x{line=$0} - On each line, compare the third field against our stored value, and if the condition is true, store the line. (We could make this run only on NR>1, but it doesn't matter.
END{print line} -- At the end of our input, print whatever line we've stored.
You should read man awk to learn about any parts of this that don't make sense.
a short answer for this would be:
sort -k3,3n temp|head -1
since you have asked for awk:
awk '{if(min>$3||NR==1){min=$3;a[$3]=$0}}END{print a[min]}' your_file
But i prefer the shorter one always.
For calculating the smallest value in any column , let say last column
awk '(FNR==1){a=$NF} {a=$NF < a?$NF:a} END {print a}'
this will only print the smallest value of the column.
In case if complete line is needed better to use sort:
sort -r -n -t [delimiter] -k[column] [file name]
awk -F ";" '(NR==1){a=$NF;b=$0} {a=$NF<a?$NF:a;b=$NF>a?b:$0} END {print b}' filename
this will print the line with smallest value which is encountered first.
awk 'BEGIN {OFS=FS=","}{if ( a[$1]>$2 || a[$1]=="") {a[$1]=$2;} if (b[$1]<$2) {b[$1]=$2;} } END {for (i in a) {print i,a[i],b[i]}}' input_file
We use || a[$1]=="" because when 1st value of field 1 is encountered it will have null in a[$1].