awk compare two files gives unexpected output when swapping the position of argument files - awk

Below are my two files' content:
cat f1
9
5
3
cat f2
1
2
3
This is my code, which works perfectly and gives output as per my understanding:
awk 'NR==FNR {a[$0]; next} FNR in a' f1 f2
3
But, when I swap the position of these 2 argument files, the output is different than what I expected.
awk 'NR==FNR {a[$0]; next} FNR in a' f2 f1
9
5
3
I expected the output as 3 again as like previous, because f2 and f1 both has exactly 3 lines and the key 3 is however stored in the hash map. Please explain how the 2nd code works.

The output from the second example is, of course, correct.
Since f2 contains the values 1, 2, 3, the array a ends up with elements a[1], a[2], and a[3]. When it is processing f1, line 1 has FNR == 1, and 1 is an index in a, so line 1 (containing 9) is printed; similarly for lines 2 and 3, hence the output you see.

Related

How do i compare alphanumeric characters in non sequential order?

Currently i am using a awk script which compares numbers in non sequential order and prints the difference . It works pretty well for numbers but if i have alphanumeric characters , it doesn't seem to work well
In its current state , apart from simply comparing the numbers it does 2 things additionally :
Currently it accounts for the zeros before a number or character and compares the absolute values only ignoring zeros before a number or character
Currently If the same number or character occurs multiple times in both files , it outputs the additional occurance
i just want the script to work well for alphanumeric characters as well as currently it only seem to work well with plain numbers. Can someone please edit the script to have the desired output while also considering the above 2 conditions
Current script
awk '{k=$0+0}
NR==FNR {a[k]++; next}
!(k in a && a[k]-->0);
END {for(k in a) while(a[k]-->0) print k}' file1 file2
Example below
cat file1
1
01
001
8
2B
12
13C
027B
0027B
cat file2
1
2
08
12
13C
02B
9
27B
Expected output/result
1
1
2
9
27B
Explanation of expected output
In file1 : "1" , "01" , "001" evaluates to 1 * 3 times
In file 2 : "1" is present only once
Hence "1" is present twice in result ( 3-1 times )
"2" and "9" are exclusively present in file2 , So obviously both simply form part of output
In file1 : '027B" , "0027B" evaluates to 27B * 2 times
In file 2 - "27B" is present only once
Hence '27B" is present once in result ( 2 -1 times )
Explanation of matched items ( ones not forming part of expected output )
"8" from file1 ( line 4 )is matched with "08" from file2 ( line 3)
"12" from file1 ( line 6) is matched with "12" from file2 ( line 4)
"13C" from file1 (line 7 ) is matched with "13C" from file2 ( line 5 )
"2B" from file1 ( line 5 ) is matched with "02B" from file2 ( line 6 )
Lastly the order of items in expected output should be in ascending order like shown in my above example, lets say if the eg above had 3 in expected output it should read vertically as 1 1 2 3 9 27B
It should be enough to remove leading zeros when forming the key (with a special case for zero values like 0000):
/^0+$/ { k = 0 }
/[^0]/ { k = $0; sub(/^0*/, "", k) }
NR==FNR {a[k]++; next}
!(k in a && a[k]-->0);
END {for(k in a) while(a[k]-->0) print k}
$ awk -f a.awk file1 file2
2
9
27B
1
1
RE-EDIT
If you just want the values sorted numerically, pipe into sort:
$ awk -f a.awk file1 file2 | sort -n
1
1
2
3
4
5
9
27B
To output in the order as found in file2, you can remember the order in another array and then do all the printing in the END block. This version will output the values in the order of file2, with any values only in file1 printed last.
/^0+$/ { k = 0 }
/[^0]/ { k = $0; sub(/^0*/, "", k) }
NR==FNR {a[k]++; next}
{ b[FNR] = k }
!(k in a && a[k]--) { a[k] = 1 }
END {
for (i=1; i<=FNR; ++i) {
k = b[i]
while(a[k]-->0) print k
}
for (k in a) {
while(a[k]-->0) print k
}
}
$ awk -f a.awk file1 file2
1
1
2
9
27B
3
4
5

awk with empty field in columns

Here my file.dat
1 A 1 4
2 2 4
3 4 4
3 7 B
1 U 2
Running awk '{print $2}' file.dat gives:
A
2
4
7
U
But I would like to keep the empty field:
A
4
U
How to do it?
I must add that between :
column 1 and 2 there is 3 whitespaces field separator
column 2 and 3 and between column 3 and 4 one whitespace field separator
So in column 2 there are 2 fields missing (lines 2 and 4) and in column 4
there are also 2 fields missing (lines 3 and 5)
If this isn't all you need:
$ awk -F'[ ]' '{print $4}' file
A
4
U
then edit your question to provide a more truly representative example and clearer requirements.
If the input is fixed-width columns, you can use substr to extract the slice you want. I have assumed that you want a single character at index 5:
awk '{ print(substr($0,5,1)) }' file
Your awk code is missing field separators.
Your example file doesn't clearly show what that field separator is.
From observation your file appears to have 5 columns.
You need to determine what your field separator is first.
This example code expects \t which means <TAB> as the field separator.
awk -F'\t' '{print $3}' OFS='\t' file.dat
This outputs the 3rd column from the file. This is the 'read in' field separator -F'\t' and OFS='\t' is the 'read out'.
A
4
U
For GNU awk. It processes the file twice. On the first time it examines all records for which string indexes have only space and considers continuous space sequences as separator strings building up FIELDWIDTHS variable. On the second time it uses that for fixed width processing of the data.
a[i]:s get valus 0/1 and h (header) with this input will be 100010101 and that leads to FIELDWIDTHS="4 2 2 1":
1 A 1 4
2 2 4
3 4 4
3 7 B
1 U 2
| | | |
100010101 - while(match(h,/10*/))
\ /|/|/|
4 2 2 1
Script:
$ awk '
NR==FNR {
for(i=1;i<=length;i++) # all record chars
a[i]=((a[i]!~/^(0|)$/) || substr($0,i,1)!=" ") # keep track of all space places
if(--i>m)
m=i # max record length...
next
}
BEGINFILE {
if(NR!=0) { # only do this once
for(i=1;i<=m;i++) # ... used here
h=h a[i] # h=100010101
while(match(h,/10*/)) { # build FIELDWIDTHS
FIELDWIDTHS=FIELDWIDTHS " " RLENGTH # qnd
h=substr(h,RSTART+RLENGTH)
}
}
}
{
print $2 # and output
}' file file
And output:
A
4
U
You need to trim off the space from the fields, though.

How to print all columns after matching on key field

How can I join all fields of each row from both files after matching on a key field? How to generalize this one-liner if the number of fields is unknown in f2?
f2:
a 1 2
b 3 4
c 5 6
f3:
10 a x y z
11 g x y z
12 j x y z
observed:
a 10 x y z
a1 10 x y z
Desired:
a 1 2 10 x y z
These are my best attempts but are incorrect:
awk 'FNR==NR{a[$1]=$2;next} ($2 in a) {print a[$2],$0}' f2.txt f3.txt > f4.txt
awk 'FNR==NR{a[$1]=$2$3;next} ($2 in a) {print a[$2],$0}' f2.txt f3.txt > f4.txt
awk 'NR==FNR{a[$1]=$0;next} ($2 in a){print a[$2],$1,$3,$4,$5}' f2.txt f3.txt > f4.txt
save the whole as value and column1 as key, when read 2nd file, check column2 in array a or not, if it is, print a[$2] and the rest columns
A shorter way(the disadvantage of this command is there's one extra space between 10 and x):
awk 'NR==FNR{a[$1]=$0;next} ($2 in a){second=$2; $2="";print a[second],$0}' f2.txt f3.txt > f4.txt
replace $2 of 2nd file with empty string, and print the whole line $0
if your files are sorted in the keys as in your example, join is the tool for this task
join -11 -22 f2.txt f3,txt
#mxttgen31: try:
awk 'FNR==NR{Q=$2;$2="";A[Q]=$0;next} ($1 in A){print $0,A[$1]}' f3 f2
Explanation of above command as follows:
awk 'FNR==NR{ ##### Checking condition FNR==NR here, where FNR and NR both denotes the number of line,
only difference between FNR and NR is as we could read mutiple files from awk,
value of FNR will be RESET on next file's start, where NR's value will be keep on increasing till
it completes reading all the file. so this condition will be TRUE only when first Input_file(which is f3 here) will be TRUE.
Q=$2; ##### Assigning second field's value to variable Q.
$2=""; ##### making second field's value to NULL now.
A[$2]=$0; ##### Create an array named A whose index is $2 and value is current line.
next} ##### putting next(awk's in-built keyword) which skips all next further statements and take the cursor again to starting.
($1 in A) ##### Cursor will come here whenever second Input_file is being read, so here checking $1(first field) is present in array A then do following.
{print $0,A[$1]} ##### print the current line($0) of current file named f2 and print array A's value whose index is $1 of current file f2.
' f3 f2 ##### Mentioning Input_files here.

How to track lines in large log file that don't appear in the expected order?

I have a large log file which includes lines in the format
id_number message_type
Here is an example for a log file where all lines appear in the expected order
1 A
2 A
1 B
1 C
2 B
2 C
However, not all lines appear in the expected order in my log file and I'd like to get a list of all id numbers that don't appear in expected order. For the following file
1 A
2 A
1 C
1 B
2 B
2 C
I would like to get an output that indicates id number 1 has lines that don't appear in the expected order. How to do this, using grep, sed and awk?
This works for me:
awk -v "a=ABC" 'substr(a, b[$1]++ + 1, 1) != $2 {print $1}' logfile
When you run this, the ID number from each out-of-order line will be printed. If there are no out-of-order lines, then nothing is printed.
How it works
-v "a=ABC"
This defines the variable a with the list of characters in their expected order.
substr(a, b[$1]++ + 1, 1) != $2 {print $1}
For each ID number, the array b keeps track of where we are. Initially, b is zero for all IDs. With this initial value, that is b[$1]==0, the expression substr(a, b[$1] + 1, 1) returns A which is our first expected output. The condition substr(a, b[$1] + 1, 1) != $2 thus checks if the expected output, from the substr function, differs from the actual output shown in the second field, $2. If it does differ, then the ID value, $1, is printed.
After the substr expression is computed, the trailing ++ in the expression b[$1]++ increments the value of b[$1] by 1 so that the value of b[$1] is ready for the next time that ID $1 is encountered.
Refinement
The above prints an ID number every time an out-of-order line is encountered. If you just want each bad ID printed once, not multiple times, use:
awk -v "a=ABC" 'substr(a, b[$1]++ + 1, 1) != $2 {bad[$1]++} END{for (n in bad) print n}' logfile
I am only on my iPad with no way to test this, but I can give you an idea how to do it with awk since no-one else is answering...
Something like this:
awk 'BEGIN{for(i=0;i<10000;i++)expected[i]=ord("A")}
{if(expected[$1]!=ord($2))
print "Out of order at line ", NR, $0;
expected[i]=ord($2)+1
}' yourFile
You will need to paste in the ord() function from here.
Basically, the concept is to initialise an array called expected[] that keeps track of the next message type expected for each id and then, as each line is read, check it is the next expected value.
Batch only (last sort is not mandatory)
sort -k1n YourFile | tee file1 | sort -k2 > file2 && comm -23 file1 file2 | sort

awk keep only the first and last value on comma-seaparated field

Hi' I am trying to keep only first and last value of comma-separated field on my data. Following is how my input data would look like
a 1 y 1,2,4,3,6,2,1
b 2 y 3,56,3,2,1
c 3 n 4,3,2,1,4
I just want to keep first and last value on the 4th coulmn of my data so that my data would look like this:
a 1 y 1,1
b 2 y 3,1
c 3 n 4,4
Can you anyone help me how to do this? Thank you
Try this:
awk -F, -vOFS=, '{print $1,$NF}' input.txt
-F, input field separator
-vOFS=, output field separator
$1 the 1st field
$NF the last field
Try this awk command:
awk '{size = split($4,numbers,",")} {print $1" "$2" "$3" "numbers[1]","numbers[size]}'
This splits the fourth field into an array, saves the size as size, prints the first 3 fields, then the first and last elements of the numbers array.
awk -F, '{ printf "%s,%s\n", $1, $NF}' should do the job!
If your other fields can contain commas:
$ awk '{sub(/,.*,/,",",$NF)}1' file
a 1 y 1,1
b 2 y 3,1
c 3 n 4,4
If not:
$ awk '{sub(/,.*,/,",")}1' file
a 1 y 1,1
b 2 y 3,1
c 3 n 4,4