Printing the compared result under last column header - awk

I'm doing a comparision of 2 files file1,file2 using first column in file1 to first column in file2 and retriving corresponding value from 7 th column .
awk -F, 'FNR==NR{a[$1]=$7;next} {print (($1 in a) ? $0","a[$1] : $0",NA");}' file2.txt file1.txt > tmp && mv tmp file1.txt
also on next day it will compare and append the result .
cat file1.txt
N1,N2,N3,N4,N5,N6,D1,D2,D3,D4,D5,D6,D7,D8,D9,D10
XX,ZZ,XC,EE,RR,BB,OK,OK,OK,OK,OK,OK,OK,OK
XC,CF,FG,RG,GH,GH,NA,NA,NA,NA,NA,NA,NA,NA,NA
DM,DF,GR,TH,EW,BB
cat file2.txt
cat file2.txt
DF,GH,MH,FR,FG,GH,NA
XX,ZZ,XC,EE,RR,BB,OK
awk -F, 'FNR==NR{a[$1]=$7;next} {print (($1 in a) ? $0","a[$1] : $0",NA");}' file2.txt file1.txt > tmp && mv tmp file1.txt
mv: overwrite `file1.txt'? y
cat file1.txt
N1,N2,N3,N4,N5,N6,D1,D2,D3,D4,D5,D6,D7,D8,D9,D10,NA ---> Header
XX,ZZ,XC,EE,RR,BB,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK
XC,CF,FG,RG,GH,GH,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
DM,DF,GR,TH,EW,BB,NA
after adding new row
DM,DF,GR,TH,EW
problem is it is comparing and printing result for header also and result is printed
under header D1 instead of D10 for newly inserted row in file1
How can we print like this, compare should exclude header and result under last column header
N1,N2,N3,N4,N5,N6,D1,D2,D3,D4,D5,D6,D7,D8,D9,D10
XX,ZZ,XC,EE,RR,BB,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK
XC,CF,FG,RG,GH,GH,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
DM,DF,GR,TH,EW,BB ,NA

To avoid having header updated, change awk's expression to the following:
'FNR==NR{a[$1]=$7;next} FNR==1{print $0; next} {print (($1 in a) ? $0","a[$1] : $0",NA");}'
In this case 1st line of the file1.txt will be printed as is, without any changes.
But don't you also need to have new day (like "D10" in the example) be added to the header on each run? Or you do it elsewhere?
As to the 2nd question (printing new value at the same position in the string for the shorter line as for the longer line), you should further modify awk:
'FNR==NR{a[$1]=$7;next} FNR==1{print $0; len=length($0); next} {printf $0; cont=(($1 in a) ? ","a[$1] : ",NA"); for (i=length($0)+1;i<=len-length(cont);i++) printf " " ; print cont;}'

Related

selecting columns in awk discarding corresponding header

How to properly select columns in awk after some processing. My file here:
cat foo
A;B;C
9;6;7
8;5;4
1;2;3
I want to add a first column with line numbers and then extract some columns of the result. For the example let's get the new first (line numbers) and third columns. This way:
awk -F';' 'FNR==1{print "linenumber;"$0;next} {print FNR-1,$1,$3}' foo
gives me this unexpected output:
linenumber;A;B;C
1 9 7
2 8 4
3 1 3
but expected is (note B is now the third column as we added linenumber as first):
linenumber;B
1;6
2;5
3;2
[fixed and revised]
To get your expected output, use:
$ awk 'BEGIN {
FS=OFS=";"
}
{
print (FNR==1?"linenumber":FNR-1),$(FNR==1?3:1)
}' file
Output:
linenumber;C
1;9
2;8
3;1
To add a column with line number and extract first and last columns, use:
$ awk 'BEGIN {
FS=OFS=";"
}
{
print (FNR==1?"linenumber":FNR-1),$1,$NF
}' file
Output this time:
linenumber;A;C
1;9;7
2;8;4
3;1;3
Why do you print $0 (the complete record) in your header? And, if you want only two columns in your output, why to you print 3 (FNR-1, $1 and $3)? Finally, the reason why your output field separators are spaces instead of the expected ; is simply that... you did not specify the output field separator (OFS). You can do this with a command line variable assignment (OFS=\;), as shown in the second and third versions below, but also using the -v option (-v OFS=\;) or in a BEGIN block (BEGIN {OFS=";"}) as you wish (there are differences between these 3 methods but they don't matter here).
[EDIT]: see a generic solution at the end.
If the field you want to keep is the second of the input file (the B column), try:
$ awk -F\; 'FNR==1 {print "linenumber;" $2; next} {print FNR-1 ";" $2}' foo
linenumber;B
1;6
2;5
3;2
or
$ awk -F\; 'FNR==1 {print "linenumber",$2; next} {print FNR-1,$2}' OFS=\; foo
linenumber;B
1;6
2;5
3;2
Note that, as long as you don't want to keep the first field of the input file ($1), you could as well overwrite it with the line number:
$ awk -F\; '{$1=FNR==1?"linenumber":FNR-1; print $1,$2}' OFS=\; foo
linenumber;B
1;6
2;5
3;2
Finally, here is a more generic solution to which you can pass the list of indexes of the columns of the input file you want to print (1 and 3 in this example):
$ awk -F\; -v cols='1;3' '
BEGIN { OFS = ";"; n = split(cols, c); }
{ printf("%s", FNR == 1 ? "linenumber" : FNR - 1);
for(i = 1; i <= n; i++) printf("%s", OFS $(c[i]));
printf("\n");
}' foo
linenumber;A;C
1;9;7
2;8;4
3;1;3

awk conditional statement based on a value between colon

I was just introduced to awk and I'm trying to retrieve rows from my file based on the value on column 10.
I need to filter the data based on the value of the third value if ":" was used as a separator in column 10 (last column).
Here is an example data in column 10. 0/1:1,9:10:15:337,0,15.
I was able to extract the third value using this command awk '{print $10}' file.txt | awk -F ":" '/1/ {print $3}'
This returns the value 10 but how can I return other rows (not just the value in column 10) if this third value is less than or greater than a specific number?
I tried this awk '{if($10 -F ":" "/1/ ($3<10))" print $0;}' file.txt but it returns a syntax error.
Thanks!
Your code:
awk '{print $10}' file.txt | awk -F ":" '/1/ {print $3}'
should be just 1 awk script:
awk '$10 ~ /1/ { split($10,f,/:/); print f[3] }' file.txt
but I'm not sure that code is doing what you think it does. If you want to print the 3rd value of all $10s that contain :s, as it sounds like from your text, that'd be:
awk 'split($10,f,/:/) > 1 { print f[3] }' file.txt
and to print the rows where that value is less than 7 would be:
awk '(split($10,f,/:/) > 1) && (f[3] < 7)' file.txt

printing multiple NR from one file based on the value from other file using awk

I want to print out multiple rows from one file based on the input values from the other.
Following is the representation of file 1:
2
4
1
Following is the representation of file 2:
MANCHKLGO
kflgklfdg
fhgjpiqog
fkfjdkfdg
fghjshdjs
jgfkgjfdk
ghftrysba
gfkgfdkgj
jfkjfdkgj
Based on the first column of the first file, the code should first print the second row of the second file followed by fourth row and then the first row of the second file. Hence, the output should be following:
kflgklfdg
fkfjdkfdg
MANCHKLGO
Following are the codes that I tried:
awk 'NR==FNR{a[$1];next}FNR in a{print $0}' file1.txt file2.txt
However, as expected, the output is not in the order as it first printed the first row then the second and fourth row is the last. How can I print the NR from the second file as exactly in the order given in the first file?
Try:
$ awk 'NR==FNR{a[NR]=$0;next} {print a[$1]}' file2 file1
kflgklfdg
fkfjdkfdg
MANCHKLGO
How it works
NR==FNR{a[NR]=$0;next}
This saves the contents of file2 in array a.
print a[$1]
For each number in file1, we print the desired line of file2.
Solution to earlier version of question
$ awk 'NR==FNR{a[NR]=$0;next} {print a[2*$1];print a[2*$1+1]}' file2 file1
fkfjdkfdg
fghjshdjs
gfkgfdkgj
jfkjfdkgj
kflgklfdg
fhgjpiqog
Another take:
awk '
NR==FNR {a[$1]; order[n++] = $1; next}
FNR in a {lines[FNR] = $0}
END {for (i=0; i<n; i++) print lines[order[i]]}
' file1.txt file2.txt
This version stores fewer lines in memory, if your files are huge.

How to find the difference between two files using multiple conditions?

I have two files file1.txt and file2.txt like below -
cat file1.txt
2016-07-20-22 4343250019 1003116 001 data45343 25-JUL-16 11-MAR-16 1 N 0 0 N
2016-06-20-22 654650018 1003116 001 data45343 25-JUL-17 11-MAR-16 1 N 0 0 N
cat file2.txt
2016-07-20-22|9|1003116|001|data45343|25-JUL-16 11-MAR-16|1|N|0|0|N|hello|table|one
2016-06-20-22|8|1003116|001|data45343|25-JUL-17 11-MAR-16|1|N|0|0|N|hi|this|kill
2017-06-22-22|8|1003116|001|data45333|25-JUL-17 11-MAR-16|1|N|0|0|N|kill|boll|one
Requirement is to fetch the records which are not available in
file1.txt using below condition.
file1.txt file2.txt
col1(date) col1(Date)
col2(number: 4343250019 ) col2(last value of number: 9)
col3(number) col3(number)
col5(alphanumeric) col5(alphanumeric)
Expected Output :
2017-06-22-22|8|1003116|001|data45333|25-JUL-17 11-MAR-16|1|NULL|0|0|N|kill|boll|one
This output line doesn't available in file1.txt but available in
file2.txt after satisfying the matching criteria.
I was trying below steps to achieve this output -
###Replacing the space/tab from the file1.txt with pipe
awk '{print $1,$2,$3,$4,$5,$6,$7,$8,$9,$10}' OFS="|" file1.txt > file1.txt1
### Looping on a combination of four column of file1.txt1 with combination of modified column of file2.txt and output in output.txt
awk 'BEGIN{FS=OFS="|"} {a[$1FS$2FS$3FS$5];next} {(($1 FS substr($2,length($2),1) FS $3 FS $5) in a) print $0}' file2.txt file1.txt1 > output.txt
###And finally, replace the "N" from column 8th and put "NULL" if the value is "N".
awk -F'|' '{ gsub ("N","NULL",$8);print}' OFS="|" output.txt > output.txt1
What is the issue?
My 2nd operation is not working and I am trying to put all 3 operations in one operation.
awk -F'[|]|[[:blank:]]+' 'FNR==NR{E[$1($2%10)$3$5]++;next}!($1$2$3$5 in E)' file1.txt file2.txt
and your sample output is wrong, it should be (last field if different: data45333)
2016-07-20-22|9|1003116|001|data45333|25-JUL-16 11-MAR-16|1|N|0|0|N|hello|table|one
2017-06-22-22|8|1003116|001|data45343|25-JUL-17 11-MAR-16|1|N|0|0|N|kill|boll|one
Commented code
# separator for both file first with blank, second with `|`
awk -F'[|]|[[:blank:]]+' '
# for first file
FNR==NR{
# create en index entry based on the 4 field. The forat of filed allow to use them directly without separator (univoq)
E[ $1 ( $2 % 10 ) $3 $5 ]++
# for this line (file) don't go further
next
}
# for next file lines
# if not in the index list of entry, print the line (default action)
! ( ( $1 $2 $3 $5 ) in E ) { print }
' file1.txt file2.txt
Input
$ cat f1
2016-07-20-22 4343250019 1003116 001 data45343 25-JUL-16 11-MAR-16 1 N 0 0 N
2016-06-20-22 654650018 1003116 001 data45343 25-JUL-17 11-MAR-16 1 N 0 0 N
$ cat f2
2016-07-20-22|9|1003116|001|data45343|25-JUL-16 11-MAR-16|1|N|0|0|N|hello|table|one
2016-06-20-22|8|1003116|001|data45343|25-JUL-17 11-MAR-16|1|N|0|0|N|hi|this|kill
2017-06-22-22|8|1003116|001|data45333|25-JUL-17 11-MAR-16|1|N|0|0|N|kill|boll|one
Output
$ awk 'FNR==NR{a[$1,substr($2,length($2)),$3,$5];next}!(($1,$2,$3,$5) in a)' f1 FS="|" f2
2017-06-22-22|8|1003116|001|data45333|25-JUL-17 11-MAR-16|1|N|0|0|N|kill|boll|one
Explanation
awk ' # call awk.
FNR==NR{ # This is true when awk reads first file
a[$1,substr($2,length($2)),$3,$5] # array a where index being $1(field1), last char from $2, $3 and $5
next # stop processing go to next line
}
!(($1,$2,$3,$5) in a) # here we check index $1,$2,$3,$5 exists in array a by reading file f2
' f1 FS="|" f2 # Read f1 then
# set FS and then read f2
FNR==NR If the number of records read so far in the current file
is equal to the number of records read so far across all files,
condition which can only be true for the first file read.
a[$1,substr($2,length($2)),$3,$5] populate array "a" such that the
indexed by the first
field, last char of second field, third field and fifth field from
current record of file1
next Move on to the next record so we don't do any processing
intended for records from the second file.
!(($1,$2,$3,$5) in a) IF the array a index constructed from the
fields ($1,$2,$3,$5) of the current record of file2 does not exist
in array a, we get boolean true (! Called Logical NOT Operator. It is used to reverse the logical state of its operand. If a condition is true, then Logical NOT operator will make it false and vice versa.) so awk does default operation print $0 from file2
f1 FS="|" f2 read file1(f1), set field separator "|" after
reading first file, and then read file2(f2)
--edit--
When filesize is huge around 60GB(900 millions rows), its not a good
idea to process the file two times. 3rd operation - (replace "N" with
"NULL" from col - 8 ""awk -F'|' '{ gsub ("N","NULL",$8);print}'
OFS="|" output.txt
$ awk 'FNR==NR{
a[$1,substr($2,length($2)),$3,$5];
next
}
!(($1,$2,$3,$5) in a){
sub(/N/,"NULL",$8);
print
}' f1 FS="|" OFS="|" f2
2017-06-22-22|8|1003116|001|data45333|25-JUL-17 11-MAR-16|1|NULL|0|0|N|kill|boll|one
You can try this awk:
awk -F'[ |]*' 'NR==FNR{su=substr($2,length($2),1); a[$1":"su":"$3":"$5]=1;next} !a[$1":"$2":"$3":"$5]{print $0}' f1 f2
Here,
a[] - an associative array
$1":"su":"$3":"$5 - this forms key for an array index. su is last digit of field $2 (su=substr($2,length($2),1)). Then, assigning an 1 as value for this key.
NR==FNR{...;next} - this block works for processing f1.
Update:
awk 'NR==FNR{$2=substr($2,length($2),1); a[$1":"$2":"$3":"$5]=1;next} !a[$1":"$2":"$3":"$5]{gsub(/^N$/,"NULL",$8);print}' f1 FS="|" OFS='|' f2

awk to compare two file by identifier & output in a specific format

I have 2 large files i need to compare all pipe delimited
file 1
a||d||f||a
1||2||3||4
file 2
a||d||f||a
1||1||3||4
1||2||r||f
Now I want to compare the files & print accordingly such as if any update found in file 2 will be printed as updated_value#oldvalue & any new line added to file 2 will also be updated accordingly.
So the desired output is: (only the updated & new data)
1||1#2||3||4
1||2||r||f
what I have tried so far is to get the separated changed values:
awk -F '[||]+' 'NR==FNR{for(i=1;i<=NF;i++)a[NR,i]=$i;next}{for(i=1;i<=NF;i++)if(a[FNR,i]!=$i)print $i"#"a[FNR,i]}' file1 file2 >output
But I want to print the whole line. How can I achieve that??
I would say:
awk 'BEGIN{FS=OFS="|"}
FNR==NR {for (i=1;i<=NF;i+=2) a[FNR,i]=$i; next}
{for (i=1; i<=NF; i+=2)
if (a[FNR,i] && a[FNR,i]!=$i)
$i=$i"#"a[FNR,i]
}1' f1 f2
This stores the file1 in a matrix a[line number, column]. Then, it compares its values with its correspondence in file2.
Note I am using the field separator | instead of || and looping in steps of two to use the proper data. This is because I for example did gawk -F'||' '{print NF}' f1 and got just 1, meaning that FS wasn't well understood. Will be grateful if someone points the error here!
Test
$ awk 'BEGIN{FS=OFS="|"} FNR==NR {for (i=1;i<=NF;i+=2) a[FNR,i]=$i; next} {for (i=1; i<=NF; i+=2) if (a[FNR,i] && a[FNR,i]!=$i) $i=$i"#"a[FNR,i]}1' f1 f2
a||d||f||b#a
1||1#2||3||4
1||2||r||f