Understanding two file processing in awk - awk

I am trying to understand how two file processing works. So here created an example.
file1.txt
zzz pq Fruit Apple 10
zzz rs Fruit Car 50
zzz tu Study Book 60
file2.txt
aa bb Book 100
cc dd Car 200
hj kl XYZ 500
ee ff Apple 300
ff gh ABC 400
I want to compare 4th column of file1 to 3rd column of file2, if matched then print the 3rd,4th,5th column of file1 followed by 3rd, 4th column of file2 with sum of 5th column of file1 and 4th column of file2.
Expected Output:
Fruit Apple 10 300 310
Fruit Car 50 200 250
Study Book 60 100 160
Here what I have tried:
awk ' FNR==NR{ a[$4]=$5;next} ( $3 in a){ print $3, a[$4],$4}' file1.txt file2.txt
Code output;
Book 100
Car 200
Apple 300
I am facing problem in printing file1 column and how to store the other column of file1 in array a. Please guide me.

Could you please try following.
awk 'FNR==NR{a[$4]=$3 OFS $4 OFS $5;b[$4]=$NF;next} ($3 in a){print a[$3],$NF,b[$3]+$NF}' file1.txt file2.txt
Output will be as follows.
Study Book 60 100 160
Fruit Car 50 200 250
Fruit Apple 10 300 310
Explanation: Adding explanation for above code now.
awk ' ##Starting awk program here.
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when first Input_file named file1.txt is being read.
a[$4]=$3 OFS $4 OFS $5 ##Creating an array named a whose index is $4 and value is 3rd, 4th and 5th fields along with spaces(By default OFS value will be space for awk).
b[$4]=$NF ##Creating an array named b whose index is $4 and value if $NF(last field of current line).
next ##next keyword will skip all further lines from here.
}
($3 in a){ ##Checking if 3rd field of current line(from file2.txt) is present in array a then do following.
print a[$3],$NF,b[$3]+$NF ##Printing array a whose index is $3, last column value of current line and then SUM of array b with index $3 and last column value here.
}
' file1.txt file2.txt ##Mentioning Input_file names file1.txt and file2.txt

Related

Join of two files introduces extraneous newline

Update: I figured out the reason for the extraneous newline. I created file1 and file2 on a Windows machine. Windows adds <cr><newline> to the end of each line. So, for example, the first record in file1 is not this:
Bill <tab> 25 <newline>
Instead, it is this:
Bill <tab> 25 <cr><newline>
So when I set a[Bill] to $2 I am actually setting it to $2<cr>.
I used a hex editor and removed all of the <cr> symbols in file1 and file2. Now the AWK program works as desired.
I have seen the SO posts on using AWK to do a natural join of two files. I took one of the solutions and am trying to get it to work. Alas, I have been unsuccessful. I am hoping you can tell me what I am doing wrong.
Note: I appreciate other solutions, but what I really want is to understand why my AWK program doesn't work (i.e., why/how an extraneous newline is being introduced).
I want to do a join of these two files:
file1 (name, tab, age):
Bill 25
John 24
Mary 21
file2 (name, tab, marital-status)
Bill divorced
Glenn married
John married
Mary single
When joined, I expect to see this (name, tab, age, tab, marital-status):
Bill 25 divorced
John 24 married
Mary 21 single
Notice that file2 has a person named Glenn, but file1 doesn't. No record in file1 joins to it.
My AWK program almost produces that result. But, for reasons I don't understand, the marital-status value is on the next line:
Bill 25
divorced
John 24
married
Mary 21
single
Here is my AWK program:
awk 'BEGIN { OFS = '\t' }
NR == FNR { a[$1] = ($1 in a? a[$1] OFS : "")$2; next }
$1 in a { $0 = $0 OFS a[$1]; delete a[$1]; print }' file2 file1 > joined_file1_file2
You may try this awk solution:
awk 'BEGIN {FS=OFS="\t"} {sub(/\r$/, "")}
FNR == NR {m[$1]=$2; next} {print $0, m[$1]}' file2 file1
Bill 25 divorced
John 24 married
Mary 21 single
Here:
Using sub(/\r$/, "") to remove any DOS line ending
If $1 doesn't exist in mapping m then m[$1] will be an empty string so we can simplify awk processing

Filter two files using AWK

First of all, thank you for your help. I have 2 files which are
1 10 Tomatoea
2 20 Potatoes
3 30 Apples
4 10 Tomatoes
5 20 Potatoes
And
A 30
B 20
C 10
D 40
E 50
I wanto to filter both files using AWK so the if $2 in the first file is equal to the value of $2 in the second file the output will be adding a new column which matches the condition given in a new file called combined.txt:
1 10 C Tomatoea
2 20 B Potatoes
3 30 A Apples
4 10 C Tomatoes
5 20 B Potatoes
I have tried this code:
awk 'FNR==NR{a[NR]=$0;next}{$2=a[FNR]}1' letters.txt numbers.txt >> combined.txt
awk 'FNR==NR {m[$2] = $1; next} $2 in m {$2 = m[$2]}1' letters.txt numbers.txt >> combined.txt
The problem is that the code only replace one column for the other I want to that the column matches the condition I have given above. Also I want to put the new column between the columns from number.txt file.
The above are simplifications of my actual files. Below you can see them in order file 2, file 1 and combined.txt. As you would appreciate fil2 have a lot of rows that is the reason why only one species name appear in it.
file 2
Salmonella_enterica_subsp_enterica_Typhimurium_LT2 >lcl|NC_003197.2_prot_NP_463122.1_4111
Salmonella_enterica_subsp_enterica_Paratyphi_B >lcl|NC_010102.1_prot_WP_000389232.1_4169
Salmonella_enterica_subsp_enterica_Infantis >lcl|CP052796.1_prot_QJV25805.1_4154
Salmonella_enterica_subsp_enterica_Paratyphi_A >lcl|NZ_CP009559.1_prot_WP_000389229.1_110
Salmonella_enterica_subsp_enterica_Typhi >lcl|NZ_CP029897.1_prot_WP_000389235.1_4284
Salmonella_bongori >lcl|NZ_CP053416.1_prot_WP_079774927.1_2027 77.619
Salmonella_enterica_subsp_enterica_Infantis >lcl|CP052796.1_prot_QJV21904.1_1
Salmonella_enterica_subsp_enterica_Infantis >lcl|CP052796.1_prot_QJV21905.1_2
Salmonella_enterica_subsp_enterica_Infantis >lcl|CP052796.1_prot_QJV21906.1_3
Salmonella_enterica_subsp_enterica_Infantis >lcl|CP052796.1_prot_QJV21907.1_4
Salmonella_enterica_subsp_enterica_Infantis >lcl|CP052796.1_prot_QJV21908.1_5
Salmonella_enterica_subsp_enterica_Infantis >lcl|CP052796.1_prot_QJV26199.1_6
Salmonella_enterica_subsp_enterica_Infantis >lcl|CP052796.1_prot_QJV21909.1_7
Salmonella_enterica_subsp_enterica_Infantis >lcl|CP052796.1_prot_QJV21910.1_8
Salmonella_enterica_subsp_enterica_Infantis >lcl|CP052796.1_prot_QJV21911.1_9
file1
SiiA lcl|NC_003197.2_prot_NP_463122.1_4111 100.000 100 MEDESNPWPSFVDTFSTVLCIFIFLMLVFALNNMIIMYDNSIKVYKANIENKTKSTAQNSGANDDSNPNEIVNKEVNTQDVSDGMTTMSGKEVGVYDIADGQKTDITSTKNELVITYHGRLRSFSEEDTYKIKAWLEDKINSNLLIEMVIPQADISFSDSLRLGYERGIILMKEIKKIYPDVVIDMSVNSAASSTTSKAIITTINKKVSE
SiiA lcl|NC_010102.1_prot_WP_000389232.1_4169 99.048 100 MEDESNPWPSFVDTFSTVLCIFIFLMLVFALNNMIIMYDNSIKVYKANIENKTKSTAQNSGANDDSNPNEIVNKEVNTQDVSDGMTTMSGKEVGVYDIADGQKTDITSTKNELVITYHGRLRSFSEEDTHKIKAWLEDKTNSNLLIEMVIPQADISFSDSLRLGYERGIILMKEIKKIYPDVVIDMSVNSAASSTTSKAIITTINKKVSE
SiiA lcl|CP052796.1_prot_QJV25805.1_4154 97.143 100 MEDESNPWPSFVDTFSTVLCIFIFLMLVFALNNMIIMYDNSIKVYKANIESKTKSTAQNSGANDNSNANEIINKEVNTQDMSDGMTTMSGKEVGVYDIADGQKTDITSTKNELVITYHGRLRSFSEEDTHKIKAWLEDKINSNLLIEMVIPQADISFSDSLRLGYERGIILMKEIKKIYPDVVIDMSVNSAASSTTSKAIITTINKKVSE
SiiA lcl|NZ_CP009559.1_prot_WP_000389229.1_1106 97.143 100 MEDESNPWPSFVDTFSTVLCIFIFLMLVFALNNMIIMYDNSIKVYKANIENKTKSTAQNNGANDNSNANEIVNKEVNTQDVSDGMTTMSGKEVGVYDIADGQKTDITSTKNELVITYHGRLRSFSEEDTHKIEAWLEDKTNSNLLIEMVIPQADISFSDSLRLGYERGIILMKEIKKIYPDVVIDMSVNSAASSTTSKAIITTINKKVSE
SiiA lcl|NZ_CP029897.1_prot_WP_000389235.1_4284 97.143 100 MEDESNPWPSFVDTFSTVLCIFIFLMLVFALNNMIIMYDNSIKVYKANIENKTKSTAQNSGANDNSNANEIVNKEVNTQDVSDGMTTMSGKEVGVYDIADGQKIDITSTKNELVITYHGRLRSFSEEDTHKIEAWLEDKTNSNLLIEMVIPQADISFSDSLRLGYERGIILMKEIKKIYPDVVIDMSVNSAASSTTSKAIITTINKKVSE
SiiA lcl|NZ_CP053416.1_prot_WP_079774927.1_2027 77.619 100 MEDESNPWPSFVDTFSTVLCIFIFLMLVFALNNMLIMYDNSIKVYKTNIEKHANSKDEKSGDNKKENTNEKVENETISKDSSAESTEMSGKEIGIYDIADDQRIDITSEEKELVITYRGRLRSFSKEDLNKITVWLEDKANSNLLIEMIIPQADISFSDSLRLGYERGIILMKEIKKIYPDVVIDMSVNSTASSSTSKAIITTTNKKVPE
Combined.txt
SiiA Salmonella_enterica_subsp_enterica_Typhimurium_LT2 lcl|NC_003197.2_prot_NP_463122.1_4111 100.000 100 MEDESNPWPSFVDTFSTVLCIFIFLMLVFALNNMIIMYDNSIKVYKANIENKTKSTAQNSGANDDSNPNEIVNKEVNTQDVSDGMTTMSGKEVGVYDIADGQKTDITSTKNELVITYHGRLRSFSEEDTYKIKAWLEDKINSNLLIEMVIPQADISFSDSLRLGYERGIILMKEIKKIYPDVVIDMSVNSAASSTTSKAIITTINKKVSE
SiiA Salmonella_enterica_subsp_enterica_Paratyphi_B lcl|NC_010102.1_prot_WP_000389232.1_4169 99.048 100 MEDESNPWPSFVDTFSTVLCIFIFLMLVFALNNMIIMYDNSIKVYKANIENKTKSTAQNSGANDDSNPNEIVNKEVNTQDVSDGMTTMSGKEVGVYDIADGQKTDITSTKNELVITYHGRLRSFSEEDTHKIKAWLEDKTNSNLLIEMVIPQADISFSDSLRLGYERGIILMKEIKKIYPDVVIDMSVNSAASSTTSKAIITTINKKVSE
SiiA Salmonella_enterica_subsp_enterica_Infantis lcl|CP052796.1_prot_QJV25805.1_4154 97.143 100 MEDESNPWPSFVDTFSTVLCIFIFLMLVFALNNMIIMYDNSIKVYKANIESKTKSTAQNSGANDNSNANEIINKEVNTQDMSDGMTTMSGKEVGVYDIADGQKTDITSTKNELVITYHGRLRSFSEEDTHKIKAWLEDKINSNLLIEMVIPQADISFSDSLRLGYERGIILMKEIKKIYPDVVIDMSVNSAASSTTSKAIITTINKKVSE
SiiA Salmonella_enterica_subsp_enterica_Paratyphi_A lcl|NZ_CP009559.1_prot_WP_000389229.1_1106 97.143 100 MEDESNPWPSFVDTFSTVLCIFIFLMLVFALNNMIIMYDNSIKVYKANIENKTKSTAQNNGANDNSNANEIVNKEVNTQDVSDGMTTMSGKEVGVYDIADGQKTDITSTKNELVITYHGRLRSFSEEDTHKIEAWLEDKTNSNLLIEMVIPQADISFSDSLRLGYERGIILMKEIKKIYPDVVIDMSVNSAASSTTSKAIITTINKKVSE
SiiA Salmonella_enterica_subsp_enterica_Typhi lcl|NZ_CP029897.1_prot_WP_000389235.1_4284 97.143 100 MEDESNPWPSFVDTFSTVLCIFIFLMLVFALNNMIIMYDNSIKVYKANIENKTKSTAQNSGANDNSNANEIVNKEVNTQDVSDGMTTMSGKEVGVYDIADGQKIDITSTKNELVITYHGRLRSFSEEDTHKIEAWLEDKTNSNLLIEMVIPQADISFSDSLRLGYERGIILMKEIKKIYPDVVIDMSVNSAASSTTSKAIITTINKKVSE
SiiA Salmonella_bongori lcl|NZ_CP053416.1_prot_WP_079774927.1_2027 77.619 100 MEDESNPWPSFVDTFSTVLCIFIFLMLVFALNNMLIMYDNSIKVYKTNIEKHANSKDEKSGDNKKENTNEKVENETISKDSSAESTEMSGKEIGIYDIADDQRIDITSEEKELVITYRGRLRSFSKEDLNKITVWLEDKANSNLLIEMIIPQADISFSDSLRLGYERGIILMKEIKKIYPDVVIDMSVNSTASSSTSKAIITTTNKKVPE
EDIT: Since samples of OP are changed now, so adding edited code as per that here.
awk '
FNR==NR{
second=$2
arr1[second]=$1
$1=$2=""
sub(/^ +/,"")
arr3[second]=$0
next
}
{
sub(/^>/,"",$2)
}
($2 in arr1){
print arr1[$2],$0,arr3[$2]
}
' file1 file2
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition which will be TRUE when file1 is being read.
second=$2 ##Creating second which has $2 in it.
arr1[second]=$1 ##Creating arr1 with index of second and value of $1 here.
$1=$2="" ##Nullifying 1st and 2nd fields here.
sub(/^ +/,"") ##Nullifying starting spaces with NULL here.
arr3[second]=$0 ##Creating arr3 with index of second and value of $0.
next ##next will skip all further statements from here.
}
{
sub(/^>/,"",$2) ##Substituting starting > in $2 with NULL.
}
($2 in arr1){ ##Checking condition if $2 is in arr2 then do following.
print arr1[$2],$0,arr3[$2] ##Printing arr1 with $2, current line, arr3 with $2.
}
' file1 file2 ##mentioning Input_file name here.
With your shown samples, could you please try following.
awk 'FNR==NR{arr[$2]=$1;next} ($2 in arr){$2=($2 OFS arr[$2])} 1' file2 file1
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition which will be true when file2 is being read.
arr[$2]=$1 ##Creating array arr with index of $2 and value is $1 of current line.
next ##next will skip all further statements from here.
}
($2 in arr){ ##Checking condition if $2 is in arr then do following.
$2=($2 OFS arr[$2]) ##Re assigning $2 value which has $2 OFS and array arr value in it.
}
1 ##Printing current line here.
' file2 file1 ##Mentioning Input_file names here.
To join 2 files, the join command is available. join requires that both files are sorted on the join field, which makes the syntax a bit gnarly:
join -j 2 -t $'\t' -o 1.1,1.2,2.1,1.3 <(sort -k2,2 file1) <(sort -k2,2 file2)
outputs
1 10 C Tomatoea
4 10 C Tomatoes
2 20 B Potatoes
5 20 B Potatoes
3 30 A Apples
As you can see, the output is not the same order as the input. If that's a requirement, use awk.

matching columns one separate files and appending matches to file

I'm trying to merge two files filtered on a single column using awk. What I'd then like to do is append the relevant columns from file2 into file 1.
Easier to explain with dummy example.
File1
name fruit animal
bob apple dog
jim orange cat
gary mango snake
daisy peach mouse
File 2:
animal number shape
cat eight square
dog nine circle
mouse eleven sphere
Desired output:
name fruit animal shape
bob apple dog circle
jim orange cat square
gary mango snake NA
daisy peach mouse sphere
Step 1: Need to filter on column 3 in file1 and column 1 in file2
awk -F'\t' 'NR==FNR{c[$3]++;next};c[$1] > 0' file1 file2
This gives me output:
cat eight square
dog nine circle
mouse eleven sphere
This helps me somewhat, however I can't simply cut the third column (shape) from the output above and append it to to file1 since there is no entry for 'snake' in file2. I need to be able to append column 3 of output to file 1 where a match is successful, and where it is not to put 'NA'. It's essential that all the lines in file1 are retained so I can't just omit them. This is where I'm stuck!
I'd appreciate any help please....
E
Could you please try following, written and tested based on shown samples in GNU awk.
awk '
BEGIN{
OFS="\t"
}
FNR==NR{
a[$1]=$NF
next
}
{
print $0,($3 in a?a[$3]:"NA")
}' Input_file2 Input_file1
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section from here.
OFS="\t" ##Setting TAB as output field separator here.
}
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when first Input_file file2 is being read.
a[$1]=$NF ##Creating array a with index $1 and value is $NF for current line.
next ##next will skip all further statements from here.
}
{
print $0,($3 in a?a[$3]:"NA") ##Printing current line and checking if 3rd field is present in array a then print its value OR print NA.
}' file2 file1 ##Mentioning Input_file names here.

shell awk script to remove duplicate lines

I am trying to remove duplicate lines from a file including the original ones but the following command that I am trying is sorting the lines and I want them to be in the same order as they are in input file.
awk '{++a[$0]}END{for(i in a) if (a[i]==1) print i}' test.txt
Input:
123
aaa
456
123
aaa
888
bbb
Output I want:
456
888
bbb
Simpler code if you are okay with reading input file twice:
$ awk 'NR==FNR{a[$0]++; next} a[$0]==1' ip.txt ip.txt
456
888
bbb
With single pass:
$ awk '{a[NR]=$0; b[$0]++} END{for(i=1;i<=NR;i++) if(b[a[i]]==1) print a[i]}' ip.txt
456
888
bbb
If you want to do this in awk only then could you please try following; if not worried about order.
awk '{a[$0]++};END{for(i in a){if(a[i]==1){print i}}}' Input_file
To get the unique values in same order in which they occur in Input_file try following.
awk '
!a[$0]++{
b[++count]=$0
}
{
c[$0]++
}
END{
for(i=1;i<=count;i++){
if(c[b[i]]==1){
print b[i]
}
}
}
' Input_file
Output will be as follows.
456
888
bbb
Explanation: Adding detailed explanation for above code.
awk ' ##Starting awk program from here.
!a[$0]++{ ##Checking condition if current line is NOT occur in array a with more than 1 occurrence then do following.
b[++count]=$0 ##Creating an array b with index count whose value is increasing with 1 and its value is current line value.
}
{
c[$0]++ ##Creating an array c whose index is current line and its value is occurrence of current lines.
}
END{ ##Starting END block for this awk program here.
for(i=1;i<=count;i++){ ##Starting for loop from here.
if(c[b[i]]==1){ ##Checking condition if value of array c with index is value of array b with index i equals to 1 then do following.
print b[i] ##Printing value of array b.
}
}
}
' Input_file ##Mentioning Input_file name here.
awk '{ b[$0]++; a[n++]=$0; }END{ for (i in a){ if(b[a[i]]==1) print a[i] }}' input
Lines are added to array b, the order of lines is kept in array a.
If, in the end, the count is 1, the line is printed.
Sorry, i misread the question at first, and i corrected the answer, to be almost the same as #Sundeep ...

Select current and previous line if values are the same in 2 columns

Check values in columns 2 and 3, if the values are the same in the previous line and current line( example lines 2-3 and 6-7), then print the lines separated as ,
Input file
1 1 2 35 1
2 3 4 50 1
2 3 4 75 1
4 7 7 85 1
5 8 6 100 1
8 6 9 125 1
4 6 9 200 1
5 3 2 156 2
Desired output
2,3,4,50,1,2,3,4,75,1
8,6,9,125,1,4,6,9,200,1
I tried to modify this code, but not results
awk '{$6=$2 $3 - $p2 $p3} $6==0{print p0; print} {p0=$0;p2=p2;p3=$3}'
Thanks in advance.
$ awk -v OFS=',' '{$1=$1; cK=$2 FS $3} pK==cK{print p0, $0} {pK=cK; p0=$0}' file
2,3,4,50,1,2,3,4,75,1
8,6,9,125,1,4,6,9,200,1
With your own code and its mechanism updated:
awk '(($2=$2) $3) - (p2 p3)==0{printf "%s", p0; print} {p0=$0;p2=$2;p3=$3}' OFS="," file
2,3,4,50,12,3,4,75,1
8,6,9,125,14,6,9,200,1
But it has underlying problem, so better use this simplified/improved way:
awk '($2=$2) FS $3==cp{print p0,$0} {p0=$0; cp=$2 FS $3}' OFS=, file
The FS is needed, check the comments under Mr. Morton's answer.
Why your code fails:
Concatenate (what space do) has higher priority than minus-.
You used $6 to save the value you want to compare, and then it becomes a part of $0 the line.(last column). -- You can change it to a temporary variable name.
You have a typo (p2=p2), and you used $p2 and $p3, which means to get p2's value and find the corresponding column. So if p2==3 then $p2 equals $3.
You didn't set OFS, so even if your code works, the output will be separated by spaces.
print will add a trailing newline\n, so even if above problems don't exist, you will get 4 lines instead of the 2 lines output you wanted.
Could you please try following too.
awk 'prev_2nd==$2 && prev_3rd==$3{$1=$1;print prev_line,$0} {prev_2nd=$2;prev_3rd=$3;$1=$1;prev_line=$0}' OFS=, Input_file
Explanation: Adding explanation for above code now.
awk '
prev_2nd==$2 && prev_3rd==$3{ ##Checking if previous lines variable prev_2nd and prev_3rd are having same value as current line 2nd and 3rd field or not, if yes then do following.
$1=$1 ##Resetting $1 value of current line to $1 only why because OP needs output field separator as comma and to apply this we need to reset it to its own value.
print prev_line,$0 ##Printing value of previous line and current line here.
} ##Closing this condition block here.
{
prev_2nd=$2 ##Setting current line $2 to prev_2nd variable here.
prev_3rd=$3 ##Setting current line $3 to prev_3rd variable here.
$1=$1 ##Resetting value of $1 to $1 to make comma in its values applied.
prev_line=$0 ##Now setting pre_line value to current line edited one with comma as separator.
}
' OFS=, Input_file ##Setting OFS(output field separator) value as comma here and mentioning Input_file name here.