Filter two files using AWK

Filter two files using AWK - awk

First of all, thank you for your help. I have 2 files which are
1 10 Tomatoea
2 20 Potatoes
3 30 Apples
4 10 Tomatoes
5 20 Potatoes
And
A 30
B 20
C 10
D 40
E 50
I wanto to filter both files using AWK so the if $2 in the first file is equal to the value of $2 in the second file the output will be adding a new column which matches the condition given in a new file called combined.txt:
1 10 C Tomatoea
2 20 B Potatoes
3 30 A Apples
4 10 C Tomatoes
5 20 B Potatoes
I have tried this code:
awk 'FNR==NR{a[NR]=$0;next}{$2=a[FNR]}1' letters.txt numbers.txt >> combined.txt
awk 'FNR==NR {m[$2] = $1; next} $2 in m {$2 = m[$2]}1' letters.txt numbers.txt >> combined.txt
The problem is that the code only replace one column for the other I want to that the column matches the condition I have given above. Also I want to put the new column between the columns from number.txt file.
The above are simplifications of my actual files. Below you can see them in order file 2, file 1 and combined.txt. As you would appreciate fil2 have a lot of rows that is the reason why only one species name appear in it.
file 2
Salmonella_enterica_subsp_enterica_Typhimurium_LT2 >lcl|NC_003197.2_prot_NP_463122.1_4111
Salmonella_enterica_subsp_enterica_Paratyphi_B >lcl|NC_010102.1_prot_WP_000389232.1_4169
Salmonella_enterica_subsp_enterica_Infantis >lcl|CP052796.1_prot_QJV25805.1_4154
Salmonella_enterica_subsp_enterica_Paratyphi_A >lcl|NZ_CP009559.1_prot_WP_000389229.1_110
Salmonella_enterica_subsp_enterica_Typhi >lcl|NZ_CP029897.1_prot_WP_000389235.1_4284
Salmonella_bongori >lcl|NZ_CP053416.1_prot_WP_079774927.1_2027 77.619
Salmonella_enterica_subsp_enterica_Infantis >lcl|CP052796.1_prot_QJV21904.1_1
Salmonella_enterica_subsp_enterica_Infantis >lcl|CP052796.1_prot_QJV21905.1_2
Salmonella_enterica_subsp_enterica_Infantis >lcl|CP052796.1_prot_QJV21906.1_3
Salmonella_enterica_subsp_enterica_Infantis >lcl|CP052796.1_prot_QJV21907.1_4
Salmonella_enterica_subsp_enterica_Infantis >lcl|CP052796.1_prot_QJV21908.1_5
Salmonella_enterica_subsp_enterica_Infantis >lcl|CP052796.1_prot_QJV26199.1_6
Salmonella_enterica_subsp_enterica_Infantis >lcl|CP052796.1_prot_QJV21909.1_7
Salmonella_enterica_subsp_enterica_Infantis >lcl|CP052796.1_prot_QJV21910.1_8
Salmonella_enterica_subsp_enterica_Infantis >lcl|CP052796.1_prot_QJV21911.1_9
file1
SiiA lcl|NC_003197.2_prot_NP_463122.1_4111 100.000 100 MEDESNPWPSFVDTFSTVLCIFIFLMLVFALNNMIIMYDNSIKVYKANIENKTKSTAQNSGANDDSNPNEIVNKEVNTQDVSDGMTTMSGKEVGVYDIADGQKTDITSTKNELVITYHGRLRSFSEEDTYKIKAWLEDKINSNLLIEMVIPQADISFSDSLRLGYERGIILMKEIKKIYPDVVIDMSVNSAASSTTSKAIITTINKKVSE
SiiA lcl|NC_010102.1_prot_WP_000389232.1_4169 99.048 100 MEDESNPWPSFVDTFSTVLCIFIFLMLVFALNNMIIMYDNSIKVYKANIENKTKSTAQNSGANDDSNPNEIVNKEVNTQDVSDGMTTMSGKEVGVYDIADGQKTDITSTKNELVITYHGRLRSFSEEDTHKIKAWLEDKTNSNLLIEMVIPQADISFSDSLRLGYERGIILMKEIKKIYPDVVIDMSVNSAASSTTSKAIITTINKKVSE
SiiA lcl|CP052796.1_prot_QJV25805.1_4154 97.143 100 MEDESNPWPSFVDTFSTVLCIFIFLMLVFALNNMIIMYDNSIKVYKANIESKTKSTAQNSGANDNSNANEIINKEVNTQDMSDGMTTMSGKEVGVYDIADGQKTDITSTKNELVITYHGRLRSFSEEDTHKIKAWLEDKINSNLLIEMVIPQADISFSDSLRLGYERGIILMKEIKKIYPDVVIDMSVNSAASSTTSKAIITTINKKVSE
SiiA lcl|NZ_CP009559.1_prot_WP_000389229.1_1106 97.143 100 MEDESNPWPSFVDTFSTVLCIFIFLMLVFALNNMIIMYDNSIKVYKANIENKTKSTAQNNGANDNSNANEIVNKEVNTQDVSDGMTTMSGKEVGVYDIADGQKTDITSTKNELVITYHGRLRSFSEEDTHKIEAWLEDKTNSNLLIEMVIPQADISFSDSLRLGYERGIILMKEIKKIYPDVVIDMSVNSAASSTTSKAIITTINKKVSE
SiiA lcl|NZ_CP029897.1_prot_WP_000389235.1_4284 97.143 100 MEDESNPWPSFVDTFSTVLCIFIFLMLVFALNNMIIMYDNSIKVYKANIENKTKSTAQNSGANDNSNANEIVNKEVNTQDVSDGMTTMSGKEVGVYDIADGQKIDITSTKNELVITYHGRLRSFSEEDTHKIEAWLEDKTNSNLLIEMVIPQADISFSDSLRLGYERGIILMKEIKKIYPDVVIDMSVNSAASSTTSKAIITTINKKVSE
SiiA lcl|NZ_CP053416.1_prot_WP_079774927.1_2027 77.619 100 MEDESNPWPSFVDTFSTVLCIFIFLMLVFALNNMLIMYDNSIKVYKTNIEKHANSKDEKSGDNKKENTNEKVENETISKDSSAESTEMSGKEIGIYDIADDQRIDITSEEKELVITYRGRLRSFSKEDLNKITVWLEDKANSNLLIEMIIPQADISFSDSLRLGYERGIILMKEIKKIYPDVVIDMSVNSTASSSTSKAIITTTNKKVPE
Combined.txt
SiiA Salmonella_enterica_subsp_enterica_Typhimurium_LT2 lcl|NC_003197.2_prot_NP_463122.1_4111 100.000 100 MEDESNPWPSFVDTFSTVLCIFIFLMLVFALNNMIIMYDNSIKVYKANIENKTKSTAQNSGANDDSNPNEIVNKEVNTQDVSDGMTTMSGKEVGVYDIADGQKTDITSTKNELVITYHGRLRSFSEEDTYKIKAWLEDKINSNLLIEMVIPQADISFSDSLRLGYERGIILMKEIKKIYPDVVIDMSVNSAASSTTSKAIITTINKKVSE
SiiA Salmonella_enterica_subsp_enterica_Paratyphi_B lcl|NC_010102.1_prot_WP_000389232.1_4169 99.048 100 MEDESNPWPSFVDTFSTVLCIFIFLMLVFALNNMIIMYDNSIKVYKANIENKTKSTAQNSGANDDSNPNEIVNKEVNTQDVSDGMTTMSGKEVGVYDIADGQKTDITSTKNELVITYHGRLRSFSEEDTHKIKAWLEDKTNSNLLIEMVIPQADISFSDSLRLGYERGIILMKEIKKIYPDVVIDMSVNSAASSTTSKAIITTINKKVSE
SiiA Salmonella_enterica_subsp_enterica_Infantis lcl|CP052796.1_prot_QJV25805.1_4154 97.143 100 MEDESNPWPSFVDTFSTVLCIFIFLMLVFALNNMIIMYDNSIKVYKANIESKTKSTAQNSGANDNSNANEIINKEVNTQDMSDGMTTMSGKEVGVYDIADGQKTDITSTKNELVITYHGRLRSFSEEDTHKIKAWLEDKINSNLLIEMVIPQADISFSDSLRLGYERGIILMKEIKKIYPDVVIDMSVNSAASSTTSKAIITTINKKVSE
SiiA Salmonella_enterica_subsp_enterica_Paratyphi_A lcl|NZ_CP009559.1_prot_WP_000389229.1_1106 97.143 100 MEDESNPWPSFVDTFSTVLCIFIFLMLVFALNNMIIMYDNSIKVYKANIENKTKSTAQNNGANDNSNANEIVNKEVNTQDVSDGMTTMSGKEVGVYDIADGQKTDITSTKNELVITYHGRLRSFSEEDTHKIEAWLEDKTNSNLLIEMVIPQADISFSDSLRLGYERGIILMKEIKKIYPDVVIDMSVNSAASSTTSKAIITTINKKVSE
SiiA Salmonella_enterica_subsp_enterica_Typhi lcl|NZ_CP029897.1_prot_WP_000389235.1_4284 97.143 100 MEDESNPWPSFVDTFSTVLCIFIFLMLVFALNNMIIMYDNSIKVYKANIENKTKSTAQNSGANDNSNANEIVNKEVNTQDVSDGMTTMSGKEVGVYDIADGQKIDITSTKNELVITYHGRLRSFSEEDTHKIEAWLEDKTNSNLLIEMVIPQADISFSDSLRLGYERGIILMKEIKKIYPDVVIDMSVNSAASSTTSKAIITTINKKVSE
SiiA Salmonella_bongori lcl|NZ_CP053416.1_prot_WP_079774927.1_2027 77.619 100 MEDESNPWPSFVDTFSTVLCIFIFLMLVFALNNMLIMYDNSIKVYKTNIEKHANSKDEKSGDNKKENTNEKVENETISKDSSAESTEMSGKEIGIYDIADDQRIDITSEEKELVITYRGRLRSFSKEDLNKITVWLEDKANSNLLIEMIIPQADISFSDSLRLGYERGIILMKEIKKIYPDVVIDMSVNSTASSSTSKAIITTTNKKVPE

EDIT: Since samples of OP are changed now, so adding edited code as per that here.
awk '
FNR==NR{
second=$2
arr1[second]=$1
$1=$2=""
sub(/^ +/,"")
arr3[second]=$0
next
}
{
sub(/^>/,"",$2)
}
($2 in arr1){
print arr1[$2],$0,arr3[$2]
}
' file1 file2
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition which will be TRUE when file1 is being read.
second=$2 ##Creating second which has $2 in it.
arr1[second]=$1 ##Creating arr1 with index of second and value of $1 here.
$1=$2="" ##Nullifying 1st and 2nd fields here.
sub(/^ +/,"") ##Nullifying starting spaces with NULL here.
arr3[second]=$0 ##Creating arr3 with index of second and value of $0.
next ##next will skip all further statements from here.
}
{
sub(/^>/,"",$2) ##Substituting starting > in $2 with NULL.
}
($2 in arr1){ ##Checking condition if $2 is in arr2 then do following.
print arr1[$2],$0,arr3[$2] ##Printing arr1 with $2, current line, arr3 with $2.
}
' file1 file2 ##mentioning Input_file name here.
With your shown samples, could you please try following.
awk 'FNR==NR{arr[$2]=$1;next} ($2 in arr){$2=($2 OFS arr[$2])} 1' file2 file1
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition which will be true when file2 is being read.
arr[$2]=$1 ##Creating array arr with index of $2 and value is $1 of current line.
next ##next will skip all further statements from here.
}
($2 in arr){ ##Checking condition if $2 is in arr then do following.
$2=($2 OFS arr[$2]) ##Re assigning $2 value which has $2 OFS and array arr value in it.
}
1 ##Printing current line here.
' file2 file1 ##Mentioning Input_file names here.

To join 2 files, the join command is available. join requires that both files are sorted on the join field, which makes the syntax a bit gnarly:
join -j 2 -t $'\t' -o 1.1,1.2,2.1,1.3 <(sort -k2,2 file1) <(sort -k2,2 file2)
outputs
1 10 C Tomatoea
4 10 C Tomatoes
2 20 B Potatoes
5 20 B Potatoes
3 30 A Apples
As you can see, the output is not the same order as the input. If that's a requirement, use awk.

Related

Count rows and columns for multiple CSV files and make new file

I have multiple large comma separated CSV files in a directory. But, as a toy example:
one.csv has 3 rows, 2 columns
two.csv has 4 rows 5 columns
This is what the files look like -
# one.csv
a b
1 1 3
2 2 2
3 3 1
# two.csv
c d e f g
1 4 1 1 4 1
2 3 2 2 3 2
3 2 3 3 2 3
4 1 4 4 1 4
The goal is to make a new .txt or .csv that gives the rows and columns for each:
one 3 2
two 4 5
To get the rows and columns (and dump it into a file) for a single file
$ awk -F "," '{print NF}' *.csv | sort | uniq -c > dims.txt
But I'm not understanding the syntax to get counts for multiple files.
What I've tried
$ awk '{for (i=1; i<=2; i++) -F "," '{print NF}' *.csv$i | sort | uniq -c}'

With any awk, you could try following awk program.
awk '
FNR==1{
if(cols && rows){
print file,rows,cols
}
rows=cols=file=""
file=FILENAME
sub(/\..*/,"",file)
cols=NF
next
}
{
rows=(FNR-1)
}
END{
if(cols && rows){
print file,rows,cols
}
}
' one.csv two.csv
Explanation: Adding detailed explanation for above solution.
awk ' ##Starting awk program from here.
FNR==1{ ##Checking condition if this is first line of each line then do following.
if(cols && rows){ ##Checking if cols AND rows are NOT NULL then do following.
print file,rows,cols ##Printing file, rows and cols variables here.
}
rows=cols=file="" ##Nullifying rows, cols and file here.
file=FILENAME ##Setting FILENAME value to file here.
sub(/\..*/,"",file) ##Removing everything from dot to till end of value in file.
cols=NF ##Setting NF values to cols here.
next ##next will skip all further statements from here.
}
{
rows=(FNR-1) ##Setting FNR-1 value to rows here.
}
END{ ##Starting END block of this program from here.
if(cols && rows){ ##Checking if cols AND rows are NOT NULL then do following.
print file,rows,cols ##Printing file, rows and cols variables here.
}
}
' one.csv two.csv ##Mentioning Input_file names here.

Using gnu awk you can do this in a single awk:
awk -F, 'ENDFILE {
print gensub(/\.[^.]+$/, "", "1", FILENAME), FNR-1, NF-1
}' one.csv two.csv > dims.txt
cat dims.txt
one 3 2
two 4 5

You will need to iterate over all CSVs print the name for each file and the dimensions
for i in *.csv; do awk -F "," 'END{print FILENAME, NR, NF}' $i; done > dims.txt
If you want to avoid awk you can also do it wc -l for lines and grep -o "CSV-separator" | wc -l for fields

I would harness GNU AWK's ENDFILE for this task as follows, let content of one.csv be
1,3
2,2
3,1
and two.csv be
4,1,1,4,1
3,2,2,3,2
2,3,3,2,3
1,4,4,1,4
then
awk 'BEGIN{FS=","}ENDFILE{print FILENAME, FNR, NF}' one.csv two.csv
output
one.csv 3 2
two.csv 4 5
Explanation: ENDFILE is executed after processing every file, I set FS to , assuming that fields are ,-separated and there is not , inside filed, FILENAME, FNR, NF are built-in GNU AWK variables: FNR is number of current row in file, i.e. in ENDFILE number of last row, NF is number of fileds (again of last row). If you have files with headers use FNR-1, if you have rows prepended with row number use NF-1.
edit: changed NR to FNR

Without GNU awk you can use the shell plus POSIX awk this way:
for fn in *.csv; do
cols=$(awk '{print NF; exit}' "$fn")
rows=$(awk 'END{print NR-1}' "$fn")
printf "%s %s %s\n" "${fn%.csv}" "$rows" "$cols"
done
Prints:
one 3 2
two 4 5

Compare and print last column in a file

I have a file
(n34)); 1
Z(n2)); 1
(n52)); 2
(n35)); 3
(n67)); 3
(n19)); 4
(n68)); 4
(n20)); 5
(n36)); 5
(n53)); 5
(n69)); 5
N(n3)); 5
(n54)); 6
(n70)); 7
N(n4)); 7
I want output such that whenever we have same number after semicolon print that lines in single line with field separator as;.
Output should be
(n34)); 1;Z(n2)); 1
(n52)); 2
(n35)); 3;(n67)); 3
(n19)); 4;(n68)); 4
(n20)); 5;(n36)); 5;(n53)); 5;(n69)); 5;N(n3)); 5
(n54)); 6
(n70)); 7;N(n4)); 7
I tried the code below
awk -F';' 'NR == FNR { count[$2]++;next}
In this I am not getting how to print it on same line if same numbers are present.

1st solution: Could you please try following, written and tested with shown samples in GNU awk and considering that your Input_file is sorted by 2nd column.
awk '
BEGIN{ OFS=";" }
prev!=$2{
if(val){ print val }
val=""
}
{
val=(val?val OFS:"")$0
prev=$2
}
END{
if(val){ print val }
}
' Input_file
2nd solution: OR in case your 2nd field is not sorted then try following.
sort -nk2 Input_file |
awk '
BEGIN{ OFS=";" }
prev!=$2{
if(val){ print val }
val=""
}
{
val=(val?val OFS:"")$0
prev=$2
}
END{
if(val){ print val }
}
'
Explanation of awk code:
awk ' ##Starting awk program from here.
BEGIN{ OFS=";" } ##Setting output field separator as semi colon here.
prev!=$2{ ##Checking condition if 2nd field is NOT equal to current 2nd field then do following.
if(val){ print val } ##If val is set then print value of val here.
val="" ##Nullifying val here.
}
{
val=(val?val OFS:"")$0 ##Creating val variable and keep adding values to it with OFS in between their values.
prev=$2 ##Setting current 2nd field to prev to be checked in next line.
}
END{ ##Starting END block for this program from here.
if(val){ print val } ##If val is set then print value of val here.
}
' Input_file ##Mentioning Input_file name here.

Another awk:
$ awk -F\; '{a[$2]=a[$2] (a[$2]==""?"":";") $0}END{for(i in a)print a[i]}' file
Output:
(n34)); 1;Z(n2)); 1
(n52)); 2
(n35)); 3;(n67)); 3
(n19)); 4;(n68)); 4
(n20)); 5;(n36)); 5;(n53)); 5;(n69)); 5;N(n3)); 5
(n54)); 6
(n70)); 7;N(n4)); 7
Explained:
$ awk -F\; '{ # set delimiter (probably useless)
a[$2]=a[$2] (a[$2]==""?"":";") $0 # keep appending where $2s match
}
END { # in the end
for(i in a) # output
print a[i]
}' file
Edit: for(i in a) will produce order that appears random. If you need to order it, you can pipe the output to:
$ awk '...' | sort -t\; -k2n

Perl to the rescue!
perl -ne '($x, $y) = split;
$h{$y} .= "$x $y;";
END { print $h{$_} =~ s/;$/\n/r for sort keys %h }
' -- file
It splits each line on whitespace, stores the value in a hash table %h keyed by the second column, and when the file has been read, it prints the remembered values, sorting them by the second column. We always store the semicolon at the end, so we need to replace the final one with a new line in the output.

I would harness GNU AWK array for that task following way. Let file.txt content be:
(n34)); 1
Z(n2)); 1
(n52)); 2
(n35)); 3
(n67)); 3
(n19)); 4
(n68)); 4
(n20)); 5
(n36)); 5
(n53)); 5
(n69)); 5
N(n3)); 5
(n54)); 6
(n70)); 7
N(n4)); 7
then:
awk '{data[$2]=data[$2] ";" $0}END{for(i in data){print substr(data[i],2)}}' file.txt
output is:
(n34)); 1;Z(n2)); 1
(n52)); 2
(n35)); 3;(n67)); 3
(n19)); 4;(n68)); 4
(n20)); 5;(n36)); 5;(n53)); 5;(n69)); 5;N(n3)); 5
(n54)); 6
(n70)); 7;N(n4)); 7
Explanation: I exploit facts that GNU AWK arrays are lazy and remember order of insertion (latter is not guaranteed in all AWKs). For every line I concatenate whole line to what is under $2 key in array data using ;. If there is not already value stored it is same as empty string. This lead to ; appearing at begins of every record in data so I print it starting at 2nd character. Keep in mind this solution stores everything in data so it might not work well for huge files.
(tested in gawk 4.2.1)

datamash has a similar function built in:
<infile datamash -W groupby 2 collapse 1
Output:
1 (n34));,Z(n2));
2 (n52));
3 (n35));,(n67));
4 (n19));,(n68));
5 (n20));,(n36));,(n53));,(n69));,N(n3));
6 (n54));
7 (n70));,N(n4));

This might work for you (GNU sed):
sed -E ':a;N;s/( \S+)\n(.*\1)$/\1;\2/;ta;P;D' file
Append the following line to the current line.
If both lines end in the same number (string), delete the intervening newline and repeat.
Print/Delete the first line in the pattern space and repeat.

Understanding two file processing in awk

I am trying to understand how two file processing works. So here created an example.
file1.txt
zzz pq Fruit Apple 10
zzz rs Fruit Car 50
zzz tu Study Book 60
file2.txt
aa bb Book 100
cc dd Car 200
hj kl XYZ 500
ee ff Apple 300
ff gh ABC 400
I want to compare 4th column of file1 to 3rd column of file2, if matched then print the 3rd,4th,5th column of file1 followed by 3rd, 4th column of file2 with sum of 5th column of file1 and 4th column of file2.
Expected Output:
Fruit Apple 10 300 310
Fruit Car 50 200 250
Study Book 60 100 160
Here what I have tried:
awk ' FNR==NR{ a[$4]=$5;next} ( $3 in a){ print $3, a[$4],$4}' file1.txt file2.txt
Code output;
Book 100
Car 200
Apple 300
I am facing problem in printing file1 column and how to store the other column of file1 in array a. Please guide me.

Could you please try following.
awk 'FNR==NR{a[$4]=$3 OFS $4 OFS $5;b[$4]=$NF;next} ($3 in a){print a[$3],$NF,b[$3]+$NF}' file1.txt file2.txt
Output will be as follows.
Study Book 60 100 160
Fruit Car 50 200 250
Fruit Apple 10 300 310
Explanation: Adding explanation for above code now.
awk ' ##Starting awk program here.
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when first Input_file named file1.txt is being read.
a[$4]=$3 OFS $4 OFS $5 ##Creating an array named a whose index is $4 and value is 3rd, 4th and 5th fields along with spaces(By default OFS value will be space for awk).
b[$4]=$NF ##Creating an array named b whose index is $4 and value if $NF(last field of current line).
next ##next keyword will skip all further lines from here.
}
($3 in a){ ##Checking if 3rd field of current line(from file2.txt) is present in array a then do following.
print a[$3],$NF,b[$3]+$NF ##Printing array a whose index is $3, last column value of current line and then SUM of array b with index $3 and last column value here.
}
' file1.txt file2.txt ##Mentioning Input_file names file1.txt and file2.txt

Match values in two files and replace values in specific columns

The purpose is to check if values for column 2 and 3 in file1 match with column 1 in file2. If any value match, then replace values in file2 for column 2 and 3 using the information of file1 columns 4 and 5.
file1
100,31431,37131,999991.70,2334362.30
100,31431,37471,111113.20,2334363.30
100,31433,36769,777775.60,2334361.90
102,31433,36853,333322.00,2334362.80
file2
3143137113 318512.50 2334387.50 100
3143137131 318737.50 2334387.50 100
3143137201 319612.50 2334387.50 100
3143137219 319837.50 2334387.50 100
3143137471 322987.50 2334387.50 100
3143137491 323237.50 2334387.50 100
3143336687 313187.50 2334412.50 100
3143336723 313637.50 2334412.50 100
3143336769 314212.50 2334412.50 100
3143336825 314912.50 2334412.50 100
3143336853 315262.50 2334412.50 102
Output desired
31431,37113,318512.50,2334387.50,100
31431,37131,999991.70,2334362.30,100
31431,37201,319612.50,2334387.50,100
31431,37219,319837.50,2334387.50,100
31431,37471,111113.20,2334363.30,100
31431,37491,323237.50,2334387.50,100
31433,36687,313187.50,2334412.50,100
31433,36723,313637.50,2334412.50,100
31433,36769,777775.60,2334361.90,100
31433,36825,314912.50,2334412.50,100
31433,36853,333322.00,2334362.80,102
I tried
awk -F[, ] 'FNR==NR{a[$1 $2]=$0;next}$1 in a{print $0 ,a[$1 $2]}' file1 file2
Thanks in advance

Could you please try following.
awk '
BEGIN{
OFS=","
}
FNR==NR{
a[$2 $3]=$2 OFS $3
b[$2 $3]=$4;c[$2 $3]=$5
next
}
($1 in a){
$2=b[$1]
$3=c[$1];$1=a[$1]
print
next
}
{
$1=$1
sub(/^...../,"&,",$1)
print
}
' FS="," file1 FS=" " file2
Output will be as follows.
31431,37113,318512.50,2334387.50,100
31431,37131,999991.70,2334362.30,100
31431,37201,319612.50,2334387.50,100
31431,37219,319837.50,2334387.50,100
31431,37471,111113.20,2334363.30,100
31431,37491,323237.50,2334387.50,100
31433,36687,313187.50,2334412.50,100
31433,36723,313637.50,2334412.50,100
31433,36769,777775.60,2334361.90,100
31433,36825,314912.50,2334412.50,100
31433,36853,333322.00,2334362.80,102

Try this:
$ awk -F, 'NR==FNR{tmp=$0;sub($1 FS,"",tmp);a[$2 $3]=tmp;next} $1 in a{print a[$1],$NF;next} {$1=substr($1,1,5) OFS substr($1,6,5);} 1' OFS=, file1 FS=' ' file2
31431,37113,318512.50,2334387.50,100
31431,37131,999991.70,2334362.30,100
31431,37201,319612.50,2334387.50,100
31431,37219,319837.50,2334387.50,100
31431,37471,111113.20,2334363.30,100
31431,37491,323237.50,2334387.50,100
31433,36687,313187.50,2334412.50,100
31433,36723,313637.50,2334412.50,100
31433,36769,777775.60,2334361.90,100
31433,36825,314912.50,2334412.50,100
31433,36853,333322.00,2334362.80,102
Above assumes $1 of file does not include regex characters, so to be accurate and safe, better use this:
awk -F, 'NR==FNR{$1="";a[$2 $3]=substr($0,2);next} $1 in a{print a[$1],$NF;next} {$1=substr($1,1,5) OFS substr($1,6,5);} 1' OFS=, file1 FS=' ' file2
However this one assumes the FS of file1 is 1 character only.
And that leads to another change/efficiency improvement:
awk -F, 'NR==FNR{a[$2 $3]=substr($0,length($1 FS)+1);next} $1 in a{print a[$1],$NF;next} {$1=substr($1,1,5) OFS substr($1,6,5);} 1' OFS=, file1 FS=' ' file2

Select current and previous line if values are the same in 2 columns

Check values in columns 2 and 3, if the values are the same in the previous line and current line( example lines 2-3 and 6-7), then print the lines separated as ,
Input file
1 1 2 35 1
2 3 4 50 1
2 3 4 75 1
4 7 7 85 1
5 8 6 100 1
8 6 9 125 1
4 6 9 200 1
5 3 2 156 2
Desired output
2,3,4,50,1,2,3,4,75,1
8,6,9,125,1,4,6,9,200,1
I tried to modify this code, but not results
awk '{$6=$2 $3 - $p2 $p3} $6==0{print p0; print} {p0=$0;p2=p2;p3=$3}'
Thanks in advance.

$ awk -v OFS=',' '{$1=$1; cK=$2 FS $3} pK==cK{print p0, $0} {pK=cK; p0=$0}' file
2,3,4,50,1,2,3,4,75,1
8,6,9,125,1,4,6,9,200,1

With your own code and its mechanism updated:
awk '(($2=$2) $3) - (p2 p3)==0{printf "%s", p0; print} {p0=$0;p2=$2;p3=$3}' OFS="," file
2,3,4,50,12,3,4,75,1
8,6,9,125,14,6,9,200,1
But it has underlying problem, so better use this simplified/improved way:
awk '($2=$2) FS $3==cp{print p0,$0} {p0=$0; cp=$2 FS $3}' OFS=, file
The FS is needed, check the comments under Mr. Morton's answer.
Why your code fails:
Concatenate (what space do) has higher priority than minus-.
You used $6 to save the value you want to compare, and then it becomes a part of $0 the line.(last column). -- You can change it to a temporary variable name.
You have a typo (p2=p2), and you used $p2 and $p3, which means to get p2's value and find the corresponding column. So if p2==3 then $p2 equals $3.
You didn't set OFS, so even if your code works, the output will be separated by spaces.
print will add a trailing newline\n, so even if above problems don't exist, you will get 4 lines instead of the 2 lines output you wanted.

Could you please try following too.
awk 'prev_2nd==$2 && prev_3rd==$3{$1=$1;print prev_line,$0} {prev_2nd=$2;prev_3rd=$3;$1=$1;prev_line=$0}' OFS=, Input_file
Explanation: Adding explanation for above code now.
awk '
prev_2nd==$2 && prev_3rd==$3{ ##Checking if previous lines variable prev_2nd and prev_3rd are having same value as current line 2nd and 3rd field or not, if yes then do following.
$1=$1 ##Resetting $1 value of current line to $1 only why because OP needs output field separator as comma and to apply this we need to reset it to its own value.
print prev_line,$0 ##Printing value of previous line and current line here.
} ##Closing this condition block here.
{
prev_2nd=$2 ##Setting current line $2 to prev_2nd variable here.
prev_3rd=$3 ##Setting current line $3 to prev_3rd variable here.
$1=$1 ##Resetting value of $1 to $1 to make comma in its values applied.
prev_line=$0 ##Now setting pre_line value to current line edited one with comma as separator.
}
' OFS=, Input_file ##Setting OFS(output field separator) value as comma here and mentioning Input_file name here.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Filter two files using AWK - awk

Related

Count rows and columns for multiple CSV files and make new file

Compare and print last column in a file

Understanding two file processing in awk

Match values in two files and replace values in specific columns

Select current and previous line if values are the same in 2 columns

Categories

Resources