Merge files where some columns matched

Merge files where some columns matched - awk

Match columns 1,2,3 in both files, if they are equal then.
For files where columns match write value of column 4 in file1 into file2
If there is not match then write NA
file1
31431 37150 100 10100
31431 37201 100 12100
31431 37471 100 14100
file2
31431 37150 100 14100
31431 37131 100 14100
31431 37201 100 14100
31431 37478 100 14100
31431 37471 100 14100
Desired output:
31431 37150 100 14100 10100
31431 37131 100 14100 NA
31431 37201 100 14100 12100
31431 37478 100 14100 NA
31431 37471 100 14100 14100
I tried
awk '
FNR==NR{
a[$1 $2 $3]=$4
next
}
($1 in a){
$1=a[$1]
found=1
}
{
$0=found==1?$0",":$0",NA"
sub(/^...../,"&,")
$1=$1
found=""
}
1
' FS=" " file1 FS=" " OFS="," file2

$ awk ' {k=$1 FS $2 FS $3}
NR==FNR {a[k]=$4; next}
{$(NF+1)=k in a?a[k]:"NA"}1' file1 file2
31431 37150 100 14100 10100
31431 37131 100 14100 NA
31431 37201 100 14100 12100
31431 37478 100 14100 NA
31431 37471 100 14100 14100

Could you please try following.
awk 'FNR==NR{a[$1,$2,$3]=$NF;next} {print $0,($1,$2,$3) in a?a[$1,$2,$3]:"NA"}' Input_file1 Input_file2
OR with creating a variable for fields as per Ed sir's comment.
awk '{var=$1 OFS $2 OFS $3} FNR==NR{a[var]=$NF;next} {print $0,var in a?a[var]:"NA"}' Input_file1 Input_file2
Output will be as follows.
31431 37150 100 14100 10100
31431 37131 100 14100 NA
31431 37201 100 14100 12100
31431 37478 100 14100 NA
31431 37471 100 14100 14100
Explanation: Adding explanation for above code now.
awk '
{
var=$1 OFS $2 OFS $3 ##Creating a variable named var whose value is first, second ansd third field of current lines of Input_file1 and Input_file2.
}
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when first Input_file1 is being read.
a[var]=$NF ##Creating an array named a whose index is variable var and value is $NF of curent line.
next ##next keyword will skip all further lines from here.
}
{
print $0,var in a?a[var]:"NA" ##Printing current line value and along with that printing either value of a[var] or NA based upon if var present in array a then print a[var] else print NA.
}' Input_file1 Input_file2 ##Mentioning Input_file names here.

Related

For each different occurrence in field, print lines with max value associated

I have
ID=exon-XM_030285750.2 LOC100221041 7895
ID=exon-XM_030285760.2 LOC100221041 8757
ID=exon-XM_030285720.2 LOC100221041 8656
ID=exon-XM_030285738.2 LOC100221041 8183
ID=exon-XM_030285728.2 LOC100221041 8402
ID=exon-XM_030285733.2 LOC100221041 7398
ID=exon-XM_030285715.2 LOC100221041 8780
ID=exon-XM_030285707.2 LOC100221041 8963
ID=exon-XM_030285694.2 DCBLD2 5838
ID=exon-XM_030285774.2 CMSS1 1440
ID=exon-XM_012570107.3 CMSS1 1502
ID=exon-XM_012570104.3 FILIP1L 6371
ID=exon-XM_030285654.2 FILIP1L 6456
ID=exon-XM_030285647.2 FILIP1L 6488
ID=exon-XM_032751000.1 FILIP1L 5886
ID=exon-XM_030285671.2 FILIP1L 5622
ID=exon-XM_030285682.2 FILIP1L 5395
ID=exon-XR_004369230.1 LOC116808959 2289
I want to print the line for which each element in $2 is associates with highest value in $3
ID=exon-XM_030285707.2 LOC100221041 8963
ID=exon-XM_030285694.2 DCBLD2 5838
ID=exon-XM_012570107.3 CMSS1 1502
ID=exon-XM_030285647.2 FILIP1L 6488
ID=exon-XR_004369230.1 LOC116808959 2289
I tried this
awk -f avg.sh test | awk 'BEGIN {OFS = "\t"} arr[$2]==0 {arr[$2]=$3} ($3 > arr[$2]) {arr[$2]=$3} END{for (i in arr) {print i, arr[i]}}'
from here
how to conditionally filter rows in awk
but I would like to also keep $1 in the output and keep the same ordering as in the input.
The answer to this
Computing averages of chunks of a column
shows how to build an array that keeps the original ordering, but I'm falling putting the two together

Could you please try following, written and tested with shown samples in GNU awk.
awk '
!arr1[$2]++{
found[++count]=$2
}
{
arr[$2]=(arr[$2]>$3?arr[$2]:$3)
val[$2 OFS $3]=$1
}
END{
for(i=1;i<=count;i++){
print val[found[i] OFS arr[found[i]]],found[i],arr[found[i]]
}
}' Input_file
Output will be as follows.
ID=exon-XM_030285707.2 1 8963
ID=exon-XM_030285694.2 2 5838
ID=exon-XM_012570107.3 3 1502
ID=exon-XM_030285647.2 4 6488
ID=exon-XR_004369230.1 5 2289
To get in TAB separated form try following.
awk -v OFS="\t" '
!arr1[$2]++{
found[++count]=$2
}
{
arr[$2]=(arr[$2]>$3?arr[$2]:$3)
val[$2 OFS $3]=$1
}
END{
for(i=1;i<=count;i++){
print val[found[i] OFS arr[found[i]]],found[i],arr[found[i]]
}
}' Input_file |
column -t -s $'\t'

You may use this awk:
awk '!($2 in max) || $3 > max[$2] {
if(!($2 in max))
ord[++n] = $2
max[$2] = $3
rec[$2] = $0
}
END {
for (i=1; i<=n; ++i)
print rec[ord[i]]
}' file | column -t
ID=exon-XM_030285707.2 LOC100221041 8963
ID=exon-XM_030285694.2 DCBLD2 5838
ID=exon-XM_012570107.3 CMSS1 1502
ID=exon-XM_030285647.2 FILIP1L 6488
ID=exon-XR_004369230.1 LOC116808959 2289

You can do with sort and awk.
If ordering is optional.
$ sort -k2,2 -k3,3nr madza.txt | awk ' $2!=p2 { if(NR>1) print p; p=$0;p2=$2 } END { print p }'
ID=exon-XR_004369230.1 LOC116808959 2289
ID=exon-XM_030285707.2 LOC100221041 8963
ID=exon-XM_030285647.2 FILIP1L 6488
ID=exon-XM_030285694.2 DCBLD2 5838
ID=exon-XM_012570107.3 CMSS1 1502
$
To keep the ordering, you can introduce seq numbers and remove them at the last.
$ awk ' { $(NF+1)=NR}1 ' madza.txt | sort -k2,2 -k3,3nr | awk ' $2!=p2 { if(NR>1) print p; p=$0;p2=$2 } END { print p }' | sort -k4 -n | awk ' {NF=NF-1}1 '
ID=exon-XM_030285707.2 LOC100221041 8963
ID=exon-XM_030285694.2 DCBLD2 5838
ID=exon-XM_012570107.3 CMSS1 1502
ID=exon-XM_030285647.2 FILIP1L 6488
ID=exon-XR_004369230.1 LOC116808959 2289
$

AWK failing to sum floats

I am trying to sum the last 12 values in a field in a particular csv file, but AWK is failing to correctly sum the values. If I output the data to a new file then run the same AWK statement against the new file it works.
Here are the contents of the original file. The fields are separated by ";"
I want to sum the values in the 3rd field
...$ tail -12 OriginalFile.csv...
02/02/2020 10:30:00;50727.421;0.264;55772.084;0.360;57110.502;0.384
02/02/2020 10:35:00;50727.455;0.408;55772.126;0.504;57110.548;0.552
02/02/2020 10:40:00;50727.489;0.408;55772.168;0.504;57110.593;0.540
02/02/2020 10:45:00;50727.506;0.204;55772.193;0.300;57110.621;0.336
02/02/2020 10:50:00;50727.541;0.420;55772.236;0.516;57110.667;0.552
02/02/2020 10:55:00;50727.566;0.300;55772.269;0.396;57110.703;0.432
02/02/2020 11:00:00;50727.590;0.288;55772.300;0.372;57110.737;0.408
02/02/2020 11:05:00;50727.605;0.180;55772.321;0.252;57110.762;0.300
02/02/2020 11:10:00;50727.621;0.192;55772.344;0.276;57110.786;0.288
02/02/2020 11:15:00;50727.659;0.456;55772.389;0.540;57110.835;0.588
02/02/2020 11:20:00;50727.681;0.264;55772.417;0.336;57110.866;0.372
02/02/2020 11:25:00;50727.704;0.276;55772.448;0.372;57110.900;0.408
I used the following code to print the original value and the summed value of field 3 for each record, but it just returns the same output for the summed value for each line
...$ awk 'BEGIN { FS = ";" } ; { sum += $3 } { print $3, sum }' OriginalFile.csv|tail -12...
0.264 2.00198e+09
0.408 2.00198e+09
0.408 2.00198e+09
0.204 2.00198e+09
0.420 2.00198e+09
0.300 2.00198e+09
0.288 2.00198e+09
0.180 2.00198e+09
0.192 2.00198e+09
0.456 2.00198e+09
0.264 2.00198e+09
0.276 2.00198e+09
If I output the contents of the file into a different file, the same code works as expected
...$ tail -12 OriginalFile.csv > testfile2.csv...
...$ awk 'BEGIN { FS = ";" } ; { sum += $3 } { print $3, sum }' testfile2.csv...
0.264 0.264
0.408 0.672
0.408 1.08
0.204 1.284
0.420 1.704
0.300 2.004
0.288 2.292
0.180 2.472
0.192 2.664
0.456 3.12
0.264 3.384
0.276 3.66
How can I get the correct output from the original file without having to create a new file?

As #Shawn's excellent comment points out, the order in which you pipe in your data is the problem. By the time you reach the 12th line from the end, sum is already 2.00198e+09; adding many small fractions is not significant, so it seems like it is "the same output".
Simply:
tail -12 OriginalFile.csv | awk 'BEGIN { FS = ";" } ; { sum += $3 } { print $3, sum }'

AWK Print fields with new values if match, else print line

I have two files - FileA and FileB. FileA will be changed. FileB contains the new values. FileB has 3 fields. The first two fields will be compared with FileA's first two fields. If the fields match, Field3 should be changed. The code below is working in this manner: "If the two values match, change field3 and print the line. If there is no match, next." The behavior I want is, "If there is no match, print the line unchanged." The "else" part of the code is not working and I've tried so many variations.
awk -F'\t' -v OFS='\t' '
# first, read in data from file B
NR == FNR { values[$1 FS $2] = $3; next }
# then, output modified lines from matching lines in file A
($1 FS $2) in values { $3 = values[$1 FS $2]; print } else { print $0 }
' fileB fileA
FileA
PROVDSRJ02.RD.RI ae0.0 16
PROVDSRJ02.RD.RI ae1.1 1000
PROVDSRJ02.RD.RI ae2.0 5000
PROVDSRJ02.RD.RI ae3.0 5000
ASHBBBRJ01.RD.AS ae39.0 16
ASHBBPRJ01.RD.AS ae2.0 16
ASHBBPRJ02.RD.AS ae1.0 16
ASHBBPRJ02.RD.AS ae2.0 16
ASHBBBRJ01.RD.AS ae0.0 16
ASHBBBRJ01.RD.AS ae11.0 16
FileB
ASHBBBRJ01.RD.AS ae10.0 524
ASHBBBRJ01.RD.AS ae11.0 235
ASHBBBRJ01.RD.AS ae39.0 2096
ASHBBBRJ01.RD.AS ae6.0 183
ASHBBBRJ01.RD.AS ae7.0 1141
ASHBBBRJ02.RD.AS ae11.0 88
ASHBBBRJ02.RD.AS ae13.0 333
ASHBBBRJ02.RD.AS ae20.0 374
ASHBBBRJ02.RD.AS ae9.0 1885
Desired Output (** indicate changed lines and should not be included in code)
PROVDSRJ02.RD.RI ae0.0 16
PROVDSRJ02.RD.RI ae1.1 1000
PROVDSRJ02.RD.RI ae2.0 5000
PROVDSRJ02.RD.RI ae3.0 5000
**ASHBBBRJ01.RD.AS ae39.0 2096**
ASHBBPRJ01.RD.AS ae2.0 16
ASHBBPRJ02.RD.AS ae1.0 16
ASHBBPRJ02.RD.AS ae2.0 16
ASHBBBRJ01.RD.AS ae0.0 16
**ASHBBBRJ01.RD.AS ae11.0 235**

Your syntax is off. Check the tag info for some learning resources.
In any case, you don't need an else as such. You can conditionally set $3 to the new value (as you already are doing), and then always print the line (which may have been modified or not).
Here we use the shortcut 1 to always print the line. 1 is an always-true pattern that invokes the default action, which is to print the current line. If that doesn't make sense now, it will soon.
$ awk 'BEGIN {FS=OFS="\t"}
NR == FNR {values[$1 FS $2] = $3; next}
($1 FS $2) in values {$3 = values[$1 FS $2]}1' fileB fileA
PROVDSRJ02.RD.RI ae0.0 16
PROVDSRJ02.RD.RI ae1.1 1000
PROVDSRJ02.RD.RI ae2.0 5000
PROVDSRJ02.RD.RI ae3.0 5000
ASHBBBRJ01.RD.AS ae39.0 2096
ASHBBPRJ01.RD.AS ae2.0 16
ASHBBPRJ02.RD.AS ae1.0 16
ASHBBPRJ02.RD.AS ae2.0 16
ASHBBBRJ01.RD.AS ae0.0 16
ASHBBBRJ01.RD.AS ae11.0 235

awk to match field between two files and use conditions on match

I am trying to look for $2 of file1 (skipping the header) in $2 of file2 and if they match and the value in $10 is > 30 and $11 is > 49, then print the line to a output file. The below awk has syntax errors in it though shellcheck didn't return any. Both the input and output are tab-delimited. I think the below is close, but not sure what is wrong. Thank you :).
awk
awk -F'\t' -v OFS='\t' 'NR==FNR{A[$2];next}$2 in A
{if($10 >.5 OFS $11 > 49)
print ; next
' file1 file2
awk: cmd. line:2: {if($10 >.5 OFS $11 > 49)
awk: cmd. line:2: ^ syntax error
awk: cmd. line:3: print ; next
awk: cmd. line:3: ^ unexpected newline or end of string
file1
Missing in IDP but found in Reference:
2 166848646 G A exonic SCN1A 68 13 16;20 0;0 17;15 0;0 0;0 0;0 c.[5139C>T]+[=] 52.94
file2
chr2 166245425 SCN2A AMPL5155065355 SNP Het C/T C T 54 100 50 23 27
chr2 166848646 SCN1A AMPL1543060606 SNP Het G/A G A 52.9411764706 100 68 32 36
desired output
2 166848646 G A exonic SCN1A 68 13 16;20 0;0 17;15 0;0 0;0 0;0 c.[5139C>T]+[=] 52.94
edit with new awk
awk -F'\t' -v OFS='\t' 'NR==FNR{A[$2];next}$2 in A {
if($10 >.5 OFS $11 > 49) >>> if($10 >.5 && $11 > 49)
print }
' file1 file2 > out
awk: cmd. line:2: if($10 >.5 OFS $11 > 49) >>> if($10 >.5 && $11 > 49)
awk: cmd. line:2: ^ syntax error

here you go...
$ awk 'BEGIN{FS=OFS="\t"} NR==FNR{a[$2]; next}
($2 in a) && $10>30 && $11>49 ' file1 file2

Copy lines from one file and paste it to other n times

I have two files such as the following:
file1
t=10
HELLO
AAAAAA
BBBBBB
CCCCCC
DDDDDD
END
t=20
HELLO
EEEEEE
FFFFFF
GGGGGG
HHHHHH
END
file2
HELLO
AAAAAA
BBBBBB
CCCCCC
DDDDDD
111111
222222
333333
END
HELLO
EEEEEE
FFFFFF
GGGGGG
HHHHHH
444444
555555
666666
END
Is it possible to copy the t=10 and t=20 which are over of HELLO and paste them to the exact location at file2 making it like
t=10
HELLO
AAAAAA
BBBBBB
CCCCCC
DDDDDD
111111
222222
333333
END
t=20
HELLO
EEEEEE
FFFFFF
GGGGGG
HHHHHH
444444
555555
666666
END
Of course my files are not so small and imagine that I would like to do this over 100000 times in a file
With the help of other members of the community I created this script but it doesn't give the right result
for frame in $(seq 1 1 2)
do
add=$(awk '/t=/{i++}i=='$frame' {print; exit}' $file1)
awk -v var="$add" 'NR>1 && NR%9==0 {print var} {print $0}' $file2
done
Please if anyone can help my I could appreciate it.
Thanks in advance

You can try following awk script. It reads file1 and saves each line before the HELLO one in an indexed array and extract each position of it when it finds again the line HELLO in the second file:
awk '
NR == 1 { prev_line = $0 }
FNR == NR {
if ( $1 == "HELLO" ) {
hash[ i++ ] = prev_line
}
prev_line = $0
next
}
$1 == "HELLO" {
printf "%s\n", hash[ j++ ]
}
{ print }
' file1 file2
It yields:
t=10
HELLO
AAAAAA
BBBBBB
CCCCCC
DDDDDD
111111
222222
333333
END
t=20
HELLO
EEEEEE
FFFFFF
GGGGGG
HHHHHH
444444
555555
666666
END

awk 'BEGIN{FS="\n";RS="END\n"}
NR==FNR{for(i=2;i<=NF;i++) a[$1]=a[$1]==""?$i:a[$1] FS $i;next}
{for (i in a) {if ($0~a[i]) printf i ORS $0 RS}
}' file1 file2
Result:
t=10
HELLO
AAAAAA
BBBBBB
CCCCCC
DDDDDD
111111
222222
333333
END
t=20
HELLO
EEEEEE
FFFFFF
GGGGGG
HHHHHH
444444
555555
666666
END

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Merge files where some columns matched - awk

$ awk ' {k=$1 FS $2 FS $3} NR==FNR {a[k]=$4; next} {$(NF+1)=k in a?a[k]:"NA"}1' file1 file2 31431 37150 100 14100 10100 31431 37131 100 14100 NA 31431 37201 100 14100 12100 31431 37478 100 14100 NA 31431 37471 100 14100 14100

Related

For each different occurrence in field, print lines with max value associated

AWK failing to sum floats

AWK Print fields with new values if match, else print line

awk to match field between two files and use conditions on match

Copy lines from one file and paste it to other n times

Categories

Resources