Compare Two Files Using awk Script - awk

I need to validate two files using condition and each record in file is separated using comma.
File1
Prakash,10,20,3(Field Index)
File2
10,25,100
20,25,200
30,20,300
From reading the file 1 FieldIndex I need to sum the corresponding column in File 2(ie 100+200+300) needs to be added.

$ cat > file1
Prakash,10,20,3
$ awk -F, '
NR == FNR {f = $NF; next} # last field of the first file
{ sum += $f } # we're in the 2nd file here since NR != FNR
END { print "sum of field index " f " is " sum }
' file1 file2
sum of field index 3 is 600

Related

Compare two columns of two files, print the row if it matches and print zero in third column

I need to compare column 1 and column 2 of my file1.txt and file2.txt. If both columns match, print the entire row of file1.txt, but where a row in file1.txt is not present in file2.txt, also print that missing row in the output and add "0" as its value in third column.
# file1.txt #
AA ZZ
JB CX
CX YZ
BB XX
SU BY
DA XZ
IB KK
XY IK
TY AB
# file2.txt #
AA ZZ 222
JB CX 345
BB XX 3145
DA XZ 876
IB KK 234
XY IK 897
Expected output
# output.txt #
File1.txt
AA ZZ 222
JB CX 345
CX YZ 0
BB XX 3145
SU BY 0
DA XZ 376
IB KK 234
XY IK 897
TY AB 0
I tried this code but couldn't figure out how to add rows that did not match and add "0" to it
awk 'BEGIN { while ((getline <"file2.txt") > 0) {REC[$1]=$0}}{print REC[$1]}' < file1.txt > output.txt
With your shown samples, could you please try following.
awk '
FNR==NR{
arr[$1 OFS $2]
next
}
(($1 OFS $2) in arr){
print
arr1[$1 OFS $2]
}
END{
for(i in arr){
if(!(i in arr1)){
print i,0
}
}
}
' file1.txt file2.txt
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking FNR==NR condition which will be TRUE when file1.txt is being read.
arr[$1 OFS $2] ##Creating array with 1st and 2nd field here.
next ##next will skip all further statements from here.
}
(($1 OFS $2) in arr){ ##Checking condition if 1st and 2nd field of file2.txt is present in arr then do following.
print ##Print the current line here.
arr1[$1 OFS $2] ##Creating array arr1 with index of 1st and 2nd fields here.
}
END{ ##Starting END block of this program from here.
for(i in arr){ ##Traversing through arr all elements from here.
if(!(i in arr1)){ ##Checking if an element/key is NOT present in arr1 then do following.
print i,0 ##Printing index and 0 here.
}
}
}
' file1.txt file2.txt ##Mentioning Input_file names here.
You may try this awk:
awk '
FNR == NR {
map[$1,$2] = $3
next
}
{
print $1, $2, (($1,$2) in map ? map[$1,$2] : 0)
}' file2 file1
AA ZZ 222
JB CX 345
CX YZ 0
BB XX 3145
SU BY 0
DA XZ 876
IB KK 234
XY IK 897
TY AB 0
$ awk '
{ key = $1 FS $2 }
NR==FNR { map[key]=$3; next }
{ print $0, map[key]+0 }
' file2.txt file1.txt
AA ZZ 222
JB CX 345
CX YZ 0
BB XX 3145
SU BY 0
DA XZ 876
IB KK 234
XY IK 897
TY AB 0

manipulate input file and create a new file using awk

Input
1473697,5342715,256,0.3
1473697,7028427,256,0.1
1473697,5342716,256,0.3
1473697,5342715,257,0.3
1473697,7028427,257,0.1
1473610,7028427,256,0.1
1473610,5342715,256,0.3
1473610,7028422,256,0.1
Output
1473697,256,5342715 0.3 7028427 0.1 5342716 0.3
1473697,257,5342715 0.3 7028427 0.1
1473610,256,7028427 0.1 5342715 0.3 7028422 0.1
OFS and FS is = ,
is there a way to search unique lines base on column 1 and 3
then print the line with the details from column 2 and 4
It took awhile to figure out what you want, but I think you're looking for:
awk '!a[$1 $3] {a[$1 $3] = $1","$3","}
{a[$1 $3] = a[$1 $3] " " $2 " " $4}
END {for(i in a) print a[i]}' FS=, input-file
or
awk '{a[$1","$3] = a[$1","$3] " " $2 " " $4}
END {for(i in a) print i","a[i]}' FS=, input-file
There are many variations on the theme.

awk to extract tags between two files

In the awk below which executes as is and results in the current output, I am trying to add a condition that will extract the text or
value after the tags AF=,FR=, HRUN=,LEN=,TYPE= for the lines in each line of file1 compared to file2. As is the lines between
the two files are either a Match, Missing in file 1, or Missing in file2,but I can not add the conditions to extract up to the ; (semi-colon).
There may not always be text after the tags, but they always ends with a ;. The decimal in $6 is also 3 signifigant figures to make it easier to read. It seems
close but there are a couple things I am not quite sure how to do. Thank you :).
file1
chr1 43814978 COSM27286 G A 86.92679999999999 PASS
AF=0;AO=1;DP=5535;FAO=0;FR=.,REALIGNEDx0.008;HRUN=1;LEN=1;TYPE=snp;VARB=0;HS;
chr1 43814981 COSM27287 G A 86.83350000000002 PASS
AF=0;AO=2;DP=5556;FAO=0;FR=.;HRUN=1;LEN=1;TYPE=snp;VARB=0;HS;
chr1 43815008 COSM29008;COSM43212 TGG AAA,AAG 70.3099 PASS
AF=0,0;AO=0,0;DP=5528;FAO=0,0;FR=.,.,;HRUN=1,1;LEN=3,2,;TYPE=mnp,mnp;VARB=0,0;HS;
file2
chr1 43814979 COSM27286 G A 86.92679999999999 PASS
AF=0;AO=1;DP=5535;FAO=0;FR=.,REALIGNEDx0.008;HRUN=1;LEN=1;TYPE=snp;VARB=0;HS;
chr1 43814981 COSM27287 G A 86.83350000000002 PASS
AF=0;AO=2;DP=5556;FAO=0;FR=.;HRUN=1;LEN=1;TYPE=snp;VARB=0;HS;
chr1 43815008 COSM29008;COSM43212 TGG AAA,AAG 70.3099 PASS
AF=0,0;AO=0,0;DP=5528;FAO=0,0;FR=.,.,;HRUN=1,1;LEN=3,2,;TYPE=mnp,mnp;VARB=0,0;HS;
desired output
Match:
chr1 43814981 COSM27287 G A 86.8 PASS
AF=0;FR=.;HRUN=1;LEN=1;TYPE=snp
chr1 43815008 COSM29008;COSM43212 TGG AAA,AAG 70.3099 PASS
AF=0,0;FR=.,.,;HRUN=1,1;LEN=3,2,;TYPE=mnp,mnp
Missing in file1:
chr1 43814979 COSM27286 G A 86.9 PASS
AF=0;FR=.,REALIGNEDx0.008;HRUN=1;LEN=1;TYPE=snp
Missing in file2:
chr1 43814978 COSM27286 G A 86.9 PASS
AF=0;FR=.,REALIGNEDx0.008;HRUN=1;LEN=1;TYPE=snp
awk
awk 'FNR==1 { next }
FNR == NR { file1[$1,$2,$3,$4,$5,$6,$7] = $1 " " $2 " " $3 " " $4 " " $5 " " $6 " "$7 }
FNR != NR { file2[$1,$2,$3,$4,$5,$6,$7] = $1 " " $2 " " $3 " " $4 " " $5 " " $6 " "$7 }
END { print "Match:"; for (k in file1) if (k in file2) print file1[k] # Or file2[k]
print "Missing in file1:"; for (k in file2) if (!(k in file1)) print file2[k]
print "Missing in file2:"; for (k in file1) if (!(k in file2)) print file1[k]
}' file1 file2 > output
Current output
Match:
chr1 43814981 COSM27287 G A 86.83350000000002 PASS
chr1 43815008 COSM29008;COSM43212 TGG AAA,AAG 70.3099 PASS
Missing in File1:
chr1 43814979 COSM27286 G A 86.92679999999999 PASS
Missing in File2:
chr1 43814978 COSM27286 G A 86.92679999999999 PASS
try:
awk 'FNR==NR{
a[$1,$2,$7]=$1 FS $2 FS $3 FS $4 FS $5 FS $6 FS $7;
next
}
(($1,$2,$7) in a){
val_match=val_match?val_match ORS a[$1,$2,$7]:a[$1,$2,$7];
delete a[$1,$2,$7];
next
}
{
val_mismatch_in_file1=val_mismatch_in_file1?val_mismatch_in_file1 ORS $1 FS $2 FS $3 FS $4 FS $5 FS $6 FS $7:$1 FS $2 FS $3 FS $4 FS $5 FS $6 FS $7;
}
END{
for(i in a){
val_missing_in_file2=val_missing_in_file2?a[i]:a[i]};
print "Match:" RS val_match RS "Missing in File1:" RS val_mismatch_in_file1 RS "Missing in File2:" RS val_missing_in_file2
}
' Input_file1 Input_file2
Output will be as follows.
Match:
chr1 43814981 COSM27287 G A 86.83350000000002 PASS
chr1 43815008 COSM29008;COSM43212;COSM19193;COSM27289;COSM28487 TGG AAA,AAG,AGG,CGG,GCG 70.3099 PASS
Missing in File1:
chr1 43814979 COSM27286 G A 86.92679999999999 PASS
Missing in File2:
chr1 43814978 COSM27286 G A 86.92679999999999 PASS

awk: printing lines side by side when the first field is the same in the records

I have a file containing lines like
a x1
b x1
q xq
c x1
b x2
c x2
n xn
c x3
I would like to test on the fist field in each line, and if there is a match I would like to append the matching lines to the first line. The output should look like
a x1
b x1 b x2
q xq
c x1 c x2 c x3
n xn
any help will be greatly appreciated
Using awk you can do this:
awk '{arr[$1]=arr[$1]?arr[$1] " " $0:$0} END {for (i in arr) print arr[i]}' file
n xn
a x1
b x1 b x2
c x1 c x2 c x3
q xq
To preserve input ordering:
$ awk '
{
if ($1 in vals) {
prev = vals[$1] " "
}
else {
prev = ""
keys[++k] = $1
}
vals[$1] = prev $0
}
END {
for (k=1;k in keys;k++)
print vals[keys[k]]
}
' file
a x1
b x1 b x2
q xq
c x1 c x2 c x3
n xn
What I ended up doing. (The answers by Ed Morton and Jonte are obviously more elegant.)
First I saved the 1st column of the input file in a separate file.
awk '{print $1}' input.file.txt > tmp0
Then saved the input file with lines, which has duplicate values at $1 field, removed.
awk 'BEGIN { FS = "\t" }; !x[$1]++ { print $0}' input_file.txt > tmp1
Then saved all the lines with duplicate $1 field.
awk 'BEGIN { FS = "\t" }; x[$1]++ { print $0}' input_file.txt >tmp2
Then saved the $1 fields of the non-duplicate file (tmp1).
awk '{ print $1}' tmp1 > tmp3
I used a for loop to pull in lines from the duplicate file (tmp2) and the duplicates removed file (tmp1) into an output file.
for i in $(cat tmp3)
do
if [ $(grep -w $i tmp0 | wc -l) = 1 ] #test for single instance in the 1st col of input file
then
echo "$(grep -w $i tmp1)" >> output.txt #if single then pull that record from no dupes
else
echo -e "$(grep -w $i tmp1) \t $(grep -w $i tmp2 | awk '{
printf $0"\t" }; END { printf "\n" }')" >> output.txt # if not single then pull that record from no_dupes first then all the records from dupes in a single line.
fi
done
Finally remove the tmp files
rm tmp* # remove all the tmp files

How to select lines inwhich the sum of negative numbers in the line is equal or less than -3 (with awk)?

I have a sample file like this:
probeset_id submitted_id chr snp_pos alleleA alleleB 562_201 562_202 562_203 562_204 562_205 562_206 562_207 562_208 562_209 562_210
AX-75448119 Chr1_41908741 1 41908741 T C 0 -1 0 -1 0 0 0 0 0 -1
AX-75448118 Chr1_41908545 1 41908545 T A 2 -1 2 2 2 -1 -1 2 2 0
AX-75448118 Chr1_41908545 1 41908545 T A 1 2 -1 2 2 -1 2 -1 2 0
and I want to exclud the lines that have a sum of negative number equal or less than -3, I know how to calculate the sum of negative number and print it, with this code:
awk 'BEGIN{sum=0} NR >=2 {for (i=7;i<=NF;i++) if ($i ~ /^-/) sum += $i; print $1,$2,$3,$4,$5,$6,sum; sum=0}' test.txt > out.txt
But I don't want to do this I just want to calculate the sum of negative number and then select the lines that have less or equal to -3.
These are the commands that I wrote:
awk 'BEGIN{sum=0} NR >=2 {for (i=7;i<=NF;i++) if ($i ~ /^-/) sum += $i; sum=0}' test.txt | awk 'sum <= -3' > out.txt
I get no errors but the out.txt file is empty!
awk 'BEGIN{sum=0} NR >=2 {for (i=7;i<=NF;i++) if ($i ~ /^-/) sum += $i; if sum >= -3 pritn R; sum=0}' test.txt | wc -l
which I get:
^ syntax error
and how can I make sure that the first line(header) would be also in my output file?
so I would like to have this out put:
probeset_id submitted_id chr snp_pos alleleA alleleB 562_201 562_202 562_203 562_204 562_205 562_206 562_207 562_208 562_209 562_210
AX-75448119 Chr1_41908741 1 41908741 T C 0 -1 0 -1 0 0 0 0 0 -1
AX-75448118 Chr1_41908545 1 41908545 T A 2 -1 2 2 2 -1 -1 2 2 0
Try this:
awk '
NR == 1 {
print
next
}
{
negsum=0
for(i=7; i<=NF; i++) {
if ($i<0) {
negsum += $i
}
}
negsum <= -3'
Your first try fails because you use two different invocations of awk. These are two different programs being run, and the second knows nothing about the sum variable in the first, so it uses the default value sum = 0.
The second try just has a mis-spelling. You used pritn instead of print.
What you described can be easier to code with proper formatting. (Not that I'd always resort to using editor when scripting awk...)
The first condition (NR == 1) just ensures we print the first line as is.
awk '
NR == 1 { print }
NR >= 2 {
sum = 0;
for (i=7;i<=NF;i++) {
if ($i < 0)
sum += $i;
}
if (sum <= -3)
print;
}
' test.txt > out.txt