manipulate input file and create a new file using awk - awk

Input
1473697,5342715,256,0.3
1473697,7028427,256,0.1
1473697,5342716,256,0.3
1473697,5342715,257,0.3
1473697,7028427,257,0.1
1473610,7028427,256,0.1
1473610,5342715,256,0.3
1473610,7028422,256,0.1
Output
1473697,256,5342715 0.3 7028427 0.1 5342716 0.3
1473697,257,5342715 0.3 7028427 0.1
1473610,256,7028427 0.1 5342715 0.3 7028422 0.1
OFS and FS is = ,
is there a way to search unique lines base on column 1 and 3
then print the line with the details from column 2 and 4

It took awhile to figure out what you want, but I think you're looking for:
awk '!a[$1 $3] {a[$1 $3] = $1","$3","}
{a[$1 $3] = a[$1 $3] " " $2 " " $4}
END {for(i in a) print a[i]}' FS=, input-file
or
awk '{a[$1","$3] = a[$1","$3] " " $2 " " $4}
END {for(i in a) print i","a[i]}' FS=, input-file
There are many variations on the theme.

Related

syntax error in awk 'if-else if-else' with multiple operations

There's a weird thing with awk conditional statements:
when running awk 'if-else if-else' with a single operation after each condition, it works fine as below:
awk 'BEGIN {a=30; \
if (a==10) print "a = 10"; \
else if (a == 20) print "a = 20"; \
else print "a = 30"}'
output:
a = 30
However, when running awk 'if-else if-else' with multiple operations (properly braced) after 'else if' , syntax error occured:
awk 'BEGIN {a=30; \
if (a==10) print "a = 10"; \
else if (a == 20) {print "a = 20"; print "b = 20"}; \
else print "a = 30"}'
output:
awk: cmd. line:4: else print "a = 30"}
awk: cmd. line:4: ^ syntax error
Can anyone tell if this is an awk issue that intrinsically doesn't allow multiple operations in such cases, or if it's just my syntax error that could be corrected?
P.S. I looked through all relevant posts of awk 'if else' syntax error, but none of them is addressing this issue.
Removed semi-colon at end of third line after close brace.
awk 'BEGIN {a=30; \
if (a==10) print "a = 10"; \
else if (a == 20) {print "a = 20"; print "b = 20"} \
else print "a = 30"}'
Output: a = 30

Compare two columns of two files, print the row if it matches and print zero in third column

I need to compare column 1 and column 2 of my file1.txt and file2.txt. If both columns match, print the entire row of file1.txt, but where a row in file1.txt is not present in file2.txt, also print that missing row in the output and add "0" as its value in third column.
# file1.txt #
AA ZZ
JB CX
CX YZ
BB XX
SU BY
DA XZ
IB KK
XY IK
TY AB
# file2.txt #
AA ZZ 222
JB CX 345
BB XX 3145
DA XZ 876
IB KK 234
XY IK 897
Expected output
# output.txt #
File1.txt
AA ZZ 222
JB CX 345
CX YZ 0
BB XX 3145
SU BY 0
DA XZ 376
IB KK 234
XY IK 897
TY AB 0
I tried this code but couldn't figure out how to add rows that did not match and add "0" to it
awk 'BEGIN { while ((getline <"file2.txt") > 0) {REC[$1]=$0}}{print REC[$1]}' < file1.txt > output.txt
With your shown samples, could you please try following.
awk '
FNR==NR{
arr[$1 OFS $2]
next
}
(($1 OFS $2) in arr){
print
arr1[$1 OFS $2]
}
END{
for(i in arr){
if(!(i in arr1)){
print i,0
}
}
}
' file1.txt file2.txt
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking FNR==NR condition which will be TRUE when file1.txt is being read.
arr[$1 OFS $2] ##Creating array with 1st and 2nd field here.
next ##next will skip all further statements from here.
}
(($1 OFS $2) in arr){ ##Checking condition if 1st and 2nd field of file2.txt is present in arr then do following.
print ##Print the current line here.
arr1[$1 OFS $2] ##Creating array arr1 with index of 1st and 2nd fields here.
}
END{ ##Starting END block of this program from here.
for(i in arr){ ##Traversing through arr all elements from here.
if(!(i in arr1)){ ##Checking if an element/key is NOT present in arr1 then do following.
print i,0 ##Printing index and 0 here.
}
}
}
' file1.txt file2.txt ##Mentioning Input_file names here.
You may try this awk:
awk '
FNR == NR {
map[$1,$2] = $3
next
}
{
print $1, $2, (($1,$2) in map ? map[$1,$2] : 0)
}' file2 file1
AA ZZ 222
JB CX 345
CX YZ 0
BB XX 3145
SU BY 0
DA XZ 876
IB KK 234
XY IK 897
TY AB 0
$ awk '
{ key = $1 FS $2 }
NR==FNR { map[key]=$3; next }
{ print $0, map[key]+0 }
' file2.txt file1.txt
AA ZZ 222
JB CX 345
CX YZ 0
BB XX 3145
SU BY 0
DA XZ 876
IB KK 234
XY IK 897
TY AB 0

awk to extract tags between two files

In the awk below which executes as is and results in the current output, I am trying to add a condition that will extract the text or
value after the tags AF=,FR=, HRUN=,LEN=,TYPE= for the lines in each line of file1 compared to file2. As is the lines between
the two files are either a Match, Missing in file 1, or Missing in file2,but I can not add the conditions to extract up to the ; (semi-colon).
There may not always be text after the tags, but they always ends with a ;. The decimal in $6 is also 3 signifigant figures to make it easier to read. It seems
close but there are a couple things I am not quite sure how to do. Thank you :).
file1
chr1 43814978 COSM27286 G A 86.92679999999999 PASS
AF=0;AO=1;DP=5535;FAO=0;FR=.,REALIGNEDx0.008;HRUN=1;LEN=1;TYPE=snp;VARB=0;HS;
chr1 43814981 COSM27287 G A 86.83350000000002 PASS
AF=0;AO=2;DP=5556;FAO=0;FR=.;HRUN=1;LEN=1;TYPE=snp;VARB=0;HS;
chr1 43815008 COSM29008;COSM43212 TGG AAA,AAG 70.3099 PASS
AF=0,0;AO=0,0;DP=5528;FAO=0,0;FR=.,.,;HRUN=1,1;LEN=3,2,;TYPE=mnp,mnp;VARB=0,0;HS;
file2
chr1 43814979 COSM27286 G A 86.92679999999999 PASS
AF=0;AO=1;DP=5535;FAO=0;FR=.,REALIGNEDx0.008;HRUN=1;LEN=1;TYPE=snp;VARB=0;HS;
chr1 43814981 COSM27287 G A 86.83350000000002 PASS
AF=0;AO=2;DP=5556;FAO=0;FR=.;HRUN=1;LEN=1;TYPE=snp;VARB=0;HS;
chr1 43815008 COSM29008;COSM43212 TGG AAA,AAG 70.3099 PASS
AF=0,0;AO=0,0;DP=5528;FAO=0,0;FR=.,.,;HRUN=1,1;LEN=3,2,;TYPE=mnp,mnp;VARB=0,0;HS;
desired output
Match:
chr1 43814981 COSM27287 G A 86.8 PASS
AF=0;FR=.;HRUN=1;LEN=1;TYPE=snp
chr1 43815008 COSM29008;COSM43212 TGG AAA,AAG 70.3099 PASS
AF=0,0;FR=.,.,;HRUN=1,1;LEN=3,2,;TYPE=mnp,mnp
Missing in file1:
chr1 43814979 COSM27286 G A 86.9 PASS
AF=0;FR=.,REALIGNEDx0.008;HRUN=1;LEN=1;TYPE=snp
Missing in file2:
chr1 43814978 COSM27286 G A 86.9 PASS
AF=0;FR=.,REALIGNEDx0.008;HRUN=1;LEN=1;TYPE=snp
awk
awk 'FNR==1 { next }
FNR == NR { file1[$1,$2,$3,$4,$5,$6,$7] = $1 " " $2 " " $3 " " $4 " " $5 " " $6 " "$7 }
FNR != NR { file2[$1,$2,$3,$4,$5,$6,$7] = $1 " " $2 " " $3 " " $4 " " $5 " " $6 " "$7 }
END { print "Match:"; for (k in file1) if (k in file2) print file1[k] # Or file2[k]
print "Missing in file1:"; for (k in file2) if (!(k in file1)) print file2[k]
print "Missing in file2:"; for (k in file1) if (!(k in file2)) print file1[k]
}' file1 file2 > output
Current output
Match:
chr1 43814981 COSM27287 G A 86.83350000000002 PASS
chr1 43815008 COSM29008;COSM43212 TGG AAA,AAG 70.3099 PASS
Missing in File1:
chr1 43814979 COSM27286 G A 86.92679999999999 PASS
Missing in File2:
chr1 43814978 COSM27286 G A 86.92679999999999 PASS
try:
awk 'FNR==NR{
a[$1,$2,$7]=$1 FS $2 FS $3 FS $4 FS $5 FS $6 FS $7;
next
}
(($1,$2,$7) in a){
val_match=val_match?val_match ORS a[$1,$2,$7]:a[$1,$2,$7];
delete a[$1,$2,$7];
next
}
{
val_mismatch_in_file1=val_mismatch_in_file1?val_mismatch_in_file1 ORS $1 FS $2 FS $3 FS $4 FS $5 FS $6 FS $7:$1 FS $2 FS $3 FS $4 FS $5 FS $6 FS $7;
}
END{
for(i in a){
val_missing_in_file2=val_missing_in_file2?a[i]:a[i]};
print "Match:" RS val_match RS "Missing in File1:" RS val_mismatch_in_file1 RS "Missing in File2:" RS val_missing_in_file2
}
' Input_file1 Input_file2
Output will be as follows.
Match:
chr1 43814981 COSM27287 G A 86.83350000000002 PASS
chr1 43815008 COSM29008;COSM43212;COSM19193;COSM27289;COSM28487 TGG AAA,AAG,AGG,CGG,GCG 70.3099 PASS
Missing in File1:
chr1 43814979 COSM27286 G A 86.92679999999999 PASS
Missing in File2:
chr1 43814978 COSM27286 G A 86.92679999999999 PASS

Compare Two Files Using awk Script

I need to validate two files using condition and each record in file is separated using comma.
File1
Prakash,10,20,3(Field Index)
File2
10,25,100
20,25,200
30,20,300
From reading the file 1 FieldIndex I need to sum the corresponding column in File 2(ie 100+200+300) needs to be added.
$ cat > file1
Prakash,10,20,3
$ awk -F, '
NR == FNR {f = $NF; next} # last field of the first file
{ sum += $f } # we're in the 2nd file here since NR != FNR
END { print "sum of field index " f " is " sum }
' file1 file2
sum of field index 3 is 600

awk: printing lines side by side when the first field is the same in the records

I have a file containing lines like
a x1
b x1
q xq
c x1
b x2
c x2
n xn
c x3
I would like to test on the fist field in each line, and if there is a match I would like to append the matching lines to the first line. The output should look like
a x1
b x1 b x2
q xq
c x1 c x2 c x3
n xn
any help will be greatly appreciated
Using awk you can do this:
awk '{arr[$1]=arr[$1]?arr[$1] " " $0:$0} END {for (i in arr) print arr[i]}' file
n xn
a x1
b x1 b x2
c x1 c x2 c x3
q xq
To preserve input ordering:
$ awk '
{
if ($1 in vals) {
prev = vals[$1] " "
}
else {
prev = ""
keys[++k] = $1
}
vals[$1] = prev $0
}
END {
for (k=1;k in keys;k++)
print vals[keys[k]]
}
' file
a x1
b x1 b x2
q xq
c x1 c x2 c x3
n xn
What I ended up doing. (The answers by Ed Morton and Jonte are obviously more elegant.)
First I saved the 1st column of the input file in a separate file.
awk '{print $1}' input.file.txt > tmp0
Then saved the input file with lines, which has duplicate values at $1 field, removed.
awk 'BEGIN { FS = "\t" }; !x[$1]++ { print $0}' input_file.txt > tmp1
Then saved all the lines with duplicate $1 field.
awk 'BEGIN { FS = "\t" }; x[$1]++ { print $0}' input_file.txt >tmp2
Then saved the $1 fields of the non-duplicate file (tmp1).
awk '{ print $1}' tmp1 > tmp3
I used a for loop to pull in lines from the duplicate file (tmp2) and the duplicates removed file (tmp1) into an output file.
for i in $(cat tmp3)
do
if [ $(grep -w $i tmp0 | wc -l) = 1 ] #test for single instance in the 1st col of input file
then
echo "$(grep -w $i tmp1)" >> output.txt #if single then pull that record from no dupes
else
echo -e "$(grep -w $i tmp1) \t $(grep -w $i tmp2 | awk '{
printf $0"\t" }; END { printf "\n" }')" >> output.txt # if not single then pull that record from no_dupes first then all the records from dupes in a single line.
fi
done
Finally remove the tmp files
rm tmp* # remove all the tmp files