awk to improve command print Match and Non-Match case: - awk

Would like to read and compare first field from two files then print
Match Lines from Both the files - ( Available in f11.txt and f22.txt) -> Op_Match.txt
Non- Match Lines from f11.txt ( Available in f11.txt Not-Available in f22.txt)-> Op_NonMatch_f11.txt
Non- Match Lines from f22.txt ( Available in f22.txt Not-Available in f11.txt)-> Op_NonMatch_f22.txt
Using below 3 separate commands to achieve the above scenario's .
f11.txt
10,03-APR-14,abc
20,02-JUL-13,def
10,19-FEB-14,abc
20,02-AUG-13,def
10,22-JAN-07,abc
10,29-JUN-07,abc
40,11-SEP-13,ghi
f22.txt
50,DL,3000~4332,ABC~XYZ
10,DL,5000~2503,ABC~XYZ
30,AL,2000~2800,DEF~PQZ
To Match Lines from Both the files:
awk ' BEGIN {FS = OFS = ","} FNR==NR {a[$1] = $0; next} ($1 in a) {print $0,a[$1]}' f22.txt f11.txt> Op_Match.txt
10,03-APR-14,abc,10,DL,5000~2503,ABC~XYZ
10,19-FEB-14,abc,10,DL,5000~2503,ABC~XYZ
10,22-JAN-07,abc,10,DL,5000~2503,ABC~XYZ
10,29-JUN-07,abc,10,DL,5000~2503,ABC~XYZ
To Non- Match Lines from f11.txt:
awk ' BEGIN {FS = OFS = ","} FNR==NR {a[$1] = $0; next} !($1 in a) {print $0}' f22.txt f11.txt > Op_NonMatch_f11.txt
20,02-JUL-13,def
20,02-AUG-13,def
40,11-SEP-13,ghi
To Non- Match Lines from f22.txt:
awk ' BEGIN {FS = OFS = ","} FNR==NR {a[$1] = $0; next} !($1 in a) {print $0}' f11.txt f22.txt > Op_NonMatch_f22.txt
50,DL,3000~4332,ABC~XYZ
30,AL,2000~2800,DEF~PQZ
Using above 3 separate commands to achieve the mentioned scenario’s. Is there any simplest way to avoid 3 different commands? Any Suggestions ...!!!

Something like this, untested:
awk '
BEGIN{ FS=OFS="," }
NR==FNR {
fname1 = FILENAME
keys[NR] = $1
recs[NR] = $0
key2nrs[$1] = ($1 in key2nrs ? key2nrs[$1] RS : "") NR
next
}
{
if ($1 in key2nrs) {
split (key2nrs[$1],nrs,RS)
for (i=1; i in nrs; i++) {
print recs[nrs[i]], $0 > "Op_Match.txt"
}
matched[$1]
}
else {
print > ("Op_NonMatch_" FILENAME ".txt")
}
}
END {
for (i=1; i in recs; i++) {
if (! (keys[i] in matched) ) {
print recs[i] > ("Op_NonMatch_" fname1 ".txt")
}
}
}
' f11.txt f22.txt
The main difference between this and Kent and Etans answers is that theirs assume that the $1 in f22.txt can only appear once within that file while the above would work if, say, 10 occurred as the first field on multiple lines of f22.txt.
The other difference is that the above will output lines in the same order that they occurred in the input files while the other answers will output some of them in random order based on how they're stored internally in a hash table.

I haven't checked #EdMorton's answer but he will quite likely have gotten it right.
My solution (which looks slightly less generic than his at first glance) is:
awk -F, '
FNR==NR {
a[$1]=$0;
next
}
($1 in a){
print $0,a[$1] > "Op_Match.txt"
am[$1]++
}
!($1 in a) {
print $0 > "Op_NonMatch_f11.txt"
}
END {
for (i in a) {
if (!(i in am)) {
print a[i] > "Op_NonMatch_f22.txt"
}
}
}
' f22.txt f11.txt

here is one:
awk -F, -v OFS="," 'NR==FNR{a[$1]=$0;next}
$1 in a{print $0,a[$1]>("common.txt");c[$1];next}
{print $0>("NonMatchFromFile1.txt")}
END{for(x in a)
if(!(x in c))
print a[x]>("NonMatchFromFile2.txt")}' f2 f1
with this, you will get 3 files: common.txt, nonmatchfromFile1.txt and nonMatchfromfile2.txt

Related

Compare multiple columns from one file with multiple columns of another file using awk?

I want to compare first 2 characters of col1 of file1 with col1 of file2 if col3 of file1 is same as col3 of file2 , provided col4 in file2 equals to TRUE. I tried something :-
awk -F'|' 'BEGIN{OFS=FS};(NR==FNR)
{a[substr($1,1,2),$3]=$1;next}(($1,$3)in a) && $4==TRUE ' file1 file2 > outfield
file 1
AE1267453617238|BIDKFXXXX|United Arab Emirates|
PL76UTYVJDYGHU9|ABSFXXJBW|Poland|
GB76UTRTSCLSKJ|FVDGXXXUY|Russia|
file 2
AE|^AE[0-9]{2}[0-9]{24}|United Arab Emirates|TRUE|
PL|^PL[0-9]{2}[A-Z]{10}[0-9]{4}|Poland|FALSE|
GB|^GB[0-9]{2}[A-Z]{5}[0-9]{3}|Europe|TRUE
expected output :-
AE1267453617238|BIDKFXXXX|United Arab Emirates|
You could just simply cascade the multiple conditions with a && as below. Remember your expected output is on the first file, so you need to process the second file first
awk -F'|' ' FNR == NR {
if ( $4 == "TRUE" ) m[$1] = $3 ; next }{ k = substr($1,1,2) } k in m && m[k] == $3' file2 file1
The part m[$1] = $3 creates a hash-map of the $1 with the value of $3 in the second file, which is then used in the first file to compare against only the first two characters of $1 i.e. substr($1,1,2). To avoid redundant use of substr(..), the value is extracted into a variable k and reused subsequently.
If the matches must be on the same line number in each file:
awk -F \| '
FNR==NR && $4 == "TRUE" {a[NR,1]=$1; a[NR,3]=$3}
FNR!=NR && $3 == a[FNR,3] &&
$1 ~ "^"a[FNR,1]' file2 file1
If the matches can be on any line (every line of file1 is checked against every line of file2, duplicate matches aren't printed):
awk -F \| '
FNR==NR {++l}
FNR==NR && $4 == "TRUE" {a[NR,1]=$1; a[NR,3]=$3}
FNR!=NR {
for (i=1; i<=l; ++i) {
if ($3 == a[i,3] && $1 ~ "^"a[i,1])
c[$0]==0
}
}
END {
for (i in c)
print i
}' file2 file1
Note the order files are given. file2 (which contains TRUE and FALSE), goes first. I also used regex instead of substr, so the characters should be alphanumeric only, if not, go back to substr.
Regarding your code:
awk -F'|' 'BEGIN{OFS=FS};(NR==FNR)
{a[substr($1,1,2),$3]=$1;next}(($1,$3)in a) && $4==TRUE ' file1 file2 > outfield
newlines matter to awk. This:
NR==FNR
{ print }
is not the same as this:
NR==FNR { print }
The first one is actually the same as:
NR==FNR { print }
1 { print }
Also when you want to output the contents of a file (file1 in your case) it's usually better to read the OTHER file into memory and then compare the values from the target file against that so you can just print it as you go. So you should be doing awk 'script' file2 file1, not awk 'script' file1 file2, and writing a script based on that.
Try this:
$ cat tst.awk
BEGIN { FS="|" }
NR==FNR {
if ( $4 == "TRUE" ) {
map[$1] = $3
}
next
}
{ key = substr($1,1,2) }
(key in map) && (map[key] == $3)
$ awk -f tst.awk file2 file1
AE1267453617238|BIDKFXXXX|United Arab Emirates|
awk -F\| '
NR==FNR{
a[$3,1]=$1;
a[$3,4]=$4;
next
}
substr($1,1,2) == a[$3,1] && a[$3,4] == "TRUE" { print }
' file2.txt file1.txt
AE1267453617238|BIDKFXXXX|United Arab Emirates|

How to sum up all other columns based on column 1?

I have an example csv file like below (but with way more columns numbering up to Sample 100 and several rows)
Genus,Sample1,Sample2,Sample3
Unclassified,0,1,44
Unclassified,0,0,392
Unclassified,0,0,0
Woeseia,0,0,76
and I would like to have a summed csv file as below where all the identical entries on column 1 are summed up
Genus,Sample1,Sample2,Sample3
Unclassified,0,1,436
Woeseia,0,0,76
I tried the following awk code but it didn't work
awk -F "," 'function SP() {n=split ($0, T); ID=$1}
function PR() {printf "%s", ID; for (i=2; i<=n; i++) printf "\t%s", T[i]; printf "\n"}
NR==1 {SP();next}
$1 != ID {PR(); SP(); next}
{for (i=2; i<=NF; i++) T[i]+=$i}
END {PR()}
' Filename.csv
I am also aware of doing something like below but it is impractical when there are hundreds of columns. Any help here would be appreciated.
awk -F "," ' NR==1 {print; next} NF {a[$1]+=$2; b[$1]+=$3; c[$1]+=$4; d[$1]+=$5; e[$1]+=$6; f[$1]++} END {for(i in a)print i, a[i], b[i], c[i], d[i], e[i], f[i]} ' Filename.csv
With your shown samples, please try following awk program. You need NOT to create these many arrays, you could easily do it with 1 or 2 here.
awk '
BEGIN { FS=OFS="," }
FNR==1{
print
next
}
{
for(i=2;i<=NF;i++){
arr1[$1]
arr2[$1,i]+=$i
}
}
END{
for(i in arr1){
printf("%s,",i)
for(j=2;j<=NF;j++){
printf("%s%s",arr2[i,j],j==NF?ORS:OFS)
}
}
}
' Input_file
Output will be as follows:
Genus,Sample1,Sample2,Sample3
Unclassified,0,1,436
Woeseia,0,0,76
Explanation: Adding detailed explanation for above code.
awk ' ##Starting awk program from here.
BEGIN { FS=OFS="," } ##In BEGIN section setting FS and OFS as comma here.
FNR==1{ ##Checking if this is first line then do following.
print ##Printing current line.
next ##next will skip further statements from here.
}
{
for(i=2;i<=NF;i++){ ##Running for loop from 2nd field to till NF here.
arr1[$1] ##Creating arr1 array with index of 1st field.
arr2[$1,i]+=$i ##Creating arr2 with index of 1st field and current field number and value is current field value which is keep adding into it.
}
}
END{ ##Starting END block for this program from here.
for(i in arr1){ ##Traversing through arr1 all elements here one by one.
printf("%s,",i) ##Printing its current index here.
for(j=2;j<=NF;j++){ ##Running for loop from 2nd field to till NF here.
printf("%s%s",arr2[i,j],j==NF?ORS:OFS) ##Printing value of arr2 with index of i and j, printing new line if its last field.
}
}
}
' Input_file ##Mentioning Input_file here.
Here's an other awk:
awk -v FS=',' -v OFS=',' '
NR == 1 {
print
next
}
{
ids[$1]
for (i = 2; i <= NF; i++)
sums[i "," $1] += $i
}
END {
for (id in ids) {
out = id
for (i = 2; i <= NF; i++)
out = out OFS sums[i "," id]
print out
}
}
' Filename.csv
Genus,Sample1,Sample2,Sample3
Unclassified,0,1,436
Woeseia,0,0,76
You can also use a CSV-aware program that provides tools for data analysis.
Here's an example with Miller, which is available as a stand-alone executable:
IFS='' read -r csv_header < Filename.csv
mlr --csv \
stats1 -a sum -g "${csv_header%%,*}" -f "${csv_header#*,}" \
then rename -r '(.*)_sum,\1' \
Filename.csv
Genus,Sample1,Sample2,Sample3
Unclassified,0,1,436
Woeseia,0,0,76

awk to count of occurrences then split into two file

Would like to count number of occurences based on $2 field then split the input file into two output files ,
if the $2 field occurances more than 3 times then those lines re-dirceted into OpFile11.txt else re-directed into OpFile22.txt
Input.csv
Des1,Location,Decs2
aaa,a123,xxx
bbb,b789,yyy
xxx,a123,aaa
aaa,a123,xxx
bbb,b789,yyy
ccc,c567,zzz
xxx,a123,aaa
ddd,d456,ddd
OpFile11.txt
aaa,a123,xxx
xxx,a123,aaa
aaa,a123,xxx
xxx,a123,aaa
OpFile22.txt
bbb,b789,yyy
bbb,b789,yyy
ccc,c567,zzz
ddd,d456,ddd
Step#1 : Counting number of occurence:
awk -F, '{key=$2;++a[key]} END {for(i in a) print i","a[i]}' Input.csv
d456,1
b789,2
c567,1
a123,4
Step#2 : Spliting the input file into two parts:
awk ' BEGIN {FS = OFS = ","} FNR==NR {a[$1]=$0;next} ($2 in a) { print $0 }' OccurGR3.csv Input.csv > OpFile11.txt
awk ' BEGIN {FS = OFS = ","} FNR==NR {a[$1]=$0;next} !($2 in a) { print $0 }' OccurGR3.csv Input.csv > OpFile22.txt
where OccurGR3.csv
a123,4
Please suggest to avoid three steps , looking for your suggestions !!!
awk -F, '
NR==FNR { cnt[$2]++; next }
{ print > ( "OpFile" (cnt[$2]<3?22:11) ".txt" ) }
' Input.csv Input.csv

awk Print Skipping a field

In the case where type is "" print the 3rd field out of sequence and then print the whole line with the exception of the 3rd field.
Given a tab separated line a b c d e the idea is to print ab<tab>c<tab>a<tab>b<tab>d<tab>e
Setting $3="" seems to cause the subsequent print statement to lose the tab field separators and so is no good.
# $1 = year $2 = movie
BEGIN {FS = "\t"}
type=="" {printf "%s\t%s\t", $2 $1,$3; $3=""; print}
type!="" {printf "%s\t<%s>\t", $2 $1,type; print}
END {print ""}
Sticking in a for loop which I like a lot less as a solution results in a blank file.
# $1 = year $2 = movie
BEGIN {FS = "\t"}
type=="" {printf "%s\t%s\t%s\t%s\t", $2 $1,$3,$1,$2; for (i=4; i<=NF;i++) printf "%s\t",$i}
type!="" {printf "%s\t<%s>\t", $2 $1,type; print}
END {print ""}
You need to set the OFS to a tab instead of it's default single blank char and you don't want to just set $3 to a bank char as then you'll get 2 tabs between $2 and $4.
$ cat tst.awk
BEGIN {FS = OFS = "\t"}
{
if (type == "") {
val = $3
for (i=3; i<NF; i++) {
$i = $(i+1)
}
NF--
}
else {
val = "<" type ">"
}
print $2 $1, val, $0
}
$
$ awk -f tst.awk file | tr '\t' '-'
ba-c-a-b-d-e
$
$ awk -v type="foo" -f tst.awk file | tr '\t' '-'
ba-<foo>-a-b-c-d-e
The |tr '\t' '-' is obviously just added to make visible where the tabs are.
If decrementing NF doesn't work in your awk to delete the last field in the record, replace it with sub(/\t[^\t]+$/,"").
One way
awk '{$3=""}1' OFS="\t" infile|column -t
explanation
{$3=""} set column to nil
1 same as print, print the line.
OFS="\t"set Output Field Separator Variable to tab, maybe you needn't it, next commandcolumn -t` make the format again.
column -t columnate lists with tabs.

awk totally separate duplicate and non-duplicates

If we have an input:
TargetIDs,CPD,Value,SMILES
95,CPD-1111111,-2,c1ccccc1
95,CPD-2222222,-3,c1ccccc1
95,CPD-2222222,-4,c1ccccc1
95,CPD-3333333,-1,c1ccccc1N
Now we would like to separate the duplicates and non-duplicates based on the fourth column (smiles)
duplicate:
95,CPD-1111111,-2,c1ccccc1
95,CPD-2222222,-3,c1ccccc1
95,CPD-2222222,-4,c1ccccc1
non-duplicate
95,CPD-3333333,-1,c1ccccc1N
Now the following attempt could do separate the duplicate without any problem. However, the first occurrence of the duplicate will still be included into the non-duplicate file.
BEGIN { FS = ","; f1="a"; f2="b"}
{
# Keep count of the fields in fourth column
count[$4]++;
# Save the line the first time we encounter a unique field
if (count[$4] == 1)
first[$4] = $0;
# If we encounter the field for the second time, print the
# previously saved line
if (count[$4] == 2)
print first[$4] > f1 ;
# From the second time onward. always print because the field is
# duplicated
if (count[$4] > 1)
print > f1;
if (count[$4] == 1) #if (count[$4] - count[$4] == 0) <= change to this doesn't work
print first[$4] > f2;
duplicate output results from the attempt:
95,CPD-1111111,-2,c1ccccc1
95,CPD-2222222,-3,c1ccccc1
95,CPD-2222222,-4,c1ccccc1
non-duplicate output results from the attempt
TargetIDs,CPD,Value,SMILES
95,CPD-3333333,-1,c1ccccc1N
95,CPD-1111111,-2,c1ccccc1
May I know if any guru might have comments/solutions? Thanks.
I would do this:
awk '
NR==FNR {count[$2] = $1; next}
FNR==1 {FS=","; next}
{
output = (count[$NF] == 1 ? "nondup" : "dup")
print > output
}
' <(cut -d, -f4 input | sort | uniq -c) input
The process substitution will pre-process the file and perform a count on the 4th column. Then, you can process the file and decide if that line is "duplicated".
All in awk: Ed Morton shows a way to collect the data in a single pass. Here's a 2 pass solution that's virtually identical to my example above
awk -F, '
NR==FNR {count[$NF]++; next}
FNR==1 {next}
{
output = (count[$NF] == 1 ? "nondup" : "dup")
print > output
}
' input input
Yes, the input file is given twice.
$ cat tst.awk
BEGIN{ FS="," }
NR>1 {
if (cnt[$4]++) {
dups[$4] = nonDups[$4] dups[$4] $0 ORS
delete nonDups[$4]
}
else {
nonDups[$4] = $0 ORS
}
}
END {
print "Duplicates:"
for (key in dups) {
printf "%s", dups[key]
}
print "\nNon Duplicates:"
for (key in nonDups) {
printf "%s", nonDups[key]
}
}
$ awk -f tst.awk file
Duplicates:
95,CPD-1111111,-2,c1ccccc1
95,CPD-2222222,-3,c1ccccc1
95,CPD-2222222,-4,c1ccccc1
Non Duplicates:
95,CPD-3333333,-1,c1ccccc1N
This solution only works if the duplicates are grouped together.
awk -F, '
function fout( f, i) {
f = (cnt > 1) ? "dups" : "nondups"
for (i = 1; i <= cnt; ++i)
print lines[i] > f
}
NR > 1 && $4 != lastkey { fout(); cnt = 0 }
{ lastkey = $4; lines[++cnt] = $0 }
END { fout() }
' file
Little late
My version in awk
awk -F, 'NR>1{a[$0":"$4];b[$4]++}
END{d="\n\nnondupe";e="dupe"
for(i in a){split(i,c,":");b[c[2]]==1?d=d"\n"i:e=e"\n"i} print e d}' file
Another built similar to glenn jackmans but all in awk
awk -F, 'function r(f) {while((getline <f)>0)a[$4]++;close(f)}
BEGIN{r(ARGV[1])}{output=(a[$4] == 1 ? "nondup" : "dup");print >output} ' file