awk to count of occurrences then split into two file - awk

Would like to count number of occurences based on $2 field then split the input file into two output files ,
if the $2 field occurances more than 3 times then those lines re-dirceted into OpFile11.txt else re-directed into OpFile22.txt
Input.csv
Des1,Location,Decs2
aaa,a123,xxx
bbb,b789,yyy
xxx,a123,aaa
aaa,a123,xxx
bbb,b789,yyy
ccc,c567,zzz
xxx,a123,aaa
ddd,d456,ddd
OpFile11.txt
aaa,a123,xxx
xxx,a123,aaa
aaa,a123,xxx
xxx,a123,aaa
OpFile22.txt
bbb,b789,yyy
bbb,b789,yyy
ccc,c567,zzz
ddd,d456,ddd
Step#1 : Counting number of occurence:
awk -F, '{key=$2;++a[key]} END {for(i in a) print i","a[i]}' Input.csv
d456,1
b789,2
c567,1
a123,4
Step#2 : Spliting the input file into two parts:
awk ' BEGIN {FS = OFS = ","} FNR==NR {a[$1]=$0;next} ($2 in a) { print $0 }' OccurGR3.csv Input.csv > OpFile11.txt
awk ' BEGIN {FS = OFS = ","} FNR==NR {a[$1]=$0;next} !($2 in a) { print $0 }' OccurGR3.csv Input.csv > OpFile22.txt
where OccurGR3.csv
a123,4
Please suggest to avoid three steps , looking for your suggestions !!!

awk -F, '
NR==FNR { cnt[$2]++; next }
{ print > ( "OpFile" (cnt[$2]<3?22:11) ".txt" ) }
' Input.csv Input.csv

Related

match two files with awk and output selected fields

I want to compare two files delimited with
;
with the same field1,
output field2 of file1 and field2 field1 of file2.
File1:
16003-Z/VG043;204352
16003/C3;100947
16003/C3;172973
16003/PAB4L;62245
16003;100530
16003;101691
16003;144786
File2:
16003-Z/VG043;568E;0540575;2.59
16003/C3;568E;0000340;2.53
16003/PAB4L;568H;0606738;9.74
16003;568E;0000339;0.71
16003TN9/C3;568E;0042261;3.29
Desired output:
204352;568E;16003-Z/VG043
100947;568E;16003/C3
172973;568E;16003/C3
62245;568H;16003/PAB4L
100530;568E;16003
101691;568E;16003
144786;568E;16003
My try:
awk -F\, '{FS=";"} NR==FNR {a[$1]; next} ($1) in a{ print a[$2]";"$2";"$3}' File1 File2 > Output
The above is not working probably because awk is still obscure to me.
The problem is what is driving the output? what $1, $2, etc are referred to what?
The a[$2] in my intention is the field2 of file 1....but it is not...
What I get is:
;204352;16003-Z/VG043
;100947;16003/C3
;172973;16003/C3
;62245;16003/PAB4L
;100530;16003
;101691;16003
;144786;16003
thanks for helping
This might be what you are after:
awk -F";" '(NR==FNR) { a[$1] = ($1 in a ? a[$1] FS : "") $2; next }
($1 in a) { split(a[$1],b); for(i in b) print b[i] FS $2 FS $1 }' file1 file2
This outputs:
204352;568E;16003-Z/VG043
100947;568E;16003/C3
172973;568E;16003/C3
62245;568H;16003/PAB4L
100530;568E;16003
101691;568E;16003
144786;568E;16003
This approach reads a file file_1.txt by first into an associative array table. (This is done to associate ids / values across files.) Then, looping over the 2nd file file_2.txt, I print the values in table that match the id field of this file along with the current value:
BEGIN {
FS=OFS=";"
while (getline < first)
table[$1] = $2 FS table[$1]
}
$1 in table {
len = split(table[$1], parts)
for (i=1; i<len; i++)
print parts[i], $2, $1
}
$ awk -v first=file_1.txt -f script.awk file_2.txt
204352;568E;16003-Z/VG043
172973;568E;16003/C3
100947;568E;16003/C3
62245;568H;16003/PAB4L
144786;568E;16003
101691;568E;16003
100530;568E;16003

awk to print the fields based on condition:

I would like to compare first field and third filed of Input.csv file with Second and third fields of Master.csv.
And if the fruits name matches and the Amount of Master.csv is less than Amount of Input.csv then print all the lines from both the files.
For example,
Fruits==Apple,Amount <20 from Master.csv need to be checked with Input.csv, so the output would be
Fruits,Region,Amount,Details
Apple,North,10,Abc
Apple,south,9,Abc
Input.csv
Fruits,Region,Amount,Details
Apple,North,10,Abc
Orange,East,115,Def
Apple,south,9,Abc
Apple,West,25,Abc
Orange,West,150,Def
Orange,North,200,Def
Mango,North,50,Ghi
Mango,East,75,Ghi
Master.csv
Details,Fruits,Amount
xxx,Apple,20
yyy,Mango,60
zzz,Cherry,80
Desired Output.csv
Fruits,Region,Amount,Details,Details,Fruits,Amount
Apple,North,10,Abc,xxx,Apple,20
Apple,south,9,Abc,xxx,Apple,20
Mango,North,50,Ghi,yyy,Mango,60
I have tried like below command
awk -F "," 'FNR==NR {a[$1]; b[$3]; next} $2 in a && $3 < b' Input.csv Master.csv > Output.csv
Following awk may help you on same:
awk -F"," 'FNR==1 && FNR==NR{val=$0} FNR==NR{a[$2]=$3;b[$2]=$0;next} FNR==1 && FNR!=NR{for(i=NF;i>0;i--){val1=val1?$i OFS val1:$i};print val,val1} (($1 in a) && $3<a[$1]){print $0,b[$1]}' OFS=, master.csv input.csv
Adding a non-one liner form of solution now:
awk -F"," '
FNR==1 && FNR==NR{
val=$0
}
FNR==NR{
a[$2]=$3;
b[$2]=$0;
next
}
FNR==1 && FNR!=NR{
for(i=NF;i>0;i--){
val1=val1?$i OFS val1:$i};
print val,val1
}
(($1 in a) && $3<a[$1]){
print $0,b[$1]
}
' OFS=, master.csv input.csv
Output will be as follows:
Details,Fruits,Amount,Fruits,Region,Amount,Details
Apple,North,10,Abc,xxx,Apple,20
Apple,south,9,Abc,xxx,Apple,20
Mango,North,50,Ghi,yyy,Mango,60

awk Print Skipping a field

In the case where type is "" print the 3rd field out of sequence and then print the whole line with the exception of the 3rd field.
Given a tab separated line a b c d e the idea is to print ab<tab>c<tab>a<tab>b<tab>d<tab>e
Setting $3="" seems to cause the subsequent print statement to lose the tab field separators and so is no good.
# $1 = year $2 = movie
BEGIN {FS = "\t"}
type=="" {printf "%s\t%s\t", $2 $1,$3; $3=""; print}
type!="" {printf "%s\t<%s>\t", $2 $1,type; print}
END {print ""}
Sticking in a for loop which I like a lot less as a solution results in a blank file.
# $1 = year $2 = movie
BEGIN {FS = "\t"}
type=="" {printf "%s\t%s\t%s\t%s\t", $2 $1,$3,$1,$2; for (i=4; i<=NF;i++) printf "%s\t",$i}
type!="" {printf "%s\t<%s>\t", $2 $1,type; print}
END {print ""}
You need to set the OFS to a tab instead of it's default single blank char and you don't want to just set $3 to a bank char as then you'll get 2 tabs between $2 and $4.
$ cat tst.awk
BEGIN {FS = OFS = "\t"}
{
if (type == "") {
val = $3
for (i=3; i<NF; i++) {
$i = $(i+1)
}
NF--
}
else {
val = "<" type ">"
}
print $2 $1, val, $0
}
$
$ awk -f tst.awk file | tr '\t' '-'
ba-c-a-b-d-e
$
$ awk -v type="foo" -f tst.awk file | tr '\t' '-'
ba-<foo>-a-b-c-d-e
The |tr '\t' '-' is obviously just added to make visible where the tabs are.
If decrementing NF doesn't work in your awk to delete the last field in the record, replace it with sub(/\t[^\t]+$/,"").
One way
awk '{$3=""}1' OFS="\t" infile|column -t
explanation
{$3=""} set column to nil
1 same as print, print the line.
OFS="\t"set Output Field Separator Variable to tab, maybe you needn't it, next commandcolumn -t` make the format again.
column -t columnate lists with tabs.

print unique lines based on field

Would like to print unique lines based on first field , keep the first occurrence of that line and remove duplicate other occurrences.
Input.csv
10,15-10-2014,abc
20,12-10-2014,bcd
10,09-10-2014,def
40,06-10-2014,ghi
10,15-10-2014,abc
Desired Output:
10,15-10-2014,abc
20,12-10-2014,bcd
40,06-10-2014,ghi
Have tried below command and in-complete
awk 'BEGIN { FS = OFS = "," } { !seen[$1]++ } END { for ( i in seen) print $0}' Input.csv
Looking for your suggestions ...
You put your test for "seen" in the action part of the script instead of the condition part. Change it to:
awk -F, '!seen[$1]++' Input.csv
Yes, that's the whole script:
$ cat Input.csv
10,15-10-2014,abc
20,12-10-2014,bcd
10,09-10-2014,def
40,06-10-2014,ghi
10,15-10-2014,abc
$
$ awk -F, '!seen[$1]++' Input.csv
10,15-10-2014,abc
20,12-10-2014,bcd
40,06-10-2014,ghi
This should give you what you want:
awk -F, '{ if (!($1 in a)) a[$1] = $0; } END '{ for (i in a) print a[i]}' input.csv
typo there in syntax.
awk '{ if (!($1 in a)) a[$1] = $0; } END { for (i in a) print a[i]}'

awk to improve command print Match and Non-Match case:

Would like to read and compare first field from two files then print
Match Lines from Both the files - ( Available in f11.txt and f22.txt) -> Op_Match.txt
Non- Match Lines from f11.txt ( Available in f11.txt Not-Available in f22.txt)-> Op_NonMatch_f11.txt
Non- Match Lines from f22.txt ( Available in f22.txt Not-Available in f11.txt)-> Op_NonMatch_f22.txt
Using below 3 separate commands to achieve the above scenario's .
f11.txt
10,03-APR-14,abc
20,02-JUL-13,def
10,19-FEB-14,abc
20,02-AUG-13,def
10,22-JAN-07,abc
10,29-JUN-07,abc
40,11-SEP-13,ghi
f22.txt
50,DL,3000~4332,ABC~XYZ
10,DL,5000~2503,ABC~XYZ
30,AL,2000~2800,DEF~PQZ
To Match Lines from Both the files:
awk ' BEGIN {FS = OFS = ","} FNR==NR {a[$1] = $0; next} ($1 in a) {print $0,a[$1]}' f22.txt f11.txt> Op_Match.txt
10,03-APR-14,abc,10,DL,5000~2503,ABC~XYZ
10,19-FEB-14,abc,10,DL,5000~2503,ABC~XYZ
10,22-JAN-07,abc,10,DL,5000~2503,ABC~XYZ
10,29-JUN-07,abc,10,DL,5000~2503,ABC~XYZ
To Non- Match Lines from f11.txt:
awk ' BEGIN {FS = OFS = ","} FNR==NR {a[$1] = $0; next} !($1 in a) {print $0}' f22.txt f11.txt > Op_NonMatch_f11.txt
20,02-JUL-13,def
20,02-AUG-13,def
40,11-SEP-13,ghi
To Non- Match Lines from f22.txt:
awk ' BEGIN {FS = OFS = ","} FNR==NR {a[$1] = $0; next} !($1 in a) {print $0}' f11.txt f22.txt > Op_NonMatch_f22.txt
50,DL,3000~4332,ABC~XYZ
30,AL,2000~2800,DEF~PQZ
Using above 3 separate commands to achieve the mentioned scenario’s. Is there any simplest way to avoid 3 different commands? Any Suggestions ...!!!
Something like this, untested:
awk '
BEGIN{ FS=OFS="," }
NR==FNR {
fname1 = FILENAME
keys[NR] = $1
recs[NR] = $0
key2nrs[$1] = ($1 in key2nrs ? key2nrs[$1] RS : "") NR
next
}
{
if ($1 in key2nrs) {
split (key2nrs[$1],nrs,RS)
for (i=1; i in nrs; i++) {
print recs[nrs[i]], $0 > "Op_Match.txt"
}
matched[$1]
}
else {
print > ("Op_NonMatch_" FILENAME ".txt")
}
}
END {
for (i=1; i in recs; i++) {
if (! (keys[i] in matched) ) {
print recs[i] > ("Op_NonMatch_" fname1 ".txt")
}
}
}
' f11.txt f22.txt
The main difference between this and Kent and Etans answers is that theirs assume that the $1 in f22.txt can only appear once within that file while the above would work if, say, 10 occurred as the first field on multiple lines of f22.txt.
The other difference is that the above will output lines in the same order that they occurred in the input files while the other answers will output some of them in random order based on how they're stored internally in a hash table.
I haven't checked #EdMorton's answer but he will quite likely have gotten it right.
My solution (which looks slightly less generic than his at first glance) is:
awk -F, '
FNR==NR {
a[$1]=$0;
next
}
($1 in a){
print $0,a[$1] > "Op_Match.txt"
am[$1]++
}
!($1 in a) {
print $0 > "Op_NonMatch_f11.txt"
}
END {
for (i in a) {
if (!(i in am)) {
print a[i] > "Op_NonMatch_f22.txt"
}
}
}
' f22.txt f11.txt
here is one:
awk -F, -v OFS="," 'NR==FNR{a[$1]=$0;next}
$1 in a{print $0,a[$1]>("common.txt");c[$1];next}
{print $0>("NonMatchFromFile1.txt")}
END{for(x in a)
if(!(x in c))
print a[x]>("NonMatchFromFile2.txt")}' f2 f1
with this, you will get 3 files: common.txt, nonmatchfromFile1.txt and nonMatchfromfile2.txt