I'm having a very weird issue where I'm working with two files
My awk script:
Its meant to match in both first fields of both files where rows are equal. Then do other conditionals on the other fields and check if they match. This seems to be working fine for all other fields however the second field, $2, of the first file fails to be populated.
#!/bin/awk -f
BEGIN {
FS=OFS=","
total = 0;
}
FNR==NR{
reg[$1] = $1;
reg_s[$2] = $2;
account[$3] = $3;
site_name[$4] = $4;
next;
}
{
if ($1 in reg)
if ( (($2 != "Yes") && (reg_s[$2] == "3")) || (($2 == "Yes") && (reg_s[$2] != "3")) ) {
print "Status Error";
total++;
}
}
END {
print " - DONE - " total" Errors"
}
Where am I going wrong?
file1:
abcd,3,Paper,go
abcde,3,stapler,staples
abb,0,pencil,sharpener
file2:
abcd,Yes,Paper,go
abcde,Yes,stapler,staples
abb,No,pencil,sharpener
to run it:
awk -f myscript.awk file1 file2
Here is something you can use...
$ join -t, <(sort file1) <(sort file2) |
awk -F, '($2==3) != ($5=="Yes"){count++} END{print count+0}'
join the files by the key (need to be sorted first), count the matching records. Note that !a && b || a && !b is the definition of xor and can be written simply as a!=b as I did above.
This should print zero. (count+0 is to initialize the value as a numeric in case it never satisfies the condition)
Run the script with the following debug modifications. It debugs the first part, when you populate the arrays:
#!/bin/awk -f
BEGIN {
FS=OFS=","
total = 0;
}
FNR==NR{
reg[$1] = $1;
reg_s[$2] = $2;
account[$3] = $3;
site_name[$4] = $4;
next;
}
{
print "----------reg----------------"
for (key in reg) { print key " : " reg[key] }
print "----------reg_s--------------"
for (key in reg_s) { print key " : " reg_s[key] }
print "----------account------------"
for (key in account) { print key " : " account[key] }
print "-----------site_name---------"
for (key in site_name) { print key " : " site_name[key] }
print "============================"
}
The output is:
----------reg----------------
abcd : abcd
abb : abb
abcde : abcde
----------reg_s--------------
0 : 0
3 : 3
----------account------------
stapler : stapler
Paper : Paper
pencil : pencil
-----------site_name---------
staples : staples
go : go
sharpener : sharpener
============================
As you can see, all arrays have three items except reg_s, and it is because reg_s gets assigned twice with the same key "3", and in awk when an array item is assigned with an existing key, it doesn't create a new array item, instead it replace the prevoius value.
That is why you have all arrays with three elements, because they have all differente keys, except this one, reg_s, which was populated using only two different keys, "3", and "0".
Hope this help, I can edit and elaborate some more if you need.
Related
I want to awk a list of data-files. All records - there is an unknown number of records before - before , e.g.,
/10-12-2014 06:47:59/{p=1}
are to be skipped.
A brief template of one data file looks like this:
data_file_001
0; n records to be skipped
1;10-12-2014 06:47:59;
2;12-12-2014 10:17:44;
3;12-12-2014 10:37:44;
4;14-12-2014 10:00:32;
5;;movefield
6;16-12-2014 04:15:39;
needed Output ($2 datefield reformatted and $3 moved to $4):
colnum;date;col3;col4;col5
2;12.12.14;;
3;12.12.14;;
4;14.12.14;;
5;;;movefield;moved
6;16.12.14;;
My source file is this at the moment:
BEGIN { OFS=FS=";" ; print "colnum;date;col3;col4;col5"}
FNR == 1 { p=0 }
$3 == "movefield" { $4 = $3; $5 = "moved"; $3 = ""}
#(x=index($2," ") > 0) {DDMMYY = substr($2,1,x-1)}
$2=substr($2,1,11)
p!=0{print};
/10-12-2014 06:47:59/{p=1}
I have problems to reformat the data fields: The pattern-action (x=index($2," ") > 0) {DDMMYY = substr($2,1,x-1)} does not work nor $2=substr($2,1,11) in conjunction with the movefield action. Notice that the record where the movefield field appears has no date field.
Please have in mind that the awk is meant to be used on a bunch of files (loop).
With GNU awk for implace editing, no loop required:
awk -i inplace '
BEGIN { OFS=FS=";" ; print "colnum","date","col3","col4","col5" }
FNR==1 { next }
$3 == "movefield" { $4 = $3; $5 = "moved"; $3 = ""; print; next }
{ sub(/ .*/,"",$2); gsub(/-/,".",$2); print $0, ""}
' file*
Another in GNU awk:
$ awk '
function refmt(str) { # reformat date for comparing
split(str,d,"[ :-]")
return mktime(d[3] " " d[2] " " d[1] " " d[4] " " d[5] " " d[6])
}
BEGIN {
FS=OFS=";"
start=refmt("10-12-2014 06:47:59") # reformat the threshold date
print "colnum","date","col3","col4" # print header (why 5?)
}
refmt($2)>start || $2=="" { # if date > start or empty
sub(/ .*/,"",$2) # delete time part
gsub(/-/,".",$2) # replace - by .
$4=$3; $3="" # or $3 = OFS $3
print # output
}' file
colnum;date;col3;col4
2;12.12.2014;;
3;12.12.2014;;
4;14.12.2014;;
5;;;movefield
6;16.12.2014;;
aNumber|bNumber|startDate|timeZone|duration|currencyType|cost|
22677512549|778|2014-07-02 10:16:35.000|NULL|NULL|localCurrency|0.00|
22675557361|76457227|2014-07-02 10:16:38.000|NULL|NULL|localCurrency|10.00|
22677521277|778|2014-07-02 10:16:42.000|NULL|NULL|localCurrency|0.00|
22676099496|77250331|2014-07-02 10:16:42.000|NULL|NULL|localCurrency|1.00|
22667222160|22667262389|2014-07-02 10:16:43.000|NULL|NULL|localCurrency|10.00|
22665799922|70110055|2014-07-02 10:16:45.000|NULL|NULL|localCurrency|20.00|
22676239633|433|2014-07-02 10:16:48.000|NULL|NULL|localCurrency|0.00|
22677277255|76919167|2014-07-02 10:16:51.000|NULL|NULL|localCurrency|1.00|
This is the input (sample of million of line) i have in csv file.
I want to sum up duration based on date.
My concern is i want to sum up first 1000000 lines
the awk program i'm using is:
test.awk
BEGIN { FS = "|" }
NR>1 && NR<=1000000
FNR == 1{ next }
{
sub(/ .*/,"",$3)
key=sprintf("%10s",$3)
duration[key] += $5 } END {
printf "%-10s %16s,"dAccused","Duration"
for (i in duration) {
printf "%-4s %16.2f i,duration[i]
}}
i run my script as
$awk -f test.awk 'file'
The input i have doesn't condsidered my condition NR>1 && NR<=1000000
ANY SUGGESTION? PLEASE!
You're looking for this:
BEGIN { FS = "|" }
1 < NR && NR <= 1000000 {
sub(/ .*/, "", $3)
key = sprintf("%10s",$3)
duration[key] += $5
}
END {
printf "%-10s %16s\n", "dAccused", "Duration"
for (i in duration) {
printf "%-4s %16.2f i,duration[i]
}
}
A lot of errors become obvious with proper indentation.
The reason you saw 1,000,000 lines was due to this:
NR>1 && NR<=1000000
That is a condition with no action block. The default action is to print the current record if the condition is true. That's why you see a lot of awk one-liners end with the number 1
You didn't post any expected output and your duration field is always NULL so it's still not clear what you really want output, but this is probably the right approach:
$ cat tst.awk
BEGIN { FS = "|" }
NR==1 { for (i=1;i<NF;i++) f[$i] = i; next }
{
sub(/ .*/,"",$(f["startDate"]))
sum[$(f["startDate"])] += $(f["duration"])
}
NR==1000000 { exit }
END { for (date in sum) print date, sum[date] }
$ awk -f tst.awk file
2014-07-02 0
Instead of discarding your header line, it uses it to create an array f[] that maps the field names to their order in each line so instead of having to hard-code that duration is field 4 (or whatever) you just reference it as $(f["duration"]).
Any time your input file has a header line, don't discard it - use it so your script is not coupled to the order of fields in your input file.
If we have an input:
TargetIDs,CPD,Value,SMILES
95,CPD-1111111,-2,c1ccccc1
95,CPD-2222222,-3,c1ccccc1
95,CPD-2222222,-4,c1ccccc1
95,CPD-3333333,-1,c1ccccc1N
Now we would like to separate the duplicates and non-duplicates based on the fourth column (smiles)
duplicate:
95,CPD-1111111,-2,c1ccccc1
95,CPD-2222222,-3,c1ccccc1
95,CPD-2222222,-4,c1ccccc1
non-duplicate
95,CPD-3333333,-1,c1ccccc1N
Now the following attempt could do separate the duplicate without any problem. However, the first occurrence of the duplicate will still be included into the non-duplicate file.
BEGIN { FS = ","; f1="a"; f2="b"}
{
# Keep count of the fields in fourth column
count[$4]++;
# Save the line the first time we encounter a unique field
if (count[$4] == 1)
first[$4] = $0;
# If we encounter the field for the second time, print the
# previously saved line
if (count[$4] == 2)
print first[$4] > f1 ;
# From the second time onward. always print because the field is
# duplicated
if (count[$4] > 1)
print > f1;
if (count[$4] == 1) #if (count[$4] - count[$4] == 0) <= change to this doesn't work
print first[$4] > f2;
duplicate output results from the attempt:
95,CPD-1111111,-2,c1ccccc1
95,CPD-2222222,-3,c1ccccc1
95,CPD-2222222,-4,c1ccccc1
non-duplicate output results from the attempt
TargetIDs,CPD,Value,SMILES
95,CPD-3333333,-1,c1ccccc1N
95,CPD-1111111,-2,c1ccccc1
May I know if any guru might have comments/solutions? Thanks.
I would do this:
awk '
NR==FNR {count[$2] = $1; next}
FNR==1 {FS=","; next}
{
output = (count[$NF] == 1 ? "nondup" : "dup")
print > output
}
' <(cut -d, -f4 input | sort | uniq -c) input
The process substitution will pre-process the file and perform a count on the 4th column. Then, you can process the file and decide if that line is "duplicated".
All in awk: Ed Morton shows a way to collect the data in a single pass. Here's a 2 pass solution that's virtually identical to my example above
awk -F, '
NR==FNR {count[$NF]++; next}
FNR==1 {next}
{
output = (count[$NF] == 1 ? "nondup" : "dup")
print > output
}
' input input
Yes, the input file is given twice.
$ cat tst.awk
BEGIN{ FS="," }
NR>1 {
if (cnt[$4]++) {
dups[$4] = nonDups[$4] dups[$4] $0 ORS
delete nonDups[$4]
}
else {
nonDups[$4] = $0 ORS
}
}
END {
print "Duplicates:"
for (key in dups) {
printf "%s", dups[key]
}
print "\nNon Duplicates:"
for (key in nonDups) {
printf "%s", nonDups[key]
}
}
$ awk -f tst.awk file
Duplicates:
95,CPD-1111111,-2,c1ccccc1
95,CPD-2222222,-3,c1ccccc1
95,CPD-2222222,-4,c1ccccc1
Non Duplicates:
95,CPD-3333333,-1,c1ccccc1N
This solution only works if the duplicates are grouped together.
awk -F, '
function fout( f, i) {
f = (cnt > 1) ? "dups" : "nondups"
for (i = 1; i <= cnt; ++i)
print lines[i] > f
}
NR > 1 && $4 != lastkey { fout(); cnt = 0 }
{ lastkey = $4; lines[++cnt] = $0 }
END { fout() }
' file
Little late
My version in awk
awk -F, 'NR>1{a[$0":"$4];b[$4]++}
END{d="\n\nnondupe";e="dupe"
for(i in a){split(i,c,":");b[c[2]]==1?d=d"\n"i:e=e"\n"i} print e d}' file
Another built similar to glenn jackmans but all in awk
awk -F, 'function r(f) {while((getline <f)>0)a[$4]++;close(f)}
BEGIN{r(ARGV[1])}{output=(a[$4] == 1 ? "nondup" : "dup");print >output} ' file
I need you help in writing awk for the below problem. I have one source file and required output of it.
Source File
a:5,b:1,c:2,session:4,e:8
b:3,a:11,c:5,e:9,session:3,c:3
Output File
session:4,a=5,b=1,c=2
session:3,a=11,b=3,c=5|3
Notes:
Fields are not organised in source file
In Output file: fields are organised in their specific format, for example: all a values are in 2nd column and then b and then c
For value c, in second line, its coming as n number of times, so in output its merged with PIPE symbol.
Please help.
Will work in any modern awk:
$ cat file
a:5,b:1,c:2,session:4,e:8
a:5,c:2,session:4,e:8
b:3,a:11,c:5,e:9,session:3,c:3
$ cat tst.awk
BEGIN{ FS="[,:]"; split("session,a,b,c",order) }
{
split("",val) # or delete(val) in gawk
for (i=1;i<NF;i+=2) {
val[$i] = (val[$i]=="" ? "" : val[$i] "|") $(i+1)
}
for (i=1;i in order;i++) {
name = order[i]
printf "%s%s", (i==1 ? name ":" : "," name "="), val[name]
}
print ""
}
$ awk -f tst.awk file
session:4,a=5,b=1,c=2
session:4,a=5,b=,c=2
session:3,a=11,b=3,c=5|3
If you actually want the e values printed, unlike your posted desired output, just add ,e to the string in the split() in the BEGIN section wherever you'd like those values to appear in the ordered output.
Note that when b was missing from the input on line 2 above, it output a null value as you said you wanted.
Try with:
awk '
BEGIN {
FS = "[,:]"
OFS = ","
}
{
for ( i = 1; i <= NF; i+= 2 ) {
if ( $i == "session" ) { printf "%s:%s", $i, $(i+1); continue }
hash[$i] = hash[$i] (hash[$i] ? "|" : "") $(i+1)
}
asorti( hash, hash_orig )
for ( i = 1; i <= length(hash); i++ ) {
printf ",%s:%s", hash_orig[i], hash[ hash_orig[i] ]
}
printf "\n"
delete hash
delete hash_orig
}
' infile
that splits line with any comma or colon and traverses all odd fields to save either them and its values in a hash to print at the end. It yields:
session:4,a:5,b:1,c:2,e:8
session:3,a:11,b:3,c:5|3,e:9
As I know in awk, $1 and $2 refer to the first and second field of the file . But can $1 and $2 be used to refer the first and second field of a variable .. Such that if session=5 is stored in a variable. Then I would like to have $1 referring to 'session' and $2 to '5' . Thank you
Input File
session=123
process=90
customer=145
session=123
customer=198
process=90
CODE
awk '$1 ~ /^Session|^CustomerId/' hi|xargs -L 1 -I name '{if (!($1 SUBSEP $2 in a)) {ids[$1]++; a[$1, $2]}} END {for (id in ids) {print "Count of unique", id, " " ids[id]}}'
DETAILS
I will pass the output that I got from first and pipe it via xargs and I have the lines read in "name" variable in xargs .. Now my $1 should correspond to first field of xargs and this is my query
Output
Count of unique sessions=2
Count of unique customer=2
If you want to limit the script to only including "session" and "customer" all you have to do is add the regex to the main script as a selector:
awk -F= '$1 ~ /^(session|customer)$/ {if (!($1 SUBSEP $2 in a)) {ids[$1]++; a[$1, $2]}} END {for (id in ids) {print "Count of unique", id, " " ids[id]}}'
If what you're looking for is a count of unique customers and sessions, then this might do:
awk -F= '
$1~/^(session|customer)$/ && !seen[$0] {
seen[$0]=1;
count[$1]++;
}
END {
printf("Count of sessions: %d\n", count["session"]);
printf("Count of customers: %d\n", count["customer"]);
}' hi
In addition to keeping a count, this keeps an associative array of lines that have contributed a count, to avoid counting lines a second time - thus making it a unique count.
Use the Field Separator, which can be specified inside the BEGIN code block as FS="separator", or as a command line option to awk via -F "separator" This answer shows only the point asked by the question. it does not address the final output.
awk -F"=" '$1 == "session" ||
$1 == "customer" { ids[$1]++ } # do whatever you need with the counters.
END { for (id in ids) {
print "Count, id "=" ids[id] }}' hi
Why don't you just try an all awk solution? It's more simple:
awk -F "=" '$1 ~ /customer|session/ { name[$1]++ } END { for (var in name) print "Count of unique", var"="name[var] }' hi
Results:
Count of unique customer=2
Count of unique session=2
Is there some other reason you need to pipe to xargs?
HTH
Yet an alternative would be
awk -F "=" '$1 ~ /customer|session/ {print $1}'|sort |uniq -c | awk '{print "Count of unique "$2"="$1}'
Here is the answer to the question you deleted:
This is self-contained AWK script based on an answer of mine to one of your earlier questions:
#!/usr/bin/awk -f
/^Customer=/ {
mc[$0, prev]++
if (!($0 in cseen)) {
cust[++custc] = $0
ids["Customer"]++
}
cseen[$0]
}
/^Merchant=/ {
prev = $0
if (!($0 in mseen)) {
merch[++merchc] = $0
ids["Merchant"]++
}
mseen[$0]++
}
END {
for (id in ids) {
print "Count of unique", id, ids[id]
}
for (i = 1; i <= merchc; i++) {
merchant = merch[i]
print "Customers under (" merchant ") is " mseen[merchant]
for (j = 1; j <= custc; j++) {
customer = cust[j]
if (customer SUBSEP merchant in mc) {
print "(" customer ") under (" merchant ") is " mc[customer, merchant]
}
}
}
}
Set it be executable and run it:
$ chmod u+x customermerchant
$ ./customermerchant data.txt