Delete elements during loop awk semantics - awk

If we assume the loop returns k==0 first (this order is implementation dependent according to the spec). How many times should the loop body run? Once or twice? If twice what should be printed for arr[1]?
BEGIN {
arr[0] = "zero";
arr[1] = "one";
for (k in arr) {
print "key " k " val " arr[k];
delete arr[k+1]
}
}
$ gawk --version
GNU Awk 5.1.0, API: 3.0 (GNU MPFR 4.1.0, GNU MP 6.2.1)
....
$ gawk 'BEGIN { arr[0] = "zero"; arr[1] = "one"; for (k in arr) { print "key " k " val " arr[k]; delete arr[k+1] } }'
key 0 val zero
key 1 val
$ goawk --version
v1.19.0
$ goawk 'BEGIN { arr[0] = "zero"; arr[1] = "one"; for (k in arr) { print "key " k " val "
key 0 val zero
gnu-awk runs it twice with arr[1] == "" and goawk runs it once. Mawk (mawk 1.3.4 20200120) sorts keys 1,0 but has the same fundamental behavior as gnu-awk, looping twice and print the empty string for the deleted key). What is the posix defined expected behavior of this program?
Essentially should keys deleted in past loops appear in future loops?

According to the POSIX spec:
The results of adding new elements to array within such a for loop are
undefined
but it doesn't define what happens if you delete them other than:
The delete statement shall remove an individual array element
However, according to the GNU AWK manual:
As a point of information, gawk sets up the list of elements to be iterated over before the loop starts, and does not change it. But not all awk versions do so.
so the behavior is undefined by POSIX, defined for GNU AWK, and you'd have to check the man page for every other AWK to see what it does.
Decide which behavior you want and then to get that behavior robustly and portably in all awks you could write whichever one of these you want:
gawks behavior:
BEGIN {
arr[0] = "zero";
arr[1] = "one";
for (k in arr) {
indices[k]
}
for (k in indices) {
print "key " k " val " arr[k];
delete arr[k+1]
}
}
goawks apparent behavior from your example:
BEGIN {
arr[0] = "zero";
arr[1] = "one";
for ( k in arr ) {
indices[k]
}
for (k in indices) {
if ( k in arr ) {
print "key " k " val " arr[k];
delete arr[k+1]
}
}
}
Notes on your code in general:
for ( k in ... ) could visit the indices in any order so relying on delete arr[k+1] to delete an element of arr[] isn't robust as, for example, you might be trying to delete an index past the end of the array on your first iteration through the loop if in decides to start with k set to the last index in the array.
All builtin and generated awk arrays, fields, and strings start at index 1, not 0, so don't create your own arrays starting at 0, start them at 1 to avoid having to remember which type of array it is when writing code to visit the indices and inevitably tripping over that difference at some point.

apparently mawk-1 is rather unique in this aspect :
mawk 1.3.4
% mawk 'BEGIN {
arr[0] = "zero";
arr[1] = "one";
for (k in arr) {
print "key " k " val " arr[k];
delete arr[k+1]
}
}'
key 1 val one <<<<<<
key 0 val zero
Furthermore, setting the WHINY_USERS shell environment flag changes its behavior (an entirely value-less declaration already suffices) :
WHINY_USERS= mawk 'BEGIN {
arr[0] = "zero";
arr[1] = "one";
for (k in arr) {
print "key " k " val " arr[k];
delete arr[k+1]
}
}'
key 0 val zero
key 1 val
% mawk 'BEGIN {
arr[0] = "zero";
arr[1] = "one";
for (k in arr) {
print "key " k " val " arr[k];
delete arr[k+1]
}
}'
key 1 val one
key 0 val zero
mawk-2 (beta-1.9.9.6)
% mawk2 'BEGIN {
arr[0] = "zero";
arr[1] = "one";
for (k in arr) {
print "key " k " val " arr[k];
delete arr[k+1]
}
}'
key 0 val zero
key 1 val
gawk 5.2.0
gawk -e 'BEGIN {
arr[0] = "zero";
arr[1] = "one";
for (k in arr) {
print "key " k " val " arr[k];
delete arr[k+1]
}
}'
key 0 val zero
key 1 val
nawk 20200816
% nawk 'BEGIN {
arr[0] = "zero";
arr[1] = "one";
for (k in arr) {
print "key " k " val " arr[k];
delete arr[k+1]
}
}'
key 0 val zero

Related

Awk create a new array of unique values from another array

I have my array:
array = [1:"PLCH2", 2:"PLCH1", 3:"PLCH2"]
I want to loop on array to create a new array unique of unique values and obtain:
unique = [1:"PLCH2", 2:"PLCH1"]
how can I achieve that ?
EDIT: as per #Ed Morton request, I show below how my array is populated. In fact, this post is the key solution to my previous post.
in my file.txt, I have:
PLCH2:A1007int&PLCH1:D987int&PLCH2:P977L
INTS11:P446P&INTS11:P449P&INTS11:P518P&INTS11:P547P&INTS11:P553P
I use split to obtain array:
awk '{
split($0,a,"&")
for ( i in a ) {
split(a[i], b, ":");
array[i] = b[1];
}
}' file.txt
This might be what you're trying to do:
$ cat tst.awk
BEGIN {
split("PLCH2 PLCH1 PLCH2",array)
printf "array ="
for (i=1; i in array; i++) {
printf " %s:\"%s\"", i, array[i]
}
print ""
for (i=1; i in array; i++) {
if ( !seen[array[i]]++ ) {
unique[++j] = array[i]
}
}
printf "unique ="
for (i=1; i in unique; i++) {
printf " %s:\"%s\"", i, unique[i]
}
print ""
}
$ awk -f tst.awk
array = 1:"PLCH2" 2:"PLCH1" 3:"PLCH2"
unique = 1:"PLCH2" 2:"PLCH1"
EDIT: given your updated question, here's how I'd really approach that:
$ cat tst.awk
BEGIN { FS="[:&]" }
{
numVals=0
for (i=1; i<NF; i+=2) {
vals[++numVals] = $i
}
print "vals =" arr2str(vals)
delete seen
numUniq=0
for (i=1; i<=numVals; i++) {
if ( !seen[vals[i]]++ ) {
uniq[++numUniq] = vals[i]
}
}
print "uniq =" arr2str(uniq)
}
function arr2str(arr, str, i) {
for (i=1; i in arr; i++) {
str = str sprintf(" %s:\"%s\"", i, arr[i])
}
return str
}
$ awk -f tst.awk file
vals = 1:"PLCH2" 2:"PLCH1" 3:"PLCH2"
uniq = 1:"PLCH2" 2:"PLCH1"
vals = 1:"INTS11" 2:"INTS11" 3:"INTS11" 4:"INTS11" 5:"INTS11"
uniq = 1:"INTS11" 2:"PLCH1"

how to optimize this awk script?

I browse 2 files with awk. I browse the first file and store the columns I need in arrays. I use after these arrays to make a comparison with a column (8) of the second file.
my script runs very slowly. I would like to know if there is not a way to optimize it?
FNR==NR
{
a[$1];
ip[NR]=$1;
site[NR]=$2;
next
}
BEGIN{
FS="[\t,=]";
OFS="|";
}
sudo awk -f{
l=length(ip);
if($8 in a)
{
for(k=0;k<=l;k++)
{
if(ip[k]== $8)
{
if(NF <= 70)
{
print "siteID Ipam: "site[k],"siteID zsc: "$14,"date: " $4,"src: "$8,"dst: "$10,"role: "$22,"urlcategory: "$36, "urlsupercategory: "$38,"urlclass: "$40;
}
else
{
print "siteID Ipam: "site[k], "siteID zsc: "$14,"date: " $4, "src: " $8, "dst: " $10, "role: "$22, "urlcategory: " $37, "urlsupercategory: "$39, "urlclass: $41;
}
break;
}
}
}
else
{
print $8 " is not in referentiel ";
}
}
Here is a better formatted same code with the initial typo.
BEGIN {
FS = "[\t,=]";
OFS = "|";
}
FNR == NR {
a[$1];
ip[NR] = $1;
site[NR] = $2;
next;
}
sudo awk -f {
l = length(ip);
if($8 in a) {
for(k = 0; k <= l; k++) {
if(ip[k] == $8) {
if(NF <= 70) {
print "siteID Ipam: "site[k],"siteID zsc: "$14,"date: " $4,"src: "$8,"dst: "$10,"role: "$22,"urlcategory: "$36, "urlsupercategory: "$38,"urlclass: "$40;
}
else {
print "siteID Ipam: "site[k], "siteID zsc: "$14,"date: " $4, "src: " $8, "dst: " $10, "role: "$22, "urlcategory: " $37, "urlsupercategory: "$39, "urlclass: $41;
}
break;
}
}
} else {
print $8 " is not in referentiel ";
}
}
suggest:
fix sudo awk -f typo.
a[$1]; --> a[$1] = 1;
($8 in a) --> (a[$8])

awk running total count and sum (Cont)

In continuation of previous post , how to calculate 80%-20% rule contribution of vendors on Daily basis ($1) AND Region ($1) wise.
The input file is alredy sorted based on Date & Region and Amount from highest to lowest
Input.csv
Date,Region,Vendor,Amount
5-Apr-15,east,aa,123
5-Apr-15,east,bb,50
5-Apr-15,east,cc,15
5-Apr-15,south,dd,88
5-Apr-15,south,ee,40
5-Apr-15,south,ff,15
5-Apr-15,south,gg,10
7-Apr-15,east,ii,90
7-Apr-15,east,jj,20
In the above input, based on Date($1) AND Region ($2) field need to populate Running Sum of Amount then calculate percentage of Running Sum of Amount for the day & Region
Date,Region,Vendor,Amount,RunningSum,%RunningSum
5-Apr-15,east,aa,123,123,65%
5-Apr-15,east,bb,50,173,92%
5-Apr-15,east,cc,15,188,100%
5-Apr-15,south,dd,88,88,58%
5-Apr-15,south,ee,40,128,84%
5-Apr-15,south,ff,15,143,93%
5-Apr-15,south,gg,10,153,100%
7-Apr-15,east,ii,90,90,82%
7-Apr-15,east,jj,20,110,100%
Once it is derived 80% or first hit of 80%above need to consider as 80% contribution remaining line items need to be consider as 20% contribution.
Date,Region,Countof80%Vendor, SumOf80%Vendor, Countof20%Vendor, SumOf20%Vendor
5-Apr-15,east,2,173,1,15
5-Apr-15,south,2,128,2,25
7-Apr-15,east,1,90,1,20
This awk script will help you do the first part, ask if you need clarification. Basically it stores the values in arrays and prints out the requested info after parsing the document.
awk -F',' 'BEGIN{OFS=FS}
NR==1{print $0, "RunningSum", "%RunningSum"}
NR!=1{
if (date == $1 && region == $2) {
counts[i]++
cities[i][counts[i]] = $3
amounts[i][counts[i]] = $4
rsum[i][counts[i]] = rsum[i][counts[i] - 1] + $4
} else {
date = $1; region = $2
dates[++i] = $1
regions[i] = $2
counts[i] = 1
cities[i][1] = $3
amounts[i][1] = $4
rsum[i][1] = $4
}
}
END{
for(j=1; j<=i; j++) {
total = rsum[j][counts[j]];
for (k=1; k<=counts[j]; k++) {
print dates[j], regions[j], cities[j][k], amounts[j][k], rsum[j][k], int(rsum[j][k]/total*100) "%"
}
if (j != i) { print "" }
}
}' yourfilename
The second part can be done like this (using the output of the first awk script):
awk -F'[,%]' 'BEGIN{ OFS="," }
NR==1 || $0 ~ /^$/ {
over = ""
record = 1
}
! (NR==1 || $0 ~ /^$/) {
if (record) {
dates[++i] = $1
regions[i] = $2
record = ""
}
if (over) {
twenty[i]++
twenties[i] += $4
} else {
eighty[i]++
eighties[i] += $4
}
if ($6 >= 80) {
over = 1
}
}
END {
print "Date","Region","Countof80%Vendor", "SumOf80%Vendor", "Countof20%Vendor", "SumOf20%Vendor"
for (j=1; j<=i; j++) {
print dates[j], regions[j], eighty[j], eighties[j], twenty[j], twenties[j]
}
}' output/file/of/first/script

awk '/range start/,/range end/' within script

How do I use the awk range pattern '/begin regex/,/end regex/' within a self-contained awk script?
To clarify, given program csv.awk:
#!/usr/bin/awk -f
BEGIN {
FS = "\""
}
/TREE/,/^$/
{
line="";
for (i=1; i<=NF; i++) {
if (i != 2) line=line $i;
}
split(line, v, ",");
if (v[5] ~ "FOAM") {
print NR, v[5];
}
}
and file chunk:
TREE
10362900,A,INSTL - SEAL,Revise
,10362901,A,ASSY / DETAIL - PANEL,Revise
,,-203,ASSY - PANEL,Qty -,Add
,,,-309,PANEL,Qty 1,Add
,,,,"FABRICATE FROM TEKLAM NE1G1-02-250 PER TPS-CN-500, TYPE A"
,,,-311,PANEL,Qty 1,Add
,,,,"FABRICATE FROM TEKLAM NE1G1-02-750 PER TPS-CN-500, TYPE A"
,,,-313,FOAM SEAL,1.00 X 20.21 X .50 THK,Qty 1,Add
,,,,"BMS1-68, GRADE B, FORM II, COLOR BAC706 (BLACK)"
,,,-315,FOAM SEAL,1.50 X 8.00 X .25 THK,Qty 1,Add
,,,,"BMS1-68, GRADE B, FORM II, COLOR BAC706 (BLACK)"
,PN HERE,Dual Lock,Add
,
10442900,IR,INSTL - SEAL,Update (not released)
,10362901,A,ASSY / DETAIL - PANEL,Revise
,PN HERE,Dual Lock,Add
I want to have this output:
27 FOAM SEAL
29 FOAM SEAL
What is the syntax for adding the command line form '/begin regex/,/end regex/' to the script to operate on those lines only? All my attempts lead to syntax errors and googling only gives me the cli form.
why not use 2 steps:
% awk '/start/,/end/' < input.csv | awk csv.awk
Simply do:
#!/usr/bin/awk -f
BEGIN {
FS = "\""
}
/from/,/to/ {
line="";
for (i=1; i<=NF; i++) {
if (i != 2) line=line $i;
}
split(line, v, ",");
if (v[5] ~ "FOAM") {
print NR, v[5];
}
}
If the from to regexes are dynamic:
#!/usr/bin/awk -f
BEGIN {
FS = "\""
FROM=ARGV[1]
TO=ARGV[2]
if (ARGC == 4) { # the pattern was the only thing, so force read from standard input
ARGV[1] = "-"
} else {
ARGV[1] = ARGV[3]
}
}
{ if ($0 ~ FROM) { p = 1 ; l = 0} }
{ if ($0 ~ TO) { p = 0 ; l = 1} }
{
if (p == 1 || l == 1) {
line="";
for (i=1; i<=NF; i++) {
if (i != 2) line=line $i;
}
split(line, v, ",");
if (v[5] ~ "FOAM") {
print NR, v[5];
}
l = 0 }
}
Now you have to call it like: ./scriptname.awk "FROM_REGEX" "TO_REGEX" INPUTFILE. The last param is optional, if missing STDIN can be used.
HTH
You need to show us what you have tried. Is there something about /begin regex/ or /end regex/ you're not telling us, other wise your script with the additions should work, i.e.
#!/usr/bin/awk -f
BEGIN {
FS = "\""
}
/begin regex/,/end regex/{
line="";
for (i=1; i<=NF; i++) {
if (i != 2) line=line $i;
}
split(line, v, ",");
if (v[5] ~ "FOAM") {
print NR, v[5];
}
}
OR are you using an old Unix, where there is old awk as /usr/bin/awk and New awk as /usr/bin/nawk. Also see if you have /usr/xpg4/bin/awk or gawk (path could be anything).
Finally, show us the error messages you are getting.
I hope this helps.

Ignoring escaped delimiters (commas) with awk?

If I had a string with escaped commas like so:
a,b,{c\,d\,e},f,g
How might I use awk to parse that into the following items?
a
b
{c\,d\,e}
f
g
{
split($0, a, /,/)
j=1
for(i=1; i<=length(a); ++i) {
if(match(b[j], /\\$/)) {
b[j]=b[j] "," a[i]
} else {
b[++j] = a[i]
}
}
for(k=2; k<=length(b); ++k) {
print b[k]
}
}
Split into array a, using ',' as delimiter
Build array b from a, merging lines that end in '\'
Print array b (Note: Starts at 2 since first item is blank)
This solution presumes (for now) that ',' is the only character that is ever escaped with '\'--that is, there is no need to handle any \\ in the input, nor weird combinations such as \\\,\\,\\\\,,\,.
{
gsub("\\\\,", "!Q!")
n = split($0, a, ",")
for (i = 1; i <= n; ++i) {
gsub("!Q!", "\\,", a[i])
print a[i]
}
}
I don't think awk has any built-in support for something like this. Here's a solution that's not nearly as short as DigitalRoss's, but should have no danger of ever accidentally hitting your made-up string (!Q!). Since it tests with an if, you could also extend it to be careful about whether you actually have \\, at the end of your string, which should be an escaped slash, not comma.
BEGIN {
FS = ","
}
{
curfield=1
for (i=1; i<=NF; i++) {
if (substr($i,length($i)) == "\\") {
fields[curfield] = fields[curfield] substr($i,1,length($i)-1) FS
} else {
fields[curfield] = fields[curfield] $i
curfield++
}
}
nf = curfield - 1
for (i=1; i<=nf; i++) {
printf("%d: %s ",i,fields[i])
}
printf("\n")
}