RegEx help in an awk script - awk

So I have a log file that contains entries like this:
[STAT] - December 11, 2017 13:16:05.360
.\something.cpp(99): [Text] Code::Open Port 1, baud 9600, parity 0, Stop bits 0, flow control 0
[STAT] - December 11, 2017 13:20:24.637
.\something\more\code.cpp(100): [log]
fooBarBaz[32] = 32, 1, 2, 7, 3, 1092, 5, 196875, 6, 270592, 20, 196870, 8, 289, 30, 196867, 11, 1156, 5, 196875, 28, 278784, 5, 196874, 32, 266496, 30, 6866, 36, 147712, 5, 196874,
[STAT] - December 11, 2017 13:20:40.939
.\something\more\code.cpp(100): [log]
fooBarBaz[8] = 8, 1, 2, 1, 31, 532992, 5, 196875,
[STAT] - December 11, 2017 13:18:16.214
.\something\more\code.cpp(100): [log]
fooBarBaz[12] = 12, 1, 2, 2, 17, 296960, 10, 196872, 51, 1792, 50, 196878,
On the command line, I can do this:
gawk -F', *' '/fooBarBaz\[[^0].*\]/ {for (f=5; f<=NF; f+=4) print $f | "sort -n" }' log
Which produces an output like this:
3
6
8
11
17
28
31
32
36
51
I'd like to have an awk script do the same thing, but my efforts so far haven't
worked.
#!/usr/local/bin/gawk -f
BEGIN { print "lines"
FS=", *";
/fooBarBaz\[[^0].*\]/
}
{
{for (f=5; f<=NF; f+=4) print $f}
}
I don't think my regular expression statement is in the right place, because
running gawk -f script.awk prints lines not relevant to my data.
What am I doing wrong?
tl;dr: On lines with fooBarBaz and not [0], I want to parse the digits starting at position 5 and then position 4 to the end of the line.

Optimized GNU awk solution:
parse_digits.awk script:
#!/bin/awk -f
BEGIN{
FS=", *";
PROCINFO["sorted_in"]="#ind_num_asc";
print "lines";
}
/fooBarBaz\[[1-9]+[0-9]*\]/{
for (i=5; i <= NF; i+=4)
if ($i != "") a[$i]
}
END{
for (i in a) print i
}
Usage:
awk -f parse_digits.awk inputfile
The output:
lines
3
6
8
11
17
28
31
32
36
51

Related

How do I sum values of a column cumulatively with awk?

I have a sample.csv and want to sum it cumulatively by column, as below:
Input csv: Output csv:
01/01/2020, 0, 0, 2, 1 01/01/2020, 0, 0, 2, 1
18/04/2022, 7, 5, 1, 3 18/04/2022, 7, 5, 3, 4
01/05/2022, 8,21, 9, 4 01/05/2022,15,26,12, 8
I've tried
awk '{ for (i=1; i<=NF; ++i) {sum[i]+=$i; $i=sum[i] }; print $0}' sample.csv
But it returns this instead:
Input csv: Output csv:
01/01/2020, 0, 0, 2, 1 01/01/2020, 0, 0, 2, 1, 0, 0, 0, 0, 0
18/04/2022, 7, 5, 1, 3 18/04/2022, 7, 5, 1, 3, 0, 0, 0, 0, 0
01/05/2022, 8,21, 9, 4 01/05/2022, 8,21, 9, 4, 0, 0, 0, 0, 0
I'm at a loss as to how to resolve this.
Note: I am writing this in a bash script, not the terminal. And I'm not allowed to use any tools other than awk for this
I can't duplicate your output. Other than whitespace mangling, this seems to do what you want:
awk '{ for (i=2; i<=NF; i+=1) {
sum[i]+=$i; $(i)=sum[i];
}; print $0 }' FS=, OFS=, sample.csv
To get the whitespace you want, you could do:
awk '{
for (i=2; i<=NF; i+=1) {
sum[i]+=$i; $(i)=sum[i];
}
printf "%s,%2d,%2d,%2d,%2d\n", $1, $2, $3, $4, $5
}' FS=, sample.csv
If you don't know the number of columns, you could write that final printf in a loop.
Tested and confirmed working on
gawk 5.1.1,
mawk 1.3.4,
mawk 1.9.9.6, and
macos nawk
——————————————————————————————————————————————
# gawk profile, created Thu May 19 15:59:38 2022
function fmt(_) {
return +_<=_^(_<_) \
? "" : $_ = sprintf("%5.f",___[_]+=$_)
}
BEGIN { split(sprintf("%0*.f",((__=++_)+(++_*_)\
)^++_,!_),___,"""")
OFS = ", "
FS = "[,][ ]*"
} { _ = NF
while(_!=__) { fmt(_--) } }_'
——————————————————————————————————————————————
01/01/2020, 0, 0, 2, 1
18/04/2022, 7, 5, 3, 4
01/05/2022, 15, 26, 12, 8

This prime generating function using generateSequence in Kotlin is not easy to understand. :(

val primes = generateSequence(2 to generateSequence(3) {it + 2}) {
val currSeq = it.second.iterator()
val nextPrime = currSeq.next()
nextPrime to currSeq.asSequence().filter { it % nextPrime != 0}
}.map {it.first}
println(primes.take(10).toList()) // prints [2, 3, 5, 7, 11, 13, 17, 19, 23, 29]
I tried to understand this function about how it works, but not easy to me.
Could someone explain how it works? Thanks.
It generates an infinite sequence of primes using the "Sieve of Eratosthenes" (see here: https://en.wikipedia.org/wiki/Sieve_of_Eratosthenes).
This implementation uses a sequence of pairs to do this. The first element of every pair is the current prime, and the second element is a sequence of integers larger than that prime which is not divisible by any previous prime.
It starts with the pair 2 to [3, 5, 7, 9, 11, 13, 15, 17, ...], which is given by 2 to generateSequence(3) { it + 2 }.
Using this pair, we create the next pair of the sequence by taking the first element of the sequence (which is now 3), and then removing all numbers divisible by 3 from the sequence (removing 9, 15, 21 and so on). This gives us this pair: 3 to [5, 7, 11, 13, 17, ...]. Repeating this pattern will give us all primes.
After creating a sequence of pairs like this, we are finally doing .map { it.first } to pick only the actual primes, and not the inner sequences.
The sequence of pairs will evolve like this:
2 to [3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, ...]
3 to [5, 7, 11, 13, 17, 19, 23, 25, 29, ...]
5 to [7, 11, 13, 17, 19, 23, 29, ...]
7 to [11, 13, 17, 19, 23, 29, ...]
11 to [13, 17, 19, 23, 29, ...]
13 to [17, 19, 23, 29, ...]
// and so on

How to use match in AWK for a given set of data

I'm working on processing some results obtained from a curl command in AWK but despite reading about match and regexps I'm still having some issues. I've got everything written, but in a really hackish way that's using a lot of substr and really basic match usage without capturing anything with a regexp.
My real data is a bit more complicated, but here's a simplified version. Assume the following is stored in a string, str:
[{"DataA":"200","DataBee":"63500","DataC":[3,22,64,838,2]},{"DataA":"190","DataBee":"63100","DataC":[55,22,64,838,2]},{"DataA":"200","DataBee":"63500","DataC":[3,22,64,838,2]}][{"DataA":"200","DataBee":"63500","DataC":[3,22,64,838,2]},{"DataA":"200","DataBee":"63500","DataC":[3,22,64,838,2]}]
Some things to note about this data:
Note that there are 3 "sets" of data delimited by {} in the first brackets [] and 2 sets in the second brackets. The string always has at least one set of data in each set of brackets, and at least one set of brackets (i.e. it will never be the empty string and will always have SOME valid data in it)
Brackets are also used for the DataC data, so those need to be considered in some way
No punctuation will ever appear in the string aside from delimiters -- all actual data is alphanumeric
The fields DataA, DataBee, and DataC will always have those names
The data for DataC will always be exactly 5 numbers, separated by commas
What I'd like to do is write a loop that will go through the string and pull out the values -- a = whatever DataA is (200 in the first case), b = whatever DataBee is (63500 in the first case), and c[1] through c[5] containing the values from DataC.
I feel like if I could just get ideas about how to do this for the above data I could run with it to adapt it to my needs. As of right now the loop I have for this using substr is like 30 lines long :(
For fun using awk :
I use "complex" FS and RS variables to split the json. This way, I have one value max per column, and 1 data per line (DataA, DataBee, DataC).
To understand the usage of the FS and RS, see how this command works :
awk -F",|\":\"|:\\\[" '
{$1=$1}1
' OFS="\t" RS="\",\"|},{|\\\]" file
(you can replace file with <(curl <your_url>) or <(echo <your_json_str>))
Returns :
[{"DataA 200
DataBee 63500
DataC" 3 22 64 838 2
"DataA 190
DataBee 63100
DataC" 55 22 64 838 2
"DataA 200
DataBee 63500
DataC" 3 22 64 838 2
}
[{"DataA 200
DataBee 63500
DataC" 3 22 64 838 2
"DataA 200
DataBee 63500
DataC" 3 22 64 838 2
}
Now it looks like something I can use with awk :
awk -F",|\":\"|:\\\[" '
/DataA/{a=$2}
/DataBee/{b=$2}
/DataC/{for(i=2;i<=NF;i++){c[i-1]=$i}}
a!=""&&b!=""&&c[1]!=""{
print "a: ", a;
print "b: ", b;
printf "c: ";
for(i in c){
printf "%s, ", c[i]
};
print "";
a=""; b=""; c[1]=""
}
' RS="\",\"|},{|\\\]" file
This command stores the value inside variables and prints them when a and b and c are set.
Returns :
a: 200
b: 63500
c: 3, 22, 64, 838, 2,
a: 190
b: 63100
c: 55, 22, 64, 838, 2,
a: 200
b: 63500
c: 3, 22, 64, 838, 2,
a: 200
b: 63500
c: 3, 22, 64, 838, 2,
a: 200
b: 63500
c: 3, 22, 64, 838, 2,
For fun using awk, match and this excellent answer :
awk '
function find_all(str, patt) {
while (match(str, patt, a) > 0) {
for (i=1; i in a; i++) print a[i]
str = substr(str, RSTART+RLENGTH)
}
}
{
print "Catching DataA"
find_all($0, "DataA\":\"([0-9]*)")
print "Catching DataBee"
find_all($0, "DataBee\":\"([0-9]*)")
print "Catching DataC"
find_all($0, "DataC\":.([0-9]*),([0-9]*),([0-9]*),([0-9]*),([0-9]*)")
}
' file
Returns
Catching DataA
200
190
200
200
200
Catching DataBee
63500
63100
63500
63500
63500
Catching DataC
3
22
64
838
2
55
22
64
838
2
3
22
64
838
2
3
22
64
838
2
3
22
64
838
2
Now you've seen how ugly it is, see how easy it can be using python :
import json
data_str = '[{"DataA":"200","DataBee":"63500","DataC":[3,22,64,838,2]},{"DataA":"190","DataBee":"63100","DataC":[55,22,64,838,2]},{"DataA":"200","DataBee":"63500","DataC":[3,22,64,838,2]}][{"DataA":"200","DataBee":"63500","DataC":[3,22,64,838,2]},{"DataA":"200","DataBee":"63500","DataC":[3,22,64,838,2]}]'
while data_str:
data, index = json.JSONDecoder().raw_decode(data_str)
for element in data:
print("DataA: ", element["DataA"])
print("DataBee: ", element["DataBee"])
print("DataC: ", element["DataC"])
data_str = data_str[index:]
Returns :
DataA: 200
DataBee: 63500
DataC: [3, 22, 64, 838, 2]
DataA: 190
DataBee: 63100
DataC: [55, 22, 64, 838, 2]
DataA: 200
DataBee: 63500
DataC: [3, 22, 64, 838, 2]
DataA: 200
DataBee: 63500
DataC: [3, 22, 64, 838, 2]
DataA: 200
DataBee: 63500
DataC: [3, 22, 64, 838, 2]
This solution is not only cleaner, it is more robust if you have unexpected results or unexpected formatting.
I would recommend using jq, e.g.:
jq -c '.[]' <<<"$str"
{"DataA":"200","DataBee":"63500","DataC":[3,22,64,838,2]}
{"DataA":"190","DataBee":"63100","DataC":[55,22,64,838,2]}
{"DataA":"200","DataBee":"63500","DataC":[3,22,64,838,2]}
{"DataA":"200","DataBee":"63500","DataC":[3,22,64,838,2]}
{"DataA":"200","DataBee":"63500","DataC":[3,22,64,838,2]}
To extract DataC:
jq -c '.[] | .DataC' <<<"$str"
Output:
[3,22,64,838,2]
[55,22,64,838,2]
[3,22,64,838,2]
[3,22,64,838,2]
[3,22,64,838,2]

using awk to perform operation on second value, only on lines with a specific string

ok. i have this which works. yay!
awk '/string/ { print $0 }' file1.csv | awk '$2=$2+(int(rand()*10))' > file2.csv
but i want the context of my file also printed into file2.csv. this program only prints the lines which contain my string, which is good, it's a start.
i would, but, i can't simply apply the operation to the $2 values on every line because the values of $2 on lines that don't contain my string, are not to be changed.
so, i want the original (file1.csv) contents intact with the only difference being an adjusted value at $2 on lines matching my string.
can someone help me? thank you.
And here are 4 lines from the original csv:
2, 0, Control, 0, 7, 1000
2, 0, Control, 0, 10, 540
2, 30720, String, 0, 76, 100
2, 32620, String, 0, 76, 0
Expected output is the same aside from small variations to $2:
2, 0, Control, 0, 7, 1000
2, 0, Control, 0, 10, 540
2, 30725, String, 0, 76, 100
2, 32621, String, 0, 76, 0
$ awk 'BEGIN{FS=OFS=", "} /String/{$2+=int(rand()*10)}1' file
2, 0, Control, 0, 7, 1000
2, 0, Control, 0, 10, 540
2, 30722, String, 0, 76, 100
2, 32622, String, 0, 76, 0
you probably need to initialize the rand seed in the BEGIN block as well, otherwise you'll always get the same rand sequence.
Something like this, maybe:
$ awk -F", " -v OFS=", " '$3 ~ /String/ { $2=$2+(int(rand()*10))} { print $0 }' file1.csv
2, 0, Control, 0, 7, 1000
2, 0, Control, 0, 10, 540
2, 30722, String, 0, 76, 100
2, 32622, String, 0, 76, 0

Getting each field following regular pattern with awk

Having an input text file as below:
1234, aaa = 34 </T><AT/>X_CONST = 34 </T><AT/>AAA_Z = 3 </T><AT/>Y_CONST = 34 </T><AT/>FOUND_ME_1 = 5 </T><AT/>BBB_X = 3 </T><AT/>CCC_X = 8 </T><AT/>FOUND_ME_2 = 8 </T><AT/>FOUND_ME_3 = 8 </T><AT/>RRR_Z = 3 </T><AT/>T_CONST = 37 </T><AT/>FOUND_ME_4 = 10 </T><AT/>X_CONST = 34
7844, aaa = 33 </T><AT/>X_CONST = 21 </T><AT/>AAA_Z = 3 </T><AT/>R_CONST = 34 </T><AT/>FOUND_ME_1 = 50 </T><AT/>BBB_X = 3 </T><AT/>CCC_X = 8 </T><AT/>FOUND_ME_2 = 81 </T><AT/>FOUND_ME_3 = 8 </T><AT/>RRR_Z = 3 </T><AT/>T_CONST = 37 </T><AT/>X_CONST = 55
8888, aaa = 31 </T><AT/>X_CONST = 21 </T><AT/>AAA_Z = 3 </T><AT/>R_CONST = 34 </T><AT/>FOUND_ME_1 = 54 </T><AT/>BBB_Z = 3 </T><AT/>CCC_X = 8 </T><AT/>FOUND_ME_2 = 81 </T><AT/>FOUND_ME_3 = 8 </T><AT/>RRR_Z = 3 </T><AT/>T_CONST = 37 </T><AT/>FOUND_ME_4 = 11 </T><AT/>X_CONST = 55 </T><AT/>FOUND_ME_5 = 8 </T><AT/>TTT_X = 8 </T><AT/>FOUND_ME_6 = 20
I need to extract all the values related to the field FOUND_ME_[0-9] , possibly with awk. I know that converting each field to separate lines it would be easier but I'm finding a solution working with the file as it is.
My goal is to have an output like the following (values separated by commas)
5, 8, 8, 10
50, 81, 8
54, 81, 8, 11, 8, 20
I'm trying the following but no luck:
awk '{for(i=1;i<=NF;i++){ if($i==".*FOUND_ME_[0-9]"){print $($i+2)} } }'
I have also problems with this special regular pattern FOUND_ME_[0-9]
This awk script gets you the output you want (although I'm guessing that file might have started out as XML once upon a time...):
$ cat script.awk
BEGIN { FS = "[[:space:]=]+" }
{
s = ""
for (i = 1; i <= NF; ++i)
if($i ~ /FOUND_ME_[0-9]/)
s = s sprintf("%s, ", $(++i))
print substr(s, 1, length(s) - 2)
}
$ awk -f script.awk file
5, 8, 8, 10
50, 81, 8
54, 81, 8, 11, 8, 20
It builds a string s from the field after each one matching the pattern. sprintf("%s, ", $(++i)) returns the value of the next field followed by a comma and a space. $(++i) increments the field number i and then returns the value of the field. In awk, strings are concatenated, so the string returned by sprintf is added to the existing value of s.
I set the field separator FS to one or more space or = characters, so the field you're interested is the one after the one matching the pattern. Note that I'm using ~ to match a regex pattern - you cannot use == as you were doing as this performs a string comparison.
The substr strips the last , from the the string before it is printed.
A much shorter option, inspired by Kent's use of FPAT on GNU awk (note that this requires a version >=4.0) :
$ awk -v FPAT="FOUND_ME_[0-9] *= *[0-9]+" -v OFS=", " '{$1=$1;gsub(/FOUND_ME_[0-9] *= */,"")}1' file
5, 8, 8, 10
50, 81, 8
54, 81, 8, 11, 8, 20
$1=$1 causes awk to "touch" each record, removing the parts which aren't matched by FPAT. gsub performs a global substitution, removing the part we aren't interested in. 1 at the end is always true, so the default action {print} is performed. Setting the OFS variable causes each field in the output to be comma-separated as desired.
gawk has FPAT, which we could use for this problem:
awk -v FPAT="FOUND_ME_[0-9] *= *[0-9]+" '
{for(i=1;i<=NF;i++){sub("FOUND_ME_[0-9] *= *","",$i);
printf "%s%s",$i,(NF==i?"\n":", ")}}' file
output:
5, 8, 8, 10
50, 81, 8
54, 81, 8, 11, 8, 20