Getting each field following regular pattern with awk

Getting each field following regular pattern with awk - awk

Having an input text file as below:
1234, aaa = 34 </T><AT/>X_CONST = 34 </T><AT/>AAA_Z = 3 </T><AT/>Y_CONST = 34 </T><AT/>FOUND_ME_1 = 5 </T><AT/>BBB_X = 3 </T><AT/>CCC_X = 8 </T><AT/>FOUND_ME_2 = 8 </T><AT/>FOUND_ME_3 = 8 </T><AT/>RRR_Z = 3 </T><AT/>T_CONST = 37 </T><AT/>FOUND_ME_4 = 10 </T><AT/>X_CONST = 34
7844, aaa = 33 </T><AT/>X_CONST = 21 </T><AT/>AAA_Z = 3 </T><AT/>R_CONST = 34 </T><AT/>FOUND_ME_1 = 50 </T><AT/>BBB_X = 3 </T><AT/>CCC_X = 8 </T><AT/>FOUND_ME_2 = 81 </T><AT/>FOUND_ME_3 = 8 </T><AT/>RRR_Z = 3 </T><AT/>T_CONST = 37 </T><AT/>X_CONST = 55
8888, aaa = 31 </T><AT/>X_CONST = 21 </T><AT/>AAA_Z = 3 </T><AT/>R_CONST = 34 </T><AT/>FOUND_ME_1 = 54 </T><AT/>BBB_Z = 3 </T><AT/>CCC_X = 8 </T><AT/>FOUND_ME_2 = 81 </T><AT/>FOUND_ME_3 = 8 </T><AT/>RRR_Z = 3 </T><AT/>T_CONST = 37 </T><AT/>FOUND_ME_4 = 11 </T><AT/>X_CONST = 55 </T><AT/>FOUND_ME_5 = 8 </T><AT/>TTT_X = 8 </T><AT/>FOUND_ME_6 = 20
I need to extract all the values related to the field FOUND_ME_[0-9] , possibly with awk. I know that converting each field to separate lines it would be easier but I'm finding a solution working with the file as it is.
My goal is to have an output like the following (values separated by commas)
5, 8, 8, 10
50, 81, 8
54, 81, 8, 11, 8, 20
I'm trying the following but no luck:
awk '{for(i=1;i<=NF;i++){ if($i==".*FOUND_ME_[0-9]"){print $($i+2)} } }'
I have also problems with this special regular pattern FOUND_ME_[0-9]

This awk script gets you the output you want (although I'm guessing that file might have started out as XML once upon a time...):
$ cat script.awk
BEGIN { FS = "[[:space:]=]+" }
{
s = ""
for (i = 1; i <= NF; ++i)
if($i ~ /FOUND_ME_[0-9]/)
s = s sprintf("%s, ", $(++i))
print substr(s, 1, length(s) - 2)
}
$ awk -f script.awk file
5, 8, 8, 10
50, 81, 8
54, 81, 8, 11, 8, 20
It builds a string s from the field after each one matching the pattern. sprintf("%s, ", $(++i)) returns the value of the next field followed by a comma and a space. $(++i) increments the field number i and then returns the value of the field. In awk, strings are concatenated, so the string returned by sprintf is added to the existing value of s.
I set the field separator FS to one or more space or = characters, so the field you're interested is the one after the one matching the pattern. Note that I'm using ~ to match a regex pattern - you cannot use == as you were doing as this performs a string comparison.
The substr strips the last , from the the string before it is printed.
A much shorter option, inspired by Kent's use of FPAT on GNU awk (note that this requires a version >=4.0) :
$ awk -v FPAT="FOUND_ME_[0-9] *= *[0-9]+" -v OFS=", " '{$1=$1;gsub(/FOUND_ME_[0-9] *= */,"")}1' file
5, 8, 8, 10
50, 81, 8
54, 81, 8, 11, 8, 20
$1=$1 causes awk to "touch" each record, removing the parts which aren't matched by FPAT. gsub performs a global substitution, removing the part we aren't interested in. 1 at the end is always true, so the default action {print} is performed. Setting the OFS variable causes each field in the output to be comma-separated as desired.

gawk has FPAT, which we could use for this problem:
awk -v FPAT="FOUND_ME_[0-9] *= *[0-9]+" '
{for(i=1;i<=NF;i++){sub("FOUND_ME_[0-9] *= *","",$i);
printf "%s%s",$i,(NF==i?"\n":", ")}}' file
output:
5, 8, 8, 10
50, 81, 8
54, 81, 8, 11, 8, 20

Related

Obtain corresponding column based on another column that matches another dataframe

I want to find matching values from two data frames and return a third value.
For example, if cpg_symbol["Gene_Symbol"] corresponds with diff_meth_kirp_symbol.index, I want to assign cpg_symbol.loc["Composite_Element_REF"] as index.
My code returned an empty dataframe.
diff_meth_kirp.index = diff_meth_kirp.merge(cpg_symbol, left_on=diff_meth_kirp.index, right_on="Gene_Symbol")[["Composite_Element_REF"]]
Example:
diff_meth_kirp
Hello
My
name
is
First
0
1
2
3
Second
4
5
6
7
Third
8
9
10
11
Fourth
12
13
14
15
Fifth
16
17
18
19
Sixth
20
21
22
23
cpg_symbol
Composite_Element_REF
Gene_Symbol
cg1
First
cg2
Third
cg3
Fifth
cg4
Seventh
cg5
Ninth
cg6
First
Expected output:
Hello
My
name
is
cg1
0
1
2
3
cg2
8
9
10
11
cg3
16
17
18
19
cg6
0
1
2
3

Your code works well for me but you can try this version:
out = (diff_meth_kirp.merge(cpg_symbol.set_index('Gene_Symbol'),
left_index=True, right_index=True)
.set_index('Composite_Element_REF')
.rename_axis(None).sort_index())
print(out)
# Output
Hello My name is
cg1 0 1 2 3
cg2 8 9 10 11
cg3 16 17 18 19
cg6 0 1 2 3
Input dataframes:
data1 = {'Hello': {'First': 0, 'Second': 4, 'Third': 8, 'Fourth': 12, 'Fifth': 16, 'Sixth': 20},
'My': {'First': 1, 'Second': 5, 'Third': 9, 'Fourth': 13, 'Fifth': 17, 'Sixth': 21},
'name': {'First': 2, 'Second': 6, 'Third': 10, 'Fourth': 14, 'Fifth': 18, 'Sixth': 22},
'is': {'First': 3, 'Second': 7, 'Third': 11, 'Fourth': 15, 'Fifth': 19, 'Sixth': 23}}
diff_meth_kirp = pd.DataFrame(data1)
data2 = {'Composite_Element_REF': {0: 'cg1', 1: 'cg2', 2: 'cg3', 3: 'cg4', 4: 'cg5', 5: 'cg6'},
'Gene_Symbol': {0: 'First', 1: 'Third', 2: 'Fifth', 3: 'Seventh', 4: 'Ninth', 5: 'First'}}
cpg_symbol = pd.DataFrame(data2)

How do I sum values of a column cumulatively with awk?

I have a sample.csv and want to sum it cumulatively by column, as below:
Input csv: Output csv:
01/01/2020, 0, 0, 2, 1 01/01/2020, 0, 0, 2, 1
18/04/2022, 7, 5, 1, 3 18/04/2022, 7, 5, 3, 4
01/05/2022, 8,21, 9, 4 01/05/2022,15,26,12, 8
I've tried
awk '{ for (i=1; i<=NF; ++i) {sum[i]+=$i; $i=sum[i] }; print $0}' sample.csv
But it returns this instead:
Input csv: Output csv:
01/01/2020, 0, 0, 2, 1 01/01/2020, 0, 0, 2, 1, 0, 0, 0, 0, 0
18/04/2022, 7, 5, 1, 3 18/04/2022, 7, 5, 1, 3, 0, 0, 0, 0, 0
01/05/2022, 8,21, 9, 4 01/05/2022, 8,21, 9, 4, 0, 0, 0, 0, 0
I'm at a loss as to how to resolve this.
Note: I am writing this in a bash script, not the terminal. And I'm not allowed to use any tools other than awk for this

I can't duplicate your output. Other than whitespace mangling, this seems to do what you want:
awk '{ for (i=2; i<=NF; i+=1) {
sum[i]+=$i; $(i)=sum[i];
}; print $0 }' FS=, OFS=, sample.csv
To get the whitespace you want, you could do:
awk '{
for (i=2; i<=NF; i+=1) {
sum[i]+=$i; $(i)=sum[i];
}
printf "%s,%2d,%2d,%2d,%2d\n", $1, $2, $3, $4, $5
}' FS=, sample.csv
If you don't know the number of columns, you could write that final printf in a loop.

Tested and confirmed working on
gawk 5.1.1,
mawk 1.3.4,
mawk 1.9.9.6, and
macos nawk
——————————————————————————————————————————————
# gawk profile, created Thu May 19 15:59:38 2022
function fmt(_) {
return +_<=_^(_<_) \
? "" : $_ = sprintf("%5.f",___[_]+=$_)
}
BEGIN { split(sprintf("%0*.f",((__=++_)+(++_*_)\
)^++_,!_),___,"""")
OFS = ", "
FS = "[,][ ]*"
} { _ = NF
while(_!=__) { fmt(_--) } }_'
——————————————————————————————————————————————
01/01/2020, 0, 0, 2, 1
18/04/2022, 7, 5, 3, 4
01/05/2022, 15, 26, 12, 8

how to use fold and reduce in grouping in kotlin

I know how to use reduce and fold operations but, I'm not getting how to use it with map.
val numbers = listOf("one", "two", "three", "four", "five")
println(numbers.groupingBy { it.first() }.eachCount()) // Output:- {o=1, t=2, f=2}
grouping returns Map. So, i need to figure out how to use fold and reduce with kotlin Map.
Any example is ok. I just need to use reduce and fold with grouping in kotlin.

I seriously never used this in my project. You can understand the code below.
it % 5 gives various remainder values 0,1,2,3,4
i.e 5%5 = 0 ; 5%6 = 1; 5%7= 2; 5%8 = 3; 5%9 = 4; 5%5 = 0
val numb = listOf(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20)
val nmap = numb.groupingBy { it % 5 }
println(nmap.eachCount())
println("map = ${nmap.reduce { key, accumulator, element ->
println("$key ($accumulator,$element)")
accumulator + element
}}")
Output
{1=4, 2=4, 3=4, 4=4, 0=4}
1 (1,6)
2 (2,7)
3 (3,8)
4 (4,9)
0 (5,10) ---> 5 +
1 (7,11)
2 (9,12)
3 (11,13)
4 (13,14)
0 (15,15) ----> 5 + 15 +
1 (18,16)
2 (21,17)
3 (24,18)
4 (27,19)
0 (30,20) -----> 5 + 15 + 30
map = {1=34, 2=38, 3=42, 4=46, 0=50}

How to use match in AWK for a given set of data

I'm working on processing some results obtained from a curl command in AWK but despite reading about match and regexps I'm still having some issues. I've got everything written, but in a really hackish way that's using a lot of substr and really basic match usage without capturing anything with a regexp.
My real data is a bit more complicated, but here's a simplified version. Assume the following is stored in a string, str:
[{"DataA":"200","DataBee":"63500","DataC":[3,22,64,838,2]},{"DataA":"190","DataBee":"63100","DataC":[55,22,64,838,2]},{"DataA":"200","DataBee":"63500","DataC":[3,22,64,838,2]}][{"DataA":"200","DataBee":"63500","DataC":[3,22,64,838,2]},{"DataA":"200","DataBee":"63500","DataC":[3,22,64,838,2]}]
Some things to note about this data:
Note that there are 3 "sets" of data delimited by {} in the first brackets [] and 2 sets in the second brackets. The string always has at least one set of data in each set of brackets, and at least one set of brackets (i.e. it will never be the empty string and will always have SOME valid data in it)
Brackets are also used for the DataC data, so those need to be considered in some way
No punctuation will ever appear in the string aside from delimiters -- all actual data is alphanumeric
The fields DataA, DataBee, and DataC will always have those names
The data for DataC will always be exactly 5 numbers, separated by commas
What I'd like to do is write a loop that will go through the string and pull out the values -- a = whatever DataA is (200 in the first case), b = whatever DataBee is (63500 in the first case), and c[1] through c[5] containing the values from DataC.
I feel like if I could just get ideas about how to do this for the above data I could run with it to adapt it to my needs. As of right now the loop I have for this using substr is like 30 lines long :(

For fun using awk :
I use "complex" FS and RS variables to split the json. This way, I have one value max per column, and 1 data per line (DataA, DataBee, DataC).
To understand the usage of the FS and RS, see how this command works :
awk -F",|\":\"|:\\\[" '
{$1=$1}1
' OFS="\t" RS="\",\"|},{|\\\]" file
(you can replace file with <(curl <your_url>) or <(echo <your_json_str>))
Returns :
[{"DataA 200
DataBee 63500
DataC" 3 22 64 838 2
"DataA 190
DataBee 63100
DataC" 55 22 64 838 2
"DataA 200
DataBee 63500
DataC" 3 22 64 838 2
}
[{"DataA 200
DataBee 63500
DataC" 3 22 64 838 2
"DataA 200
DataBee 63500
DataC" 3 22 64 838 2
}
Now it looks like something I can use with awk :
awk -F",|\":\"|:\\\[" '
/DataA/{a=$2}
/DataBee/{b=$2}
/DataC/{for(i=2;i<=NF;i++){c[i-1]=$i}}
a!=""&&b!=""&&c[1]!=""{
print "a: ", a;
print "b: ", b;
printf "c: ";
for(i in c){
printf "%s, ", c[i]
};
print "";
a=""; b=""; c[1]=""
}
' RS="\",\"|},{|\\\]" file
This command stores the value inside variables and prints them when a and b and c are set.
Returns :
a: 200
b: 63500
c: 3, 22, 64, 838, 2,
a: 190
b: 63100
c: 55, 22, 64, 838, 2,
a: 200
b: 63500
c: 3, 22, 64, 838, 2,
a: 200
b: 63500
c: 3, 22, 64, 838, 2,
a: 200
b: 63500
c: 3, 22, 64, 838, 2,
For fun using awk, match and this excellent answer :
awk '
function find_all(str, patt) {
while (match(str, patt, a) > 0) {
for (i=1; i in a; i++) print a[i]
str = substr(str, RSTART+RLENGTH)
}
}
{
print "Catching DataA"
find_all($0, "DataA\":\"([0-9]*)")
print "Catching DataBee"
find_all($0, "DataBee\":\"([0-9]*)")
print "Catching DataC"
find_all($0, "DataC\":.([0-9]*),([0-9]*),([0-9]*),([0-9]*),([0-9]*)")
}
' file
Returns
Catching DataA
200
190
200
200
200
Catching DataBee
63500
63100
63500
63500
63500
Catching DataC
3
22
64
838
2
55
22
64
838
2
3
22
64
838
2
3
22
64
838
2
3
22
64
838
2
Now you've seen how ugly it is, see how easy it can be using python :
import json
data_str = '[{"DataA":"200","DataBee":"63500","DataC":[3,22,64,838,2]},{"DataA":"190","DataBee":"63100","DataC":[55,22,64,838,2]},{"DataA":"200","DataBee":"63500","DataC":[3,22,64,838,2]}][{"DataA":"200","DataBee":"63500","DataC":[3,22,64,838,2]},{"DataA":"200","DataBee":"63500","DataC":[3,22,64,838,2]}]'
while data_str:
data, index = json.JSONDecoder().raw_decode(data_str)
for element in data:
print("DataA: ", element["DataA"])
print("DataBee: ", element["DataBee"])
print("DataC: ", element["DataC"])
data_str = data_str[index:]
Returns :
DataA: 200
DataBee: 63500
DataC: [3, 22, 64, 838, 2]
DataA: 190
DataBee: 63100
DataC: [55, 22, 64, 838, 2]
DataA: 200
DataBee: 63500
DataC: [3, 22, 64, 838, 2]
DataA: 200
DataBee: 63500
DataC: [3, 22, 64, 838, 2]
DataA: 200
DataBee: 63500
DataC: [3, 22, 64, 838, 2]
This solution is not only cleaner, it is more robust if you have unexpected results or unexpected formatting.

I would recommend using jq, e.g.:
jq -c '.[]' <<<"$str"
{"DataA":"200","DataBee":"63500","DataC":[3,22,64,838,2]}
{"DataA":"190","DataBee":"63100","DataC":[55,22,64,838,2]}
{"DataA":"200","DataBee":"63500","DataC":[3,22,64,838,2]}
{"DataA":"200","DataBee":"63500","DataC":[3,22,64,838,2]}
{"DataA":"200","DataBee":"63500","DataC":[3,22,64,838,2]}
To extract DataC:
jq -c '.[] | .DataC' <<<"$str"
Output:
[3,22,64,838,2]
[55,22,64,838,2]
[3,22,64,838,2]
[3,22,64,838,2]
[3,22,64,838,2]

RegEx help in an awk script

So I have a log file that contains entries like this:
[STAT] - December 11, 2017 13:16:05.360
.\something.cpp(99): [Text] Code::Open Port 1, baud 9600, parity 0, Stop bits 0, flow control 0
[STAT] - December 11, 2017 13:20:24.637
.\something\more\code.cpp(100): [log]
fooBarBaz[32] = 32, 1, 2, 7, 3, 1092, 5, 196875, 6, 270592, 20, 196870, 8, 289, 30, 196867, 11, 1156, 5, 196875, 28, 278784, 5, 196874, 32, 266496, 30, 6866, 36, 147712, 5, 196874,
[STAT] - December 11, 2017 13:20:40.939
.\something\more\code.cpp(100): [log]
fooBarBaz[8] = 8, 1, 2, 1, 31, 532992, 5, 196875,
[STAT] - December 11, 2017 13:18:16.214
.\something\more\code.cpp(100): [log]
fooBarBaz[12] = 12, 1, 2, 2, 17, 296960, 10, 196872, 51, 1792, 50, 196878,
On the command line, I can do this:
gawk -F', *' '/fooBarBaz\[[^0].*\]/ {for (f=5; f<=NF; f+=4) print $f | "sort -n" }' log
Which produces an output like this:
3
6
8
11
17
28
31
32
36
51
I'd like to have an awk script do the same thing, but my efforts so far haven't
worked.
#!/usr/local/bin/gawk -f
BEGIN { print "lines"
FS=", *";
/fooBarBaz\[[^0].*\]/
}
{
{for (f=5; f<=NF; f+=4) print $f}
}
I don't think my regular expression statement is in the right place, because
running gawk -f script.awk prints lines not relevant to my data.
What am I doing wrong?
tl;dr: On lines with fooBarBaz and not [0], I want to parse the digits starting at position 5 and then position 4 to the end of the line.

Optimized GNU awk solution:
parse_digits.awk script:
#!/bin/awk -f
BEGIN{
FS=", *";
PROCINFO["sorted_in"]="#ind_num_asc";
print "lines";
}
/fooBarBaz\[[1-9]+[0-9]*\]/{
for (i=5; i <= NF; i+=4)
if ($i != "") a[$i]
}
END{
for (i in a) print i
}
Usage:
awk -f parse_digits.awk inputfile
The output:
lines
3
6
8
11
17
28
31
32
36
51

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Getting each field following regular pattern with awk - awk

gawk has FPAT, which we could use for this problem: awk -v FPAT="FOUND_ME_[0-9] = [0-9]+" ' {for(i=1;i<=NF;i++){sub("FOUND_ME_[0-9] = ","",$i); printf "%s%s",$i,(NF==i?"\n":", ")}}' file output: 5, 8, 8, 10 50, 81, 8 54, 81, 8, 11, 8, 20

Related

Obtain corresponding column based on another column that matches another dataframe

How do I sum values of a column cumulatively with awk?

how to use fold and reduce in grouping in kotlin

How to use match in AWK for a given set of data

RegEx help in an awk script

Categories

Resources

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Getting each field following regular pattern with awk - awk

gawk has FPAT, which we could use for this problem: awk -v FPAT="FOUND_ME_[0-9] *= *[0-9]+" ' {for(i=1;i<=NF;i++){sub("FOUND_ME_[0-9] *= *","",$i); printf "%s%s",$i,(NF==i?"\n":", ")}}' file output: 5, 8, 8, 10 50, 81, 8 54, 81, 8, 11, 8, 20

Related

Obtain corresponding column based on another column that matches another dataframe

How do I sum values of a column cumulatively with awk?

how to use fold and reduce in grouping in kotlin

How to use match in AWK for a given set of data

RegEx help in an awk script

Categories

Resources

gawk has FPAT, which we could use for this problem: awk -v FPAT="FOUND_ME_[0-9] = [0-9]+" ' {for(i=1;i<=NF;i++){sub("FOUND_ME_[0-9] = ","",$i); printf "%s%s",$i,(NF==i?"\n":", ")}}' file output: 5, 8, 8, 10 50, 81, 8 54, 81, 8, 11, 8, 20