Split a string based on a regex in bash and construct a map of pattern to matched output

Split a string based on a regex in bash and construct a map of pattern to matched output - awk

I have a string something like
scope1/group1: health check failure; Refer section:1; scope2/group2: config validation failure; Refer section:2
I want the output in form of a map like
{scope1/group1, health check failure; Refer section:1}
{scope2/group2, config validation failure; Refer section:2}
The set of keys (scope1/group1, scope2/group2, etc.) is fixed and I think the best way to achieve the output is to split on the keys itself. Open to suggestions if there's another better way.
To split on keys, I tried using awk with the split function to split on the keys.
echo "$string" | awk '{ split($0, a, /(scope1\/group1:|scope2\/group2:)/); print a[2] }' }'
Output
health check failure; Refer section:1;
However, I don't get the pattern that this value is matched against to construct the map.
gawk split() also populates the separators array(documentation) which will work to find the key associated with the matched value. However, I can't use gawk for some other reasons and looking for alternatives.

You can make semicolon the record separator and colon the field separator to let awk split the tokens for you. To deal with the last record, which ends with a newline character that doesn't belong to the value, include it as an alternation pattern in the field separator:
awk 'BEGIN{FS=": |\n";RS="; "}{print"{\""$1"\": \""$2"\"}"}'
Demo: https://ideone.com/UaqsFO
EDIT: With your edited question now stating that the set of keys are fixed and known in advance, and that values may contain colons and semicolons, you can instead iterate through matches of the known keys, and use RSTART and RLENGTH to extract the key and value, and remove the current record for the next iteration to match a new record:
awk '{
for (;;) {
match($0, "(; )?(scope1/group1|scope2/group2): |$")
if (RSTART > 1) {
value = substr($0, 1, RSTART - 1)
sub("; $", "", value)
print "{" key ", " value "}"
}
if (!RLENGTH)
break
key = substr($0, RSTART, RLENGTH - 2)
sub("^; ", "", key)
$0 = substr($0, RSTART + RLENGTH)
}
}'
Demo: https://ideone.com/gBhVDs

Your edit removing the quotes in the output made things a bit easier. If your field-separator is consistent as "; ", then you can use:
awk -F"; " '{ for (i=2; i<=NF; i+=2) { sub(/[:]/,",",$(i-1)); print "{"$(i-1)"; "$i"}" }}'
With your data, the result would be:
$ echo "scope1/group1: health check failure; Refer section:1; scope2/group2: config validation failure; Refer section:2" |
awk -F"; " '{ for (i=2; i<=NF; i+=2) { sub(/[:]/,",",$(i-1)); print "{"$(i-1)"; "$i"}" }}'
{scope1/group1, health check failure; Refer section:1}
{scope2/group2, config validation failure; Refer section:2}

Using any awk:
$ cat tst.awk
{
while ( match($0,/[^;]+;[^;]+; /) ) {
blocks[++numBlocks] = substr($0,RSTART,RLENGTH-2)
$0 = substr($0,RSTART+RLENGTH)
}
blocks[++numBlocks] = $0
for ( i=1; i<=numBlocks; i++ ) {
sub(/:/,",",blocks[i])
print "{" blocks[i] "}"
}
}
$ awk -f tst.awk file
{scope1/group1, health check failure; Refer section:1}
{scope2/group2, config validation failure; Refer section:2}

Given your string:
s='scope1/group1: health check failure; Refer section:1; scope2/group2: config validation failure; Refer section:2'
You can use perl to capture the strings leading up to \w:\d) with an optional ; :
perl -0777 -lpE 's/(.*?\w+:\d+)(?:;\s*)?/{\1}\n/g' <<<"$s"
Or, use your notion of splitting on scope:
perl -0777 -nE 's/\R//; for $gr (split /;\h+(?=scope)/) {say "{$gr}"}' <<<"$s"
Or GNU grep with sed:
grep -oP "scope.*?(?=scope|$)" <<<"$s" | sed 's/^/{/; s/;*[[:space:]]*$/}/'
Or POSIX sed pipeline:
sed -E 's/(.*:[[:digit:]][[:digit:]]*); /\1\n/g' <<<"$s" | sed 's/^/{/; s/$/}/'
Or (assuming just one split) you could use Bash:
printf "{${s%; scope*}}\n{${s#*:?; }}"
Any of these prints:
{scope1/group1: health check failure; Refer section:1}
{scope2/group2: config validation failure; Refer section:2}

Related

Separate lines with keys and store in different files

How to separate (get) the entire line related to hexadecimal number keys and the entire line for DEBUG in a text file, then store in different file, where the key is in this format: "[ uid key]"?
i.e. ignore any lines that is not DEBUG.
in.txt:
[ uid 28fd4583833] DEBUG web.Action
[ uid 39fd5697944] DEBUG test.Action
[ uid 56866969445] DEBUG test2.Action
[ uid 76696944556] INFO test4.Action
[ uid 39fd5697944] DEBUG test7.Action
[ uid 85483e10256] DEBUG testing.Action
The output files are named as "out" + i + ".txt", where i = 1, 2, 3, 4.
i.e.
out1.txt:
[ uid 28fd4583833] DEBUG web.Action
out2.txt:
[ uid 39fd5697944] DEBUG test.Action
[ uid 39fd5697944] DEBUG test7.Action
out3.txt:
[ uid 56866969445] DEBUG test2.Action
out4.txt:
[ uid 85483e10256] DEBUG testing.Action
I tried:
awk 'match($0, /uid ([^]]+)/, a) && /DEBUG/ {print > (a[1] ".txt")}' in.txt

If you are willing to change the output file names to include the keys (frankly, this seems more useful that a one-up counter in the names), you can do:
awk '/DEBUG/{print > ("out-" $3 ".txt")}' FS='[][ ]*' in.txt
This will put all lines that match the string DEBUG with key 85483e10256 into the file out-85483e10256.txt, etc.
If you do want to keep the one-up counter, you could do:
awk '/DEBUG/{if( ! a[$3] ) a[$3] = ++counter;
print > ("out" a[$3] ".txt")}' FS='[][ ]*' in.txt
Basically, the idea is to use the regex [][ ]* as the field separator, which matches a string of square brackets or spaces. This way, $1 is the text preceding the initial [, $2 is the string uid, and $3 is the key. This will (should!) correctly get the key for lines that might have slightly different white space. We use an associative array to keep track of which keys have already been seen to keep track of the counter. But it really is cleaner just to use the key in the output file name.

Using GNU sort for -s (to guarantee retaining input line order for every key value) and any awk:
$ sort -sk3,3 in.txt |
awk '$4!="DEBUG"{next} $3!=prev{close(out); out="out"(++i)".txt"; prev=$3} {print > out}'
$ head out*.txt
==> out1.txt <==
[ uid 28fd4583833] DEBUG web.Action
==> out2.txt <==
[ uid 39fd5697944] DEBUG test.Action
[ uid 39fd5697944] DEBUG test7.Action
==> out3.txt <==
[ uid 56866969445] DEBUG test2.Action
==> out5.txt <==
[ uid 85483e10256] DEBUG testing.Action
If you don't have GNU sort then you can apply the DSU (Decorate/Sort/Undecorate) idiom using any sort:
$ awk -v OFS='\t' '{print NR, $0}' in.txt | sort -k4,4 -k1,1n | cut -f2- |
awk '$4!="DEBUG"{next} $3!=prev{close(out); out="out"(++i)".txt"; prev=$3} {print > out}'
Note that with the above only sort has to handle all of the input in memory and it's designed to use demand paging, etc. to handle extremely large amounts of input, while awk only processes 1 line at a time and keeps almost nothing in memory and only has 1 output file open at a time and so the above is far more likely to succeed for large files than an approach that stores a lot in memory in awk, or has many output files open concurrently.

If your file format is consistent as you show, you can just do:
awk '
$4!="DEBUG" { next }
!f[$3] { f[$3]=++i }
{ print > ("out" f[$3] ".txt") }
' in.txt

1st solution: Using GNU awk try following single awk code. Where I am using PROCINFO["sorted_in"] method of GNU awk.
awk '
BEGIN{
PROCINFO["sorted_in"] = "#ind_num_asc"
}
!/DEBUG/{ next }
match($0,/uid [a-zA-Z0-9]+/){
ind=substr($0,RSTART,RLENGTH)
arr[ind]=(arr[ind]?arr[ind] ORS:"") $0
}
END{
for(i in arr){
outputFile=("out"++count".txt")
print arr[i] > (outputFile)
close(outputFile)
}
}
' Input_file
2nd solution: with any awk, with your shown samples please try following solution. Change Input_file name with your actual file's name here. Using GNU sort here with option -s to maintain the order while sorting values.
awk '
!/DEBUG/{ next }
match($0,/uid [0-9a-zA-Z]+/){
print substr($0,RSTART,RLENGTH)";"$0
}' Input_file |
sort -sk2n |
cut -d';' -f2- |
awk '
match($0,/uid [0-9a-zA-Z]+/){
if(prev!=substr($0,RSTART,RLENGTH)){
count++
close(outputFile)
}
outputFile="out"count".txt"
print > (outputFile)
prev=substr($0,RSTART,RLENGTH)
}
'
1st solution's Explanation: Adding detailed explanation for 1st solution:
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section from here.
PROCINFO["sorted_in"] = "#ind_num_asc" ##Setting PROCINFO["sorted_in"] to #ind_num_asc to sort any array with index.
}
!/DEBUG/{ next } ##If a line does not contain DEBUG then jump to next line.
match($0,/uid [a-zA-Z0-9]+/){ ##using match function to match uid space and alphanumeric values here.
ind=substr($0,RSTART,RLENGTH) ##Creating ind which contains sub string of matched sub string in match function.
arr[ind]=(arr[ind]?arr[ind] ORS:"") $0 ##Creating array arr with index of ind and keep adding current line value to same index.
}
END{ ##Starting END block of this program from here.
for(i in arr){ ##Traversing through array arr here.
outputFile=("out"++count".txt") ##Creating output file name here as per OP requirement.
print arr[i] > (outputFile) ##printing current array element into outputFile variable.
close(outputFile) ##Closing output file in backend to avoid too many files opened error.
}
}
' Input_file ##Mentioning Input_file name here.

A relatively portable awk-based solution with these highlights ::
output rows do not truncate leading edge double space
output filenames adhere to stabilized input row order without the need to pre-sort rows, post-sort rows, or utilize gnu gawk-specific features
tested and confirmed working on
gawk 5.1.1, including -ce flag,
mawk 1.3.4,
mawk 1.9.9.6, and
macOS nawk 20200816
————————————————————————————————
# gawk profile, created Thu May 19 12:10:56 2022
BEGIN {
____ = "test_72297811_" # opt. filename prefix
OFS = FS = "^ [[] uid "
_+=_ = gsub("\\^|[[][]]", _, OFS)
_*= _--
} NF *= / DEBUG / {
print >> (__[___ = substr($NF,_~_,_)] ?__[___]:\
__[___]= ____ "out" length(__) ".txt" )
} END {
for (_ in __) { close(__[_]) } }'
————————————————————————————————
==> test_72297811_out1.txt <==
[ uid 28fd4583833] DEBUG web.Action
==> test_72297811_out2.txt <==
[ uid 39fd5697944] DEBUG test.Action
[ uid 39fd5697944] DEBUG test7.Action
==> test_72297811_out3.txt <==
[ uid 56866969445] DEBUG test2.Action
==> test_72297811_out4.txt <==
[ uid 85483e10256] DEBUG testing.Action

How to replace all escape sequences with non-escaped equivalent with unix utilities (sed/tr/awk)

I'm processing a Wireshark config file (dfilter_buttons) for display filters and would like to print out the filter of a given name. The content of file is like:
Sample input
"TRUE","test","sip contains \x22Hello, world\x5cx22\x22",""
And the resulting output should have the escape sequences replaced, so I can use them later in my script:
Desired output
sip contains "Hello, world\x22"
My first pass is like this:
Current parser
filter_name=test
awk -v filter_name="$filter_name" 'BEGIN {FS="\",\""} ($2 == filter_name) {print $3}' "$config_file"
And my output is this:
Current output
sip contains \x22Hello, world\x5cx22\x22
I know I can handle these exact two escape sequences by piping to sed and matching those exact two sequences, but is there a generic way to substitutes all escape sequences? Future filters I build may utilize more escape sequences than just " and , and I would like to handle future scenarios.

Using gnu-awk you can do this using split, gensub and strtonum functions:
awk -F '","' -v filt='test' '$2 == filt {n = split($3, subj, /\\x[0-9a-fA-F]{2}/, seps); for (i=1; i<n; ++i) printf "%s%c", subj[i], strtonum("0" substr(seps[i], 2)); print subj[i]}' file
sip contains "Hello, world\x22"
A more readable form:
awk -F '","' -v filt='test' '
$2 == filt {
n = split($3, subj, /\\x[0-9a-fA-F]{2}/, seps)
for (i=1; i<n; ++i)
printf "%s%c", subj[i], strtonum("0" substr(seps[i], 2))
print subj[i]
}' file
Explanation:
Using -F '","' we split input using delimiter ","
$2 == filt we filter input for $2 == "test" condition
Using /\\x[0-9a-fA-F]{2}/ as regex (that matches 2 digit hex strings) we split $3 and save split tokens into array subj and matched separators into array seps
Using substr we remove first char i.e \\ and prepend 0
Using strtonum we convert hex string to equivalent ascii number
Using %c in printf we print corresponding ascii character
Last for loop joins $3 back using subj and seps array elements

Using GNU awk for FPAT, gensub(), strtonum(), and the 3rd arg to match():
$ cat tst.awk
BEGIN { FPAT="([^,]*)|(\"[^\"]*\")"; OFS="," }
$2 == ("\"" filter_name "\"") {
gsub(/^"|"$/,"",$3)
while ( match($3,/(\\x[0-9a-fA-F]{2})(.*)/,a) ) {
printf "%s%c", substr($3,1,RSTART-1), strtonum(gensub(/./,0,1,a[1]))
$3 = a[2]
}
print $3
}
$ awk -v filter_name='test' -f tst.awk file
sip contains "Hello, world\x22"
The above assumes your escape sequences are always \x followed by exactly 2 hex digits. It isolates every \xHH string in the input, replaces \ with 0 in that string so that strtonum() can then convert the string to a number, then uses %c in the printf formatting string to convert that number to a character.
Note that GNU awk has a debugger (see https://www.gnu.org/software/gawk/manual/gawk.html#Debugger) so if you're ever not sure what any part of a program does you can just run it in the debugger (-D) and trace it, e.g. in the following I plant a breakpoint to tell awk to stop at line 1 of the script (b 1), then start running (r) and the step (s) through the script printing the value of $3 (p $3) at each line so I can see how it changes after the gsub():
$ awk -D -v filter_name='test' -f tst.awk file
gawk> b 1
Breakpoint 1 set at file `tst.awk', line 1
gawk> r
Starting program:
Stopping in BEGIN ...
Breakpoint 1, main() at `tst.awk':1
1 BEGIN { FPAT="([^,]*)|(\"[^\"]*\")"; OFS="," }
gawk> p $3
$3 = uninitialized field
gawk> s
Stopping in Rule ...
2 $2 == "\"" filter_name "\"" {
gawk> p $3
$3 = "\"sip contains \\x22Hello, world\\x5cx22\\x22\""
gawk> s
3 gsub(/^"|"$/,"",$3)
gawk> p $3
$3 = "\"sip contains \\x22Hello, world\\x5cx22\\x22\""
gawk> s
4 while ( match($3,/(\\x[0-9a-fA-F]{2})(.*)/,a) ) {
gawk> p $3
$3 = "sip contains \\x22Hello, world\\x5cx22\\x22"

Can't replace string to multi-lined string with sed

I'm trying to replace a fixed parse ("replaceMe") in a text with multi-lined text with sed.
My bash script goes as follows:
content=$(awk'{print $5}' < data.txt | sort | uniq)
target=$(cat install.sh)
text=$(sed "s/replaceMe/$content/" <<< "$target")
echo "${text}"
If content contains one line only, replacing works, but if it contains sevrel lines I get:
sed:... untarminated `s' command
I read about "fetching" multi-lined content, but I couldn't find something about placing multi lined string

You'll have more problems than that depending on the contents of data.txt since sed doesn't understand literal strings (see Is it possible to escape regex metacharacters reliably with sed). Just use awk which does:
text="$( awk -v old='replaceMe' '
NR==FNR {
if ( !seen[$5]++ ) {
new = (NR>1 ? new ORS : "") $5
}
next
}
s = index($0,old) { $0 = substr($0,1,s-1) new substr($0,s+length(old)) }
{ print }
' data.txt install.sh )"

awk Joining n fields with delimiter

How can I use awk to join various fields, given that I don't know how many of them I have? For example, given the input string
aaa/bbb/ccc/ddd/eee
I use -F'/' as delimiter, do some manipulation on aaa, bbb, ccc, ddd, eee (altering, removing...) and I want to join it back to print something line
AAA/bbb/ddd/e
Thanks

... given that I don't know how many of them I have?
Ah, but you do know how many you have. Or you will soon, if you keep reading :-)
Before giving you a record to process, awk will set the NF variable to the number of fields in that record, and you can use for loops to process them (comments aren't part of the script, I've just put them there to explain):
$ echo pax/is/a/love/god | awk -F/ '{
gsub (/god/,"dog",$5); # pax,is,a,love,dog
$4 = ""; # pax,is,a,,dog
$6 = $5; # pax,is,a,,dog,dog
$5 = "rabid"; # pax,is,a,,rabid,dog
printf $1; # output "pax"
for (i = 2; i <= NF; i++) { # output ".<field>"
if ($i != "") { # but only for non-blank fields (skip $4)
printf "."$i;
}
}
printf "\n"; # finish line
}'
pax.is.a.rabid.dog
This shows manipulation of the values, as well as insertion and deletion.

The following will show you how to process each field and do some example manipulations on them.
The only caveat of using the output field separator OFS is that "deleted" fields will still have delimiters as shown in the output below; however it makes the code much simpler if you can live with that.
awk '
BEGIN{FS=OFS="/"}
{
for(i=1;i<=NF;i++){
if($i == "aaa")
$i=toupper($i)
else if($i ~ /c/)
$i=""
else if($i ~ /^eee$/)
$i="e"
}
}1' <<<'aaa/bbb/ccc/ddd/eee'
Output
AAA/bbb//ddd/e

This might work for you:
echo "aaa/bbb/ccc/ddd/eee" |
awk 'BEGIN{FS=OFS="/"}{sub(/../,"",$4);NF=4;print}'
aaa/bbb/ccc/d
To delete fields not at the end use a function to shuffle the values:
echo "aaa/bbb/ccc/ddd/eee" |
awk 'func d(n){for(x=n;x<=NF-1;x++){y=x+1;$x=$y}NF--};BEGIN{FS=OFS="/"}{d(2);print}'
aaa/ccc/ddd/eee
Deletes the second field.

awk -F'/' '{ # I'd suggest to add them to an array, like:
# for (i=1;i<=NF;i++) {a[i]=$i }
# Now manipulate your elements in the array
# then finally print them:
n = asorti(a, dest)
for (i=1;i<=n;i++) { output+=dest[i] "/") }
print gensub("/$","","g",output)
}' INPUTFILE
Doing it this way you can delete elements as well. Note deleting an item can be done like delete array[index].

Use Awk to Print every character as its own column?

I am in need of reorganizing a large CSV file. The first column, which is currently a 6 digit number needs to be split up, using commas as the field separator.
For example, I need this:
022250,10:50 AM,274,22,50
022255,11:55 AM,275,22,55
turned into this:
0,2,2,2,5,0,10:50 AM,274,22,50
0,2,2,2,5,5,11:55 AM,275,22,55
Let me know what you think!
Thanks!

It's a lot shorter in perl:
perl -F, -ane '$,=","; print split("",$F[0]), #F[1..$#F]' <file>
Since you don't know perl, a quick explanation. -F, indicates the input field separator is the comma (like awk). -a activates auto-split (into the array #F), -n implicitly wraps the code in a while (<>) { ... } loop, which reads input line-by-line. -e indicates the next argument is the script to run. $, is the output field separator (it gets set iteration of the loop this way, but oh well). split has obvious purpose, and you can see how the array is indexed/sliced. print, when lists as arguments like this, uses the output field separator and prints all their fields.
In awk:
awk -F, '{n=split($1,a,""); for (i=1;i<=n;i++) {printf("%s,",a[i])}; for (i=2;i<NF;i++) {printf("%s,",$i)}; print $NF}' <file>

I think this might work. The split function (at least in the version I am running) splits the value into individual characters if the third parameter is an empty string.
BEGIN{ FS="," }
{
n = split( $1, a, "" );
for ( i = 1; i <= n; i++ )
printf("%s,", a[i] );
sep = "";
for ( i = 2; i <= NF; i++ )
{
printf( "%s%s", sep, $i );
sep = ",";
}
printf("\n");
}

here's another way in awk
$ awk -F"," '{gsub(".",",&",$1);sub("^,","",$1)}1' OFS="," file
0,2,2,2,5,0,10:50 AM,274,22,50
0,2,2,2,5,5,11:55 AM,275,22,55

Here's a variation on a theme. One thing to note is it prints the remaining fields without using a loop. Another is that since you're looping over the characters in the first field anyway, why not just do it without using the null-delimiter feature of split() (which may not be present in some versions of AWK):
awk -F, 'BEGIN{OFS=","} {len=length($1); for (i=1;i<len; i++) {printf "%s,", substr($1,i,1)}; printf "%s", substr($1,len,1);$1=""; print $0}' filename
As a script:
BEGIN {FS = OFS = ","}
{
len = length($1);
for (i=1; i<len; i++)
{printf "%s,", substr($1, i, 1)};
printf "%s", substr($1, len, 1)
$1 = "";
print $0
}

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Split a string based on a regex in bash and construct a map of pattern to matched output - awk

Related

Separate lines with keys and store in different files

How to replace all escape sequences with non-escaped equivalent with unix utilities (sed/tr/awk)

Can't replace string to multi-lined string with sed

awk Joining n fields with delimiter

Use Awk to Print every character as its own column?

Categories

Resources