Using .subst with a partial regex match

Using .subst with a partial regex match - raku

my $book1 = "Don Quixote- Miguel de Cervantes";
my $book2 = "Les Misérables -Victor Hugo";
my $book3 = "War and Peace - Leo Tolstoy";
I want to use .subst to change "- " to " - " in $book1 and " -" to " - " in $book2. The problem is that I can't find the right regex to use with .subst. I could to use something different to a regex but I would like to use .subst. I can use different regexes for both strings but both should ignore the " - " in $book3.
Sorry for the probably basic question. I've been trying different things but I always destroy part of the text.

you can use trans method:
my $book1 = "Don Quixote- Miguel de Cervantes";
my $book2 = "Les Misérables -Victor Hugo";
my $book3 = "War and Peace - Leo Tolstoy";
for ($book1, $book2, $book3) -> $b {
say $b.trans([/<wb> '- '/, /' -' <wb>/] => [' - ']);
}
wb is a word boundary.

TL;DR Another option to consider is using the <( and )> capture markers to pick out just the bit you want to replace.
A "literal" interpretation of your Q
Matching strictly per your examples:
/ \C[space] <( '- ' | ' -' )> \C[space] /
The syntax \c[...] specifies one or more characters by using their Unicode names inside the square brackets (in this case the classic ASCII space character).1
In this pattern I've used \C[...] (uppercase C, not lowercase c). There is a range of Raku "backslash" atoms and they all have lowercase and uppercase variants, where the uppercase variant matches any character except the one(s) matched by the lowercase variant. So \C[space] matches any character other than the ASCII space character. See \c / \C for more info.
The <( capture marker marks the start point of the regex's capture. Likewise )> marks the endpoint.
Without them, when the pattern matches, the whole match would be captured, which would include whatever non whitespace character matches the \C[space] atom. We don't want that. So we use these markers to restrict what we capture.
Btw, each marker is independent. The above pattern matches \C[space] '- ' or '- ' \C[space]. If the pattern to the left of the | matches, only the <( has an impact, omitting whatever matched \C[space], and capturing until the end of the match, which for this pattern stops at the |. If the pattern to the right matches, capturing starts immediately after the | and ends at the )>.
The | is Raku's parallel (aka "longest token match" -- LTM) pattern alternation operator, an alternative to the traditional sequential pattern alternation operator (which in Raku is written ||). In this case the set of substrings that the two operators will and won't match is the same, so it makes no difference which is used. But | is shorter than ||; when the match set is the same it's typically faster; and when the match sets are different it's often | that's desirable. So I use it by default unless I know I need the traditional sequential alternation logic (try pattern on left of || first; if that fails, try the pattern on the right of the ||).
A "per its spirit?" interpretation of your Q
Matching more flexibly regarding whitespace:
/ \S <( '-' \s+ | \s+ '-' )> \S /
The \S atoms match any character that is not categorized by Unicode as being a whitespace character. (I use Raku, or tools such as this character property lookup web page, to explore what Unicode makes of a character.)
Comparing \C[space], \S, and <wb>:
\C[space] matches any character, including whitespace characters, with the sole exception of an ASCII space. My guess is it'll be the fastest of the three.
\S matches any non-whitespace. My guess is it'll be faster than <wb>.
<wb> matches between characters. Also it'll match before the first character in a string, and after the last one. So #chenyf's pattern would match and change '- foo...' to ' - foo...' and '...bar -' to '...bar - ' whereas the patterns with \C[space] or \S would not match at the start/end of those strings.
The \s+ atoms match one or more whitespace characters.
Footnotes
1 The naming is case insensitive. Multiple characters are separated by commas. \c[...] also works in a double quoted string (but not \C[...]).

for ($book1, $book2, $book3, $book4, $book5, $book6) -> $b
{ say $b
.subst(/ \S <( (\-+) \h )> \S /, {" $0 "}, :global)
.subst(/ \S <( \h (\-+) )> \S /, {" $0 "}, :global)
.subst(/ \S <( (\-) \v )> \S /, {"$0"}, :global) #fixes hyphenated words w/embedded newlines
}
Sample Input:
my $book1 = "Don Quixote- Miguel de Cervantes";
my $book2 = "Les Misérables -Victor Hugo";
my $book3 = "War and Peace - Leo Tolstoy";
my $book4 = "Moby-Dick; or, The Whale- Herman Melville";
my $book5 = "Winnie-the-Pooh --A. A. Milne";
my $book6 = "Slaughterhouse-\nFive- Kurt Vonnegut";
Sample Output:
Don Quixote - Miguel de Cervantes
Les Misérables - Victor Hugo
War and Peace - Leo Tolstoy
Moby-Dick; or, The Whale - Herman Melville
Winnie-the-Pooh -- A. A. Milne
Slaughterhouse-Five - Kurt Vonnegut
For this problem I would probably start by asking how these erroneous entries found their way into the data at hand. Was it the product of concatenation? Or informal (manual) entry? The first is fixable, the second might be a primary application of the Raku programming language (i.e. making informal, manual text entries more formal). This answer follows the excellent examples already posted, but (in contrast) uses a $0 capture to re-position the "-" field separator. In brief:
The first .subst(...) command globally captures one-or-more hyphens when followed by a single horizontal whitespace, and places the equivalent number of hyphens between the Title and Author (hyphens surrounded by whitespace).
The second .subst(...) command globally captures one-or-more hyphens when preceded by a horizontal whitespace, and places the equivalent number of hyphens between the Title and Author (hyphens surrounded by whitespace).
The third .subst(...) command globally captures a single hyphen when followed by a single vertical whitespace (e.g. newline), and removes the vertical whitespace. Hyphens followed by horizontal whitespace remain untouched. Note, for this the third .subst(...) command, the replacement can simply be written as "-" (i.e. no need to use $0).
Note: the first two .subst statements can be combined with | OR:
.subst(/ \S <( (\-+) \h | \h (\-+) )> \S /, {" "~$0~" "}, :global)
Why go to all this trouble? Well, the first reason is that a more 'pedestrian' approach is more robust to complicated input (e.g. hyphenated words). In fact, some answers already posted may not handle hyphenated book titles and/or author names, which are handled gracefully (above and below, note alternate replacement form):
~$ cat book_author.txt
Don Quixote- Miguel de Cervantes
Les Misérables -Victor Hugo
War and Peace - Leo Tolstoy
Moby-Dick; or, The Whale- Herman Melville
Winnie-the-Pooh --A. A. Milne
Slaughterhouse-
Five- Kurt Vonnegut
~$ cat book_author.txt | raku -e 'say lines.join("\n")
.subst(/ \S <( (\-+) \h )> \S /, {" "~$0~" "}, :global)
.subst(/ \S <( \h (\-+) )> \S /, {" "~$0~" "}, :global)
.subst(/ \S <( \- \v )> \S /, "-", :global);'
Don Quixote - Miguel de Cervantes
Les Misérables - Victor Hugo
War and Peace - Leo Tolstoy
Moby-Dick; or, The Whale - Herman Melville
Winnie-the-Pooh -- A. A. Milne
Slaughterhouse-Five - Kurt Vonnegut
The second reason is such an answer can be used to modify text with other separators, such as Title | Author data, wherein title is separated from author by a vertical bar. The third reason is capturing (e.g. using $0) is adapted to a wide variety of problems, such as making multiple identical separator characters like -- or || into single-character separators (note yet another way of writing the replacement, this time adding .comb[0]):
~$ cat book_bar_author.txt
Don Quixote| Miguel de Cervantes
Les Misérables |Victor Hugo
War and Peace | Leo Tolstoy
Moby-Dick; or, The Whale| Herman Melville
Winnie-the-Pooh ||A. A. Milne
Slaughterhouse-
Five| Kurt Vonnegut
~$ cat book_bar_author.txt | raku -e 'say lines.join("\n")
.subst(/ \S <( (\|+) \h )> \S /, {"",$0.comb[0],""}, :global)
.subst(/ \S <( \h (\|+) )> \S /, {"",$0.comb[0],""}, :global)
.subst(/ \S <( \- \v )> \S /, "-", :global);'
Don Quixote | Miguel de Cervantes
Les Misérables | Victor Hugo
War and Peace | Leo Tolstoy
Moby-Dick; or, The Whale | Herman Melville
Winnie-the-Pooh | A. A. Milne
Slaughterhouse-Five | Kurt Vonnegut

Related

extracting a specific word between : using sed, awk or grep

I have a file that has the following contents and many more.
#set_property board_part my.biz:ab0860_1cf:part0:1.0 [current_project]
set_property board_part my.biz:ab0820_1ab:part0:1.0 [current_project]
My ideal output is as shown below (ie, the text after the first ":" and the second ":".
ab0820_1ab
I generally use python and use regular expression along the lines of below to get the result.
\s*set_property board_part trenz.biz:([a-zA-Z_0-9]+)
I wish to know how can it be done quickly and in a more generic way using commandline tools (sed, awk).

You might use GNU sed following way, let file.txt content be
#set_property board_part my.biz:ab0860_1cf:part0:1.0 [current_project]
set_property board_part my.biz:ab0820_1ab:part0:1.0 [current_project]
garbage garbage garbage
then
sed -n '/ set_property board_part my.biz/ s/[^:]*:\([^:]*\):.*/\1/ p' file.txt
gives output
ab0820_1ab
Explanation: -n turns off default printing, / set_property board_part my.biz/ is so-called address, following commands will be applied solely to lines matching adress. First command is substitution (s) which use capturing group denoted by \( and \), regular expression is as followes zero-or-more non-: (i.e. everything before 1st :), :, then zero-or-more non-: (i.e. everything between 1st and 2nd :) encased in capturing group : and zero-or-more any character (i.e. everything after 2nd :), this is replaced by content of 1st (and sole in this case) capturing group. After substitution takes place p command is issued to prompt GNU sed to print changed line.
(tested in GNU sed 4.2.2)

Your example data has my.biz but your pattern tries to match trenz.biz
If gnu awk is available, you can use the capture group and then print the first value of which is available in a[1]
awk 'match($0, /^\s*set_property board_part \w+\.biz:(\w+)/, a) {print a[1]}' file
The pattern matches:
^ Start of string
\s* Match optional whitespace chars
set_property board_part Match literally
\w+\.biz: Match 1+ word chars followed by .biz (note to escape the dot to match it literally)
(\w+) Capture group 1, match 1+ word chars
Notes
If you just want to match trenz.biz then you can replace \w+\.biz with trenz\.biz
If the strings are not at the start of the string, you can change ^ for \s wo match a whitespace char instead
Output
ab0820_1ab

sed: cut out a substring following various patterns

There is a list of identifiers I want to modify:
3300000526.a:P_A23_Liq_2_FmtDRAFT_1000944_2,
200254578.a:CR_10_Liq_3_inCRDRAFT_100545_11,
3300000110.a:BSg2DRAFT_c10006505_1,
3300000062.a:IMNBL1DRAFT_c0010786_1,
3300000558.a:Draft_10335283_1
I want to remove all starting from first . and first _ after DRAFT (case-insensitive), i.e.:
3300000526_1000944_2,
200254578_100545_11,
3300000110_c10006505_1,
3300000062_c0010786_1,
3300000558_10335283_1
I am using sed 's/.a.*[a-zA-Z0-9]DRAFT_.*[^_]_[a-zA-Z0-9]//g' but it ignores the first _ after DRAFT and does this:
3300000526_2,
200254578_11,
3300000110_1,
3300000062_1,
3300000558_1
P.S.
There can be various identifiers and I tried to show a little portion on their variance here, but they all keep same pattern.
I'd be grateful for corrections!

You could easily do this in awk, could you please try following once. Based on shown samples only.
awk -F'[.]|DRAFT_' '{$2="_";sub(/ +_ +/,"_")} 1' Input_file
OR with GNU awk for handling case-insensitive try:
awk -v IGNORECASE="1" -F'[.]|DRAFT_' '{$2="_";sub(/ +_ +/,"_")} 1' Input_file
To handle case insensitive without ignorecase option try:
awk -F'[.]|[dD][rR][aA][fF][tT]_' '{$2="_";sub(/ +_ +/,"_")} 1' Input_file
Explanation: Simply setting field separator as . OR DRAFT_ as per OP's need. Then in main program setting 2nd field to _ and then substituting spaces underscore spaces with only _. Finally printing the current line by 1.

A workable solution
You can use:
sed 's/[.].*[dD][rR][aA][fF][tT]_/_/' data
You could also use \. in place of [.] but I'm allergic to unnecessary backslashes — you might be too if you'd had to spend time fighting whether 8 or 16 consecutive backslashes was the correct way to write to document markup (using troff).
For your sample data, it produces:
3300000526_1000944_2,
200254578_100545_11,
3300000110_c10006505_1,
3300000062_c0010786_1,
3300000558_10335283_1
What went wrong
Your command is:
sed 's/.a.*[a-zA-Z0-9]DRAFT_.*[^_]_[a-zA-Z0-9]//g'
This matches:
any character (the leading .)
lower-case 'a'
any sequence of characters
an alphanumeric character
upper-case only DRAFT
underscore
any sequence of characters
underscore
an alphanumeric character
doing the match globally on each line
It then deletes all the matched material. You could rescue it by using:
sed 's/[.]a.*[a-zA-Z0-9]DRAFT\(_.*[^_]_[a-zA-Z0-9]\)/\1/'
This matches a dot rather than any character, and saves the material after DRAFT starting with the underscore (that's the \(…\)), replacing what was matched with what was saved (that's the \1). You can convert DRAFT to the case-insensitive pattern too, of course. However, the terms of the question refer to "from the first dot (.) up to (but not including) the underscore after a (case-insensitive) DRAFT", and detailing, saving, and replacing the material after the underscore is not necessary.
Laziness
I saved myself typing the elaborate case-insensitive string by using a program called mkpattern that (according to RCS) I created on 1990-08-23 (a while ago now). It's not rocket science. I use it every so often — I've actually used it a number of times in the last week, searching for some elusive documents on a system at work.
$ mkpattern DRAFT
[dD][rR][aA][fF][tT]
$
You might have to do that longhand in future.

try something like
{mawk/mawk2/gawk} 'BEGIN { FS = "[\056].+DRAFT_"; OFS = ""; } (NF < 2) || ($1 = $1)'
It might not be the fastest but it's relatively clean approach. octal \056 is the period, and it's less ambiguous to reader when the next item is a ".+"

This might work for you (GNU sed):
sed -nE 's/DRAFT[^_]*/\n/i;s/\..*\n//p' file
First turn on the -n and -E to turn off implicit printing and make regexp more easily seen.
Since we want the first occurrence of DRAFT we can not use a regexp that uses the .* idiom as this is greedy and may pass over it if there are two or more such occurrences. Thus we replace the DRAFT by a unique character which cannot occur in the line. The newline can only be introduced by the programmer and is the best choice.
Now we can use the .* idiom as the introduced newline can only exist if the previous substitution has matched successfully.
N.B. The i flag in the first substitution allows for any upper/lower case rendition of the string DRAFT, also the second substitution includes the p flag to print the successful substitution.

Printing lines with duplicate words

I am trying to print all line that can contain same word twice or more
E.g. with this input file:
cat dog cat
dog cat deer
apple peanut banana apple
car bus train plane
car train car train
Output should be
cat dog cat
apple peanut banana apple
car train car train.
I have tried this code and it works but I think there must be a shorter way.
awk '{ a=0;for(i=1;i<=NF;i++){for(j=i+1;j<=NF;j++){if($i==$j)a=1} } if( a==1 ) print $0}'
Later I want to find all such duplicate words and delete all the duplicate entries except for 1st occurrence.
So input:
cat dog cat lion cat
dog cat deer
apple peanut banana apple
car bus train plane
car train car train
Desired output:
cat dog lion
dog cat deer
apple peanut banana
car bus train plane
car train

You can use this GNU sed command:
sed -rn '/(\b\w+\b).*\b\1\b/ p' yourfile
-r activate extended re and n deactivates the implicit printing of every line
the p command then prints only lines that match the preceding re (inside the slashes):
\b\w+\b are words : an nonemtpy sequence of word charactes (\w) between word boundaries (\b`), these are GNU extensions
such a word is "stored" in \1 for later reuse, due to the use of parentheses
then we try to match this word with \b\1\b again with something optional (.*) between those two places.
and that is the whole trick: match something, put it in parentheses so you can reuse it in the same re with \1
To answer the second part of the question, deleting the doubled words after the first, but print all lines (modifying only the lines with doubled words), you could use some sed s magic:
sed -r ':A s/(.*)(\b\w+\b)(.*)\b\2\b(.*)/\1\2\3\4/g; t A ;'
here we use again the backreference trick.
but we have to account for the things before, between and after our doubled words, thus we have a \2 in the matching part of then s command and we have the other backreferences in the replacement part.
notice that only the \2 has no parens in the matching part and we use all groups in the replacement, thus we effectively deleted the second word of the pair.
for more repetitions of the word we need loop:
:A is a label
t A jumps to the label if there was a replacement done in the last s comamnd
this builds a "while loop" around the s to delete the other repetitions, too

Here's a solution for printing only lines that contain duplicate words.
awk '{
delete seen
for (i=1;i<=NF;++i) {
if (seen[$i]) { print ; next }
seen[$i] = 1
}
}'
Here's a solution for deleting duplicate words after the first.
awk '{
delete seen
for (i=1;i<=NF;++i) {
if (seen[$i]) { continue }
printf("%s ", $i);
seen[$i] = 1
}
print "";
}'
Re your comment...
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. — Jamie Zawinski, 1997

With egrep you can use a so called back reference:
egrep '(\b\w+\b).*\b\1\b' file
(\b\w+\b) matches a word at word boundaries in capturing group 1. \1 references that matched word in the pattern.

I'll show solutions in Perl as it is probably the most flexible tool for text parsing, especially when it comes to regular expressions.
Detecting Duplicates
perl -ne 'print if m{\b(\S+)\b.*?(\b\1\b)}g' file
where
-n causes Perl to execute the expression passed via -e for each input line;
\b matches word boundaries;
\S+ matches one or more non-space characters;
.*? is a non-greedy match for zero or more characters;
\1 is a backreference to the first group, i.e. the word \S+;
g globally matches the pattern repeatedly in the string.
Removing Duplicates
perl -pe '1 while (s/\b(\S+)\b.*?\K(\s\1\b)//g)' file
where
-p causes Perl to print the line ($_), like sed;
1 while loop runs as long as the substitution replaces something;
\K keeps the part matching the previous expression;
Duplicate words (\s\1\b) are replaced with empty string (//g).
Why Perl?
Perl regular expressions are known to be very flexible, and regular expressions in Perl are actually more than just regular expressions. For example, you can embed Perl code into the substitution using the /e modifier. You can use the /x modifier that allows to write regular expressions in a more readable format and even use Perl comments in it, e.g.:
perl -pe '1 while (
s/ # Begins substitution: s/pattern/replacement/flags
\b (\S+) \b # A word
.*? # Ungreedy pattern for any number of characters
\K # Keep everything that matched the previous patterns
( # Group for the duplicate word:
\s # - space
\1 # - backreference to the word
\b # - word boundary
)
//xg
)' file
As you should have noticed, the \K anchor is very convenient, but is not available in many popular tools including awk, bash, and sed.

How to parse this string into kv in awk in a simple way

Now I have a str in awk like this:
str = "a='abc',b=1,c='http://xxxx,http://yyyy,http://zzz'"
How can I parse it to get this result:
(a abc)(b 1)(c http://xxxx,http://yyyy,http://zzz)
By now I still implement it in such an ugly way:
result = ""
while (match(str, /[^=]*=('[^']*'|[^,]*),/) != 0) {
subs = substr(str, RSTART, RLENGTH)
str = substr(str, RSTART + RLENGTH, length(str) - RSTART - RLENGTH + 1)
split(subs, vec, "=")
gsub(/'/, "", vec[1])
gsub(/'/, "", vec[2])
if (substr(vec[2], length(vec[2]), 1) == ",") {
vec[2] = substr(vec[2], 0, length(vec[2]) - 1)
}
result = result"("vec[1]" "vec[2]")"
}
I wonder if there exist some more elegant way.

Using awk
The trick here is that we need to treat quoted commas differently from unquoted commas. That can be done as follows:
$ echo "$str" | awk -F"'" -v OFS="" '{for (i=1;i<=NF;i+=2) gsub(",", ")(", $i)} {gsub("=", " "); print "("$0")"}'
(a abc)(b 1)(c http://xxxx,http://yyyy,http://zzz)
How it works
-F"'" -v OFS=""
This sets the input field separator to a single quote and the output separator to an empty string.
{for (i=1;i<=NF;i+=2) gsub(",", ")(", $i)}
This replaces unquoted commas (odd fields) with )(.
Even numbered fields represent the quoted strings and they are left unchanged here.
gsub("=", " ")
This replaces equal signs with spaces.
print "("$0")"
This adds parens to the beginning and end and prints the line.
Using sed
$ echo "$str" | sed -r ":a; s/^(([^']*'[^']*')*[^']*'[^,']*),/\1\n/; ta; s/,/)(/g; s/^/(/; s/$/)/; s/\n/,/g; s/'//g; s/=/ /g"
(a abc)(b 1)(c http://xxxx,http://yyyy,http://zzz)
How it works
First, remember that sed processes input line-by-line. That means that, unless we put one in it, no line in sed's pattern space will contain a newline character.
This command works by replacing all quoted commas with newline characters. It then adds ( to the beginning of the line, ) to the end of the line, and replaces the remaining commas with )(. The newline characters are changed back to commas. Next the single-quotes are removed. Finally, the = signs are then replaced with spaces and we are done.
We can tell whether a comma is quoted or unquoted by whether is it is preceded by an odd or an even number of single-quotes.
In more detail:
sed -r
This starts sed with extended regular expressions.
:a; s/^(([^']*'[^']*')*[^']*'[^,']*),/\1\n/; ta
This converts all quoted commas into newline characters. The regex ^(([^']*'[^']*')*[^']*'[^,']*) matches, starting at the beginning of the line, any odd numbers of single-quotes and the text surrounding them up to the first comma afterward. The substitution command s/^(([^']*'[^']*')*[^']*'[^']*),/\1\n/ consequently replaces the first quoted comma found with a newline, \n.
:a is a label. ta is a test: it branches back to label a if a substitution was made. Thus, as many substitutions are made as needed to replace all the quoted commas with newline characters.
s/,/)(/g; s/^/(/; s/$/)/
These three substitution commands puts parens everywhere that we want one.
s/\n/,/g
Now that we have parens where we need them, this converts the newline characters that we added back to commas.
s/'//g
This removes all the single quotes.
s/=/ /g
This replaces the equal signs with spaces.

Using awk how do I reprint a found pattern with a new line character?

I have a text file in the format of:
aaa: bcd;bcd;bcddd;aaa:bcd;bcd;bcd;
Where "bcd" can be any length of any characters, excluding ; or :
What I want to do is print the text file in the format of:
aaa: bcd;bcd;bcddd;
aaa: bcd;bcd;bcd;
-etc-
My method of approach to this problem was to isolate a pattern of ";...:" and then reprint this pattern without the initial ;
I concluded I would have to use awk's 'gsub' to do this, but have no idea how to replicate the pattern nor how to print the pattern again with this added new line character 1 character into my pattern.
Is this possible?
If not, can you please direct me in a way of tackling it?

We can't quite be sure of the variability in the aaa or bcd parts; presumably, each one could be almost anything.
You should probably be looking for:
a series of one or more non-colon, non-semicolon characters followed by colon,
with one or more repeats of:
a series of one or more non-colon, non-semicolon characters followed by a semi-colon
That makes up the unit you want to match.
/[^:;]+:([^:;]+;)+/
With that, you can substitute what was found by the same followed by a newline, and then print the result. The only trick is avoiding superfluous newlines.
Example script:
{
echo "aaa: bcd;bcd;bcddd;aaa:bcd;bcd;bcd;"
echo "aaz: xcd;ycd;bczdd;baa:bed;bid;bud;"
} |
awk '{ gsub(/[^:;]+:([^:;]+;)+/, "&\n"); sub(/\n+$/, ""); print }'
Example output
aaa: bcd;bcd;bcddd;
aaa:bcd;bcd;bcd;
aaz: xcd;ycd;bczdd;
baa:bed;bid;bud;
Paraphrasing the question in a comment:
Why does the regular expression not include the characters before a colon (which is what it's intended to do, but I don't understand why)? I don't understand what "breaks" or ends the regex.
As I tried to explain at the top, you're looking for what we can call 'words', meaning sequences of characters that are neither a colon nor a semicolon. In the regex, that is [^:;]+, meaning one or more (+) of the negated character class — one or more non-colon, non-semicolon characters.
Let's pretend that spaces in a regex are not significant. We can space out the regex like this:
/ [^:;]+ : ( [^:;]+ ; ) + /
The slashes simply mark the ends, of course. The first cluster is a word; then there's a colon. Then there is a group enclosed in parentheses, tagged with a + at the end. That means that the contents of the group must occur at least once and may occur any number of times more than that. What's inside the group? Well, a word followed by a semicolon. It doesn't have to be the same word each time, but there does have to be a word there. If something can occur zero or more times, then you use a * in place of the +, of course.
The key to the regex stopping is that the aaa: in the middle of the first line does not consist of a word followed by a semicolon; it is a word followed by a colon. So, the regex has to stop before that because the aaa: doesn't match the group. The gsub() therefore finds the first sequence, and replaces that text with the same material and a newline (that's the "&\n", of course). It (gsub()) then resumes its search directly after the end of the replacement material, and — lo and behold — there is a word followed by a colon and some words followed by semicolons, so there's a second match to be replaced with its original material plus a newline.
I think that $0 must contain the newline at the end of the line. Therefore, without the sub() to remove a trailing newlines, the print (implictly of $0 with a newline) generated a blank line I didn't want in the output, so I removed the extraneous newline(s). The newline at the end of $0 would not be matched by the gsub() because it is not followed by a colon or semicolon.

This might work for you:
awk '{gsub(/[^;:]*:/,"\n&");sub(/^\n/,"");gsub(/: */,": ")}1' file
Prepend a newline (\n) to any string not containing a ; or a : followed by a :
Remove any newline prepended to the start of line.
Replace any : followed by none or many spaces with a : followed by a single space.
Print all lines.
Or this:
sed 's/;\([^;:]*: *\)/;\n\1 /g' file

Not sure how to do it in awk, but with sed this does what I think you want:
$ nl='
'
$ sed "s/\([^;]*:\)/\\${nl}\1/g" input
The first command sets the shell variable $nl to the string containing a single new line. Some versions of sed allow you to use \n inside the replacement string, but not all allow that. This keeps any whitespace that appears after the final ; and puts it at the start of the line. To get rid of that, you can do
$ sed "s/\([^;]*:\)/\\${nl}\1/g; s/\n */\\$nl/g" input

Ordinary awk gsub() and sub() don't allow you to specify components in the replacement strings Gnu awk - "gawk" - supplies "gensub()" which would allow "gensub(/(;) (.+:)/,"\1\n\2","g")"

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Using .subst with a partial regex match - raku

you can use trans method: my $book1 = "Don Quixote- Miguel de Cervantes"; my $book2 = "Les Misérables -Victor Hugo"; my $book3 = "War and Peace - Leo Tolstoy"; for ($book1, $book2, $book3) -> $b { say $b.trans([/<wb> '- '/, /' -' <wb>/] => [' - ']); } wb is a word boundary.

Related

extracting a specific word between : using sed, awk or grep

sed: cut out a substring following various patterns

Printing lines with duplicate words

How to parse this string into kv in awk in a simple way

Using awk how do I reprint a found pattern with a new line character?

Categories

Resources