How to parse this string into kv in awk in a simple way - awk

Now I have a str in awk like this:
str = "a='abc',b=1,c='http://xxxx,http://yyyy,http://zzz'"
How can I parse it to get this result:
(a abc)(b 1)(c http://xxxx,http://yyyy,http://zzz)
By now I still implement it in such an ugly way:
result = ""
while (match(str, /[^=]*=('[^']*'|[^,]*),/) != 0) {
subs = substr(str, RSTART, RLENGTH)
str = substr(str, RSTART + RLENGTH, length(str) - RSTART - RLENGTH + 1)
split(subs, vec, "=")
gsub(/'/, "", vec[1])
gsub(/'/, "", vec[2])
if (substr(vec[2], length(vec[2]), 1) == ",") {
vec[2] = substr(vec[2], 0, length(vec[2]) - 1)
}
result = result"("vec[1]" "vec[2]")"
}
I wonder if there exist some more elegant way.

Using awk
The trick here is that we need to treat quoted commas differently from unquoted commas. That can be done as follows:
$ echo "$str" | awk -F"'" -v OFS="" '{for (i=1;i<=NF;i+=2) gsub(",", ")(", $i)} {gsub("=", " "); print "("$0")"}'
(a abc)(b 1)(c http://xxxx,http://yyyy,http://zzz)
How it works
-F"'" -v OFS=""
This sets the input field separator to a single quote and the output separator to an empty string.
{for (i=1;i<=NF;i+=2) gsub(",", ")(", $i)}
This replaces unquoted commas (odd fields) with )(.
Even numbered fields represent the quoted strings and they are left unchanged here.
gsub("=", " ")
This replaces equal signs with spaces.
print "("$0")"
This adds parens to the beginning and end and prints the line.
Using sed
$ echo "$str" | sed -r ":a; s/^(([^']*'[^']*')*[^']*'[^,']*),/\1\n/; ta; s/,/)(/g; s/^/(/; s/$/)/; s/\n/,/g; s/'//g; s/=/ /g"
(a abc)(b 1)(c http://xxxx,http://yyyy,http://zzz)
How it works
First, remember that sed processes input line-by-line. That means that, unless we put one in it, no line in sed's pattern space will contain a newline character.
This command works by replacing all quoted commas with newline characters. It then adds ( to the beginning of the line, ) to the end of the line, and replaces the remaining commas with )(. The newline characters are changed back to commas. Next the single-quotes are removed. Finally, the = signs are then replaced with spaces and we are done.
We can tell whether a comma is quoted or unquoted by whether is it is preceded by an odd or an even number of single-quotes.
In more detail:
sed -r
This starts sed with extended regular expressions.
:a; s/^(([^']*'[^']*')*[^']*'[^,']*),/\1\n/; ta
This converts all quoted commas into newline characters. The regex ^(([^']*'[^']*')*[^']*'[^,']*) matches, starting at the beginning of the line, any odd numbers of single-quotes and the text surrounding them up to the first comma afterward. The substitution command s/^(([^']*'[^']*')*[^']*'[^']*),/\1\n/ consequently replaces the first quoted comma found with a newline, \n.
:a is a label. ta is a test: it branches back to label a if a substitution was made. Thus, as many substitutions are made as needed to replace all the quoted commas with newline characters.
s/,/)(/g; s/^/(/; s/$/)/
These three substitution commands puts parens everywhere that we want one.
s/\n/,/g
Now that we have parens where we need them, this converts the newline characters that we added back to commas.
s/'//g
This removes all the single quotes.
s/=/ /g
This replaces the equal signs with spaces.

Related

Gawk matching one word - one unexpected match

I wanted to get all matches in Column 3 which have the exact word "aa" (case insensitive match) in the string in Column 3
The gawk command used in the awk file is:
$3 ~ /\<aa\>/
The BEGIN statement specifies: IGNORECASE = 1
The command returns 20 rows. What is puzzling is this value in Column 3 in the returned rows:
aA.AHAB
How do I avoid this row as it is not a word by itself because there is dot following the first two aa's and not a space?
A is a word character. . is not a word character. \> matches the zero-width string at the end of a word. Such a zero-width string occurs between A and ..
To search for the string aa delimited by space characters (or start/end of field):
$3 ~ /(^|[ ])aa([ ]|$)
Add any other characters that you care about inside the set ([ ]).
Note that by default, awk splits records into fields on whitespace, so you will not get any spaces in $3 unless you have changed the value of FS.
1st solution: OR to exactly match aa try:
awk 'BEGIN{IGNORECASE=1} $3 ~ /^aa$/' Input_file
2nd solution: OR without IGNORECASE option try:
awk 'tolower($3)=="aa"' Input_file
Question: Why does the awk regex-pattern /\<aa\>/ matches a string like: "aa.bbb"?
We can quickly verify this with:
$ echo aa.bbb | awk '/\<aa\>/'
aa.bbb
The answer is simply found in the manual of gnu awk:
3.7 gawk-Specific Regexp Operators
GNU software that deals with regular expressions provides a number of additional regexp operators. These operators are described in this section and are specific to gawk; they are not available in other awk implementations. Most of the additional operators deal with word matching. For our purposes, a word is a sequence of one or more letters, digits, or underscores (‘_’):
\<: Matches the empty string at the beginning of a word. For example, /\<away/ matches "away" but not "stowaway".
\>:
Matches the empty string at the end of a word. For example, /stow\>/ matches "stow" but not "stowaway".
source: GNU awk manual: Section 3 :: Regular Expressions
So to come back to the example from above, the string "aa.bbb" contains two words "aa" and "bbb" since the <dot>-character is not part of the character set that can build up a word. The empty strings matched here is the empty string before "aa.bbb" and the empty string between the characters a and . (an empty string is really an empty string, length 0, 0 characters, commonly written as "")
Solution to the OP: Since FS is most likely the default value, the field $3 cannot have a space. So the following two solutions are possible:
$3 ~ /^aa$/
$3 == "aa"
If the field separator FS is defined in the code, the following might work
" "$3" " ~ /" aa "/
$3 ~ /(^|[ ])aa([ ]|$) # See solution of JHNC

awk - Rounding all floating-point numbers in multi-line text file

Assume a multi-line text file that contains multiple floating-point numbers as well as alphanumeric strings and special characters per line. The only consistency is that all floats are separated from any other string by a single whitespace. Further, assume that we wish to round each floating-point number to a maximum of n digits after the comma. All strings other than the floats shall remain in place and as is. Let us assume that n=5.
I know this can be implemented via awk easily. My current code (below) only rounds the last float of each line and swallows all strings that precede it. How do I improve it?
echo -e "\textit{foo} & 1234.123456 & -1234.123456\n1234.123456" |\
awk '{for(i=1;i<=NF;i++);printf("%.05f\n",$NF)}'
# -1234.12346
# 1234.12346
Using perl :
perl -i -pe 's/(\d+\.\d+)/sprintf "%.05f", $1/eg' file
One solution :
$ echo -e "\textit{foo} & 1234.123456 & -1234.123456\n1234.123456" |
awk '{for(i=1;i<=NF;i++){if ($i ~ /[0-9]+.[0-9]+/){printf "%.05f\n", $i}}}'
Output :
1234.12346
-1234.12346
1234.12346
Is this what you're trying to do?
$ printf '\textit{foo} & 1234.123456 & -1234.123456\n1234.123456\n' |
awk -F'[ ]' '{for(i=1;i<=NF;i++) if ($i+0 == $i) $i = sprintf("%.05f",$i)} 1'
extit{foo} & 1234.12346 & -1234.12346
1234.12346
if ($i+0 == $i) is the idiomatic awk way to test for a value being a number since only a number could have the same value on the left and right side of that comparison.
I'm setting the FS to a literal, single blank char instead of it's default which, confusingly, is also a blank char but the latter (i.e. ' ' vs '[ ]') is treated specially and results in ALL chains of contiguous white space being treated as a separator and ignoring stripping leading/trailing blanks on a recompilation of $0 (e.g. as caused by assigning to any field) and so would not allow your formatting to be maintained in the output.

To find ";" then delete spaces up to next character

I have many lines starting with ";", then 1 or more spaces, followed by some other character(s) on the same line. I need to remove the spaces following the ";" up to but not including the characters that follow.
I tried a few variations of the following code because it worked great on lines with empty spaces, but I am not very familiar with awk.
awk '{gsub(/^ +| +$/,"")}1' filea>fileb
Sample input:
; 4
; group 452
; ring
Output wanted:
;4
;group 452
;ring
To remove any white space after the first semicolon, try:
$ awk '{sub(/^;[[:blank:]]+/, ";")} 1' filea
;4
;group 452
;ring
The regex ^;[[:blank:]]+ matches the first semicolon and any blanks or tabs which follow it. The function sub replaces this with ;. Since this only occurs once on the line (at the beginning), there is no need for gsub.
[:blank:] is a unicode-safe way of specifying blank space.
awk '{sub(/^; +/,";")}1' file
;4
;group 452
;ring
sed would also do :
sed -E 's/^(;){1}([[:blank:]]+)/\1/' file
The parentheses is used as selectors and a \number combination represents the corresponding selection.
In ^(;){1}([[:blank:]]+) we check for the start of line (^) and a ; that occurs one time({1}), followed by any number of blank characters ([[:blank:]]+) and then replace the matched pattern with our first selection.
This defines the field separator FS to be the start of the record followed by a ; and a set of spaces. It then redefines the output field separator OFS to be just a ;. The conversion from FS to OFS is done by reassigning $1 to itself.
awk 'BEGIN{FS="^; *";OFS=";"}{$1=$1}1'
or
awk -F'^; *' -vOFS=";" '{$1=$1}1'

How to replace one character inside parentheses keep everything else as is

The data looks like this :
There is stuff here (word, word number phrases)
(word number anything, word phrases), even more
...
There is a lot of them in different files. There is different kind of data too, all around it that isn't in the same format. The data inside the paratheses can't change, and it's always on the same line. I do not have to deal with:
(stuff number,
maybe more here)
I would like to be able to replace the comma with a colon
Desired output would be
There is stuff here (word: word number phrases)
(word number anything: word phrases), even more
...
Here's a version for awk that uses the parentheses as record separators:
awk -v RS='[()]' 'NR%2 == 0 {sub(/,/,":")} {printf "%s%s", $0, RT}' file
The stuff between parentheses will be every even-numbered record. The RT variable holds the character that matched the RS pattern for this record.
Note that this only replace the first comma of the parenthesized text. If you want to replace all, use gsub in place of sub
Assuming there's only one comma to be replaced inside parentheses, this POSIX BRE sed expression will replace it with colon:
sed 's/(\(.*\),\(.*\))/(\1:\2)/g' file
If there are more than one comma, only the last one will be replaced.
In multiple-commas scenario, you can replace only the first one with:
sed 's/(\([^,]*\),\([^)]*\))/(\1:\2)/g' file
While #randomir's sed solution dwells on replacing a single comma inside parentheses, there is a way to replace multiple commas inside parentheses with sed, too.
Here is the code:
sed '/(/ {:a s/\(([^,()]*\),/\1:/; t a}'
or
sed '{:a;s/\(([^,()]*\),/\1:/;ta}'
or
sed -E '{:a;s/(\([^,()]*),/\1:/;ta}'
See an online demo.
In all cases, the main part is between the curly braces. Here are the details for the POSIX ERE (sed with -E option) pattern:
:a;
s/(\([^,()]*),/\1:/; - find and capture into Group 1
\( - a ( char
[^,()]* - zero or more chars other than ,, ( and ) (so, only those commas will be removed that are in between the closest ( and ) chars, not inside (..,.(...,.) - remove ( from the bracket expression to also match in the latter patterns)
\1: - and replace with the Group 1 contents + a colon after it
ta - loop to :a if there was a match at the preceding iteration.
Using awk
$ awk -v FS="" -v OFS="" '{ c=0; for(i=1; i<=NF; i++){ if( $i=="(" || $i ==")" ) c=1-c; if(c==1 && $i==",") $i=":" } }1' file
There is stuff here (word: word number phrases)
(word number anything: word phrases), even more
-v FS="" -v OFS="" Set FS to null so that each char is treated as a field.
set variable c=0. Iterate over each field using for loop and toggle the value of c if ( or ) is encountered.
if c==1 and , appears then replace it to :
With perl
$ perl -pe 's/\([^()]+\)/$&=~s|,|:|gr/ge' ip.txt
There is stuff here (word: word number phrases)
(word number anything: word phrases), even more
$ echo 'i,j,k (a,b,c) bar (1,2)' | perl -pe 's/\([^()]+\)/$&=~s|,|:|gr/ge'
i,j,k (a:b:c) bar (1:2)
$ # since only single character is changed, can also use tr
$ echo 'i,j,k (a,b,c) bar (1,2)' | perl -pe 's/\([^()]+\)/$&=~tr|,|:|r/ge'
i,j,k (a:b:c) bar (1:2)
e modified allows to use Perl code in replacement section
\([^()]+\) match non-nested () with one or more characters inside
$&=~s|,|:|gr perform another substitution on matched text, the r modifier would return the modified text

Using awk how do I reprint a found pattern with a new line character?

I have a text file in the format of:
aaa: bcd;bcd;bcddd;aaa:bcd;bcd;bcd;
Where "bcd" can be any length of any characters, excluding ; or :
What I want to do is print the text file in the format of:
aaa: bcd;bcd;bcddd;
aaa: bcd;bcd;bcd;
-etc-
My method of approach to this problem was to isolate a pattern of ";...:" and then reprint this pattern without the initial ;
I concluded I would have to use awk's 'gsub' to do this, but have no idea how to replicate the pattern nor how to print the pattern again with this added new line character 1 character into my pattern.
Is this possible?
If not, can you please direct me in a way of tackling it?
We can't quite be sure of the variability in the aaa or bcd parts; presumably, each one could be almost anything.
You should probably be looking for:
a series of one or more non-colon, non-semicolon characters followed by colon,
with one or more repeats of:
a series of one or more non-colon, non-semicolon characters followed by a semi-colon
That makes up the unit you want to match.
/[^:;]+:([^:;]+;)+/
With that, you can substitute what was found by the same followed by a newline, and then print the result. The only trick is avoiding superfluous newlines.
Example script:
{
echo "aaa: bcd;bcd;bcddd;aaa:bcd;bcd;bcd;"
echo "aaz: xcd;ycd;bczdd;baa:bed;bid;bud;"
} |
awk '{ gsub(/[^:;]+:([^:;]+;)+/, "&\n"); sub(/\n+$/, ""); print }'
Example output
aaa: bcd;bcd;bcddd;
aaa:bcd;bcd;bcd;
aaz: xcd;ycd;bczdd;
baa:bed;bid;bud;
Paraphrasing the question in a comment:
Why does the regular expression not include the characters before a colon (which is what it's intended to do, but I don't understand why)? I don't understand what "breaks" or ends the regex.
As I tried to explain at the top, you're looking for what we can call 'words', meaning sequences of characters that are neither a colon nor a semicolon. In the regex, that is [^:;]+, meaning one or more (+) of the negated character class — one or more non-colon, non-semicolon characters.
Let's pretend that spaces in a regex are not significant. We can space out the regex like this:
/ [^:;]+ : ( [^:;]+ ; ) + /
The slashes simply mark the ends, of course. The first cluster is a word; then there's a colon. Then there is a group enclosed in parentheses, tagged with a + at the end. That means that the contents of the group must occur at least once and may occur any number of times more than that. What's inside the group? Well, a word followed by a semicolon. It doesn't have to be the same word each time, but there does have to be a word there. If something can occur zero or more times, then you use a * in place of the +, of course.
The key to the regex stopping is that the aaa: in the middle of the first line does not consist of a word followed by a semicolon; it is a word followed by a colon. So, the regex has to stop before that because the aaa: doesn't match the group. The gsub() therefore finds the first sequence, and replaces that text with the same material and a newline (that's the "&\n", of course). It (gsub()) then resumes its search directly after the end of the replacement material, and — lo and behold — there is a word followed by a colon and some words followed by semicolons, so there's a second match to be replaced with its original material plus a newline.
I think that $0 must contain the newline at the end of the line. Therefore, without the sub() to remove a trailing newlines, the print (implictly of $0 with a newline) generated a blank line I didn't want in the output, so I removed the extraneous newline(s). The newline at the end of $0 would not be matched by the gsub() because it is not followed by a colon or semicolon.
This might work for you:
awk '{gsub(/[^;:]*:/,"\n&");sub(/^\n/,"");gsub(/: */,": ")}1' file
Prepend a newline (\n) to any string not containing a ; or a : followed by a :
Remove any newline prepended to the start of line.
Replace any : followed by none or many spaces with a : followed by a single space.
Print all lines.
Or this:
sed 's/;\([^;:]*: *\)/;\n\1 /g' file
Not sure how to do it in awk, but with sed this does what I think you want:
$ nl='
'
$ sed "s/\([^;]*:\)/\\${nl}\1/g" input
The first command sets the shell variable $nl to the string containing a single new line. Some versions of sed allow you to use \n inside the replacement string, but not all allow that. This keeps any whitespace that appears after the final ; and puts it at the start of the line. To get rid of that, you can do
$ sed "s/\([^;]*:\)/\\${nl}\1/g; s/\n */\\$nl/g" input
Ordinary awk gsub() and sub() don't allow you to specify components in the replacement strings Gnu awk - "gawk" - supplies "gensub()" which would allow "gensub(/(;) (.+:)/,"\1\n\2","g")"