With sed or awk, how to replace all occurrences of string between quotes? - awk

Given a file that looks like:
some text
no replace "text in quotes" no replace
more text
no replace "more text in quotes" no replace
even more text
no replace "even more text in quotes" no replace
etc
what sed or awk script would replace all the es that are between quotes and only the es between quotes such that something like the following is produced:
some text
no replace "t##$xt in quot##$s" no replace
more text
no replace "mor##$ t##$xt in quot##$s" no replace
even more text
no replace "##$v##$n mor##$ t##$xt in quot##$s" no replace
etc
There can be any number es between the quotes.

$ awk 'BEGIN{FS=OFS="\""} {gsub(/e/,"##$",$2)} 1' file
some text
no replace "t##$xt in quot##$s" no replace
more text
no replace "mor##$ t##$xt in quot##$s" no replace
even more text
no replace "##$v##$n mor##$ t##$xt in quot##$s" no replace
etc
Also consider multiple pairs of quotes on a line:
$ echo 'aebec"edeee"fegeh"eieje"kelem' |
awk 'BEGIN{FS=OFS="\""} {gsub(/e/,"##$",$2)} 1'
aebec"##$d##$##$##$"fegeh"eieje"kelem
$ echo 'aebec"edeee"fegeh"eieje"kelem' |
awk 'BEGIN{FS=OFS="\""} {for (i=2;i<=NF;i+=2) gsub(/e/,"##$",$i)} 1'
aebec"##$d##$##$##$"fegeh"##$i##$j##$"kelem

This might work for you (GNU sed):
sed -r ':a;s/^([^"]*("[^"e]*"[^"]*)*"[^"e]*)e/\1##$/;ta' file
This regex looks from the start of line for a series of non-double quote characters, followed by a possible pair of double quotes with no e's within them, followed by another possibile series of non-double quote characters, followed by a double quote and a possible series of non-double quotes. If the next pattern is an e it replaces the pattern by the \1 (which is everything up until the e) and ##$. If the substitution is successful i.e. ta, then the process is repeated until no further substitutions occur.
N.B. This caters for lines with multiple pairs of double quoted strings.

sed ':cycle
s/^\(\([^"]*\("[^"]*"\)*\)*"[^"]*\)e/\1##$/
t cycle' YourFile
Posix version
front last till first e
change also any e that will be in an unclosed quoted string (at the end thus and failed in next line if that could happend (Not present in your sample)

Related

To find ";" then delete spaces up to next character

I have many lines starting with ";", then 1 or more spaces, followed by some other character(s) on the same line. I need to remove the spaces following the ";" up to but not including the characters that follow.
I tried a few variations of the following code because it worked great on lines with empty spaces, but I am not very familiar with awk.
awk '{gsub(/^ +| +$/,"")}1' filea>fileb
Sample input:
; 4
; group 452
; ring
Output wanted:
;4
;group 452
;ring
To remove any white space after the first semicolon, try:
$ awk '{sub(/^;[[:blank:]]+/, ";")} 1' filea
;4
;group 452
;ring
The regex ^;[[:blank:]]+ matches the first semicolon and any blanks or tabs which follow it. The function sub replaces this with ;. Since this only occurs once on the line (at the beginning), there is no need for gsub.
[:blank:] is a unicode-safe way of specifying blank space.
awk '{sub(/^; +/,";")}1' file
;4
;group 452
;ring
sed would also do :
sed -E 's/^(;){1}([[:blank:]]+)/\1/' file
The parentheses is used as selectors and a \number combination represents the corresponding selection.
In ^(;){1}([[:blank:]]+) we check for the start of line (^) and a ; that occurs one time({1}), followed by any number of blank characters ([[:blank:]]+) and then replace the matched pattern with our first selection.
This defines the field separator FS to be the start of the record followed by a ; and a set of spaces. It then redefines the output field separator OFS to be just a ;. The conversion from FS to OFS is done by reassigning $1 to itself.
awk 'BEGIN{FS="^; *";OFS=";"}{$1=$1}1'
or
awk -F'^; *' -vOFS=";" '{$1=$1}1'

Field separators-trouble delimiting command characters

I'm trying to parse through html source code. In my example I'm just echoing it in. But, I am reading html from a file in practice.
Here is a bit of code that works, syntactically:
echo "<td>Here</td> some dynamic text to ignore <garbage> is a string</table>more junk" |
awk -v FS="(<td>|</td>|<garbage>|</table>)" '{print $2, $4}'
in the FS declaration I create 4 delimiters which work fine, and I output the 2nd and 4th field.
However, the 3rd field delimeter I actually need to use contains awk command characters, literally:
')">
such that when I change the above statement to:
echo "<td>Here</td> some dynamic text to ignore ')\"> is a string</table>more junk" |
awk -v FS="(<td>|</td>|')\">|</table>)" '{print $2, $4}'
I've tried escaping one, all, and every combination of the offending string with the \character. but, nothing is working.
This might be what you're looking for:
$ echo "<td>Here</td> some dynamic text to ignore ')\"> is a string</table>more junk" |
awk -v FS='(<td>|</td>|\047\\)">|</table>)' '{print $2, $4}'
Here is a string
In shell, always include strings (and command line scripts) in single quotes unless you NEED to use double quotes to expose your strings contents to the shell, e.g. to let the shell expand a variable.
Per shell rules you cannot include a single quote within a single quote delimited string 'foo'bar' though (no amount of backslashes will work to escape that mid-string ') so you need to either jump back out of the single quotes to provide a single quote and then come back in, e.g. with 'foo'\''bar' or use the octal escape sequence \047 (do not use the hex equivalent as it is error prone) wherever you want a single quote, e.g. 'foo\047bar'. You then need to escape the ) twice - once for when awk converts the string to a regexp and then again when awk uses it as a regexp.
If you had been using double quotes around the string you'd have needed one additional escape for when shell parsed the string but that's not needed when you surround your string in single quotes since that is blocking the shell from parsing the string.

SED: replace string with another string

I am new to scripting looking for help to replace <buildWrappers/> with <buildWrappers> Sample text </buildWrappers>
To replace string with another string in sed:
sed 's/replace_this/replace_with/g'
# g is to replace all occurences of replace_this
For your particular case:
sed 's#<buildWrappers/>#<buildWrappers> Sample text </buildWrappers>#g' file
To change the file in place:
sed --in-place ...
You will need to escape special chars (if any in sample text). Those special chars are: # (delimiter) & \

How to replace pipe instead of comma in a csv file

I want to convert csv file from comma separated to pipe(|). But in csv file some lines should have comma also,
My file
$ cat a.txt
"a","b","c,test","new","abc"
Expecting:
a|b|c,test|new|abc
This sed command will do:
sed 's/","/\|/g; s/"//g' File
Replace all "," patterns with |. This will have " at either ends, which is removed later.
Sample:
AMD$ cat File
"a","b","c,test","new","abc"
AMD$ sed 's/","/\|/g; s/"//g' File
a|b|c,test|new|abc
sed ':cycle
s/^\(\("[^"]*"[|]\{0,1\}\)*\),/\1|/
t cycle' YourFile
recursive posix version.
A shortcut with [|]\{0,1\} assuming there is no "foo"|, or "foo",, (empty field are "")
another assumption here, there is no double quote inside quoted string (even escaped)
CSV can be tricky to get right by hand. I'd use a language with a proper CSV parser. For example, with ruby:
$ ruby -rcsv -ne 'puts CSV.generate_line(CSV.parse_line($_), :col_sep=>"|")' a.txt
a|b|c,test|new|abc
That loops over the lines of the file, parses it into an array using the defaults (comma separated, double quotes as the quote character), then generates a new CSV string using pipe as the separator. If a field were to contain a pipe character, that field would be quoted.
This parser cannot handle embedded newlines in a quoted field. Perl's Text::CSV can.
$ awk -F'","' -v OFS='|' '{$1=$1; gsub(/"/,"")} 1' a.txt
a|b|c,test|new|abc
You can use perl in the following way:
cat a.txt | perl -ne 's/^"//; s/"$//; #items = split /","/; print join("|", #items);'

Using awk how do I reprint a found pattern with a new line character?

I have a text file in the format of:
aaa: bcd;bcd;bcddd;aaa:bcd;bcd;bcd;
Where "bcd" can be any length of any characters, excluding ; or :
What I want to do is print the text file in the format of:
aaa: bcd;bcd;bcddd;
aaa: bcd;bcd;bcd;
-etc-
My method of approach to this problem was to isolate a pattern of ";...:" and then reprint this pattern without the initial ;
I concluded I would have to use awk's 'gsub' to do this, but have no idea how to replicate the pattern nor how to print the pattern again with this added new line character 1 character into my pattern.
Is this possible?
If not, can you please direct me in a way of tackling it?
We can't quite be sure of the variability in the aaa or bcd parts; presumably, each one could be almost anything.
You should probably be looking for:
a series of one or more non-colon, non-semicolon characters followed by colon,
with one or more repeats of:
a series of one or more non-colon, non-semicolon characters followed by a semi-colon
That makes up the unit you want to match.
/[^:;]+:([^:;]+;)+/
With that, you can substitute what was found by the same followed by a newline, and then print the result. The only trick is avoiding superfluous newlines.
Example script:
{
echo "aaa: bcd;bcd;bcddd;aaa:bcd;bcd;bcd;"
echo "aaz: xcd;ycd;bczdd;baa:bed;bid;bud;"
} |
awk '{ gsub(/[^:;]+:([^:;]+;)+/, "&\n"); sub(/\n+$/, ""); print }'
Example output
aaa: bcd;bcd;bcddd;
aaa:bcd;bcd;bcd;
aaz: xcd;ycd;bczdd;
baa:bed;bid;bud;
Paraphrasing the question in a comment:
Why does the regular expression not include the characters before a colon (which is what it's intended to do, but I don't understand why)? I don't understand what "breaks" or ends the regex.
As I tried to explain at the top, you're looking for what we can call 'words', meaning sequences of characters that are neither a colon nor a semicolon. In the regex, that is [^:;]+, meaning one or more (+) of the negated character class — one or more non-colon, non-semicolon characters.
Let's pretend that spaces in a regex are not significant. We can space out the regex like this:
/ [^:;]+ : ( [^:;]+ ; ) + /
The slashes simply mark the ends, of course. The first cluster is a word; then there's a colon. Then there is a group enclosed in parentheses, tagged with a + at the end. That means that the contents of the group must occur at least once and may occur any number of times more than that. What's inside the group? Well, a word followed by a semicolon. It doesn't have to be the same word each time, but there does have to be a word there. If something can occur zero or more times, then you use a * in place of the +, of course.
The key to the regex stopping is that the aaa: in the middle of the first line does not consist of a word followed by a semicolon; it is a word followed by a colon. So, the regex has to stop before that because the aaa: doesn't match the group. The gsub() therefore finds the first sequence, and replaces that text with the same material and a newline (that's the "&\n", of course). It (gsub()) then resumes its search directly after the end of the replacement material, and — lo and behold — there is a word followed by a colon and some words followed by semicolons, so there's a second match to be replaced with its original material plus a newline.
I think that $0 must contain the newline at the end of the line. Therefore, without the sub() to remove a trailing newlines, the print (implictly of $0 with a newline) generated a blank line I didn't want in the output, so I removed the extraneous newline(s). The newline at the end of $0 would not be matched by the gsub() because it is not followed by a colon or semicolon.
This might work for you:
awk '{gsub(/[^;:]*:/,"\n&");sub(/^\n/,"");gsub(/: */,": ")}1' file
Prepend a newline (\n) to any string not containing a ; or a : followed by a :
Remove any newline prepended to the start of line.
Replace any : followed by none or many spaces with a : followed by a single space.
Print all lines.
Or this:
sed 's/;\([^;:]*: *\)/;\n\1 /g' file
Not sure how to do it in awk, but with sed this does what I think you want:
$ nl='
'
$ sed "s/\([^;]*:\)/\\${nl}\1/g" input
The first command sets the shell variable $nl to the string containing a single new line. Some versions of sed allow you to use \n inside the replacement string, but not all allow that. This keeps any whitespace that appears after the final ; and puts it at the start of the line. To get rid of that, you can do
$ sed "s/\([^;]*:\)/\\${nl}\1/g; s/\n */\\$nl/g" input
Ordinary awk gsub() and sub() don't allow you to specify components in the replacement strings Gnu awk - "gawk" - supplies "gensub()" which would allow "gensub(/(;) (.+:)/,"\1\n\2","g")"