sed: cut out a substring following various patterns - awk

There is a list of identifiers I want to modify:
3300000526.a:P_A23_Liq_2_FmtDRAFT_1000944_2,
200254578.a:CR_10_Liq_3_inCRDRAFT_100545_11,
3300000110.a:BSg2DRAFT_c10006505_1,
3300000062.a:IMNBL1DRAFT_c0010786_1,
3300000558.a:Draft_10335283_1
I want to remove all starting from first . and first _ after DRAFT (case-insensitive), i.e.:
3300000526_1000944_2,
200254578_100545_11,
3300000110_c10006505_1,
3300000062_c0010786_1,
3300000558_10335283_1
I am using sed 's/.a.*[a-zA-Z0-9]DRAFT_.*[^_]_[a-zA-Z0-9]//g' but it ignores the first _ after DRAFT and does this:
3300000526_2,
200254578_11,
3300000110_1,
3300000062_1,
3300000558_1
P.S.
There can be various identifiers and I tried to show a little portion on their variance here, but they all keep same pattern.
I'd be grateful for corrections!

You could easily do this in awk, could you please try following once. Based on shown samples only.
awk -F'[.]|DRAFT_' '{$2="_";sub(/ +_ +/,"_")} 1' Input_file
OR with GNU awk for handling case-insensitive try:
awk -v IGNORECASE="1" -F'[.]|DRAFT_' '{$2="_";sub(/ +_ +/,"_")} 1' Input_file
To handle case insensitive without ignorecase option try:
awk -F'[.]|[dD][rR][aA][fF][tT]_' '{$2="_";sub(/ +_ +/,"_")} 1' Input_file
Explanation: Simply setting field separator as . OR DRAFT_ as per OP's need. Then in main program setting 2nd field to _ and then substituting spaces underscore spaces with only _. Finally printing the current line by 1.

A workable solution
You can use:
sed 's/[.].*[dD][rR][aA][fF][tT]_/_/' data
You could also use \. in place of [.] but I'm allergic to unnecessary backslashes — you might be too if you'd had to spend time fighting whether 8 or 16 consecutive backslashes was the correct way to write to document markup (using troff).
For your sample data, it produces:
3300000526_1000944_2,
200254578_100545_11,
3300000110_c10006505_1,
3300000062_c0010786_1,
3300000558_10335283_1
What went wrong
Your command is:
sed 's/.a.*[a-zA-Z0-9]DRAFT_.*[^_]_[a-zA-Z0-9]//g'
This matches:
any character (the leading .)
lower-case 'a'
any sequence of characters
an alphanumeric character
upper-case only DRAFT
underscore
any sequence of characters
underscore
an alphanumeric character
doing the match globally on each line
It then deletes all the matched material. You could rescue it by using:
sed 's/[.]a.*[a-zA-Z0-9]DRAFT\(_.*[^_]_[a-zA-Z0-9]\)/\1/'
This matches a dot rather than any character, and saves the material after DRAFT starting with the underscore (that's the \(…\)), replacing what was matched with what was saved (that's the \1). You can convert DRAFT to the case-insensitive pattern too, of course. However, the terms of the question refer to "from the first dot (.) up to (but not including) the underscore after a (case-insensitive) DRAFT", and detailing, saving, and replacing the material after the underscore is not necessary.
Laziness
I saved myself typing the elaborate case-insensitive string by using a program called mkpattern that (according to RCS) I created on 1990-08-23 (a while ago now). It's not rocket science. I use it every so often — I've actually used it a number of times in the last week, searching for some elusive documents on a system at work.
$ mkpattern DRAFT
[dD][rR][aA][fF][tT]
$
You might have to do that longhand in future.

try something like
{mawk/mawk2/gawk} 'BEGIN { FS = "[\056].+DRAFT_"; OFS = ""; } (NF < 2) || ($1 = $1)'
It might not be the fastest but it's relatively clean approach. octal \056 is the period, and it's less ambiguous to reader when the next item is a ".+"

This might work for you (GNU sed):
sed -nE 's/DRAFT[^_]*/\n/i;s/\..*\n//p' file
First turn on the -n and -E to turn off implicit printing and make regexp more easily seen.
Since we want the first occurrence of DRAFT we can not use a regexp that uses the .* idiom as this is greedy and may pass over it if there are two or more such occurrences. Thus we replace the DRAFT by a unique character which cannot occur in the line. The newline can only be introduced by the programmer and is the best choice.
Now we can use the .* idiom as the introduced newline can only exist if the previous substitution has matched successfully.
N.B. The i flag in the first substitution allows for any upper/lower case rendition of the string DRAFT, also the second substitution includes the p flag to print the successful substitution.

Related

extracting a specific word between : using sed, awk or grep

I have a file that has the following contents and many more.
#set_property board_part my.biz:ab0860_1cf:part0:1.0 [current_project]
set_property board_part my.biz:ab0820_1ab:part0:1.0 [current_project]
My ideal output is as shown below (ie, the text after the first ":" and the second ":".
ab0820_1ab
I generally use python and use regular expression along the lines of below to get the result.
\s*set_property board_part trenz.biz:([a-zA-Z_0-9]+)
I wish to know how can it be done quickly and in a more generic way using commandline tools (sed, awk).
You might use GNU sed following way, let file.txt content be
#set_property board_part my.biz:ab0860_1cf:part0:1.0 [current_project]
set_property board_part my.biz:ab0820_1ab:part0:1.0 [current_project]
garbage garbage garbage
then
sed -n '/ set_property board_part my.biz/ s/[^:]*:\([^:]*\):.*/\1/ p' file.txt
gives output
ab0820_1ab
Explanation: -n turns off default printing, / set_property board_part my.biz/ is so-called address, following commands will be applied solely to lines matching adress. First command is substitution (s) which use capturing group denoted by \( and \), regular expression is as followes zero-or-more non-: (i.e. everything before 1st :), :, then zero-or-more non-: (i.e. everything between 1st and 2nd :) encased in capturing group : and zero-or-more any character (i.e. everything after 2nd :), this is replaced by content of 1st (and sole in this case) capturing group. After substitution takes place p command is issued to prompt GNU sed to print changed line.
(tested in GNU sed 4.2.2)
Your example data has my.biz but your pattern tries to match trenz.biz
If gnu awk is available, you can use the capture group and then print the first value of which is available in a[1]
awk 'match($0, /^\s*set_property board_part \w+\.biz:(\w+)/, a) {print a[1]}' file
The pattern matches:
^ Start of string
\s* Match optional whitespace chars
set_property board_part Match literally
\w+\.biz: Match 1+ word chars followed by .biz (note to escape the dot to match it literally)
(\w+) Capture group 1, match 1+ word chars
Notes
If you just want to match trenz.biz then you can replace \w+\.biz with trenz\.biz
If the strings are not at the start of the string, you can change ^ for \s wo match a whitespace char instead
Output
ab0820_1ab

Print out a line from a .txt file that has a keyword with parentheses in it

I'm trying to print out lines of a bunch of output files that contain the characters "g(tot)" in them.
awk '/g(tot)/{print}' ./*/*.out
However, this is not printing anything, and it seems to be due to the parentheses around the "tot". How can I get around this?
( and ) are interpreted as special characters in a regular expression.
Escape ( and ) with a \:
awk '/g\(tot\)/{print}' ./*/*.out
() are specials characters in RegEx, they are for catching things in a group
(…)
Parentheses are used for grouping in regular expressions, as in arithmetic. They can be used to concatenate regular expressions containing the alternation operator, ‘|’. For example, ‘#(samp|code){[^}]+}’ matches both ‘#code{foo}’ and ‘#samp{bar}’. (These are Texinfo formatting control sequences. The ‘+’ is explained further on in this list.)
So /g(tot)/ actually matches gtot not g(tot).
You need to escape it, like this /g\(tot\)/.
Also you can remove the part {print}, it's implied after a condition, so that sums to:
awk '/g\(tot\)/' ./*/*.out
However, for this simple task, I would suggest you to use grep instead:
awk 'g\(tot\)' ./*/*.out
And you can use sed too:
sed -n '/g\(tot\)/p' ./*/*.out

awk statement within sed

I have multiple occurrences of the pattern:
)0.[0-9][0-9][0-9]:
where [0-9] is any digit, in various text context but the pattern is unique as this regex. And I need to turn the decimal fraction into integer (percent values from 0 to 99).
A small example substring would be
=1:0.00055)0.944:0.02762)0.760:0 to turn into
=1:0.00055)94:0.02762)76:0
What I’m doing is :
cat file | sed -e "s/)\([0-9].[0-9][0-9][0-9]\):/)`echo "\1"|awk '{ r=int(100*$0); if((r>=0)&&(r<=100)){ print r; } else { print "error"; exit(-1); } }'`:/g"
but the output is )0:
where is the fault?...
Since you asked 'where is the fault' and not 'how to solve the problem':
Your backquoted pipeline echo ...|awk ... is executed FIRST, producing a single 0 which is then made part of the s/// command passed to sed and thus substituted everywhere the pattern matches. PS: using the newer (post-Reagan) and more flexible notation for command substitution $( ... ) instead of backquotes is preferred in all shells except csh family, and especially on Stack where backquotes are special to markdown and troublesome to show in text.
If you want to actually solve the problem, which you didn't describe clearly or completely, some pointers toward a better direction:
Standard sed can't execute a command to generate replacement text; GNU sed can with flag e but you need to make the whole patternspace the command and fiddle anything else into holdspace, which is tedious. perl can evaluate an expression in the replacement for s, including arithmetic; awk (even gawk) can't do so directly, but you can get the same effect by doing the match and the replace/rebuild as separate steps, depending on the unspecified and unclear details of exactly what you want to do; if you want to keep the rest of the line unchanged, something like:
awk 'match($0,/)0[.][0-9][0-9][0-9]:/){ print substr($0,1,RSTART) (substr($0,RSTART+1,RLENGTH-2)*100) substr($0,RSTART+RLENGTH-1) }'
But you don't actually need arithmetic here if you're satisified with truncating. Just discard the leading 0. and the last digit and keep the two digits in between:
sed 's/)0[.]\([0-9][0-9]\)[0-9]:/)0.\1:/g'
Note . in regexp unless escaped or in a charclass (as I did) matches any character not just period, which may or may not be a problem since you didn't give the rest of your input.
And PS: negative numbers for process exit status don't work (except IIRC Plan 9). Use small (usually < 128) positive status values for errors; most common is to just use 1.
Check this perl one-liner command :
perl -pe 's/\)(\d+\.\d+):/sprintf ")%d:", $1 * 100/ge' file
Before :
=1:0.00055)0.944:0.02762)0.760:0
After :
=1:0.00055)94:0.02762)76:0
If you need to replace for real in editing mode, add -i switch :
perl -i -pe '...'

Awk understanding variables

What does the words (probably variables?) like NF, RF, FS mean in awk? I believe they have some pre-defined meaning and usage.
Please let me know, if there are more such variables that must be known to a beginner?
-Thanks
These are built-in variables, in awk they have special meaning.
There is the part in GAWK reference manual covering this topic.
In particular:
FS:
This is the input field separator (see Field Separators). The value
is a single-character string or a multi-character regular expression
that matches the separations between fields in an input record. If the
value is the null string (""), then each character in the record
becomes a separate field. (This behavior is a gawk extension. POSIX
awk does not specify the behavior when FS is the null string.
Nonetheless, some other versions of awk also treat "" specially.)
The default value is " ", a string consisting of a single space. As a
special exception, this value means that any sequence of spaces, TABs,
and/or newlines is a single separator. It also causes spaces, TABs,
and newlines at the beginning and end of a record to be ignored.
NF:
The number of fields in the current input record. NF is set each
time a new record is read, when a new field is created or when $0
changes (see Fields).
Unlike most of the variables described in this section, assigning a
value to NF has the potential to affect awk's internal workings. In
particular, assignments to NF can be used to create or remove fields
from the current record. See Changing Fields.

Using awk how do I reprint a found pattern with a new line character?

I have a text file in the format of:
aaa: bcd;bcd;bcddd;aaa:bcd;bcd;bcd;
Where "bcd" can be any length of any characters, excluding ; or :
What I want to do is print the text file in the format of:
aaa: bcd;bcd;bcddd;
aaa: bcd;bcd;bcd;
-etc-
My method of approach to this problem was to isolate a pattern of ";...:" and then reprint this pattern without the initial ;
I concluded I would have to use awk's 'gsub' to do this, but have no idea how to replicate the pattern nor how to print the pattern again with this added new line character 1 character into my pattern.
Is this possible?
If not, can you please direct me in a way of tackling it?
We can't quite be sure of the variability in the aaa or bcd parts; presumably, each one could be almost anything.
You should probably be looking for:
a series of one or more non-colon, non-semicolon characters followed by colon,
with one or more repeats of:
a series of one or more non-colon, non-semicolon characters followed by a semi-colon
That makes up the unit you want to match.
/[^:;]+:([^:;]+;)+/
With that, you can substitute what was found by the same followed by a newline, and then print the result. The only trick is avoiding superfluous newlines.
Example script:
{
echo "aaa: bcd;bcd;bcddd;aaa:bcd;bcd;bcd;"
echo "aaz: xcd;ycd;bczdd;baa:bed;bid;bud;"
} |
awk '{ gsub(/[^:;]+:([^:;]+;)+/, "&\n"); sub(/\n+$/, ""); print }'
Example output
aaa: bcd;bcd;bcddd;
aaa:bcd;bcd;bcd;
aaz: xcd;ycd;bczdd;
baa:bed;bid;bud;
Paraphrasing the question in a comment:
Why does the regular expression not include the characters before a colon (which is what it's intended to do, but I don't understand why)? I don't understand what "breaks" or ends the regex.
As I tried to explain at the top, you're looking for what we can call 'words', meaning sequences of characters that are neither a colon nor a semicolon. In the regex, that is [^:;]+, meaning one or more (+) of the negated character class — one or more non-colon, non-semicolon characters.
Let's pretend that spaces in a regex are not significant. We can space out the regex like this:
/ [^:;]+ : ( [^:;]+ ; ) + /
The slashes simply mark the ends, of course. The first cluster is a word; then there's a colon. Then there is a group enclosed in parentheses, tagged with a + at the end. That means that the contents of the group must occur at least once and may occur any number of times more than that. What's inside the group? Well, a word followed by a semicolon. It doesn't have to be the same word each time, but there does have to be a word there. If something can occur zero or more times, then you use a * in place of the +, of course.
The key to the regex stopping is that the aaa: in the middle of the first line does not consist of a word followed by a semicolon; it is a word followed by a colon. So, the regex has to stop before that because the aaa: doesn't match the group. The gsub() therefore finds the first sequence, and replaces that text with the same material and a newline (that's the "&\n", of course). It (gsub()) then resumes its search directly after the end of the replacement material, and — lo and behold — there is a word followed by a colon and some words followed by semicolons, so there's a second match to be replaced with its original material plus a newline.
I think that $0 must contain the newline at the end of the line. Therefore, without the sub() to remove a trailing newlines, the print (implictly of $0 with a newline) generated a blank line I didn't want in the output, so I removed the extraneous newline(s). The newline at the end of $0 would not be matched by the gsub() because it is not followed by a colon or semicolon.
This might work for you:
awk '{gsub(/[^;:]*:/,"\n&");sub(/^\n/,"");gsub(/: */,": ")}1' file
Prepend a newline (\n) to any string not containing a ; or a : followed by a :
Remove any newline prepended to the start of line.
Replace any : followed by none or many spaces with a : followed by a single space.
Print all lines.
Or this:
sed 's/;\([^;:]*: *\)/;\n\1 /g' file
Not sure how to do it in awk, but with sed this does what I think you want:
$ nl='
'
$ sed "s/\([^;]*:\)/\\${nl}\1/g" input
The first command sets the shell variable $nl to the string containing a single new line. Some versions of sed allow you to use \n inside the replacement string, but not all allow that. This keeps any whitespace that appears after the final ; and puts it at the start of the line. To get rid of that, you can do
$ sed "s/\([^;]*:\)/\\${nl}\1/g; s/\n */\\$nl/g" input
Ordinary awk gsub() and sub() don't allow you to specify components in the replacement strings Gnu awk - "gawk" - supplies "gensub()" which would allow "gensub(/(;) (.+:)/,"\1\n\2","g")"