Match regexp at the end of the string with AWK - awk

I am trying to match two different Regexp to long strings with awk, removing the part of the string that matches in a 35 characters window.
The problem is that the same bunch of code works when I am looking for the first (which matches at the beginnng) whereas fails to match with the second one (end of string).
Input:
Regexp1(1)(2)(3)(4)(5)xxxxxxxxxxxxxxx(20)(21)(22)(23)Regexp2
Desired output
(1)(2)(3)(4)(5)xxxxxxxxxxxxxxx(20)(21)(22)(23)
So far I used this code that extracts correctly Regexp1, but, unfortunately, is not able to extract also Regexp2 since indexed of RSTART and RLENGTH for Regexp2 are incorrect.
Code for extracting Regexp1 (correct output):
awk -v F="Regexp1" '{if (match(substr($1,1,35),F)) print substr($1,RSTART,RLENGTH)}' file
Code for extracting Regexp2 (wrong output)
awk -v F="Regexp2" '{if (match(substr($1,length($1)-35,35),F)) print substr($1,RSTART,RLENGTH)}' file
Despite the indexes for Regexp1 are correct, for Regexp2 indexes are wrond (RSTART=13). I cannot figure out how to extract the second Regexp.

Considering that your actual Input_file is same as shown samples, if this is the case could you please try following then(good to have new version of awk since old versions may not support number of times logic for regex).
awk '
match($0,/\([0-9]+\){5}.*\([0-9]\){4}/){
print substr($0,RSTART,RLENGTH)
}' Input_file
In case your number of parenthesis values are not fixed then you could do like as follows:
awk '
match($0,/\([0-9]+\){1,}.*\([0-9]\){1,}/){
print substr($0,RSTART,RLENGTH)
}' Input_file

If this isn't all you need:
$ sed 's/Regexp1\(.*\)Regexp2/\1/' file
(1)(2)(3)(4)(5)xxxxxxxxxxxxxxx(20)(21)(22)(23)
or using GNU awk for gensub():
$ awk '{print gensub(/Regexp1(.*)Regexp2/,"\\1",1)}' file
(1)(2)(3)(4)(5)xxxxxxxxxxxxxxx(20)(21)(22)(23)
then edit your question to be far clearer with your requirements and example.

Related

replace new line with a space if next line starts with a word character

I've large text file that looks like
some random : demo text for
illustration, can be long
and : some more
here is : another
one
I want an output like
some random : demo text for illustration, can be long
and : some more
here is : another one
I tried some strange, obviously faulty regex like %s/\w*\n/ /g but can't really get my head around.
With your shown samples, please try following awk code. Using RS(record separator), setting it nullify. This is based on your shown samples only.
awk -v RS="" '{$1=$1} 1' Input_file
Adding another solution in case someone is looking for printf function with awk. Though 1st solution provided in Here should be used IMHO, as an alternative adding these solutions too here.
2nd solution: Adding solution to check if lines starts with alphabets, then only add them with previous lines or so.
awk '{printf("%s%s",$0~/^[a-zA-Z]/?(FNR>1 && prev~/^[a-zA-Z]/?OFS:""):ORS,$0);prev=$0} END{print ""}' Input_file
3rd solution: Note: This will work only if your lines has colon present in the lines as per shown samples.
awk '{printf("%s%s",$0~/:/?(FNR>1?ORS:""):OFS,$0)} END{print ""}' Input_file
Explanation: Using printf function of awk. Then using conditions, if current line has : and greater than 1 then print ORS else print nothing. If line doesn't contain : then print OFS for each line. In the END block of this program printing newline.

Replacing columns of a CSV with a string using awk and gsub

I have an input csv file that looks something like:
Name,Index,Location,ID,Message
Alexis,10,Punggol,4090b43,Production 4090b43
Scott,20,Bedok,bfb34d3,Prevent
Ronald,30,one-north,86defac,Difference 86defac
Cindy,40,Punggol,40d0ced,Central
Eric,50,one-north,aeff08d,Military aeff08d
David,60,Bedok,5d1152d,Study
And I want to write a bash shell script using awk and gsub to replace 6-7 alpha numeric character long strings under the ID column with "xxxxx", with the output in a separate .csv file.
Right now I've got:
#!/bin/bash
awk -F ',' -v OFS=',' '{gsub(/^([a-zA-Z0-9]){6,7}/g, "xxxxx", $4);}1' input.csv > output.csv
But the output from I'm getting from running bash myscript.sh input.csv doesn't make any sense. The output.csv file looks like:
Name,Index,Location,ID,Message
Alexis,10,Punggol,4xxxxx9xxxxxb43,Production 4090b43
Scott,20,Bedok,bfb34d3,Prevent
Ronald,30,one-north,86defac,Difference 86defac
Cindy,40,Punggol,4xxxxxdxxxxxced,Central
Eric,50,one-north,aeffxxxxx8d,Military aeff08d
David,60,Bedok,5d1152d,Study
but the expected output csv should look like:
Name,Index,Location,ID,Message
Alexis,10,Punggol,xxxxx,Production 4090b43
Scott,20,Bedok,xxxxx,Prevent
Ronald,30,one-north,xxxxx,Difference 86defac
Cindy,40,Punggol,xxxxx,Central
Eric,50,one-north,xxxxx,Military aeff08d
David,60,Bedok,xxxxx,Study
With your shown sample, please try the following code:
awk -F ',[[:space:]]+' -v OFS=',\t' '
{
sub(/^([a-zA-Z0-9]){6,7}$/, "xxxxx", $4)
$1=$1
}
1
' Input_file | column -t -s $'\t'
Explanation: Setting field separator as comma, space(s), then setting output field separator as comma tab here. Then substituting from starting to till end of value(6 to 7 occurrences) of alphanumeric(s) with xxxxx in 4th field. Finally printing current line. Then sending output of awk program to column command to make it as per shown sample of OP.
EDIT: In case your Input_file is separated by only , as per edited samples now, then try following.
awk -F ',' -v OFS=',' '
{
sub(/^([a-zA-Z0-9]){6,7}$/, "xxxxx", $4)
}
1
' Input_file
Note: OP has installed latest version of awk from older version and these codes helped.
The short version to your answer would be the following:
$ awk 'BEGIN{FS=OFS=","}(FNR>1){$4="xxxxxx"}1' file
This will replace all entries in column 4 by "xxxxxx".
If you only want to change the first 6 to 7 characters of column 4 (and not if there are only 5 of them, there are a couple of ways:
$ awk 'BEGIN{FS=OFS=","}(FNR>1)&&(length($4)>5){$4="xxxxxx" substr($4,8)}1' file
$ awk 'BEGIN{FS=OFS=","}(FNR>1)&&{sub(/.......?/,"xxxxxx",$4)}1' file
Here, we will replace 123456abcde into xxxxxxabcde
Why is your script failing:
Besides the fact that the approach is wrong, I'll try to explain what the following command does: gsub(/([a-zA-Z0-9]){6,7}/g,"xxxxx",$4)
The notation /abc/g is valid awk syntax, but it does not do what you expect it to do. The notation /abc/ is an ERE-token (an extended regular expression). The notation g is, at this point, nothing more than an undefined variable which defaults to an empty string or zero, depending on its usage. awk will now try to execute the operation /abc/g by first executing /abc/ which means: if my current record ($0) matches the regular expression "abc", return 1 otherwise return 0. So it converts /abc/g into 0g which means to concatenate the content of g to the number 0. For this, it will convert the number 0 to a string "0" and concatenate it with the empty string g. In the end, your gsub command is equivalent to gsub("0","xxxxx",$4) and means to replace all the ZERO's by "xxxxx".
Why are you getting always gsub("0","xxxxx",$4) and never gsub("1","xxxxx",$4). The reason is that your initial regular expression never matches anything in the full record/line ($0). Your reguar expression reads /^([a-zA-Z0-9]){6,7}/, and while there are lines that start with 6 or 7 characters, it is likely that your awk does not recognize the extended regular expression notation '{m,n}' which makes it fail. If you use gnu awk, the output would be different when using -re-interval which in old versions of GNU awk is not enabled by default.
I tried to find why your code behave like that, for simplicty sake I made example concering only gsub you have used:
awk 'BEGIN{id="4090b43"}END{gsub(/^([a-zA-Z0-9]){6,7}/g, "xxxxx", id);print id}' emptyfile.txt
output is
4xxxxx9xxxxxb43
after removing g in first argument
awk 'BEGIN{id="4090b43"}END{gsub(/^([a-zA-Z0-9]){6,7}/, "xxxxx", id);print id}' emptyfile.txt
output is
xxxxx
So regular expression followed by g caused malfunction. I was unable to find relevant passage in GNU AWK manual what g after / is supposed to do.
(tested in gawk 4.2.1)

right pad regex with spaces using sed or awk

I have a file with two fields separated with :, both fields are varying length, second field can have all sort of characters(user input). I want the first field to be right padded with spaces to fixed length of 15 characters, for first field I have a working regex #.[A-Z0-9]{4,12}.
sample:
#ABC123:"wild things here"
#7X3Z:"":":#":";:*:-user input:""
#99999X999:"also, imagine: unicode, yay!"
desired output:
#ABC123 :"wild things here"
#7X3Z :"":":#":";:*:-user input:""
#99999X999 :"also, imagine: unicode, yay!"
There is plenty of examples how to zero pad a number, but surprisingly not a lot about general padding a regex or a field, any help using (preferably) sed or awk?
Here is another awk solution that would work with any version of awk:
awk 'BEGIN {FS=OFS=":"} {$1 = sprintf("%-15s", $1)} 1' file
#ABC123 :"wild things here"
#7X3Z :"":":#":";:*:-user input:""
#99999X999 :"also, imagine: unicode, yay!"
With perl:
$ perl -pe 's/^[^:]+/sprintf("%-15s",$&)/e' ip.txt
#ABC123 :"wild things here"
#7X3Z :"":":#":";:*:-user input:""
#99999X999 :"also, imagine: unicode, yay!"
The e flag allows you to use Perl code in replacement section. $& will have the matched portion which gets formatted by sprintf.
With awk:
# should work with any awk
awk 'match($0, /^[^:]+/){printf "%-15s%s\n", substr($0,1,RLENGTH), substr($0,RLENGTH+1)}'
# can be simplified with GNU awk
awk 'match($0, /^[^:]+/, m){printf "%-15s%s\n", m[0], substr($0,RLENGTH+1)}'
# or
awk 'match($0, /^([^:]+)(.+)/, m){printf "%-15s%s\n", m[1], m[2]}'
substr($0,1,RLENGTH) or m[0] will give contents of first field. I have used 1 instead of the usual RSTART here since we are matching start of line
substr($0,RLENGTH+1) will give rest of the line contents (i.e. from the first :)
See awk manual: String-Manipulation for details about match function.
Adding one more way of adding spaces to 1st columns here, though anubhava's answer with sprintf is better answer, adding is as an option here. Here I have created a variable named spaces, where one could define number of spaces which we need to add to it.
awk -v spaces="15" 'BEGIN{FS=OFS=":"} {sub(/:/,sprintf("%"spaces-length($1)"s",":"))} 1' Input_file
Explanation: Adding detailed explanation for above.
awk -v spaces="15" ' ##Starting awk program from here, setting spaces to 15 here.
BEGIN{ ##Starting BEGIN section of this program from here.
FS=OFS=":" ##Setting FS and OFS as colon here.
}
{
sub(/:/,sprintf("%"spaces-length($1)"s",":")) ##Substituting colon first occurrence with spaces(left padding of spaces) along with colon here.
}
1 ##Printing current line here.
' Input_file ##Mentioning Input_file name here.
i believe anbhava's solution of
awk 'BEGIN {FS=OFS=":"} {$1 = sprintf("%-15s", $1)} 1' file
can be even further simplified as :
awk -F: 'BEGIN{FS=OFS} $1=sprintf("%-15s",$1)'
the { } and final 1 are optional

What does this Awk expression mean

I am working with bash script that has this command in it.
awk -F ‘‘ ‘/abc/{print $3}’|xargs
What is the meaning of this command?? Assume input is provided to awk.
The quick answer is it'll do different things depending on the version of awk you're running and how many fields of output the awk script produces.
I assume you meant to write:
awk -F '' '/abc/{print $3}'|xargs
not the syntactically invalid (due to "smart quotes"):
awk -F ‘’’/abc/{print $3}’|xargs
-F '' is undefined behavior per POSIX so what it will do depends on the version of awk you're running. In some awks it'll split the current line into 1 character per field. in others it'll be ignored and the line will be split into fields at every sequence of white space. In other awks still it could do anything else.
/abc/ looks for a string matching the regexp abc on the current line and if found invokes the subsequent action, in this case {print $3}.
However it's split into fields, print $3 will print the 3rd such field.
xargs as used will just print chunks of the multi-line input it's getting all on 1 line so you could get 1 line of all-fields output if you don't have many fields being output or several lines of multi-field output if you do.
I suspect the intent of that code was to do what this code actually will do in any awk alone:
awk '/abc/{printf "%s%s", sep, substr($0,3,1); sep=OFS} END{print ""}'
e.g.:
$ printf 'foo\nxabc\nyzabc\nbar\n' |
awk '/abc/{printf "%s%s", sep, substr($0,3,1); sep=OFS} END{print ""}'
b a

Output field separators in awk after substitution in fields

Is it always the case, after modifying a specific field in awk, that information on the output field separator is lost? What happens if there are multiple field separators and I want them to be recovered?
For example, suppose I have a simple file example that contains:
a:e:i:o:u
If I just run an awk script, which takes account of the input field separator, that prints each line in my file, such as running
awk -F: '{print $0}' example
I will see the original line. If however I modify one of the fields directly, e.g. with
awk -F: '{$2=$2"!"; print $0}' example
I do not get back a modified version of the original line, rather I see the fields separated by the default whitespace separator, i.e:
a e! i o u
I can get back a modified version of the original by specifying OFS, e.g.:
awk -F: 'BEGIN {OFS=":"} {$2=$2"!"; print $0}' example
In the case, however, where there are multiple potential field separators but in the case of multiple separators is there a simple way of restoring the original separators?
For example, if example had both : and ; as separators, I could use -F":|;" to process the file but OFS would no be sufficient to restore the original separators in their relative positions.
More explicitly, if we switched to example2 containing
a:e;i:o;u
we could use
awk -F":|;" 'BEGIN {OFS=":"} {$2=$2"!"; print $0}' example2
(or -F"[:;]") to get
a:e!:i:o:u
but we've lost the distinction between : and ; which would have been maintained if we could recover
a:e!;i:o;u
You need to use GNU awk for the 4th arg to split() which saves the separators, like RT does for RS:
$ awk -F'[:;]' '{split($0,f,FS,s); $2=$2"!"; r=s[0]; for (i=1;i<=NF;i++) r=r $i s[i]; $0=r} 1' file
a:e!;i:o;u
There is no automatically populated array of FS matching strings because of how expensive it'd be in time and memory to store the string that matches FS every time you split a record into fields. Instead the GNU awk folks provided a 4th arg to split() so you can do it yourself if/when you want it. That is the result of a long conversation a few years ago in the comp.lang.awk newsgroup between experienced awk users and gawk providers before all agreeing that this was the best approach.
See split() at https://www.gnu.org/software/gawk/manual/gawk.html#String-Functions.