Replace string in upper or lower case with Awk - awk

How can I take a string like this:
sample="+TEST/TEST01/filetest01.txt"
And replace all occurrences of test01/TEST01 with test02/TEST02, keeping the text in the same case. So the desired output would be:
"+TEST/TEST02/filetest02.txt"
If you were to pass the replacement string of TEST03. Then the desired output would be
"+TEST/TEST03/filetest03.txt"
If the replacement text was Test04. The desired output:
"+TEST/TEST04/filetest04.txt"
I've tried this:
echo "$sample" | awk 'BEGIN{IGNORECASE=1}{gsub("test01", "test02");print}'
It replaces the lower case value but not the upper case.
I cannot use sed as the version I have doesn't support the /I switch to ignore case.
My end goal is to be able to use variables that represent the Item to change. So variables would be like this:
text2replace=test01
replacetext=test02

Try this using gnu-awk: gawk:
echo "$sample" | awk 'BEGIN{IGNORECASE=1}{print gensub("test01", "test02", "g")}'
Output
+TEST/test02/filetest02.txt
Last chance area
echo "$sample" |
tr '[[:upper:]]' '[[:lower:]]' |
awk '{gsub("test01", "test02");print}'

perl is good for this
$ perl -pe 's/test\K01/02/ig' <<< "+TEST/TEST01/filetest01.txt"
+TEST/TEST02/filetest02.txt
The \K directive instructs the regex engine to match what is on the left-hand side of it and then forget about it. It acts to position the "cursor" to the start of "01" only when it is preceded by "test".
I'm also using the i flag for case-insensitive matching.
More generally, if you looking to increment the digits following "test" case-insensitively (and zero-pad the same amount):
perl -pe 's/test\K(\d+)/ sprintf "%0*d", length($1), $1+1 /eig' <<INPUT
+TEST/TEST01234/filetest00009.txt
INPUT
+TEST/TEST01235/filetest00010.txt

You say you don't have GNU sed with its I flag, but you can do it with POSIX sed:
$ sed 's/\([Tt][Ee][Ss][Tt]0\)1/\12/g' <<< '+TEST/TEST01/filetest01.txt'
+TEST/TEST02/filetest02.txt
[Tt] is the poor man's case-insensitive match for T or t; the case is preserved by using a capture group.

Related

create a PCRE expression that matches exactly the strings described. strings that have exactly 3 vowels in any position

the vowels do not have to appear consecutively. The string bonfire should
match, but beak and yearbook should not.
I tried using
grep -P '([aeiouy][aeiouy]*){3}'
but it only seems to work when the vowels are consecutive.
This job can be done easily using awk without using any regex. Just set field separator to one of the vowels and then make sure we have 4 fields:
awk -F '[aeiouy]' 'NF == 4' file
PS: y is not really a vowel but included because of your shown example.
Use [^aeiouy] to match non-vowels. Put a sequence of these around each vowel pattern.
grep -x '[^aeiouy]*[aeiouy][^aeiouy]*[aeiouy][^aeiouy]*[aeiouy][^aeiouy]*' filename
Use the -x option to match the whole line, or anchor the regexp with ^ and $.
You don't need PCRE, this only uses traditional regexp patterns.
Using any awk:
$ echo 'bonfire' | awk 'gsub(/[aeiouy]/,"&")==3'
bonfire
$ echo 'beak' | awk 'gsub(/[aeiouy]/,"&")==3'
$ echo 'yearbook' | awk 'gsub(/[aeiouy]/,"&")==3'
$
You don't need -P for Perl-compatible regular expressions.
If you want to match exactly 3 vowels using grep, you can use anchors and match 3 vowels surrounded by optional repetitions of the negated character class [^aeiouy]* matching any character except the vowels.
grep '^[^aeiouy]*\([aeiouy][^aeiouy]*\)\{3\}$' file
Or with -E
grep -E '^[^aeiouy]*([aeiouy][^aeiouy]*){3}$' file
If you want at least 3 vowels:
grep -E '([aeiouy][^aeiouy]*){2}[aeiouy]' file

Delete string from line that matches regex with AWK

I have file that contains a lot of data like this and I have to delete everything that matches this regex [-]+\d+(.*)
Input:
zxczxc-6-9hw7w
qweqweqweqweqwe-18-8c5r6
asdasdasasdsad-11-br9ft
Output should be:
zxczxc
qweqweqweqweqwe
asdasdasasdsad
How can I do this with AWK?
sed might be easier...
$ sed -E 's/-+[0-9].*//' file
note that .* covers +.*
AFAIK awk doesn't support \d so you could use [0-9], your regex is correct only thing you need to put it in correct function of awk.
awk '{sub(/-+[0-9].*/,"")} 1' Input_file
You don't need the extra <plus> sign afther [0-9] as this is covered by the .*
Generally, if you want to delete a string that matches a regular expression, then all you need to do is substitute it with an empty string. The most straightforward solution is sed which is presented by karafka, the other solution is using awk as presented by RavinderSingh13.
The overall syntax would look like this:
sed -e 's/ere//g' file
awk '{gsub(/ere/,"")}1' file
with ere the regular expression representation. Note I use g and gsub here to substitute all non-overlapping strings.
Due to the nature of the regular expression in the OP, i.e. it ends with .*, the g can be dropped. It also allows us to write a different awk solution which works with field separators:
awk -F '-+[0-9]' '{print $1}' file

Regex match using awk for a line starting with a non conforming string

I have a huge file, I want to only copy from it lines starting with
,H|756|F:BRN\
but when I do
awk '$1 ~ /^ ,H|756|F:BRN\/' file_1.txt > file_2.txt
I get:
awk: line 1: runaway regular expression /^ ,H|756|F ...
The meta-characters in the regex match needs to be properly escaped to achieve what you are trying to do. In Extended Regular Expressions (ERE) supported by awk by default | has a special meaning to do alternate match, so you need to escape it to deprive it of its special meaning and treat it literally and the same applies for \
awk '/^,H\|756\|F:BRN\\/' file
Also you don't need to use the explicit ~ match on $1. For a simpler case like this, a string pattern starting with, the /regex/ approach is easier to do.
If the file is "huge", you can consider grep or ack or ag, which may bring you better performance.
grep '^,H|756|F:BRN\\' input > output
grep uses BRE as default, so you don't have to escape the pipe |. But the ending back-slash you should escape.

Arithmetic on gensub substitution in gawk

I wonder whether the following is possible:
echo -e "0#1 1#1 0#0\n0#0 1#1 0#1" | awk '{print gensub(/([01])#([01])/, "\\1" + "\\2", "g")}'
It doesn't work the way it is; is that because the evaluation of "+" happens before the substitutions of "\1" and "\2"?
As output, I would expect 1, the result of arithmetic on \1 and \2, so for \1=0 and \2=1, the output should be 1.
Also, as per answer below, I am not looking for a solution on how to add 1 and 0 in "1#0"; this is just an example, I just wondered whether it is possible to do arithmetic on \1 and \2, since this works:
gensub(/blah blah/, 0 + 1, "g") gives 1.
You can't use gensub() for this, because it returns the captured groups as literal strings as its result.
For such a trivial requirement use # as the field separator and do the arithmetic computation as
echo "0#1" | awk -F# '{print ($1 + $2)}'
Or if you are worried about string values in the input string, force the numeric conversion using int() casting, or just add +0 to each of the operands, i.e. use (int($1) + int($2)) or (($1+0) + ($2+0))
As per the updated question/comments in the answer below, doing constant numeric arithmetic is not something gensub() is intended for, which is supposed to do a regexp based pattern search and replacement. The replacement part on most cases involves dealing with the captured groups from the search string and apply some modifications over it.
I think I understand what you want, and you can do it in Perl using the e modifier on a substitution which means it evaluates the replacement. Here's an example:
echo "7#302" | perl -nle 's/(\d+)#(\d+)/$1+$2/e && print'
309
Or, slightly more fun:
echo "The 200#109 cats sat on the 7#302 mats" | perl -nle 's/(\d+)#(\d+)/$1+$2/ge && print'
The 309 cats sat on the 309 mats
You could use sed w/bc for calculating, in the manner Mark used perl:
echo "7#302" | sed -E 's/([0-9]+)#([0-9]+)/echo "\1+\2"|bc/e'
When you write foo(bar()), you'll find that bar() is executed first whether it's a function or any expression so gensub(..., "\\1" + "\\2", ...) calls gensub() using the result of adding the 2 strings which is 0, i.e. gensub(..., 0, ...).
This isn't semantically identical to the code you wrote but the approach to do what you want is to use the 3rd arg to match():
$ echo "0#1" | awk 'match($0,/([01])#([01])/,a){print a[1] + a[2]}'
1
The above uses GNU awk for that 3rd arg to match() but you were already using that for gensub() anyway. If it's not clear how to use that on your real data then post a followup question that includes an example of your real data.

awk to remove 5th column from N column with fixed delimiter

I have file with Nth columns
I want to remove the 5th column from last of Nth columns
Delimiter is "|"
I tested with simple example as shown below:
bash-3.2$ echo "1|2|3|4|5|6|7|8" | nawk -F\| '{print $(NF-4)}'
4
Expecting result:
1|2|3|5|6|7|8
How should I change my command to get the desired output?
If I understand you correctly, you want to use something like this:
sed -E 's/\|[^|]*((\|[^|]*){4})$/\1/'
This matches a pipe character \| followed by any number of non-pipe characters [^|]*, then captures 4 more of the same pattern ((\|[^|]*){4}). The $ at the end matches the end of the line. The first part of the match (i.e. the fifth field from the end) is dropped.
Testing it out:
$ sed -E 's/\|[^|]*((\|[^|]*){4})$/\1/' <<<"1|2|3|4|5|6|7"
1|2|4|5|6|7
You could achieve the same thing using GNU awk with gensub but I think that sed is the right tool for the job in this case.
If your version of sed doesn't support extended regex syntax with -E, you can modify it slightly:
sed 's/|[^|]*\(\(|[^|]*\)\{4\}\)$/\1/'
In basic mode, pipes are interpreted literally but parentheses for capture groups and curly brcneed to be escaped.
AWK is your friend :
Sample Input
A|B|C|D|E|F|G|H|I
A|B|C|D|E|F|G|H|I|A
A|B|C|D|E|F|G|H|I|F|E|D|O|R|Q|U|I
A|B|C|D|E|F|G|H|I|E|O|Q
A|B|C|D|E|F|G|H|I|X
A|B|C|D|E|F|G|H|I|J|K|L
Script
awk 'BEGIN{FS="|";OFS="|"}
{$(NF-5)="";sub(/\|\|/,"|");print}' file
Sample Output
A|B|C|E|F|G|H|I
A|B|C|D|F|G|H|I|A
A|B|C|D|E|F|G|H|I|F|E|O|R|Q|U|I
A|B|C|D|E|F|H|I|E|O|Q
A|B|C|D|F|G|H|I|X
A|B|C|D|E|F|H|I|J|K|L
What we did here
As you are aware awk's has special variables to store each field in the record, which ranges from $1,$2 upto $(NF)
To exclude the 5th from the last column is as simple as
Emptying the colume ie $(NF-5)=""
Removing from the record, the consecutive | formed by the above step ie do sub(/\|\|/,"|")
another alternative, using #sjsam's input file
$ rev file | cut -d'|' --complement -f6 | rev
A|B|C|E|F|G|H|I
A|B|C|D|F|G|H|I|A
A|B|C|D|E|F|G|H|I|F|E|O|R|Q|U|I
A|B|C|D|E|F|H|I|E|O|Q
A|B|C|D|F|G|H|I|X
A|B|C|D|E|F|H|I|J|K|L
not sure you want the 5'th from the last or 6th. But it's easy to adjust.
Thanks for the help and guidance.
Below is what I tested:
bash-3.2$ echo "1|2|3|4|5|6|7|8|9" | nawk 'BEGIN{FS="|";OFS="|"} {$(NF-4)="!";print}' | sed 's/|!//'
Output: 1|2|3|4|6|7|8|9
Further tested on the file that I have extracted from system and so it worked fine.