Awk - How to escape the | in sub? - awk

I'd like to substitue a string, which contains a |
My STDIN :
13|Test|123|6232
14|Move|126|6692
15|Test|123|6152
I'd like to obtain :
13|Essai|666|6232
14|Move|126|6692
15|Essai|666|6152
I tried like this
{sub("|Test|123","|Essai|666") ;} {print;}
But I think the | is bothers me.... I really need to replace the complete string WITH the |.
How should I do to get this result ?
Many thanks for you precious help

You can use
awk '{sub(/\|Test\|123\|/,"|Essai|666|")}1' file
See the online demo.
Note:
/\|Test\|123\|/ is a regex that matches |Test|123| substring
sub(/\|Test\|123\|/,"|Essai|666|") - replaces the first occurrence of the regex pattern in the whole record (since the input is omitted, $0 is assumed)
1 triggers the default print action, no need to explicitly call print here.

Related

Return lines with at least n consecutive occurrences of the pattern in bash [duplicate]

This question already has answers here:
Specify the number of pattern repeats in JavaScript Regex
(2 answers)
Closed 1 year ago.
Might be naive question, but I can't find an answer.
Given a text file, I'd like to find lines with at least (defined number) of occurrences of a certain pattern, say, AT[GA]CT.
For example, in n=2, from the file:
ATGCTTTGA
TAGATGCTATACTTGA
TAGATGCTGTATACTTGA
Only the second line should be returned.
I know how to use grep/awk to search for at least one instance of this degenerate pattern, and for some defined number of pattern instances occurring non-consecutively. But the issue is the pattern occurrences MUST be consecutive, and I can't figure out how to achieve that.
Any help appreciated, thank you very much in advance!
I would use GNU AWK for this task following way, let file.txt content be
ATGCTTTGA
TAGATGCTATACTTGA
TAGATGCTGTATACTTGA
then
awk 'BEGIN{p="AT[GA]CT";n=2;for(i=1;i<=2;i+=1){pat=pat p}}$0~pat' file.txt
output
TAGATGCTATACTTGA
Explanation: I use for loop to repeat p n times, then filter line by checking if line ($0) does match with what I created earlier.
Alternatively you might use string formatting function sprintf as follows:
awk 'BEGIN{n=2}$0~sprintf("(AT[GA]CT){%s}",n)' file.txt
Explanation: I used sprintf function, %s in first argument marks where to put n. If you want to know more about what might be used in printf and sprintf first argument read Format Modifiers
(both solutions tested in GNU Awk 5.0.1)

Need solution for break line issue in string

I have below string which has enter character coming randomely and fields are separated by ~$~ and end with ##&.
Please help me to merge broken line into one.
In below string enter character is occured in address field (4/79A)
-------Sting----------
23510053~$~ABC~$~4313708~$~19072017~$~XYZ~$~CHINNUSAMY~$~~$~R~$~~$~~$~~$~42~$~~$~~$~~$~~$~28022017~$~
4/79A PQR Marg, Mumbai 4000001~$~TN~$~637301~$~Owns~$~RAT~$~31102015~$~12345~$~##&
Thanks in advance.
Rupesh
Seems to be a (more or less) duplicate of https://stackoverflow.com/a/802439/3595749
Note, you should ask to your client to remove the CRLF signs (rather than aplying the code below).
Nevertheless, try this:
cat inputfile | tr -d '\n' | sed 's/##&/##\&\n/g' >outputfile
Explanation:
tr is to remove the carriage return,
sed is to add it again (only when ##& is encountred). s/##&/##\&\n/g is to substitute "##&" by "##&\n" (I add a carriage return and "&" must be escaped). This applies globally (the "g" letter at the end).
Note, depending of the source (Unix or Windows), "\n" must be replaced by "\r\n" in some cases.

Grep for Multiple instances of string between a substring and a character?

Can you please tell me how to Grep for every instance of a substring that occurs multiple times on multiple lines within a file?
I've looked at
https://unix.stackexchange.com/questions/131399/extract-value-between-two-search-patterns-on-same-line
and How to use sed/grep to extract text between two words?
But my problem is slightly different - each substring will be immediately preceded by the string: name"> and will be terminated be a < character immediately after the last character of the substring I want.
So one line might be
<"name">Bob<125><adje></name><"name">Dave<123><adfe></name><"name">Fred<125><adfe></name>
And I would like the output to be:
Bob
Dave
Fred
Although awk is not the best tool for xml processing, it will help if your xml structure and data simple enough.
$ awk -F"[<>]" '{for(i=1;i<NF;i++) if($i=="\"name\"") print $(++i)}' file
Bob
Dave
Fred
I doubt that the tag is <"name"> though. If it's <name>, without the quotes change the condition in the script to $i=="name"
gawk
awk -vRS='<"name">|<' '/^[A-Z]/' file
Bob
Dave
Fred

Remove all occurrences of a list of words vim

Having a document whose first line is foo,bar,baz,qux,quux, is there a way to store these words in a variable as a list ['foo','bar','baz','qux','quux']and remove all their occurrences in a document with vim?
Like a command :removeall in visual mode highlighting the list:
foo,bar,baz,qux,quux
hello foo how are you
doing foo bar baz qux
good quux
will change the text to:
hello how are you
doing good
A safer way is to write a function, check each part of your "list", if there is something needs to be escaped. then do the substitution (removing). A dirty & quick way to do it with your input is with this mapping:
nnoremap <leader>R :s/,/\|/g<cr>dd:%s/\v<c-r>"<c-h>//g<cr>
then in Normal mode, when you go to the line, which contains deletion parts and must be CSV format, press <leader>R you will get expected output.
The substitution would fail if that line has regex special chars, like /, *, . or \ etc.
Something like this one liner should work:
:for f in split(getline("."), ",") | execute "%s/" . f | endfor | 0d
Note that you'll end up with a lot of trailing spaces.
edit
This version of the command above takes care of those pesky trailing spaces (but not the one on line 2 of your sample text):
:for f in split(getline("."), ",") | execute "%s/ *" . f | endfor | 0d
Result:
hello how are you
doing
good

How to pass a regular expression to a function in AWK

I do not know how to pass an regular expression as an argument to a function.
If I pass a string, it is OK,
I have the following awk file,
#!/usr/bin/awk -f
function find(name){
for(i=0;i<NF;i++)if($(i+1)~name)print $(i+1)
}
{
find("mysql")
}
I do something like
$ ./fct.awk <(echo "$str")
This works OK.
But when I call in the awk file,
{
find(/mysql/)
}
This does not work.
What am I doing wrong?
Thanks,
Eric J.
you cannot (should not) pass regex constant to a user-defined function. you have to use dynamic regex in this case. like find("mysql")
if you do find(/mysql/), what does awk do is : find($0~/mysql/) so it pass a 0 or 1 to your find(..) function.
see this question for detail.
awk variable assignment statement explanation needed
also
http://www.gnu.org/software/gawk/manual/gawk.html#Using-Constant-Regexps
section: 6.1.2 Using Regular Expression Constants
warning: regexp constant for parameter #1 yields boolean value
The regex gets evaluated (matching against $0) before it's passed to the function. You have to use strings.
Note: make sure you do proper escaping: http://www.gnu.org/software/gawk/manual/gawk.html#Computed-Regexps
If you use GNU awk, you can use regular expression as user defined function parameter.
You have to define your regex as #/.../.
In your example, you would use it like this:
function find(regex){
for(i=1;i<=NF;i++)
if($i ~ regex)
print $i
}
{
find(#/mysql/)
}
It's called strongly type regexp constant and it's available since GNU awk version 4.2 (Oct 2017).
Example here.
use quotations, treat them as a string. this way it works for mawk, mawk2, and gnu-gawk. but you'll also need to double the backslashes since making them strings will eat away one of them right off the bat.
in your examplem just find("mysql") will suffice.
you can actually get it to pass arbitrary regex as you wish, and not be confined to just gnu-gawk, as long as you're willing to make them strings not the #/../ syntax others have mentioned. This is where the # of backslashes make a difference.
You can even make regex out of arbitrary bytes too, preferably via octal codes. if you do "\342\234\234" as a regex, the system will convert that into actual bytes in the regex before matching.
While there's nothing with that approach, if you wanna be 100% safe and prefer not having arbitrary bytes flying around , write it as
"[\\342][\\234][\\234]" ----> ✜
Once initially read by awk to create an internal representation, it'll look like this :
[\342][\234][\234]
which will still match the identical objects you desire (in this case, some sort of cross-looking dingbat). This will spit out annoying warnings in unicode-aware mode of gawk due to attempting to enclose non-ASCII bytes directly into square brackets. For that use case,
"\\342\\234\\234" ------(eqv to )---> /\342\234\234/
will keep gawk happy and quiet. Lately I've been filling the gaps in my own codes and write regex that can mimic all the Unicode-script classes that perl enjoys.