How to regex on the dynamic input which may have brackets in it. Here, I am supplying input via the bash command line. This input is coming from some other program that sometimes contains brackets in it and then my simple good old $0 ~ var construct is failing.
Here is my input data:
hello there
this is monk
and this is a random data
piano (sense) is cool
which makes no (sense) to anyone
Command-1: worked, without brackets around the var. Eg: sense
awk -v var='sense' '$0 ~ var {print "worked"}' input
worked
Command-2: worked, when I used . (dot) in place of brackets ( and ).
awk -v var='no .sense.' '$0 ~ var{print "worked"}' input
worked
Command-3: Here I need to supply input with brackets ( and ). Things go crazy and I get no results. awk silently failed by giving a false negative.
awk -v var='no (sense)' '$0 ~ var {print "worked"}' input
I have already tried $0 ~ var and match($0, var) they both exhibits the same behavior. I have also tried, the following but it failed miserably. Although the input var is dynamic I cannot do manual escaping as it is coming from some other program.
awk -v var='no \(sense\)' 'match($0,var){print "worked"}' input
awk: warning: escape sequence `\(' treated as plain `('
awk: warning: escape sequence `\)' treated as plain `)'
Question is, How to supply an input variable that may contain brackets to awk and awk should be able to do sane regex operation on it. Is it just impossible to do?
TLDR:
when working with the above sample input data, when var is no (sense), it should ONLY return which makes no (sense) to anyone
Better to ditch regex and use plain string search using index function:
awk -v var='no (sense)' 'index($0, var) {print "worked"; exit}' file
worked
btw if you want to escape then use \\ to escape special characters like this:
awk -v var='(^|[[:blank:]])no \\(sense\\)([[:blank:]]|$)' '
$0 ~ var {print "worked"; exit}' file
However if you must use regex and you cannot pre-escape content of var then you can escape all special characters in the BEGIN block like this:
awk -v var='no (sense)' '
BEGIN {
gsub(/[^_[:alnum:] ]/, "\\\\&", var)
var = "(^|[[:blank:]])" var "([[:blank:]]|$)"
}
$0 ~ var {print "worked"; exit}
' file
worked
Alternative to escape those characters having special meanings in ERE, you can consider using character class:
$ awk -v var='no [(]sense[)]' '$0 ~ var {print "worked"}' file
worked
IMO, [] could be easier to read than escapes in some cases.
INPUT
hello there
this is monk
and this is a random data
which makes no (sense) to anyone
CODE
{m,n,g}awk -v __='no (sense)' '
BEGIN {
gsub("[[-\140!-/\\]{-~:-#]",
"[&]", __)
gsub(/[\\^]/, "\\\\&",__)
OFS = "worked"
FS = "^.*[^[:alpha:]]?"(__)".*$" } NF*=!_<NF'
OUTPUT
worked
To give a sense what those 2 gsub() does to ASCII :
anything from "!" to "~" that isn't alphanumeric gets
safely "caged" in square brackets,
regardless of whether it's considered metacharacter or not,
which differs among awk flavors.
=
[!] ["] [#] [$] [%] [&] ['] [(]
[)] [*] [+] [,] [-] [.] [/] 0
1 2 3 4 5 6 7 8
9 [:] [;] [<] [=] [>] [?] [#]
A B C D E F G H
I J K L M N O P
Q R S T U V W X
Y Z [[] [\\] []] [\^] [_] [`]
a b c d e f g h
i j k l m n o p
q r s t u v w x
y z [{] [|] [}] [~]
Related
Let's say I have a file like so:
test.txt
one
two
three
I'd like to get the following output: one|two|three
And am currently using this command: gawk -v ORS='|' '{ print $0 }' test.txt
Which gives: one|two|three|
How can I print it so that the last | isn't there?
Here's one way to do it:
$ seq 1 | awk -v ORS= 'NR>1{print "|"} 1; END{print "\n"}'
1
$ seq 3 | awk -v ORS= 'NR>1{print "|"} 1; END{print "\n"}'
1|2|3
With paste:
$ seq 1 | paste -sd'|'
1
$ seq 3 | paste -sd'|'
1|2|3
Convert one column to one row with field separator:
awk '{$1=$1} 1' FS='\n' OFS='|' RS='' file
Or in another notation:
awk -v FS='\n' -v OFS='|' -v RS='' '{$1=$1} 1' file
Output:
one|two|three
See: 8 Powerful Awk Built-in Variables – FS, OFS, RS, ORS, NR, NF, FILENAME, FNR
awk solutions work great. Here is tr + sed solution:
tr '\n' '|' < file | sed 's/\|$//'
1|2|3
just flatten it :
gawk/mawk 'BEGIN { FS = ORS; RS = "^[\n]*$"; OFS = "|"
} NF && ( $NF ? NF=NF : —-NF )'
ascii | = octal \174 = hex 0x7C. The reason for —-NF is that more often than not, the input includes a trailing new line, which makes field count 1 too many and result in
1|2|3|
Both NF=NF and --NF are similar concepts to $1=$1. Empty inputs, regardless of whether trailing new lines exist or not, would result in nothing printed.
At the OFS spot, you can delimit it with any string combo you like instead of being constrained by tr, which has inconsistent behavior. For instance :
gtr '\012' '高' # UTF8 高 = \351\253\230 = xE9 xAB x98
on bsd-tr, \n will get replaced by the unicode properly 1高2高3高 , but if you're on gnu-tr, it would only keep the leading byte of the unicode, and result in
1 \351 2 \351 . . .
For unicode equiv-classes, bsd-tr works as expected while gtr '[=高=]' '\v' results in
gtr: ?\230: equivalence class operand must be a single character
and if u attempt equiv-classes with an arbitrary non-ASCII byte, bsd-tr does nothing while gnu-tr would gladly oblige, even if it means slicing straight through UTF8-compliant characters :
g3bn 77138 | (g)tr '[=\224=]' '\v'
bsd-tr : 77138=Koyote 코요태 KYT✜ 高耀太
gnu-tr : 77138=Koyote ?
?
태 KYT✜ 高耀太
I would do it following way, using GNU AWK, let test.txt content be
one
two
three
then
awk '{printf NR==1?"%s":"|%s", $0}' test.txt
output
one|two|three
Explanation: If it is first line print that line content sans trailing newline, otherwise | followed by line content sans trailing newline. Note that I assumed that test.txt has not trailing newline, if this is not case test this solution before applying it.
(tested in gawk 5.0.1)
Also you can try this with awk:
awk '{ORS = (NR%3 ? "|" : RS)} 1' file
one|two|three
% is the modulo operator and NR%3 ? "|" : RS is a ternary expression.
See Ed Morton's explanation here: https://stackoverflow.com/a/55998710/14259465
With a GNU sed, you can pass -z option to match line breaks, and thus all you need is replace each newline but the last one at the end of string:
sed -z 's/\n\(.\)/|\1/g' test.txt
perl -0pe 's/\n(?!\z)/|/g' test.txt
perl -pe 's/\n/|/g if !eof' test.txt
See the online demo.
Details:
s - substitution command
\n\(.\) - an LF char followed with any one char captured into Group 1 (so \n at the end of string won't get matched)
|\1 - a | char and the captured char
g - all occurrences.
The first perl command matches any LF char (\n) not at the end of string ((?!\z)) after slurping the whole file into a single string input (again, to make \n visible to the regex engine).
The second perl command replaces an LF char at the end of each line except the one at the end of file (eof).
To make the changes inline add -i option (mind this is a GNU sed example):
sed -i -z 's/\n\(.\)/|\1/g' test.txt
perl -i -0pe 's/\n(?!\z)/|/g' test.txt
perl -i -pe 's/\n/|/g if !eof' test.txt
I am unable to read fields from awk command in Tcl while it runs in a terminal but not in Tcl script.
Tried making syntax changes, it works in terminal not in script
set a { A B C D E F G H I J K L M N O P Q R S T U V W X Y Z }
#store only cell var in file
exec grep -in "cell (?*" ./slow.lib | cut -d "(" -f2 | cut -d ")" -f1 > cells.txt
#take alphabets to loop
foreach b $a {
puts "$b\n"
if { [ exec cat cells.txt | awk ' $1 ~ /^$b/ ' ] } {
foreach cell [exec cat ./cells.txt] {
puts "$b \t $cell"
}
}
The condition should check for first char in the file and give boolean.
The error is:
can't read "1": no such variable
while executing "exec cat cells.txt | awk ' $1 ~ /^$b/ ' "
Your problem is that Tcl attaches no special meaning at all to the ' character. It uses {…} (which nest better) for the same purpose. Your command:
exec cat cells.txt | awk ' $1 ~ /^$b/ '
should become:
exec cat cells.txt | awk { $1 ~ /^$b/ }
Except… you also want $b (but not $1) to be substituted in there. The easiest way to do that is with format:
exec cat cells.txt | awk [format { $1 ~ /^%s/ } $b]
It would be more optimal to omit the use of cat here:
exec awk [format { $1 ~ /^%s/ } $b] <cells.txt
You are aware that your whole script can be written in pure Tcl without any use of exec?
can't read "1": no such variable
The (Tcl) error message is very informative. Tcl feels responsible for substituting a value of a Tcl variable 1 for $1 (meant for awk as part of the awk script). This is due to improper quoting of your awk scriplet. At the same time, you want $b to be substituted for from within Tcl.
Turn awk 'print $1 ~ /^$b/' into awk [string map [list #b# $b] {{$1 ~ /^#b#/}}]. Curly braces will preclude Tcl substitutions for $1, #b# will have already been substituted for before awk sees it thanks to [string map].
exec cat cells.txt | awk [string map [list #b# $b] {{$1 ~ /^#b#/}}]
That written, I fail to see why you are going back and forth between grep, awk etc. and Tcl. All of this could be done in Tcl alone.
I was trying to do masking of file with command 'tr' and 'awk' but failing with error fatal: cannot open pipe ( Too many open pipes) error. FILE has approx 1000000 records quite a huge number.
Below is the code I am trying :-
awk - F "|" - v OFS="|" '{ "echo \""$1"\" | tr \" 0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ\" \" QWERTYUIOPASDFGHJKLZXCVBNM9876543210mnbvcxzlkjhgfdsapoiuytrewq\"" | get line $1}1' FILE.CSV > test.CSV
It is showing error :-
awk: (FILENAME=- FNR=1019) fatal: cannot open pipe `echo ""TTP_123"" | tr "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ" "QWERTYUIOPASDFGHJKLZXCVBNM9876543210mnbvcxzlkjhgfdsapoiuytrewq"' (Too many open pipes)
Please let me know what I am doing wrong here
Also a Note any number of columns could be used for masking and can be at any positions in this example I have taken 1 and 2 column positions but it could be 3 and 10 or 5,7,25 columns
Thanks
AJ
First things first, you can't have a space between - and F or v.
I was going to suggest sed, but as you only want to translate the first column, that's not as easy.
Unfortunately, awk doesn't have built-in tr functionality, so you'd have to use the shell like you are and just close the pipe:
awk -F "|" -v OFS="|" '{
command="echo \"\\"$1"\\\" | tr \" 0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ\" \" QWERTYUIOPASDFGHJKLZXCVBNM9876543210mnbvcxzlkjhgfdsapoiuytrewq\""
command | getline $1
close(command)
}1' FILE.CSV > test.CSV
However, I suggest using perl, which can do field splitting and character translation:
perl -F'\|' -lane '$F[0] =~ tr/0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ/QWERTYUIOPASDFGHJKLZXCVBNM9876543210mnbvcxzlkjhgfdsapoiuytrewq/; print join("|", #F)' FILE.CSV > test.CSV
Or, for a shorter command line, just put the program into a file, drop the e in -lane and use the file name instead of the '...' command.
you can do the mapping in awk instead of making a system call for each line, or perhaps simply
paste -d'|' <(cut -d'|' -f1 file | tr '0-9' 'a-z') <(cut -d'|' -f2- file)
replace the tr arguments with yours.
This does not answer your question, but you can implement tr as an awk function that would save having to spawn lots of external processes
$ cat tr.awk
function tr(str, from, to, s,i,c,idx) {
s = ""
for (i=1; i<=length($str); i++) {
c = substr(str, i, 1)
idx = index(from, c)
s = s (idx == 0 ? c : substr(to, idx, 1))
}
return s
}
{
print $1, tr($1,
" 0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ",
" QWERTYUIOPASDFGHJKLZXCVBNM9876543210mnbvcxzlkjhgfdsapoiuytrewq")
}
Example:
$ printf "%s\n" hello wor-ld | awk -f tr.awk
hello KGCCN
wor-ld 3N8-CF
I have a file with input text below (this is not the original file and just example of input text ) and I want to replace all the 2 letter string to numeric 100 . In this file FS can be :,| or " " (space) , I have no other choice but to treat all three of them as FS, and I want to preserve these field separators at the original position (as in input file) in the output
A:B C|D
AA:C EE G
BB|FF XX1 H
DD:MM:YY K
I have tried
awk -F"[:| ]" '{gsub(/[A-Z]{2}/,"100");print}'
but this does not seem to work , please suggest.
Desired output:
A:B C|D
100:C 1000 G
100|100 1001 H
100:100:100 K
There is no functionality in POSIX awk to retain the strings that match the string defined by RS (POSIX) or regexp defined by FS. Since in POSIX RS is just a string there's no need for such functionality and to do it for every FS matching string would be unnecessarily inefficient given it's rarely needed.
With GNU awk where RS can be a regexp, not just a string, you can retain the string that matched the regexp RS with RT but there is no functionality that retains the values that match FS for the same efficiency reason that POSIX doesn't do it. Instead in GNU awk they added a 4th arg to split() so you can retain the strings that match FS in an array yourself if you want it (seps[] below):
$ awk -v FS='[:| ]' '{
split($0,flds,FS,seps)
gsub(/[A-Z]{2}/,"100")
for (i=1;i<=NF;i++) {
printf "%s%s", $i, seps[i]
}
print ""
}' file
A:B C|D
100:C 100 G
100|100 1001 H
100:100:100 K
Look up split() in the GNU awk manual for more info.
in this case
sed 's/[A-Z]\{2\}/100/g' YourFile
awk '{gsub(/[A-Z]{2}/, "100"); print}' YourFile
no need of field separation in this case, change all group of upper letter by "100", unless you specify other constraint than in OP (like other element in the string, you than need to specify what is possible and idealy, add a sample of expected result to be univoq)
Now you certainly have lot more thing around, so this code will certainly failed by changing thing like ABC:DEF with 100C:100F that is certainly not expected
in this case
awk -F '[[:blank:]:|]+' '
{
split( $0, aS, /[^[:blank:]:|]+/)
for( i=1;i<=NF;i++){
if( $i ~ /^[A-Z][A-Z]$/) $i = "100"
printf( "%s%s", $i, aS[i+1])
}
printf( "\n" )
} ' YourFile
Give this sed one-liner a try:
kent$ sed -r 's/(^|[:| ])[A-Z][A-Z]([:| ]|$)/\1100\2/g' file
A:B C|D
100:C 100 G
100|FF XX1 H
100:MM:100 K
Note:
this will search and replace pattern: exact two [A-Z] between two delimiters. If this is not what you want exactly, paste the desired output.
Your code seems to work just fine with my Gnu awk:
A:B C|D
100:C 100 G # even the typo in this record got fixed.
100|100 1001 H
100:100:100 K
I'd say the problem is that the regex /[A-Z]{2}/ should be written /[A-Z][A-Z]/.
If the field separator is the empty string, each character becomes a separate field
$ echo hello | awk -F '' -v OFS=, '{$1 = NF OFS $1} 1'
5,h,e,l,l,o
However, if FS is a regex that can possibly match zero times, the same behaviour does not occur:
$ echo hello | awk -F ' *' -v OFS=, '{$1 = NF OFS $1} 1'
1,hello
Anyone know why that is? I could not find anything in the gawk manual. Is FS="" just a special case?
I'm most interested in understanding why the 2nd case does not split the record into more fields. It's as if awk is treating FS=" *" like FS=" +"
Interesting question!
I just pulled gnu-awk 4.1.0's codes, I think the answer we could find in the file field.c.
line 371:
* re_parse_field --- parse fields using a regexp.
*
* This is called both from get_field() and from do_split()
* via (*parse_field)(). This variation is for when FS is a regular
* expression -- either user-defined or because RS=="" and FS==" "
*/
static long
re_parse_field(lo...
also this line: (line 425):
if (REEND(rp, scan) == RESTART(rp, scan)) { /* null match */
here is the case of <space>* matching in your question. The implementation didn't increment the nf, that is, it thinks the whole line is one single field. Note this function was used in do_split() function too.
First, if FS is null string, gawk separates each char into its own field. gawk's doc has clearly written this, also in codes, we could see:
line 613:
* null_parse_field --- each character is a separate field
*
* This is called both from get_field() and from do_split()
* via (*parse_field)(). This variation is for when FS is the null string.
*/
static long
null_parse_field(long up_to,
If the FS has single character, awk won't consider it as regex. This was mentioned in doc too. Also in codes:
#line 667
* sc_parse_field --- single character field separator
*
* This is called both from get_field() and from do_split()
* via (*parse_field)(). This variation is for when FS is a single character
* other than space.
*/
static long
sc_parse_field(l
if we read the function, no regex match handling was done there.
In the comments of the function re_parse_field(), and sc_parse_field(), we see do_split invokes them too. It explains why we have 1 in following command instead of 3:
kent$ echo "foo"|awk '{split($0,a,/ */);print length(a)}'
1
Note, to avoid to make the post too long, I didn't paste the complete codes here, we can find the codes here:
http://git.savannah.gnu.org/cgit/gawk.git/
As was mentioned, an empty field separator generates undefined behavior; the same code will give different results on different platforms / flavors of awk. For example (all Mac OSX 10.8.5):
> echo hello | awk -F '' -v OFS=, '{$1 = NF OFS $1} 1'
awk: field separator FS is empty
1,hello
So awk complains, but keeps going.
Let's look at some other examples:
> echo hello | awk -F '.' -v OFS=, '{$1 = NF OFS $1} 1'
1,hello
A . by itself is not considered a regular expression
> echo hello | awk -F '[.]' -v OFS=, '{$1 = NF OFS $1} 1'
1,hello
Still nothing
> echo hello | awk -F '.?' -v OFS=, '{$1 = NF OFS $1} 1'
6,,,,,,
Now we have something like a regex: .? is "zero or one character". It is expanded to one character (which is consumed), so the output is "a whole lot of nothings"
> echo hello | awk -F '*' -v OFS=, '{$1 = NF OFS $1} 1'
1,hello
Not a regular expression
> echo hello | awk -F '.*' -v OFS=, '{$1 = NF OFS $1} 1'
2,,
A regular expression that consumes the entire thing
> echo hello | awk -F 'l' -v OFS=, '{$1 = NF OFS $1} 1'
3,he,,o
Match the letter l twice - two empty strings
> echo hello | awk -F 'ell' -v OFS=, '{$1 = NF OFS $1} 1'
2,h,o
Match all of ell at once
> echo hello | awk -F '.?|' -v OFS=, '{$1 = NF OFS $1} 1'
awk: illegal primary in regular expression .?| at
input record number 1, file
source line number 1
Attempt to be clever: sometimes an | with empty string on one side will match "anything" but awk's regex engine doesn't like it.
Conclusion - the regular expressions cannot match "empty", and whatever is matched is consumed. Attempts to use (?:.) or even (?=.) generate errors.
It seems to be a special case in gawk.
Traditionally, the behavior of FS equal to "" was not defined. In this
case, most versions of Unix awk simply treat the entire record as only
having one field. (d.c.) In compatibility mode (see Options), if FS is
the null string, then gawk also behaves this way.
What POSIX has to say about this:
If FS is a null string, the behavior is unspecified.
So the gawk behaviour is implementation-specific and sort of explains why your two examples don't yield the same output.
Another data point: gawk and perl disagree on how to do this:
$ perl -E '$,=","; $s="hello"; $r=qr( *); #s=split($r,$s); say scalar(#s), #s'
5,h,e,l,l,o
$ gawk 'BEGIN {s="hello";r=" *";n=split(s,a,r); print n,a[n]; if (s~r) print "match"}'
1 hello
match
$ gawk 'BEGIN {s="hello";r=""; n=split(s,a,r); print n,a[n]; if (s~r) print "match"}'
5 o
match