How to use variable including special symbol in awk? - variables

For my case, if a certain pattern is found as the second field of one line in a file, then I need print the first two fields. And it should be able to handle case with special symbol like backslash.
My solution is first using sed to replace \ with \\, then pass the new variable to awk, then awk will parse \\ as \ then match the field 2.
escaped_str=$( echo "$pattern" | sed 's/\\/\\\\/g')
input | awk -v awk_escaped_str="$escaped_str" '$2==awk_escaped_str { $0=$1 " " $2 " "}; { print } '
While this seems too complicated, and cannot handle various case.
Is there a better way which is more simpler and could cover all other special symbol?

The way to pass a shell variable to awk without backslashes being interpreted is to pass it in the arg list instead of populating an awk variable outside of the script:
$ shellvar='a\tb'
$ awk -v awkvar="$shellvar" 'BEGIN{ printf "<%s>\n",awkvar }'
<a b>
$ awk 'BEGIN{ awkvar=ARGV[1]; ARGV[1]=""; printf "<%s>\n",awkvar }' "$shellvar"
<a\tb>
and then you can search a file for it as a string using index() or ==:
$ cat file
a b
a\tb
$ awk 'BEGIN{ awkvar=ARGV[1]; ARGV[1]="" } index($0,awkvar)' "$shellvar" file
a\tb
$ awk 'BEGIN{ awkvar=ARGV[1]; ARGV[1]="" } $0 == awkvar' "$shellvar" file
a\tb
You need to set ARGV[1]="" after populating the awk variable to avoid the shell variable value also being treated as a file name. Unlike any other way of passing in a variable, ALL characters used in a variable this way are treated literally with no "special" meaning.

There are three variations you can try without needing to escape your pattern:
This one tests literal strings. No regex instance is interpreted:
$2 == expr
This one tests if a literal string is a subset:
index($2, expr)
This one tests regex pattern:
$2 ~ pattern

Related

Awk: gsub("\\\\", "\\\\") yields suprising results

Consider the following input:
$ cat a
d:\
$ cat a.awk
{ sub("\\", "\\\\"); print $0 }
$ cat a_double.awk
{ sub("\\\\", "\\\\"); print $0 }
Now running cat a | awk -f a.awk gives
d:\
and running cat a | awk -f a_double.awk gives
d:\\
and I expect exactly the other way around. How should I interpret this?
$ awk -V
GNU Awk 4.1.4, API: 1.1 (GNU MPFR 4.0.1, GNU MP 6.1.2)
Yes, its expected behavior of awk. When you run sub("\\", "\\\\") in your first script, in sub's inside "(double quotes) since we are NOT using / for matching pattern we need to escape first \(actual literal character) then for escaping we are using \ so we need to escape that also, hence it will become \\\\
\\ \\
| |
| |
first 2 chars are denoting escaping next 2 chars are denoting actual literal character \
Which is NOT happening your 1st case hence NO match so no substitution in it, in your 2nd awk script you are doing this(escaping part in regex matching section of sub) hence its matching \ perfectly.
Let's see this by example and try putting ... for checking purposes.
When Nothing happens: Since no match on
awk '{sub("\\", "....\\\\"); print $0}' Input_file
d:\
When pattern matching happens:
awk '{sub("\\\\", "...\\\\"); print $0}' Input_file
d:...\\
From man awk:
gsub(r, s [, t])
For each substring matching the regular expression r in the string t,
substitute the string s, and return the number of substitutions.
How could we could do perform actual escaping part(where we need to use only \ before character only once)? Do mention your regexp in /../ in first section of sub like and we need NOT to double escape \ here.
awk '{sub(/\\/,"&\\")} 1' Input_file
The first arg to *sub() is a regexp, not a string, so you should use regexp (/.../) rather than string ("...") delimiters. The former is a literal regexp which is used as-is while the latter defines a dynamic (or computed) regexp which forces awk to interpret the string twice, the first time to convert the string to a regexp and the second to use it as a regexp, hence double the backslashes needed for escaping. See https://www.gnu.org/software/gawk/manual/gawk.html#Computed-Regexps.
In the following we just need to escape the backslash once because we're using a literal, rather than dynamic, regexp:
$ cat a
d:\
$ awk '{sub(/\\/,"\\\\")}1' a
d:\\
Your first script would rightfully produce a syntax error in a more recent version of gawk (5.1.0) since "\\" in a dynamic regexp is equivalent to /\/ in a literal one and in that expression the backslash is escaping the final forward slash which means there is no final delimiter:
$ cat a.awk
{ sub("\\", "\\\\"); print $0 }
$ awk -f a.awk a
awk: a.awk:1: (FILENAME=a FNR=1) fatal: invalid regexp: Trailing backslash: /\/

Recognising backslash in awk field separator

Input is
AZE D11/879\x0Dabc\x0D\x0A\x1E!DEF F11/999
awk script sets field separator to "\x0D" (I have tried with and without escaping the backslash.
awk script is
BEGIN {FS="\\x0D"}
{print NF}
It should output 3 because there are 2 occurrences of the field separator but it outputs 1 which indicates it is not being recognized.
There are 2 ways to provide a regexp in awk - a static regexp (aka regexp literal) written as /regexp/ and a dynamic regexp (aka computed regexp) written as "regexp" and used in a regexp context. A field separator is just a regexp with some additional behavior so lets just consider regexps in general to explain what's going on in your example.
The split() function takes a field separator (a regexp for our purposes) as it's third argument so it provides a good test bed:
Using a static regexp:
$ awk '{print split($0,a,/\x0D/)}' file
1
The \ above is escaping the x, it's not a literal \. For that you need to escape the \ itself:
$ awk '{print split($0,a,/\\x0D/)}' file
3
What if we used a dynamic regexp instead of the above static regexp?
$ awk '{print split($0,a,"\x0D")}' file
1
$ awk '{print split($0,a,"\\x0D")}' file
1
$ awk '{print split($0,a,"\\\x0D")}' file
' is not a known regexp operator FNR=1) warning: regexp escape sequence `\
1
$ awk '{print split($0,a,"\\\\x0D")}' file
3
The behavior above is because awk first parses the string to convert it into a regexp (using up one layer of escape chars) and then parses it a second time when using it as a regexp (using up a second layer of escape chars).
Unfortunately when you specify a FS there is no option to specify it as a literal regexp, it's always specified using a string and thus is a dynamic regexp and so needs an extra layer of escaping:
$ awk -v FS='\x0D' '{print NF}' file
1
$ awk -v FS='\\x0D' '{print NF}' file
1
$ awk -v FS='\\\x0D' '{print NF}' file
' is not a known regexp operatorence `\
1
$ awk -v FS='\\\\x0D' '{print NF}' file
3
Now - what if you were using the wrong type of quotes in the shell part of the script, i.e. " instead of '? Then you introduce even more pain because now you're inviting the shell to also parse the string even before awk gets to see and parse it twice:
$ awk -v FS="\\\\x0D" '{print NF}' file
1
$ awk -v FS="\\\\\x0D" '{print NF}' file
' is not a known regexp operatorence `\
1
$ awk -v FS="\\\\\\x0D" '{print NF}' file
' is not a known regexp operatorence `\
1
$ awk -v FS="\\\\\\\x0D" '{print NF}' file
3
That's different from the case where the double quotes are using inside awk because that's all wrapped inside single quotes and so protected from the shell already:
$ awk 'BEGIN{FS="\\\\x0D"} {print NF}' file
3
So - in the shell always use the most restrictive quotes (' over " over none) unless you have a very specific reason not to, and when using regexps or field separators always use literal /.../ rather than dynamic "...", again unless you have a very specific reason not to.
The odd, truncated looking error message above are because of the \rs the tool is trying to print due to the escape sequence we're providing, they're really all warning: regexp escape sequence '\^M' is not a known regexp operator
You need two backslashes for a literal backslash since \ is an escape character:
$ echo 'AZE D11/879\x0Dabc\x0D\x0A\x1E!DEF F11/999' |
awk 'BEGIN{ FS="\\\\x0D" } { print NF }'
3

Strip last field

My script will be receiving various lengths of input and I want to strip the last field separated by a "/". An example of the input I will be dealing with is.
this/that/and/more
But the issue I am running into is that the length of the input will vary like so:
this/that/maybe/more/and/more
or/even/this/could/be/it/and/maybe/more
short/more
In any case, the expected output should be the whole string minus the last "/more".
Note: The word "more" will not be a constant these are arbitrary examples.
Example input:
this/that/and/more
this/that/maybe/more/and/more
Expected output:
this/that/and
this/that/maybe/more/and
What I know works for a string you know the length of would be
cut -d'/' -f[x]
With what I need is a '/' delimited AWK command I'm assuming like:
awk '{$NF=""; print $0}'
With awk as requested:
$ awk '{sub("/[^/]*$","")} 1' file
this/that/maybe/more/and
or/even/this/could/be/it/and/maybe
short
but this is the type of job sed is best suited for:
$ sed 's:/[^/]*$::' file
this/that/maybe/more/and
or/even/this/could/be/it/and/maybe
short
The above were run against this input file:
$ cat file
this/that/maybe/more/and/more
or/even/this/could/be/it/and/maybe/more
short/more
Depending on how you have the input in your script, bash's Shell Parameter Expansion may be convenient:
$ s1=this/that/maybe/more/and/more
$ s2=or/even/this/could/be/it/and/maybe/more
$ s3=short/more
$ echo ${s1%/*}
this/that/maybe/more/and
$ echo ${s2%/*}
or/even/this/could/be/it/and/maybe
$ echo ${s3%/*}
short
(Lots of additional info on parameter expansion at https://www.gnu.org/software/bash/manual/html_node/Shell-Parameter-Expansion.html)
In your script, you could create a loop that removes the last character in the input string if it is not a slash through each iteration. Then, when the loop finds a slash character, exit the loop then remove the final character (which is supposed to be a slash).
Pseudo-code:
while (lastCharacter != '/') {
removeLastCharacter();
}
removeLastCharacter(); # removes the slash
(Sorry, it's been a while since I wrote a bash script.)
Another awk alternative using fields instead of regexs
awk -F/ '{printf "%s", $1; for (i=2; i<NF; i++) printf "/%s", $i; printf "\n"}'
Here is an alternative shell solution:
while read -r path; do dirname "$path"; done < file

awk -Search pattern through Variable

We have wrote shell script for multiple file name search pattern.
file format:
<number>_<20180809>.txt
starting with single number and ending with 8 digits number
Command:
awk -v string="12_1234" -v serch="^[0-9]+_+[0-9][0-9][0-9][0-9]$" "BEGIN{ if (string ~/serch$/) print string }"
If sting matches then return value.
You can just change your command in the following way and it will work:
awk -v string='12_1234' -v search='^[0-9]+_+[0-9][0-9][0-9][0-9]$' 'BEGIN{ if (string ~ search) print string }'
12_1234
You do not need to use /.../ syntax for regex if you use the ~ operator and also you had one extra $. You were really close!!!
Then you must adapt the search regex into ^[0-9]_[0-9]{8}$ to match exactly your_<20180809>` pattern.
Also if you are just extracting this information from the file you can use grep,
$ awk -v string='1_12345678' -v search='^[0-9]_[0-9]{8}$' 'BEGIN{ if (string ~ search) print string }'
1_12345678
$ (search='^[0-9]_[0-9]{8}$'; echo '1_12345678')| grep -oE "$search"
1_12345678

AWK that reads up to the /

I have the following lines of text :
170311 005201 0433 DE(N) itemhandling itemAddBarCodeData: Barcode(1/1) <0157357069/OK> ##[ti=7672,
170311 005323 0433 DE(N) itemhandling itemAddBarCodeData: Barcode(1/1) </NOREAD> ##[ti=7672,
I have the following script :
grep "itemAddBarCodeData" %myItemHandling% | gawk -F "[<>]+" -v OFS=, "{for(i=1;i<=NF;++i){if($i~/Barcode/){print substr($1,5,2)substr($1,3,2)substr($1,1,2),substr($1,8,6),$(i+1)}}}" > %myOutputPath%%myFilename%
What I need is a script that reads only the /NOREAD and the /OK so the output is like :
11-03-17,00:52:01,NOREAD
11-03-17,00:53:23,OK
any help would be greatly appreciated
Thanks
Complex gawk approach:
awk -F"[ />]" '{patsplit($1, a, /[0-9]{2}/); patsplit($2, b, /[0-9]{2}/);
printf("%s-%s-%s,%s:%s:%s,%s\n",a[3],a[2],a[1],b[1],b[2],b[3],$10)}' inpufile
The output:
11-03-17,00:52:01,OK
11-03-17,00:53:23,NOREAD
-F"[ />]" - "composite" field separator
patsplit(string, array [, fieldpat [, steps ] ])
Divide string into pieces defined by fieldpat and store the pieces in array and
the separator strings in the seps array.
You can use this following script:
script.awk
/\/[A-Z]+>/ { match($1"-"$2,/(..)(..)(..)-(..)(..)(..)/,ts)
dt=mktime( sprintf("20%s %s %s %s %s %s",
ts[1], ts[2], ts[3],
ts[4], ts[5], ts[6]) )
dtd = strftime( "%d-%m-%y", dt )
dts = strftime( "%H:%M:%S", dt )
match ( $0, /\/[A-Z]+>/) # set RSTART and RLENGTH
print dtd, dts, substr( $0, RSTART+1, RLENGTH-2)
}
Run it like this: awk -v OFS=, -f script.awk yourfile
The important part is the second match function call, which matches
a string of capital letters [A_Z]
preceded by a /
followed by a >.
It should match the OK and NOREAD case and not the Barcode(1/1).
The variables
RSTART and
RLENGTH
are set by the match function, we have to correct them by +1 and -2, because the match RE included / and >.
The first match, mktime, strftime and the sprintf function call are another way the format the date and time. The time functions are GNU AWK extensions.
Regular awk version:
awk '
{
d=$1$2
gsub(/../,"& ",d)
split(d,T)
split($8,R,"[/>]")
printf "%s-%s-%s,%s:%s:%s,%s\n",T[3],T[2],T[1],T[4],T[5],T[6],R[2]
}
' file
With script in file:
script.awk:
{
d=$1$2
gsub(/../,"& ",d)
split(d,T)
split($8,R,"[/>]")
printf "%s-%s-%s,%s:%s:%s,%s\n",T[3],T[2],T[1],T[4],T[5],T[6],R[2]
}
awk -f script.awk file
crammed on one line..
awk '{d=$1$2; gsub(/../,"& ",d); split(d,T); split($8,R,"[/>]"); printf "%s-%s-%s,%s:%s:%s,%s\n",T[3],T[2],T[1],T[4],T[5],T[6],R[2]}' file
You don't need grep when you're using awk. With GNU awk for gensub():
$ awk '/itemAddBarCodeData/{print gensub(/(..)(..)(..) (..)(..)(..).*\/([^>]+).*/,"\\3-\\2-\\1,\\4:\\5:\\6,\\7",1)}' file
11-03-17,00:52:01,OK
11-03-17,00:53:23,NOREAD
Here's a pragmatic combination of awk and sed that is conceptually relatively simple:
On Linux and BSD/macOS:
awk -F'[ />]' -v OFS=, '/itemAddBarCodeData/ {print $1, $2, $10}' file |
sed -E 's/^(..)(..)(..),(..)(..)(..)/\3-\2-\1,\4:\5:\6/'
On a Windows system, invoked from cmd.exe, different quoting and line continuation rules apply (assumes the presence of ported GNU utilities):
awk -F"[ />]" -v OFS=, "/itemAddBarCodeData/ {print $1, $2, $10}" file ^
| sed -E "s/^(..)(..)(..),(..)(..)(..)/\3-\2-\1,\4:\5:\6/"
Note how:
"..." strings rather than '...' strings must be used to protect the embedded content from interpretation by the shell
Unlike with "..." on Unix, $ has no special meaning to cmd.exe, so it can be used as-is.
^ as the very last character on a line serves as the explicit line-continuation character, and the line must be broken before the | (whereas on Unix a line ending in | is implicitly continued).
This is only used for readability here; of course, you can place your command on a single line.