changing locale (LC_ALL) for sprintf inside awk - awk

I want to print integer values in the range of 129 to 255 to a string using sprintf("%c") and have a problem with the following statement mentioned in the "GNU Awk User's Guide":
NOTE: The POSIX standard says the first character of a string is
printed. In locales with multibyte characters, gawk attempts to
convert the leading bytes of the string into a valid wide character
and then to print the multibyte encoding of that character. Similarly,
when printing a numeric value, gawk allows the value to be within the
numeric range of values that can be held in a wide character. If the
conversion to multibyte encoding fails, gawk uses the low eight bits
of the value as the character to print.
This leads to the following output:
[:~]$ gawk 'BEGIN {retString = sprintf("%c%c%c", 129, 130, 131); print retString}' | od -x
0000000 81c2 82c2 83c2 000a
In front of every byte (0x81, 0x82, 0x82) an extra byte (0xc2) is added. I can avoid this by setting LC_ALL to C:
[:~]$ LC_ALL=C gawk 'BEGIN {retString = sprintf("%c%c%c", 129, 130, 131); print retString}' | od -x
0000000 8281 0a83
The question is now: How can I change the locale within awk without setting LC_ALL outside the awk script? I want to use this script on multiple systems and don't want that the output depends on the default locale settings.
Or is there another way to achieve the same result without the sprintf() call?

I think the simplest way is to create a wrapper script
$ cat cawk
LC_ALL=C gawk "$#"
and make it executable
$ chmod +x cawk
It works just like gawk
$ ./cawk -v a=42 'BEGIN {print a}'
42

Related

Using AWK to print 2 columns in reverse order with comma as delimiter [duplicate]

The intent of this question is to provide an answer to the daily questions whose answer is "you have DOS line endings" so we can simply close them as duplicates of this one without repeating the same answers ad nauseam.
NOTE: This is NOT a duplicate of any existing question. The intent of this Q&A is not just to provide a "run this tool" answer but also to explain the issue such that we can just point anyone with a related question here and they will find a clear explanation of why they were pointed here as well as the tool to run so solve their problem. I spent hours reading all of the existing Q&A and they are all lacking in the explanation of the issue, alternative tools that can be used to solve it, and/or the pros/cons/caveats of the possible solutions. Also some of them have accepted answers that are just plain dangerous and should never be used.
Now back to the typical question that would result in a referral here:
I have a file containing 1 line:
what isgoingon
and when I print it using this awk script to reverse the order of the fields:
awk '{print $2, $1}' file
instead of seeing the output I expect:
isgoingon what
I get the field that should be at the end of the line appear at the start of the line, overwriting some text at the start of the line:
whatngon
or I get the output split onto 2 lines:
isgoingon
what
What could the problem be and how do I fix it?
The problem is that your input file uses DOS line endings of CRLF instead of UNIX line endings of just LF and you are running a UNIX tool on it so the CR remains part of the data being operated on by the UNIX tool. CR is commonly denoted by \r and can be seen as a control-M (^M) when you run cat -vE on the file while LF is \n and appears as $ with cat -vE.
So your input file wasn't really just:
what isgoingon
it was actually:
what isgoingon\r\n
as you can see with cat -v:
$ cat -vE file
what isgoingon^M$
and od -c:
$ od -c file
0000000 w h a t i s g o i n g o n \r \n
0000020
so when you run a UNIX tool like awk (which treats \n as the line ending) on the file, the \n is consumed by the act of reading the line, but that leaves the 2 fields as:
<what> <isgoingon\r>
Note the \r at the end of the second field. \r means Carriage Return which is literally an instruction to return the cursor to the start of the line so when you do:
print $2, $1
awk will print isgoingon and then will return the cursor to the start of the line before printing what which is why the what appears to overwrite the start of isgoingon.
To fix the problem, do either of these:
dos2unix file
sed 's/\r$//' file
awk '{sub(/\r$/,"")}1' file
perl -pe 's/\r$//' file
Apparently dos2unix is aka frodos in some UNIX variants (e.g. Ubuntu).
Be careful if you decide to use tr -d '\r' as is often suggested as that will delete all \rs in your file, not just those at the end of each line.
Note that GNU awk will let you parse files that have DOS line endings by simply setting RS appropriately:
gawk -v RS='\r\n' '...' file
but other awks will not allow that as POSIX only requires awks to support a single character RS and most other awks will quietly truncate RS='\r\n' to RS='\r'. You may need to add -v BINMODE=3 for gawk to even see the \rs though as the underlying C primitives will strip them on some platforms, e.g. cygwin.
One thing to watch out for is that CSVs created by Windows tools like Excel will use CRLF as the line endings but can have LFs embedded inside a specific field of the CSV, e.g.:
"field1","field2.1
field2.2","field3"
is really:
"field1","field2.1\nfield2.2","field3"\r\n
so if you just convert \r\ns to \ns then you can no longer tell linefeeds within fields from linefeeds as line endings so if you want to do that I recommend converting all of the intra-field linefeeds to something else first, e.g. this would convert all intra-field LFs to tabs and convert all line ending CRLFs to LFs:
gawk -v RS='\r\n' '{gsub(/\n/,"\t")}1' file
Doing similar without GNU awk left as an exercise but with other awks it involves combining lines that do not end in CR as they're read.
Also note that though CR is part of the [[:space:]] POSIX character class, it is not one of the whitespace characters included as separating fields when the default FS of " " is used, whose whitespace characters are only tab, blank, and newline. This can lead to confusing results if your input can have blanks before CRLF:
$ printf 'x y \n'
x y
$ printf 'x y \n' | awk '{print $NF}'
y
$
$ printf 'x y \r\n'
x y
$ printf 'x y \r\n' | awk '{print $NF}'
$
That's because trailing field separator white space is ignored at the beginning/end of a line that has LF line endings, but \r is the final field on a line with CRLF line endings if the character before it was whitespace:
$ printf 'x y \r\n' | awk '{print $NF}' | cat -Ev
^M$
You can use the \R shorthand character class in PCRE for files with unknown line endings. There are even more line ending to consider with Unicode or other platforms. The \R form is a recommended character class from the Unicode consortium to represent all forms of a generic newline.
So if you have an 'extra' you can find and remove it with the regex s/\R$/\n/ will normalize any combination of line endings into \n. Alternatively, you can use s/\R/\n/g to capture any notion of 'line ending' and standardize into a \n character.
Given:
$ printf "what\risgoingon\r\n" > file
$ od -c file
0000000 w h a t \r i s g o i n g o n \r \n
0000020
Perl and Ruby and most flavors of PCRE implement \R combined with the end of string assertion $ (end of line in multi-line mode):
$ perl -pe 's/\R$/\n/' file | od -c
0000000 w h a t \r i s g o i n g o n \n
0000017
$ ruby -pe '$_.sub!(/\R$/,"\n")' file | od -c
0000000 w h a t \r i s g o i n g o n \n
0000017
(Note the \r between the two words is correctly left alone)
If you do not have \R you can use the equivalent of (?>\r\n|\v) in PCRE.
With straight POSIX tools, your best bet is likely awk like so:
$ awk '{sub(/\r$/,"")} 1' file | od -c
0000000 w h a t \r i s g o i n g o n \n
0000017
Things that kinda work (but know your limitations):
tr deletes all \r even if used in another context (granted the use of \r is rare, and XML processing requires that \r be deleted, so tr is a great solution):
$ tr -d "\r" < file | od -c
0000000 w h a t i s g o i n g o n \n
0000016
GNU sed works, but not POSIX sed since \r and \x0D are not supported on POSIX.
GNU sed only:
$ sed 's/\x0D//' file | od -c # also sed 's/\r//'
0000000 w h a t \r i s g o i n g o n \n
0000017
The Unicode Regular Expression Guide is probably the best bet of what the definitive treatment of what a "newline" is.
Run dos2unix. While you can manipulate the line endings with code you wrote yourself, there are utilities which exist in the Linux / Unix world which already do this for you.
If on a Fedora system dnf install dos2unix will put the dos2unix tool in place (should it not be installed).
There is a similar dos2unix deb package available for Debian based systems.
From a programming point of view, the conversion is simple. Search all the characters in a file for the sequence \r\n and replace it with \n.
This means there are dozens of ways to convert from DOS to Unix using nearly every tool imaginable. One simple way is to use the command tr where you simply replace \r with nothing!
tr -d '\r' < infile > outfile

Attempting to pass an escape char to awk as a variable

I am using this command;
awk -v regex1='new[[:blank:]]+File\(' 'BEGIN{print "Regex1 =", regex1}'
which warns me;
awk: warning: escape sequence `\(' treated as plain `(
which prints;
new[[:blank:]]+File(
I would like the value to be;
new[[:blank:]]+File\(
I've tried amending the command to account for escape chars but always get the same warning
When you run:
$ awk -v regex1='new[[:blank:]]+File\(' 'BEGIN{print "Regex1 =", regex1}'
awk: warning: escape sequence `\(' treated as plain `('
Regex1 = new[[:blank:]]+File(
you're in shell assigning a string to an awk variable. When you use -v in awk you're asking awk to interpret escape sequences in such an assignment so that \t can become a literal tab char, \n a newline, etc. but the ( in your string has no special meaning when escaped and so \( is exactly the same as (, hence the warning message.
If you want to get a literal \ character you'd need to escape it so that \\ gets interpreted as just \:
$ awk -v regex1='new[[:blank:]]+File\\(' 'BEGIN{print "Regex1 =", regex1}'
Regex1 = new[[:blank:]]+File\(
You seem to be trying to pass a regexp to awk and in my opinion once you get to needing 2 escapes your code is clearer and simpler if you put the target character into a bracket expression instead:
$ awk -v regex1='new[[:blank:]]+File[(]' 'BEGIN{print "Regex1 =", regex1}'
Regex1 = new[[:blank:]]+File[(]
If you want to assign an awk variable the value of a literal string with no interpretation of escape sequences then there are other ways of doing so without using -v, see How do I use shell variables in an awk script?.
If you use gnu awk then you can use a regexp literal with #/.../ format and avoid double escaping:
awk -v regex1='#/new[[:blank:]]+File\(/' 'BEGIN{print "Regex1 =", regex1}'
Regex1 = new[[:blank:]]+File\(
i think gawk and mawk 1/2 are also okay with the hideous but fool-proof octal method like
-v regex1="new[[:blank:]]+File[\\050]" # note the double quotes
once the engine takes out the first \\ layer, the regex being tested against is equivalent to
/new[[:blank:]]+File[\050]/
which is as safe as it gets. Reason why it matters is that something like
/new[[:blank:]]+File[\(]/
is something mawk/mawk2 are totally cool with but gawk will give an annoying warning message. octals (or [\x28]) get rid of that cross-awk weirdness and allow the same custom string regex to be deployed across all 3
(haven't tested against less popular variants like BWK original or NAWK etc).
ps : since i'm on the subject of octal caveats, mawk/mawk2 and gawk in binary mode are cool with square bracket octals for all bytes, meaning
"[\\302-\\364][\\200-\\277]+" # this happens to be a *very* rough proxy for UTF-8
is valid for all 3. if you really want to be the hex guy, that same regex becomes
"[\\xC2-\\xF4][\\x80-\\xBF]+"
however, gawk in unicode mode will scream about locale whenever you attempt to put squares around any non-ASCII byte. To circumvent that, you'll have to just list them out with a bunch of or's like :
(\302|\303|\304.....|\364)(\200|\201......|\277)+
this way you can get gawk unicode mode to handle any arbitrary byte and also handle binary input data (whatever the circumstances caused that to happen), and perform full base64 or URI plus encoding/decoding from within (plus anything else you want, like SHA256 or LZMA etc).... So far I've even managed to get gawk in unicode mode to base64 encode an MP4 file input without gawk spitting out the "illegal multi byte" error message.
.....and also get gawk and mawk in binary modes to become mostly UTF-8 aware and safe.
The "mostly" caveat being I haven't implemented the minute details like directly doing normalization form conversions from within instead of dumping out to python3 and getting results back via getline, or keeping modifier linguistics marks with its intended character if i do a UC-safe-substring string-reversal.

Awk: gsub("\\\\", "\\\\") yields suprising results

Consider the following input:
$ cat a
d:\
$ cat a.awk
{ sub("\\", "\\\\"); print $0 }
$ cat a_double.awk
{ sub("\\\\", "\\\\"); print $0 }
Now running cat a | awk -f a.awk gives
d:\
and running cat a | awk -f a_double.awk gives
d:\\
and I expect exactly the other way around. How should I interpret this?
$ awk -V
GNU Awk 4.1.4, API: 1.1 (GNU MPFR 4.0.1, GNU MP 6.1.2)
Yes, its expected behavior of awk. When you run sub("\\", "\\\\") in your first script, in sub's inside "(double quotes) since we are NOT using / for matching pattern we need to escape first \(actual literal character) then for escaping we are using \ so we need to escape that also, hence it will become \\\\
\\ \\
| |
| |
first 2 chars are denoting escaping next 2 chars are denoting actual literal character \
Which is NOT happening your 1st case hence NO match so no substitution in it, in your 2nd awk script you are doing this(escaping part in regex matching section of sub) hence its matching \ perfectly.
Let's see this by example and try putting ... for checking purposes.
When Nothing happens: Since no match on
awk '{sub("\\", "....\\\\"); print $0}' Input_file
d:\
When pattern matching happens:
awk '{sub("\\\\", "...\\\\"); print $0}' Input_file
d:...\\
From man awk:
gsub(r, s [, t])
For each substring matching the regular expression r in the string t,
substitute the string s, and return the number of substitutions.
How could we could do perform actual escaping part(where we need to use only \ before character only once)? Do mention your regexp in /../ in first section of sub like and we need NOT to double escape \ here.
awk '{sub(/\\/,"&\\")} 1' Input_file
The first arg to *sub() is a regexp, not a string, so you should use regexp (/.../) rather than string ("...") delimiters. The former is a literal regexp which is used as-is while the latter defines a dynamic (or computed) regexp which forces awk to interpret the string twice, the first time to convert the string to a regexp and the second to use it as a regexp, hence double the backslashes needed for escaping. See https://www.gnu.org/software/gawk/manual/gawk.html#Computed-Regexps.
In the following we just need to escape the backslash once because we're using a literal, rather than dynamic, regexp:
$ cat a
d:\
$ awk '{sub(/\\/,"\\\\")}1' a
d:\\
Your first script would rightfully produce a syntax error in a more recent version of gawk (5.1.0) since "\\" in a dynamic regexp is equivalent to /\/ in a literal one and in that expression the backslash is escaping the final forward slash which means there is no final delimiter:
$ cat a.awk
{ sub("\\", "\\\\"); print $0 }
$ awk -f a.awk a
awk: a.awk:1: (FILENAME=a FNR=1) fatal: invalid regexp: Trailing backslash: /\/

Keep first 3 characters of every word containing a character

I have a large text file with lines like:
01 81118 9164.47 0/0:6,0:6:18:.:.:0,18,172:. 0/0:2,0:2:6:.:.:0,6,74:. 0/1:4,5:9:81:.:.:148,0,81:.
What I need is to keep just the first three characters of all the columns containing a colon, i.e.
01 81118 9164.47 0/0 0/0 0/1
Where the number of chars after the first 3 can vary. I started here by removing everything after a colon, but that removes the entire rest of the line, rather than per word:
sed 's/:.*//g' file.txt
Alternately, I've been trying to bring in the word boundary (\b) and hack away at removing everything after colons several times:
sed 's/\b:[^ ]//g' file.txt | sed 's/\b:[^ ]//g'
But this is not a good way to go about it. What's the best approach?
Using a sed that has a -E are to enable EREs (e.g. GNU or BSD/OSX sed):
$ sed -E 's/([^[:space:]]{3}):[^[:space:]]+/\1/g' file
01 81118 9164.47 0/0 0/0 0/1
With a POSIX sed:
$ sed 's/\([^[:space:]]\{3\}\):[^[:space:]]\{1,\}/\1/g' file
01 81118 9164.47 0/0 0/0 0/1
The above will work regardless of whether the spaces in your input are blanks or tabs or both.
Using awk. Print only 3 first characters of any field containing colon, print the rest as is.
awk '{ for (i=1;i<=NF;i++) if ($i ~/:/) $i=substr($i,1,3) } 1' file
substr() is one of the GNU awk string functions.
1 at the end of the statement is equivalent to action {print} the whole line.
Regarding output format, if input is tab separated and you want to keep the tabs, you can run:
awk 'BEGIN{OFS=FS="\t"} { for (i=1;i<=NF;i++) if ($i ~/:/) $i=substr($i,1,3) } 1' file
or another idea is to pretty-print with column -t (does not insert real \t but appropriate number of spaces between fields)
awk '{ for (i=1;i<=NF;i++) if ($i ~/:/) $i=substr($i,1,3) } 1' file |column -t
If, as in your example, the colon is not part of the string which should be preserved, try
sed 's/\(\(^\| \)[^ :][^ :][^ :]\)[^ :]*:[^ ]*/\1/g' file
The literal spaces in the character classes may need to be augmented with tabs and possibly other whitespace characters.
(The regex could be prettier if your sed supports extended regex with -E or -r or some such nonstandard option; but this ugly sucker should be portable most anywhere.)
Using GNU sed with regular expression extensions, a one-liner could be:
sed -E 's/(\S{3})\S*:\S*/\1/g' file
\S matches non-whitespace characters (a GNU extension).
This might work for you (GNU sed):
sed -E 's/\S*:/\n&/g;s/\n(\S{3})\S*/\1/g;s/\n//g' file
Prepend a newline to any non-whitespaced strings which contains a :.
If these strings contain at least 3 non-whitespaced characters, remove all but the first 3 characters.
Clean up any strings with :'s which were not 3 non-whitespaced characters in length.
optional : set _ = "[[:space:]]*"
if u wanna use the formal POSIX regex class
echo "${input}" |
mawk 'BEGIN { __ = OFS ="\f\r\t"
FS = "^"(_ = "[ \t]*")"|(:"(_)")?"(_)
_ = sub("[(]..", "&^", FS) } $_ = __$_'
01
81118
9164.47
0/0
0/0
0/1
tested and confirmed working on gawk 5.1.1, mawk 1.3.4, mawk 1.996, and macos nawk
The ultra brute force method would be like :
mawk NF=NF FS='(:[^ \t]*)?[ \t]*' OFS='\t'
01 81118 9164.47 0/0 0/0 0/1
to handle leading/trailing edge spaces+tabs in brute-force approach:
gawk NF=NF FS='(:[^ \t]*)?[ \t]*' OFS='\t' | column -t
01 81118 9164.47 0/0 0/0 0/1

Awk tolower a string that starts with an accent - support for foreign characters

I have a file with this string in a line: "Ávila"
And I want to get this output: "ávila".
The problem is that the function tolower of awk only works when the string does not start with accent, and I must use awk.
For example, if I do awk 'BEGIN { print tolower("Ávila") }' then I get "Ávila" instead of "ávila", that is what I expect.
But if I do awk 'BEGIN { print tolower("Castellón") }' then I get "castellón"
For a given awk implementation to work properly with non-ASCII characters (foreign letters), it must respect the active locale's character encoding, as reflected in the (effective) LC_CTYPE setting (run locale to see it).
These days, most locales use UTF-8 encoding, a multi-byte-on-demand encoding that is single-byte in the ASCII range, and uses 2 to 4 bytes to represent all other Unicode characters.
Thus, for a given awk implementation to recognize non-ASCII (accented, foreign) letters, it must be able to recognize multiple bytes as a single character.
Among the major awk implementations,
GNU Awk (gawk), the default on some Linux distros
BSD awk, as also used on OS X
Mawk (mawk), the default on Debian-based Linux distros such as Ubuntu
only GNU Awk properly handles UTF8-encoded characters (and presumably any other encoding if specified in the locale):
$ echo ÁvilA | gawk '{print tolower($0)}'
ávila # both Á and A lowercased
Conversely, if you expressly want to limit character processing to ASCII only, prepend LC_CTYPE=C:
$ echo ÁvilA | LC_CTYPE=C gawk '{print tolower($0)}'
Ávila # only ASCII char. A lowercased
Practical advice:
To determine what implementation your default awk is, run awk --version.
In the case of Mawk you'll get an error message, because it only supports printing version information with -W version, but that error message will contain the word mawk.
If possible, install and use GNU Awk (and optionally make it the default awk); it is available for most Unix-like platforms; e.g.:
On Debian-based platforms such as Ubuntu: sudo apt-get install gawk
On OS X, using Homebrew: brew install gawk.
If you must use either BSD Awk or Mawk, use the above LC_CTYPE=C approach to ensure that the multi-byte UTF-8 characters are at least passed through without modification.[1], but foreign letters will NOT be recognized as letters (and thus won't be lowercased, in this case).
[1] BSD Awk and Mawk on OS X (the latter curiously not on Linux) treat UTF-8-encoded character as follows:
Each byte is mistakenly interpreted as its own character.
If, after ignoring the high bit, the resulting byte value falls into the range of ASCII uppercase letters, 32 is added to the original byte value to obtain the lowercase counterpart.
In the case at hand, this means:
Á is Unicode codepoint U+00C1, whose UTF-8 encoding is the 2-byte sequence: 0xC3 0x81.
0xC3: Dropping the high bit (0xC3 & 0x7F) yields 0x43, which is interpreted as ASCII letter C, and 32 (0x20) is therefore added to the original value, yielding 0xE3 (0xC3 + 0x20).
0x81: Dropping the high bit (0x81 & 0x7F) yields 0x1, which is not in the range of ASCII uppercase letters (65-90, 0x41-0x5a), so the byte is left as-is.
Effectively, the first byte is modified from 0xC3 to 0xE3, while the 2nd byte is left untouched; since 0xC3 0x81 is not a properly UTF-8-encoded character, the terminal will print ? instead to signal that.
I tried to comment on your reply, which is correct, but I need to be able to format what I'm adding, otherwise, it becomes gibberish.
Super useful and I would like to add the following for the ones that have problems with uppercase :
bash-3.2$ echo "TOMÀS VICENÇ ROMÀ" |LC_CTYPE=C gawk '{ print tolower($0)}'
tomÀs vicenÇ romÀ
bash-3.2$ echo "TOMÀS VICENÇ ROMÀ" |LC_CTYPE=C gawk '{ print $0}'|tr '[:upper:]' '[:lower:]'
tomàs vicenç romà