awk statement within sed - awk

I have multiple occurrences of the pattern:
)0.[0-9][0-9][0-9]:
where [0-9] is any digit, in various text context but the pattern is unique as this regex. And I need to turn the decimal fraction into integer (percent values from 0 to 99).
A small example substring would be
=1:0.00055)0.944:0.02762)0.760:0 to turn into
=1:0.00055)94:0.02762)76:0
What Iā€™m doing is :
cat file | sed -e "s/)\([0-9].[0-9][0-9][0-9]\):/)`echo "\1"|awk '{ r=int(100*$0); if((r>=0)&&(r<=100)){ print r; } else { print "error"; exit(-1); } }'`:/g"
but the output is )0:
where is the fault?...

Since you asked 'where is the fault' and not 'how to solve the problem':
Your backquoted pipeline echo ...|awk ... is executed FIRST, producing a single 0 which is then made part of the s/// command passed to sed and thus substituted everywhere the pattern matches. PS: using the newer (post-Reagan) and more flexible notation for command substitution $( ... ) instead of backquotes is preferred in all shells except csh family, and especially on Stack where backquotes are special to markdown and troublesome to show in text.
If you want to actually solve the problem, which you didn't describe clearly or completely, some pointers toward a better direction:
Standard sed can't execute a command to generate replacement text; GNU sed can with flag e but you need to make the whole patternspace the command and fiddle anything else into holdspace, which is tedious. perl can evaluate an expression in the replacement for s, including arithmetic; awk (even gawk) can't do so directly, but you can get the same effect by doing the match and the replace/rebuild as separate steps, depending on the unspecified and unclear details of exactly what you want to do; if you want to keep the rest of the line unchanged, something like:
awk 'match($0,/)0[.][0-9][0-9][0-9]:/){ print substr($0,1,RSTART) (substr($0,RSTART+1,RLENGTH-2)*100) substr($0,RSTART+RLENGTH-1) }'
But you don't actually need arithmetic here if you're satisified with truncating. Just discard the leading 0. and the last digit and keep the two digits in between:
sed 's/)0[.]\([0-9][0-9]\)[0-9]:/)0.\1:/g'
Note . in regexp unless escaped or in a charclass (as I did) matches any character not just period, which may or may not be a problem since you didn't give the rest of your input.
And PS: negative numbers for process exit status don't work (except IIRC Plan 9). Use small (usually < 128) positive status values for errors; most common is to just use 1.

Check this perl one-liner command :
perl -pe 's/\)(\d+\.\d+):/sprintf ")%d:", $1 * 100/ge' file
Before :
=1:0.00055)0.944:0.02762)0.760:0
After :
=1:0.00055)94:0.02762)76:0
If you need to replace for real in editing mode, add -i switch :
perl -i -pe '...'

Related

sed: cut out a substring following various patterns

There is a list of identifiers I want to modify:
3300000526.a:P_A23_Liq_2_FmtDRAFT_1000944_2,
200254578.a:CR_10_Liq_3_inCRDRAFT_100545_11,
3300000110.a:BSg2DRAFT_c10006505_1,
3300000062.a:IMNBL1DRAFT_c0010786_1,
3300000558.a:Draft_10335283_1
I want to remove all starting from first . and first _ after DRAFT (case-insensitive), i.e.:
3300000526_1000944_2,
200254578_100545_11,
3300000110_c10006505_1,
3300000062_c0010786_1,
3300000558_10335283_1
I am using sed 's/.a.*[a-zA-Z0-9]DRAFT_.*[^_]_[a-zA-Z0-9]//g' but it ignores the first _ after DRAFT and does this:
3300000526_2,
200254578_11,
3300000110_1,
3300000062_1,
3300000558_1
P.S.
There can be various identifiers and I tried to show a little portion on their variance here, but they all keep same pattern.
I'd be grateful for corrections!
You could easily do this in awk, could you please try following once. Based on shown samples only.
awk -F'[.]|DRAFT_' '{$2="_";sub(/ +_ +/,"_")} 1' Input_file
OR with GNU awk for handling case-insensitive try:
awk -v IGNORECASE="1" -F'[.]|DRAFT_' '{$2="_";sub(/ +_ +/,"_")} 1' Input_file
To handle case insensitive without ignorecase option try:
awk -F'[.]|[dD][rR][aA][fF][tT]_' '{$2="_";sub(/ +_ +/,"_")} 1' Input_file
Explanation: Simply setting field separator as . OR DRAFT_ as per OP's need. Then in main program setting 2nd field to _ and then substituting spaces underscore spaces with only _. Finally printing the current line by 1.
A workable solution
You can use:
sed 's/[.].*[dD][rR][aA][fF][tT]_/_/' data
You could also use \. in place of [.] but I'm allergic to unnecessary backslashes ā€” you might be too if you'd had to spend time fighting whether 8 or 16 consecutive backslashes was the correct way to write to document markup (using troff).
For your sample data, it produces:
3300000526_1000944_2,
200254578_100545_11,
3300000110_c10006505_1,
3300000062_c0010786_1,
3300000558_10335283_1
What went wrong
Your command is:
sed 's/.a.*[a-zA-Z0-9]DRAFT_.*[^_]_[a-zA-Z0-9]//g'
This matches:
any character (the leading .)
lower-case 'a'
any sequence of characters
an alphanumeric character
upper-case only DRAFT
underscore
any sequence of characters
underscore
an alphanumeric character
doing the match globally on each line
It then deletes all the matched material. You could rescue it by using:
sed 's/[.]a.*[a-zA-Z0-9]DRAFT\(_.*[^_]_[a-zA-Z0-9]\)/\1/'
This matches a dot rather than any character, and saves the material after DRAFT starting with the underscore (that's the \(ā€¦\)), replacing what was matched with what was saved (that's the \1). You can convert DRAFT to the case-insensitive pattern too, of course. However, the terms of the question refer to "from the first dot (.) up to (but not including) the underscore after a (case-insensitive) DRAFT", and detailing, saving, and replacing the material after the underscore is not necessary.
Laziness
I saved myself typing the elaborate case-insensitive string by using a program called mkpattern that (according to RCS) I created on 1990-08-23 (a while ago now). It's not rocket science. I use it every so often ā€” I've actually used it a number of times in the last week, searching for some elusive documents on a system at work.
$ mkpattern DRAFT
[dD][rR][aA][fF][tT]
$
You might have to do that longhand in future.
try something like
{mawk/mawk2/gawk} 'BEGIN { FS = "[\056].+DRAFT_"; OFS = ""; } (NF < 2) || ($1 = $1)'
It might not be the fastest but it's relatively clean approach. octal \056 is the period, and it's less ambiguous to reader when the next item is a ".+"
This might work for you (GNU sed):
sed -nE 's/DRAFT[^_]*/\n/i;s/\..*\n//p' file
First turn on the -n and -E to turn off implicit printing and make regexp more easily seen.
Since we want the first occurrence of DRAFT we can not use a regexp that uses the .* idiom as this is greedy and may pass over it if there are two or more such occurrences. Thus we replace the DRAFT by a unique character which cannot occur in the line. The newline can only be introduced by the programmer and is the best choice.
Now we can use the .* idiom as the introduced newline can only exist if the previous substitution has matched successfully.
N.B. The i flag in the first substitution allows for any upper/lower case rendition of the string DRAFT, also the second substitution includes the p flag to print the successful substitution.

Replace character except between pattern using grep -o or sed (or others)

In the following file I want to replace all the ; by , with the exception that, when there is a string (delimited with two "), it should not replace the ; inside it.
Example:
Input
A;B;C;D
5cc0714b9b69581f14f6427f;5cc0714b9b69581f14f6428e;1;"5cc0714b9b69581f14f6427f;16a4fba8d13";xpto;
5cc0723b9b69581f14f64285;5cc0723b9b69581f14f64294;2;"5cc0723b9b69581f14f64285;16a4fbe3855";xpto;
5cc072579b69581f14f6428a;5cc072579b69581f14f64299;3;"5cc072579b69581f14f6428a;16a4fbea632";xpto;
output
A,B,C,D
5cc0714b9b69581f14f6427f,5cc0714b9b69581f14f6428e,1,"5cc0714b9b69581f14f6427f;16a4fba8d13",xpto,
5cc0723b9b69581f14f64285,5cc0723b9b69581f14f64294,2,"5cc0723b9b69581f14f64285;16a4fbe3855",xpto,
5cc072579b69581f14f6428a,5cc072579b69581f14f64299,3,"5cc072579b69581f14f6428a;16a4fbea632",xpto,
For sed I have: sed 's/;/,/g' input.txt > output.txt but this would replace everything.
The regex for the " delimited string: \".*;.*\" .
(A regex for hexadecimal would be better -- something like: [0-9a-fA-F]+)
My problem is combining it all to make a grep -o / sed that replaces everything except for that pattern.
The file size is in the order of two digit Gb (max 99Gb), so performance is important. Relevant.
Any ideas are appreciated.
sed is for doing simple s/old/new on individual strings. grep is for doing g/re/p. You're not trying to do either of those tasks so you shouldn't be considering either of those tools. That leaves the other standard UNIX tool for manipulating text - awk.
You have a ;-separated CSV that you want to make ,-separated. That's simply:
$ awk -v FPAT='[^;]*|"[^"]+"' -v OFS=',' '{$1=$1}1' file
A,B,C,D
5cc0714b9b69581f14f6427f,5cc0714b9b69581f14f6428e,1,"5cc0714b9b69581f14f6427f;16a4fba8d13",xpto,
5cc0723b9b69581f14f64285,5cc0723b9b69581f14f64294,2,"5cc0723b9b69581f14f64285;16a4fbe3855",xpto,
5cc072579b69581f14f6428a,5cc072579b69581f14f64299,3,"5cc072579b69581f14f6428a;16a4fbea632",xpto,
The above uses GNU awk for FPAT. See What's the most robust way to efficiently parse CSV using awk? for more details on parsing CSVs with awk.
If I get correctly your requirements, one option would be to make a three pass thing.
From your comment about hex, I'll consider nothing like # will come in the input so you can do (using GNU sed) :
sed -E 's/("[^"]+);([^"]+")/\1#\2/g' original > transformed
sed -i 's/;/,/g' transformed
sed -i 's/#/;/g' transformed
The idea being to replace the ; when within quotes by something else and write it to a new file, then replace all ; by , and then set back the ; in place within the same file (-i flag of sed).
The three pass can be combined in a single command with
sed -E 's/("[^"]+);([^"]+")/\1#\2/g;s/;/,/g;s/#/;/g' original > transformed
That said, there's probably a bunch of csv parser witch already handle quoted fields that you can probably use in the final use case as I bet this is just an intermediary step for something else later in the chain.
From Ed Morton's comment: if you do it in one pass, you can use \n as replacement separator as there can't be a newline in the text considered line by line.
This might work for you (GNU sed):
sed -E ':a;s/^([^"]*("[^"]*"[^"]*)*"[^";]*);/\1\n/;ta;y/;/,/;y/\n/;/' file
Replace ;'s inside double quotes with newlines, transpose ;'s to ,'s and then transpose newlines to ;'s.

Arithmetic on gensub substitution in gawk

I wonder whether the following is possible:
echo -e "0#1 1#1 0#0\n0#0 1#1 0#1" | awk '{print gensub(/([01])#([01])/, "\\1" + "\\2", "g")}'
It doesn't work the way it is; is that because the evaluation of "+" happens before the substitutions of "\1" and "\2"?
As output, I would expect 1, the result of arithmetic on \1 and \2, so for \1=0 and \2=1, the output should be 1.
Also, as per answer below, I am not looking for a solution on how to add 1 and 0 in "1#0"; this is just an example, I just wondered whether it is possible to do arithmetic on \1 and \2, since this works:
gensub(/blah blah/, 0 + 1, "g") gives 1.
You can't use gensub() for this, because it returns the captured groups as literal strings as its result.
For such a trivial requirement use # as the field separator and do the arithmetic computation as
echo "0#1" | awk -F# '{print ($1 + $2)}'
Or if you are worried about string values in the input string, force the numeric conversion using int() casting, or just add +0 to each of the operands, i.e. use (int($1) + int($2)) or (($1+0) + ($2+0))
As per the updated question/comments in the answer below, doing constant numeric arithmetic is not something gensub() is intended for, which is supposed to do a regexp based pattern search and replacement. The replacement part on most cases involves dealing with the captured groups from the search string and apply some modifications over it.
I think I understand what you want, and you can do it in Perl using the e modifier on a substitution which means it evaluates the replacement. Here's an example:
echo "7#302" | perl -nle 's/(\d+)#(\d+)/$1+$2/e && print'
309
Or, slightly more fun:
echo "The 200#109 cats sat on the 7#302 mats" | perl -nle 's/(\d+)#(\d+)/$1+$2/ge && print'
The 309 cats sat on the 309 mats
You could use sed w/bc for calculating, in the manner Mark used perl:
echo "7#302" | sed -E 's/([0-9]+)#([0-9]+)/echo "\1+\2"|bc/e'
When you write foo(bar()), you'll find that bar() is executed first whether it's a function or any expression so gensub(..., "\\1" + "\\2", ...) calls gensub() using the result of adding the 2 strings which is 0, i.e. gensub(..., 0, ...).
This isn't semantically identical to the code you wrote but the approach to do what you want is to use the 3rd arg to match():
$ echo "0#1" | awk 'match($0,/([01])#([01])/,a){print a[1] + a[2]}'
1
The above uses GNU awk for that 3rd arg to match() but you were already using that for gensub() anyway. If it's not clear how to use that on your real data then post a followup question that includes an example of your real data.

text processing: sed to work backwards to delete until string

My AWK script generates 1 of the following 2 outputs depending on what text file it is being used on.
49 1146.469387755102 mongodb 192.168.0.8:27017 -p mongodb.database
1 1243.0 jdbc:mysql 192.168.0.8:3306/ycsb -p db.user
I need a way of deleting everything past the IP address, including the port number.
sed 's/:[^:]*//2g'
Works apart from the fact it deletes from left to right and as one of the outputs contains 2 : 's it stops and deletes everything after that. Is there a way of reversing sed to work from right to left?
Just to be clear, desired output of each would be:
49 1146.469387755102 mongodb 192.168.0.8
1 1243.0 jdbc:mysql 192.168.0.8
You could use the below sed command.
sed 's/:[0-9]\{4\}.*//' file
OR
sed 's/:[^:]*$//' file
[^:]* negated character class which matches any char but not of :, zero or more times. $ matches the end of the line boundary. So :[^:]*$ matches all the chars from the last colon upto the end. Replacing those matched chars with empty string will give you the desired output.
You can take advantage of the greedy nature of the Kleene *:
sed 's/\(.*\):.*/\1/' file
The .* consumes as much as it can, while still matching the pattern. The captured part of the line is used in the replacement.
Alternatively, using awk (thanks to glenn jackman for setting me straight):
awk -F: -v OFS=: 'NF{NF--}1' file
Set the input and output field separators to a colon remove the final field by decrementing NF. 1 is true so the default action {print} is performed. The NF condition prevents empty lines from causing an error, which may not be necessary in your case but does no harm.
Output either way:
49 1146.469387755102 mongodb 192.168.0.8
1 1243.0 jdbc:mysql 192.168.0.8

In awk, how can I use a file containing multiple format strings with printf?

I have a case where I want to use input from a file as the format for printf() in awk. My formatting works when I set it in a string within the code, but it doesn't work when I load it from input.
Here's a tiny example of the problem:
$ # putting the format in a variable works just fine:
$ echo "" | awk -vs="hello:\t%s\n\tfoo" '{printf(s "bar\n", "world");}'
hello: world
foobar
$ # But getting the format from an input file does not.
$ echo "hello:\t%s\n\tfoo" | awk '{s=$0; printf(s "bar\n", "world");}'
hello:\tworld\n\tfoobar
$
So ... format substitutions work ("%s"), but not special characters like tab and newline. Any idea why this is happening? And is there a way to "do something" to input data to make it usable as a format string?
UPDATE #1:
As a further example, consider the following using bash heretext:
[me#here ~]$ awk -vs="hello: %s\nworld: %s\n" '{printf(s, "foo", "bar");}' <<<""
hello: foo
world: bar
[me#here ~]$ awk '{s=$0; printf(s, "foo", "bar");}' <<<"hello: %s\nworld: %s\n"
hello: foo\nworld: bar\n[me#here ~]$
As far as I can see, the same thing happens with multiple different awk interpreters, and I haven't been able to locate any documentation that explains why.
UPDATE #2:
The code I'm trying to replace currently looks something like this, with nested loops in shell. At present, awk is only being used for its printf, and could be replaced with a shell-based printf:
#!/bin/sh
while read -r fmtid fmt; do
while read cid name addy; do
awk -vfmt="$fmt" -vcid="$cid" -vname="$name" -vaddy="$addy" \
'BEGIN{printf(fmt,cid,name,addy)}' > /path/$fmtid/$cid
done < /path/to/sampledata
done < /path/to/fmtstrings
Example input would be:
## fmtstrings:
1 ID:%04d Name:%s\nAddress: %s\n\n
2 CustomerID:\t%-4d\t\tName: %s\n\t\t\t\tAddress: %s\n
3 Customer: %d / %s (%s)\n
## sampledata:
5 Companyname 123 Somewhere Street
12 Othercompany 234 Elsewhere
My hope was that I'd be able to construct something like this to do the entire thing with a single call to awk, instead of having nested loops in shell:
awk '
NR==FNR { fmts[$1]=$2; next; }
{
for(fmtid in fmts) {
outputfile=sprintf("/path/%d/%d", fmtid, custid);
printf(fmts[fmtid], $1, $2) > outputfile;
}
}
' /path/to/fmtstrings /path/to/sampledata
Obviously, this doesn't work, both because of the actual topic of this question and because I haven't yet figured out how to elegantly make awk join $2..$n into a single variable. (But that's the topic of a possible future question.)
FWIW, I'm using FreeBSD 9.2 with its built in, but I'm open to using gawk if a solution can be found with that.
Why so lengthy and complicated an example? This demonstrates the problem:
$ echo "" | awk '{s="a\t%s"; printf s"\n","b"}'
a b
$ echo "a\t%s" | awk '{s=$0; printf s"\n","b"}'
a\tb
In the first case, the string "a\t%s" is a string literal and so is interpreted twice - once when the script is read by awk and then again when it is executed, so the \t is expanded on the first pass and then at execution awk has a literal tab char in the formatting string.
In the second case awk still has the characters backslash and t in the formatting string - hence the different behavior.
You need something to interpret those escaped chars and one way to do that is to call the shell's printf and read the results (corrected per #EtanReiser's excellent observation that I was using double quotes where I should have had single quotes, implemented here by \047, to avoid shell expansion):
$ echo 'a\t%s' | awk '{"printf \047" $0 "\047 " "b" | getline s; print s}'
a b
If you don't need the result in a variable, you can just call system().
If you just wanted the escape chars expanded so you don't need to provide the %s args in the shell printf call, you'd just need to escape all the %s (watching out for already-escaped %s).
You could call awk instead of the shell printf if you prefer.
Note that this approach, while clumsy, is much safer than calling an eval which might just execute an input line like rm -rf /*.*!
With help from Arnold Robbins (the creator of gawk), and Manuel Collado (another noted awk expert), here is a script which will expand single-character escape sequences:
$ cat tst2.awk
function expandEscapes(old, segs, segNr, escs, idx, new) {
split(old,segs,/\\./,escs)
for (segNr=1; segNr in segs; segNr++) {
if ( idx = index( "abfnrtv", substr(escs[segNr],2,1) ) )
escs[segNr] = substr("\a\b\f\n\r\t\v", idx, 1)
new = new segs[segNr] escs[segNr]
}
return new
}
{
s = expandEscapes($0)
printf s, "foo", "bar"
}
.
$ awk -f tst2.awk <<<"hello: %s\nworld: %s\n"
hello: foo
world: bar
Alternatively, this shoudl be functionally equivalent but not gawk-specific:
function expandEscapes(tail, head, esc, idx) {
head = ""
while ( match(tail, /\\./) ) {
esc = substr( tail, RSTART + 1, 1 )
head = head substr( tail, 1, RSTART-1 )
tail = substr( tail, RSTART + 2 )
idx = index( "abfnrtv", esc )
if ( idx )
esc = substr( "\a\b\f\n\r\t\v", idx, 1 )
head = head esc
}
return (head tail)
}
If you care to, you can expand the concept to octal and hex escape sequences by changing the split() RE to
/\\(x[0-9a-fA-F]*|[0-7]{1,3}|.)/
and for a hex value after the \\:
c = sprintf("%c", strtonum("0x" rest_of_str))
and for an octal value:
c = sprintf("%c", strtonum("0" rest_of_str))
Since the question explicitly asks for an awk solution, here's one which works on all the awks I know of. It's a proof-of-concept; error handling is abysmal. I've tried to indicate places where that could be improved.
The key, as has been noted by various commentators, is that awk's printf -- like the C standard function it is based on -- does not interpret backslash-escapes in the format string. However, awk does interpret them in command-line assignment arguments.
awk 'BEGIN {if(ARGC!=3)exit(1);
fn=ARGV[2];ARGC=2}
NR==FNR{ARGV[ARGC++]="fmt="substr($0,length($1)+2);
ARGV[ARGC++]="fmtid="$1;
ARGV[ARGC++]=fn;
next}
{match($0,/^ *[^ ]+[ ]+[^ ]+[ ]+/);
printf fmt,$1,$2,substr($0,RLENGTH+1) > ("data/"fmtid"/"$1)
}' fmtfile sampledata
(
What's going on here is that the 'FNR==NR' clause (which executes only on the first file) adds the values (fmtid, fmt) from each line of the first file as command-line assignments, and then inserts the data file name as a command-line argument. In awk, assignments as command line arguments are simply executed as though they were assignments from a string constant with implicit quotes, including backslash-escape processing (except that if the last character in the argument is a backslash, it doesn't escape the implicit closing double-quote). This behaviour is mandated by Posix, as is the order in which arguments are processed which makes it possible to add arguments as you go.
As written, the script must be provided with exactly two arguments: the formats and the data (in that order). There is some room for improvement, obviously.
The snippet also shows two ways of concatenating trailing fields.
In the format file, I assume that the lines are well behaved (no leading spaces; exactly one space after the format id). With those constraints, substr($0, length($1)+2) is precisely the part of the line after the first field and a single space.
Processing the datafile, it may be necessary to do this with fewer constraints. First, the builtin match function is called with the regular expression /^ *[^ ]+[ ]+[^ ]+[ ]+/ which matches leading spaces (if any) and two space-separated fields, along with the following spaces. (It would be better to allow tabs, as well.) Once the regex matches (and matching shouldn't be assumed, so there's another thing to fix), the variables RSTART and RLENGTH are set, so substr($0, RLENGTH+1) picks up everything starting with the third field. (Again, this is all Posix-standard behaviour.)
Honestly, I'd use the shell printf for this problem, and I don't understand why you feel that solution is somehow sub-optimal. The shell printf interprets backslash escapes in formats, and the shell read -r will do the line splitting the way you want. So there's no reason for awk at all, as far as I can see.
Ed Morton shows the problem clearly (edit: and it's now complete, so just go accept it): awk's string literal processing handled the escapes, and file I/O code isn't a lexical analyzer.
It's an easy fix: decide what escapes you want to support, and support them. Here's a one-liner form if you're doing special-purpose work that doesn't need to handle escaped backslashes
awk '{ gsub(/\\n/,"\n"); gsub(/\\t/,"\t"); printf($0 "bar\n", "world"); }' <<\EOD
hello:\t%s\n\tfoo
EOD
but for doit-and-forgetit peace of mind just use the full form in the linked answer.
#Ed Morton's answer explains the problem well.
A simple workaround is to:
pass the format-string file contents via an awk variable, using command substitution,
assuming that file is not too large to be read into memory in full.
Using GNU awk or mawk:
awk -v formats="$(tr '\n' '\3' <fmtStrings)" '
# Initialize: Split the formats into array elements.
BEGIN {n=split(formats, aFormats, "\3")}
# For each data line, loop over all formats and print.
{ for(i=1;i<n;++i) {printf aFormats[i] "\n", $1, $2, $3} }
' sampleData
Note:
The advantage of this solution is that it works generically - you don't need to anticipate specific escape sequences and handle them specially.
On FreeBSD awk, this almost works, but - sadly - split() still splits by newlines, despite being given an explicit separator - this smells like a bug. Observed on versions 20070501 (OS X 10.9.4) and 20121220 (FreeBSD 10.0).
The above solves the core problem (for brevity, it omits stripping the ID from the front of the format strings and omits the output-file creation logic).
Explanation:
tr '\n' '\3' <fmtStrings replaces actual newlines in the format-strings file with \3 (0x3) characters, so as to be able to later distinguish them from the \n escape sequences embedded in the lines, which awk turns into actual newlines when assigning to variable formats (as desired).
\3 (0x3) - the ASCII end-of-text char. - was arbitrarily chosen as an auxiliary separator that is assumed not to be present in the input file.
Note that using \0 (NUL) is NOT an option, because awk interprets that as an empty string, causing split() to split the string into individual characters.
Inside the BEGIN block of the awk script, split(formats, aFormats, "\3") then splits the combined format strings back into individual format strings.
I had to create another answer to start clean, I believe I've come to a good solution, again with perl:
echo '%10s\t:\t%10s\r\n' | perl -lne 's/((?:\\[a-zA-Z\\])+)/qq[qq[$1]]/eeg; printf "$_","hi","hello"'
hi : hello
That bad boy s/((?:\\[a-zA-Z\\])+)/qq[qq[$1]]/eeg will translate any meta character I can think of, let us take a look with cat -A :
echo '%10s\t:\t%10s\r\n' | perl -lne 's/((?:\\[a-zA-Z\\])+)/qq[qq[$1]]/eeg; printf "$_","hi","hello"' | cat -A
hi^I:^I hello^M$
PS. I didn't create that regex, I googled unquote meta and found here
What you are trying to do is called templating. I would suggest that shell tools are not the best tools for this job. A safe way to go would be to use a templating library such as Template Toolkit for Perl, or Jinja2 for Python.
The problem lies in the non-interpretation of the special characters \t and \n by echo: it makes sure that they are understood as as-is strings, and not as tabulations and newlines. This behavior can be controlled by the -e flag you give to echo, without changing your awk script at all:
echo -e "hello:\t%s\n\tfoo" | awk '{s=$0; printf(s "bar\n", "world");}'
tada!! :)
EDIT:
Ok, so after the point rightfully raised by Chrono, we can devise this other answer corresponding to the original request to have the pattern read from a file:
echo "hello:\t%s\n\tfoo" > myfile
awk 'BEGIN {s="'$(cat myfile)'" ; printf(s "bar\n", "world")}'
Of course in the above we have to be careful with the quoting, as the $(cat myfile) is not seen by awk but interpreted by the shell.
This looks extremely ugly, but it works for this particular problem:
s=$0;
gsub(/'/, "'\\''", s);
gsub(/\\n/, "\\\\\\\\n", s);
"printf '%b' '" s "'" | getline s;
gsub(/\\\\n/, "\n", s);
gsub(/\\n/, "\n", s);
printf(s " bar\n", "world");
Replace all single quotes with shell-escaped single quotes ('\'').
Replace all escaped newline sequences that appear normally as \n with the sequence that appears as \\\\n. It would suffice to use \\\\n as the actual replacement string (meaning \\n would print if you printed it), but the version of gawk I have messes things up in POSIX mode.
Invoke the shell to execute printf '%b' 'escape'\''d format' and use awk's getline statement to retrieve the line.
Unescape \\n to yield a newline. This step wouldn't be necessary if gawk in POSIX mode played nicely.
Unescape \n to yield a newline.
Otherwise you're left to call the gsub function for each possible escape sequence, which is terrible for \001, \002, etc.
Graham,
Ed Morton's solution is the best (and perhaps only) one available.
I'm including this answer for a better explanation of WHY you're seeing what you're seeing.
A string is a string. The confusing part here is WHERE awk does the translation of \t to a tab, \n to a newline, etc. It appears NOT to be the case that the backslash and t get translated when used in a printf format. Instead, the translation happens at assignment, so that awk stores the tab as part of the format rather than translating when it runs the printf.
And this is why Ed's function works. When read from stdin or a file, no assignment is performed that will implement the translation of special characters. Once you run the command s="a\tb"; in awk, you have a three character string containing no backslash or t.
Evidence:
$ echo "a\tb\n" | awk '{ s=$0; for (i=1;i<=length(s);i++) {printf("%d\t%c\n",i,substr(s,i,1));} }'
1 a
2 \
3 t
4 b
5 \
6 n
vs
$ awk 'BEGIN{s="a\tb\n"; for (i=1;i<=length(s);i++) {printf("%d\t%c\n",i,substr(s,i,1));} }'
1 a
2
3 b
4
And there you go.
As I say, Ed's answer provides an excellent function for what you need. But if you can predict what your input will look like, you can probably get away with a simpler solution. Knowing how this stuff gets parsed, if you have a limited set of characters you need to translate, you may be able to survive with something simple like:
s=$0;
gsub(/\\t/,"\t",s);
gsub(/\\n/,"\n",s);
That's a cool question, I don't know the answer in awk, but in perl you can use eval :
echo '%10s\t:\t%-10s\n' | perl -ne ' chomp; eval "printf (\"$_\", \"hi\", \"hello\")"'
hi : hello
PS. Be aware of code injection danger when you use eval in any language, no just eval any system call can't be done blindly.
Example in Awk:
echo '$(whoami)' | awk '{"printf \"" $0 "\" " "b" | getline s; print s}'
tiago
What if the input was $(rm -rf /)? You can guess what would happen :)
ikegami adds:
Why would even think of using eval to convert \n to newlines and \t to tabs?
echo '%10s\t:\t%-10s\n' | perl -e'
my %repl = (
n => "\n",
t => "\t",
);
while (<>) {
chomp;
s{\\(?:(\w)|(\W))}{
if (defined($2)) {
$2
}
elsif (exists($repl{$1})) {
$repl{$1}
}
else {
warn("Unrecognized escape \\$1.\n");
$1
}
}eg;
printf($_, "hi", "hello");
}
'
Short version:
echo '%10s\t:\t%-10s\n' | perl -nle'
s/\\(?:(n)|(t)|(.))/$1?"\n":$2?"\t":$3/seg;
printf($_, "hi", "hello");
'