awk bash repeat pattern - awk

I currently have this ORIGINAL string:
mklink .\Oracle\MOLT_HYB_01\110_Header\pkg_cdc_util.pks $.\Oracle\MOLT_HYB_01\110_Header\pkg_cdc_util.pks
#.......................................................^
....and need to replace as follows:
mklink .\Oracle\MOLT_HYB_01\110_Header\pkg_cdc_util.pks ..\..\..\..\.\Oracle\MOLT_HYB_01\110_Header\pkg_cdc_util.pks
#.......^......^...........^..........^.................^^^^^^^^^^^^
i.e. replace $ with "..\" 4 times based on the number of slashes in the 2nd column of the ORIGINAL string (".\Oracle\MOLT_HYB_01\110_Header\pkg_cdc_util.pks")
I can for example do the following individually:
awk -F '\\' '{print NF-1}' --> to print the number of occurrences of the backward slash
sed -e "s,\\\,$(printf '..\\\%.0s' {1..4}),g" --> to replace and repeat the string pattern
......but unsure how to string together in 1 command line.

An awk command that works with an arbitrary number of input lines (with the same field structure):
awk -v prefixUnit="..\\" '{
count = gsub("\\\\", "\\", $2) # count the number of "\"s
# Build the prefix composed of count prefixUnit instances
prefix = ""; for (i=1; i<=count; ++i) prefix = prefix prefixUnit
$3 = prefix substr($3, 2) # prepend the prefix to the 3rd field, replacing the "$"
print # print result
}' file
As a condensed one-liner:
awk -v pu="..\\" '{n=gsub("\\\\","\\",$2);p="";for(i=1;i<=n;++i)p=p pu;$3=p substr($3,2);print}' file

I'd use perl, although this is a bit "clever"
perl -ane '
$n = ($F[1] =~ tr/\\/\\/);
$F[2] =~ s{\$}{"..\\" x $n}e;
print join(" ", #F);
' file
where
-a - autosplit the line into the array #F
-n - execute the body for each line of "file"
$F[1] =~ tr/\\/\\/ - count the number of backslashes in the 2nd field
$F[2] =~ s{\$}{"..\\" x $n}e - replace the dollar with the right number of "..\"

awk one-liner, assuming that there is just a single line to process
awk -F\\ -v RS='$' -v ORS='' '{print} NR==1{for(i=1; i< NF; i++) printf("..\\")}'
Explanation
awk -F\\ -v RS='$' ' -v ORS='' ' # Specify separators
{print} # print the line
NR==1{ # for first record
for (i=1; i<NF; i++) # print multiple ..\
printf("..\\") # need to escape \\
}
'

Related

gawk - Delimit lines with custom character and no similar ending character

Let's say I have a file like so:
test.txt
one
two
three
I'd like to get the following output: one|two|three
And am currently using this command: gawk -v ORS='|' '{ print $0 }' test.txt
Which gives: one|two|three|
How can I print it so that the last | isn't there?
Here's one way to do it:
$ seq 1 | awk -v ORS= 'NR>1{print "|"} 1; END{print "\n"}'
1
$ seq 3 | awk -v ORS= 'NR>1{print "|"} 1; END{print "\n"}'
1|2|3
With paste:
$ seq 1 | paste -sd'|'
1
$ seq 3 | paste -sd'|'
1|2|3
Convert one column to one row with field separator:
awk '{$1=$1} 1' FS='\n' OFS='|' RS='' file
Or in another notation:
awk -v FS='\n' -v OFS='|' -v RS='' '{$1=$1} 1' file
Output:
one|two|three
See: 8 Powerful Awk Built-in Variables – FS, OFS, RS, ORS, NR, NF, FILENAME, FNR
awk solutions work great. Here is tr + sed solution:
tr '\n' '|' < file | sed 's/\|$//'
1|2|3
just flatten it :
gawk/mawk 'BEGIN { FS = ORS; RS = "^[\n]*$"; OFS = "|"
} NF && ( $NF ? NF=NF : —-NF )'
ascii | = octal \174 = hex 0x7C. The reason for —-NF is that more often than not, the input includes a trailing new line, which makes field count 1 too many and result in
1|2|3|
Both NF=NF and --NF are similar concepts to $1=$1. Empty inputs, regardless of whether trailing new lines exist or not, would result in nothing printed.
At the OFS spot, you can delimit it with any string combo you like instead of being constrained by tr, which has inconsistent behavior. For instance :
gtr '\012' '高' # UTF8 高 = \351\253\230 = xE9 xAB x98
on bsd-tr, \n will get replaced by the unicode properly 1高2高3高 , but if you're on gnu-tr, it would only keep the leading byte of the unicode, and result in
1 \351 2 \351 . . .
For unicode equiv-classes, bsd-tr works as expected while gtr '[=高=]' '\v' results in
gtr: ?\230: equivalence class operand must be a single character
and if u attempt equiv-classes with an arbitrary non-ASCII byte, bsd-tr does nothing while gnu-tr would gladly oblige, even if it means slicing straight through UTF8-compliant characters :
g3bn 77138 | (g)tr '[=\224=]' '\v'
bsd-tr : 77138=Koyote 코요태 KYT✜ 高耀太
gnu-tr : 77138=Koyote ?
?
태 KYT✜ 高耀太
I would do it following way, using GNU AWK, let test.txt content be
one
two
three
then
awk '{printf NR==1?"%s":"|%s", $0}' test.txt
output
one|two|three
Explanation: If it is first line print that line content sans trailing newline, otherwise | followed by line content sans trailing newline. Note that I assumed that test.txt has not trailing newline, if this is not case test this solution before applying it.
(tested in gawk 5.0.1)
Also you can try this with awk:
awk '{ORS = (NR%3 ? "|" : RS)} 1' file
one|two|three
% is the modulo operator and NR%3 ? "|" : RS is a ternary expression.
See Ed Morton's explanation here: https://stackoverflow.com/a/55998710/14259465
With a GNU sed, you can pass -z option to match line breaks, and thus all you need is replace each newline but the last one at the end of string:
sed -z 's/\n\(.\)/|\1/g' test.txt
perl -0pe 's/\n(?!\z)/|/g' test.txt
perl -pe 's/\n/|/g if !eof' test.txt
See the online demo.
Details:
s - substitution command
\n\(.\) - an LF char followed with any one char captured into Group 1 (so \n at the end of string won't get matched)
|\1 - a | char and the captured char
g - all occurrences.
The first perl command matches any LF char (\n) not at the end of string ((?!\z)) after slurping the whole file into a single string input (again, to make \n visible to the regex engine).
The second perl command replaces an LF char at the end of each line except the one at the end of file (eof).
To make the changes inline add -i option (mind this is a GNU sed example):
sed -i -z 's/\n\(.\)/|\1/g' test.txt
perl -i -0pe 's/\n(?!\z)/|/g' test.txt
perl -i -pe 's/\n/|/g if !eof' test.txt

AWK Match & Split not finding string pattern

Passing the following commands I would expect the first to split the string (which is also a regex) into two array elements and the second command (match) to print [[:blank:]].
echo "new[[:blank:]]+File\(" | awk '{ split($0, a, "[[:blank:]]"); print a[1]}'
prints the whole string as it has not split
echo "new[[:blank:]]+File\(" | awk '{ match($0, /[[:blank:]]/, m)}END{print m[0]}'
prints nothing
What am I missing here?
UPDATE
I'm calling an awk script with the following command;
awk -v regex1=new[[:blank:]]+File\( -f parameterisedRegexAwkScript.awk "$file" >> "output.txt"
Then in the my script I attempt to split on the string literal with the following command;
len = split(regex1, regex, /[[:blank:]]/, seps
but when I print len it's value is 1 when I would have expected it to be 2
echo "new[[:blank:]]+File\(" | awk '{ split($0, a, "[[:blank:]]"); print a[1]}'
3rd argument for split works like setting FS in BEGIN, so in this case you instruct to split at any whitespace, you need to escape [ and ]. Let file.txt content be
new[[:blank:]]+File\(
then
awk '{split($0, a, "\\[\\[:blank:\\]\\]"); print a[1]}' file.txt
output
new
(tested in gawk 4.2.1)

Convert single column into three comma separated columns using awk

I have a single long column and want to reformat it into three comma separated columns, as indicated below, using awk or any Unix tool.
Input:
Xaa
Ybb
Mdd
Tmmn
UUnx
THM
THSS
THEY
DDe
Output:
Xaa,Ybb,Mdd
Tmmn,UUnx,THM
THSS,THEY,DDe
$ awk '{printf "%s%s",$0,NR%3?",":"\n";}' file
Xaa,Ybb,Mdd
Tmmn,UUnx,THM
THSS,THEY,DDe
How it works
For every line of input, this prints the line followed by, depending on the line number, either a comma or a newline.
The key part is this ternary statement:
NR%3?",":"\n"
This takes the line number modulo 3. If that is non-zero, then it returns a comma. If it is zero, it returns a newline character.
Handling files that end before the final line is complete
The assumes that the number of lines in the file is an integer multiple of three. If it isn't, then we probably want to assure that the last line has a newline. This can be done, as Jonathan Leffler suggests, using:
awk '{printf "%s%s",$0,NR%3?",":"\n";} END { if (NR%3 != 0) print ""}' file
If the final line is short of three columns, the above code will leave a trailing comma on the line. This may or may not be a problem. If we do not want the final comma, then use:
awk 'NR==1{printf "%s",$0; next} {printf "%s%s",(NR-1)%3?",":"\n",$0;} END {print ""}' file
Jonathan Leffler offers this slightly simpler alternative to achieve the same goal:
awk '{ printf("%s%s", pad, $1); pad = (NR%3 == 0) ? "\n" : "," } END { print "" }'
Improved portability
To support platforms which don't use \n as the line terminator, Ed Morton suggests:
awk -v OFS=, '{ printf("%s%s", pad, $1); pad = (NR%3?OFS:ORS)} END { print "" }' file
There is a tool for this. Use pr
pr -3ats,
3 columns width, across, suppress header, comma as separator.
xargs -n3 < file | awk -v OFS="," '{$1=$1} 1'
xargs uses echo as default action, $1=$1 forces rebuild of $0.
Using only awk I would go with this (which is similar to what proposed by #jonathan-leffler and #John1024)
{
sep = NR == 1 ? "" : \
(NR-1)%3 ? "," : \
"\n"
printf sep $0
}
END {
printf "\n"
}

awk - concatenate two string variable and assign to a third

In awk, I have 2 fields: $1 and $2.
They are both strings that I want to concatenate and assign to a variable.
Just use var = var1 var2 and it will automatically concatenate the vars var1 and var2:
awk '{new_var=$1$2; print new_var}' file
You can put an space in between with:
awk '{new_var=$1" "$2; print new_var}' file
Which in fact is the same as using FS, because it defaults to the space:
awk '{new_var=$1 FS $2; print new_var}' file
Test
$ cat file
hello how are you
i am fine
$ awk '{new_var=$1$2; print new_var}' file
hellohow
iam
$ awk '{new_var=$1 FS $2; print new_var}' file
hello how
i am
You can play around with it in ideone: http://ideone.com/4u2Aip
Could use sprintf to accomplish this:
awk '{str = sprintf("%s %s", $1, $2)} END {print str}' file
You can also concatenate strings from across multiple lines with whitespaces.
$ cat file.txt
apple 10
oranges 22
grapes 7
Example 1:
awk '{aggr=aggr " " $2} END {print aggr}' file.txt
10 22 7
Example 2:
awk '{aggr=aggr ", " $1 ":" $2} END {print aggr}' file.txt
, apple:10, oranges:22, grapes:7
Concatenating strings in awk can be accomplished by the print command AWK manual page, and you can do complicated combination. Here I was trying to change the 16 char to A and used string concatenation:
echo CTCTCTGAAATCACTGAGCAGGAGAAAGATT | awk -v w=15 -v BA=A '{OFS=""; print substr($0, 1, w), BA, substr($0,w+2)}'
Output: CTCTCTGAAATCACTAAGCAGGAGAAAGATT
I used the substr function to extract a portion of the input (STDIN). I passed some external parameters (here I am using hard-coded values) that are usually shell variable. In the context of shell programming, you can write -v w=$width -v BA=$my_charval. The key is the OFS which stands for Output Field Separate in awk. Print function take a list of values and write them to the STDOUT and glue them with the OFS. This is analogous to the perl join function.
It looks that in awk, string can be concatenated by printing variable next to each other:
echo xxx | awk -v a="aaa" -v b="bbb" '{ print a b $1 "string literal"}'
# will produce: aaabbbxxxstring literal

gawk FS to split record into individual characters

If the field separator is the empty string, each character becomes a separate field
$ echo hello | awk -F '' -v OFS=, '{$1 = NF OFS $1} 1'
5,h,e,l,l,o
However, if FS is a regex that can possibly match zero times, the same behaviour does not occur:
$ echo hello | awk -F ' *' -v OFS=, '{$1 = NF OFS $1} 1'
1,hello
Anyone know why that is? I could not find anything in the gawk manual. Is FS="" just a special case?
I'm most interested in understanding why the 2nd case does not split the record into more fields. It's as if awk is treating FS=" *" like FS=" +"
Interesting question!
I just pulled gnu-awk 4.1.0's codes, I think the answer we could find in the file field.c.
line 371:
* re_parse_field --- parse fields using a regexp.
*
* This is called both from get_field() and from do_split()
* via (*parse_field)(). This variation is for when FS is a regular
* expression -- either user-defined or because RS=="" and FS==" "
*/
static long
re_parse_field(lo...
also this line: (line 425):
if (REEND(rp, scan) == RESTART(rp, scan)) { /* null match */
here is the case of <space>* matching in your question. The implementation didn't increment the nf, that is, it thinks the whole line is one single field. Note this function was used in do_split() function too.
First, if FS is null string, gawk separates each char into its own field. gawk's doc has clearly written this, also in codes, we could see:
line 613:
* null_parse_field --- each character is a separate field
*
* This is called both from get_field() and from do_split()
* via (*parse_field)(). This variation is for when FS is the null string.
*/
static long
null_parse_field(long up_to,
If the FS has single character, awk won't consider it as regex. This was mentioned in doc too. Also in codes:
#line 667
* sc_parse_field --- single character field separator
*
* This is called both from get_field() and from do_split()
* via (*parse_field)(). This variation is for when FS is a single character
* other than space.
*/
static long
sc_parse_field(l
if we read the function, no regex match handling was done there.
In the comments of the function re_parse_field(), and sc_parse_field(), we see do_split invokes them too. It explains why we have 1 in following command instead of 3:
kent$ echo "foo"|awk '{split($0,a,/ */);print length(a)}'
1
Note, to avoid to make the post too long, I didn't paste the complete codes here, we can find the codes here:
http://git.savannah.gnu.org/cgit/gawk.git/
As was mentioned, an empty field separator generates undefined behavior; the same code will give different results on different platforms / flavors of awk. For example (all Mac OSX 10.8.5):
> echo hello | awk -F '' -v OFS=, '{$1 = NF OFS $1} 1'
awk: field separator FS is empty
1,hello
So awk complains, but keeps going.
Let's look at some other examples:
> echo hello | awk -F '.' -v OFS=, '{$1 = NF OFS $1} 1'
1,hello
A . by itself is not considered a regular expression
> echo hello | awk -F '[.]' -v OFS=, '{$1 = NF OFS $1} 1'
1,hello
Still nothing
> echo hello | awk -F '.?' -v OFS=, '{$1 = NF OFS $1} 1'
6,,,,,,
Now we have something like a regex: .? is "zero or one character". It is expanded to one character (which is consumed), so the output is "a whole lot of nothings"
> echo hello | awk -F '*' -v OFS=, '{$1 = NF OFS $1} 1'
1,hello
Not a regular expression
> echo hello | awk -F '.*' -v OFS=, '{$1 = NF OFS $1} 1'
2,,
A regular expression that consumes the entire thing
> echo hello | awk -F 'l' -v OFS=, '{$1 = NF OFS $1} 1'
3,he,,o
Match the letter l twice - two empty strings
> echo hello | awk -F 'ell' -v OFS=, '{$1 = NF OFS $1} 1'
2,h,o
Match all of ell at once
> echo hello | awk -F '.?|' -v OFS=, '{$1 = NF OFS $1} 1'
awk: illegal primary in regular expression .?| at
input record number 1, file
source line number 1
Attempt to be clever: sometimes an | with empty string on one side will match "anything" but awk's regex engine doesn't like it.
Conclusion - the regular expressions cannot match "empty", and whatever is matched is consumed. Attempts to use (?:.) or even (?=.) generate errors.
It seems to be a special case in gawk.
Traditionally, the behavior of FS equal to "" was not defined. In this
case, most versions of Unix awk simply treat the entire record as only
having one field. (d.c.) In compatibility mode (see Options), if FS is
the null string, then gawk also behaves this way.
What POSIX has to say about this:
If FS is a null string, the behavior is unspecified.
So the gawk behaviour is implementation-specific and sort of explains why your two examples don't yield the same output.
Another data point: gawk and perl disagree on how to do this:
$ perl -E '$,=","; $s="hello"; $r=qr( *); #s=split($r,$s); say scalar(#s), #s'
5,h,e,l,l,o
$ gawk 'BEGIN {s="hello";r=" *";n=split(s,a,r); print n,a[n]; if (s~r) print "match"}'
1 hello
match
$ gawk 'BEGIN {s="hello";r=""; n=split(s,a,r); print n,a[n]; if (s~r) print "match"}'
5 o
match