why doesn't awk seem to work on splitting into fields based on alternative involving "."? - awk

It is OK to awk split on .:
>printf foo.bar | awk '{split($0, a, "."); print a[1]}'
foo
It is OK to awk split on an alternative:
>printf foo.bar | awk '{split($0, a, "b|a"); print a[1]}'
foo.
Then why is it not OK to split on an anternative involving .:
>printf foo.bar | awk '{split($0, a, ".|a"); print a[1]}'
(nothing printed)

Escape that period and I think you'll be golden:
printf foo.bar | awk '{split($0, a, "\\.|a"); print a[1]}'

JNevill showed how to get it working. But to answer your question of why the escape is needed in one case but not the other, we can find the answer in the awk manual in the summary of "how fields are split, based on the value of FS." (And the same rules apply to the fieldsep given to the split command.)
The bottom line is that when FS is a single character it is not treated as a regular expression but otherwise it is.
Hence split($0, a, ".") works as we hope, taking . to literally be ., but split($0, a, ".|a") takes .|a to be a regexp where . has a special meaning, setting the separator to be any character, and with that the necessity to add the backslashes to have the . treated literally.
FS == " "
Fields are separated by runs of whitespace. Leading and trailing whitespace are ignored. This is the default.
FS == any single character
Fields are separated by each occurrence of the character. Multiple successive occurrences delimit empty fields, as do leading and
trailing occurrences.
FS == regexp
Fields are separated by occurrences of characters that match regexp. Leading and trailing matches of regexp delimit empty fields.
You can see the despite the empty result .|a is really doing something, dividing the line into eight empty fields --- same as a line like ,,,,,,, would do with FS set to ,.
$ printf foo.bar | awk '{split($0, a, ".|a"); for (i in a) print i ": " a[i]; }'
4:
5:
6:
7:
8:
1:
2:
3:

Related

Combine awk with sub to print multiple columns

Input:
MARKER POS EA NEA BETA SE N EAF STRAND IMPUTED
1:244953:TTGAC:T 244953 T TTGAC -0.265799 0.291438 4972 0.00133176 + 1
2:569406:G:A 569406 A G -0.17456 0.296652 4972 0.00128021 + 1
Desired output:
1 1:244953:TTGAC:T 0 244953
2 2:569406:G:A 0 569406
Column 1 in output file is first number from first column in input file
Tried:
awk '{gsub(/:.*/,"",$1);print $1,0,$2}' input
But it does not print $2 correctly
Thank you for any help
Your idea is right, but the reason it didn't work is that you've replaced the $1 value as part of the gsub() routine and have not backed it up. So next call to $1 will return the value after the call. So back it up as below. Also sub() is sufficient here for the first replacement part
awk 'NR>1{backup=$1; sub(/:.*/,"",backup);print backup,$1,0,$2}' file
Or use split() function to the first part of the first column. The call to the function returns the number of elements split by delimiter : and updates the elements to the array a. We print the element and subsequent columns as needed.
awk 'NR>1{n=split($1, a, ":"); print a[1],$1,"0", $2}' file
From GNU awk documentation under String functions
split(string, array [, fieldsep [, seps ] ])
Divide string into pieces separated by fieldsep and store the pieces in array and the separator strings in the seps array. The first piece is stored in array[1], the second piece in array[2], and so forth. The string value of the third argument, fieldsep, is a regexp describing where to split string.
Add a | column -t to beautify the result to make it appear more spaced out and readable
awk 'NR>1{n=split($1, a, ":"); print a[1],$1,"0", $2}' file | column -t
Could you please try following and let me know if this helps you?
awk -v s1=" " -F"[: ]" 'FNR>1{print $1 s1 $1 OFS $2 OFS $3 OFS $4 s1 "0" s1 $5}' OFS=":" Input_file

How can I replace all middle characters with '*'?

I would like to replace middle of word with ****.
For example :
ifbewofiwfib
wofhwifwbif
iwjfhwi
owfhewifewifewiwei
fejnwfu
fehiw
wfebnueiwbfiefi
Should become :
if********ib
wo*******if
iw***wi
ow**************ei
fe***fu
fe*iw
wf***********fi
So far I managed to replace all but the first 2 chars with:
sed -e 's/./*/g3'
Or do it the long way:
grep -o '^..' file > start
cat file | sed 's:^..\(.*\)..$:\1:' | awk -F. '{for (i=1;i<=length($1);i++) a=a"*";$1=a;a=""}1' > stars
grep -o '..$' file > end
paste -d "" start stars > temp
paste -d "" temp end > final
I would use Awk for this, if you have a GNU Awk to set the field separator to an empty string (How to set the field separator to an empty string?).
This way, you can loop through the chars and replace the desired ones with "*". In this case, replace from the 3rd to the 3rd last:
$ awk 'BEGIN{FS=OFS=""}{for (i=3; i<=NF-2; i++) $i="*"} 1' file
if********ib
wo*******if
iw***wi
ow**************ei
fe***fu
fe*iw
wf***********fi
If perl is okay:
$ perl -pe 's/..\K.*(?=..)/"*" x length($&)/e' ip.txt
if********ib
wo*******if
iw***wi
ow**************ei
fe***fu
fe*iw
wf***********fi
..\K.*(?=..) to match characters other than first/last two characters
See regex lookarounds section for details
e modifier allows to use Perl code in replacement section
"*" x length($&) use length function and string repetition operator to get desired replacement string
You can do it with a repetitive substitution, e.g.:
sed -E ':a; s/^(..)([*]*)[^*](.*..)$/\1\2*\3/; ta'
Explanation
This works by repeating the substitution until no change happens, that is what the :a; ...; ta bit does. The substitution consists of 3 matched groups and a non-asterisk character:
(..) the start of the string.
([*]*) any already replaced characters.
[^*] the character to be replaced next.
(.*..) any remaining characters to replace and the end of the string.
Alternative GNU sed answer
You could also do this by using the hold space which might be simpler to read, e.g.:
h # save a copy to hold space
s/./*/g3 # replace all but 2 by *
G # append hold space to pattern space
s/^(..)([*]*)..\n.*(..)$/\1\2\3/ # reformat pattern space
Run it like this:
sed -Ef parse.sed input.txt
Output in both cases
if********ib
wo*******if
iw***wi
ow**************ei
fe***fu
fe*iw
wf***********fi
Following awk may help you on same. It should work in any kind of awk versions.
awk '{len=length($0);for(i=3;i<=(len-2);i++){val=val "*"};print substr($0,1,2) val substr($0,len-1);val=""}' Input_file
Adding a non-one liner form of solution too now.
awk '
{
len=length($0);
for(i=3;i<=(len-2);i++){
val=val "*"};
print substr($0,1,2) val substr($0,len-1);
val=""
}
' Input_file
Explanation: Adding explanation now for above code too.
awk '
{
len=length($0); ##Creating variable named len whose value is length of current line.
for(i=3;i<=(len-2);i++){ ##Starting for loop which starts from i=3 too till len-2 value and doing following:
val=val "*"}; ##Creating a variable val whose value is concatenating the value of it within itself.
print substr($0,1,2) val substr($0,len-1);##Printing substring first 2 chars and variable val and then last 2 chars of the current line.
val="" ##Nullifying the variable val here, so that old values should be nullified for this variable.
}
' Input_file ##Mentioning the Input_file name here.

AWK escape double quotes in pattern matching

I am using a pattern matching as below:
awk -F"|" 'NR == 2 {
if (length($1) > 50) {print NR "\t" " Does not meet length requirements"}
if ($1 ~ /["&+%\\|"]/) {print NR "\t" "has an invalid character"}}' Skruggle.csv
The only reason to use double quotes in the pattern is to escape pipe and use it literally instead of alternation. The code is successfully able to match &,+.% and |, however, it also finds double quotes. I don't want to match double quotes in the pattern. How do I use pipe as a literal match and at the same time skip from matching double quotes in the same pattern matching regular expression?
Any inputs will be appreciated.
BigBaby608

awk include column value in pattern

I am looking for a way to pattern match against another column in awk. For example, I wish to find rows for which the value in column 4 is nested in column 5.
Performing awk '$4 ~ /$5/' doesn't work, as the dollar sign is interpreted as part of the regular expression. How do I get the column 5 value into this pattern match!
Many thanks!
if you're looking for literal match, not regex; you can use
awk 'index($5,$4)' file
will print the lines where $4 is a substring of $5.
> awk '$2 ~ $1' <<< "field another_field"
field another_field
this will print lines when $2 contains the value of $1

awk condition on whether pattern is matched

I am trying to have awk alter a given pattern if matched or return the original line. Here's my code
printf 'hello,"hru, bro"\nhi,bye\n' | gawk 'match($0, /"([^"]+)"/, m) {if (m[1] == "") {print $0} else {print gsub(/,/,"",m[1])}}'
-> 1
I expect `match to return the matched pattern in m[1] and gsub to substitute all ',' in m[1] when there is a match. Thus the result should be
-> hello,hru bro\nhi,bye
What am I missing here?
UPDATE
According to Tom comment I replace gsubwith gensub, yet I now get the following result:
-> gawk: cmd. line:1: (FILENAME=- FNR=1) warning: gensub: third argument `hru, bro' treated as 1
hello"hru, bro"
gsub mutates the third argument and returns the number of substitutions made - in this case, 1.
I would suggest changing your code to something like this:
awk 'match($0, /([^"]*")([^"]+)(".*)/, m) {
$0 = m[1] gensub(/,/, "", "g", m[2]) m[3]
} 1'
If there is anything surrounded by quotes on the line, then rebuild it, using gensub to remove the commas from the middle captured group (i.e. the part between the double quotes).
Note that gensub takes 4 arguments, where the third is used to specify the number of replacements to be made ("g" means global).