AWK escape double quotes in pattern matching - awk

I am using a pattern matching as below:
awk -F"|" 'NR == 2 {
if (length($1) > 50) {print NR "\t" " Does not meet length requirements"}
if ($1 ~ /["&+%\\|"]/) {print NR "\t" "has an invalid character"}}' Skruggle.csv
The only reason to use double quotes in the pattern is to escape pipe and use it literally instead of alternation. The code is successfully able to match &,+.% and |, however, it also finds double quotes. I don't want to match double quotes in the pattern. How do I use pipe as a literal match and at the same time skip from matching double quotes in the same pattern matching regular expression?
Any inputs will be appreciated.
BigBaby608

Related

awk / gawk printf when variable format string, changing zero to dash

I have a table of numbers I am printing in awk using printf.
The printf accomplishes some truncation for the numbers.
(cat <<E\OF
Name,Where,Grade
Bob,Sydney,75.12
Sue,Sydney,65.2475
George,Sydney,84.6
Jack,Sydney,35
Amy,Sydney,
EOF
)|gawk 'BEGIN{FS=","}
FNR==1 {print("Name","Where","Grade");next}
{if ($3<50) {$3=0}
printf("%s,%s,%d \n",$1,$2,$3)}'
This produces:
Name Where Grade
Bob,Sydney,75
Sue,Sydney,65
George,Sydney,84
Jack,Sydney,0
Amy,Sydney,0
What I want is to display scores which are less than 50, or missing, as a dash ("-").
Name Where Grade
Bob,Sydney,75
Sue,Sydney,65
George,Sydney,84
Jack,Sydney,-
Amy,Sydney,-
This requires the 3rd string format in printf change from %d to %s.
So in some rows, the third column should be a value, and in some rows, the third column should be a string. How can I tell this to GAWK? Or should I just pipe through another awk to re-format?
$ gawk 'BEGIN{FS=","}
FNR==1 {print("Name","Where","Grade");next}
{if ($3<50) {$3="-"} else {$3=sprintf("%d", $3)}
printf("%s,%s,%s \n",$1,$2,$3)}' ip.txt
Name Where Grade
Bob,Sydney,75
Sue,Sydney,65
George,Sydney,84
Jack,Sydney,-
Amy,Sydney,-
use if-else to assign value to $3 as needed
sprintf allows to assign result of formatting to a variable
for this case, you could use int function as well
now printf will have %s for $3 as well
Assuming you missed the commas for the header and space after third column is not needed, you could do this with a simple one-liner
$ awk -F, -v OFS=, 'NR>1{$3 = $3 < 50 ? "-" : int($3)} 1' ip.txt
Name,Where,Grade
Bob,Sydney,75
Sue,Sydney,65
George,Sydney,84
Jack,Sydney,-
Amy,Sydney,-
?: ternary operator is alternate for if-else
1 is an awk idiom to print contents of $0

awk include column value in pattern

I am looking for a way to pattern match against another column in awk. For example, I wish to find rows for which the value in column 4 is nested in column 5.
Performing awk '$4 ~ /$5/' doesn't work, as the dollar sign is interpreted as part of the regular expression. How do I get the column 5 value into this pattern match!
Many thanks!
if you're looking for literal match, not regex; you can use
awk 'index($5,$4)' file
will print the lines where $4 is a substring of $5.
> awk '$2 ~ $1' <<< "field another_field"
field another_field
this will print lines when $2 contains the value of $1

Using awk to remove lines containing a string in two columns of line

I have my input like this:
gi|88193823|ref|NC_007795.1|:3070-3370 gi|387601291|ref|NC_017333.1|:297226-297526 0.361403508772
gi|387601291|ref|NC_017333.1|:216167-216467 gi|88193823|ref|NC_007795.1|:2735510-2735810 0.386440677966
gi|88193823|ref|NC_007795.1|:1278679-1278979 gi|88193823|ref|NC_007795.1|:2735510-2735810 0.392491467577
I want output by removing the line containing 007795 in both column 1 and 2.
Expected output:
gi|88193823|ref|NC_007795.1|:3070-3370 gi|387601291|ref|NC_017333.1|:297226-297526 0.361403508772
gi|387601291|ref|NC_017333.1|:216167-216467 gi|88193823|ref|NC_007795.1|:2735510-2735810 0.386440677966
I tried
awk '! ( $1 == "/007795/" && $2 == "/007795/" )' 1.txt > 1.temp
I don't know where I am going wrong. Please help me
You don't need the double quotes, since you are using the slashes to delimit the regex literal, and you need to use a regex match instead of an equality comparison, since you want to test if the fields contain the string. The command should look like this:
awk '! ( $1 ~ /007795/ && $2 ~ /007795/ )' file

awk condition on whether pattern is matched

I am trying to have awk alter a given pattern if matched or return the original line. Here's my code
printf 'hello,"hru, bro"\nhi,bye\n' | gawk 'match($0, /"([^"]+)"/, m) {if (m[1] == "") {print $0} else {print gsub(/,/,"",m[1])}}'
-> 1
I expect `match to return the matched pattern in m[1] and gsub to substitute all ',' in m[1] when there is a match. Thus the result should be
-> hello,hru bro\nhi,bye
What am I missing here?
UPDATE
According to Tom comment I replace gsubwith gensub, yet I now get the following result:
-> gawk: cmd. line:1: (FILENAME=- FNR=1) warning: gensub: third argument `hru, bro' treated as 1
hello"hru, bro"
gsub mutates the third argument and returns the number of substitutions made - in this case, 1.
I would suggest changing your code to something like this:
awk 'match($0, /([^"]*")([^"]+)(".*)/, m) {
$0 = m[1] gensub(/,/, "", "g", m[2]) m[3]
} 1'
If there is anything surrounded by quotes on the line, then rebuild it, using gensub to remove the commas from the middle captured group (i.e. the part between the double quotes).
Note that gensub takes 4 arguments, where the third is used to specify the number of replacements to be made ("g" means global).

why doesn't awk seem to work on splitting into fields based on alternative involving "."?

It is OK to awk split on .:
>printf foo.bar | awk '{split($0, a, "."); print a[1]}'
foo
It is OK to awk split on an alternative:
>printf foo.bar | awk '{split($0, a, "b|a"); print a[1]}'
foo.
Then why is it not OK to split on an anternative involving .:
>printf foo.bar | awk '{split($0, a, ".|a"); print a[1]}'
(nothing printed)
Escape that period and I think you'll be golden:
printf foo.bar | awk '{split($0, a, "\\.|a"); print a[1]}'
JNevill showed how to get it working. But to answer your question of why the escape is needed in one case but not the other, we can find the answer in the awk manual in the summary of "how fields are split, based on the value of FS." (And the same rules apply to the fieldsep given to the split command.)
The bottom line is that when FS is a single character it is not treated as a regular expression but otherwise it is.
Hence split($0, a, ".") works as we hope, taking . to literally be ., but split($0, a, ".|a") takes .|a to be a regexp where . has a special meaning, setting the separator to be any character, and with that the necessity to add the backslashes to have the . treated literally.
FS == " "
Fields are separated by runs of whitespace. Leading and trailing whitespace are ignored. This is the default.
FS == any single character
Fields are separated by each occurrence of the character. Multiple successive occurrences delimit empty fields, as do leading and
trailing occurrences.
FS == regexp
Fields are separated by occurrences of characters that match regexp. Leading and trailing matches of regexp delimit empty fields.
You can see the despite the empty result .|a is really doing something, dividing the line into eight empty fields --- same as a line like ,,,,,,, would do with FS set to ,.
$ printf foo.bar | awk '{split($0, a, ".|a"); for (i in a) print i ": " a[i]; }'
4:
5:
6:
7:
8:
1:
2:
3: