awk partial matching not being printed - awk

awk -F"," -v var=$test '$1 ~ /^var$/{print}' alpha.txt
I tried hard-coding my var with my actual variable input and I found that this code works. However, when I tried for example /^ppl$/ to search for partial match of apple, it does not display. can someone give me some guidance as to how I can parse my variable into the command?

try this:
awk -F"," -v var=$test '$1 ~ "^"var"$"' alpha.txt

awk -F"," '$1 ~ /^'"$test"'$/{print}' alpha.txt

If you're anchoring the match at the beginning and the end, just use ==
awk -F"," -v var=$test '$1 == var' alpha.txt
Unless $test contains a regular expression, in which case #Kent has the right answer.

Related

Chain awk regex matches like grep

I am trying to use awk to select/remove data based on cell entries in a CSV file.
How do I chain Awk commands to build up complex searches like I have done with grep? I plan to use Awk to select rows based on matching criteria in cells in multiple columns, not just the first column as in this example.
Test data
123,line1
123a,line2
abc,line3
G-123,line4
G-123a,line5
Separate Awk statements with intermediate files
awk '$1 !~ /^[[:digit:]]/ {print $0}' file.txt > output1.txt
awk '$1 !~ /^G-[[:digit:]]/ {print $0}' output1.txt > output2.txt
mv output2.txt output.txt
cat output.txt
Chained or multi-line grep version (I think limited to first column only)
grep -v \
-e "^[[:digit:]]" \
-e "^G-[[:digit:]]" \
file.txt > output.txt
cat output.txt
How can I rewrite the Awk command to avoid the intermediate files?
Generally, in awk there are boolean operators available (it's better than grep! :) )
awk '/match1/ || /match2/' file
awk '(/match1/ || /match2/ ) && /match3/' file
and so on ...
In your example you could use something like:
awk -F, '$1 ~ /^[[:digit:]]/ || $1 ~ /G-[[:digit:]]/' input >> output
Note: This is just an example of how to use boolean operators. Also the regular expression itself could have been used here to express the alternative match:
awk -F, '$1 ~ /^(G-)?[[:digit:]]/' input >> ouput
In your awk commands and example, awk regards file.txt as having only one field because you have not defined FS, so the default whitespace field separator is used.
With that said, you can easily AND your two pattern matches together like this:
awk '($1 !~ /^[[:digit:]]/) && ($1 !~ /^G-[[:digit:]]/) {print $0}' file.txt
To make awk use comma as a field separator, you can define it in a BEGIN block. In this example, the output should be just line3
awk 'BEGIN {FS=","} ($1 !~ /^[[:digit:]]/) && ($1 !~ /^G-[[:digit:]]/) {print $2}' file.txt
I would suggest the literal translation of that grep command in awk is
awk '
/^[[:digit:]]/ {next}
/^G-[[:digit:]]/ {next}
{print}
' file.txt
But you have several examples of how to write it more concisely.
You can use
awk '$1 !~ /^(G-)?[[:digit:]]/' file.txt > output.txt
The awk tries to find in Field 1:
^ - start of string
(G-)? - an optional G- char sequence (note the regex flavor in awk is POSIX ERE, so (...) denotes a capturing group and ? denotes a one or zero times quantifier)
[[:digit:]] - a digit.
If the match is found, the record (=line) is not printed. Else, the line is printed.
to stick to your question, I would use:
awk '$1 !~ /^[[:digit:]]/ && $1 !~ /G-[[:digit:]]/' file.txt > output.txt
But I like the #Wiktor Stribiżew REGEX approach!
With your shown samples, this could be also done in grep in a single regexp, we need not to chain the different regex, adding this solution in case you/anyone need it; could be helpful.
grep -v -E '^(G-)?[[:digit:]]' Input_file
Explanation: Simple explanation would be, using grep's -v option to omit lines which are matching the mentioned pattern. Then using -E option of it to enable ERE(extended regular expressions). In main program using regex ^(G-)?[[:digit:]] to match if line starts from G- OR digit then don't print that line.

Regexp in gawk matches multiples ways

I have some text I need to split up to extract the relevant argument, and my [g]awk match command does not behave - I just want to understand why?! (I have written a less elegant way around it now...).
So the string is blahblah|msgcontent1=HeaderUUIiewConsenFlagPSMessage|msgtype2=Blah002|msgcontent2=header
I want to output just the contents of msgcontent1=, so did
echo "blahblah|msgcontent1=HeaderUUIiewConsenFlagPSMessage|msgtype2=Blah002|msgcontent2=header" | gawk '{ if (match($0,/msgcontent1=(.*)[|]/,a)) { print a[1]; } }'
Trouble instead of getting
HeaderUUIiewConsenFlagPSMessage
I get the match with everything from there to the last pipe of the string HeaderUUIiewConsenFlagPSMessage|msgtype2=Blah002
Now I accept this is because the regexp in /msgcontent1=(.*)[|]/ can match multiple ways, but HOW do I make it match the way I want it to??
With your shown samples please try following. Written and tested in GNU awk this will print only contents from msgcontent1= till | first occurrence.
awk 'match($0,/msgcontent1=[^|]*/){print substr($0,RSTART+12,RLENGTH-12)}' Input_file
OR with echo + awk try:
echo "blahblah|msgcontent1=HeaderUUIiewConsenFlagPSMessage|msgtype2=Blah002|msgcontent2=header" |
awk 'match($0,/msgcontent1=[^|]*/){print substr($0,RSTART+12,RLENGTH-12)}'
With FPAT option in GNU awk:
awk -v FPAT='msgcontent1=[^|]*' '{sub(/.*=/,"",$1);print $1}' Input_file
This is your input:
s='blahblah|msgcontent1=HeaderUUIiewConsenFlagPSMessage|msgtype2=Blah002|msgcontent2=header'
You may use gnu awk like this to extract value after msgcontent1=:
awk -F= -v RS='|' '$1 == "msgcontent1" {print $2}' <<< "$s"
HeaderUUIiewConsenFlagPSMessage
or using this sed:
sed -E 's/^(.*\|)?msgcontent1=([^|]+).*/\2/' <<< "$s"
HeaderUUIiewConsenFlagPSMessage
Or using this gnu grep:
grep -oP '(^|\|)msgcontent1=\K[^|]+' <<< "$s"
HeaderUUIiewConsenFlagPSMessage
echo "blahblah|msgcontent1=HeaderUUIiewConsenFlagPSMessage|msgtype2=Blah002|msgcontent2=header" | awk '{ if (match($0,/msgcontent1=([^\|]*)/,a)) print a[1] }'
this prints HeaderUUIiewConsenFlagPSMessage
The reason your regex match msgcontent1=HeaderUUIiewConsenFlagPSMessage|msgtype2=Blah002 is that matching is 'hungry' so it allways finds the longest possible match
Also with awk:
echo 'blahblah|msgcontent1=HeaderUUIiewConsenFlagPSMessage|msgtype2=Blah002|msgcontent2=header' | awk -v FS='[=|]' '$2 == "msgcontent1" {print $3}'
HeaderUUIiewConsenFlagPSMessage

awk command to read a key value pair from a file

I have a file input.txt which stores information in KEY:VALUE form. I'm trying to read GOOGLE_URL from this input.txt which prints only http because the seperator is :. What is the problem with my grep command and how should I print the entire URL.
SCRIPT
$> cat script.sh
#!/bin/bash
URL=`grep -e '\bGOOGLE_URL\b' input.txt | awk -F: '{print $2}'`
printf " $URL \n"
INPUT_FILE
$> cat input.txt
GOOGLE_URL:https://www.google.com/
OUTPUT
https
DESIRED_OUTPUT
https://www.google.com/
Since there are multiple : in your input, getting $2 will not work in awk because it will just give you 2nd field. You actually need an equivalent of cut -d: -f2- but you also need to check key name that comes before first :.
This awk should work for you:
awk -F: '$1 == "GOOGLE_URL" {sub(/^[^:]+:/, ""); print}' input.txt
https://www.google.com/
Or this non-regex awk approach that allows you to pass key name from command line:
awk -F: -v k='GOOGLE_URL' '$1==k{print substr($0, length(k FS)+1)}' input.txt
Or using gnu-grep:
grep -oP '^GOOGLE_URL:\K.+' input.txt
https://www.google.com/
Could you please try following, written and tested with shown samples in GNU awk. This will look for string GOOGLE_URL and will catch further either http or https value from url, in case you need only https then change http[s]? to https in following solution please.
awk '/^GOOGLE_URL:/{match($0,/http[s]?:\/\/.*/);print substr($0,RSTART,RLENGTH)}' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
/^GOOGLE_URL:/{ ##Checking condition if line starts from GOOGLE_URL: then do following.
match($0,/http[s]?:\/\/.*/) ##Using match function to match http[s](s optional) : till last of line here.
print substr($0,RSTART,RLENGTH) ##Printing sub string of matched value from above function.
}
' Input_file ##Mentioning Input_file name here.
2nd solution: In case you need anything coming after first : then try following.
awk '/^GOOGLE_URL:/{match($0,/:.*/);print substr($0,RSTART+1,RLENGTH-1)}' Input_file
Take your pick:
$ sed -n 's/^GOOGLE_URL://p' file
https://www.google.com/
$ awk 'sub(/^GOOGLE_URL:/,"")' file
https://www.google.com/
The above will work using any sed or awk in any shell on every UNIX box.
I would use GNU AWK following way for that task:
Let file.txt content be:
EXAMPLE_URL:http://www.example.com/
GOOGLE_URL:https://www.google.com/
KEY:GOOGLE_URL:
Then:
awk 'BEGIN{FS="^GOOGLE_URL:"}{if(NF==2){print $2}}' file.txt
will output:
https://www.google.com/
Explanation: GNU AWK FS might be pattern, so I set it to GOOGLE_URL: anchored (^) to begin of line, so GOOGLE_URL: in middle/end will not be seperator (consider 3rd line of input). With this FS there might be either 1 or 2 fields in each line - latter is case only if line starts with GOOGLE_URL: so I check number of fields (NF) and if this is second case I print 2nd field ($2) as first record in this case is empty.
(tested in gawk 4.2.1)
Yet another awk alternative:
gawk -F'(^[^:]*:)' '/^GOOGLE_URL:/{ print $2 }' infile

Regex "^[[:digit:]]$" not working as expected in AWK/GAWK

My GAWK version on RHEL is:
gawk-3.1.5-15.el5
I wanted to print a line if the first field of it has all digits (no special characters, even space to be considered)
Example:
echo "123456789012345,3" | awk -F, '{if ($1 ~ /^[[:digit:]]$/) print $0}'
Output:
Nothing
Expected Output:
123456789012345,3
What is going wrong here ? Does my AWK version not understand the GNU character classes ? Kindly help
To match multiple digits in the the [[:digit:]] character class add a +, which means match one or more number of digits in $1.
echo "123456789012345,3" | awk -F, '{if ($1 ~ /^([[:digit:]]+)$/) print $0}'
123456789012345,3
which satisfies your requirement.
A more idiomatic way ( as suggested from the comments) would be to drop the print and involve the direct match on the line and print it,
echo "123456789012345,3" | awk -F, '$1 ~ /^([[:digit:]]+)$/'
123456789012345,3
Some more examples which demonstrate the same,
echo "a1,3" | awk -F, '$1 ~ /^([[:digit:]]+)$/'
(and)
echo "aa,3" | awk -F, '$1 ~ /^([[:digit:]]+)$/'
do NOT produce any output a per the requirement.
Another POSIX compliant way to do strict length checking of digits can be achieved with something like below, where {3} denotes the match length.
echo "123,3" | awk --posix -F, '$1 ~ /^[0-9]{3}$/'
123,3
(and)
echo "12,3" | awk --posix -F, '$1 ~ /^[0-9]{3}$/'
does not produce any output.
If you are using a relatively newer version of bash shell, it supports a native regEx operator with the ~ using POSIX character classes as above, something like
#!/bin/bash
while IFS=',' read -r row1 row2
do
[[ $row1 =~ ^([[:digit:]]+)$ ]] && printf "%s,%s\n" "$row1" "$row2"
done < file
For an input file say file
$ cat file
122,12
a1,22
aa,12
The script produces,
$ bash script.sh
122,12
Although this works, bash regEx can be slower a relatively straight-forward way using string manipulation would be something like
while IFS=',' read -r row1 row2
do
[[ -z "${row1//[0-9]/}" ]] && printf "%s,%s\n" "$row1" "$row2"
done < file
The "${row1//[0-9]/}" strips all the digits from the row and the condition becomes true only if there are no other characters left in the variable.
Here you are printing every line that matches a pattern. This is exactly the purpose of grep. Since #Inian brilliantly told you what was wrong with your code, let me propose an alternative grep-based answer that does exactly the same as the awk command (albeit much faster):
grep -E '^[[:digit:]]+,'
Could you please try following and let me know if this helps.
echo "123456789012345,3" | awk -F, '{if ($1 ~ /^([[:digit:]]*)$/) print $0}'
EDIT: Above code could be reduced a bit to as follows too.
echo "123456789012345,3" | awk -F, '($1 ~ /^[[:digit:]]*$/)'

How to search by variable in awk

I'm trying to get the N-th row after a given pattern with awk.
The problem is that awk searches pattern literally:
awk -v patt=${1} -v rows=${2}'NR==p {print} /patt/ {p=NR+rows}'
How to escape the "patt" valiable ?
Use the awk matching operator instead of the slashes:
awk -v patt=${1} -v rows=${2} 'NR==p {print} $0 ~ patt {p=NR+rows}'
I've maneged to get it work,with double quotes
patt=${1}
awk -v rows=${2} "NR==p {print} /${patt}/ {p=NR+rows}" $3
There's nothing special about the string containing the awk program, so you can build it as usual in the shell, e.g.:
awk -v rows=${2}'NR==p {print} /'"$1"'/ {p=NR+rows}'