Working with the awk line matching pattern - awk

The tool awk has line pattern matching like
/pattern/ { statements; }
Is there any way to get the string of pattern as a variable, for use in match expressions etc?
Or even better, directly get:
pattern matched text
pattern matched length
match groups if there are any (groups) in the pattern
within the {statements} block?

If you use GNU awk and, instead of using /pattern/ in the condition part, use match and its third argument match(string, regexp [, array]) you get access to matched text, start index, length and the groups:
$ echo foobar |
gawk 'match($0, /(fo*)(b.*)/, a) {
print a[0],a[0,"start"],a[0,"length"] # 0 index refers to whole matched text
print a[2],a[2,"start"],a[2,"length"] # 1, 2, etc. to matched groups
}'
foobar 1 6
bar 4 3
See GNU awk documentation for match for more info.

Could you please try following ones.
1st: To get matching text match is BEST option.
awk 'match($0,/regex/){print substr($0,RSTART,RLENGTH)}' Input_file
2nd: To get length of matched string:
awk 'match($0,/regex/){print RLENGTH}' Input_file
3rd: To get all matched patterns use while loop with match until match found in line and we should get all matched patterns.

Related

filter unique parameters from file

i have file contains urls plus params like following
https://example.com/endpoint/?param1=123&param2=1212
https://example.com/endpoint/?param3=123&param1=98989
https://example.com/endpoint/endpoint3/?param2=123
https://example.com/endpoint/endpoint2/?param1=123
https://example.com/endpoint/endpoint2/
https://example.com/endpoint/endpoint5/"//i.example.com/00/s/Nzk5WDEwMjQ=/z/47IAAOSwBu5hXIKF
and i need to filter only urls with unique params
the desired output
http://example.com/endpoint/?param1=123&param2=1212
https://example.com/endpoint/?param3=123&param1=98989
https://example.com/endpoint/endpoint3/?param2=123
i managed to filter only urls with params with grep
grep -E '(\?[a-zA-Z0-9]{1,9}\=)'
but i need to filter params in the same time so i tried with awk with the same regex
but it gives error
awk '{sub(\?[a-zA-Z0-9]{1,9}\=)} !seen[$0]++'
update
i am sorry for editing the desired output but when i tried the scripts i figured out that their a lot of carbege in my file need to filter too.
i tried #James Brown with some editing and it looks good till the end line it dose not filter it unfortunately
awk -F '?|&' '$2&&!a[$2]++'
and to be more clear why the that output is good for me
it chosed the 1 st line because it has at least param1
2nd line because it has at least param3
3 line because it has at least param2
the comparison method here is choose just unique parameter whatever it concatenate with others with & char or not
Edited version after the reqs changes some:
$ awk -F? '{ # ? as field delimiter
split($2,b,/&/) # split at & to get whats between ? and &
if(b[1]!=""&&!a[b[1]]++) # no ? means no $2
print
}' file
Output as expected. Original answer was:
A short one:
$ awk -F? '$2&&!a[$2]++' file
Explained: Split records at ? (-F?) and if there is a second field ($2) and (&&) it is unique this far by counting the instances of the parameters in the array a (!a[$2]++), output it.
EDIT: Following solution may help when query string has ? as well as & present in it and we want to consider both of them for removing duplicates.
awk '
/\?/{
match($0,/\?[^&]*/)
val=substr($0,RSTART,RLENGTH)
match($0,/&.*/)
if(!seen[val]++ && !seen[substr($0,RSTART,RLENGTH)]++){
print
}
}' Input_file
2nd solution: (Following solution may help when we don't have & parameters in query string) With your shown samples, please try following awk program.
awk 'match($0,/\?.*$/) && !seen[substr($0,RSTART,RLENGTH)]++' Input_file
OR above could be shorten to as follows:(as per Ed sir's suggestions):
awk 's=index($0,"?") && !seen[substr($0,s)]++' Input_file
Explanation: Simple explanation would be, using match function of awk which matches everything from ? to till end of line value. Then adding an AND condition to it to make sure we get only unique values out of all matched values in all lines.
With gnu awk, you could also match the url till the first occurrence of the question mark, and then capture what follows using your initial pattern for the first parameter ([a-zA-Z0-9]{1,9}=[^&]+) followed by matching any character except the &
Then you can use the !seen[$0]++ part with the value of capture group 1.
awk '
match($0, /https?:\/\/[^?]+\?([a-zA-Z0-9]{1,9}=[^&]+)/, arr) && !seen[arr[1]]++
' file
Output
https://example.com/endpoint/?param1=123&param2=1212
https://example.com/endpoint/?param3=123&param1=98989
https://example.com/endpoint/endpoint3/?param2=123
Using awk you can check that the string starts with the protocol and contains a question mark.
Then to get the first parameter only, you can split on ? and & and use the second part of the split for seen
awk '
/^https?:\/\/[^?]*\?/ && split($0, arr, /[?&]/) > 1 && !seen[arr[2]]++
' file

Match string pattern - Replacement via awk-gsub with another pattern

AIM
I want to be able to match a pattern in a string, this using its initial and final boundaries.
I further aim to replace the pattern with "ID=".
STRING
Class=Grainyhead.domain.factors;Family=CP2-related.factors;id=TFCP2.Ca9750.2.YY2017.HT-SE2;strand=+;seq=TTCTGGTTGGGACCAGGA;score=7.62921;pval=6.53e-05;Averageconservationscore=1.77
DESIRED PATTERN OF THE STRING TO BE MATCHED WITH A COMMAND IN AWK
PATTERN
Class=Grainyhead.domain.factors;Family=CP2-related.factors;id=
COMMAND
(/\Class=(.*);id=/)
AWK-GSUB
awk 'BEGIN{FS=OFS="\t"} {gsub(/\Class=(.*);id=/), "ID=", $4) 1'}
I am not sure about the (.*) use !
I commonly employed it in R to select part of a string.
Can this be employed as well in awk-gsub filtering?
Your separator appears like a ';' (not a tab).
To filter with "start with a token", use '^' (not \) at the beginning of the regexp.
After first replace, select the columns with $number.
cat file | awk 'BEGIN{FS=OFS=";"} {gsub(/^Class=(.*);id=/, "id="); print $1, $6}' > outputfile

gsub: remove till first occurence instead of last occurence of a given character in a line

I have an html file which I basically try to remove first occurences of <...> with sub/gsub functionalities.
I used awk regex . * + according to match anything between < >. However first occurence of > is being escaped (?). I don't know if there is a workaround.
sample input file.txt (x is added not to print empty):
<div>fruit</div></td>x
<span>banana</span>x
<br/>apple</td>x
code:
awk '{gsub(/^<.*>/,""); print}' file.txt
current output:
x
x
x
desired output:
fruit</div></td>x
banana</span>x
apple</td>x
With your shown samples, please try following awk code. Simple explanation would be, using sub substitute function of awk programing. Then substituting starting < till(using [^>] means till first occurrence of > comes) > including > with NULL in current line, finally print edited/non-edited line by 1.
awk '{sub(/^<[^>]*>/,"")} 1' Input_file
2nd solution: Using match function of awk here match values from 1st occurrence of < to till 1st occurrence of > and print the rest of line.
awk 'match($0,/^<[^>]*>/){print substr($0,RSTART+RLENGTH)}' Input_file
OR In case you have lines which are not starting from < and you want to print them also then use following:
awk 'match($0,/^<[^>]*>/){print substr($0,RSTART+RLENGTH);next} 1' Input_file
However first occurence of > is being escaped (?).
No, you got result as is due to that in GNU AWK as manual say
awk(...)regular expressions always match the leftmost, longest
sequence of input characters that can match
this is called greedy in other languages' regular expressions usage, so say for
<div>fruit</div></td>x
/^<.*>/ does match
<div>fruit</div></td>
thus you end with x. In languages supporting so-called non-greedy matching you can harness it in such case, for example in ECMAScript
let str = "<div>fruit</div></td>x";
let out_str = str.replace(/^<.*?>/, "");
console.log(out_str);
output
fruit</div></td>x
As GNU AWK manual say in GNU AWK it is always longest (greedy), thus you have to use [^>] i.e. all but > to prevent match spanning from first < to last > which would contain > inside.

Gawk matching one word - one unexpected match

I wanted to get all matches in Column 3 which have the exact word "aa" (case insensitive match) in the string in Column 3
The gawk command used in the awk file is:
$3 ~ /\<aa\>/
The BEGIN statement specifies: IGNORECASE = 1
The command returns 20 rows. What is puzzling is this value in Column 3 in the returned rows:
aA.AHAB
How do I avoid this row as it is not a word by itself because there is dot following the first two aa's and not a space?
A is a word character. . is not a word character. \> matches the zero-width string at the end of a word. Such a zero-width string occurs between A and ..
To search for the string aa delimited by space characters (or start/end of field):
$3 ~ /(^|[ ])aa([ ]|$)
Add any other characters that you care about inside the set ([ ]).
Note that by default, awk splits records into fields on whitespace, so you will not get any spaces in $3 unless you have changed the value of FS.
1st solution: OR to exactly match aa try:
awk 'BEGIN{IGNORECASE=1} $3 ~ /^aa$/' Input_file
2nd solution: OR without IGNORECASE option try:
awk 'tolower($3)=="aa"' Input_file
Question: Why does the awk regex-pattern /\<aa\>/ matches a string like: "aa.bbb"?
We can quickly verify this with:
$ echo aa.bbb | awk '/\<aa\>/'
aa.bbb
The answer is simply found in the manual of gnu awk:
3.7 gawk-Specific Regexp Operators
GNU software that deals with regular expressions provides a number of additional regexp operators. These operators are described in this section and are specific to gawk; they are not available in other awk implementations. Most of the additional operators deal with word matching. For our purposes, a word is a sequence of one or more letters, digits, or underscores (‘_’):
\<: Matches the empty string at the beginning of a word. For example, /\<away/ matches "away" but not "stowaway".
\>:
Matches the empty string at the end of a word. For example, /stow\>/ matches "stow" but not "stowaway".
source: GNU awk manual: Section 3 :: Regular Expressions
So to come back to the example from above, the string "aa.bbb" contains two words "aa" and "bbb" since the <dot>-character is not part of the character set that can build up a word. The empty strings matched here is the empty string before "aa.bbb" and the empty string between the characters a and . (an empty string is really an empty string, length 0, 0 characters, commonly written as "")
Solution to the OP: Since FS is most likely the default value, the field $3 cannot have a space. So the following two solutions are possible:
$3 ~ /^aa$/
$3 == "aa"
If the field separator FS is defined in the code, the following might work
" "$3" " ~ /" aa "/
$3 ~ /(^|[ ])aa([ ]|$) # See solution of JHNC

awk and regular expressions confusion

Having never used awk before on Linux I am attempting to understand how it matches regular expressions. For example in the past based on my experience the regular expression /2/ would match 2 in all of the following lines.
This will match 2
This will not match 2
Now if I run the command awk '{if(NR~2)print}' sample.txt which has the contents
2 will be matched
This will not match 2
2 may be matched
The line that is matched is This will not match 2 which indicates it is matching the line 2 because if I replace the command with awk '{if(NR~3)print}' sample.txt it matches 2 may be matched. Now if I also run the command awk '{if(NR~/^2$/)print}' sample.txt, the matches the same exact line i.e. line 2.
However the source I am referring to at http://www.youtube.com/watch?feature=player_detailpage&v=Htnno4CHVus#t=502s seems to indicate otherwise.
What am I missing and how is the command awk '{if(NR~2)print}' sample.txt different to that of awk '{if(NR~/^2$/)print}' sample.txt?
The condition NR~2 is checking whether the record number, NR, matches 2. For a 2 or 3 line input file, the expression is equivalent to:
if (NR == 2)
Similarly with NR~3, of course. Try:
awk '/2/'
That will print all lines where the text of the line ($0) contains a 2. By default, a regular expression matches against the whole line; you could limit it to a particular field with $3 ~ /3/, for example.
An awk program consists of patterns and actions, where either the pattern or the action may be omitted.
awk '{ if ($0 ~ /2/) print }
/2/
/2/ { if ($0 ~ /a.*z/) print "Matches a.*z"; }'
The first line has no pattern; the action in the { ... } is executed for each input line (but only some input lines will generate output because of the conditional. All lines that contain a 2 will be printed. (If there is no argument to print, it prints $0 followed by a newline.)
The second line has a pattern but no action; all lines that contain a 2 will be printed again. (The missing action is equivalent to { print }.)
The third line has both a pattern and an action; all lines that both contain a 2 and also contain an 'a' followed by a 'z' will be remarked upon.
How are these two commands different?
`awk '{if(NR~2)print}' sample.txt`
`awk '{if(NR~/^2$/)print}' sample.txt`
The first command will print line numbers 2, 12, 20..29, 32, 42, ... 102, 112, 120..129, ... 200..299, ...; all lines where the line number contains a 2.
The second command will print only line number 2 because the /^2$/ constrains the value to contain start of string, digit 2 and end of string.
I take it that means that the source is wrong?
Now I've looked at the YouTube resource, I think you must have misunderstood what it is trying to teach. When it talks about {if (NR~2) print}, it should be saying it will print any line number which contains a 2; the video cites line numbers 2, 12, 20, 21, 22, etc. It should not be saying any line which contains a 2; I think the video does say that, but the video misspoke (but the text was accurate). The comparison against NR is not actually wrong, but it is aconventional; I'm not sure that I'd include regexes against NR in an introductory video describing awk. So, the video appears to have a glitch in the audio but the text on screen is accurate, I think. I may still have missed something.
The command awk '{ if ($0 ~ /2/) print } against the file say sample.txt with the contents I mentioned would only result in the output 2 will be matched. Is that correct?
That command, given the input:
2 will be matched
This will not match 2
2 may be matched
will print all three lines; they all contain the digit 2.
I also thought that the action was print and the pattern was $0 ~ /2/.
No; the pattern was empty (because there was nothing before the open brace) — so all lines match it — and the action was the part in braces { if ($0 ~ /2/) print }. Now, the action contains a conditional, but that's a separate issue.
Now the command awk '/2/' sample.txt would print all three lines. Is that correct?
Yes.
NR means the Number of the Record being processed...
You are matching against line number 2.