What order do location directives fire in?
From the HTTP core module docs:
Directives with the "=" prefix that match the query exactly. If found, searching stops.
All remaining directives with conventional strings. If this match used the "^~" prefix, searching stops.
Regular expressions, in the order they are defined in the configuration file.
If #3 yielded a match, that result is used. Otherwise, the match from #2 is used.
Example from the documentation:
location = / {
# matches the query / only.
[ configuration A ]
}
location / {
# matches any query, since all queries begin with /, but regular
# expressions and any longer conventional blocks will be
# matched first.
[ configuration B ]
}
location /documents/ {
# matches any query beginning with /documents/ and continues searching,
# so regular expressions will be checked. This will be matched only if
# regular expressions don't find a match.
[ configuration C ]
}
location ^~ /images/ {
# matches any query beginning with /images/ and halts searching,
# so regular expressions will not be checked.
[ configuration D ]
}
location ~* \.(gif|jpg|jpeg)$ {
# matches any request ending in gif, jpg, or jpeg. However, all
# requests to the /images/ directory will be handled by
# Configuration D.
[ configuration E ]
}
If it's still confusing, here's a longer explanation.
It fires in this order.
= (exactly)
location = /path
^~ (forward match)
location ^~ /path
~ (regular expression case sensitive)
location ~ /path/
~* (regular expression case insensitive)
location ~* .(jpg|png|bmp)
/
location /path
There is a handy online tool for testing location priority now:
location priority testing online
Locations are evaluated in this order:
location = /path/file.ext {} Exact match
location ^~ /path/ {} Priority prefix match -> longest first
location ~ /Paths?/ {} (case-sensitive regexp) and location ~* /paths?/ {} (case-insensitive regexp) -> first match
location /path/ {} Prefix match -> longest first
The priority prefix match (number 2) is exactly as the common prefix match (number 4), but has priority over any regexp.
For both prefix matche types the longest match wins.
Case-sensitive and case-insensitive have the same priority. Evaluation stops at the first matching rule.
Documentation says that all prefix rules are evaluated before any regexp, but if one regexp matches then no standard prefix rule is used. That's a little bit confusing and does not change anything for the priority order reported above.
Related
This one may have been answered but I can't seem to find one that fits my specifics. So apologies if it sounds old.
My rewrite consists of using a rewrite map and a redirect to another domain.
For example:
An old site has moved and new id's have been generated along with file name changes
So this URL:
https://www.example.com/gallery/951/med_U951I1424251158.SEQ.0.jpg
would need to be redirected to
https://www.example2.com/gallery/5710/med_U5710I1424251158.SEQ.0.jpg
The new value 5710 comes from a rewrite map lookup where I pass 951
there are two changes needed and the 2nd change has two possible variations:
The first change is gallery/951 to gallery/5710
The 2nd change is the filename where the id can be delimited in either of 2 ways:
U951I - delimited between a U and an I
or
U951. - delimited between a U and a .
I started with something like this:
RewriteRule "^gallery/(\d+)/(\s+)$"
But that is as far as I can get.
You could do something like the following:
RewriteCond ${MapName:$1|XXX} (.+)
RewriteRule ^/?gallery/(\d+)/(.*U)\1([I.][^/]+)$ https://www.example2.com/gallery/%1/$2%1$3 [R=302,L]
I'm assuming the domain being redirected to resides on a different server, otherwise (if this is in .htaccess) you would need to check the requested hostname in an additional condition (RewriteCond directive).
Explanation:
The regex ^/?gallery/(\d+)/(.*U)\1([I.][^/]+)$ matches and captures the relevant parts of the source URL...
^/?gallery/ - Literal text (optional slash prefix if this is being used in a server context, as opposed to .htaccess)
(\d+) - Captures the "old" member ID (later available using the $1 backreference and used to lookup the "new" member ID from the RewriteMap)
/ - literal slash
(.*U) - Captures the first part of the filename before the "old" member ID. Later available in the $2 backreference.
\1 - An internal backreference matches the "old" member ID that occurred in the 2nd path segment - Not captured.
([I.][^/]+)$ - Captures the last part of the filename after the "old" member ID - Later available in the $3 backreference. The character classs [I.] matches either an I or a literal . (the dot does not need to be escaped when used inside a character class). And [^/]+ matches everything thereafter to the end of the filename (URL-path).
The RewriteCond directive serves to lookup the new member ID (using the old member ID saved in the $1 backreference) in the rewrite map defined earlier in the server config. The result is then captured, which is later available in the %1 backreference. I set a default XXX so it would be easy to spot any lookups that fail.
Using the RewriteCond directive means we only lookup the rewrite map once. (Results are probably cached anyway, but it saves repetition at the very least.)
/%1/$2%1$3 - The substitution string is then constructed from the backreferences captured earlier:
%1 is the new member ID (captured in the preceding CondPattern)
$2 is the part of the filename before the member ID (captured by the RewriteRule pattern against the URL-path).
$3 is the part of the filename after the member ID.
Note that backreferences of the form $n refer to captured groups in the RewriteRule pattern and backreferences of the form %n refer to captured groups in the last matched CondPattern (RewriteCond directive).
I have a simple grammar, and I am using it to parse some text. The text is user inputted, but my program guarantees that it stars with a match to the grammar. (ie, if my grammar matched only a, the text might be abc or a or a_.) However, when I use the .parse method on my grammar, it fails on any non-exact match. How can I perform a partial match?
In Raku, Grammar.parse has to match the whole string. This is what causes it to fail if your grammar would only match a in the string abc. To allow matching only part of the input string, you can use Grammar.subparse instead.
grammar Foo {
token TOP { 'a' }
}
my $string = 'abc';
say Foo.parse($string); # Nil
say Foo.subparse($string); # 「a」
The input string will need to start with the potential Match. Otherwise, you will get a failed match.
say Foo.subparse('cbacb'); # #<failed match>
You can work around this using a Capture marker.
grammar Bar {
token TOP {
<-[a]>* # Match 0 or more characters that are *not* a
<( 'a' # Start the match, and match a single 'a'
}
}
say Bar.parse('a'); # 「a」
say Bar.subparse('a'); # 「a」
say Bar.parse('abc'); # Nil
say Bar.subparse('abc'); # 「a」
say Bar.parse('cbabc'); # Nil
say Bar.subparse('cbabc'); # 「a」
This works because <-[a]>*, a character class that includes any character except the letter a, will consume all the characters before a potential a. However, the Capture marker will cause these to be dropped from the eventual Match object, leaving you with just the a you wanted to match.
TL;DR
grammar foo { token TOP { a* } }
# Partial match anchored at start of string:
say .subparse: 'abcaa' given foo; # 「a」
# Partial match anchored to end of string:
say 'abcaa' ~~ / <.foo::TOP> $ /; # 「aa」
# Longest partial match, no anchoring:
say ('abcaaabcaabc' ~~ m:g/ <.foo::TOP> /).max(*.chars); # 「aaa」
Vocabulary
There are traditionally two takes on the general notion of text "matching":
"Parsing"
"Regexes"
Raku:
Provides a unified text pattern language and engine that do both jobs.
Makes it easy to stick to one perspective, or other, or blend them, or refactor between them, as suits an individual dev and/or individual use case.
Takes "parsing" to mean more or less a single match starting at the start of the input string whereas "regexes" are much more flexible.
What you've written in your question and your first comment on Tyil's answer reflects the inherent ambiguity of the topic. I'll provide two answers rather than one to try help you and/or other readers to be clearer about Raku's use of vocabulary, and your options functionality wise.
Limited "partial matching" via .parse et al
You began with:
Partial match in a grammar ... I have a simple grammar ... my program guarantees that it starts with a match to the grammar
With that in mind, here's your question:
How can I perform a partial match?
The phrases "guarantees that it starts" and "partial match" are ambiguous.
One take is that you want what I'll call a "prefix" match, matching one or more characters anchored from the start of the string, and not merely any sub-string starting and ending anywhere in the input string.
This nicely fits with "parsing", or at least Raku's use of the word in its grammar methods.
All the built in Grammar methods with parse in their name insert an anchor to the start of the string in whatever grammar rule they use to start the parsing process. You cannot remove that anchor. This reflects the choice of vocabulary; "parse" is taken to mean matching from the start no matter what else happens.
The parse method for this "prefix" scenario is .subparse:
grammar foo { token TOP { a* } }
# Partial match anchored at start of string:
say .subparse: 'abcaa' given foo; # 「a」
See also:
Search of SO for "[raku] subparse".
raku doc for .subparse.
But perhaps "guarantees that it starts" and "partial match" did not mean that you wanted anchoring at the start. Your comment on Tyil's answer highlights this ambiguity:
Will .subparse only match at the start, or match anywhere in the string?
Tyil provides a workaround. You can do what Tyil shows, but it'll only match if the very first a encountered in the input string is the one that's at the start of the sub-string you want your "parse" to match.
If instead the first a was a false positive, and there was a second or a subsequent a you wanted the "parse" match to start at, then, at least in the Raku world, it's helpful to call that "regexing" rather than "parsing" and to use "regex" matching via the ~~ smartmatch operator.
Unlimited "partial matching" via ~~
Raku lets you do unlimited partial matching if you use its ~~ construct with a regex.
For example, you could write:
# End of match at end of string:
↓
say 'abcaa' ~~ token { a* $ } # 「aa」
~~ with a regex tells Raku to:
Try match starting at the first character position in the string on the LHS;
If that fails, step forward one character, and try again, with the new position in the input string treated as a fresh starting point;
Repeat that until either matching once, or failing to find any match in the entire string.
Here I've left the start position of the match unspecified (which ~~ takes to mean it can be anywhere in the string) and anchored the end of the pattern to the end of the input string. So it successfully matches the aa at the end of the string.
This anchoring freedom illustrates just one of the many ways that ~~ smart matching provides much greater matching flexibility than using the parse methods.
If you have an existing grammar you can still use that:
grammar foo { token TOP { a* } }
# Anchor matching to end of string:
↓
say 'abcaa' ~~ / <.foo::TOP> $ /; # 「aa」
You have to name both the grammar and the rule within it you wish to invoke and put them inside <...>. And you need to insert a . to avoid a correspondingly named sub-capture, presuming you don't want that.
Here's another example:
# Longest partial match, no anchoring:
say ('abcaaabcaabc' ~~ m:g/ <.foo::TOP> /).max(*.chars); # 「aaa」
"Parsing" in Raku always starts at the beginning of an input string and results in either no match or one match.
In contrast, a "regex" can match arbitrary fragments, and can match any number of fragments. (You can even match overlapping fragments.)
In my last example I used :g, which is short for :global, which is a well known feature among traditional regex engines. :g matches as many times as a match is found in the input string (but not overlapping).
The match operation then returns either Nil (no matches at all) or a list of match objects (one or more). I've applied a .max(*.chars) to yield the longest match (the first if there are multiple longest sub-strings).
When I use a named regex, I can print its contents:
my regex rgx { \w\w };
my $string = 'abcd';
$string ~~ / <rgx> /;
say $<rgx>; # 「ab」
But if I want to match with :g or :ex adverb, so there is more than one match, it doesn't work. The following
my regex rgx { \w\w };
my $string = 'abcd';
$string ~~ m:g/ <rgx> /;
say $<rgx>; # incorrect
gives an error:
Type List does not support associative indexing.
in block <unit> at test1.p6 line 5
How should I modify my code?
UPD: Based on #piojo's explanation, I modified the last line as follows and that solved my problem:
say $/[$_]<rgx> for ^$/.elems;
The following would be easier, but for some reason it doesn't work:
say $_<verb> for $/; # incorrect
It seems like :g and :overlap are special cases: if your match is repeated within the regex, like / <rgx>* /, then you would access the matches as $<rgx>[0], $<rgx>[1], etc.. But in this case, the engine is doing the whole match more than once. So you can access those matches through the top-level match operator, $/. In fact, $<foo> is just a shortcut for $/<foo>.
So based on the error message, we know that in this case, $/ is a list. So we can access your matches as $/[0]<rgx> and $/[1]<rgx>.
I am trying to install drupal in a subdirectory on my bluehost hosted website...
It's a HUGE pain
I'm thinking the following lines from the .htaccess is the problem. When I currently navigatoe to mysite.com/subdir/install.php I get a 403 error. However, when I take out "deny" from the lines below, I cease to get that error, so I suspect that this line is causing all the trouble.
My question is, can someone help me understand what is happening in the following code? Especially if you can break it down by component.
<FilesMatch "\.(engine|inc|info|install|make|module|profile|test|po|sh|.*sql|theme|tpl(\.php)?|xtmpl)(|~|\.sw[op]|\.bak|\.orig|\.save)?$|^(\..*|Entries.*|Repository|Root|Tag|Template)$|^#.*#$|\.php(~|\.sw[op]|\.bak|\.orig\.save)$">
Order allow,deny
</FilesMatch>
FilesMatch allows you to match files using a regular expression.
On your above FilesMatch you have 4 sets of regular expression where the 1 set have an secondary optional set.
Basically what it is doing is forbidden access (error 403) to any of the files found that are described on your sets of regex.
For example:
\.(engine|inc ...)$|
Means if the file ends with .engine or .inc or ... rest of the rule, deny access to it.
Then at the end of the first set of rules you have a | which like the above example, stands for OR so if the first set of rules were not match, it starts the second one, which is slight different.
^(\..*|Entries.*|Repository)$
Here it does the opposite, it matches if the file starts and end with a given keyword, so for example:
If file starts with . followed by anything the (.*) means anything else for example .htaccess or starts with Entries followed by anything or is exactly Repository or ... till the end.
Then the next rule ^#.*#$, this one means the file starts and ends with a # as # its treated literally
And the last set of rules does the same of the first verify if file ends with those given extensions.
If you want to know more then I suggest you to learn more about Perl Compatible Regular Expressions (PCRE)
I'm looking for a regex in order to transform something like
{test}hello world{/test} and {again}i'm coming back{/again} in hello world i'm coming back.
I tried {[^}]+} but with this regex, I can't have only what I have in the test and again tags. Is there a way to complete this regex ?
Doing this properly is generally beyond the capabilities of regular expressions. However, if you can guarantee that those tags will never be nested and your input will never contain curly brackets that do not signify tags, then this regex could do the matching:
\{([^}]+)}(.*?)\{/\1}
Explanation:
\{ # a literal {
( # capture the tag name
[^}]+) # everything until the end of the tag (you already had this)
} # a literal }
( # capture the tag's value
.*?) # any characters, but as few as possible to complete the match
# note that the ? makes the repetition ungreedy, which is important if
# you have the same tag twice or more in a string
\{ # a literal {
\1 # use the tag's name again (capture no. 1)
} # a literal }
So this uses a backreference \1 to make sure that the closing tag contains the same word as the opening tag. Then you will find the tag's name in capture 1 and the tag's value/content in capture 2. From here you can do with these whatever you want (for instance, put the values back together).
Note that you should use the SINGLELINE or DOTALL option, if you want your tags to span multiple lines.