Regex reverse lookbehind - regex-lookarounds

Regex reverse lookbehind - regex-lookarounds

I am having some issues in understanding how to use lookbehind in regex.
I need to match all between the first preceding occurrence of myMethod and somethingelse
Example
https://regex101.com/r/lF8yT0/4
public myMethod
do something
private myMethod
do somethingelse
(?s)(?<=(myMethod){1})(.*)somethingelse
Selects all from the top, while I only expect
private myMethod
do somethingelse

You may use a tempered greedy token:
[^\r\n]*myMethod((?:(?!myMethod).)*?)somethingelse
^^^^^^^^^^^^^^^^^^^
See the regex demo
The first [^\r\n]* matches 0+ chars other than CR/LF (since you expect the start of the line in your match result) and (?:(?!myMethod).)*? matches any 0 chars (as few as possible) that do not start a myMethod substring.

Related

REGEXP_REPLACE URL BIGQUERY

I have two types of URL's which I would need to clean, they look like this:
["//xxx.com/se/something?SE_{ifmobile:MB}{ifnotmobile:DT}_A_B_C_D_E_F_G_H"]
["//www.xxx.com/se/car?p_color_car=White?SE_{ifmobile:MB}{ifnotmobile:DT}_A_B_C_D_E_F_G_H"]
The outcome I want is;
SE_{ifmobile:MB}{ifnotmobile:DT}_A_B_C_D_E_F_G_H"
I want to remove the brackets and everything up to SE, the URLS differ so I want to remove:
First URL
["//xxx.com/se/something?
Second URL:
["//www.xxx.com/se/car?p_color_car=White?
I can't get my head around it,I've tried this .*\/ . But it will still keep strings I don't want such as:
(1 url) =
something?
(2 url) car?p_color_car=White?

You can use
regexp_replace(FinalUrls, r'.*\?|"\]$', '')
See the regex demo
Details
.*\? - any zero or more chars other than line breakchars, as many as possible and then ? char
| - or
"\]$ - a "] substring at the end of the string.
Mind the regexp_replace syntax, you can't omit the replacement argument, see reference:
REGEXP_REPLACE(value, regexp, replacement)
Returns a STRING where all substrings of value that match regular
expression regexp are replaced with replacement.
You can use backslashed-escaped digits (\1 to \9) within the
replacement argument to insert text matching the corresponding
parenthesized group in the regexp pattern. Use \0 to refer to the
entire matching text.

REGEXP_REPLACE explanation

Hi may i know what does the below query means?
REGEXP_REPLACE(number,'[^'' ''-/0-9:-#A-Z''[''-`a-z{-~]', 'xy') ext_number

part 1
In terms of explaining what the function function call is doing:
It is a function call to analyse an input string 'number' with a regex (2nd argument) and replace any parts of the string which match a specific string. As for the name after the parenthesis I am not sure, but the documentation for the function is here
part 2
Sorry to be writing a question within an answer here but I cannot respond in comments yet (not enough rep)
Does this regex work? Unless sql uses different syntax this would appear to be a non-functional regex. There are some red flags, e.g:
The entire regex is wrapped in square parenthesis, indicating a set of characters but seems to predominantly hold an expression
There is a range indicator between a single quote and a character (invalid range: if a dash was required in the match it should be escaped with a '\' (backslash))
One set of square brackets is never closed
After some minor tweaks this regex is valid syntax:
^'' ''\-\/0-9:-#A-Z''[''-a-z{-~]`, but does not match anything I can think of, it is important to know what string is being examined/what the context is for the program in order to identify what the regex might be attempting to do

It seems like it is meant to replaces all ASCII control characters in the column or variable number with xy.
[] encloses a class of characters. Any character in that class matches. [^] negates that, hence all characters match, that are not in the class.
- is a range operator, e.g. a-z means all characters from a to z, like abc...xyz.
It seams like characters enclosed in ' should be escaped (The second ' is to escape the ' in the string itself.) At least this would make some sense. (But for none of the DBMS I found having a regexp_replace() function (Postgres, Oracle, DB2, MariaDB, MySQL), I found something in the docs, that would indicate this escape mechanism. They all use \, but maybe I missed something? Unfortunately you didn't tag which DBMS you're actually using!)
Now if you take an ASCII table you'll see, that the ranges in the expression make up all printable characters (counting space as printable) in groups from space to /, 0 to 9, : to #, etc.. Actually it might have been shorter to express it as '' ''-~, space to ~.
Given the negation, all these don't match. The ones left are from NUL to US and DEL. These match and get replaced by xy one by one.

Regex in Objective-C and regexpal

I created a regex expression and tested with a string in Rexpal.
Then, i tested it in Objective-C with the same string and i get no result.
NSRegularExpression *regex = [NSRegularExpression regularExpressionWithPattern:#"^Page(\\d+| \\d+)(:| :|)$" options:NSRegularExpressionCaseInsensitive error:&regError];
As you can see, i did add another '\' char before the 'd', but i get no result at all.
If i change the regex expression from "^Page(\d+| \d+)(:| :|)$" to "Page(\d+| \d+)(:| :|)", i get way too many results. It's like my 'AND' statements were understoud as 'OR' statements. Anybody got an idea of what is happening?
EDIT :
For the regex expression "^page(\d+| \d+)(:| :|)$" for the string "page 15 :", will return me with 3 solutions "page 15 :", "15", and ":". I only want the first one. Like i said, it's like my AND is transformed in a OR/AND. I would like the number and the semi-colon (or not like my regex says) to always be attached to 'page'

Turn on multi-line option. then anchors ^$ will mean begining and end of line.
Instead, by default, ^$ mean begin/end of entire string.
In RxPal you can see Match at line-breaks (m) option is checked.
edit
If you are getting too much sub-expression data, then you should replace the
context into cluster groups. (..) -> (?: ..).
This is an 'extended' context.
If you can't do that, then just go with the data in group 0, which is the entire match, and ignore the rest. Not sure how to do this.

As pointed out by sln (he solved that issue), you don't match anything in C because you have to turn multiline option on (with m), and you do match in regexpal because it is on.
Regarding your regex, it could be improved with ^(Page\s*\d+\s*:?\s*)$. The question mark means that the preceding character doesn't have to be here, the \s matches any whitespace-type character (whitespace, tab, etc).
Regarding your selecting issue, parenthesis in regex are what catches variables. So if you do (Page( \d+|\d+)) you'll have two different variables. What you wanted was (Page(?: \d+|\d+)), since (?: ) counts as parenthesis not assigning any value. But | aren't usually used when a simple ? does the trick.

What does the \? (backslash question mark) escape sequence mean?

I'm writing a regular expression in Objective-C.
The escape sequence \w is illegal and emits a warning, so the regular expression /\w/ must be written as #"\\w"; the escape sequence \? is valid, apparently, and doesn't emit a warning, so the regular expression /\?/ must be written as #"\\?" (i.e., the backslash must be escaped).
Question marks aren't invisible like \t or \n, so why is \? a valid escape sequence?
Edit: To clarify, I'm not asking about the quantifier, I'm asking about a string escape sequence. That is, this doesn't emit a warning:
NSString *valid = #"\?";
By contrast, this does emit a warning ("Unknown escape sequence '\w'"):
NSString *invalid = #"\w";

It specifies a literal question mark. It is needed because of a little-known feature called trigraphs, where you can write a three-character sequence starting with question marks to substitute another character. If you have trigraphs enabled, in order to write "??" in a string, you need to write it as "?\?" in order to prevent the preprocessor from trying to read it as the beginning of a trigraph.
(If you're wondering "Why would anybody introduce a feature like this?": Some keyboards or character sets didn't include commonly used symbols like {. so they introduced trigraphs so you could write ??< instead.)

? in regex is a quantifier, it means 0 or 1 occurences. When appended to the + or * quantifiers, it makes it "lazy".
For example, applying the regex o? to the string foo? would match o.
However, the regex o\? in foo? would match o?, because it is searching for a literal question mark in the string, instead of an arbitrary quantifier.
Applying the regex o*? to foo? would match oo.
More info on quantifiers here.

Xcode - replace function with regex and two-digit capture group (back reference)

I would like to use the Xcode's find in project option to normalize the signatures of methods.
I wrote the find expression:
^\s*([+-])\s*\((\w+)\s*(\*?)\s*\)\s*(\w+)(\s*(:)\s*(\()\s*(\w+)\s*(\*?)\s*(\))\s*(\w+))?
and the replacement expression:
\1 \(\2\3\)\4\6\7\8\9\10\11
The test string is:
+(NSString *) testFunction : (NSInteger ) arg1
and the desired result:
+ (NSString*)testFunction:(NSInteger)arg1
Unfortunatelly Xcode isn't able to recognize te two digit capture group \10 and translates it to \1 and '0' character and so long. How to solve this problem or bug?
Thanks in advance,
Michał

I believe #trojanfoe is correct; regexes can only have nine capture groups. This is waaay more than you need for your particular example, though.
^\s*([+-])\s*\((\w+)\s*(\*?)\s*\)\s*(\w+)(\s*(:)\s*(\()\s*(\w+)\s*(\*?)\s*(\))\s*(\w+))?
\1 \(\2\3\)\4\6\7\8\9\10\11
The first thing I notice is that you're not using \5, so there's no reason to capture it at all. Next, I notice that \6 corresponds to the regex (:), so you can avoid capturing it and replace \6 with : in the output. \7 corresponds to (\(), so you can replace \7 with ( in the output. ...Iterating this approach yields a much simpler pair of regexes: one for zero-argument methods and one for one-argument methods.
^\s*([+-])\s*\((\w+)\s*(\*?)\s*\)\s*(\w+)
\1 \(\2\3\)\4
^([+-] \(\w+\*?\)\w+)\s*:\s*\(\s*(\w+)\s*(\*?)\s*\)\s*(\w+)
\1:\(\2\3\)\4
Notice that I can capture the whole regex [+-] \(\w+\*?\)\w+ without all those noisy \s*s, because it's been normalized already by the first regex's pass.
However, this whole idea is a huge mistake. Consider the following Objective-C method declarations:
-(const char *)toString;
-(id)initWithA: (A) a andB: (B) b andC: (C) c;
-(NSObject **)pointerptr;
-(void)performBlock: (void (^)(void)) block;
-(id)stringWithFormat: (const char *) fmt, ...;
None of these are going to be parsed correctly by your regex. The first one contains a two-word type const char instead of a single word; the second has more than one parameter; the third has a double pointer; the fourth has a very complicated type instead of a single word; and the fifth has not only const char but a variadic argument list. I could go on, through out parameters and arrays and __attribute__ syntax, but surely you're beginning to see why regexes are a bad match for this problem.
What you're really looking for is an indent program (named after GNU indent, which unfortunately doesn't do Objective-C). The best-known and best-supported Objective-C indent program is called uncrustify; get it here.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Regex reverse lookbehind - regex-lookarounds

Related

REGEXP_REPLACE URL BIGQUERY

REGEXP_REPLACE explanation

Regex in Objective-C and regexpal

What does the \? (backslash question mark) escape sequence mean?

Xcode - replace function with regex and two-digit capture group (back reference)

Categories

Resources