I'm trying to handle a bunch of files, and I need to alter then to remove extraneous information in the filenames; notably, I'm trying to remove text inside parentheses. For example:
filename = "Example_file_(extra_descriptor).ext"
and I want to regex a whole bunch of files where the parenthetical expression might be in the middle or at the end, and of variable length.
What would the regex look like? Perl or Python syntax would be preferred.
s/\([^)]*\)//
So in Python, you'd do:
re.sub(r'\([^)]*\)', '', filename)
The pattern that matches substrings in parentheses having no other ( and ) characters in between (like (xyz 123) in Text (abc(xyz 123)) is
\([^()]*\)
Details:
\( - an opening round bracket (note that in POSIX BRE, ( should be used, see sed example below)
[^()]* - zero or more (due to the * Kleene star quantifier) characters other than those defined in the negated character class/POSIX bracket expression, that is, any chars other than ( and )
\) - a closing round bracket (no escaping in POSIX BRE allowed)
Removing code snippets:
JavaScript: string.replace(/\([^()]*\)/g, '')
PHP: preg_replace('~\([^()]*\)~', '', $string)
Perl: $s =~ s/\([^()]*\)//g
Python: re.sub(r'\([^()]*\)', '', s)
C#: Regex.Replace(str, #"\([^()]*\)", string.Empty)
VB.NET: Regex.Replace(str, "\([^()]*\)", "")
Java: s.replaceAll("\\([^()]*\\)", "")
Ruby: s.gsub(/\([^()]*\)/, '')
R: gsub("\\([^()]*\\)", "", x)
Lua: string.gsub(s, "%([^()]*%)", "")
Bash/sed: sed 's/([^()]*)//g'
Tcl: regsub -all {\([^()]*\)} $s "" result
C++ std::regex: std::regex_replace(s, std::regex(R"(\([^()]*\))"), "")
Objective-C: NSRegularExpression *regex = [NSRegularExpression regularExpressionWithPattern:#"\\([^()]*\\)" options:NSRegularExpressionCaseInsensitive error:&error]; NSString *modifiedString = [regex stringByReplacingMatchesInString:string options:0 range:NSMakeRange(0, [string length]) withTemplate:#""];
Swift: s.replacingOccurrences(of: "\\([^()]*\\)", with: "", options: [.regularExpression])
Google BigQuery: REGEXP_REPLACE(col, "\\([^()]*\\)" , "")
I would use:
\([^)]*\)
If you don't absolutely need to use a regex, useconsider using Perl's Text::Balanced to remove the parenthesis.
use Text::Balanced qw(extract_bracketed);
my ($extracted, $remainder, $prefix) = extract_bracketed( $filename, '()', '[^(]*' );
{ no warnings 'uninitialized';
$filename = (defined $prefix or defined $remainder)
? $prefix . $remainder
: $extracted;
}
You may be thinking, "Why do all this when a regex does the trick in one line?"
$filename =~ s/\([^}]*\)//;
Text::Balanced handles nested parenthesis. So $filename = 'foo_(bar(baz)buz)).foo' will be extracted properly. The regex based solutions offered here will fail on this string. The one will stop at the first closing paren, and the other will eat them all.
$filename =~ s/\([^}]*\)//;
# returns 'foo_buz)).foo'
$filename =~ s/\(.*\)//;
# returns 'foo_.foo'
# text balanced example returns 'foo_).foo'
If either of the regex behaviors is acceptable, use a regex--but document the limitations and the assumptions being made.
If a path may contain parentheses then the r'\(.*?\)' regex is not enough:
import os, re
def remove_parenthesized_chunks(path, safeext=True, safedir=True):
dirpath, basename = os.path.split(path) if safedir else ('', path)
name, ext = os.path.splitext(basename) if safeext else (basename, '')
name = re.sub(r'\(.*?\)', '', name)
return os.path.join(dirpath, name+ext)
By default the function preserves parenthesized chunks in directory and extention parts of the path.
Example:
>>> f = remove_parenthesized_chunks
>>> f("Example_file_(extra_descriptor).ext")
'Example_file_.ext'
>>> path = r"c:\dir_(important)\example(extra).ext(untouchable)"
>>> f(path)
'c:\\dir_(important)\\example.ext(untouchable)'
>>> f(path, safeext=False)
'c:\\dir_(important)\\example.ext'
>>> f(path, safedir=False)
'c:\\dir_\\example.ext(untouchable)'
>>> f(path, False, False)
'c:\\dir_\\example.ext'
>>> f(r"c:\(extra)\example(extra).ext", safedir=False)
'c:\\\\example.ext'
For those who want to use Python, here's a simple routine that removes parenthesized substrings, including those with nested parentheses. Okay, it's not a regex, but it'll do the job!
def remove_nested_parens(input_str):
"""Returns a copy of 'input_str' with any parenthesized text removed. Nested parentheses are handled."""
result = ''
paren_level = 0
for ch in input_str:
if ch == '(':
paren_level += 1
elif (ch == ')') and paren_level:
paren_level -= 1
elif not paren_level:
result += ch
return result
remove_nested_parens('example_(extra(qualifier)_text)_test(more_parens).ext')
If you can stand to use sed (possibly execute from within your program, it'd be as simple as:
sed 's/(.*)//g'
>>> import re
>>> filename = "Example_file_(extra_descriptor).ext"
>>> p = re.compile(r'\([^)]*\)')
>>> re.sub(p, '', filename)
'Example_file_.ext'
Java code:
Pattern pattern1 = Pattern.compile("(\\_\\(.*?\\))");
System.out.println(fileName.replace(matcher1.group(1), ""));
I would like to match any Num from part of a text string. So far, this (stolen from from https://docs.perl6.org/language/regexes.html#Best_practices_and_gotchas) does the job...
my token sign { <[+-]> }
my token decimal { \d+ }
my token exponent { 'e' <sign>? <decimal> }
my regex float {
<sign>?
<decimal>?
'.'
<decimal>
<exponent>?
}
my regex int {
<sign>?
<decimal>
}
my regex num {
<float>?
<int>?
}
$str ~~ s/( <num>? \s*) ( .* )/$1/;
This seems like a lot of (error prone) reinvention of the wheel. Is there a perl6 trick to match built in types (Num, Real, etc.) in a grammar?
If you can make reasonable assumptions about the number, like that it's delimited by word boundaries, you can do something like this:
regex number {
« # left word boundary
\S+ # actual "number"
» # right word boundary
<?{ defined +"$/" }>
}
The final line in this regex stringifies the Match ("$/"), and then tries to convert it to a number (+). If it works, it returns a defined value, otherwise a Failure. This string-to-number conversion recognizes the same syntax as the Perl 6 grammar. The <?{ ... }> construct is an assertion, so it makes the match fail if the expression on the inside returns a false value.
In Edit distance: Ignore start/end, I offered a Perl 6 solution to a fuzzy fuzzy matching problem. I had a grammar like this (although maybe I've improved it after Edit #3):
grammar NString {
regex n-chars { [<.ignore>* \w]**4 }
regex ignore { \s }
}
The literal 4 itself was the length of the target string in the example. But the next problem might be some other length. So how can I tell the grammar how long I want that match to be?
Although the docs don't show an example or using the $args parameter, I found one in S05-grammar/example.t in roast.
Specify the arguments in :args and give the regex an appropriate signature. Inside the regex, access the arguments in a code block:
grammar NString {
regex n-chars ($length) { [<.ignore>* \w]**{ $length } }
regex ignore { \s }
}
class NString::Actions {
method n-chars ($/) {
put "Found $/";
}
}
my $string = 'The quick, brown butterfly';
loop {
state $from = 0;
my $match = NString.subparse(
$string,
:rule('n-chars'),
:actions(NString::Actions),
:c($from++),
:args( \(5) )
);
last unless ?$match;
}
I'm still not sure about the rules for passing the arguments though. This doesn't work:
:args( 5 )
I get:
Too few positionals passed; expected 2 arguments but got 1
This works:
:args( 5, )
But that's enough thinking about this for one night.
I would like to use a regex for "spaces", "dashes (-)", "apostrophes (')", and "letters" in my objective-c app.
I have the following, but it does not allow spaces.
NSString *fullNameRegex = #"^[a-zA-Z'\\-]$";
Could someone help me add the spaces please? Thank you!
You can use
NSString *fullNameRegex = #"^[\\sa-zA-Z'-]*$";
^^^ ^
Add a whitespace \s char class, and do not forget to let your string have 0 or more (with *) or 1 or more (with + quantifier).
I was wondering if anyone might know what the regular expression would be to turn this:
West4thStreet
into this:
West 4th Street
I'm going to add the spaces to the string in Objective-C.
Thanks!
I don't know exactly where you want to put in spaces, but try something like [a-z.-][^a-z .-] and then put a space between the two characters in each match.
Something like this perl regex substitution would put a space before each group of capital letters or numbers. (You'd want to trim space before the string in this case also.) I assume you don't want it to break up eg: 45thStreet to 4 5th Street
Letters I'm less certain of.
s/([A-Z]+|[0-9]+)/ \1/g
I created a pattern to not match the beginning of the line for my personal amusement:
s/([^\^])([A-Z]+|[0-9]+)/\1 \2/g
This should work, if all your strings truly match the format of your example:
([A-Z][a-z]+)(\d+[a-z]+)([A-Z][a-z]+)
You can then separate the groups with spaces.
Another option would be to not use RegExKit and use code to loop through each character in the string and insert a space after each capital letter or after first decimal..
NSMutableString *myText2 = [[NSMutableString alloc] initWithString:#"The1stTest"];
bool isNumber=false;
for(int x=myText2.length-1;x>1;x--)
{
bool isUpperCase = [[NSCharacterSet uppercaseLetterCharacterSet] characterIsMember:[myText2 characterAtIndex:x]];
bool isLowerCase = [[NSCharacterSet lowercaseLetterCharacterSet] characterIsMember:[myText2 characterAtIndex:x]];
if([[NSCharacterSet decimalDigitCharacterSet] characterIsMember:[myText2 characterAtIndex:x]])
isNumber = true;
if((isUpperCase || isLowerCase) && isNumber)
{
[myText2 insertString:#" " atIndex:x+1];
isNumber=false;
}
if(isUpperCase)
[myText2 insertString:#" " atIndex:x];
}
NSLog(#"%#",myText2); // Output: "The 1st Test"