I'm trying to handle a bunch of files, and I need to alter then to remove extraneous information in the filenames; notably, I'm trying to remove text inside parentheses. For example:
filename = "Example_file_(extra_descriptor).ext"
and I want to regex a whole bunch of files where the parenthetical expression might be in the middle or at the end, and of variable length.
What would the regex look like? Perl or Python syntax would be preferred.
s/\([^)]*\)//
So in Python, you'd do:
re.sub(r'\([^)]*\)', '', filename)
The pattern that matches substrings in parentheses having no other ( and ) characters in between (like (xyz 123) in Text (abc(xyz 123)) is
\([^()]*\)
Details:
\( - an opening round bracket (note that in POSIX BRE, ( should be used, see sed example below)
[^()]* - zero or more (due to the * Kleene star quantifier) characters other than those defined in the negated character class/POSIX bracket expression, that is, any chars other than ( and )
\) - a closing round bracket (no escaping in POSIX BRE allowed)
Removing code snippets:
JavaScript: string.replace(/\([^()]*\)/g, '')
PHP: preg_replace('~\([^()]*\)~', '', $string)
Perl: $s =~ s/\([^()]*\)//g
Python: re.sub(r'\([^()]*\)', '', s)
C#: Regex.Replace(str, #"\([^()]*\)", string.Empty)
VB.NET: Regex.Replace(str, "\([^()]*\)", "")
Java: s.replaceAll("\\([^()]*\\)", "")
Ruby: s.gsub(/\([^()]*\)/, '')
R: gsub("\\([^()]*\\)", "", x)
Lua: string.gsub(s, "%([^()]*%)", "")
Bash/sed: sed 's/([^()]*)//g'
Tcl: regsub -all {\([^()]*\)} $s "" result
C++ std::regex: std::regex_replace(s, std::regex(R"(\([^()]*\))"), "")
Objective-C: NSRegularExpression *regex = [NSRegularExpression regularExpressionWithPattern:#"\\([^()]*\\)" options:NSRegularExpressionCaseInsensitive error:&error]; NSString *modifiedString = [regex stringByReplacingMatchesInString:string options:0 range:NSMakeRange(0, [string length]) withTemplate:#""];
Swift: s.replacingOccurrences(of: "\\([^()]*\\)", with: "", options: [.regularExpression])
Google BigQuery: REGEXP_REPLACE(col, "\\([^()]*\\)" , "")
I would use:
\([^)]*\)
If you don't absolutely need to use a regex, useconsider using Perl's Text::Balanced to remove the parenthesis.
use Text::Balanced qw(extract_bracketed);
my ($extracted, $remainder, $prefix) = extract_bracketed( $filename, '()', '[^(]*' );
{ no warnings 'uninitialized';
$filename = (defined $prefix or defined $remainder)
? $prefix . $remainder
: $extracted;
}
You may be thinking, "Why do all this when a regex does the trick in one line?"
$filename =~ s/\([^}]*\)//;
Text::Balanced handles nested parenthesis. So $filename = 'foo_(bar(baz)buz)).foo' will be extracted properly. The regex based solutions offered here will fail on this string. The one will stop at the first closing paren, and the other will eat them all.
$filename =~ s/\([^}]*\)//;
# returns 'foo_buz)).foo'
$filename =~ s/\(.*\)//;
# returns 'foo_.foo'
# text balanced example returns 'foo_).foo'
If either of the regex behaviors is acceptable, use a regex--but document the limitations and the assumptions being made.
If a path may contain parentheses then the r'\(.*?\)' regex is not enough:
import os, re
def remove_parenthesized_chunks(path, safeext=True, safedir=True):
dirpath, basename = os.path.split(path) if safedir else ('', path)
name, ext = os.path.splitext(basename) if safeext else (basename, '')
name = re.sub(r'\(.*?\)', '', name)
return os.path.join(dirpath, name+ext)
By default the function preserves parenthesized chunks in directory and extention parts of the path.
Example:
>>> f = remove_parenthesized_chunks
>>> f("Example_file_(extra_descriptor).ext")
'Example_file_.ext'
>>> path = r"c:\dir_(important)\example(extra).ext(untouchable)"
>>> f(path)
'c:\\dir_(important)\\example.ext(untouchable)'
>>> f(path, safeext=False)
'c:\\dir_(important)\\example.ext'
>>> f(path, safedir=False)
'c:\\dir_\\example.ext(untouchable)'
>>> f(path, False, False)
'c:\\dir_\\example.ext'
>>> f(r"c:\(extra)\example(extra).ext", safedir=False)
'c:\\\\example.ext'
For those who want to use Python, here's a simple routine that removes parenthesized substrings, including those with nested parentheses. Okay, it's not a regex, but it'll do the job!
def remove_nested_parens(input_str):
"""Returns a copy of 'input_str' with any parenthesized text removed. Nested parentheses are handled."""
result = ''
paren_level = 0
for ch in input_str:
if ch == '(':
paren_level += 1
elif (ch == ')') and paren_level:
paren_level -= 1
elif not paren_level:
result += ch
return result
remove_nested_parens('example_(extra(qualifier)_text)_test(more_parens).ext')
If you can stand to use sed (possibly execute from within your program, it'd be as simple as:
sed 's/(.*)//g'
>>> import re
>>> filename = "Example_file_(extra_descriptor).ext"
>>> p = re.compile(r'\([^)]*\)')
>>> re.sub(p, '', filename)
'Example_file_.ext'
Java code:
Pattern pattern1 = Pattern.compile("(\\_\\(.*?\\))");
System.out.println(fileName.replace(matcher1.group(1), ""));
Related
I have a list of string and generate it to HTML dynamically with li tag. I want to assign that value to id attribute as well. But the problem is the string item has some special characters like :, ', é, ... I just want the output to include the number(0-9) and the alphabet (a-z) only.
// Input:
listStr = ["Pop & Suki", "PINK N' PROPER", "L'Oréal Paris"]
// Output:
result = ["pop_suki", "pink_n_proper", "loreal_paris"] ("loral_paris" is also good)
Currently, I've just lowercased and replace " " to _, but don't know how to eliminate special character.
Many thanks!
Instead of thinking of it as eliminating special characters, consider the permitted characters – you want just lower-case alphanumeric characters.
Elm provides Char.isAlphaNum to test for alphanumeric characters, and Char.toLower to transform a character to lower case. It also provides the higher function String.foldl which you can use to process a String one Char at a time.
So for each character:
check if it's alphanumeric
if it is, transform it to lower case
if not and it is a space, transform it to an underscore
else drop the character
Putting this together, we create a function that processes a character and appends it to the string processed so far, then apply that to all characters in the input string:
transformNextCharacter : Char -> String -> String
transformNextCharacter nextCharacter partialString =
if Char.isAlphaNum nextCharacter then
partialString ++ String.fromChar (Char.toLower nextCharacter)
else if nextCharacter == ' ' then
partialString ++ "_"
else
partialString
transformString : String -> String
transformString inputString =
String.foldl transformNextCharacter "" inputString
Online demo here.
Note: This answer simply drops special characters and thus produces "loral_paris" which is acceptable as per the OP.
The answer that was ticked is a lot more efficient than the code I have below. Nonetheless, I just want to add my code as an optional method.
Nonetheless, if you want to change accents to normal characters, you can install and use the elm-community/string-extra package. That one has the remove accent method.
This code below is inefficient as you keep on calling library function on the same string of which all of them would go through your string one char at a time.
Also, take note that when you remove the & in the first index you would have a double underscore. You would have to replace the double underscore with a single underscore.
import Html exposing (text)
import String
import List
import String.Extra
import Char
listStr = ["Pop & Suki", "PINK N' PROPER", "L'Oréal Paris"]
-- True if alpha or digit or space, otherwise, False.
isDigitAlphaSpace : Char -> Bool
isDigitAlphaSpace c =
if Char.isAlpha c || Char.isDigit c || c == ' ' then
True
else
False
main =
List.map (\x -> String.Extra.removeAccents x --Remove Accents first
|> String.filter isDigitAlphaSpace --Remove anything that not digit alpha or space
|> String.replace " " "_" --Replace space with _
|> String.replace "__" "_" --Replace double __ with _
|> String.toLower) listStr --Turn the string to lower
|> Debug.toString
|> Html.text
In perl6 grammars, as explained here (note, the design documents are not guaranteed to be up-to-date as the implementation is finished), if an opening angle bracket is followed by an identifier then the construct is a call to a subrule, method or function.
If the character following the identifier is an opening paren, then it's a call to a method or function eg: <foo('bar')>. As explained further down the page, if the first char after the identifier is a space, then the rest of the string up to the closing angle will be interpreted as a regex argument to the method - to quote:
<foo bar>
is more or less equivalent to
<foo(/bar/)>
What's the proper way to use this feature? In my case, I'm parsing line oriented data and I'm trying to declare a rule that will instigate a seperate search on the current line being parsed:
#!/usr/bin/env perl6
# use Grammar::Tracer ;
grammar G {
my $SOLpos = -1 ; # Start-of-line pos
regex TOP { <line>+ }
method SOLscan($regex) {
# Start a new cursor
my $cur = self."!cursor_start_cur"() ;
# Set pos and from to start of the current line
$cur.from($SOLpos) ;
$cur.pos($SOLpos) ;
# Run the given regex on the cursor
$cur = $regex($cur) ;
# If pos is >= 0, we found what we were looking for
if $cur.pos >= 0 {
$cur."!cursor_pass"(self.pos, 'SOLscan')
}
self
}
token line {
{ $SOLpos = self.pos ; say '$SOLpos = ' ~ $SOLpos }
[
|| <word> <ws> 'two' { say 'matched two' } <SOLscan \w+> <ws> <word>
|| <word>+ %% <ws> { say 'matched words' }
]
\n
}
token word { \S+ }
token ws { \h+ }
}
my $mo = G.subparse: q:to/END/ ;
hello world
one two three
END
As it is, this code produces:
$ ./h.pl
$SOLpos = 0
matched words
$SOLpos = 12
matched two
Too many positionals passed; expected 1 argument but got 2
in method SOLscan at ./h.pl line 14
in regex line at ./h.pl line 32
in regex TOP at ./h.pl line 7
in block <unit> at ./h.pl line 41
$
Line 14 is $cur.from($SOLpos). If commented out, line 15 produces the same error. It appears as though .pos and .from are read only... (maybe :-)
Any ideas what the proper incantation is?
Note, any proposed solution can be a long way from what I've done here - all I'm really wanting to do is understand how the mechanism is supposed to be used.
It does not seem to be in the corresponding directory in roast, so that would make it a "Not Yet Implemented" feature, I'm afraid.
To split e.g. mins-2 into component parts of units name and order, this does what I want
sub split-order ( $string ) {
my Str $i-s = '1';
$string ~~ / ( <-[\-\d]>+ ) ( \-?\d? ) /;
$i-s = "$1" if $1 ne '';
return( "$0", +"$i-s".Int );
}
It seems that perl6 should be able to pack this into a much more concise phrasing. I need default order of 1 where there is no trailing number.
I am probably being a bit lazy not matching the line end with $. Trying to avoid returning Nil as that is not useful to caller.
Anyone with a better turn of phrase?
How about using good old split?
use v6;
sub split-order(Str:D $in) {
my ($name, $qty) = $in.split(/ '-' || <?before \d>/, 2);
return ($name, +($qty || 1));
}
say split-order('mins-2'); # (mins 2)
say split-order('foo42'); # (foo 42)
say split-order('bar'); # (bar 1)
This does not reproduce your algorithm exactly (and in particular doesn't produce negative numbers), but I suspect it's closer to what you actually want to achieve:
sub split-order($_) {
/^ (.*?) [\-(\d+)]? $/;
(~$0, +($1 // 1));
}
After learning how to pass regexes as arguments, I've tried to build my first regex using a sub, and I'm stuck once more. Sorry for the complex rules below, I've made my best to simplify them. I need at least some clues how to approach this problem.
The regex should consist of alternations, each of them consisting of left, middle and right, where left and right should come in pairs and the variant of middle depends on which right is chosen.
An array of Pairs contains pairs of left and right:
my Pair #leftright =
A => 'a',
...
Z => 'z',
;
Middle variants are read from a hash:
my Regex %middle =
z => / foo /,
a => / bar /,
m => / twi /,
r => / bin /,
...
;
%middle<z> should be chosen if right is z, %middle<a> — if right is a, etc.
So, the resulting regex should be
my token word {
| A <%middle[a]> a
| Z <%middle[z]> z
| ...
}
or, more generally
my token word {
| <left=#leftright[0].key>
<middle=%middle{#leftright[0].value}>
<right=#leftright[0].value>
| (the same for index == 1)
| (the same for index == 2)
| (the same for index == 3)
...
}
and it should match Abara and Zfooz.
How to build token word (which can be used e.g. in a grammar) with a sub that will take every pair from #leftright, put the suitable %middle{} depending on the value of right and then combine it all into one regex?
my Regex sub sub_word(Pair #l_r, Regex %m) {
...
}
my token word {
<{sub_word(#leftright, %middle)}>
}
After the match I need to know the values of left, middle, and right:
"Abara" ~~ &word;
say join '|', $<left>, $<middle>, $<right> # A|bar|a
I was not able to do this using token yet, but here is a solution with EVAL and Regex (and also I am using %middle as a hash of Str and not a hash of Regex):
my Regex sub build_pattern (%middle, #leftrigth) {
my $str = join '|', #leftright.map(
{join ' ',"\$<left>='{$_.key}'", "\$<middle>='{%middle{$_.value}}'", "\$<right>='{$_.value}'"});
);
my Regex $regex = "rx/$str/".EVAL;
return $regex;
}
my Regex $pat = build_pattern(%middle, #leftright);
say $pat;
my $res = "Abara" ~~ $pat;
say $res;
Output:
rx/$<left>='A' $<middle>='bar' $<right>='a'|$<left>='Z' $<middle>='foo' $<right>='z'/
「Abara」
left => 「A」
middle => 「bar」
right => 「a」
For more information on why I chose to use EVAL, see How can I interpolate a variable into a Perl 6 regex?
I'm trying to gather all text that is not defined by a previous rule into a string and prefix it with a formatting string using lex. I'm wondering if there's a standard way of doing this.
For example, say I have the rules:
word1|word2|word3|word4 {printf("%s%s", "<>", yytext);}
[0-9]+ {printf("%s%s", "{}", yytext);}
everything else {printf("%s%s", "[]", yytext);}
And I attempt to lex the string:
word1 this is some other text ; word2 98 foo bar .
I would want this to produce the following when run through the lexer:
<>word1[] this is some other text ; <>word2[] {}98[] foo bar .
I attempted to do this using states, but realize I can't determine when to stop the check, like:
%x OTHER
%%
. {yymore(); BEGIN OTHER;}
<OTHER>.|\n yymore();
<OTHER>how to determine when to end? {printf("%s%s", "[]", yytex); BEGIN INITIAL;}
What is a good way to do this? Is there someway to continue as long as another rule isn't met?
AFAIK, there is no "standard" solution, but a simple one is to keep a bit of context (the prefix last printed) and use that to decide whether or not to print a new prefix. For example, you could use a custom printer like this:
enum OutputType { NO_TOKEN = 0, WORD, NUMBER, OTHER };
void print_with_prefix(enum OutputType type, const char* token) {
static enum OutputType prev = NO_TOKEN;
const char* prefix = "";
switch (type) {
case WORD: prefix = "<>"; break;
case NUMBER: prefix = "{}"; break;
case OTHER: if (prev != OTHER) prefix = "[]"; break;
default: assert(false);
}
prev = type;
printf("%s%s", prefix, token);
}
Then you just need to change the calls to printf to invoke print_with_prefix instead (and, as written, to supply an enum value instead of a string).
For the OTHER case, you then don't need to do anything special to accumulate the token. Just
. { print_with_prefix(OTHER, yytext); }
(I'm skating over the handling of whitespace and newlines, but it's just conceptual.)