Normalizer not removing accents - kotlin

I'm using Normalizer followed by a regex to remove accents, but I'm getting back the same string with the accents still there.
import java.text.*
const val INPUT = "áéíóůø"
fun main() {
println(Normalizer.normalize(INPUT, Normalizer.Form.NFC).replace("[\\p{InCombiningDiacriticalMarks}]+", ""))
println(Normalizer.normalize(INPUT, Normalizer.Form.NFD).replace("[\\p{InCombiningDiacriticalMarks}]+", ""))
println(Normalizer.normalize(INPUT, Normalizer.Form.NFKC).replace("[\\p{InCombiningDiacriticalMarks}]+", ""))
println(Normalizer.normalize(INPUT, Normalizer.Form.NFKD).replace("[\\p{InCombiningDiacriticalMarks}]+", ""))
}
Output:
áéíóůø
áéíóůø
áéíóůø
áéíóůø
Kotlin playground: https://pl.kotl.in/62l6rUEUm
I've read a dozen questions here that say the way to strip accent marks in Java/Kotlin is to use java.text.Normalizer plus some minor variation of the above regular expression (sometimes without square brackets, sometimes without the plus). Even Apache Common's stripAccent function uses Normalizer for its implementation (but apparently handles to special characters too).
What am I doing wrong?

You did not make "[\\p{InCombiningDiacriticalMarks}]+" a Regex.
println(
Normalizer.normalize(INPUT, Normalizer.Form.NFD)
.replace("[\\p{InCombiningDiacriticalMarks}]+".toRegex(), "")
)
This produces:
aeiouø
Notice that the stroke in ø is not a diacritic mark. It can be decomposed to neither
"o" and U+0338 COMBINING LONG SOLIDUS OVERLAY, or;
"o" and U+0337 COMBINING SHORT SOLIDUS OVERLAY
You can see that these three all look a bit different: o̸øo̷
Also notice that there are two more blocks in Unicode that contains combining diacritics, called "Combining Diacritical Marks Extended" and "Combining Diacritical Marks Supplement". Consider including those in your regex too.

Related

Escape hex like \u... in kotlin strings

I have a string "\ufffd\ufffd hello\n"
i have a code like this
fun main() {
val bs = "\ufffd\ufffd hello\n"
println(bs) // �� hello
}
and i want to see "\ufffd\ufffd hello", how can i escape \u for every hex values
UPD:
val s = """\uffcd"""
val req = """(?<!\\\\)(\\\\\\\\)*(\\u)([A-Fa-f\\d]{4})""".toRegex()
return s.replace(unicodeRegex, """$1\\\\u$3""")
(I'm interpreting the question as asking how to clearly display a string that contains non-printable characters.  The Kotlin compiler converts sequences of a \u followed by 4 hex digits in string literals into single characters, so the question is effectively asking how to convert them back again.)
Unfortunately, there's no built-in way of doing this.  It's fairly easy to write one, but it's a bit subjective, as there's no single definition of what's ‘printable‘…
Here's an extension function that probably does roughly what you want:
fun String.printable() = map {
when (Character.getType(it).toByte()) {
Character.CONTROL, Character.FORMAT, Character.PRIVATE_USE,
Character.SURROGATE, Character.UNASSIGNED, Character.OTHER_SYMBOL
-> "\\u%04x".format(it.toInt())
else -> it.toString()
}
}.joinToString("")
println("\ufffd\ufffd hello\n".printable()) // prints ‘\ufffd\ufffd hello\u000a’
The sample string in the question is a bad example, because \uFFFD is the replacement character — a black diamond with a question mark, usually shown in place of any non-displayable characters.  So the replacement character itself is displayable!
The code above treats it as non-displayable by excluding the Character.OTHER_SYMBOL type — but that will also exclude many other symbols.  So you'll probably want to remove it, leaving just the other 5 types.  (I got those from this answer.)
Because the trailing newline is non-displayable, that gets converted to a hex code too.  You could extend the code to handle the escape codes \t, \b, \n, \r and maybe \\ too if needed.  (You could also make it more efficient… this was done for brevity!)
Simply escape the \ in your strings by adding another backslash in front of it:
val bs = "\\ufffd\\ufffd hello\n"
You can also use raw strings with """ so you don't have to escape the backslashes (which is useful for regex):
val bs = """\ufffd\ufffd hello\n"""
Note that in that case the \n would also NOT be counted as an LF character, and will be literally printed as the 2 characters "\n".
You can add literal line breaks in your raw string if you want an actual line feed, though:
val bs = """\ufffd\ufffd hello
"""

How to split on unicode whitespace in kotlin

In Kotlin if we use:
string.split(Regex("\\s+"))
Then we can split a string into words separated by whitespace. However the string:
val string = "a\u2000b"
doesn't split since the regex doesn't match unicode whitespace characters.
I there a way to split the string on all whitespace characters?
Since Java 7 Pattern allows to specify the UNICODE_CHARACTER_CLASS-flag which would basically also work for your current issue:
Pattern.compile("\\s+", Pattern.UNICODE_CHARACTER_CLASS)
Unfortunately this isn't directly supported via RegexOption with Kotlins Regex yet. There is a known issue that also describes a workaround (KT-21094):
string.split("""(?U)\s+""".toRegex())
You (most probably) require Java 7+ for that to actually work. Alternatives could be to use other predefined character classes instead. However, you need to lookup the appropriate Pattern-javadoc for your Java version to ensure that it is actually working (or do it in a trial-error-manner ;-)).
I've used the following regex to match Unicode whitespace:
Regex("[\\p{javaWhitespace}\u00A0\u2007\u202F]+")
This works because while \s matches only Latin-1 whitespace, \p{javaWhitespace} matches everything for which Character.isWhitespace() is true.  For some reason, this doesn't include a few particular characters, which I've listed separately.
More info in the docs for Pattern.
Related fact: although java.lang.String.trim() doesn't remove non-breaking spaces or figure spaces, kotlin.String.trim() does!

Regular expression to extract a number of steps

I have a localized string that looks something like this in English:
"
5 Mile(s)
5,252 Step(s)
"
My app is localized both in left-to-right and right-to-left languages so I don't want to make assumptions either about the ordering of the step(s) or about the formatting of the number (e.g. 5,252 can be 5.252 depending on user locale). So I need to account for possibilities that can include things like
Step(s) 5.252
as well as what's above.
A few other caveats
All I know is that if the Step(s) line is in there, it will be on its own line (hence in my regex I require \n at each end of the string)
No guarantee that the Mile(s) information will be in the string at all, let alone whether it will be before or after Step(s)
Here's my attempt at pattern extraction:
NSString *patternString = [NSString stringWithFormat:#"\\n(([0-9,\\.]*)\s*%#|%#\s*([0-9,\\.]*))\\n",
NSLocalizedString(#"Step(s)",nil), NSLocalizedString(#"Step(s)",nil)];
There appear to be two problems with this:
XCode is indicating Unknown escape sequence '\s' for the second \s in the pattern string above
No matches are being found even for strings like the following:
0.2 Mile(s)
1,482 Step(s)
Ideally I would extract the 1,482 out of this string in a way that is localization friendly. How should I modify my regex?
as far as the regex, perhaps this approach might work - it simply matches (with named groups) each couplet of numbers in sequence, with the assumption the first is miles and the second is steps. Decimals in the . or , form are optional:
(?<miles>\d+(?:[.,]\d+)?).*?(?<steps>\d+(?:[.,]\d+)?)
(and i think it should be \\s) - i'm not an ios guy, but if you can use a regex literal it would be way more readable.
regular expression demo
First I'd like to ask - Why is Mile(s) mentioned in the question at all?
And now to my two bits - you could simply use a positive look-ahead:
^(?=.*Step\(s\))[^\d]*(\d+(?:[.,]\d+)?)
It makes sure the expected word is present on the line, and then captures the number on it, allowing for localized, optional, decimal separator and decimals. This way it doesn't matter if the numer is before, or after, the "word".
It doesn't take localization of the "word" into account, but that you seem to have handled by yourself ;)
See it here at regex101.
Your regex is close, although in Obj-C you need to double-escape the \s and (s):
^(([0-9,.]*)\\s*%#|%#\\s*([0-9,.]*))$
In your NSLocalizedString you likely also need to escape the parentheses enclosing (s):
NSString *patternString = [NSString stringWithFormat:#"^(([\\d,.]+)\\s%#|%#\\s([\\d,.]+))$",
NSLocalizedString(#"Step\\(s\\)",nil), NSLocalizedString(#"Step\\(s\\)",nil)];
If you don't escape (s) then the regex engine is probably going to interpret it as a capture group.
Looking at NSLog you can see what the pattern actually reads like:
NSLog(#"patternString: %#", patternString);
Output:
patternString: ^(([\d,.]+)\sStep\(s\)|Step\(s\)\s([\d,.]+))$
Since you mentioned the Mile(s) part may not be in the string at all I'm assuming it isn't relevant to the regular expression. As I understand from the question, you just need to capture the number of steps and nothing else. On this basis, here's a modified version of your existing regex:
NSString *patternString =
[NSString stringWithFormat:#"^(?:([0-9,.]*)\\s*%#|%#\\s*([0-9,.]*))$",
NSLocalizedString(#"Step\\(s\\)",nil), NSLocalizedString(#"Step\\(s\\)",nil)];
Demo:
https://www.regex101.com/r/Q6ff1b/1
This is based on the following tips/modifications:
Use the m (= UREGEX_MULTILINE) flag option when creating the regex to specify that ^ and $ match the start and end of each line. This is more sophisticated than using \n as it will also handle the start and end of the string where this might not be present. See here.
Always use a double backslash (\\) for regex escaping - otherwise NSString will interpret the single backslash to be escaping the next character and convert it before it gets to the regex.
Literal parentheses need to be escaped - e.g. Step\\(s\\) instead of Step(s).
Characters within a character class (i.e. anything within the [] square brackets) don't need to be escaped - so it would be . rather than \\. - the latter.
If you are using (x|y|...) as a choice and don't need it to be a capturing group, use ?: after the first parenthesis to ensure it doesn't get captured - i.e. (?:x|y|...).

What does the \? (backslash question mark) escape sequence mean?

I'm writing a regular expression in Objective-C.
The escape sequence \w is illegal and emits a warning, so the regular expression /\w/ must be written as #"\\w"; the escape sequence \? is valid, apparently, and doesn't emit a warning, so the regular expression /\?/ must be written as #"\\?" (i.e., the backslash must be escaped).
Question marks aren't invisible like \t or \n, so why is \? a valid escape sequence?
Edit: To clarify, I'm not asking about the quantifier, I'm asking about a string escape sequence. That is, this doesn't emit a warning:
NSString *valid = #"\?";
By contrast, this does emit a warning ("Unknown escape sequence '\w'"):
NSString *invalid = #"\w";
It specifies a literal question mark. It is needed because of a little-known feature called trigraphs, where you can write a three-character sequence starting with question marks to substitute another character. If you have trigraphs enabled, in order to write "??" in a string, you need to write it as "?\?" in order to prevent the preprocessor from trying to read it as the beginning of a trigraph.
(If you're wondering "Why would anybody introduce a feature like this?": Some keyboards or character sets didn't include commonly used symbols like {. so they introduced trigraphs so you could write ??< instead.)
? in regex is a quantifier, it means 0 or 1 occurences. When appended to the + or * quantifiers, it makes it "lazy".
For example, applying the regex o? to the string foo? would match o.
However, the regex o\? in foo? would match o?, because it is searching for a literal question mark in the string, instead of an arbitrary quantifier.
Applying the regex o*? to foo? would match oo.
More info on quantifiers here.

Import format to intellij idea from JSMin/JSFormat

Does anybody knows which formatting rules uses jsmin/jsformatter plugin of Notepad++? I need this because we are forced to use this formatter but I'm using intellij idea to write js code. So having this rules I can import it some how or, at least, apply manually.
Thanks everyone in advance!
The minimising rules applied are listed here:
http://www.crockford.com/javascript/jsmin.html
JSMin is a filter that omits or modifies some characters. This does
not change the behavior of the program that it is minifying. The
result may be harder to debug. It will definitely be harder to read.
JSMin first replaces carriage returns ('\r') with linefeeds ('\n'). It
replaces all other control characters (including tab) with spaces. It
replaces comments in the // form with linefeeds. It replaces comments
in the /* */ form with spaces. All runs of spaces are replaced with a
single space. All runs of linefeeds are replaced with a single
linefeed.
It omits spaces except when a space is preceded and followed by a
non-ASCII character or by an ASCII letter or digit, or by one of these
characters:
\ $ _
It is more conservative in omitting linefeeds, because linefeeds are
sometimes treated as semicolons. A linefeed is not omitted if it
precedes a non-ASCII character or an ASCII letter or digit or one of
these characters:
\ $ _ { [ ( + -
and if it follows a non-ASCII character or an ASCII letter or digit or
one of these characters:
\ $ _ } ] ) + - " '
No other characters are omitted or modified.
There are other custom formatting rules applied according to the plugin developer's page:
http://www.sunjw.us/jsminnpp/