I have a string "\ufffd\ufffd hello\n"
i have a code like this
fun main() {
val bs = "\ufffd\ufffd hello\n"
println(bs) // �� hello
}
and i want to see "\ufffd\ufffd hello", how can i escape \u for every hex values
UPD:
val s = """\uffcd"""
val req = """(?<!\\\\)(\\\\\\\\)*(\\u)([A-Fa-f\\d]{4})""".toRegex()
return s.replace(unicodeRegex, """$1\\\\u$3""")
(I'm interpreting the question as asking how to clearly display a string that contains non-printable characters. The Kotlin compiler converts sequences of a \u followed by 4 hex digits in string literals into single characters, so the question is effectively asking how to convert them back again.)
Unfortunately, there's no built-in way of doing this. It's fairly easy to write one, but it's a bit subjective, as there's no single definition of what's ‘printable‘…
Here's an extension function that probably does roughly what you want:
fun String.printable() = map {
when (Character.getType(it).toByte()) {
Character.CONTROL, Character.FORMAT, Character.PRIVATE_USE,
Character.SURROGATE, Character.UNASSIGNED, Character.OTHER_SYMBOL
-> "\\u%04x".format(it.toInt())
else -> it.toString()
}
}.joinToString("")
println("\ufffd\ufffd hello\n".printable()) // prints ‘\ufffd\ufffd hello\u000a’
The sample string in the question is a bad example, because \uFFFD is the replacement character — a black diamond with a question mark, usually shown in place of any non-displayable characters. So the replacement character itself is displayable!
The code above treats it as non-displayable by excluding the Character.OTHER_SYMBOL type — but that will also exclude many other symbols. So you'll probably want to remove it, leaving just the other 5 types. (I got those from this answer.)
Because the trailing newline is non-displayable, that gets converted to a hex code too. You could extend the code to handle the escape codes \t, \b, \n, \r and maybe \\ too if needed. (You could also make it more efficient… this was done for brevity!)
Simply escape the \ in your strings by adding another backslash in front of it:
val bs = "\\ufffd\\ufffd hello\n"
You can also use raw strings with """ so you don't have to escape the backslashes (which is useful for regex):
val bs = """\ufffd\ufffd hello\n"""
Note that in that case the \n would also NOT be counted as an LF character, and will be literally printed as the 2 characters "\n".
You can add literal line breaks in your raw string if you want an actual line feed, though:
val bs = """\ufffd\ufffd hello
"""
Related
I'm using Normalizer followed by a regex to remove accents, but I'm getting back the same string with the accents still there.
import java.text.*
const val INPUT = "áéíóůø"
fun main() {
println(Normalizer.normalize(INPUT, Normalizer.Form.NFC).replace("[\\p{InCombiningDiacriticalMarks}]+", ""))
println(Normalizer.normalize(INPUT, Normalizer.Form.NFD).replace("[\\p{InCombiningDiacriticalMarks}]+", ""))
println(Normalizer.normalize(INPUT, Normalizer.Form.NFKC).replace("[\\p{InCombiningDiacriticalMarks}]+", ""))
println(Normalizer.normalize(INPUT, Normalizer.Form.NFKD).replace("[\\p{InCombiningDiacriticalMarks}]+", ""))
}
Output:
áéíóůø
áéíóůø
áéíóůø
áéíóůø
Kotlin playground: https://pl.kotl.in/62l6rUEUm
I've read a dozen questions here that say the way to strip accent marks in Java/Kotlin is to use java.text.Normalizer plus some minor variation of the above regular expression (sometimes without square brackets, sometimes without the plus). Even Apache Common's stripAccent function uses Normalizer for its implementation (but apparently handles to special characters too).
What am I doing wrong?
You did not make "[\\p{InCombiningDiacriticalMarks}]+" a Regex.
println(
Normalizer.normalize(INPUT, Normalizer.Form.NFD)
.replace("[\\p{InCombiningDiacriticalMarks}]+".toRegex(), "")
)
This produces:
aeiouø
Notice that the stroke in ø is not a diacritic mark. It can be decomposed to neither
"o" and U+0338 COMBINING LONG SOLIDUS OVERLAY, or;
"o" and U+0337 COMBINING SHORT SOLIDUS OVERLAY
You can see that these three all look a bit different: o̸øo̷
Also notice that there are two more blocks in Unicode that contains combining diacritics, called "Combining Diacritical Marks Extended" and "Combining Diacritical Marks Supplement". Consider including those in your regex too.
I am stuck on the following code challenge in Kotlin:
Replace all the words in the string starting and ending with $ e.g. $lorem$ to <i>lorem</i>
var incomingString = "abc 123 $Lorem$, $ipsum$, $xyz$ 547"
// My non working code:
fun main(args: Array<String>) {
val incomingString = "abc 123 \$Lorem$, \$ipsum$, \$xyz$ 547";
var finalString = "";
println(filteredValue)
if (incomingString.contains("$")){
val intermediateString = incomingString.replace("\$", "<i>")
finalString = "$intermediateString</i>"
}
println(finalString)
}
Output is:
abc 123 <i>Lorem<i>, <i>ipsum<i>, <i>xyz<i> 547</i>
Desired output:
abc 123 <i>Lorem</i>, <i>ipsum</i>, <i>xyz</i> 547</i>
I am not going to do your home work for you, but the reason why you have a challenge including $ is that symbol has two special purposes
Read up about String Interpolation in Kotlin: https://kotlinlang.org/docs/idioms.html#string-interpolation ... you will need to take care to prevent the $ being used for interpolation ... and seems you already got that part
$ is also a special character in Regular Expressions. Regular Expressions are an esoteric area of programming - meaning very complicated to get your head around, but very very powerful. Worth the effort. Using Regular Expression (Regex) approach for this program is what will get you lots of marks if you can also be sure to escape the $. Here is the Kotlin Regex replace function docs:
https://kotlinlang.org/api/latest/jvm/stdlib/kotlin.text/-regex/replace.html
Every time you come across a $ in your input string, you need to alternate between replacing it with <i> and replacing it with </i>. Therefore, you need to have a variable that tells you what state you're in, and every time you make a replacement, you flip that variable. You may find Kotlin's String.replaceFirst method useful.
I am trying to unescape UTF_8 characters like "\u00f6" to their UTF-8 representation.
E.g. file contains "Aalk\u00f6rben" should become "Aalkörben".
val tmp = text.toByteArray(Charsets.UTF_8)
val escaped = tmp.decodeToString()
// or val escaped = tmp.toString(Charsets.UTF_8)
When I set the string manually to "Aalk\u00f6rben", this works fine. However, when reading the string from the file it is interpreted like "Aalk\\u00f6rben" with the slash escaped (two slashes) and the escaping fails.
Is there any way to convince Kotlin to convert the special characters? I would rather not use external libraries like from Apache.
I do not know how you read the file, but what happens most probably is that ...\u00f6... is read as six single characters and the backslash is probably being escaped. You could check in the debugger.
So my assumption is that in memory you have "Aalk\\u00f6rben". Try this replace:
val result = text
.replace("\\u00f6", "\u00f6")
.toByteArray(Charsets.UTF_8)
.decodeToString()
Edit: this should replace all escaped 4 byte characters:
val text = Pattern
.compile("\\\\u([0-9A-Fa-f]{4})")
.matcher(file.readText())
.replaceAll { matchResult -> matchResult.group(1).toInt(16).toChar().toString() }
I scanned a document in to kotlin and it has words, numbers, values, etc... but I only want the values that start with a $ and have 2 decimal places after the .(so the price) do I use a combination of a substring with other string parses?
Edit: I have looked into Regex and the problem I am having now is I am using this line
val reg = Regex("\$([0-9]*\.[0-9]*)")
to grab all the prices however the portion of *. is saying Invalid escape. However in other languages this works just fine.
You have to use double \ instead of single . It's because the \ is an escape character both in Regex and in Kotlin/Java strings. So when \ appears in a String, Kotlin expects it to be followed by a character that needs to be escaped. But you aren't trying to escape a String's character...you're trying to escape a Regex character. So you have to escape your backslash itself using another backslash, so the backslash is part of the computed String literal and can be understood by Regex.
You also need double \ before your dollar sign for it to behave correctly. Technically, I think it should be triple \ because $ is a special character in both Kotlin and in Regex and you want to escape it in both. However, Kotlin seems smart enough to guess what you're trying to do with a double escape if no variable name or expression follows the dollar sign. Rather than rely on that, I would use the triple escape.
val reg = Regex("\\\$([0-9]*\\.[0-9]*)")
I tried many ways to get a single backslash from an executed (I don't mean an input from html).
I can get special characters as tab, new line and many others then escape them to \\t or \\n or \\(someother character) but I cannot get a single backslash when a non-special character is next to it.
I don't want something like:
str = "\apple"; // I want this, to return:
console.log(str); // \apple
and if I try to get character at 0 then I get a instead of \.
(See ES2015 update at the end of the answer.)
You've tagged your question both string and regex.
In JavaScript, the backslash has special meaning both in string literals and in regular expressions. If you want an actual backslash in the string or regex, you have to write two: \\.
The following string starts with one backslash, the first one you see in the literal is an escape character starting an escape sequence. The \\ escape sequence tells the parser to put a single backslash in the string:
var str = "\\I have one backslash";
The following regular expression will match a single backslash (not two); again, the first one you see in the literal is an escape character starting an escape sequence. The \\ escape sequence tells the parser to put a single backslash character in the regular expression pattern:
var rex = /\\/;
If you're using a string to create a regular expression (rather than using a regular expression literal as I did above), note that you're dealing with two levels: The string level, and the regular expression level. So to create a regular expression using a string that matches a single backslash, you end up using four:
// Matches *one* backslash
var rex = new RegExp("\\\\");
That's because first, you're writing a string literal, but you want to actually put backslashes in the resulting string, so you do that with \\ for each one backslash you want. But your regex also requires two \\ for every one real backslash you want, and so it needs to see two backslashes in the string. Hence, a total of four. This is one of the reasons I avoid using new RegExp(string) whenver I can; I get confused easily. :-)
ES2015 and ES2018 update
Fast-forward to 2015, and as Dolphin_Wood points out the new ES2015 standard gives us template literals, tag functions, and the String.raw function:
// Yes, this unlikely-looking syntax is actually valid ES2015
let str = String.raw`\apple`;
str ends up having the characters \, a, p, p, l, and e in it. Just be careful there are no ${ in your template literal, since ${ starts a substitution in a template literal. E.g.:
let foo = "bar";
let str = String.raw`\apple${foo}`;
...ends up being \applebar.
Try String.raw method:
str = String.raw`\apple` // "\apple"
Reference here: String.raw()
\ is an escape character, when followed by a non-special character it doesn't become a literal \. Instead, you have to double it \\.
console.log("\apple"); //-> "apple"
console.log("\\apple"); //-> "\apple"
There is no way to get the original, raw string definition or create a literal string without escape characters.
please try the below one it works for me and I'm getting the output with backslash
String sss="dfsdf\\dfds";
System.out.println(sss);