efficiently break Kotlin string into fixed length sub-strings without using regex - kotlin

obviously to split can be used to break a string in to sub-strings at specific character or delimiter string, but i was looking for any easy way to break into fixed length sub-strings.
eg.
"abcde".splitAt(2) == listOf("ab", "cd", "e")
any ideas?

Use the CharSequence.chunked(size: Int) function. It does exactly that:
println("abcde".chunked(2)) // [ab, cd, e]

Related

It is necessary to replace all characters of the string with the adjacent character (shift all characters to the right by 1)

It is necessary to replace each character in the string cyclically with the character adjacent to the right, and then collect it into a string again.
Instead of shifting characters to the right, it turns out to increase alphabetically
fun main {
val message = "abcd1234"
val messageSecond = message.map {char -> char + 1}.joinToString ("")
}
Your code is wrong as it increases the ASCII code of each character and as you noted yourself it increases "alphabetically" (not exactly, just in ASCII order)
I can't write the code for you as it seems to be an assignment, but I can give you a hint.
You can actually solve this just by "moving" only one character.
 
Best of luck!

What's the best way to 'normalize' a string in Redshift?

Since my texts are in Portuguese, there are many words with accent and other special characters, like: "coração", "hambúrguer", "São Paulo".
Normally, I treat these names in Python with the following function:
from unicodedata import normalize
def string_normalizer(text):
result = normalize("NFKD", text.lower()).encode("ASCII", "ignore").decode("ASCII")
return result.replace(" ", "-")
This would replace the blank spaces with '-', replace special characters and apply a lowercase convertion. The word "coração" would become "coracao", "São Paulo" would become "Sao Paulo" and so on. Now, I'm not sure what's the best way to do this in Redshift. My solution would be to apply multiple replaces, something like this:
replace(replace(replace(lower(column), 'á', 'a'), 'ç', 'c')...
Even though this works, it doesn't look like the best solution. Is there an easy way to normalize my string?
In Redshift, you can use the translate function to normalize a string. The translate function takes three arguments: the source string, the characters to replace, and the replacement characters. You can use this function to replace all the special characters in your string with their ASCII equivalent.
For example, the following query uses the translate function to replace all the special characters in a string with their ASCII equivalent. Additionally, spaces are replaced with "-" characters.
SELECT translate('São Paulo', ' áàãâäéèêëíìîïóòõôöúùûüçÁÀÃÄÉÈÊËÍÌÎÏÓÒÕÖÔÚÙÛÜÇ', '-aaaaaeeeeiiiiooooouuuucAAAAAEEEEIIIIOOOOOUUUUC')
This query would return the string "Sao Paulo". You can use the lower function to convert the string to lowercase.
Here's an example of how you could use these functions together to normalize a string:
SELECT lower(translate('São Paulo', ' áàãâäéèêëíìîïóòõôöúùûüçÁÀà ÄÉÈÊËÍÌÎÏÓÒÕÖÔÚÙÛÜÇ', '-aaaaaeeeeiiiiooooouuuucAAAAAEEEEIIIIOOOOOUUUUC'))
This query would return the string "sao-paulo".

How can I add a string character based on a position in OpenRefine?

I have a column in Openrefine, which I would like to add a character string in each of its rows, based on the position in the string.
For example:
I have an 8th character number string: 85285296 and would like to add "-" at the fourth place: "8528-5296".
Anyone can help me find the specific function in OpenRefine?
Thanks
Tzipy
The simplest approach is to just use the expression language's built-in string indexing and concatenation:
value[0,4]+'-'+value[4,8]
or more generally, if you don't know that your value is exactly 8 characters long:
value[0,4]+'-'+value[4,999]
Possible solution (not sure if it's the most straightforward):
value.replace(/(\d{4})(.+)/, "$1-$2")
This means : if $1 represents the content of the first parenthesis/group in the regular expression before and $2 the content of the second one, replaces each value in the column with $1-$2.
Some other options:
value.splitByLengths(4,4).join("-")
value.match(/(\d{4})(\d{4})/).join("-")
value.substring(0,4)+"-"+value.substring(4,8)
I think 'splitByLengths' is the neatest, but I might use 'match' instead because it fails with an error if your starting string isn't 8 digits - which means you don't accidentally process data that doesn't conform to your assumption of what data is in the column - but you could use a facet/filter to check this with any of the others

Hive convert a string to an array of characters

How can I convert a string to an array of characters, for example
"abcd" -> ["a","b","c","d"]
I know the split methd:
SELECT split("abcd","");
#["a","b","c","d",""]
is a bug for the last whitespace? or any other ideas?
This is not actually a bug. Hive split function simply calls the underlying Java String#split(String regexp, int limit) method with limit parameter set to -1, which causes trailing whitespace(s) to be returned.
I'm not going to dig into implementation details on why it's happening since there is already a brilliant answer that describes the issue. Note that str.split("", -1) will return different results depending on the version of Java you use.
A few alternatives:
Use "(?!\A|\z)" as a separator regexp, e.g. split("abcd", "(?!\\A|\\z)"). This will make the regexp matcher skip zero-width matches at the start and at the end positions of the string.
Create a custom UDF that uses either String#toCharArray(), or accepts limit as an argument of the UDF so you can use it as: SPLIT("", 0)
I don't know if it is a bug or that's how it works. As an alternative, you could use explode and collect_list to exclude blanks from a where clause
SELECT collect_list(l)
FROM ( SELECT EXPLODE(split('abcd','') ) as l ) t
WHERE t.l <> '';

Using groups in OpenRefine regex

I'm wondering if it is possible to use "groups" in ReGeX used in Open Refine GREL syntax. I mean, I'd like to replace all the dots followed and preceded by a character WITH the same character and dot but followed by a space and then the character.
Something like:
s.replace(/(.{1})\..({1})/,/(1).\s(2)/)
It should, but your last argument needs to be a string, not a regular expression. Internally Refine uses Java's Matcher#replaceAll method which accepts a string argument.
I think I found out how to deal with this. You need to put $X in your string value to address a Xth capture group.
It should be like this:
s.replace(/.?(#capcure group 1).?(#capcure group 2).*?/), " some text $1 some text $2 some text")