splitting of email-address in spark-sql - sql

code:
case when length(neutral)>0 then regexp_extract(neutral, '(.*#)', 0) else '' end as neutral
The above query returns the output value with # symbol, for example if the input is 1234#gmail.com, then the output is 1234#. how to remove the # symbol using the above query. And the resulting output should be evaluated for numbers, if it contains any non-numeric characters it should get rejected.
sample input:1234#gmail.com output: 1234
sample input:123adc#gmail.com output: null

You could phrase the regex as ^[^#]+, which would match all characters in the email address up to, but not including, the # character:
REGEXP_EXTRACT(neutral, '^[^#]+', 0) AS neutral
Note that this approach is also clean and frees us from having to use the bulky CASE expression.

Try this code:
val pattern = """([0-9]+)#([a-zA-Z0-9]+.[a-z]+)""".r
val correctEmail = "1234#gmail.com"
val wrongEmail = "1234abcd#gmail.com"
def parseEmail(email: String): Option[String] =
email match {
case pattern(id, domain) => Some(id)
case _ => None
}
println(parseEmail(correctEmail)) // prints Some(1234)
println(parseEmail(wrongEmail)) // prints None
Also, it is more idiomatic to use Options instead of null

Related

Split character from string in Idiomatic way Kotlin

Hey I am working in kotlin. I have one string in which I want to split into list from there where I should provide character. I'll explain in details
For example 1
val string = "Birth Control"
val searchText = "n"
Output
["Birth Co", "trol"]
For example 2
val string = "Bladder Infection"
val searchText = "i"
Actual Output
["Bladder ", "nfect", "on"]
Expect Output
["Bladder ", "nfection"]
I tried some code but example 1 is working fine but example 2 is not because I only want to split first occurrence.
val splitList = title?.split(searchText, ignoreCase = true)?.toMutableList()
splitList?.remove(searchText)
Can someone guide me how to solve this idiomatic way. Thanks
You miss the limit option of the split function. If you give it a value of 2 the result list will have a maximum of 2 entries:
val result = "Bladder Infection".split("i", ignoreCase = true, limit = 2)

How to replace string characters that are not in a reference list in kotlin

I have a reference string on which the allowed characters are listed. Then I also have input strings, from which not allowed characters should be replaced with a fixed character, in this example "0".
I can use filter but it removes the characters altogether, does not offer a replacement. Please note that it is not about being alphanumerical, there are ALLOWED non-alphanumerical characters and there are not allowed alphanumerical characters, referenceStr happens to be arbitrary.
var referenceStr = "abcdefg"
var inputStr = "abcqwyzt"
inputStr = inputStr.filter{it in referenceStr}
This yields:
"abc"
But I need:
"abc00000"
I also considered replace but it looks more like when you have a complete reference list of characters that are NOT allowed. My case is the other way around.
Given:
val referenceStr = "abcd][efg"
val replacementChar = '0'
val inputStr = "abcqwyzt[]"
You can do this with a regex [^<referenceStr>], where <referenceStr> should be replaced with referenceStr:
val result = inputStr.replace("[^${Regex.escape(referenceStr)}]".toRegex(), replacementChar.toString())
println(result)
Note that Regex.escape is used to make sure that the characters in referenceStr are all interpreted literally.
Alternatively, use map:
val result = inputStr.map {
if (it !in referenceStr) replacementChar else it
}.joinToString(separator = "")
In the lambda decide whether the current char "it" should be transformed to replacementChar, or itself. map creates a List<Char>, so you need to use joinToString to make the result a String again.

Scala Spark: Parse SQL string to Column

I have two functions, foo and bar, that I want to write like follows:
def foo(df : DataFrame, conditionString : String) =
val conditionColumn : Column = something(conditionString) //help me define "something"
bar(df, conditionColumn)
}
def bar(df : DataFrame, conditionColumn : Column) = {
df.where(conditionColumn)
}
Where condition is a sql string like "person.age >= 18 AND person.citizen == true" or something.
Because reasons, I don't want to change the type signatures here. I feel this should work because if I could change the type signatures, I could just write:
def foobar(df : DataFrame, conditionString : String) = {
df.where(conditionString)
}
As .where is happy to accept a sql string expression.
So, how can I turn a string representing a column expression into a column? If the expression were just the name of a single column in df I could just do col(colName), but that doesn't seem to take the range of expressions that .where does.
If you need more context for why I'm doing this, I'm working on a databricks notebook that can only accept string arguments (and needs to take a condition as an argument), which calls a library I want to take column-typed arguments.
You can use functions.expr:
def expr(expr: String): Column
Parses the expression string into the column that it represents

Specific regex in Kotlin

I created two functions .One checkUnit to get the unit from string and second whatU if input contains m u n p T g k to convert the input value .But there is some mismatch .
My pattern examples:
"m(O|h|F|s|H|A|V)" -this is for the m before unit this part needs improve
\b0-9.Ohm.(?<![0-9])\b" - this is for Ohm this part is wrong
val pattern = Regex(whatToFind)
val result = pattern.containsMatchIn(whatToFind)
This is for all invalid characters in input [A-EI-LNP-SUW-Za-gi-jloq-tw-zvV/ '$&+,{}:;=_\[]|`~?##"<>^*()%!-£€¥¢©®™¿÷¦¬×§¶°]
How to check if m is before Ohm and after number in string 100mOhm in regex Kotlin in more effective way ?
You can take a look at the Pattern documentation. There are several predefined character classes like \d+ for a digit. So you can use the following method:
boolean matched = Pattern.matches("\\d+mOhm", "100mOhm");
Or if you want to have the pattern for a longer time you can use the following method:
Pattern pattern = Pattern.compile("\\d+mOhm");
Matcher matcher = pattern.matcher("100mOhm");
boolean matched = matcher.matches();
If you want to use only Kotlin, you can use the following code:
val regex = "\\d+mOhm".toRegex()
val matched = regex.matches("100mOhm")
I found also a nice tutorial with more information about Koltin Regex.
This meets your requirement:
import java.util.Scanner
fun main() {
val scanner = Scanner(System.`in`)
val s = scanner.nextLine()
println("\\d+mOhm".toRegex().matches(s))
}
Perhaps you can modify it from here.

Vectorising .apply function to return 2 columns

How can I vectorise the following code (without changing the functions)
sms['Digit'] = 0
sms['URL'] = 0
###THE FOR LOOP IS MAKING MY CODE VERY SLOW
for i in range(len(sms)):
sms['Message'].iloc[i],sms['Digit'].iloc[i] = nm.remove_numbers(sms['Message'].iloc[i])
sms['Message'].iloc[i],sms['URL'].iloc[i] = nm.trim_urls(sms['Message'].iloc[i])
sms['Message'] = sms['Message'].apply(nm.stem)
sms.head()
where the functions nm.remove_numbers and nm.trim_urls are the following
# removes large numbers from sms text
def remove_numbers(message):
# identifies number with digits in [4,25]
number_re = "(?<!\d)\d{4,25}(?!\d)"
numbers = re.findall(number_re, message)
for number in numbers:
message = message.replace(number, '')
return message, numbers.__len__() > 0
# trims all urls in sms text down to their domain names
def trim_urls(message):
# identifies if string is url
url_re = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+' #a better version of this
urls = re.findall(url_re, message)
for url in urls:
trimmed_url = url.split("//")[-1].split("/")[0].split('?')[0].replace('www.', '')
message = message.replace(url, trimmed_url)
return message, urls.__len__() > 0
Hence I return a pair of values from the functions and wish to unpack that pair and assign them to sms['Message'] and sms['Digit'] for the first function (both are similar).
I tried unpacking using * but that throws an exception. So does any explicit assignment like
sms['Message'],sms['Digit'] = sms['Message'].apply(nm.remove_numbers)
Is there some way I can get rid of my for loop, and vectorise my code?
Of course if it can't be done, and my only option is to edit my main functions, then just help me with that option.