whats a good way to parse the incoming url in nifi? - api

When using HandleHttpRequest, i want to setup a structure to operate on different objects through the same handler:
/api/foo/add/1/2..
how do i easily parse that out into
object = foo
operation = add
arg1 = [1,2,...]
?

Why not to use ExpressionLanguage getDelimitedField ?
From the Expression Language documentation:
getDelimitedField
Description: Parses the Subject as a delimited line of text and returns just a single field from that delimited text.
Subject Type: String
Arguments:
index : The index of the field to return. A value of 1 will return the first field, a value of 2 will return the second field, and so on.
delimiter : Optional argument that provides the character to use as a field separator. If not specified, a comma will be used. This value must be exactly 1 character.
quoteChar : Optional argument that provides the character that can be used to quote values so that the delimiter can be used within a single field. If not specified, a double-quote (") will be used. This value must be exactly 1 character.
escapeChar : Optional argument that provides the character that can be used to escape the Quote Character or the Delimiter within a field. If not specified, a backslash (\) is used. This value must be exactly 1 character.
stripChars : Optional argument that specifies whether or not quote characters and escape characters should be stripped. For example, if we have a field value "1, 2, 3" and this value is true, we will get the value 1, 2, 3, but if this value is false, we will get the value "1, 2, 3" with the quotes. The default value is false. This value must be either true or false.

This code is just an example you can try sticking a executeScript processor on nifi's workbench. You can use this as example.
from urlparse import parse_qs, urlparse
def parse ( uri2parse ) :
o = urlparse( uri2parse )
d = parse_qs( o.query )
return ( o.path[1:], d['year'][0], d['month'][0], d['day'][0] )
# get the flow file from the incoming queue
flowfile = session.get()
if flowfile is not None:
source_URI = flowfile.getAttribute( 'source_URI' )
destination_URI = flowfile.getAttribute( 'destination_URI' )
current_time = flowfile.getAttribute( 'current_time' )
# expand the URI into smaller pieces
src_table, src_year, src_month, src_day = parse( source_URI )
dst_table, dst_year, dst_month, dst_day = parse( destination_URI )
flowfile = session.putAllAttributes( flowfile, { 'src_table' : src_table, 'src_year': src_year, 'src_month' :src_month, 'src_day': src_day })
flowfile = session.putAllAttributes( flowfile, { 'dst_table' : dst_table, 'dst_year': dst_year, 'dst_month' :dst_month, 'dst_day': dst_day })
session.transfer( flowfile, REL_SUCCESS )
else:
flowfile = session.create()
session.transer( flowfile, REL_FAILURE )

Related

Convert String into list of Pairs: Kotlin

Is there an easier approach to convert an Intellij IDEA environment variable into a list of Tuples?
My environment variable for Intellij is
GROCERY_LIST=[("egg", "dairy"),("chicken", "meat"),("apple", "fruit")]
The environment variable gets accessed into Kotlin file as String.
val g_list = System.getenv("GROCERY_LIST")
Ideally I'd like to iterate over g_list, first element being ("egg", "dairy") and so on.
And then ("egg", "dairy") is a tuple/pair
I have tried to split g_list by comma that's NOT inside quotes i.e
val splitted_list = g_list.split(",(?=(?:[^\\\"]*\\\"[^\\\"]*\\\")*[^\\\"]*\$)".toRegex()).toTypedArray()
this gives me first element as [("egg", second element as "dairy")] and so on.
Also created a data class and tried to map the string into data class using jacksonObjectMapper following this link:
val mapper = jacksonObjectMapper()
val g_list = System.getenv("GROCERY_LIST")
val myList: List<Shopping> = mapper.readValue(g_list)
data class Shopping(val a: String, val b: String)
You can create a regular expression to match all strings in your environmental variable.
Regex::findAll()
Then loop through the strings while creating a list of Shopping objects.
// Raw data set.
val groceryList: String = "[(\"egg\", \"dairy\"),(\"chicken\", \"meat\"),(\"apple\", \"fruit\")]"
// Build regular expression.
val regex = Regex("\"([\\s\\S]+?)\"")
val matchResult = regex.findAll(groceryList)
val iterator = matchResult.iterator()
// Create a List of `Shopping` objects.
var first: String = "";
var second: String = "";
val shoppingList = mutableListOf<Shopping>()
var i = 0;
while (iterator.hasNext()) {
val value = iterator.next().value;
if (i % 2 == 0) {
first = value;
} else {
second = value;
shoppingList.add(Shopping(first, second))
first = ""
second = ""
}
i++
}
// Print Shopping List.
for (s in shoppingList) {
println(s)
}
// Output.
/*
Shopping(a="egg", b="dairy")
Shopping(a="chicken", b="meat")
Shopping(a="apple", b="fruit")
*/
data class Shopping(val a: String, val b: String)
Never a good idea to use regex to match parenthesis.
I would suggest a step-by-step approach:
You could first match the name and the value by
(\w+)=(.*)
There you get the name in group 1 and the value in group 2 without caring about any subsequent = characters that might appear in the value.
If you then want to split the value, I would get rid of start and end parenthesis first by matching by
(?<=\[\().*(?=\)\])
(or simply cut off the first and last two characters of the string, if it is always given it starts with [( and ends in )])
Then get the single list entries from splitting by
\),\(
(take care that the split operation also takes a regex, so you have to escape it)
And for each list entry you could split that simply by
,\s*
or, if you want the quote character to be removed, use a match with
\"(.*)\",\s*\"(.*)\"
where group 1 contains the key (left of equals sign) and group 2 the value (right of equals sign)

How to replace string characters that are not in a reference list in kotlin

I have a reference string on which the allowed characters are listed. Then I also have input strings, from which not allowed characters should be replaced with a fixed character, in this example "0".
I can use filter but it removes the characters altogether, does not offer a replacement. Please note that it is not about being alphanumerical, there are ALLOWED non-alphanumerical characters and there are not allowed alphanumerical characters, referenceStr happens to be arbitrary.
var referenceStr = "abcdefg"
var inputStr = "abcqwyzt"
inputStr = inputStr.filter{it in referenceStr}
This yields:
"abc"
But I need:
"abc00000"
I also considered replace but it looks more like when you have a complete reference list of characters that are NOT allowed. My case is the other way around.
Given:
val referenceStr = "abcd][efg"
val replacementChar = '0'
val inputStr = "abcqwyzt[]"
You can do this with a regex [^<referenceStr>], where <referenceStr> should be replaced with referenceStr:
val result = inputStr.replace("[^${Regex.escape(referenceStr)}]".toRegex(), replacementChar.toString())
println(result)
Note that Regex.escape is used to make sure that the characters in referenceStr are all interpreted literally.
Alternatively, use map:
val result = inputStr.map {
if (it !in referenceStr) replacementChar else it
}.joinToString(separator = "")
In the lambda decide whether the current char "it" should be transformed to replacementChar, or itself. map creates a List<Char>, so you need to use joinToString to make the result a String again.

Scala Spark: Parse SQL string to Column

I have two functions, foo and bar, that I want to write like follows:
def foo(df : DataFrame, conditionString : String) =
val conditionColumn : Column = something(conditionString) //help me define "something"
bar(df, conditionColumn)
}
def bar(df : DataFrame, conditionColumn : Column) = {
df.where(conditionColumn)
}
Where condition is a sql string like "person.age >= 18 AND person.citizen == true" or something.
Because reasons, I don't want to change the type signatures here. I feel this should work because if I could change the type signatures, I could just write:
def foobar(df : DataFrame, conditionString : String) = {
df.where(conditionString)
}
As .where is happy to accept a sql string expression.
So, how can I turn a string representing a column expression into a column? If the expression were just the name of a single column in df I could just do col(colName), but that doesn't seem to take the range of expressions that .where does.
If you need more context for why I'm doing this, I'm working on a databricks notebook that can only accept string arguments (and needs to take a condition as an argument), which calls a library I want to take column-typed arguments.
You can use functions.expr:
def expr(expr: String): Column
Parses the expression string into the column that it represents

splitting of email-address in spark-sql

code:
case when length(neutral)>0 then regexp_extract(neutral, '(.*#)', 0) else '' end as neutral
The above query returns the output value with # symbol, for example if the input is 1234#gmail.com, then the output is 1234#. how to remove the # symbol using the above query. And the resulting output should be evaluated for numbers, if it contains any non-numeric characters it should get rejected.
sample input:1234#gmail.com output: 1234
sample input:123adc#gmail.com output: null
You could phrase the regex as ^[^#]+, which would match all characters in the email address up to, but not including, the # character:
REGEXP_EXTRACT(neutral, '^[^#]+', 0) AS neutral
Note that this approach is also clean and frees us from having to use the bulky CASE expression.
Try this code:
val pattern = """([0-9]+)#([a-zA-Z0-9]+.[a-z]+)""".r
val correctEmail = "1234#gmail.com"
val wrongEmail = "1234abcd#gmail.com"
def parseEmail(email: String): Option[String] =
email match {
case pattern(id, domain) => Some(id)
case _ => None
}
println(parseEmail(correctEmail)) // prints Some(1234)
println(parseEmail(wrongEmail)) // prints None
Also, it is more idiomatic to use Options instead of null

how to remove all html characters in snowflake, dont want to include all html special characters in query (no hardcoding)

Want to remove below kind of characters from string..pl help
'
&
You may try this one to remove any HTML special characters:
select REGEXP_REPLACE( 'abc&def³»ghi', '&[^&]+;', '!' );
Explanation:
REGEXP_REPLACE uses regular expression to search and replace. I search for "&[^&]+;" and replace it with "!" for demonstration. You can of course use '' to remove them. More info about the function:
https://docs.snowflake.com/en/sql-reference/functions/regexp_replace.html
About the regular expression string:
& is the & character of a HTML special character
[^&] means any character except &. Tthis prevents to REGEXP to replace all characters between the first '&' char and last ';'. It will stop when it see second '&'
+ means match 1 or more of preceding token (any character except &)
; is the last character of a HTML special character
CREATE or REPLACE FUNCTION UDF_StripHTML(str varchar)
returns varchar
language javascript
strict
as
'var HTMLParsedText=""
var resultSet = STR.split(''>'')
var resultSetLength =resultSet.length
var counter=0
while(resultSetLength>0)
{
if(resultSet[counter].indexOf(''<'')>0)
{
var value = resultSet[counter]
value=value.substring(0, resultSet[counter].indexOf(''<''))
if (resultSet[counter].indexOf(''&'')>=0 && resultSet[counter].indexOf('';'')>=0)
{
value=value.replace(value.substring(resultSet[counter].indexOf(''&''), resultSet[counter].indexOf('';'')+1),'''')
}
}
if (value)
{
value = value.trim();
if(HTMLParsedText === "")
{
HTMLParsedText = value
}
else
{
if (value) {
HTMLParsedText = HTMLParsedText + '' '' + value
}
}
value=''''
}
counter= counter+1
resultSetLength=resultSetLength-1
}
return HTMLParsedText';
to call this UDF :
Select UDF_StripHTML(text)