UDF in Spark SQL DSL

UDF in Spark SQL DSL - sql

I am trying to use DSL over pure SQL in Spark SQL jobs but I cannot get my UDF works.
sqlContext.udf.register("subdate",(dateTime: Long)=>dateTime.toString.dropRight(6))
This doesn't work
rdd1.toDF.join(rdd2.toDF).where("subdate(rdd1(date_time)) === subdate(rdd2(dateTime))")
I also would like to add another join condition like in this working pure SQL
val results=sqlContext.sql("select * from rdd1 join rdd2 on rdd1.id=rdd2.idand subdate(rdd1.date_time)=subdate(rdd2.dateTime)")
Thanks for your help

SQL expression you pass to where method is incorrect at least for a few reasons:
=== is a Column method not a valid SQL equality. You should use single equality sign =
bracket notation (table(column)) is not a valid way to reference columns in SQL. In this context it will be recognized as a function call. SQL uses dot notation (table.column)
even if it was neither rdd1 nor rdd2 are valid table aliases
Since it looks like column names are unambiguous you could simply use following code:
df1.join(df2).where("subdate(date_time) = subdate(dateTime)")
If it wasn't the case using dot syntax wouldn't work without providing aliases first. See for example Usage of spark DataFrame "as" method
Moreover registering UDFs makes sense mostly when you use raw SQL all the way. If you want to use DataFrame API it is better to use UDF directly:
import org.apache.spark.sql.functions.udf
val subdate = udf((dateTime: Long) => dateTime.toString.dropRight(6))
val df1 = rdd1.toDF
val df2 = rdd2.toDF
df1.join(df2, subdate($"date_time") === subdate($"dateTime"))
or if column names were ambiguous:
df1.join(df2, subdate(df1("date_time")) === subdate(df2("date_time")))
Finally for simple functions like this it is better to compose built-in expressions than create UDFs.

Related

How to create where statement based on result of multiset

So, i would like to filter my query by exact match in result of multiset. Any ideas how to do it in JOOQ?
Example:
val result = dsl.select(
PLANT_PROTECTION_REGISTRATION.ID,
PLANT_PROTECTION_REGISTRATION.REGISTRATION_NUMBER,
PLANT_PROTECTION_REGISTRATION.PLANT_PROTECTION_ID,
multiset(
select(
PLANT_PROTECTION_APPLICATION.ORGANISM_ID,
PLANT_PROTECTION_APPLICATION.ORGANISM_TEXT
).from(PLANT_PROTECTION_APPLICATION)
.where(PLANT_PROTECTION_APPLICATION.REGISTRATION_ID.eq(PLANT_PROTECTION_REGISTRATION.ID))
).`as`("organisms")
).from(PLANT_PROTECTION_REGISTRATION)
// here i would like to filter my result only for records that their organisms contain specific
// organism id
.where("organisms.organism_id".contains(organismId))

I've explained the following answer more in depth in this blog post
About the MULTISET value constructor
The MULTISET value constructor operator is so powerful, we'd like to use it everywhere :) But the way it works is that it creates a correlated subquery, which produces a nested data structure, which is hard to further process in the same SQL statement. It's not impossible. You could create a derived table and then unnest the MULTISET again, but that would probably be quite unwieldy. I've shown an example using native PostgreSQL in that blog post
Alternative using MULTISET_AGG
If you're not nesting things much more deeply, how about using the lesser known and lesser hyped MULTISET_AGG alternative, instead? In your particular case, you could do:
// Using aliases to make things a bit more readable
val ppa = PLANT_PROTECTION_APPLICATION.as("ppa");
// Also, implicit join helps keep things more simple
val ppr = ppa.plantProtectionRegistration().as("ppr");
dsl.select(
ppr.ID,
ppr.REGISTRATION_NUMBER,
ppr.PLANT_PROTECTION_ID,
multisetAgg(ppa.ORGANISM_ID, ppa.ORGANISM_TEXT).`as`("organisms"))
.from(ppa)
.groupBy(
ppr.ID,
ppr.REGISTRATION_NUMBER,
ppr.PLANT_PROTECTION_ID)
// Retain only those groups which contain the desired ORGANISM_ID
.having(
boolOr(trueCondition())
.filterWhere(ppa.ORGANISM_ID.eq(organismId)))
.fetch()

How do I access dataframe column value within udf via scala

I am attempting to add a column to a dataframe, using a value from a specific column—-let’s assume it’s an id—-to look up its actual value from another df.
So I set up a lookup def
def lookup(id:String): String {
return lookupdf.select(“value”)
.where(s”id = ‘$id’”).as[String].first
}
The lookup def works if I test it on its own by passing an id string, it returns the corresponding value.
But I’m having a hard time finding a way to use it within the “withColumn” function.
dataDf
.withColumn(“lookupVal”, lit(lookup(col(“someId”))))
It properly complains that I’m passing in a column, instead of the expected string, the question is how do I give it the actual value from that column?

You cannot access another dataframe from withColumn . Think of withColumn can only access data at a single record level of the dataDf
Please use a join like
val resultDf = lookupDf.select(“value”,"id")
.join(dataDf, lookupDf("id") == dataDf("id"), "right")

Precedence in multiple DataWeave functions

I'm going through the Mule Dev 1 course and am stumped between module content and what I'm seeing in practice.
The module content states that:
"When using a series of functions, the last function in the chain is executed first."
So
filghts orderBy $.price filter ($.availableSeats > 30)
would "filter then orderBy".
However, I'm seeing that this statement:
payload.flights orderBy $.price filter $.price < 500 groupBy $.destination
actually does NOT execute groupBy first. In fact, placing the groupBy anywhere else throws an error (since the schema of the output after groupBy is changed).
Any thoughts here on why the module states the last function is executed first when that's clearly seems not the case?
Thanks!

The precedence is all the same for (orderBy, groupBy, etc).
So it will first do the orderBy by price then it will filter it by price and last it will groupBy destination.
This is the same for dw 1 (mule 3.x) and dw 2 ( mule 4.x). Now the difference between this to versions of DW is that in DW1 all this used to be lang operators but in DW 2 are just functions that are called using infix notation. So this mean that you can just write the same using the prefix notation
filter(
orderBy(filghts, (value, index) -> value.price),
(value, index) -> value.availableSeats > 30)
Just to show you this is the AST of this expression.

ERROR: function regexp_matches(jsonb, unknown) does not exist in Tableau but works elsewhere

I have a column called "Bakery Activity" whose values are all JSONs that look like this:
{"flavors": [
{"d4js95-1cc5-4asn-asb48-1a781aa83": "chocolate"},
{"dc45n-jnsa9i-83ysg-81d4d7fae": "peanutButter"}],
"degreesToCook": 375,
"ingredients": {
"d4js95-1cc5-4asn-asb48-1a781aa83": [
"1nemw49-b9s88e-4750-bty0-bei8smr1eb",
"98h9nd8-3mo3-baef-2fe682n48d29"]
},
"numOfPiesBaked": 1,
"numberOfSlicesCreated": 6
}
I'm trying to extract the number of pies baked with a regex function in Tableau. Specifically, this one:
REGEXP_EXTRACT([Bakery Activity], '"numOfPiesBaked":"?([^\n,}]*)')
However, when I try to throw this calculated field into my text table, I get an error saying:
ERROR: function regexp_matches(jsonb, unknown) does not exist;
Error while executing the query
Worth noting is that my data source is PostgreSQL, which Tableau regex functions support; not all of my entries have numOfPiesBaked in them; when I run this in a simulator I get the correct extraction (actually, I get "numOfPiesBaked": 1" but removing the field name is a problem for another time).
What might be causing this error?

In short: Wrong data type, wrong function, wrong approach.
REGEXP_EXTRACT is obviously an abstraction layer of your client (Tableau), which is translated to regexp_matches() for Postgres. But that function expects text input. Since there is no assignment cast for jsonb -> text (for good reasons) you have to add an explicit cast to make it work, like:
SELECT regexp_matches("Bakery Activity"::text, '"numOfPiesBaked":"?([^\n,}]*)')
(The second argument can be an untyped string literal, Postgres function type resolution can defer the suitable data type text.)
Modern versions of Postgres also have regexp_match() returning a single row (unlike regexp_matches), which would seem like the better translation.
But regular expressions are the wrong approach to begin with.
Use the simple json/jsonb operator ->>:
SELECT "Bakery Activity"->>'numOfPiesBaked';
Returns '1' in your example.
If you know the value to be a valid integer, you can cast it right away:
SELECT ("Bakery Activity"->>'numOfPiesBaked')::int;

I found an easier way to handle JSONB data in Tableau.
Firstly, make a calculated field from the JSONB field and convert the field to a string by using str([FIELD_name]) command.
Then, on the calculated field, make another calculated field and use function:
REGEXP_EXTRACT([String_Field_Name], '"Key_to_be_extracted":"?([^\n,}]*)')
The required key-value pair will form the second caluculated field.

ifelse & grepl commands when using dplyr for SQL in-db operations

In dplyr running on R data frames, it is easy to run
df <- df %>%
mutate(income_topcoded = ifelse(income > topcode, income, topcode)
I'm now working with a large SQL database, using dplyr to send commands to the SQL server. When I run the same command, I get back
Error in postgresqlExecStatement(conn, statement, ...) :
RS-DBI driver: (could not Retrieve the result : ERROR:
function ifelse (boolean, numeric, numeric) does not exist
HINT: No function matches the given name and argument types. You may need to add explicit type casts.
How would you suggest implementing ifelse() statements? I'd be fine with something in PivotalR (which seems to support ifelse(), but I don't know how to integrate it with dplyr and couldn't find any examples on SO), some piece of SQL syntax which I can use in-line here, or some feature of dplyr which I was unaware of.
(I have the same problem that I'd like to use grepl() as an in-db operation, but I don't know how to do so.)

Based on #hadley's reply on this thread, you can use an SQL-style if() statement inside mutate() on dplyr's in-db dataframes:
df <- df %>%
mutate( income_topcoded = if (income > topcode) income else topcode)
As far as using grepl() goes...well, you can't. But you can use the SQL like operator:
df <- df %>%
filter( topcode %like% "ABC%" )

I had a similar problem. The best I could do was to use an in-db operation as you suggest:
topcode <- 10000
queryString <- sprintf("UPDATE db.table SET income_topcoded = %s WHERE income_topcoded > %s",topcode,topcode)
dbGetQuery(con, queryString)
In my case, I was using MySQL with dplyr, but it wasn't able to translate my ifelse() into valid SQL.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

UDF in Spark SQL DSL - sql

Related

How to create where statement based on result of multiset

How do I access dataframe column value within udf via scala

Precedence in multiple DataWeave functions

ERROR: function regexp_matches(jsonb, unknown) does not exist in Tableau but works elsewhere

ifelse & grepl commands when using dplyr for SQL in-db operations

Categories

Resources