filter spark dataframe using udf

filter spark dataframe using udf - sql

I have a student dataframe:
var student = Seq(("h123","078","Ryan"),("h789","078","John"),("h456","ad0","Mike")).toDF("id","div","name")
now I want to filter student on div column based on some logic, for this example assume only 078 value should be present.
For this, I have a udf defined as:
val filterudf = udf((div: String) => div == "078")
currently, I am using following approach to get my work done
val allowedDivs = student.select(col("div")).distinct().filter(filterudf(col("div")))
.collectAsList().asScala.map(row => row.getAs[String](0)).toList
val resultDF = student.filter(col("div").isInCollection(allowedDivs))
The actual table where I have to apply this filter is huge and in order to improve the performance I want to use spark.sql query to get benefit from codgen and other Tungsten optimizations.
This is want I have come to, but this query is not working
filterudf.registerTemplate("filterudf")
val resultDF = spark.sql("select * from student where div in (filterudf(select distinct div from student).div)")
Any help is appreciated.

Related

How to modify value in column typeorm

I have 2 tables contractPoint and contractPointHistory
ContractPointHistory
ContractPoint
I would like to get contractPoint where point will be subtracted by pointChange. For example: ContractPoint -> id: 3, point: 5
ContractPointHistory has contractPointId: 3 and pointChange: -5. So after manipulating point in contractPoint should be 0
I wrote this code, but it works just for getRawMany(), not for getMany()
const contractPoints = await getRepository(ContractPoint).createQueryBuilder('contractPoint')
.addSelect('"contractPoint".point + COALESCE((SELECT SUM(cpHistory.point_change) FROM contract_point_history AS cpHistory WHERE cpHistory.contract_point_id = contractPoint.id), 0) AS points')
.andWhere('EXTRACT(YEAR FROM contractPoint.validFrom) = :year', { year })
.andWhere('contractPoint.contractId = :contractId', { contractId })
.orderBy('contractPoint.grantedAt', OrderByDirection.Desc)
.getMany();

The method getMany can be used to select all attributes of an entity. However, if one wants to select some specific attributes of an entity then one needs to use getRawMany.
As per the documentation -
There are two types of results you can get using select query builder:
entities or raw results. Most of the time, you need to select real
entities from your database, for example, users. For this purpose, you
use getOne and getMany. But sometimes you need to select some specific
data, let's say the sum of all user photos. This data is not an
entity, it's called raw data. To get raw data, you use getRawOne and
getRawMany
From this, we can conclude that the query which you want to generate can not be made using getMany method.

ecto select fields with map function as alias or different name

iex(18)> fields = [:id]
[:id]
iex(19)> fields_map = %{id: :job_id}
%{id: :job_id}
iex(20)> query = from p in Qber.V1.JobModel, where: p.id == 1, select: map(p, ^fields)
#Ecto.Query<from j in Qber.V1.JobModel, where: j.id == 1, select: map(j, [:id])>
I want to select my fields dynamically with AS names on the basis of params I got(which may have some joins as this app is going to have heavy admin dashboard queries). So I need dynamic solution. is there a way I can pass a map instead of list dynamically to map function and ecto will send me the selected result with custom names I passed in map. Instead of
[%{id: 1}]
I want the result to be
[%{job_id: 1}]
Note: I have tried and searched different keywords on the web and I have gone through almost all the docs many times and unable to find solution.

Picking max item by column from group by in Slick 3.x

I'm trying to write a Slick query to find the "max" element within a group and then continue querying based on that result, however I'm getting a massive error when I try what I thought was the obvious way:
val articlesByUniqueLink = for {
(link, groupedArticles) <- historicArticles.groupBy(_.link)
latestPerLink <- groupedArticles.sortBy(_.pubDate.desc).take(1)
} yield latestPerLink
Since this doesn't seem to work, I'm wondering if there's some other way to find the "latest" element out of "groupedArticles" above, assuming these come from an Articles table with a pubDate Timestamp and a link that can be duplicated. I'm effectively looking for HAVING articles.pub_date = max(articles.pub_date).
The other equivalent way to express it yields the same result:
val articlesByUniqueLink = for {
(link, groupedArticles) <- historicArticles.groupBy(_.link)
latestPerLink <- groupedArticles.filter(_.pubDate === groupedArticles.map(_.pubDate).max.get)
} yield latestPerLink
[SlickTreeException: Unreachable reference to s2 after resolving monadic joins + 50 lines of Slick node trees.

The best way I found to get max or min or etc. per group in Slick is to use self join on grouping result:
val articlesByUniqueLink = for {
(article, _) <- historicArticles join historicArticles.groupBy(_.link)
.map({case (link, group) => (link, group.map(_.pubDate).max)})
on ((article, tuple) => article.link === tuple._1 &&
article.pubDate === tuple._2)
} yield article
If there is possible to produce duplicates with on condition, just drop duplicates like this after.

Union between optional and non-optional tables

I have two queries that select records where a union needs to be taken, one of which is a left join and one of which is a regular (i.e. inner) join.
Here's the left join case:
def regularAccountRecords = for {
(customer, account) <- customers joinLeft accounts on (_.accountId === _.accountId) // + some other special conditions
} yield (customer, account)
Here's the regular join case:
def specialAccountRecords = for {
(customer, account) <- customers join accounts on (_.accountId === _.accountId) // + some other special conditions
} yield (customer, account)
Now I want to take a union of the two record sets:
regularAccountRecords ++ specialAccountRecords
Obviously this doesn't work because in the regular join case it returns Query[(Customer, Account),...] and in the left join case it returns Query[(Customer, Rep[Option[Account]]),...] and this results in a Type Mismatch error.
Now, If this were a regular column type (e.g. Rep[String]) I could convert it to an optional via the ? operator (i.e. record.?) and get Rep[Option[String]] but using it on a table (i.e. the accounts table) causes:
Error:(62, 85) value ? is not a member of com.test.Account
How do I work around this issue and do the union properly?

Okay, looks like this is what the '?' projection is for but I didn't realize it because I disabled the optionEnabled option in the Codegen. Here's what your codegen extension is supposed to look like:
class MyCodegen extends SourceCodeGenerator(inputModel) {
override def TableClass = new TableClassDef {
override def optionEnabled = true
}
}
Alternatively, you can use implicit classes to tack this thing onto the generated TableClass yourself. Here is how that would look:
implicit class AccountExtensions(account:Account) {
def ? = (Rep.Some(account.id), account.name).shaped.<>({r=>r._1.map(_=> Account.tupled((r._2, r._1.get)))}, (_:Any) => throw new Exception("Inserting into ? projection not supported."))
}
NOTE: be sure to check the field ordering, depending on how this
projection is done, the union query might put the ID field in the wrong
place in the output, use
println(query.result.statements.headOption) to debug the output
SQL to be sure.
Once you do that, you will be able to use account.? in the yield statement:
def specialAccountRecords = for {
(customer, account) <- customers join accounts on (_.accountId === _.accountId)
} yield (customer, account.?)
...and then you will be able to unionize the tables correctly
regularAccountRecords ++ specialAccountRecords
I really wish the Slick people would put a note on how the '?' projection is useful in the documentation beyond the vague statement 'useful for outer joins'.

Can I get a relational intersection or difference using Slick 2?

I'm using Slick versian 2.0.0-M3. If I have two Querys representing relations of the same type, I see there is a union operator to inclusively disjoin them, but I don't see a comparable operator for obtaining their intersection nor their difference. Do such operators not exist in Slick?
I think the foregoing explains what I'm looking for, but if not, here's an example. I have the suppliers table:
case class Supplier(snum: String, sname: String, status: Int, city: String)
class Suppliers(tag: Tag) extends Table[Supplier](tag, "suppliers") {
def snum = column[String]("snum")
def sname = column[String]("sname")
def status = column[Int]("status")
def city = column[String]("city")
def * = (snum, sname, status, city) <> (Supplier.tupled, Supplier.unapply _)
}
val suppliers = TableQuery[Suppliers]
If I want to know about suppliers that either are in a particular city or have a particular status, I see how to use Query.union for that:
scala> val thirtySuppliers = suppliers.filter(_.status === 30)
thirtySuppliers: scala.slick.lifted.Query[Suppliers,Suppliers#TableElementType] = scala.slick.lifted.WrappingQuery#166f63a
scala> val londonSuppliers = suppliers.filter(_.city === "London")
londonSuppliers: scala.slick.lifted.Query[Suppliers,Suppliers#TableElementType] = scala.slick.lifted.WrappingQuery#1bea855
scala> (thirtySuppliers union londonSuppliers).foreach(println)
Supplier(S1,Smith,20,London)
Supplier(S4,Clark,20,London)
Supplier(S3,Blake,30,Paris)
Supplier(S5,Adams,30,Athens)
No problem. But what if I want only the suppliers that are both in a particular city and have a particular status? Seems as if I ought to be able to do something like:
(thirtySuppliers intersect londonSuppliers).foreach(println)
Or if I want the suppliers in a particular city except the ones that have a particular status. Can I not do something like:
(thirtySuppliers except londonSuppliers).foreach(println)
SQL has UNION, INTERSECT, and EXCEPT operations, and Slick's Query class has a union method that builds an SQL query using SQL's UNION, but I'm not seeing Query methods in Slick for deriving intersections nor differences. Am I missing them?

There is a pull request that implements this. It will likely make it into 2.0 or 2.1. https://github.com/slick/slick/pull/242 We still need to figure out some details and clean up a bit.

The operations are pretty much composable in that an intersect can just be two filters. For instance
val intersect = suppliers.filter(_.status === 30).filter(_.city === "London")
or except:
val except= suppliers.filter(_.city === "London").filterNot(_.status === 30)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

filter spark dataframe using udf - sql

Related

How to modify value in column typeorm

ecto select fields with map function as alias or different name

Picking max item by column from group by in Slick 3.x

Union between optional and non-optional tables

Can I get a relational intersection or difference using Slick 2?

Categories

Resources