Can I get a relational intersection or difference using Slick 2? - sql

I'm using Slick versian 2.0.0-M3. If I have two Querys representing relations of the same type, I see there is a union operator to inclusively disjoin them, but I don't see a comparable operator for obtaining their intersection nor their difference. Do such operators not exist in Slick?
I think the foregoing explains what I'm looking for, but if not, here's an example. I have the suppliers table:
case class Supplier(snum: String, sname: String, status: Int, city: String)
class Suppliers(tag: Tag) extends Table[Supplier](tag, "suppliers") {
def snum = column[String]("snum")
def sname = column[String]("sname")
def status = column[Int]("status")
def city = column[String]("city")
def * = (snum, sname, status, city) <> (Supplier.tupled, Supplier.unapply _)
}
val suppliers = TableQuery[Suppliers]
If I want to know about suppliers that either are in a particular city or have a particular status, I see how to use Query.union for that:
scala> val thirtySuppliers = suppliers.filter(_.status === 30)
thirtySuppliers: scala.slick.lifted.Query[Suppliers,Suppliers#TableElementType] = scala.slick.lifted.WrappingQuery#166f63a
scala> val londonSuppliers = suppliers.filter(_.city === "London")
londonSuppliers: scala.slick.lifted.Query[Suppliers,Suppliers#TableElementType] = scala.slick.lifted.WrappingQuery#1bea855
scala> (thirtySuppliers union londonSuppliers).foreach(println)
Supplier(S1,Smith,20,London)
Supplier(S4,Clark,20,London)
Supplier(S3,Blake,30,Paris)
Supplier(S5,Adams,30,Athens)
No problem. But what if I want only the suppliers that are both in a particular city and have a particular status? Seems as if I ought to be able to do something like:
(thirtySuppliers intersect londonSuppliers).foreach(println)
Or if I want the suppliers in a particular city except the ones that have a particular status. Can I not do something like:
(thirtySuppliers except londonSuppliers).foreach(println)
SQL has UNION, INTERSECT, and EXCEPT operations, and Slick's Query class has a union method that builds an SQL query using SQL's UNION, but I'm not seeing Query methods in Slick for deriving intersections nor differences. Am I missing them?

There is a pull request that implements this. It will likely make it into 2.0 or 2.1. https://github.com/slick/slick/pull/242 We still need to figure out some details and clean up a bit.

The operations are pretty much composable in that an intersect can just be two filters. For instance
val intersect = suppliers.filter(_.status === 30).filter(_.city === "London")
or except:
val except= suppliers.filter(_.city === "London").filterNot(_.status === 30)

Related

filter spark dataframe using udf

I have a student dataframe:
var student = Seq(("h123","078","Ryan"),("h789","078","John"),("h456","ad0","Mike")).toDF("id","div","name")
now I want to filter student on div column based on some logic, for this example assume only 078 value should be present.
For this, I have a udf defined as:
val filterudf = udf((div: String) => div == "078")
currently, I am using following approach to get my work done
val allowedDivs = student.select(col("div")).distinct().filter(filterudf(col("div")))
.collectAsList().asScala.map(row => row.getAs[String](0)).toList
val resultDF = student.filter(col("div").isInCollection(allowedDivs))
The actual table where I have to apply this filter is huge and in order to improve the performance I want to use spark.sql query to get benefit from codgen and other Tungsten optimizations.
This is want I have come to, but this query is not working
filterudf.registerTemplate("filterudf")
val resultDF = spark.sql("select * from student where div in (filterudf(select distinct div from student).div)")
Any help is appreciated.

Selecting related model: Left join, prefetch_related or select_related?

Considering I have the following relationships:
class House(Model):
name = ...
class User(Model):
"""The standard auth model"""
pass
class Alert(Model):
user = ForeignKey(User)
house = ForeignKey(House)
somevalue = IntegerField()
Meta:
unique_together = (('user', 'property'),)
In one query, I would like to get the list of houses, and whether the current user has any alert for any of them.
In SQL I would do it like this:
SELECT *
FROM house h
LEFT JOIN alert a
ON h.id = a.house_id
WHERE a.user_id = ?
OR a.user_id IS NULL
And I've found that I could use prefetch_related to achieve something like this:
p = Prefetch('alert_set', queryset=Alert.objects.filter(user=self.request.user), to_attr='user_alert')
houses = House.objects.order_by('name').prefetch_related(p)
The above example works, but houses.user_alert is a list, not an Alert object. I only have one alert per user per house, so what is the best way for me to get this information?
select_related didn't seem to work. Oh, and surely I know I can manage this in multiple queries, but I'd really want to have it done in one, and the 'Django way'.
Thanks in advance!
The solution is clearer if you start with the multiple query approach, and then try to optimise it. To get the user_alerts for every house, you could do the following:
houses = House.objects.order_by('name')
for house in houses:
user_alerts = house.alert_set.filter(user=self.request.user)
The user_alerts queryset will cause an extra query for every house in the queryset. You can avoid this with prefetch_related.
alerts_queryset = Alert.objects.filter(user=self.request.user)
houses = House.objects.order_by('name').prefetch_related(
Prefetch('alert_set', queryset=alerts_queryset, to_attrs='user_alerts'),
)
for house in houses:
user_alerts = house.user_alerts
This will take two queries, one for houses and one for the alerts. I don't think you require select related here to fetch the user, since you already have access to the user with self.request.user. If you want you could add select_related to the alerts_queryset:
alerts_queryset = Alert.objects.filter(user=self.request.user).select_related('user')
In your case, user_alerts will be an empty list or a list with one item, because of your unique_together constraint. If you can't handle the list, you could loop through the queryset once, and set house.user_alert:
for house in houses:
house.user_alert = house.user_alerts[0] if house.user_alerts else None

Union between optional and non-optional tables

I have two queries that select records where a union needs to be taken, one of which is a left join and one of which is a regular (i.e. inner) join.
Here's the left join case:
def regularAccountRecords = for {
(customer, account) <- customers joinLeft accounts on (_.accountId === _.accountId) // + some other special conditions
} yield (customer, account)
Here's the regular join case:
def specialAccountRecords = for {
(customer, account) <- customers join accounts on (_.accountId === _.accountId) // + some other special conditions
} yield (customer, account)
Now I want to take a union of the two record sets:
regularAccountRecords ++ specialAccountRecords
Obviously this doesn't work because in the regular join case it returns Query[(Customer, Account),...] and in the left join case it returns Query[(Customer, Rep[Option[Account]]),...] and this results in a Type Mismatch error.
Now, If this were a regular column type (e.g. Rep[String]) I could convert it to an optional via the ? operator (i.e. record.?) and get Rep[Option[String]] but using it on a table (i.e. the accounts table) causes:
Error:(62, 85) value ? is not a member of com.test.Account
How do I work around this issue and do the union properly?
Okay, looks like this is what the '?' projection is for but I didn't realize it because I disabled the optionEnabled option in the Codegen. Here's what your codegen extension is supposed to look like:
class MyCodegen extends SourceCodeGenerator(inputModel) {
override def TableClass = new TableClassDef {
override def optionEnabled = true
}
}
Alternatively, you can use implicit classes to tack this thing onto the generated TableClass yourself. Here is how that would look:
implicit class AccountExtensions(account:Account) {
def ? = (Rep.Some(account.id), account.name).shaped.<>({r=>r._1.map(_=> Account.tupled((r._2, r._1.get)))}, (_:Any) => throw new Exception("Inserting into ? projection not supported."))
}
NOTE: be sure to check the field ordering, depending on how this
projection is done, the union query might put the ID field in the wrong
place in the output, use
println(query.result.statements.headOption) to debug the output
SQL to be sure.
Once you do that, you will be able to use account.? in the yield statement:
def specialAccountRecords = for {
(customer, account) <- customers join accounts on (_.accountId === _.accountId)
} yield (customer, account.?)
...and then you will be able to unionize the tables correctly
regularAccountRecords ++ specialAccountRecords
I really wish the Slick people would put a note on how the '?' projection is useful in the documentation beyond the vague statement 'useful for outer joins'.

Can I do a NATURAL JOIN is Slick v2?

The title is self-explanatory. Using 2.0.0-M3, I'd like to avoid unnecessary verbosity is the form of explicitly naming the columns to be joined on, since they are appropriately named, and since NATURAL JOIN is part of the SQL standard. Not to mention, Wikipedia itself even says that "The natural join is arguably one of the most important operators since it is the relational counterpart of logical AND."
I think the foregoing ought to be clear enough, but just if not, read on. Suppose I want to know the supplier-name and part-number of each part. Assuming appropriate case classes not shown:
class Suppliers(tag: Tag) extends Table[Supplier](tag, "suppliers") {
def snum = column[String]("snum")
def sname = column[String]("sname")
def * = (snum, sname) <> (Supplier.tupled, Supplier.unapply _)
}
class Shipments(tag: Tag) extends Table[Shipment](tag, "shipments") {
def snum = column[String]("snum")
def pnum = column[String]("pnum")
def * = (snum, pnum) <> (Shipment.tupled, Shipment.unapply _)
}
val suppliers = TableQuery[Suppliers]
val shipments = TableQuery[Shipments]
Given that both tables have the snum column I want to join on, seems as if
( suppliers join shipments ).run
ought to return a Vector with my desired data, but I get a failed attempt at an INNER JOIN, failing (at run-time) since it's missing any join condition.
I know I can do
suppliers.flatMap( s => shipments filter (sp => sp.snum === s.snum) map (sp => (s.sname, sp.pnum)) )
but, even without the names of all the columns I omitted for clarity of this question, it's still quite a lot more typing (and proofreading) than simply
suppliers join shipments
or, for that matter
SELECT * FROM suppliers NATURAL JOIN shipments;
If the Scala code is messier than the SQL code, then I really start questioning things. Is there no way simply to do a natural join in Slick?
Currently not supported by Slick. Please submit a ticket or pull request.
To improve readability of query code, you can put your join conditions into re-usable values. Or you can put the whole join in a function or method extension of Query[Suppliers,Supplier].
Alternatively you could look at the AutoJoin pattern (which basically makes your join conditions implicit) described here http://slick.typesafe.com/docs/#20130612_slick_vs_orm_scaladays_2013 and implemented here https://github.com/cvogt/play-slick/blob/scaladays2013/samples/computer-database/app/util/autojoin.scala

Django annotate code

Just stumbled upon some guy code
He have models like this
class Country(models.Model):
name = models.CharField(max_length=100)
class TourDate(models.Model):
artist = models.ForeignKey("Artist")
date = models.DateField()
country = models.ForeignKey("Country")
And is querying like this
ireland = Country.objects.get(name="Ireland")
artists = Artist.objects.all().extra(select = {
"tourdate_count" : """
SELECT COUNT(*)
FROM sandbox_tourdate
JOIN sandbox_country on sandbox_tourdate.country_id = sandbox_country.id
WHERE sandbox_tourdate.artist_id = sandbox_artist.id
AND sandbox_tourdate.country_id = %d """ % ireland.pk,
}).order_by("-tourdate_count",)
My question is why He have underscores like sandbox_tourdate but it isn't in model field
Is that created automatically like some sort of pseudo-field?
sandbox_tourdate isn't the name of the field, it's the name of the table. Django's naming convention is to use appname_modelname as the table name, although this can be overridden. In this case, I guess the app is called 'sandbox'.
I don't really know why that person has used a raw query though, that is quite easily expressed in Django's ORM syntax.