How can I sort elements of a TypedPipe in Scalding? - scalding

I have not been able to find a way to sort elements of a TypedPipe in Scalding (when not performing a group operation). Here are the relevant parts of my program (replacing irrelevant parts with ellipses):
case class ReduceOutput(val slug : String, score : Int, json1 : String, json2 : String)
val pipe1 : TypedPipe[(String, ReduceFeatures)] = ...
val pipe2 : TypedPipe[(String, ReduceFeatures)] = ...
pipe1.join(pipe2).map { entry =>
val (slug : String, (features1 : ReduceFeatures, features2 : ReduceFeatures)) = entry
new ReduceOutput(
slug,
computeScore(features1, features2),
features1.json,
features2.json)
}
.write(TypedTsv[ReduceOutput](args("output")))
Is there a way to sort the elements on their score after the map but before the write?

Related

Scala MatchError while joining a dataframe and a dataset

I have one dataframe and one dataset :
Dataframe 1 :
+------------------------------+-----------+
|City_Name |Level |
+------------------------------+------------
|{City -> Paris} |86 |
+------------------------------+-----------+
Dataset 2 :
+-----------------------------------+-----------+
|Country_Details |Temperature|
+-----------------------------------+------------
|{City -> Paris, Country -> France} |31 |
+-----------------------------------+-----------+
I am trying to make a join of them by checking if the map in the column "City_Name" is included in the map of the Column "Country_Details".
I am using the following UDF to check the condition :
val mapEqual = udf((col1: Map[String, String], col2: Map[String, String]) => {
if (col2.nonEmpty){
col2.toSet subsetOf col1.toSet
} else {
true
}
})
And I am making the join this way :
dataset2.join(dataframe1 , mapEqual(dataset2("Country_Details"), dataframe1("City_Name"), "leftanti")
However, I get such error :
terminated with error scala.MatchError: UDF(Country_Details#528) AS City_Name#552 (of class org.apache.spark.sql.catalyst.expressions.Alias)
Has anyone previously got the same error ?
I am using Spark version 3.0.2 and SQLContext, with scala language.
There are 2 issues here, the first one is that when you're calling your function, you're passing one extra parameter leftanti (you meant to pass it to join function, but you passed it to the udf instead).
The second one is that the udf logic won't work as expected, I suggest you use this:
val mapContains = udf { (col1: Map[String, String], col2: Map[String, String]) =>
col2.keys.forall { key =>
col1.get(key).exists(_ eq col2(key))
}
}
Result:
scala> ds.join(df1 , mapContains(ds("Country_Details"), df1("City_Name")), "leftanti").show(false)
+----------------------------------+-----------+
|Country_Details |Temperature|
+----------------------------------+-----------+
|{City -> Paris, Country -> France}|31 |
+----------------------------------+-----------+

Join dataset with case class spark scala

I am converting a dataframe into a dataset using case class which has a sequence of another case class
case class IdMonitor(id: String, ipLocation: Seq[IpLocation])
case class IpLocation(
ip: String,
ipVersion: Byte,
ipType: String,
city: String,
state: String,
country: String)
Now I have another dataset of strings that has just IPs. My requirement is to get all records from IpLocation if ipType == "home" or IP dataset has the given IP from ipLocation. I am trying to use bloom filter on the IP dataset to search through that dataset but it is inefficient and not working that well in general. I want to join the IP dataset with IpLocation but I'm having trouble since this is in a Seq. I'm very new to spark and scala so I'm probably missing something. Right now my code looks like this
def buildBloomFilter(Ips: Dataset[String]): BloomFilter[String] = {
val count = Ips.count
val bloomFilter = Ips.rdd
.mapPartitions { iter =>
val b = BloomFilter.optimallySized[String](count, FP_PROBABILITY)
iter.foreach(i => b += i)
Iterator(b)
}
.treeReduce(_|_)
bloomFilter
}
val ipBf = buildBloomFilter(Ips)
val ipBfBroadcast = spark.sparkContext.broadcast(ipBf)
idMonitor.map { x =>
x.ipLocation.filter(
x => x.ipType == "home" && ipBfBroadcast.value.contains(x.ip)
)
}
I just want to figure out how to join IpLocation and Ips
Sample:
Starting from your case class,
case class IpLocation(
ip: String,
ipVersion: Byte,
ipType: String,
city: String,
state: String,
country: String
)
case class IdMonitor(id: String, ipLocation: Seq[IpLocation])
I have defined the sample data as follows:
val ip_locations1 = Seq(IpLocation("123.123.123.123", 12.toByte, "home", "test", "test", "test"), IpLocation("123.123.123.124", 12.toByte, "otherwise", "test", "test", "test"))
val ip_locations2 = Seq(IpLocation("123.123.123.125", 13.toByte, "company", "test", "test", "test"), IpLocation("123.123.123.124", 13.toByte, "otherwise", "test", "test", "test"))
val id_monitor = Seq(IdMonitor("1", ip_locations1), IdMonitor("2", ip_locations2))
val df = id_monitor.toDF()
df.show(false)
+---+------------------------------------------------------------------------------------------------------+
|id |ipLocation |
+---+------------------------------------------------------------------------------------------------------+
|1 |[{123.123.123.123, 12, home, test, test, test}, {123.123.123.124, 12, otherwise, test, test, test}] |
|2 |[{123.123.123.125, 13, company, test, test, test}, {123.123.123.124, 13, otherwise, test, test, test}]|
+---+------------------------------------------------------------------------------------------------------+
and the IPs:
val ips = Seq("123.123.123.125")
val df_ips = ips.toDF("ips")
df_ips.show()
+---------------+
| ips|
+---------------+
|123.123.123.125|
+---------------+
Join:
From the above example data, explode the array of the IdMonitor and join with the IPs.
df.withColumn("ipLocation", explode('ipLocation)).alias("a")
.join(df_ips.alias("b"), col("a.ipLocation.ipType") === lit("home") || col("a.ipLocation.ip") === col("b.ips"), "inner")
.select("ipLocation.*")
.as[IpLocation].collect()
Finally, the collected result is given as follows:
res32: Array[IpLocation] = Array(IpLocation(123.123.123.123,12,home,test,test,test), IpLocation(123.123.123.125,13,company,test,test,test))
You can explode your array sequence in your IpMonitor objects using explode function, then use a left outer join to match ips present in your Ips dataset, then filter out on ipType == "home" or ip is present in Ips dataset and finally rebuild your IpLocation sequence by grouping by id and collect_list.
Complete code is as follows:
import org.apache.spark.sql.functions.{col, collect_list, explode}
val result = idMonitor.select(col("id"), explode(col("ipLocation")))
.join(Ips, col("col.ip") === col("value"), "left_outer")
.filter(col("col.ipType") === "home" || col("value").isNotNull())
.groupBy("id")
.agg(collect_list("col").as("value"))
.drop("id")
.as[Seq[IpLocation]]

micronaut-data and composite key mapping

I have an entity with a composite key
#Entity
data class Page(
#EmbeddedId
val pageId : PageId,
...
)
#Embeddable
data class PageId (
#Column(name = "id")
val id: UUID,
#Column(name = "is_published")
val isPublished: Boolean
)
But I need to respect the existing column names in the db table, which are 'id' and 'is_published'
But querying the db with a JDBCRepository I get the error:
SQL Error executing Query: ERROR: column page_.page_id_published does
not exist
Is there any way that I can map the columns correctly?
Try and error led me to the answer, somehow Micronaut does not like a Boolean to be named 'isPublished', when I rename it to 'published' it works fine:
data class PageId (
#MappedProperty(value = "id")
val id: UUID,
#MappedProperty(value = "is_published")
val published: Boolean)

How correctly built an object graph based on multi level join in Slick?

I have a model structure as following:
Group -> Many Parties -> Many Participants
In on of the API calls I need to get single groups with parties and it's participants attached.
This whole structure is built on 4 tables:
group
party
party_participant
participant
Naturally, with SQL it's a pretty straight forward join that combines all of them. And this is exactly what I am trying to do with slick.
Mu method is dao class looks something like this:
def findOneByKeyAndAccountIdWithPartiesAndParticipants(key: UUID, accountId: Int): Future[Option[JourneyGroup]] = {
val joins = JourneyGroups.groups join
Parties.parties on (_.id === _.journeyGroupId) joinLeft
PartiesParticipants.relations on (_._2.id === _.partyId) joinLeft
Participants.participants on (_._2.map(_.participantId) === _.id)
val query = joins.filter(_._1._1._1.accountId === accountId).filter(_._1._1._1.key === key)
val q = for {
(((journeyGroup, party), partyParticipant), participant) <- query
} yield (journeyGroup, party, participant)
val result = db.run(q.result)
result ????
}
The problem here, is that the result is type of Future[Seq[(JourneyGroup, Party, Participant)]]
However, what I really need is Future[Option[JourneyGroup]]
Note: case classes of JourneyGroup and Party have sequences for there children defined:
case class Party(id: Option[Int] = None,
partyType: Parties.Type.Value,
journeyGroupId: Int,
accountId: Int,
participants: Seq[Participant] = Seq.empty[Participant])
and
case class JourneyGroup(id: Option[Int] = None,
key: UUID,
name: String,
data: Option[JsValue],
accountId: Int,
parties: Seq[Party] = Seq.empty[Party])
So they both can hold the descendants.
What is the correct way to convert to the result I need? Or am I completely in a wrong direction?
Also, is this statement is correct:
Participants.participants on (_._2.map(_.participantId) === _.id) ?
I ended up doing something like this:
journeyGroupDao.findOneByKeyAndAccountIdWithPartiesAndParticipants(key, account.id.get) map { data =>
val groupedByJourneyGroup = data.groupBy(_._1)
groupedByJourneyGroup.map { case (group, rows) =>
val parties = rows.map(_._2).distinct map { party =>
val participants = rows.filter(r => r._2.id == party.id).flatMap(_._3)
party.copy(participants = participants)
}
group.copy(parties = parties)
}.headOption
}
where DAO method's signature is:
def findOneByKeyAndAccountIdWithPartiesAndParticipants(key: UUID, accountId: Int): Future[Seq[(JourneyGroup, Party, Option[Participant])]]

scala: how to model a basic parent-child relation

I have a Brand class that has several products
And in the product class I want to have a reference to the brand, like this:
case class Brand(val name:String, val products: List[Product])
case class Product(val name: String, val brand: Brand)
How can I poulate these classes???
I mean, I can't create a product unless I have a brand
And I can't create the brand unless I have a list of Products (because Brand.products is a val)
What would be the best way to model this kind of relation?
I would question why you are repeating the information, by saying which products relate to which brand in both the List and in each Product.
Still, you can do it:
class Brand(val name: String, ps: => List[Product]) {
lazy val products = ps
override def toString = "Brand("+name+", "+products+")"
}
class Product(val name: String, b: => Brand) {
lazy val brand = b
override def toString = "Product("+name+", "+brand.name+")"
}
lazy val p1: Product = new Product("fish", birdseye)
lazy val p2: Product = new Product("peas", birdseye)
lazy val birdseye = new Brand("BirdsEye", List(p1, p2))
println(birdseye)
//Brand(BirdsEye, List(Product(fish, BirdsEye), Product(peas, BirdsEye)))
By-name params don't seem to be allowed for case classes unfortunately.
See also this similar question: Instantiating immutable paired objects
Since your question is about model to this relationship, I will say why not just model them like what we do in database? Separate the entity and the relationship.
val productsOfBrand: Map[Brand, List[Product]] = {
// Initial your brand to products mapping here, using var
// or mutable map to construct the relation is fine, since
// it is limit to this scope, and transparent to the outside
// world
}
case class Brand(val name:String){
def products = productsOfBrand.get(this).getOrElse(Nil)
}
case class Product(val name: String, val brand: Brand) // If you really need that brand reference