Join dataset with case class spark scala - dataframe

I am converting a dataframe into a dataset using case class which has a sequence of another case class
case class IdMonitor(id: String, ipLocation: Seq[IpLocation])
case class IpLocation(
ip: String,
ipVersion: Byte,
ipType: String,
city: String,
state: String,
country: String)
Now I have another dataset of strings that has just IPs. My requirement is to get all records from IpLocation if ipType == "home" or IP dataset has the given IP from ipLocation. I am trying to use bloom filter on the IP dataset to search through that dataset but it is inefficient and not working that well in general. I want to join the IP dataset with IpLocation but I'm having trouble since this is in a Seq. I'm very new to spark and scala so I'm probably missing something. Right now my code looks like this
def buildBloomFilter(Ips: Dataset[String]): BloomFilter[String] = {
val count = Ips.count
val bloomFilter = Ips.rdd
.mapPartitions { iter =>
val b = BloomFilter.optimallySized[String](count, FP_PROBABILITY)
iter.foreach(i => b += i)
Iterator(b)
}
.treeReduce(_|_)
bloomFilter
}
val ipBf = buildBloomFilter(Ips)
val ipBfBroadcast = spark.sparkContext.broadcast(ipBf)
idMonitor.map { x =>
x.ipLocation.filter(
x => x.ipType == "home" && ipBfBroadcast.value.contains(x.ip)
)
}
I just want to figure out how to join IpLocation and Ips

Sample:
Starting from your case class,
case class IpLocation(
ip: String,
ipVersion: Byte,
ipType: String,
city: String,
state: String,
country: String
)
case class IdMonitor(id: String, ipLocation: Seq[IpLocation])
I have defined the sample data as follows:
val ip_locations1 = Seq(IpLocation("123.123.123.123", 12.toByte, "home", "test", "test", "test"), IpLocation("123.123.123.124", 12.toByte, "otherwise", "test", "test", "test"))
val ip_locations2 = Seq(IpLocation("123.123.123.125", 13.toByte, "company", "test", "test", "test"), IpLocation("123.123.123.124", 13.toByte, "otherwise", "test", "test", "test"))
val id_monitor = Seq(IdMonitor("1", ip_locations1), IdMonitor("2", ip_locations2))
val df = id_monitor.toDF()
df.show(false)
+---+------------------------------------------------------------------------------------------------------+
|id |ipLocation |
+---+------------------------------------------------------------------------------------------------------+
|1 |[{123.123.123.123, 12, home, test, test, test}, {123.123.123.124, 12, otherwise, test, test, test}] |
|2 |[{123.123.123.125, 13, company, test, test, test}, {123.123.123.124, 13, otherwise, test, test, test}]|
+---+------------------------------------------------------------------------------------------------------+
and the IPs:
val ips = Seq("123.123.123.125")
val df_ips = ips.toDF("ips")
df_ips.show()
+---------------+
| ips|
+---------------+
|123.123.123.125|
+---------------+
Join:
From the above example data, explode the array of the IdMonitor and join with the IPs.
df.withColumn("ipLocation", explode('ipLocation)).alias("a")
.join(df_ips.alias("b"), col("a.ipLocation.ipType") === lit("home") || col("a.ipLocation.ip") === col("b.ips"), "inner")
.select("ipLocation.*")
.as[IpLocation].collect()
Finally, the collected result is given as follows:
res32: Array[IpLocation] = Array(IpLocation(123.123.123.123,12,home,test,test,test), IpLocation(123.123.123.125,13,company,test,test,test))

You can explode your array sequence in your IpMonitor objects using explode function, then use a left outer join to match ips present in your Ips dataset, then filter out on ipType == "home" or ip is present in Ips dataset and finally rebuild your IpLocation sequence by grouping by id and collect_list.
Complete code is as follows:
import org.apache.spark.sql.functions.{col, collect_list, explode}
val result = idMonitor.select(col("id"), explode(col("ipLocation")))
.join(Ips, col("col.ip") === col("value"), "left_outer")
.filter(col("col.ipType") === "home" || col("value").isNotNull())
.groupBy("id")
.agg(collect_list("col").as("value"))
.drop("id")
.as[Seq[IpLocation]]

Related

Scala MatchError while joining a dataframe and a dataset

I have one dataframe and one dataset :
Dataframe 1 :
+------------------------------+-----------+
|City_Name |Level |
+------------------------------+------------
|{City -> Paris} |86 |
+------------------------------+-----------+
Dataset 2 :
+-----------------------------------+-----------+
|Country_Details |Temperature|
+-----------------------------------+------------
|{City -> Paris, Country -> France} |31 |
+-----------------------------------+-----------+
I am trying to make a join of them by checking if the map in the column "City_Name" is included in the map of the Column "Country_Details".
I am using the following UDF to check the condition :
val mapEqual = udf((col1: Map[String, String], col2: Map[String, String]) => {
if (col2.nonEmpty){
col2.toSet subsetOf col1.toSet
} else {
true
}
})
And I am making the join this way :
dataset2.join(dataframe1 , mapEqual(dataset2("Country_Details"), dataframe1("City_Name"), "leftanti")
However, I get such error :
terminated with error scala.MatchError: UDF(Country_Details#528) AS City_Name#552 (of class org.apache.spark.sql.catalyst.expressions.Alias)
Has anyone previously got the same error ?
I am using Spark version 3.0.2 and SQLContext, with scala language.
There are 2 issues here, the first one is that when you're calling your function, you're passing one extra parameter leftanti (you meant to pass it to join function, but you passed it to the udf instead).
The second one is that the udf logic won't work as expected, I suggest you use this:
val mapContains = udf { (col1: Map[String, String], col2: Map[String, String]) =>
col2.keys.forall { key =>
col1.get(key).exists(_ eq col2(key))
}
}
Result:
scala> ds.join(df1 , mapContains(ds("Country_Details"), df1("City_Name")), "leftanti").show(false)
+----------------------------------+-----------+
|Country_Details |Temperature|
+----------------------------------+-----------+
|{City -> Paris, Country -> France}|31 |
+----------------------------------+-----------+

Erlang ets:select sublist

Is there a way in Erlang to create a select query on ets table, which will get all the elements that contains the searched text?
ets:select(Table,
[{ %% Match spec for select query
{'_', #movie_data{genre = "Drama" ++ '_' , _ = '_'}}, % Match pattern
[], % Guard
['$_'] % Result
}]) ;
This code gives me only the data that started (=prefix) with the required text (text = "Drama"), but the problem is I need also the the results that contain the data, like this example:
#movie_data{genre = "Action, Drama" }
I tried to change the guard to something like that -
{'_', #movie_data{genre = '$1', _='_'}}, [string:str('$1', "Drama") > 0] ...
But the problem is that it isn't a qualified guard expression.
Thanks for the help!!
It's not possible. You need to design your data structure to be searchable by the guard expressions, for example:
-record(movie_data, {genre, name}).
-record(genre, {comedy, drama, action}).
example() ->
Table = ets:new('test', [{keypos,2}]),
ets:insert(Table, #movie_data{name = "Bean",
genre = #genre{comedy = true}}),
ets:insert(Table, #movie_data{name = "Magnolia",
genre = #genre{drama = true}}),
ets:insert(Table, #movie_data{name = "Fight Club",
genre = #genre{drama = true, action = true}}),
ets:select(Table,
[{#movie_data{genre = #genre{drama = true, _ = '_'}, _ = '_'},
[],
['$_']
}]).

Spark dataset filter elements

I have 2 spark datasets:
lessonDS and latestLessonDS ;
This is my spark dataset POJO:
Lesson class:
private List<SessionFilter> info;
private lessonId;
LatestLesson class:
private String id:
SessionFilter class:
private String id;
private String sessionName;
I want to get all Lesson data where info.id in Lesson class not in LatestLesson id .
something like this:
lessonDS.filter(explode(col("info.id")).notEqual(latestLessonDS.col("value"))).show();
latestLessonDS contain:
100A
200C
300A
400A
lessonDS contain:
1,[[100A,jon],[200C,natalie]]
2,[[100A,jon]]
3,[[600A,jon],[400A,Kim]]
result:
3,[[600A,jon]
If your dataset size latestLessonDS is reasonable enough you can collect it and broadcast
and then simple filter transformtion on lessonDS will give you desired result.
like
import scala.collection.JavaConversions._
import spark.implicits._
val bc = spark.sparkContext.broadcast(latestLessonDS.collectAsList().toSeq)
lessonDS.mapPartitions(itr => {
val cache = bc.value;
itr.filter(x => {
//check in cache
})
})
Usually the array_contains function could be used as join condition when joining lessonDs and latestLessonDs. But this function does not work here as the join condition requires that all elements of lessonDs.info.id appear in latestLessonDS.
A way to get the result is to explode lessonDs, join with latestLessonDs and then check if for all elements of lessonDs.info an entry in latestLessonDs exists by comparing the number of info elements before and after the join:
lessonDs
.withColumn("noOfEntries", size('info))
.withColumn("id", explode(col("info.id")))
.join(latestLessonDs, "id" )
.groupBy("lessonId", "info", "noOfEntries").count()
.filter("noOfEntries = count")
.drop("noOfEntries", "count")
.show(false)
prints
+--------+------------------------------+
|lessonId|info |
+--------+------------------------------+
|1 |[[100A, jon], [200C, natalie]]|
|2 |[[100A, jon]] |
+--------+------------------------------+

How correctly built an object graph based on multi level join in Slick?

I have a model structure as following:
Group -> Many Parties -> Many Participants
In on of the API calls I need to get single groups with parties and it's participants attached.
This whole structure is built on 4 tables:
group
party
party_participant
participant
Naturally, with SQL it's a pretty straight forward join that combines all of them. And this is exactly what I am trying to do with slick.
Mu method is dao class looks something like this:
def findOneByKeyAndAccountIdWithPartiesAndParticipants(key: UUID, accountId: Int): Future[Option[JourneyGroup]] = {
val joins = JourneyGroups.groups join
Parties.parties on (_.id === _.journeyGroupId) joinLeft
PartiesParticipants.relations on (_._2.id === _.partyId) joinLeft
Participants.participants on (_._2.map(_.participantId) === _.id)
val query = joins.filter(_._1._1._1.accountId === accountId).filter(_._1._1._1.key === key)
val q = for {
(((journeyGroup, party), partyParticipant), participant) <- query
} yield (journeyGroup, party, participant)
val result = db.run(q.result)
result ????
}
The problem here, is that the result is type of Future[Seq[(JourneyGroup, Party, Participant)]]
However, what I really need is Future[Option[JourneyGroup]]
Note: case classes of JourneyGroup and Party have sequences for there children defined:
case class Party(id: Option[Int] = None,
partyType: Parties.Type.Value,
journeyGroupId: Int,
accountId: Int,
participants: Seq[Participant] = Seq.empty[Participant])
and
case class JourneyGroup(id: Option[Int] = None,
key: UUID,
name: String,
data: Option[JsValue],
accountId: Int,
parties: Seq[Party] = Seq.empty[Party])
So they both can hold the descendants.
What is the correct way to convert to the result I need? Or am I completely in a wrong direction?
Also, is this statement is correct:
Participants.participants on (_._2.map(_.participantId) === _.id) ?
I ended up doing something like this:
journeyGroupDao.findOneByKeyAndAccountIdWithPartiesAndParticipants(key, account.id.get) map { data =>
val groupedByJourneyGroup = data.groupBy(_._1)
groupedByJourneyGroup.map { case (group, rows) =>
val parties = rows.map(_._2).distinct map { party =>
val participants = rows.filter(r => r._2.id == party.id).flatMap(_._3)
party.copy(participants = participants)
}
group.copy(parties = parties)
}.headOption
}
where DAO method's signature is:
def findOneByKeyAndAccountIdWithPartiesAndParticipants(key: UUID, accountId: Int): Future[Seq[(JourneyGroup, Party, Option[Participant])]]

Slick query for one to optional one (zero or one) relationship

Given tables of:
case class Person(id: Int, name: String)
case class Dead(personId: Int)
and populated with:
Person(1, "George")
Person(2, "Barack")
Dead(1)
is it possible to have a single query that would produce a list of (Person, Option[Dead]) like so?
(Person(1, "George"), Some(Dead(1)))
(Person(2, "Barack"), None)
For slick 3.0 it should be something like this:
val query = for {
(p, d) <- persons joinLeft deads on (_.id === _.personId)
} yield (p, d)
val results: Future[Seq[(Person, Option[Dead])]] = db.run(query.result)
In slick, outer joins are automatically wrapped in an Option type. You can read more about joining here: http://slick.typesafe.com/doc/3.0.0/queries.html#joining-and-zipping