CassandraSourceRelation not serializable when joining two dataframes

CassandraSourceRelation not serializable when joining two dataframes - apache-spark-sql

I have a setup of dataframes with spark-cassandra-connector 1.6.2.
I try to perform some transformations with cassandra. Datastax enterprise version is 5.0.5.
DataFrame df1 = sparkContext
.read().format("org.apache.spark.sql.cassandra")
.options(readOptions).load()
.where("field2 ='XX'")
.limit(limitVal)
.repartition(partitions);
List<String> distinctKeys = df1.getColumn("field3").collect();
values = some transformations to get IN query values;
String cassandraQuery = String.format("SELECT * FROM "
+ "table2 "
+ "WHERE field2 = 'XX' "
+ "AND field3 IN (%s)", values);
DataFrame df2 = sparkContext.cassandraSql(cassandraQuery);
String column1 = "field3";
String column2 = "field4";
List<String> columns = new ArrayList<>();
columns.add(column1);
columns.add(column2);
scala.collection.Seq<String> usingColumns =
scala.collection.JavaConverters.
collectionAsScalaIterableConverter(columns).asScala().toSeq();
DataFrame joined = df1.join(df2, usingColumns, "left_outer");
List<Row> collected = joined.collectAsList(); // doestn't work
Long count = joined.count(); // works
This is the exception log, looks like spark is creating cassandra source realation, and it cannot be serialized.
java.io.NotSerializableException: java.util.ArrayList$Itr
Serialization stack:
- object not serializable (class:
org.apache.spark.sql.cassandra.CassandraSourceRelation, value:
org.apache.spark.sql.cassandra.CassandraSourceRelation#1c11a496)
- field (class: org.apache.spark.sql.execution.datasources.LogicalRelation,
name: relation, type: class org.apache.spark.sql.sources.BaseRelation)
- object (class org.apache.spark.sql.execution.datasources.LogicalRelation,
Relation[fields]
org.apache.spark.sql.cassandra.CassandraSourceRelation#1c11a496
)
- field (class: org.apache.spark.sql.catalyst.plans.logical.Filter, name:
child, type: class org.apache.spark.sql.catalyst.plans.logical.LogicalPlan)
- object (class org.apache.spark.sql.catalyst.plans.logical.Filter, Filter
(field2#0 = XX)
+- Relation[fields]
org.apache.spark.sql.cassandra.CassandraSourceRelation#1c11a496
)
- field (class: org.apache.spark.sql.catalyst.plans.logical.Repartition, name:
child, type: class org.apache.spark.sql.catalyst.plans.logical.LogicalPlan)
- object (class org.apache.spark.sql.catalyst.plans.logical.Repartition,
Repartition 4, true
+- Filter (field2#0 = XX)
+- Relation[fields]
org.apache.spark.sql.cassandra.CassandraSourceRelation#1c11a496
)
- field (class: org.apache.spark.sql.catalyst.plans.logical.Join, name: left,
type: class org.apache.spark.sql.catalyst.plans.logical.LogicalPlan)
- object (class org.apache.spark.sql.catalyst.plans.logical.Join, Join
LeftOuter, Some(((field3#2 = field3#18) && (field4#3 = field4#20)))
:- Repartition 4, true
: +- Filter (field2#0 = XX)
: +- Relation[fields]
org.apache.spark.sql.cassandra.CassandraSourceRelation#1c11a496
+- Project [fields]
+- Filter ((field2#17 = YY) && field3#18 IN (IN array))
+- Relation[fields]
org.apache.spark.sql.cassandra.CassandraSourceRelation#7172525e
)
- field (class: org.apache.spark.sql.catalyst.plans.logical.Project, name:
child, type: class org.apache.spark.sql.catalyst.plans.logical.LogicalPlan)
- object (class org.apache.spark.sql.catalyst.plans.logical.Project, Project
[fields]
+- Join LeftOuter, Some(((field3#2 = field3#18) && (field4#3 = field4#20)))
:- Repartition 4, true
: +- Filter (field2#0 = XX)
: +- Relation[fields]
org.apache.spark.sql.cassandra.CassandraSourceRelation#1c11a496
+- Project [fields]
+- Filter ((field2#17 = XX) && field3#18 IN (IN array))
+- Relation[fields]
org.apache.spark.sql.cassandra.CassandraSourceRelation#7172525e
)
- field (class: org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4, name:
$outer, type: class org.apache.spark.sql.catalyst.trees.TreeNode)
- object (class org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4,
<function1>)
- field (class:
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4$$anonfun$apply$9,
name: $outer, type: class
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4)
- object (class
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4$$anonfun$apply$9,
<function1>)
- field (class: scala.collection.immutable.Stream$$anonfun$map$1, name: f$1,
type: interface scala.Function1)
- object (class scala.collection.immutable.Stream$$anonfun$map$1, <function0>)
- writeObject data (class: scala.collection.immutable.$colon$colon)
- object (class scala.collection.immutable.$colon$colon,
List(org.apache.spark.OneToOneDependency#17f43f4a))
- field (class: org.apache.spark.rdd.RDD, name:
org$apache$spark$rdd$RDD$$dependencies_, type: interface scala.collection.Seq)
- object (class org.apache.spark.rdd.MapPartitionsRDD, MapPartitionsRDD[32] at
collectAsList at RevisionPushJob.java:308)
- field (class: org.apache.spark.rdd.RDD$$anonfun$collect$1, name: $outer,
type: class org.apache.spark.rdd.RDD)
- object (class org.apache.spark.rdd.RDD$$anonfun$collect$1, <function0>)
- field (class: org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12, name:
$outer, type: class org.apache.spark.rdd.RDD$$anonfun$collect$1)
- object (class org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12,
<function1>)
Is it possible to make it serialized? Why is count operation working but collect operation does not?
UPDATE:
After getting back to it, it turned out that in Java you have first convert the Java Iterable to scala buffer and create a scala Iterable -> Seq out of it. Otherwise it doesn't work. Thanks Russel for bringing my attention to the cause of the problem.
String attrColumn1 = "column1";
String attrColumn2 = "column2";
String attrColumn3 = "column3";
String attrColumn4 = "column4";
List<String> attrColumns = new ArrayList<>();
attrColumns.add(attrColumn1);
attrColumns.add(attrColumn2);
attrColumns.add(attrColumn3);
attrColumns.add(attrColumn4);
Seq<String> usingAttrColumns =
JavaConverters.asScalaBufferConverter(attrColumns).asScala().toList();

See the error message pointing to java.util.ArrayList$Itr being your unserialzable bit which I think may be a reference to
List<String> columns = new ArrayList<>();
columns.add(column1);
columns.add(column2);
Which in it's implicit conversion may require the serialization of the array-list iterator? That's the only ArrayList I see so it may be the culprit. It may also be in the code you removed for "values."
When you do the Count it can discard column information so that probably saves you but I can't be sure.
So TLDR my suggestion is trying to remove things from the code and replacing and building your code back up again to find the unserializable bits.

Related

Why is subtyping preserved in this example without using the out keyword?

Here a have a generic type, but I don't understand why even though I didn't made the type T covariant, I can add subclasses of Player to my generics class and function which has an upper bound T: Player; so why is the subtyping preserved without using the out keyword; and because of this I can - wrongly- add a BaseballPlayer and GamesPlayer to a football team. 1
class Team<T : Player>(val name: String, private val players: MutableList<T>) {
fun addPlayers(player: T) {
if (players.contains(player)) {
println("Player: ${(player as Player).name} is already in the team.")
} else {
players.add(player)
println("Player: {(player as Player).name} was added to the team.")
}
}
}
open class Player(open val name: String)
data class FootballPlayer(override val name: String) : Player(name)
data class BaseballPlayer(override val name: String) : Player(name)
data class GamesPlayer(override val name: String) : Player(name)
val footballTeam = Team<Player>(
"Football Team",
mutableListOf(FootballPlayer("Player 1"), FootballPlayer("Player 2"))
)
val baseballPlayer = BaseballPlayer("Baseball Player")
val footballPlayer = FootballPlayer("Football Player")
val gamesPlayer = GamesPlayer("Games Player")
footballTeam.addPlayers(baseballPlayer)
footballTeam.addPlayers(footballPlayer)
footballTeam.addPlayers(gamesPlayer)

The mutable list you defined on this line:
mutableListOf(FootballPlayer("Player 1"), FootballPlayer("Player 2"))
is not a MutableList<FootballPlayer>. It is a MutableList<Player>, since you didn't specify its type, so the compiler used type inference to assume you want a MutableList<Player> to fit the constructor argument for your Team<Player> constructor.
So, it is valid to put any kind of Player into a MutableList<Player>, since it can only ever return items of type Player. It's still type safe.
If you had been explicit about the type, it would have been a compile error:
val footballTeam = Team<Player>(
"Football Team",
mutableListOf<FootballPlayer>(FootballPlayer("Player 1"), FootballPlayer("Player 2"))
//error, expected MutableList<Player>
)
Or if you had omitted the type from the Team constructor, it would have assumed you wanted a Team<FootballPlayer> and you would have had an error when trying to add other types of player:
val footballTeam = Team(
"Football Team",
mutableListOf(FootballPlayer("Player 1"), FootballPlayer("Player 2"))
)
// ^^ is a Team<FootballPlayer> because of type inferrence.
footballTeam.addPlayers(BaseballPlayer("Foo")) // error, expected FootballPlayer

Android Room Embedded Field Not Compiling

I am trying to create an embedded field. This is a simple example but I can't get this simple example to work. Eventually I need to have 3 levels of embedded items but trying to get this test case to work.
#Entity(tableName = "userItemsEntity")
#Parcelize
data class Item(
var objecttype: String?,
#PrimaryKey(autoGenerate = false)
var objectid: Int?,
var subtype: String?,
var collid: Int?,
#Embedded
var name: Name?
) : Parcelable
#Parcelize
data class Name(
var primary: Boolean? = true,
var sortindex: Int? = null,
var content: String? = null) : Parcelable
When I try and compile it it complains on the DAO that the updateItem()
SQL error or missing database (no such column: name)
DAO function
#Query("UPDATE userItemsEntity SET " +
"objecttype=:objecttype, objectid=:objectid, subtype=:subtype, collid=:collid, name=:name " +
"WHERE objectid=:objectid")
fun updateItem(
objecttype: String?,
objectid: Int,
subtype: String?,
collid: Int?,
name: Name?)

The reason is as it says there is no name column. Rather the table consists of the columns, as per the member variables of the EMBEDDED class (i.e. primary, sortindex and content).
i.e. the table create SQL is/will be :-
CREATE TABLE IF NOT EXISTS `userItemsEntity` (`objecttype` TEXT, `objectid` INTEGER, `subtype` TEXT, `collid` INTEGER, `primary` INTEGER, `sortindex` INTEGER, `content` TEXT, PRIMARY KEY(`objectid`))
Room knows to build the respective Name object from those columns when extracting rows.
So you could use :-
#Query("UPDATE userItemsEntity SET " +
"objecttype=:objecttype, objectid=:objectid, subtype=:subtype, collid=:collid, `primary`=:primary, sortindex=:sortindex, content=:content " +
"WHERE objectid=:objectid")
fun updateItem(
objecttype: String?,
objectid: Int,
subtype: String?,
collid: Int?,
primary: Boolean?,
sortindex: Int?,
content: String?
)
note that primary is an SQLite token and thus enclosed in grave accents to ensure that it is not treated as a token. Otherwise you would get :-
There is a problem with the query: [SQLITE_ERROR] SQL error or missing database (near "primary": syntax error)
However, as you are using a WHERE clause based upon the primary key (objectid) then the update will only apply to a single row and as such you can simply use:-
#Update
fun update(item: Item): Int
obviously the function's name need not be update it could be any valid name that suits.
this has the advantage of not only being simpler but of returning the number of rows updated (would be 1 if the row exists, otherwise 0)
Impementing a name column
If you want a name column and for that name column to hold a Name object. Then, as SQLite does not have storage/column types for objects then you would not EMBED the Name class.
You would have var name: Name? with an appropriate TypeConverter that would convert the Name object into a type that SQLite caters for :-
TEXT (String),
REAL (Float, Double...),
INTEGER (Long, Int ...) or
BLOB (ByteArray)).
Typically String is used and typically GSON is used to convert from an object to a JOSN String.
SQlite does have a NUMERIC type. However, Room doesn't support it's use. I believe because the other types cover all types of data and NUMERIC is a catch-all/default.
However, using a JSON representation of an object, introduces bloat and reduces the usefulness of the converted data from an SQL aspect.
For example say you had :-
#Entity(tableName = "userOtherItemsEntity")
#Parcelize
data class OtherItem (
var objecttype: String?,
#PrimaryKey(autoGenerate = false)
var objectid: Int?,
var subtype: String?,
var collid: Int?,
var name: OtherName?) : Parcelable
#Parcelize
data class OtherName(
var primary: Boolean? = true,
var sortindex: Int? = null,
var content: String? = null) : Parcelable
Then the underlying table does have the name column. The CREATE SQL, generated by Room, would be :-
CREATE TABLE IF NOT EXISTS `userOtherItemsEntity` (`objecttype` TEXT, `objectid` INTEGER, `subtype` TEXT, `collid` INTEGER, `name` TEXT, PRIMARY KEY(`objectid`))
However, you would need TypeConverters which could be :-
#TypeConverter
fun fromOtherName(othername: OtherName ): String {
return Gson().toJson(othername)
}
#TypeConverter
fun toOtherName(json: String): OtherName {
return Gson().fromJson(json,OtherName::class.java)
}
the first using Gson to convert the object to a JSON string, e.g. when inserting data
the second converts the JSON string to an OtherName object.
using Item with Name embedded then data would be stored along the lines of :-
Whilst with the OtherItem with OtherName being converted then the data (similar data) would be along the lines of :-
in the former the 3 Name columns would take up about (1 + 1 + 12) = 16 bytes.
in the latter, The OtherName columns (discounting the word Other whenever used) would take uo some 55 bytes.
the latter may require more complex and resource expensive searches if the components of the OtherName are to be included in searches.
e.g. #Query("SELECT * FROM userItemsEntity WHERE primary") as opposed to #Query("SELECT * FROM userOtherItemsEntity WHERE instr(name,'primary\":true') > 0")

Entity class do not work in room database

following error appears:
Entity class must be annotated with #Entity - androidx.room.Entityerror: Entities and POJOs must have a usable public constructor. You can have an empty constructor or a constructor whose parameters match the fields (by name and type). - androidx.room.Entityerror: An entity must have at least 1 field annotated with #PrimaryKey - androidx.room.Entityerror: [SQLITE_ERROR] SQL error or missing database (near ")": syntax error) - androidx.room.EntityH:\Apps2021\app\build\tmp\kapt3\stubs\debug\de\tetzisoft\danza\data\DAO.java:17: error: There is a problem with the query: [SQLITE_ERROR] SQL error or missing database (no such table: geburtstag)
Entity looks like this
#Entity(tableName = "geburtstag")
data class Bday(
#PrimaryKey(autoGenerate = true)
var id : Int,
#ColumnInfo(name="Name")
var name : String,
#ColumnInfo(name="Birthday")
var birth : String
)

The issue would appear to be that you do not have the Bday class defined in the entities in the class that is annotated with #Database.
e.g. you appear to have :-
#Database(entities = [],version = 1)
abstract class TheDatabase: RoomDatabase() {
abstract fun getDao(): Dao
}
This produces the results you have show e.g.
E:\AndroidStudioApps\SO67560510Geburtstag\app\build\tmp\kapt3\stubs\debug\a\a\so67560510geburtstag\TheDatabase.java:7: error: #Database annotation must specify list of entities
public abstract class TheDatabase extends androidx.room.RoomDatabase {
^error: There is a problem with the query: [SQLITE_ERROR] SQL error or missing database (no such table: geburtstag) - a.a.so67560510geburtstag.Dao.getAll()error: Not sure how to convert a Cursor to this method's return type (java.util.List<a.a.so67560510geburtstag.Bday>). - a.a.so67560510geburtstag.Dao.getAll()error: a.a.so67560510geburtstag.Dao is part of a.a.so67560510geburtstag.TheDatabase but this entity is not in the database. Maybe you forgot to add a.a.so67560510geburtstag.Bday to the entities section of the #Database? - a.a.so67560510geburtstag.Dao.insert(a.a.so67560510geburtstag.Bday)
Whilst :-
#Database(entities = [Bday::class],version = 1)
abstract class TheDatabase: RoomDatabase() {
abstract fun getDao(): Dao
}
compiles successfully.
note the change from entities=[] to entities = [Bday::class]

You should set default values for constructor args. In this case kotlin will generate no arg constructor. Try this
#Entity(tableName = "geburtstag")
data class Bday(
#PrimaryKey(autoGenerate = true)
var id : Int = 0,
#ColumnInfo(name="Name")
var name : String = "",
#ColumnInfo(name="Birthday")
var birth : String = ""
)

Realm (kotlin): querying objects that have list of children where at least one child has non-null field

I have following models
#RealmClass
open class Sport (#PrimaryKey open var sportId: Long = 0,
//...
open var events: RealmList<Event> = RealmList(),
) : RealmObject()
#RealmClass
open class Event(
#PrimaryKey var id: Long = 0L,
//...
open var defaultGame: Game? = null
) : RealmObject()
and following query
fun getSports(): RealmResults<Sport> = realm.where(Sport::class.java)
.isNotNull("events.defaultGame")
.distinct("sportId")
.sort(orderSportFields, orderSport)
.findAll()
When I call getSports() function, I receive:
java.lang.IllegalArgumentException: Illegal Argument: isNotNull() by nested query for link field is not supported.
How to change my query to get the list of sports where each sport has at least one event with defaultGame field being not null?
Thanks!

Define collection of data class

If I've the below data class
data class User(val name: String = "", val age: Int = 0)
How can I define a collection of it, like:
var user = User [] // this is not working
I need to be able to call the users by:
user[0].name // something like this!

Defining a List collection in Kotlin in different ways:
Immutable variable with immutable (read only) list:
val users: List<User> = listOf( User("Tom", 32), User("John", 64) )
Immutable variable with mutable list:
val users: MutableList<User> = mutableListOf( User("Tom", 32), User("John", 64) )
or without initial value - empty list and without explicit variable type:
val users = mutableListOf<User>()
//or
val users = ArrayList<User>()
you can add items to list:
users.add(anohterUser) or
users += anotherUser (under the hood it's users.add(anohterUser))
Mutable variable with immutable list:
var users: List<User> = listOf( User("Tom", 32), User("John", 64) )
or without initial value - empty list and without explicit variable type:
var users = emptyList<User>()
NOTE: you can add* items to list:
users += anotherUser - *it creates new ArrayList and assigns it to users
Mutable variable with mutable list:
var users: MutableList<User> = mutableListOf( User("Tom", 32), User("John", 64) )
or without initial value - empty list and without explicit variable type:
var users = emptyList<User>().toMutableList()
//or
var users = ArrayList<User>()
NOTE: you can add items to list:
users.add(anohterUser)
but not using users += anotherUser
Error: Kotlin: Assignment operators ambiguity:
public operator fun Collection.plus(element: String): List defined in kotlin.collections
#InlineOnly public inline operator fun MutableCollection.plusAssign(element: String): Unit defined in kotlin.collections
see also:
https://kotlinlang.org/docs/reference/collections.html

You do it like this in Kotlin:
val array = arrayOf(User("name1"), User("name2"))
If you want to create an Array without adding elements right away, use
val arrayList = ArrayList<User>()
In this case you have to specify the element type explicitely because there is nothing to infer it from.
From the ArrayList docu:
Povides a MutableList implementation, which uses a
resizable array as its backing storage

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

CassandraSourceRelation not serializable when joining two dataframes - apache-spark-sql

Related

Why is subtyping preserved in this example without using the out keyword?

Android Room Embedded Field Not Compiling

Entity class do not work in room database

Realm (kotlin): querying objects that have list of children where at least one child has non-null field

Define collection of data class

Categories

Resources