spark-sql on google dataproc from sbt scala - apache-spark-sql

Using Google Dataproc Spark cluster, my sbt built assembly jar can access Cassandra via SparkContext.
However, when I try to access via sqlContext I get spark sql classes not found on the remote cluster - though I believe the dataproc cluster is supposed to be provisioned for spark sql.
java.lang.NoClassDefFoundError: org/apache/spark/sql/types/UTF8String$
at org.apache.spark.sql.cassandra.CassandraSQLRow$$anonfun$fromJavaDriverRow$1.apply$mcVI$sp(CassandraSQLRow.scala:50)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala
my sbt file:
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "1.5.0" % "provided",
"org.apache.spark" %% "spark-sql" % "1.5.0" % "provided",
"com.datastax.spark" %% "spark-cassandra-connector" % "1.4.0"
)
Turning off "provided" on spark-sql puts me in jar duplicate merge hell.
Thx for any help.

It looks like you also need version 1.5.0 of the spark-cassandra-connector to ensure your classes are compatible. Here's the commit which upgraded the cassandra connector to 1.5.0, and you can see it removes the import of org.apache.spark.sql.types.UTF8String and adds import org.apache.spark.unsafe.types.UTF8String instead, changing the relevant lines in CassandraSQLRow.scala:
data(i) = GettableData.get(row, i)
data(i) match {
case date: Date => data.update(i, new Timestamp(date.getTime))
- case str: String => data.update(i, UTF8String(str))
+ case bigInt: BigInteger => data.update(i, new JBigDecimal(bigInt))
+ case str: String => data.update(i, UTF8String.fromString(str))
case set: Set[_] => data.update(i, set.toSeq)
case _ =>
}
Though it appears there are only "milestone" artifact types rather than "release" types in Maven central for the cassandra connector, you should still be able to get the latest milestone connector 1.5.0-M2 to work with your code.
EDIT: Additional link to the compatibility table from the Cassandra connector's GitHub README.md

Related

vertx hazelcast class serialization on OSGi karaf

I want to use vertx cluster with hazelcast on karaf. When I try to write messages to the bus (after cluster is formed) I am getting this serialization error. I was thinking about adding a class definition to hazelcast to tell it where to find the vertx server id class (io.vertx.spi.cluster.hazelcast.impl.HazelcastServerID) but I am not sure how.
On Karaf I had to wrap the vertx-hazelcast jar because it doesn't have a proper manifest file.
<bundle start-level="80">wrap:mvn:io.vertx/vertx-hazelcast/${vertx.version}</bundle>
here is my error
com.hazelcast.nio.serialization.HazelcastSerializationException: Problem while reading DataSerializable, namespace: 0, id: 0, class: 'io.vertx.spi.cluster.hazelcast.impl.HazelcastServerID', exception: io.vertx.spi.cluster.hazelcast.impl.
HazelcastServerID
at com.hazelcast.internal.serialization.impl.DataSerializer.read(DataSerializer.java:130)[11:com.hazelcast:3.6.3]
at com.hazelcast.internal.serialization.impl.DataSerializer.read(DataSerializer.java:47)[11:com.hazelcast:3.6.3]
at com.hazelcast.internal.serialization.impl.StreamSerializerAdapter.read(StreamSerializerAdapter.java:46)[11:com.hazelcast:3.6.3]
at com.hazelcast.internal.serialization.impl.AbstractSerializationService.toObject(AbstractSerializationService.java:170)[11:com.hazelcast:3.6.3]
at com.hazelcast.map.impl.DataAwareEntryEvent.getOldValue(DataAwareEntryEvent.java:82)[11:com.hazelcast:3.6.3]
at io.vertx.spi.cluster.hazelcast.impl.HazelcastAsyncMultiMap.entryRemoved(HazelcastAsyncMultiMap.java:147)[64:wrap_file__C__Users_gadei_development_github_effectus.io_effectus-core_core.test_core.test.exam_target_paxexam_unpack_
5bf4439f-01ff-4db4-bd3d-e3b6a1542596_system_io_vertx_vertx-hazelcast_3.4.0-SNAPSHOT_vertx-hazelcast-3.4.0-SNAPSHOT.jar:0.0.0]
at com.hazelcast.multimap.impl.MultiMapEventsDispatcher.dispatch0(MultiMapEventsDispatcher.java:111)[11:com.hazelcast:3.6.3]
at com.hazelcast.multimap.impl.MultiMapEventsDispatcher.dispatchEntryEventData(MultiMapEventsDispatcher.java:84)[11:com.hazelcast:3.6.3]
at com.hazelcast.multimap.impl.MultiMapEventsDispatcher.dispatchEvent(MultiMapEventsDispatcher.java:55)[11:com.hazelcast:3.6.3]
at com.hazelcast.multimap.impl.MultiMapService.dispatchEvent(MultiMapService.java:371)[11:com.hazelcast:3.6.3]
at com.hazelcast.multimap.impl.MultiMapService.dispatchEvent(MultiMapService.java:65)[11:com.hazelcast:3.6.3]
at com.hazelcast.spi.impl.eventservice.impl.LocalEventDispatcher.run(LocalEventDispatcher.java:56)[11:com.hazelcast:3.6.3]
at com.hazelcast.util.executor.StripedExecutor$Worker.process(StripedExecutor.java:187)[11:com.hazelcast:3.6.3]
at com.hazelcast.util.executor.StripedExecutor$Worker.run(StripedExecutor.java:171)[11:com.hazelcast:3.6.3]
Caused by: java.lang.ClassNotFoundException: io.vertx.spi.cluster.hazelcast.impl.HazelcastServerID
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)[:1.8.0_101]
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)[:1.8.0_101]
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)[:1.8.0_101]
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)[:1.8.0_101]
at com.hazelcast.nio.ClassLoaderUtil.tryLoadClass(ClassLoaderUtil.java:137)[11:com.hazelcast:3.6.3]
at com.hazelcast.nio.ClassLoaderUtil.loadClass(ClassLoaderUtil.java:115)[11:com.hazelcast:3.6.3]
at com.hazelcast.nio.ClassLoaderUtil.newInstance(ClassLoaderUtil.java:68)[11:com.hazelcast:3.6.3]
at com.hazelcast.internal.serialization.impl.DataSerializer.read(DataSerializer.java:119)[11:com.hazelcast:3.6.3]
... 13 more
any suggestions appreciated
thanks.
This normally happens if one object has an acyclic serialization (reading one less / much property). In this case you're on a wrong stream position which means you end up reading the wrong datatype.
Another possible reason is multiple different Hazelcast versions in the classpath (please check that) or different versions on different nodes.
The solution involved classloading magic!
.setClassLoader(HazelcastClusterManager.class.getClassLoader())
I ended up rolling my own hazelcast instance and configuring it the way vertx specification is instructing with the additional classloader configuration trick.
```
ServiceReference serviceRef = context.getServiceReference(HazelcastOSGiService.class);
log.info("Hazelcast OSGi Service Reference: {}", serviceRef);
hazelcastOsgiService = context.getService(serviceRef);
log.info("Hazelcast OSGi Service: {}", hazelcastOsgiService);
hazelcastOsgiService.getClass().getClassLoader();
Map<String, SemaphoreConfig> semaphores = new HashMap<>();
semaphores.put("__vertx.*", new SemaphoreConfig().setInitialPermits(1));
Config hazelcastConfig = new Config("effectus-instance")
.setClassLoader(HazelcastClusterManager.class.getClassLoader())
.setGroupConfig(new GroupConfig("dev").setPassword("effectus"))
// .setSerializationConfig(new SerializationConfig().addClassDefinition()
.addMapConfig(new MapConfig()
.setName("__vertx.subs")
.setBackupCount(1)
.setTimeToLiveSeconds(0)
.setMaxIdleSeconds(0)
.setEvictionPolicy(EvictionPolicy.NONE)
.setMaxSizeConfig(new MaxSizeConfig().setSize(0).setMaxSizePolicy(MaxSizeConfig.MaxSizePolicy.PER_NODE))
.setEvictionPercentage(25)
.setMergePolicy("com.hazelcast.map.merge.LatestUpdateMapMergePolicy"))
.setSemaphoreConfigs(semaphores);
hazelcastOSGiInstance = hazelcastOsgiService.newHazelcastInstance(hazelcastConfig);
log.info("New Hazelcast OSGI instance: {}", hazelcastOSGiInstance);
hazelcastOsgiService.getAllHazelcastInstances().stream().forEach(instance -> {
log.info("Registered Hazelcast OSGI Instance: {}", instance.getName());
});
clusterManager = new HazelcastClusterManager(hazelcastOSGiInstance);
VertxOptions options = new VertxOptions().setClusterManager(clusterManager).setHAGroup("effectus");
Vertx.clusteredVertx(options, res -> {
if (res.succeeded()) {
Vertx v = res.result();
log.info("Vertx is running in cluster mode: {}", v);
// some more code...
```
so the issue is that hazelcast instance doesn't have access to the cleasass inside the vertx-hazelcst bundle.
I am sure there is a shorter cleaner way somewhere..
any better suggestions would be great.

Cannot register new CQ query on Apache Geode

I stuck in place while trying to register cq query with ClientCache. Still getting this exception:
CqService is not available.
java.lang.IllegalStateException: CqService is not available.
at org.apache.geode.cache.query.internal.cq.MissingCqService.start(MissingCqService.java:171)
at org.apache.geode.cache.query.internal.DefaultQueryService.getCqService(DefaultQueryService.java:777)
at org.apache.geode.cache.query.internal.DefaultQueryService.newCq(DefaultQueryService.java:486)
The client cache is created as follow:
def client(): ClientCache = new ClientCacheFactory()
.setPdxPersistent(true)
.setPdxSerializer(new ReflectionBasedAutoSerializer(false, "org.geode.importer.domain.FooBar"))
.addPoolLocator(ConfigProvider.locator.host, ConfigProvider.locator.port)
.setPoolSubscriptionEnabled(true)
.create()
and suggested solution does not help. Actual library version is:
"org.apache.geode" % "geode-core" % "1.0.0-incubating"
You will have to pull in geode-cq as a dependency. In gradle
compile 'org.apache.geode:geode-cq:1.0.0-incubating'

Spark Redshift saving into s3 as Parquet

Issues saving a redshift table into s3 as a parquet file... This is coming from the date field. I'm going to try to convert the column to a long and store it as a unix timestamp for now.
Caused by: java.lang.NumberFormatException: multiple points
at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:1110)
at java.lang.Double.parseDouble(Double.java:540)
at java.text.DigitList.getDouble(DigitList.java:168)
at java.text.DecimalFormat.parse(DecimalFormat.java:1321)
at java.text.SimpleDateFormat.subParse(SimpleDateFormat.java:1793)
at java.text.SimpleDateFormat.parse(SimpleDateFormat.java:1455)
at com.databricks.spark.redshift.Conversions$$anon$1.parse(Conversions.scala:54)
at java.text.DateFormat.parse(DateFormat.java:355)
at com.databricks.spark.redshift.Conversions$.com$databricks$spark$redshift$Conversions$$parseTimestamp(Conversions.scala:67)
at com.databricks.spark.redshift.Conversions$$anonfun$1.apply(Conversions.scala:122)
at com.databricks.spark.redshift.Conversions$$anonfun$1.apply(Conversions.scala:108)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
at com.databricks.spark.redshift.Conversions$.com$databricks$spark$redshift$Conversions$$convertRow(Conversions.scala:108)
at com.databricks.spark.redshift.Conversions$$anonfun$createRowConverter$1.apply(Conversions.scala:135)
at com.databricks.spark.redshift.Conversions$$anonfun$createRowConverter$1.apply(Conversions.scala:135)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:241)
... 8 more
These are my gradle dependencies:
dependencies {
compile 'com.amazonaws:aws-java-sdk:1.10.31'
compile 'com.amazonaws:aws-java-sdk-redshift:1.10.31'
compile 'org.apache.spark:spark-core_2.10:1.5.1'
compile 'org.apache.spark:spark-sql_2.10:1.5.1'
compile 'com.databricks:spark-redshift_2.10:0.5.1'
compile 'com.fasterxml.jackson.module:jackson-module-scala_2.10:2.6.3'
}
EDIT 1: df.write.parquet("s3n://bucket/path/log.parquet") is how I'm saving the dataframe after I load in the redshift data using spark-redshift.
EDIT 2: I'm running all of this on my macbook air, maybe too much data corrupts the Dataframe? Not sure... It works when I 'limit 1000', just not for the entire table... So "query" works, but "table" doesn't in the spark-redshift options params.
spark-redshift maintainer here. I believe that the error that you're seeing is caused by a thread-safety bug in spark-redshift (Java DecimalFormat instances are not thread-safe and we were sharing a single instance across multiple threads).
This has been fixed in the 0.5.2 release, which is available on Maven Central and Spark Packages. Upgrade to 0.5.2 and this should work!

Spark Parallelize? (Could not find creator property with name 'id')

What causes this Serialization error in Apache Spark 1.4.0 when calling:
sc.parallelize(strList, 4)
This exception is thrown:
com.fasterxml.jackson.databind.JsonMappingException:
Could not find creator property with name 'id' (in class org.apache.spark.rdd.RDDOperationScope)
Thrown from addBeanProps in Jackson: com.fasterxml.jackson.databind.deser.BeanDeserializerFactory#addBeanProps
The RDD is a Seq[String], and the #partitions doesn't seem to matter (tried 1, 2, 4).
There is no serialization stack trace, as normal the worker closure cannot be serialized.
What is another way to track this down?
#Interfector is correct. I ran into this issue also, here's a snippet from my sbt file and the 'dependencyOverrides' section which fixed it.
libraryDependencies ++= Seq(
"com.amazonaws" % "amazon-kinesis-client" % "1.4.0",
"org.apache.spark" %% "spark-core" % "1.4.0",
"org.apache.spark" %% "spark-streaming" % "1.4.0",
"org.apache.spark" %% "spark-streaming-kinesis-asl" % "1.4.0",
"com.amazonaws" % "aws-java-sdk" % "1.10.2"
)
dependencyOverrides ++= Set(
"com.fasterxml.jackson.core" % "jackson-databind" % "2.4.4"
)
I suspect that this is caused by the classpath providing you with a different version of jackson than the one Spark is expecting (that is 2.4.4 if I'm not mistaking). You will need to adjust your classpath so that the correct jackson is referenced first for Spark.
I had the same problem with a project built with Gradle and I excluded the transitive dependencies from the project that was creating the problem:
dependencies
{
compile('dependency.causing:problem:version')
{
exclude module: 'jackson-databind'
}
....
}
That worked perfectly for me.
This worked for me <dependency> excludeAll ExclusionRule(organization = "com.fasterxml.jackson.core")

ScalaQuery with play2.1.4

I am migrating play2.0 app to play2.1 which has a lot of scalaquery implemented .
with all the migration changes its finally compiled ( not using anorm) the scalaqueries are still there .
play compile and stage is successful but its giving following error
java.lang.NoClassDefFoundError: scala/Right
org.scalaquery.ql.basic.BasicImplicitConversions$class.queryToQueryInvoker(BasicImplicitConversions.scala:26)
org.scalaquery.ql.extended.MySQLDriver$$anon$1.queryToQueryInvoker(MySQLDriver.scala:13)
models.SynonymMappings$$anonfun$updateCommonSynonymMappingTable$1.apply(SynonymMapping.scala:234)
models.SynonymMappings$$anonfun$updateCommonSynonymMappingTable$1.apply(SynonymMapping.scala:224)
org.scalaquery.session.Database.withSession(Database.scala:38)
models.SynonymMappings$.updateCommonSynonymMappingTable(SynonymMapping.scala:224)
Global$.onStart(Global.scala:48)
play.api.GlobalPlugin.onStart(GlobalSettings.scala:175)
play.api.Play$$anonfun$start$1$$anonfun$apply$mcV$sp$1.apply(Play.scala:85)
play.api.Play$$anonfun$start$1$$anonfun$apply$mcV$sp$1.apply(Play.scala:85)
scala.collection.immutable.List.foreach(List.scala:309)
play.api.Play$$anonfun$start$1.apply$mcV$sp(Play.scala:85)
play.api.Play$$anonfun$start$1.apply(Play.scala:85)
play.api.Play$$anonfun$start$1.apply(Play.scala:85)
play.utils.Threads$.withContextClassLoader(Threads.scala:18)
play.api.Play$.start(Play.scala:84)
SynonymMappings.scala
This is where i am getting error
def updateCommonSynonymMappingTable = database.withSession { implicit db: Session =>
val q = for (m <- SynonymMappings) yield m.skill ~ m.synonyms ~ m.function ~ m.industry
Logger.debug("Q for getting common syn mapping: " + q.selectStatement)
var table: java.util.concurrent.ConcurrentHashMap[String, scala.Array[String]] = EfoundrySynonymEngine.getCommonSynonymMappingTable()
var i = 0
Logger.debug("Q for getting common syn mapping: " + q.selectStatement)
var domainSpWords = 0
form this line trace goes to org.scalaquery.session.Database.withSession
As of Play Framework 2.1, ScalaQuery is deprecated, as it, along with Play Framework 2.0, only supports Scala 2.9.
Play Framework 2.1 supports Scala 2.10. The replacement for ScalaQuery is Slick, which also supports/requires Scala 2.10.
Slick web site: http://slick.typesafe.com/
It's a much cooler web site, so you should be happy.
So include that library in your web app and get rid of ScalaQuery, which should help you along on your migration effort.
And if you're curious, that class not found error is because scala.Right is now scala.util.Right :)