Spark Scala Convert Sparse to Dense Feature - dataframe

I have the following output that shows the DataFrame where I'm trying to OneHotEncode a String DataType:
+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+---------------+-------------+
|longitude|latitude|housing_median_age|total_rooms|total_bedrooms|population|households|median_income|median_house_value|ocean_proximity| feature|
+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+---------------+-------------+
| -122.28| 37.81| 52.0| 340.0| 97.0| 200.0| 87.0| 1.5208| 112500.0| [NEAR BAY]|(5,[3],[1.0])|
| -122.13| 37.67| 40.0| 1748.0| 318.0| 914.0| 317.0| 3.8676| 184000.0| [NEAR BAY]|(5,[3],[1.0])|
| -122.07| 37.67| 27.0| 3239.0| 671.0| 1469.0| 616.0| 3.2465| 230600.0| [NEAR BAY]|(5,[3],[1.0])|
| -122.13| 37.66| 19.0| 862.0| 167.0| 407.0| 183.0| 4.3456| 163000.0| [NEAR BAY]|(5,[3],[1.0])|
As it can be seen that I have the feature calculated from the ocean_proximity column. I now want to expand on this feature column and have that as a dense vector and for that I tried something like this:
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.feature.{StringIndexer, OneHotEncoder}
import org.apache.spark.ml.feature.CountVectorizer
import org.apache.spark.mllib.linalg.Vector
import spark.implicits._
// Identify how many distinct values are in the OCEAN_PROXIMITY column
val distinctOceanProximities = dfRaw.select(col("ocean_proximity")).distinct().as[String].collect()
val oceanProximityAsArrayDF = dfRaw.withColumn("ocean_proximity", array("ocean_proximity"))
val countModel = new CountVectorizer().setInputCol("ocean_proximity").setOutputCol("feature").fit(oceanProximityAsArrayDF)
val transformedDF = countModel.transform(oceanProximityAsArrayDF)
transformedDF.show()
def columnExtractor(idx: Int) = udf((v: Vector) => v(idx))
val featureCols = (0 until distinctOceanProximities.size).map(idx => columnExtractor(idx)($"feature").as(s"$distinctOceanProximities(idx)"))
val toDense = udf((v:Vector) => v.toDense)
val denseDF = transformedDF.withColumn("feature", toDense($"feature"))
denseDF.show()
This however fails with the following message:
org.apache.spark.sql.AnalysisException: Cannot up cast `input` from struct<type:tinyint,size:int,indices:array<int>,values:array<double>> to struct<type:tinyint,size:int,indices:array<int>,values:array<double>>.
The type path of the target object is:
- root class: "org.apache.spark.mllib.linalg.Vector"
You can either add an explicit cast to the input data or choose a higher precision type of the field in the target object
at org.apache.spark.sql.errors.QueryCompilationErrors$.upCastFailureError(QueryCompilationErrors.scala:137)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveUpCast$$fail(Analyzer.scala:3438)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$36$$anonfun$applyOrElse$198.applyOrElse(Analyzer.scala:3467)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$36$$anonfun$applyOrElse$198.applyOrElse(Analyzer.scala:3445)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:318)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:74)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:318)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:323)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChild$2(TreeNode.scala:377)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$4(TreeNode.scala:438)
at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
at scala.collection.immutable.List.foreach(List.scala:392)
at scala.collection.TraversableLike.map(TraversableLike.scala:238)
at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
at scala.collection.immutable.List.map(List.scala:298)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:438)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:244)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:406)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:359)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:323)
at org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$transformExpressionsDown$1(QueryPlan.scala:94)
at org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$1(QueryPlan.scala:116)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:74)
at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:116)
at org.apache.spark.sql.catalyst.plans.QueryPlan.recursiveTransform$1(QueryPlan.scala:127)
at org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$4(QueryPlan.scala:137)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:244)
at org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:137)
at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:94)
at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:85)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$36.applyOrElse(Analyzer.scala:3445)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$36.applyOrElse(Analyzer.scala:3441)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsUp$3(AnalysisHelper.scala:90)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:74)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsUp$1(AnalysisHelper.scala:90)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:221)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsUp(AnalysisHelper.scala:86)

It was so annoying and the actual error was caused by using the wrong imports:
Instead of:
import org.apache.spark.mllib.linalg.Vector
Using this:
import org.apache.spark.ml.linalg.Vector
Solved the issue!

Related

MapStruct Property has no write accessor in MetaData for target name

I have a FlatDTO that needs to be mapped to a nested Response containing InfoData and MetaData
The code for the response is generated by OpenAPI. So the below definitions can't be changed.
package org.mapstruct.example.kotlin.openapi
import com.fasterxml.jackson.annotation.JsonProperty
import javax.validation.Valid
data class Response(
#field:Valid #field:JsonProperty("infoData", required = true) val infoData: InfoData,
#field:Valid #field:JsonProperty("metaData", required = true) val metaData: MetaData
)
data class InfoData(
#field:JsonProperty("id", required = true) val id: kotlin.String,
)
data class MetaData(
#field:JsonProperty("firstProperty") val firstProperty: String? = null,
)
I have defined FlatDTO as follows.
package org.mapstruct.example.kotlin.models
data class FlatDTO(
var id: String? = null,
var firstProperty: String,
)
Here is my mapper class which maps FlatDTO to Response
package org.mapstruct.example.kotlin.mapper
import org.mapstruct.Mapper
import org.mapstruct.Mapping
import org.mapstruct.Mappings
import org.mapstruct.example.kotlin.models.FlatDTO
import org.mapstruct.example.kotlin.openapi.Response
#Mapper
interface DataMapper {
#Mappings(
Mapping(target = "infoData.id", source = "id"),
Mapping(target = "metaData.firstProperty", source = "firstProperty")
)
fun flatToResponse(flatDTO: FlatDTO): Response
}
When I try to build the code using mvn clean install
I get the following error.
error: Property "firstProperty" has no write accessor in MetaData for target name "metaData.firstProperty".
[ERROR] #org.mapstruct.Mappings(value = {#org.mapstruct.Mapping(target = "infoData.id", source = "id"), #org.mapstruct.Mapping(target = "metaData.firstProperty", source = "firstProperty")})
I understand that this message is trying to say that there is no setter function for firstProperty because it's defined as val but that code cannot be edited. I can write my own custom mapper that works just fine.
I wanted to understand if there is a way to use MapStruct in this scenario.

Stream Eventhub Fixed Length data to a streaming DataFrame

Summary - I have a streaming source as EventHub where the data is receievd in fixedLength format. Now I want to read the streaming source that contains fixedLength into a spark data frame
Note: I am able to read if the fixedLength is from a directory and create substring and categorize according to my need. But How can I do for streaming source like eventHub (Since eventhub takes all the data as Body)
Lets take my fixed lenght file is sample.txt that contains - 00101292017you1235
My code is like below
import org.apache.spark.eventhubs
import org.apache.spark.eventhubs.EventPosition._
import com.microsoft.azure.eventhubs.EventHubClientOptions
import org.apache.spark.eventhubs._
import com.microsoft.azure.eventhubs.impl.StringUtil
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
val endpoint = "Endpoint=sb://XXXXXX.servicebus.windows.net/;SharedAccessKeyName=RootManageSharedAccessKey;SharedAccessKey=XXXXXXX"
val eventHub = "XXXX"
val connectionString = ConnectionStringBuilder(endpoint) .setEventHubName("XXXXX") .build
val ehConf = EventHubsConf(connectionString).setStartingPosition(EventPosition.fromEndOfStream) .setMaxEventsPerTrigger(500)
val ehStream = spark.readStream.format("eventhubs").options(ehConf.toMap).load
val messages =ehStream.withColumn("Offset", $"offset".cast(LongType)).withColumn("Time (readable)", $"enqueuedTime".cast(TimestampType)).withColumn("Timestamp", $"enqueuedTime".cast(LongType)).withColumn("Body", $"body".cast(StringType)).select("Offset", "Time (readable)", "Timestamp", "Body")
messages.writeStream.outputMode("append").option("truncate", false) .format("console").start() .awaitTermination()
For the above scenario how can the body data that recieves in eventhub 00101292017you1235
something like
df.select(
df.value.substr(1,3).alias('id'),
df.value.substr(4,8).alias('date'),
df.value.substr(12,3).alias('string'),
df.value.substr(15,4).cast('integer').alias('integer')
).show
will result in:
+---+--------+------+-------+
| id| date|string|integer|
+---+--------+------+-------+
|001|01292017| you| 1234|
|002|01302017| me| 5678|
+---+--------+------+-------+
This Helps
val messages =ehStream.withColumn("FirstColumn",$"body".substr(1,3).cast(StringType)).select("FirstColumn")

How to convert spark dataframe[double , String] to LabeledPoint?

Following is the code that am experimenting with. Am trying to convert SalesData in csv to DF and then to LabeledPoints. However in the last step am getting following compilation error
package macros contains object and package with same name: blackbox
Can you please give me pointers on what am doing wrong here ? Thank you
--EDIT--
Compilation Issue solved by adding 2.11 mllib to build.gradle . but mlData.show fails with
ERROR: java.lang.ClassCastException: java.lang.String cannot be cast to org.apache.spark.ml.linalg.Vector
val path = "SalesData.csv"
val conf = new SparkConf().setMaster("local[2]").set("deploy-mode", "client").set("spark.driver.bindAddress", "127.0.0.1")
.set("spark.broadcast.compress", "false")
.setAppName("local-spark-kafka-consumer-client")
val sparkSession = SparkSession
.builder()
.config(conf)
.getOrCreate()
val data = sparkSession.read.format("csv").option("header", "true").option("inferSchema", "true").load(path)
data.cache()
import org.apache.spark.sql.DataFrameNaFunctions
data.na.drop()
data.show
//get monthly sales totals
val summary = data.select("OrderMonthYear","SaleAmount").groupBy("OrderMonthYear").sum().orderBy("OrderMonthYear").toDF("OrderMonthYear","SaleAmount")
summary.show
// convert ordermonthyear to integer type
//val results = summary.map(df => (df.getAs[String]("OrderMonthYear").replace("-", "") , df.getAs[String]("SaleAmount"))).toDF(["OrderMonthYear","SaleAmount"])
import org.apache.spark.sql.functions._
val test = summary.withColumn("OrderMonthYear", (regexp_replace(col("OrderMonthYear").cast("String"),"-",""))).toDF("OrderMonthYear","SaleAmount")
test.printSchema()
test.show
import sparkSession.implicits._
val mlData = test.select("OrderMonthYear", "SaleAmount").
map(row => org.apache.spark.ml.feature.LabeledPoint(
row.getAs[Double](1),
row.getAs[org.apache.spark.ml.linalg.Vector](0))).toDF
mlData.show

Facing issue while using SparkUDF with multiple arguments

I am trying to encript the data using SHA-256 by passing as an argument in Spark UDF but getting below error. Please find the program snippet and error details below.
Code Snippet:
package com.sample
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import java.security.MessageDigest
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.UserDefinedFunction
import javax.xml.bind.DatatypeConverter;
import org.apache.spark.sql.Column
object Customer {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Customer-data").setMaster("local[2]").set("spark.executor.memory", "1g");
val sc = new SparkContext(conf)
val spark = SparkSession.builder().config(sc.getConf).getOrCreate()
//val hash_algm=sc.getConf.get("halgm")
val hash_algm="SHA-256"
val df = spark.read.format("csv").option("header", "true").load("file:///home/tcs/Documents/KiranDocs/Data_files/sample_data")
spark.udf.register("encriptedVal1", encriptedVal)
//calling encription UDF function
//val resDF1 = df.withColumn(("ssn_number"), encriptedVal(df("customer_id"))).show()
val resDF2 = df.withColumn(("ssn_number"), encriptedVal(array("customer_id", hash_algm))).show()
println("data set"+resDF2)
sc.stop()
}
def encriptedVal = udf((s: String,s1:String) => {
val digest = MessageDigest.getInstance(s1)
val hash = digest.digest(s.getBytes("UTF-8"))
DatatypeConverter.printHexBinary(hash)
})
}
Error details are below:
Exception in thread "main" 2019-01-21 19:42:48 INFO SparkContext:54 -
Invoking stop() from shutdown hook java.lang.ClassCastException:
com.sample.Customer$$anonfun$encriptedVal$1 cannot be cast to
scala.Function1 at
org.apache.spark.sql.catalyst.expressions.ScalaUDF.(ScalaUDF.scala:104)
at
org.apache.spark.sql.expressions.UserDefinedFunction.apply(UserDefinedFunction.scala:85)
at com.sample.Customer$.main(Customer.scala:26) at
com.sample.Customer.main(Customer.scala)
The problem here is how you call the defined UDF. You should use it like the following:
val resDF1 = df.withColumn(("ssn_number"), encriptedVal(df.col("customer_id"), lit(hash_algm)))
because it accepts two Columns object (both Columns must be String type as defined in your UDF).

sparkSql no such method error

I'm new to learning sparkSQL, and I'm trying to run the examples provided by the spark document, but got error like:
enter image description here
my program like this:
enter image description here
what should I do.
note:I'm using the IDEA to edit my program
all of the code:
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.{Row, SQLContext, SparkSession}
import org.apache.spark.sql.types._
object SqlTest1 {
case class Person(name: String, age:Long)
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder()
.getOrCreate()
import spark.implicits._
runBasicDataFrameExample(spark)
}
private def runBasicDataFrameExample(spark: SparkSession)={
val df = spark.read.json("resorces/people.json")
df.show()
}
}
Val Conf = new SparkConf(true).setAppName(“appName”)
Val spark = SparkSession.builder().config(Conf).getOrCreate()
Val df = spark.read.option(“timestampFormat”,”yyyy/MM/dd HH:mm:ss ZZ).json(path)
it may be the scala version does not match. you can check the scala verion and spark version.