How to retrieve the CrossValidator bestModel ALS regParam value? - apache-spark-ml

I'm training an ALS model with a CrossValidator:
val als = new ALS()
.setMaxIter(5)
.setUserCol("userId")
.setItemCol("movieId")
.setRatingCol("rating")
val evaluator = new RegressionEvaluator()
.setMetricName("rmse")
.setLabelCol("rating")
.setPredictionCol("prediction")
val paramGrid = new ParamGridBuilder()
.addGrid(als.regParam, Array(0.001, 0.01, 0.1, 1))
.build()
val cv = new CrossValidator()
.setEstimator(als)
.setEvaluator(evaluator)
.setEstimatorParamMaps(paramGrid)
.setNumFolds(3)
val cvModel = cv.fit(training)
I would like to inspect the chosen regParam value. I've tried this:
val bestRegParam = cvModel.bestModel.getRegParam()
However, I get the exception:
value getRegParam is not a member of org.apache.spark.ml.Model[_$5]

usually you have to cast bestModel to a specific model, e.g. ALSModel.
But ALSModel class doesn't have regParam field. Unfortunately I guess there is no way to extract the chosen regParam and it's really a question to Spark developers.
You could turn on logging for CrossValidator as it logs the chosen best set of parameters

Related

Apply a function into grouped dataframe using Scala Spark

I'm trying to do DBSCAN in each group of latitudes and longitudes from users. The implementation of this clustering algorithm was done by irvingc here. I bumped up all dependencies to make the code work properly in my env.
Describing the sistuation: I have a Dataframe which is composed by events from user, each event has an id, a lat, and a long, you can see the columns by this case class. By that, I transform the dataframe to dataset to use the groupbykey and mapgroups methods to apply the function to the grouped data. However, the DBSCAN I'm using receive an RDD[linalg.Vector], so I have to transform the group into Vector of lat/lon, and this transformation gives the error SPARK-28702. Can you give some advice how to handle this issue?
case class StayDataset(objectID: Long, latitude: Double, longitude: Double, timeStart: Long, timeEnd: Long)
var dfs: Array[DataFrame] = Array()
val s = dataset.groupByKey(k => k.objectID).mapGroups{
case(k, iter) => {
POIDetection.groupStayPointsFromUser(k, iter, dataset.sparkSession)
dfs = dfs ++ Array(df)
k
}
}
def groupStayPointsFromUser(k: Long, dataset: Iterator[StayDataset], spark: SparkSession): DataFrame = {
val points = dataset.map(row => Vectors.dense(Array(row.latitude, row.longitude))).toSeq
val rddVector = spark.sparkContext.parallelize(points)
val size = points.length
val model = DBSCAN.train(rddVector, eps = 20, minPoints = (size * 0.18).toInt, maxPointsPerPartition = (size / 4).toInt)
val pointRDD = new PointRDD(model.labeledPoints.map(p => {
val point = POIDetection.geoFactory.createPoint(new Coordinate(p.x, p.y))
point.setUserData(p.cluster.toString())
point
}))
val df = Adapter.toDf(pointRDD, Seq("cluster"), spark)
.select(col("cluster").cast("long"), col("geometry"))
df
}
I think this problem arises when we want to apply a KNN in a grouped data. How to do that?
I don't understand what you want to achieve but first, you need to create a RDD[linalg.Vector], I suppose that you have the dataset of StayDataset already, to retrieve the RDDs, you need to transform the Dataset of StayDataset to linalg.Vector
val dsVector = dataset.transform[linalg.Vector](rec => linalg.Vectors.dense(rec.latitude, rec.longitude))
and then you retrieve the rdd[linalg.Vector]:
val rdd = dsVector.rdd
and you pass the rdd to your DBSCAN:
DBSCAN.train(rdd, ...)
These are necessary for your to get the rdd to do the train.
I think you also need to do some aggregation beforehand. If it is true, you need to manipulate on the dataset you have

How to deserialize JSON from Kafka Consumer Record

I'm looking to access some fields on a Kafka Consumer record. I'm able to receive the event data which is a Java object i.e ConsumerRecord(topic = test.topic, partition = 0, leaderEpoch = 0, offset = 0, CreateTime = 1660933724665, serialized key size = 32, serialized value size = 394, headers = RecordHeaders(headers = [], isReadOnly = false), key = db166cbf1e9e438ab4eae15093f89c34, value = {"eventInfo":...}).
I'm able to access the eventInfo values which comes back as a json string. I'm fairly new to Kotlin and using Kafka so I'm not entirely sure if this is correct but I'm looking to basically access the fields in value but I can't get rid of an error that appears when trying to use mapper.readValue which is:
None of the following functions can be called with the arguments supplied.
import com.afterpay.shop.favorites.model.Product
import com.fasterxml.jackson.module.kotlin.jacksonObjectMapper
import org.apache.avro.generic.GenericData.Record
import org.apache.kafka.clients.consumer.ConsumerRecord
import org.springframework.kafka.annotation.KafkaListener
import org.springframework.kafka.support.Acknowledgment
import org.springframework.stereotype.Component
#Component
class KafkaConsumer {
#KafkaListener(topics = ["test.topic"], groupId = "group-id")
fun consume(consumerRecord: ConsumerRecord<String, Any>, ack: Acknowledgment) {
val mapper = jacksonObjectMapper()
val value = consumerRecord.value()
val record = mapper.readValue(value, Product::class.java)
println(value)
ack.acknowledge()
}
}
Is this the correct way to accomplish this?
First, change ConsumerRecord<String, Any> to ConsumerRecord<String, Product>, then change value.deserializer in your consumer config/factory to use JSONDeserializer
Then your consumerRecord.value() will already be a Product instance, and you don't need an ObjectMapper
https://docs.spring.io/spring-kafka/docs/current/reference/html/#json-serde
Otherwise, if you use StringDeserializer, change Any to String so that the mapper.readValue argument types are correct.

How do I pull entries from an ArrayList at random in Kotlin?

This is my first Kotlin project. I am learning as I go and I have reached a roadblock.
I have an ArrayList of questions that I want to pull into that app in a random order. I've tried assigning the .random to the point where the question is assigned (right now it is set to CurrentPosition-1) but that only randomized the question and didn't pull the correct answers along with the questions.
How do I either bundle the answers to the question or is there a better way to get the questions to shuffle in order? I plan on having 50+ questions but only 10 will show each time the test is taken. I don't want the same 10 questions showing each time the user opens the test.
override fun onCreate(savedInstanceState: Bundle?) {
super.onCreate(savedInstanceState)
setContentView(R.layout.activity_quiz_questions)
mQuestionsList=Constants.getQuestions()
setQuestion()
}
private fun setQuestion(){
val question = mQuestionsList!![mCurrentPosition-1]
defaultOptionsView()
if(mCurrentPosition == mQuestionsList!!.size){
submitBtn.text = "Finish"
}else{
submitBtn.text = "Submit"
}
progressBar.progress = mCurrentPosition
tv_progress.text = "$mCurrentPosition" + "/" + progressBar.max
tv_question.text = question!!.question
test_image.setImageResource(question.image)
tvOptionOne.text = question.optionOne
tvOptionTwo.text = question.optionTwo
tvOptionThree.text = question.optionThree
tvOptionFour.text = question.optionFour
}
private fun defaultOptionsView(){
val options = ArrayList<TextView>()
options.add(0, tvOptionOne)
options.add(1, tvOptionTwo)
options.add(2, tvOptionThree)
options.add(3, tvOptionFour)
Here is my Array
object Constants{
const val TOTAL_QUESTIONS: String = "total_questions"
const val CORRECT_ANSWERS: String = "correct_answers"
fun getQuestions(): ArrayList<Question>{
val questionsList = ArrayList<Question>()
val q1 = Question(
R.drawable.questionmark,
1,
"Who is Regional Manager of\n Dunder Mifflin Scranton?",
"Michael",
"Jim",
"Pam",
"Dwight",
1,
)
I appreciate any help at all. Thank you in advance.
list.shuffled().take(10) And make your mQuestionsList property type List instead of ArrayList since you don’t need to modify it after retrieval. You should also probably make it lateinit or initialize it at its declaration site so you won’t have to make the type nullable and have to resort to !!, which is generally a code smell. So I would declare it as var mQuestionsList: List<Question> = emptyList() and whenever you want new values do mQuestionsList = Constants.getQuestions().shuffled().take(10).

Extract words from a string in spark hadoop with scala

I was using the code below to extract strings I needed in Spark SQL. But now I am working with more data in Spark Hadoop and I want to extract strings. I tried the same code, but it does not work.
val sparkConf = new SparkConf().setAppName("myapp").setMaster("local[*]")
val sc = new SparkContext(sparkConf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
import org.apache.spark.sql.functions.{col, udf}
import java.util.regex.Pattern
//User Defined function to extract
def toExtract(str: String) = {
val pattern = Pattern.compile("#\\w+")
val tmplst = scala.collection.mutable.ListBuffer.empty[String]
val matcher = pattern.matcher(str)
while (matcher.find()) {
tmplst += matcher.group()
}
tmplst.mkString(",")
}
val Extract = udf(toExtract _)
val values = List("#always_nidhi #YouTube no i dnt understand bt i loved the music nd their dance awesome all the song of this mve is rocking")
val df = sc.parallelize(values).toDF("words")
df.select(Extract(col("words"))).show()
How do I solve this problem?
First off, you're using Spark not the way its meant to. Your DataFrame isn't partitioned at all. Use:
val values = List("#always_nidhi", "#YouTube", "no", "i", "dnt", "understand" ...). That way, each bulk of words will be assigned to a different partition, different JVMs and/or clusters (depending on the total number of partitions and size of data). In your solution, the entire sentence is assigned to a specific partition and thus there's no parallelism nor distribution.
Second, you don't have to use a UDF (try to avoid those in general).
In order to find your regex, you can simply execute:
dataFrame.filter(col("words") rlike "#\\w+")
Hope it helps :-)

Evaluate my forecast using a Pipeline Model

There is a part of the code below ; I would like to know how can I evaluate my forecast ?
If, I want to know the importance of my feature is there a trick to use the featureImportances of RandomForestRegressionModel ? Should I directly switch to a RandomForestRegressionModel and not using a PipelineModel ?
I read that using a Pipeline could give better results that's why i'm using it.
I tried using a RegressionEvaluator but i don't get what I want.
Or should I just think simple and convert my DataFrame to an RDD and use the RegressionMetrics to get the mean Squared Error.
To summarize, i just need to know what is the best method to evaluate my forecast.
val assembler = new VectorAssembler()
.setInputCols(Array("customers", "year", "month", "dayOfMonth", "dayOfWeek", "weekOfYear", "dayOfYear"))
.setOutputCol("features")
val limitDate = "2017-04-01"
val trainingData = DF_2.filter(DF_2("time").lt(lit(limitDate)))
//trainingData.printSchema()
val rf = new RandomForestRegressor()
.setNumTrees(60)
.setMaxDepth(25)
.setMaxBins(100)
.setLabelCol("amount")
.setFeaturesCol("features")
val pipeline = new Pipeline().setStages(Array(assembler, rf))
//Train the model
val model = pipeline.fit(trainingData)
//Make predictions
val predictions = model.transform(DF_2)
For those who need an answer ; here how do I deal with this problem.
You can "transform"/"cast" your Pipeline Model to the type that u need using asInstanceOf like this :
val pipeline = new Pipeline().setStages(Array(assembler, rf))
val newModel = model.stages("NumberStage").asInstanceOf[TheModelYouWant]
Change NumberStage with the index of your Algorithm in your Pipeline, in my Pipeline it was 1. (For rf)
Change TheModelYouWant for the type of the model you need, in my case it was RandomForestRegressionModel.
Then you can create your own Evaluator that you want to use for your model.
If you want to transform your DF in a RDD[(Double, Double)], u can use .rdd & .map :
val predictionsAndLabels= df.select("amount", "prediction").rdd.map {case (row) => ((row.getInt(0).toDouble), (row.getDouble(1)))}
I have now a RDD[(Double, Double)] and you can use it for RegressionMetrics.
I hope it will help someone.