convert RDD[String,String] to RDD[Int,Int] - apache-spark-sql

I'm new to spark and facing issues finding out how to convert RDD elements data types. I have following text file:
1 2
2 3
3 4
when I create a new RDD ,it by default takes String Data type
val exampleRDD = sc.textFile("example.txt").map(x => (x.split(" ")(0),x.split(" ")(1)))
exampleRDD: org.apache.spark.rdd.RDD[(String, String)] = MapPartitionsRDD[5] at map at <console>:27
But i want it to be RDD[(Int, Int)]. I tried
val exampleRDD: RDD[(Int,Int)) =sc.textFile("example.txt").map(x => (x.split(" ")(0),x.split(" ")(1)))
but it gives error
error: not found: type RDD
Any help would be appreciated.

The error "error: not found: type RDD" is because, you would need to full class name as org.apache.spark.rdd.RDD.
But that doesn't still solve the problem. To return Int, you would have to convert string to Int.
val exampleRDD = sc.textFile("example.txt").map(x => (x.split(" ")(0).toInt,x.split(" ")(1).toInt))
Result:
exampleRDD: org.apache.spark.rdd.RDD[(Int, Int)] = MapPartitionsRDD[36] at map at :34

sc.textFile("two.txt").map(_.split(" ")).map(ar => (ar(0).toInt, ar(1).toInt))
If you have more complex format, use spark-csv is a better choice to parse data.

Related

How to read Key Value pair in spark SQL?

How do I get this output using spark sql or scala ? I have a table with columns storing such values - need to split in seprate columns.
Input :
Output :
It pretty much depends on what libs you want to use (as you mentioned in Scala or Spark).
Using spark
val rawJson = """
{"Name":"ABC.txt","UploaddedById":"xxxxx1123","UploadedByName":"James"}
"""
spark.read.json(Seq(rawJson).toDS)
Using common json libraries:
// play
Json.parse(rawJson) match {
case obj: JsObject =>
val values = obj.values
val keys = obj.keys
// construct dataframe having keys and values
case other => // handle other types (like JsArray, etc,.)
}
// circe
import io.circe._, io.circe.parser._
parse(rawJson) match {
case Right(json) => // fetch key values, construct df, much like above
case Left(parseError) => ...
}
You can use almost any json library to parse your json object, and then convert it to spark df very easily.

Spark DataFrame CountVectorizedModel Error With DataType String

I have the following piece of code that tries to perform a simple action where I'm trying to convert from a sparse vector to a dense vector. Here is what I have so far:
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.feature.{StringIndexer, OneHotEncoder}
import org.apache.spark.ml.feature.CountVectorizerModel
import org.apache.spark.mllib.linalg.Vector
import spark.implicits._
// Identify how many distinct values are in the OCEAN_PROXIMITY column
val distinctOceanProximities = dfRaw.select(col("ocean_proximity")).distinct().as[String].collect()
val cvmDF = new CountVectorizerModel(tags)
.setInputCol("ocean_proximity")
.setOutputCol("sparseFeatures")
.transform(dfRaw)
val exprs = (0 until distinctOceanProximities.size).map(i => $"features".getItem(i).alias(s"$distinctOceanProximities(i)"))
val vecToSeq = udf((v: Vector) => v.toArray)
val df2 = cvmDF.withColumn("features", vecToSeq($"sparseFeatures")).select(exprs:_*)
df2.show()
When I ran this script, I get the following error:
java.lang.IllegalArgumentException: requirement failed: Column ocean_proximity must be of type equal to one of the following types: [array<string>, array<string>] but was actually of type string.
at scala.Predef$.require(Predef.scala:281)
at org.apache.spark.ml.util.SchemaUtils$.checkColumnTypes(SchemaUtils.scala:63)
at org.apache.spark.ml.feature.CountVectorizerParams.validateAndTransformSchema(CountVectorizer.scala:97)
at org.apache.spark.ml.feature.CountVectorizerParams.validateAndTransformSchema$(CountVectorizer.scala:95)
at org.apache.spark.ml.feature.CountVectorizerModel.validateAndTransformSchema(CountVectorizer.scala:272)
at org.apache.spark.ml.feature.CountVectorizerModel.transformSchema(CountVectorizer.scala:338)
at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:71)
at org.apache.spark.ml.feature.CountVectorizerModel.transform(CountVectorizer.scala:306)
... 101 elided
I think it is expecting a Seq of String for the datatype but I have just a String. Any ideas how to fix this?
It was pretty simple. All I had to do is to convert the column from String to an Array of String just like this:
val oceanProximityAsArrayDF = dfRaw.withColumn("ocean_proximity", array("ocean_proximity"))

From Column to Array Scala Spark

I am trying to apply a function on a Column in scala, but i am encountering some difficulties.
There is this error
found : org.apache.spark.sql.Column
required: Array[Double]
Is there a way to convert a Column to an Array?
Thank you
Update:
Thank you very much for your answer, I think I am getting closer to what I am trying to achieve. I give you a little bit of more context:
Here the code:
object Targa_Indicators_Full {
def get_quantile (variable: Array[Double], perc:Double) : Double = {
val sorted_vec:Array[Double]=variable.sorted
val pos:Double= Math.round(perc*variable.length)-1
val quant:Double=sorted_vec(pos.toInt)
quant
}
def main(args: Array[String]): Unit = {
val get_quantileUDF = udf(get_quantile _)
val plate_speed =
trips_df.groupBy($"plate").agg(sum($"time_elapsed").alias("time"),sum($"space").alias("distance"),
stddev_samp($"distance"/$"time_elapsed").alias("sd_speed"),
get_quantileUDF($"distance"/$"time_elapsed",.75).alias("Quant_speed")).
withColumn("speed", $"distance" / $"time")
}
Now I get this error:
type mismatch;
[error] found : Double(0.75)
[error] required: org.apache.spark.sql.Column
[error] get_quantileUDF($"distanza"/$"tempo_intermedio",.75).alias("IQR_speed")
^
[error] one error found
What can I do?
Thanks.
You cannot directly apply a function on the Dataframe column. You have to convert your existing function to UDF. Spark provides user to define custom user defined functions(UDF).
eg:
You have a dataframe with array column
scala> val df=sc.parallelize((1 to 100).toList.grouped(5).toList).toDF("value")
df: org.apache.spark.sql.DataFrame = [value: array<int>]
You have defined a function to apply on the array type column
def convert( arr:Seq[Int] ) : String = {
arr.mkString(",")
}
You have to convert this to udf before applying on the column
val convertUDF = udf(convert _)
And then you can apply your function:
df.withColumn("new_col", convertUDF(col("value")))

How to pass in a map into UDF in spark

Here is my problem, I have a map of Map[Array[String],String], and I want to pass that into a UDF.
Here is my UDF:
def lookup(lookupMap:Map[Array[String],String]) =
udf((input:Array[String]) => lookupMap.lift(input))
And here is my Map variable:
val srdd = df.rdd.map { row => (
Array(row.getString(1),row.getString(5),row.getString(8)).map(_.toString),
row.getString(7)
)}
Here is how I call the function:
val combinedDF = dftemp.withColumn("a",lookup(lookupMap))(Array($"b",$"c","d"))
I first got an error about immutable array, so I changed my array into immutable type, then I got an error about type mismatch. I googled a bit, apparently I can't pass in non-column type directly into a UDF. Can somebody help? Kudos.
Update: So I did convert everything to a wrapped array. Here is what I did:
val srdd = df.rdd.map{row => (WrappedArray.make[String](Array(row.getString(1),row.getString(5),row.getString(8))),row.getString(7))}
val lookupMap = srdd.collectAsMap()
def lookup(lookupMap:Map[collection.mutable.WrappedArray[String],String]) = udf((input:collection.mutable.WrappedArray[String]) => lookupMap.lift(input))
val combinedDF = dftemp.withColumn("a",lookup(lookupMap))(Array($"b",$"c",$"d"))
Now I am having an error like this:
required: Map[scala.collection.mutable.WrappedArray[String],String]
-ksh: Map[scala.collection.mutable.WrappedArray[String],String]: not found [No such file or directory]
I tried to do something like this:
val m = collection.immutable.Map(1->"one",2->"Two")
val n = collection.mutable.Map(m.toSeq: _*)
but then I just got back to the error of column type.
First, you have to pass a Column as an argument of the UDF; Since you want this argument to be an array, you should use the array function in org.apache.spark.sql.functions, which creates an array Column from a series of other Columns. So the UDF call would be:
lookup(lookupMap)(array($"b",$"c",$"d"))
Now, since array columns are deserialized into mutable.WrappedArray, in order for the map lookup to succeed you'd best make sure that's the type used by your UDF:
def lookup(lookupMap: Map[mutable.WrappedArray[String],String]) =
udf((input: mutable.WrappedArray[String]) => lookupMap.lift(input))
So altogether:
import spark.implicits._
import org.apache.spark.sql.functions._
// Create an RDD[(mutable.WrappedArray[String], String)]:
val srdd = df.rdd.map { row: Row => (
mutable.WrappedArray.make[String](Array(row.getString(1), row.getString(5), row.getString(8))),
row.getString(7)
)}
// collect it into a map (I assume this is what you're doing with srdd...)
val lookupMap: Map[mutable.WrappedArray[String], String] = srdd.collectAsMap()
def lookup(lookupMap: Map[mutable.WrappedArray[String],String]) =
udf((input: mutable.WrappedArray[String]) => lookupMap.lift(input))
val combinedDF = dftemp.withColumn("a",lookup(lookupMap)(array($"b",$"c",$"d")))
Anna your code for srdd/lookupmap is of type org.apache.spark.rdd.RDD[(Array[String], String)]
val srdd = df.rdd.map { row => (
Array(row.getString(1),row.getString(5),row.getString(8)).map(_.toString),
row.getString(7)
)}
Where as in lookup method you are expecting a Map as a parameter
def lookup(lookupMap:Map[Array[String],String]) =
udf((input:Array[String]) => lookupMap.lift(input))
That is the reason why you are getting type mismatch error.
First make srdd from RDD[tuple] to a RDD[Map] and then try converting the RDD to Map to resolve this error.
val srdd = df.rdd.map { row => Map(
Array(row.getString(1),row.getString(5),row.getString(8)).map(_.toString) ->
row.getString(7)
)}

Read parquet file having mixed data type in a column

I want to read a parquet file using spark sql in which one column has mixed datatype (string and integer).
val sqlContext = new SQLContext(sparkContext)
val df = sqlContext.read.parquet("/tmp/data")
This throws me exception : Failed to merge incompatible data types IntegerType and StringType
Is there a way to explicitly type cast the column during read ?
The only way that I have found is to manually cast one of the fields so that they match. You can do this by reading in the individual parquet files into a sequence and iteratively modifying them as such:
def unionReduce(dfs: Seq[DataFrame]) = {
dfs.reduce{ (x, y) =>
def schemaTruncate(df: DataFrame) = df.schema.map(schema => schema.name -> schema.dataType)
val diff = schemaTruncate(y).toSet.diff(schemaTruncate(x).toSet)
val fixedX = diff.foldLeft(x) { case (df, (name, dataType)) =>
Try(df.withColumn(name, col(name).cast(dataType))) match {
case Success(newDf) => newDf
case Failure(error) => df.withColumn(name, lit(null).cast(dataType))
}
}
fixedX.select(y.columns.map(col): _*).unionAll(y)
}
}
The above function first finds the differently named or typed columns which are in Y but not in X. It then adds those columns to X by attempting to cast the existing columns, and upon failure adding the column as a literal null, then it selects only the columns in Y from the new fixed X incase there are columns in X not in Y and returns the result of the union.