PySpark error loading data (JSON array) into a dataframe [duplicate] - dataframe

I was trying to explode a column called phones like the following schema and content:
(customer_external_id,StringType
phones,StringType)
customer_id phones
x8x46x5 [{"phone" : "(xx) 35xx4x80"},{"phone" : "(xx) xxxx46605"}]
xx44xx5 [{"phone" : "(xx) xxx3x8443"}]
4xxxxx5 [{"phone" : "(xx) x6xx083x3"},{"areaCode" : "xx"},{"phone" : "(xx) 3xxx83x3"}]
xx6564x [{"phone" : "(x3) x88xx344x"}]
xx8x4x0 [{"phone" : "(xx) x83x5x8xx"}]
xx0434x [{"phone" : "(0x) 3x6x4080"},{"areaCode" : "xx"}]
x465x40 [{"phone" : "(6x) x6x445xx"}]
x0684x8 [{"phone" : "(xx) x8x88x4x4"},{"phone" : "(xx) x8x88x4x4"}]
x84x850 [{"phone" : "(xx) 55x56xx4"}]
x0604xx [{"phone" : "(xx) x8x4xxx68"}]
4x6xxx0 [{"phone" : "(xx) x588x43xx"},{"phone" : "(xx) 5x6465xx"},{"phone" : "(xx) x835xxxx8"},{"phone" : "(xx) x5x6465xx"}]
x6x000x [{"phone" : "(xx) xxx044xx4"}]
5x65533 [{"phone" : "(xx) xx686x0xx"}]
x3668xx [{"phone" : "(5x) 33x8x3x4"},{"phone" : "(5x) 8040x8x6"}]
So I tried to run this code and got the subsequential error:
df.select('customer_external_id', explode(df.phones))
AnalysisException: u"cannot resolve 'explode(`phones`)' due to data type mismatch: input to function explode should be array or map type, not StringType;;
'Project [customer_external_id#293, explode(phones#296) AS List()]\n+- Relation[order_id#292,customer_external_id#293,name#294,email#295,phones#296,phones_version#297,classification#298,locale#299] parquet\n"
And by this error I found out my column was a StringType so I ran this code to remove brackets and convert to json:
phones = df.select('customer_external_id', 'phones').rdd\
.map(lambda x: str(x).replace('[','')\
.replace(']','')\
.replace('},{', ','))\
.map(lambda x: json.loads(x).get('phone')\
.map(lambda x: Row(x))\
.toDF(df.select('customer_external_id','phones').schema)
phones.show()
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 38.0 failed 4 times, most recent failure: Lost task 0.3 in stage 38.0 (TID 2740, 10.112.80.248, executor 8): org.apache.spark.api.python.PythonException: Traceback (most recent call last)
Apparently I can't cast to Json and I can't explode the column. So how could I properly deal with this kind of data to get this output:
+-----------+--------+--------------+
|customer_id|areaCode|phone |
+-----------+--------+--------------+
|x8x46x5 |null |(xx) 35xx4x80 |
|x8x46x5 |null |(xx) xxxx46605|
|xx44xx5 |null |(xx) xxx3x8443|
|4xxxxx5 |null |(xx) x6xx083x3|
|4xxxxx5 |xx |null |
|4xxxxx5 |null |(xx) 3xxx83x3 |
|xx6564x |null |(x3) x88xx344x|
|xx8x4x0 |null |(xx) x83x5x8xx|
|xx0434x |null |(0x) 3x6x4080 |
|xx0434x |xx |null |
|x465x40 |null |(6x) x6x445xx |
|x0684x8 |null |(xx) x8x88x4x4|
|x0684x8 |null |(xx) x8x88x4x4|
|x84x850 |null |(xx) 55x56xx4 |
|x0604xx |null |(xx) x8x4xxx68|
|4x6xxx0 |null |(xx) x588x43xx|
|4x6xxx0 |null |(xx) 5x6465xx |
|4x6xxx0 |null |(xx) x835xxxx8|
|4x6xxx0 |null |(xx) x5x6465xx|
|x6x000x |null |(xx) xxx044xx4|
|5x65533 |null |(xx) xx686x0xx|
|x3668xx |null |(5x) 33x8x3x4 |
|x3668xx |null |(5x) 8040x8x6 |
+-----------+--------+--------------+

What you want to do is use the from_json method to convert the string into an array and then explode:
from pyspark.sql.functions import *
from pyspark.sql.types import *
phone_schema = ArrayType(StructType([StructField("phone", StringType())]))
converted = inputDF\
.withColumn("areaCode", get_json_object("phones", "$[*].areaCode"))\
.withColumn("phones", explode(from_json("phones", phone_schema)))\
.withColumn("phone", col("phones.phone"))\
.drop("phones")\
.filter(~isnull("phone"))
converted.show()

I think you should be able to directly call json.loads() without using replace() as you have shown.
Map the string to ArrayType(MapType()) using json.loads().
Call flatMap() to create a new Row() for each element of the array.
Map these rows to the desired output.
Take a look at the following example:
from StringIO import StringIO
from pyspark.sql import Row
import json
import pandas as pd
# mock up some sample data
data = StringIO("""customer_id\tphones
x8x46x5\t[{"phone" : "(xx) 35xx4x80"},{"phone" : "(xx) xxxx46605"}]
xx44xx5\t[{"phone" : "(xx) xxx3x8443"}]
4xxxxx5\t[{"phone" : "(xx) x6xx083x3"},{"areaCode" : "xx"},{"phone" : "(xx) 3xxx83x3"}]
xx6564x\t[{"phone" : "(x3) x88xx344x"}]
xx8x4x0\t[{"phone" : "(xx) x83x5x8xx"}]
xx0434x\t[{"phone" : "(0x) 3x6x4080"},{"areaCode" : "xx"}]
x465x40\t[{"phone" : "(6x) x6x445xx"}]
x0684x8\t[{"phone" : "(xx) x8x88x4x4"},{"phone" : "(xx) x8x88x4x4"}]
x84x850\t[{"phone" : "(xx) 55x56xx4"}]
x0604xx\t[{"phone" : "(xx) x8x4xxx68"}]
4x6xxx0\t[{"phone" : "(xx) x588x43xx"},{"phone" : "(xx) 5x6465xx"},{"phone" : "(xx) x835xxxx8"},{"phone" : "(xx) x5x6465xx"}]
x6x000x\t[{"phone" : "(xx) xxx044xx4"}]
5x65533\t[{"phone" : "(xx) xx686x0xx"}]
x3668xx\t[{"phone" : "(5x) 33x8x3x4"},{"phone" : "(5x) 8040x8x6"}]""")
pandas_df = pd.read_csv(data, sep="\t")
df = sqlCtx.createDataFrame(pandas_df) # convert pandas to spark df
# run the steps outlined above
df.rdd\
.map(lambda x: Row(customer_id=x['customer_id'], phones=json.loads(x['phones'])))\
.flatMap(lambda x: [Row(customer_id=x['customer_id'], phone=phone) for phone in x['phones']])\
.map(lambda x: Row(customer_id=x['customer_id'], phone=x['phone'].get('phone'), areaCode=x['phone'].get('areaCode')))\
.toDF()\
.select('customer_id', 'areaCode', 'phone')\
.show(truncate=False, n=100)
The output:
+-----------+--------+--------------+
|customer_id|areaCode|phone |
+-----------+--------+--------------+
|x8x46x5 |null |(xx) 35xx4x80 |
|x8x46x5 |null |(xx) xxxx46605|
|xx44xx5 |null |(xx) xxx3x8443|
|4xxxxx5 |null |(xx) x6xx083x3|
|4xxxxx5 |xx |null |
|4xxxxx5 |null |(xx) 3xxx83x3 |
|xx6564x |null |(x3) x88xx344x|
|xx8x4x0 |null |(xx) x83x5x8xx|
|xx0434x |null |(0x) 3x6x4080 |
|xx0434x |xx |null |
|x465x40 |null |(6x) x6x445xx |
|x0684x8 |null |(xx) x8x88x4x4|
|x0684x8 |null |(xx) x8x88x4x4|
|x84x850 |null |(xx) 55x56xx4 |
|x0604xx |null |(xx) x8x4xxx68|
|4x6xxx0 |null |(xx) x588x43xx|
|4x6xxx0 |null |(xx) 5x6465xx |
|4x6xxx0 |null |(xx) x835xxxx8|
|4x6xxx0 |null |(xx) x5x6465xx|
|x6x000x |null |(xx) xxx044xx4|
|5x65533 |null |(xx) xx686x0xx|
|x3668xx |null |(5x) 33x8x3x4 |
|x3668xx |null |(5x) 8040x8x6 |
+-----------+--------+--------------+
I'm not sure if this is the output you were hoping for, but this should help you get there.

Related

Append new rows to a Spark dataframe based on a condition

I need some help on resolving this tricky transformation-
My spark dataframe look like this:
+---+---+--------+---------+-------+--------+---------+
| A| B|rcvr_nbr|order_qty|rcv_qty|item_nbr|edit_flag|
+---+---+--------+---------+-------+--------+---------+
|123| 1| 500| 10| 2| 1001| false|
|123| 1| 501| 10| 2| 1001| false|
|123| 4| 502| 60| 5| 1001| true|
|123| 2| 504| 40| 30| 1003| false|
|123| 5| 510| 10| 10| 1003| true|
+---+---+--------+---------+-------+--------+---------+
Where I need to add new records(rows) for a combination of A and B (having edit_flag as false) if the item_nbr matches with another A and B having edit_flag as true.
The new row will have everything columns copied from its parent row except rcv_qty and rcvr_nbr. So, final output will look like:
+---+---+--------+---------+-------+--------+---------+
| A| B|rcvr_nbr|order_qty|rcv_qty|item_nbr|edit_flag|
+---+---+--------+---------+-------+--------+---------+
|123| 1| 500| 10| 2| 1001| false|
|123| 1| 501| 10| 2| 1001| false|
|123| 1| 502| 10| 5| 1001| false|
|123| 4| 502| 60| 5| 1001| true|
|123| 2| 504| 40| 30| 1003| false|
|123| 2| 510| 40| 10| 1003| false|
|123| 5| 510| 10| 10| 1003| true|
+---+---+--------+---------+-------+--------+---------+
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import spark.implicits._
case class Source(
A: Int,
B: Int,
rcvr_nbr: Int,
order_qty: Int,
rcv_qty: Int,
item_nbr: Int,
edit_flag: Boolean
)
val sourceDF = Seq(
Source(123, 1, 500, 10, 2, 1001, false),
Source(123, 1, 501, 10, 2, 1001, false),
Source(123, 4, 502, 60, 5, 1001, true),
Source(123, 2, 504, 40, 30, 1003, false),
Source(123, 5, 510, 10, 10, 1003, true)
).toDF()
sourceDF.printSchema()
// root
// |-- A: integer (nullable = false)
// |-- B: integer (nullable = false)
// |-- rcvr_nbr: integer (nullable = false)
// |-- order_qty: integer (nullable = false)
// |-- rcv_qty: integer (nullable = false)
// |-- item_nbr: integer (nullable = false)
// |-- edit_flag: boolean (nullable = false)
sourceDF.show(false)
// +---+---+--------+---------+-------+--------+---------+
// |A |B |rcvr_nbr|order_qty|rcv_qty|item_nbr|edit_flag|
// +---+---+--------+---------+-------+--------+---------+
// |123|1 |500 |10 |2 |1001 |false |
// |123|1 |501 |10 |2 |1001 |false |
// |123|4 |502 |60 |5 |1001 |true |
// |123|2 |504 |40 |30 |1003 |false |
// |123|5 |510 |10 |10 |1003 |true |
// +---+---+--------+---------+-------+--------+---------+
val sourceDFTrueF = sourceDF.filter(col("edit_flag").equalTo(true))
val sourceDFTrue = sourceDFTrueF.columns.foldLeft(sourceDFTrueF) {
(tmpDF, col) =>
tmpDF.withColumnRenamed(col, s"${col}_true")
}
val sourceDFFalse = sourceDF
.filter(col("edit_flag").equalTo(false))
.dropDuplicates("item_nbr")
val resDF =
sourceDFFalse
.join(
sourceDFTrue,
sourceDFFalse.col("item_nbr") === sourceDFTrue.col("item_nbr_true"),
"inner"
)
.select(
sourceDFFalse.col("A"),
sourceDFFalse.col("B"),
sourceDFTrue.col("rcvr_nbr_true").alias("rcvr_nbr"),
sourceDFFalse.col("order_qty"),
sourceDFTrue.col("rcv_qty_true").alias("rcv_qty"),
sourceDFFalse.col("item_nbr"),
sourceDFFalse.col("edit_flag")
)
.union(sourceDF)
.orderBy(col("A"), col("item_nbr"), col("edit_flag"))
resDF.show(false)
// +---+---+--------+---------+-------+--------+---------+
// |A |B |rcvr_nbr|order_qty|rcv_qty|item_nbr|edit_flag|
// +---+---+--------+---------+-------+--------+---------+
// |123|1 |501 |10 |2 |1001 |false |
// |123|1 |500 |10 |2 |1001 |false |
// |123|1 |502 |10 |5 |1001 |false |
// |123|4 |502 |60 |5 |1001 |true |
// |123|2 |504 |40 |30 |1003 |false |
// |123|2 |510 |40 |10 |1003 |false |
// |123|5 |510 |10 |10 |1003 |true |
// +---+---+--------+---------+-------+--------+---------+

Counting longest sequence of specific elements in a list contained within a spark.sql database column

I have the following problem I would like to solve.
I have the following Dataframe that I created from a query
val temp = spark.sql("select Id, collect_list(from) as letter from f group by Id")
|Id| letter|
+-----------+---------------+
| 106| [c]|
| 101| [p]|
| 104|[c, c, c, t, u]|
| 100|[d, t, j, j, c]|
| 110| [p, n, f]|
| 113|[s, c, c, b, ..|
| 115|[u, s, t, c, ..|
| 11| [c, c, i, s]|
| 117| [d, d, p, s]|
| 118|[a, s, c, t, ..|
| 123| [d, n]|
| 125| [n, b]|
| 128| [c]|
| 131| [c, t, c, u]|
| 132| [c, u, i]|
| 134|[c, p, j, u, c]|
| 136|[b, a, t, n, c]|
| 137| [b, a]|
| 138| [b, t, c]|
| 141| [s]|
I would like to create a new column called "n"
This column would contain a numerical value which represents the longest sequence of letters in a cell before "c" appears. the longest sequence can be anywhere in the list.
For example the solution column for this section (assuming nothing is cut off by the ....) would be
0, 1, 3, 5, 3, 2, 4, 4, 4, 4, 2, 2, 1, 4, 2, 5, 5, 2, 3, 1
Any help would be greatly appreciated. Thank you!
Here is how you can use the spark functions, you can convert the given scala functions with spark functions as below
import org.apache.spark.sql.functions._
df.withColumn("n_trip",
array_max(
transform(
filter(
split(array_join($"trip", " "), "co"),
(col: Column) => (col =!= "" || col =!= null)
), (col: Column) => size(split(trim(col), " "))
)
))
.withColumn("n_trip", when($"n_trip".isNull, 0).otherwise($"n_trip"))
.show(false)
Update: Easy to understand
df.withColumn("split", split(array_join($"trip", " "), "co"))
.withColumn("filter", filter($"split", (col: Column) => col =!= "" || col =!= null))
.withColumn("n_trip", array_max(transform($"filter", (col: Column) => size(split(trim(col), " ")))))
.withColumn("n_trip", when($"n_trip".isNull, 0).otherwise($"n_trip"))
.drop("split", "filter")
.show(false)
Output:
+-----------+--------------------+------+
|passengerId|trip |n_trip|
+-----------+--------------------+------+
|10096 |[co] |0 |
|10351 |[pk] |1 |
|10436 |[co, co, cn, tj, us]|3 |
|1090 |[dk, tj, jo, jo, ch]|5 |
|11078 |[pk, no, fr] |3 |
|11332 |[sg, cn, co, bm] |2 |
|11563 |[us, sg, th, cn] |4 |
|1159 |[ca, cl, il, sg] |4 |
|11722 |[dk, dk, pk, sg] |4 |
|11888 |[au, se, ca, tj] |4 |
|12394 |[dk, nl] |2 |
|12529 |[no, be] |2 |
|12847 |[cn] |1 |
|13192 |[cn, tk, cg, uk] |4 |
|13282 |[co, us, iq] |2 |
|13442 |[cn, pk, jo, us, ch]|5 |
|13610 |[be, ar, tj, no, ch]|5 |
|13772 |[be, at] |2 |
|13865 |[be, th, cn] |3 |
|14157 |[sg] |1 |
+-----------+--------------------+------+
You could write a user defined function (udf) that would compute what you wish. There are plenty of ways to compute the longuest sequence. One simple way is to split the sequence on "co", compute the size of each sub sequence and take the max.
val longuest_seq = udf((x : Seq[String]) => {
x.reduce(_ +" "+_)
.split(" *co *")
.map(_.count(_ == ' ') + 1)
.max
})
val df = Seq(
(1, Array("x", "y", "co", "z")),
(2, Array("x")),
(3, Array("co", "t")),
(4, Array("a", "b", "c", "d", "co", "e"))
).toDF("id", "trip")
df.withColumn("n_trips", longuest_seq('trip)).show
which yields
+---+-------------------+-------+
| id| trip|n_trips|
+---+-------------------+-------+
| 1| [x, y, co, z]| 2|
| 2| [x]| 1|
| 3| [co, t]| 1|
| 4|[a, b, c, d, co, e]| 4|
+---+-------------------+-------+

How do I pass parameters to selectExpr? SparkSQL-Scala

:)
When you have a data frame, you can add columns and fill their rows with the method selectExprt
Something like this:
scala> table.show
+------+--------+---------+--------+--------+
|idempr|tipperrd| codperrd|tipperrt|codperrt|
+------+--------+---------+--------+--------+
| OlcM| h|999999999| J| 0|
| zOcQ| r|777777777| J| 1|
| kyGp| t|333333333| J| 2|
| BEuX| A|999999999| F| 3|
scala> var table2 = table.selectExpr("idempr", "tipperrd", "codperrd", "tipperrt", "codperrt", "'hola' as Saludo")
tabla: org.apache.spark.sql.DataFrame = [idempr: string, tipperrd: string, codperrd: decimal(9,0), tipperrt: string, codperrt: decimal(9,0), Saludo: string]
scala> table2.show
+------+--------+---------+--------+--------+------+
|idempr|tipperrd| codperrd|tipperrt|codperrt|Saludo|
+------+--------+---------+--------+--------+------+
| OlcM| h|999999999| J| 0| hola|
| zOcQ| r|777777777| J| 1| hola|
| kyGp| t|333333333| J| 2| hola|
| BEuX| A|999999999| F| 3| hola|
My point is:
I define strings and call a method which use this String parameter to fill a column in the data frame. But I am not able to do the select expresion get the string (I tried $, +, etc..) . To achieve something like this:
scala> var english = "hello"
scala> def generar_informe(df: DataFrame, tabla: String) {
var selectExpr_df = df.selectExpr(
"TIPPERSCON_BAS as TIP.PERSONA CONTACTABILIDAD",
"CODPERSCON_BAS as COD.PERSONA CONTACTABILIDAD",
"'tabla' as PUNTO DEL FLUJO" )
}
scala> generar_informe(df,english)
.....
scala> table2.show
+------+--------+---------+--------+--------+------+
|idempr|tipperrd| codperrd|tipperrt|codperrt|Saludo|
+------+--------+---------+--------+--------+------+
| OlcM| h|999999999| J| 0| hello|
| zOcQ| r|777777777| J| 1| hello|
| kyGp| t|333333333| J| 2| hello|
| BEuX| A|999999999| F| 3| hello|
I tried:
scala> var result = tabl.selectExpr("A", "B", "$tabla as C")
scala> var abc = tabl.selectExpr("A", "B", ${tabla} as C)
<console>:31: error: not found: value $
var abc = tabl.selectExpr("A", "B", ${tabla} as C)
scala> var abc = tabl.selectExpr("A", "B", "${tabla} as C")
scala> sqlContext.sql("set tabla='hello'")
scala> var abc = tabl.selectExpr("A", "B", "${tabla} as C")
SAME ERROR:
java.lang.RuntimeException: [1.1] failure: identifier expected
${tabla} as C
^
at scala.sys.package$.error(package.scala:27)
Thanks in advance!
Can you try this.
val english = "hello"
generar_informe(data,english).show()
}
def generar_informe(df: DataFrame , english : String)={
df.selectExpr(
"transactionId" , "customerId" , "itemId","amountPaid" , s"""'${english}' as saludo """)
}
This is the output I got.
17/11/02 23:56:44 INFO CodeGenerator: Code generated in 13.857987 ms
+-------------+----------+------+----------+------+
|transactionId|customerId|itemId|amountPaid|saludo|
+-------------+----------+------+----------+------+
| 111| 1| 1| 100.0| hello|
| 112| 2| 2| 505.0| hello|
| 113| 3| 3| 510.0| hello|
| 114| 4| 4| 600.0| hello|
| 115| 1| 2| 500.0| hello|
| 116| 1| 2| 500.0| hello|
| 117| 1| 2| 500.0| hello|
| 118| 1| 2| 500.0| hello|
| 119| 2| 3| 500.0| hello|
| 120| 1| 2| 500.0| hello|
| 121| 1| 4| 500.0| hello|
| 122| 1| 2| 500.0| hello|
| 123| 1| 4| 500.0| hello|
| 124| 1| 2| 500.0| hello|
+-------------+----------+------+----------+------+
17/11/02 23:56:44 INFO SparkContext: Invoking stop() from shutdown hook

Find date of each week in from week in Spark Dataframe

I want to add a column with date of each corresponding week in Dataframe (appending friday in each date)
My Dataframe looks like this
+----+------+---------+
|Week| City|sum(Sale)|
+----+------+---------+
| 29|City 2| 72|
| 28|City 3| 48|
| 28|City 2| 19|
| 27|City 2| 16|
| 28|City 1| 84|
| 28|City 4| 72|
| 29|City 4| 39|
| 27|City 3| 42|
| 26|City 3| 68|
| 27|City 1| 89|
| 27|City 4| 104|
| 26|City 2| 19|
| 29|City 3| 27|
+----+------+---------+
I need to convert it as below dataframe
----+------+---------+--------------- |
|Week| City|sum(Sale)|perticular day(dd/mm/yyyy) |
+----+------+---------+---------------|
| 29|City 2| 72|Friday(07/21/2017)|
| 28|City 3| 48|Friday(07/14/2017)|
| 28|City 2| 19|Friday(07/14/2017)|
| 27|City 2| 16|Friday(07/07/2017)|
| 28|City 1| 84|Friday(07/14/2017)|
| 28|City 4| 72|Friday(07/14/2017)|
| 29|City 4| 39|Friday(07/21/2017)|
| 27|City 3| 42|Friday(07/07/2017)|
| 26|City 3| 68|Friday(06/30/2017)|
| 27|City 1| 89|Friday(07/07/2017)|
| 27|City 4| 104|Friday(07/07/2017)|
| 26|City 2| 19|Friday(06/30/2017)|
| 29|City 3| 27|Friday(07/21/2017)|
+----+------+---------+
please help me
You can write a simple UDF and get the date from adding week in it.
Here is the simple example
import spark.implicits._
val data = spark.sparkContext.parallelize(Seq(
(29,"City 2", 72),
(28,"City 3", 48),
(28,"City 2", 19),
(27,"City 2", 16),
(28,"City 1", 84),
(28,"City 4", 72),
(29,"City 4", 39),
(27,"City 3", 42),
(26,"City 3", 68),
(27,"City 1", 89),
(27,"City 4", 104),
(26,"City 2", 19),
(29,"City 3", 27)
)).toDF("week", "city", "sale")
val getDateFromWeek = udf((week : Int) => {
//create a default date for week 1
val week1 = LocalDate.of(2016, 12, 30)
val day = "Friday"
//add week from the week column
val result = week1.plusWeeks(week).format(DateTimeFormatter.ofPattern("MM/dd/yyyy"))
//return result as Friday (date)
s"${day} (${result})"
})
//use the udf and create a new column named day
data.withColumn("day", getDateFromWeek($"week")).show
can anyone convert this to Pyspark?

Exception on special characters on Spark Dataframe returned by DropNA

I am using the na.drop in Scala to return a new dataframe with non-null and non-special characters.
When I am doing a RDD conversion of the DataFrame, i can see only the non-null characters and special characters correctly.
But when I am doing a MAP operation on the RDD to convert to Double, I am hitting an exception that there are special characters still in the RDD and crashing.
But if do the same operation on the DataFrame with no-special characters it works fine.
Please suggest the solution for this issue
inputData.show
+------------+-----------+------------+-----------+-------+
|Sepal_Length|Sepal_Width|Petal_Length|Petal_Width|Species|
+------------+-----------+------------+-----------+-------+
| 5.1| 3.5| 1.4| 0.2| |
| 4.9| 3.0| 1.4| 0.2| setosa|
| | null| 1.3| 0.1| setosa|
| 4.6| null| 1.5| ?| setosa|
| | 3.6| null| | setosa|
| 5.4| 3.9| 1.7| 0.4| setosa|
| 4.6| 3.4| 1.4| 0.3| setosa|
| 5| 3.4| 1.5| 0.2| setosa|
+------------+-----------+------------+-----------+-------+
Returned new Dataframe
val inputData1:DataFrame = inputData.na.drop() <br/>
+------------+-----------+------------+-----------+-------+
|Sepal_Length|Sepal_Width|Petal_Length|Petal_Width|Species|
+------------+-----------+------------+-----------+-------+
| 5.1| 3.5| 1.4| 0.2| |
| 4.9| 3.0| 1.4| 0.2| setosa|
| 5.4| 3.9| 1.7| 0.4| setosa|
| 4.6| 3.4| 1.4| 0.3| setosa|
| 5| 3.4| 1.5| 0.2| setosa|
+------------+-----------+------------+-----------+-------+
val rdd_fromDf =inputData1.rdd
+------------+
|Sepal_Length|
+------------+
| 5.1|
| 4.9|
| 5.4|
| 4.6|
| 5|
+------------+
Mapping operation to convert to RDD[Double]
val mapReturnRDD = rdd_fromDf.map(row => row.getDouble(0))
The exception stack trace is as below:
java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Double<br/>
at scala.runtime.BoxesRunTime.unboxToDouble(BoxesRunTime.java:119)<br/>
at org.apache.spark.sql.Row$class.getDouble(Row.scala:243)<br/>
at org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:192)<br/>
at $line286.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$pangea_Missing$$anonfun$5.apply(<console>:260)<br/>
at $line286.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$pangea_Missing$$anonfun$5.apply(<console>:260)<br/>
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)<br/>
at scala.collection.Iterator$class.foreach(Iterator.scala:727)<br/>
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)<br/>
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$32.apply(RDD.scala:912)<br/>
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$32.apply(RDD.scala:912)<br/>
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)<br/>
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)<br/>
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)<br/>
at org.apache.spark.scheduler.Task.run(Task.scala:89)<br/>
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)<br/>
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)<br/>
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)<br/>
at java.lang.Thread.run(Thread.java:745)<br/>
16/02/17 15:13:01 ERROR Executor: Exception in task 1.0 in stage 417.0 (TID 654)<br/>
java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Double<br/>
at scala.runtime.BoxesRunTime.unboxToDouble(BoxesRunTime.java:119)<br/>
at org.apache.spark.sql.Row$class.getDouble(Row.scala:243)<br/>
at org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:192)<br/>
at $line286.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$pangea_Missing$$anonfun$5.apply(<console>:260)<br/>
at $line286.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$pangea_Missing$$anonfun$5.apply(<console>:260)<br/>
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)<br/>
at scala.collection.Iterator$class.foreach(Iterator.scala:727)<br/>
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)<br/>
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$32.apply(RDD.scala:912)<br/>
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$32.apply(RDD.scala:912)<br/>
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)<br/>
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)<br/>
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)<br/>
at org.apache.spark.scheduler.Task.run(Task.scala:89)<br/>
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)<br/>
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)<br/>
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)<br/>
at java.lang.Thread.run(Thread.java:745)<br/>
16/02/17 15:13:01 WARN TaskSetManager: Lost task 1.0 in stage 417.0 (TID 654, localhost): java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Double<br/>
at scala.runtime.BoxesRunTime.unboxToDouble(BoxesRunTime.java:119)<br/>
at org.apache.spark.sql.Row$class.getDouble(Row.scala:243)<br/>
at org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:192)<br/>
at $line286.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$pangea_Missing$$anonfun$5.apply(<console>:260)<br/>
at $line286.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$pangea_Missing$$anonfun$5.apply(<console>:260)<br/>
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)<br/>
at scala.collection.Iterator$class.foreach(Iterator.scala:727)<br/>
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)<br/>
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$32.apply(RDD.scala:912)<br/>
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$32.apply(RDD.scala:912)<br/>
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)<br/>
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)<br/>
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)<br/>
at org.apache.spark.scheduler.Task.run(Task.scala:89)<br/>
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)<br/>
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)<br/>
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)<br/>
at java.lang.Thread.run(Thread.java:745)<br/>
<br/>
16/02/17 15:13:01 ERROR TaskSetManager: Task 1 in stage 417.0 failed 1 times; aborting job<br/>
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 417.0 failed 1 times, most recent failure: Lost task 1.0 in stage 417.0 (TID 654, localhost): java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Double<br/>
at scala.runtime.BoxesRunTime.unboxToDouble(BoxesRunTime.java:119)<br/>
at org.apache.spark.sql.Row$class.getDouble(Row.scala:243)<br/>
at org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:192)<br/>
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$pangea_Missing$$anonfun$5.apply(<console>:260)<br/>
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$pangea_Missing$$anonfun$5.apply(<console>:260)<br/>
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)<br/>
at scala.collection.Iterator$class.foreach(Iterator.scala:727)<br/>
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)<br/>
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$32.apply(RDD.scala:912)<br/>
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$32.apply(RDD.scala:912)<br/>
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)<br/>
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)<br/>
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)<br/>
at org.apache.spark.scheduler.Task.run(Task.scala:89)<br/>
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)<br/>
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)<br/>
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)<br/>
at java.lang.Thread.run(Thread.java:745)<br/>
<br/>
Driver stacktrace:<br/>
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)<br/>
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)<br/>
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)<br/>
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)<br/>
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)<br/>
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)<br/>
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)<br/>
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)<br/>
at scala.Option.foreach(Option.scala:236)<br/>
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799)<br/>
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640)<br/>
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)<br/>
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)<br/>
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)<br/>
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)<br/>
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832)<br/>
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1845)<br/>
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1858)<br/>
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1929)<br/>
at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:912)<br/>
at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:910)<br/>
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)<br/>
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)<br/>
at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)<br/>
at org.apache.spark.rdd.RDD.foreach(RDD.scala:910)<br/>
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$pangea_Missing$.pangea_DFtoRDD_LabeledPoint(<console>:264)<br/>
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:121)<br/>
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:126)<br/>
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:128)<br/>
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:130)<br/>
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:132)<br/>
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:134)<br/>
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:136)<br/>
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:138)<br/>
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:140)<br/>
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:142)<br/>
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:144)<br/>
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:146)<br/>
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:148)<br/>
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:150)<br/>
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:152)<br/>
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:154)<br/>
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:156)<br/>
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:158)<br/>
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:160)<br/>
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:162)<br/>
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:164)<br/>
at $iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:166)<br/>
at $iwC$$iwC$$iwC$$iwC.<init>(<console>:168)<br/>
at $iwC$$iwC$$iwC.<init>(<console>:170)<br/>
at $iwC$$iwC.<init>(<console>:172)<br/>
at $iwC.<init>(<console>:174)<br/>
at <init>(<console>:176)<br/>
at .<init>(<console>:180)<br/>
at .<clinit>(<console>)<br/>
at .<init>(<console>:7)<br/>
at .<clinit>(<console>)<br/>
at $print(<console>)<br/>
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)<br/>
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)<br/>
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)<br/>
at java.lang.reflect.Method.invoke(Method.java:483)<br/>
at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)<br/>
at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1346)<br/>
at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)<br/>
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)<br/>
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)<br/>
at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)<br/>
at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)<br/>
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)<br/>
at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657)<br/>
at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665)<br/>
at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670)<br/>
at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997)<br/>
at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)<br/>
at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)<br/>
at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)<br/>
at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)<br/>
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)<br/>
at org.apache.spark.repl.Main$.main(Main.scala:31)<br/>
at org.apache.spark.repl.Main.main(Main.scala)<br/>
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)<br/>
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)<br/>
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)<br/>
at java.lang.reflect.Method.invoke(Method.java:483)<br/>
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)<br/>
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)<br/>
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)<br/>
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)<br/>
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)<br/>
Caused by: java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Double<br/>
at scala.runtime.BoxesRunTime.unboxToDouble(BoxesRunTime.java:119)<br/>
at org.apache.spark.sql.Row$class.getDouble(Row.scala:243)<br/>
at org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:192)<br/>
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$pangea_Missing$$anonfun$5.apply(<console>:260)<br/>
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$pangea_Missing$$anonfun$5.apply(<console>:260)<br/>
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)<br/>
at scala.collection.Iterator$class.foreach(Iterator.scala:727)<br/>
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)<br/>
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$32.apply(RDD.scala:912)<br/>
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$32.apply(RDD.scala:912)<br/>
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)<br/>
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)<br/>
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)<br/>
at org.apache.spark.scheduler.Task.run(Task.scala:89)<br/>
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)<br/>
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)<br/>
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)<br/>
at java.lang.Thread.run(Thread.java:745)<br/>
<br/>
<br/>
It is not clear how you load the data but one thing for sure there is no such thing as "special character" in Spark SQL. Since output you've provided shows both characters (like "?") and empty fields (like "" which are neither NULL or NaN) as safe a assumption is that the columns of your DataFrame are simply Strings (just check inputData.schema or inputData.printSchema).
Typed methods on Row like getDouble or getInt are equivalent to:
get(i).asInstanceOf[T]
where get is of type Int => Any. It means these are not type safe and don't perform any type casting. If you want to perform some type of casting you can use either data source specific option (like inferSchema from spark-csv) or simply cast columns to required types:
import org.apache.spark.sql.functions
val exprs = Seq("Sepal_Length", "Sepal_Width", "Petal_Length", "Petal_Width")
exprs.foldLeft(inputData)((df, c) => df.withColumn(c, col(c).cast("double"))
Column.cast method has the same semantics as standard SQL cast hence is type safe (if data cannot casted to a given type it means the the result is undefined or in other words NULL).