mapping each column based on datatype in scala - sql

44,8602,37.19
35,5368,65.89
35,3391,40.64
44,6694,14.98
val sc = new SparkContext("local[*]", "TotalSpentByCustomer")
val input = sc.textFile("C:\\Spar\cuscopy.csv")
val fields = input.map(x => (x.split("\t")(1).toInt, 1, 2.toFloat, 2))
val d = fields.reduceByKey((x,y) => x+y)
val results = d.collect()
results.foreach(print)
Getting error
value reduceByKey is not a member of org.apache.spark.rdd.RDD[(Int,
Int, Float, Int)]
val d = fields.reduceByKey((x,y) => x+y)
Please suggest is it right way to parse fields ?

You could load the csv file to a dataframe with a given schema using the Spark Project SQL library and if needed convert it to a RDD.
// remove those lines when not using jupyter
interp.load.ivy("org.apache.spark" %% "spark-sql" % "3.2.0")
interp.load.ivy("org.apache.spark" %% "spark-core" % "3.2.0")
import org.apache.spark.sql.types.{StructType, StructField, FloatType, IntegerType};
val spark = org.apache.spark.sql.SparkSession.builder
.master("local")
.appName("Spark CSV Reader")
.getOrCreate;
// create a new schema for reading the csv
val schema = new StructType()
.add("Field1", IntegerType, true)
.add("Field2", IntegerType ,true)
.add("Field3", FloatType, true)
val df = spark.read.format("csv")
.schema(schema)
.load("/vagrant/test/test.csv") //replace with desired path
// select only column 1 & 2, not sure if this was intended by the questioner
val selected = df.select("Field1","Field2")
// convert your dataframe to a rdd
val d = selected.rdd
d.collect().foreach(println)
This outputs the following lines for the given input from the question
[44,8602]
[35,5368]
[35,3391]
[44,6694]

Related

Check matching data in Dataframes in spark

How can i match if two RDD generated the way i did contains the same data including number of rows?
I'm using scala test to run the tests and spark version 3 with scala 2.12.12
Below my code from the creation of the schemas of my two rdd, included the expected one and the creation of all 3 rdd with data.
-- CREATING SCHEMA FOR RDD AMOUNTS AND WAREHOUSE AND EXPECTED FINAL SCHEMA
val amountsSchema: StructType = StructType(
Seq(
StructField("positionId", LongType, nullable = true),
StructField("amount", DecimalType(10, 2), nullable = true),
StructField("eventTime",LongType, nullable = true),
)
)
val warehouseSchema: StructType = StructType(
Seq(
StructField("positionId", LongType, nullable = true),
StructField("warehouse", StringType, nullable = true),
StructField("product", StringType, nullable = true),
StructField("eventTime",LongType, nullable = true),
)
)
val expectedDfSchema: StructType = StructType(
Seq(
StructField("positionId", LongType, nullable = true),
StructField("warehouse", StringType, nullable = true),
StructField("product", StringType, nullable = true),
StructField("amount", DecimalType(10, 2), nullable = true),
)
)
--- CREATING DATA FOR RDD AMOUNTS RDD AND WAREHOUSE RDD AND EXPECTED FINAL RDD
val amounts_data = Seq(
Row("1", "5.00", "1528463387"),
Row("1", "7.20", "1528463005"),
Row("2", "5.00", "1528463097"),
Row("2", "7.20", "1528463007"),
Row("3", "6.00", "1528463078"),
Row("4", "24.20", "1528463008"),
Row("4", "15.00", "1528463100"),
)
val wh_data = Seq(
Row("1", "W-1", "P-1", "1528463098"),
Row("2", "W-2", "P-2", "1528463097"),
Row("3", "W-2", "P-3", "1528463078"),
Row("4", "W-1", "P-6", "1528463100"),
)
val expected_data = Seq(
Row("1", "W-1", "P-1", "5.00"),
Row("2", "W-2", "P-2", "5.00"),
Row("3", "W-2", "P-3", "6.00"),
Row("4", "W-1", "P-6", "15.00")
)
---- CREATING RDD WITH SCHEMAS AND DATA FOR DF_AMOUNTS AND DF_WAREHOUSE AND FOR THE EXPECTED RDD WITH EXPECTED_DATA
val df_amounts: DataFrame = spark.createDataFrame(
spark.sparkContext.parallelize(amounts_data),
amountsSchema
)
val df_wh: DataFrame = spark.createDataFrame(
spark.sparkContext.parallelize(wh_data),
warehouseSchema
)
val df_expected: DataFrame = spark.createDataFrame(
spark.sparkContext.parallelize(expected_data),
expectedDfSchema
)
---- USING GET_AMOUNTS METHOD TO GENERATE A RDD FROM THE FUNCTION get_amounts
val resDf: DataFrame = get_amounts(df_amounts, df_wh)
---- TESTING IF THE resDf SCHEMA MATCH WITH THE EXPECTED SCHEMA - IT DOES TEST PASSED
test("DataFrame Schema Test") {
assert(assertSchema(resDf.schema, df_expected.schema))
}
---- TESTING IF THE resDf DATA MATCH WITH THE EXPECTED DATA - IT DOESNT'T MATCH
test("DataFrame Data Test") {
assert(assertData(resDf, df_expected))
}
}
The assertData function used to match the data for the expected data rdd and the one coming from my function get_amounts but it fails the test.
def assertData(df1: DataFrame, df2: DataFrame): Boolean = {
df1.exceptAll(df2).rdd.isEmpty()
}
Thank You
The way you create a datasets is valid. A test structure looks good as well.
I would suggest to improve your assert method to see why the test case failes, here you can find some thoughts on your testing method:
exceptAll is not a perfect for testing, if the df2 contains an additional row it will still say that the data matches, consider below code:
val df1 = Seq(
(1, "x"),
(2, "y")
).toDF("x", "y")
val df2 = Seq(
(1, "x"),
(2, "y"),
(3, "z")
).toDF("x", "y")
assert(df1.exceptAll(df2).rdd.isEmpty())
"this function resolves columns by position (not by name)" (from Spark code scala docs), due to this sometimes you can get confused about your test result.
your assert method says nothing about what exactly mismatched
For testing purposes is not bad to collect (small amount of) data and match sequences.
You can consider using a method like this one:
def matchDF(resultDF: DataFrame, expectedDF: DataFrame): Unit = {
resultDF.printSchema()
expectedDF.printSchema()
assert(resultDF.schema == expectedDF.schema,
s"Schema does not match: ${resultDF.schema} != ${expectedDF.schema}")
val expected = expectedDF.collect().toSeq
val result = resultDF.collect().toSeq
assert(expected == result, s"Data does not match: $result != $expected")
}
It's not a perfect approach (still depends on the position in a row), but at least you will be able to find out what is going on and why your test fails.
For wrong data you'll see this:
assertion failed: Data does not match: WrappedArray([1,x], [2,y]) != WrappedArray([1,x], [3,y])
For wrong schema you'll get:
root
|-- x: integer (nullable = false)
|-- y: string (nullable = true)
root
|-- x: string (nullable = true)
|-- y: string (nullable = true)
Exception in thread "main" java.lang.AssertionError: assertion failed: Schema does not match
I hope this will help you understand what is going wrong.

Schema for type org.apache.spark.sql.types.DataType is not supported

I try to create empty df with schema:
val sparkConf = new SparkConf()
.setAppName("app")
.setMaster("local")
val sparkSession = SparkSession
.builder()
.config(sparkConf)
.getOrCreate()
val sparkContext = sparkSession.sparkContext
var tmpScheme = StructType(
StructField("source_id", StringType, true) :: Nil)
var df = conf.SparkConf.sparkSession.createDataFrame(tmpScheme)
and got Schema for type org.apache.spark.sql.types.DataType is not supported ...
I don't understand why - there is no .DataType even in Imports:
import org.apache.spark.sql.types.{BooleanType, IntegerType, StringType, StructField, StructType}
What can be the problem here?
PS: spark version
"org.apache.spark" %% "spark-sql" % "3.2.2", // spark
"org.apache.spark" %% "spark-core" % "3.2.2", // spark
If you check the documentation, you can see that the argument fields of StructType is of type Array[StructField] and you are passing StructField.
This means that you should wrap your StructField with Array, for example:
val simpleSchema = StructType(Array(
StructField("source_id", StringType, true))
)
Good luck!
EDIT
The case with one parameter in createDataframe:
val data = Seq(
Data(1, "test"),
Data(2, "test2")
)
val dataDf = spark.createDataFrame(data)
dataDf.show(10, false)
The case with two parameterse in createDataframe:
val someSchema = List(
StructField("number", IntegerType, true),
StructField("word", StringType, true)
)
val someData = Seq(Row(1, "test"), Row(2, "test2"))
val someDF = spark.createDataFrame(
spark.sparkContext.parallelize(someData),
StructType(someSchema)
)
The output for both cases is the same:
+------+-----+
|number|word |
+------+-----+
|1 |test |
|2 |test2|
+------+-----+
In your case, the schema is trying to be inferred from attributes of the class (StructType) and is trying to be populated with StructField: source_id. StructType extends DataType and that is where your error comes from (Spark can not resolve the type)

in scala how to convert the result in sql query row to double

I try to get the result of spark sql query and do some calculations for them in Scala.
val sql_DF = spark.sql("SELECT count(distinct(my_id)) total_id FROM some_ids_table ")
val total_ids = sql_DF.select("total_id").first().toSeq.asInstanceOf[Seq[Double]][0]
val sql_DF01 = spark.sql("SELECT count(distinct(other_id)) other_ids FROM some_ids_table where column_value1 = 1")
val other_id_1 = sql_DF01.select("other_ids").first().toSeq.asInstanceOf[Seq[Double]][0]
println(other_id_1/total_ids)
I got error:
error: identifier expected but integer literal found.
val total_ids = sql_DF.select("total_id").first().toSeq.asInstanceOf[Seq[Double]][0]
How to convert the result in sql query row to double so that I can do some math calculations on them?
e.g.
other_ids / total_ids
thanks
You can use the CAST function as below
scala> val sql_DF = spark.sql("SELECT count(distinct(eid)) total_id FROM employee ")
sql_DF: org.apache.spark.sql.DataFrame = [total_id: bigint]
Using the CAST function the bigint is converted to double
scala> val sql_DF = spark.sql("SELECT CAST(count(distinct(eid)) AS DOUBLE) total_id FROM employee ")
sql_DF: org.apache.spark.sql.DataFrame = [total_id: double]
The question didn't seem quite clear to me. Here's how you generally handle the situation below.
import spark.implicits._
val df = Seq(("a",90.0),("b",30.0)).toDF("name","marks")
Collect the data in column "marks", which will be of type'Array[Row]'
val rows: Array[Row] = df.select("marks").collect()
You specifically need to get each element using map and give it the type as per its datatype in the column.
val values: Array[Double] = rows.map(row => row.getAs[Double](0)/*row.getDouble(0)*/)
There are different points in your question. If you want to get the count of distinct 'my_id' and assign it to a value, you can do it directly with:
total_ids = sql_DF.dropDuplicates("my_id").count()
The error that you are getting is due to the way to access the data from DF (You need to access to row object: https://spark.apache.org/docs/1.6.1/api/java/org/apache/spark/sql/Row.html).
Alternative: You can directly cast your DF using .as[DataType], like this:
val total_id = sql_DF.selectExpr("count(distinct(my_id)) as total_id").as[Long].first()

Update with inner join using spark dataframe/dataset/RDD

I am transformaing ms sql server query's logic to spark. Thge query to be transformed is as under:
Update enc set PrUid=m.OriginalPrUid
FROM CachePatDemo enc
inner join #MergePreMap m on enc.PrUid=m.NewPrUid
WHERE StatusId is null
I am using data frame for the transformation and I have two tables in my two dataframes which I am joining as an inner join. I need to find a way to take all the columns of table 1 and updated columns(which are commonin both tables).
I tried using this:
val result = CachePatDemo.as("df123").
join(MergePreMap.as("df321"), CachePatDemo("prUid") === MergePreMap("prUid"),"inner").where("StatusId is null")
select($"df123.pId",
$"df321.provFname".as("firstName"),
$"df123.lastName",
$"df123.prUid")
It doesn't seem to solve my issue. Can anyone please help?
On Spark 2.1 This works
case class TestModel(x1: Int, x2: String, x3: Int)
object JoinDataFrames extends App {
import org.apache.spark.sql.{DataFrame, SparkSession}
val spark = SparkSession.builder.appName("GroupOperations").master("local[2]").enableHiveSupport.getOrCreate
import spark.implicits._
import org.apache.spark.sql.functions._
val list1 = (3 to 10).toList.map(i => new TestModel(i, "This is df1 " + i, i * 3))
val list2 = (0 to 5).toList.map(i => new TestModel(i, "This is df2 " + i, i * 13))
val df1: DataFrame = spark.sqlContext.createDataFrame[TestModel](list1)
val df2: DataFrame = spark.sqlContext.createDataFrame[TestModel](list2)
val res = df1.join(df2, Seq("x1"), "inner")
println("from DF1")
res.select(df1("x2")).show()
println("from DF2")
res.select(df2("x2")).show()
}

(Scala) Convert String to Date in Apache Spark

I would like to read a .csv file with Spark and associate the columns with fitting Types.
val conf = new SparkConf()
.setMaster("local[8]")
.setAppName("Name")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val customSchema = StructType(Array(
StructField("date", DateType, true),
StructField("time",StringType, true),
StructField("am", DoubleType, true),
StructField("hum", DoubleType, true),
StructField("temp", DoubleType, true)
))
val df = sqlContext.read
.format("org.apache.spark.sql.execution.datasources.csv.CSVFileFormat")
.option("header","true")
.option("delimiter",";")
.schema(customSchema)
.load("data.csv")
A line of the .csv I am reading looks like this
+----------+--------+-----+-----+-----+
| date| time| am| hum| temp|
+----------+--------+-----+-----+-----+
|04.10.2016|12:51:20|1.121|0.149|0.462|
+----------+--------+-----+-----+-----+
Spark will read the .csv and associate the Types correctly if I set the type for the date to String. If I keep the customSchema like in the code shown above, Spark will throw an exception due to the wrong date format (DateType will expect YYYY-MM-DD while mine is DD.MM.YYYY).
Is there a way to re-format the date Strings to YYYY-MM-DD and apply the schema afterwards? Or can I also alter the DateType given by Spark by adding parameters?
Thanks in advance
Use dateFormat option:
val df = sqlContext.read
.format("org.apache.spark.sql.execution.datasources.csv.CSVFileFormat")
.option("header","true")
.option("delimiter",";")
.option("dateFormat", "dd.MM.yyyy")
.schema(customSchema)
.load("data.csv")
I recommend parsing the dates afterwards. Also refer to this.
val df = Seq((1L, "05/26/2016 01:01:01"), (2L, "#$####")).toDF("id", "dts")
import org.apache.spark.sql.functions.unix_timestamp
val ts = unix_timestamp($"dts", "MM/dd/yyyy HH:mm:ss").cast("timestamp")
df.withColumn("ts", ts).show(2, false)
// +---+-------------------+---------------------+
// |id |dts |ts |
// +---+-------------------+---------------------+
// |1 |05/26/2016 01:01:01|2016-05-26 01:01:01.0|
// |2 |#$#### |null |
// +---+-------------------+---------------------+
and:
scala> date.format(DateTimeFormatter.ofPattern("yyyy.MM.dd"))
res0: String = 2020.01.01
Also as a side note, since spark 2.0, you use spark session object only and use encoders for inferring schema(instead of sc, sqlcontext etc). Something like this:
spark = SparkSession(...)
case class User(id:Int, city:String, loc:Array[Double], pop:Long, state:String)
val users = (spark.read.option("inferSchema","true").option("header","true").csv("data/users1.csv").as[User])