Error while querying Data in Spark 2.2 dataframe - dataframe

i have able to query Restapi data and convert to RDD and Dataframe.
but when i tried to query the column, instead of columnar result, i am getting comma separated value.
Am i missing something?
Code:
package Stream
import org.apache.spark.sql._
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs._
import org.apache.http.client.methods.HttpGet
import org.apache.http.impl.client.DefaultHttpClient
import org.apache.spark.sql.{DataFrame, SaveMode, SparkSession}
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.StructType
import org.apache.log4j.{Level, Logger}
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.types._
import org.apache.spark.storage.StorageLevel
object SparkRestApi {
def main(args: Array[String]): Unit = {
val logger = Logger.getLogger("blah")
Logger.getLogger("org").setLevel(Level.WARN)
Logger.getLogger("akka").setLevel(Level.WARN)
val spark = SparkSession.builder()
.appName("blah")
.config("spark.driver.memory", "2g")
.master("local[*]")
//.enableHiveSupport()
.getOrCreate()
import spark.implicits._
val url = "https://platform-api.opentargets.io/v3/platform/public/association/filter"
val result2 = List(scala.io.Source.fromURL(url).mkString)
val githubRdd2=spark.sparkContext.makeRDD(result2)
val gitHubDF2=spark.read.json(githubRdd2)
gitHubDF2.show()
val mediandf= gitHubDF2.select(col("data.association_score.overall").as("association_score"))
mediandf.show()
spark.stop()
}
}
But Result not well formatted. It is comma separated instead of value in well formatted rows.
association_score|
|[1.0, 1.0, 1.0, 1...|
But i am expecting
1.0
1.0
1.0

Check below code.
Note You will get duplicates rows if you explode array column values.
scala> val url = "https://platform-api.opentargets.io/v3/platform/public/association/filter"
scala> val df = spark.read.json(Seq(scala.io.Source.fromURL(url).mkString).toDS)
scala> df.withColumn("data",explode($"data")).select($"data.association_score.overall",$"data.disease.id".as("disease_id"),$"data.target.id").show(false)
+-------+-------------+---------------+
|overall|disease_id |id |
+-------+-------------+---------------+
|1.0 |MONDO_0045024|ENSG00000121879|
|1.0 |MONDO_0023370|ENSG00000121879|
|1.0 |EFO_0000616 |ENSG00000121879|
|1.0 |EFO_0010285 |ENSG00000121879|
|1.0 |MONDO_0045024|ENSG00000133703|
|1.0 |MONDO_0023370|ENSG00000133703|
|1.0 |MONDO_0045024|ENSG00000157764|
|1.0 |MONDO_0023370|ENSG00000157764|
|1.0 |EFO_0000616 |ENSG00000157764|
|1.0 |OTAR_0000018 |ENSG00000157764|
+-------+-------------+---------------+

Related

Schema for type org.apache.spark.sql.types.DataType is not supported

I try to create empty df with schema:
val sparkConf = new SparkConf()
.setAppName("app")
.setMaster("local")
val sparkSession = SparkSession
.builder()
.config(sparkConf)
.getOrCreate()
val sparkContext = sparkSession.sparkContext
var tmpScheme = StructType(
StructField("source_id", StringType, true) :: Nil)
var df = conf.SparkConf.sparkSession.createDataFrame(tmpScheme)
and got Schema for type org.apache.spark.sql.types.DataType is not supported ...
I don't understand why - there is no .DataType even in Imports:
import org.apache.spark.sql.types.{BooleanType, IntegerType, StringType, StructField, StructType}
What can be the problem here?
PS: spark version
"org.apache.spark" %% "spark-sql" % "3.2.2", // spark
"org.apache.spark" %% "spark-core" % "3.2.2", // spark
If you check the documentation, you can see that the argument fields of StructType is of type Array[StructField] and you are passing StructField.
This means that you should wrap your StructField with Array, for example:
val simpleSchema = StructType(Array(
StructField("source_id", StringType, true))
)
Good luck!
EDIT
The case with one parameter in createDataframe:
val data = Seq(
Data(1, "test"),
Data(2, "test2")
)
val dataDf = spark.createDataFrame(data)
dataDf.show(10, false)
The case with two parameterse in createDataframe:
val someSchema = List(
StructField("number", IntegerType, true),
StructField("word", StringType, true)
)
val someData = Seq(Row(1, "test"), Row(2, "test2"))
val someDF = spark.createDataFrame(
spark.sparkContext.parallelize(someData),
StructType(someSchema)
)
The output for both cases is the same:
+------+-----+
|number|word |
+------+-----+
|1 |test |
|2 |test2|
+------+-----+
In your case, the schema is trying to be inferred from attributes of the class (StructType) and is trying to be populated with StructField: source_id. StructType extends DataType and that is where your error comes from (Spark can not resolve the type)

Spark: convert string to Date

I'm using spark/scala
I have a dataframe. There are columns year/month/day with value, for ex. 2020/9/2. How can I add a column to the same dataframe with conversion to datetime (yyyy-mm-dd)?
I found how to convert date from String to Date format, but I can't find a solution how to combine values and convert it to datetime.
thanks for any advice or hint
You can use the to_date function.
val df1 = Seq(
("2020/9/2"),
("2020/9/15"),
("2020/9/30")
).toDF("str")
val df2 = df1.withColumn("dt", to_date(col("str"), "y/M/d"))
df2.show()
I do some test, I think you can use my examples to convert date. I hope I can help you.
package com.jackpan.spark.examples
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
object SomeExamples {
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder()
.appName("SomeExamples")
.getOrCreate()
val dataDF = spark.createDataFrame(Seq(("2022", "12", "09"), ("2022", "12", "19"),
("2022", "12", "15"))).toDF("year", "month", "day")
dataDF.withColumn("dateStr",
concat(col("year"), lit("-"),col("month"), lit("-"), col("day")))
.withColumn("date", to_date(col("dateStr"), "yyyy-MM-dd"))
.show(false)
}
}
And this function show result is diplay like below :
+----+-----+---+----------+----------+
|year|month|day|dateStr |date |
+----+-----+---+----------+----------+
|2022|12 |09 |2022-12-09|2022-12-09|
|2022|12 |19 |2022-12-19|2022-12-19|
|2022|12 |15 |2022-12-15|2022-12-15|
+----+-----+---+----------+----------+

using numpy busday_count function in pyspark

Given I have two columns in pyspark that are not null:
df2 = sqlContext.sql("SELECT AssignedDate,primaryid from parts_so where AssignedDate is not null ")
df2 = df2.withColumn('sdate', df2.AssignedDate )
df2 = df2.withColumn('edate',current_timestamp() )
When I run df2 = df2.withColumn('days_outstanding', lit(get_hours2(df2.sdate,df2.edate ) )
def get_hours2(sdate,edate):
biz_days = np.busday_count(sdate,edate)
return biz_days
I get the following error:
object of type 'Column' has no len()
First of all let's create some random data
import datetime
import random
import pandas as pd
import numpy as np
sdate = [datetime.datetime.now() + datetime.timedelta(i) for i in range(5)]
edate = [date + datetime.timedelta(random.random()+3) for date in sdate]
data = {
'sdate': sdate,
'edate': edate
}
pdf = pd.DataFrame(data)
df = spark.createDataFrame(pdf)
df.show()
+--------------------+--------------------+
| edate| sdate|
+--------------------+--------------------+
|2019-12-06 22:55:...|2019-12-03 08:14:...|
|2019-12-07 19:42:...|2019-12-04 08:14:...|
|2019-12-08 21:26:...|2019-12-05 08:14:...|
|2019-12-09 18:57:...|2019-12-06 08:14:...|
|2019-12-11 04:08:...|2019-12-07 08:14:...|
+--------------------+--------------------+
You cannot use bare function to create another column in pyspark. We have to create UDF in order to do that.
NOTE: Please remember that you have to cast the result of the computation to int, because you might get a problem with pickling numpy type.
import pyspark.sql.types as T
import pyspark.sql.functions as F
#F.udf(returnType=T.IntegerType())
def get_hours2(sdate,edate):
biz_days = np.busday_count(sdate,edate)
return int(biz_days)
Finally we can use UDF on created DataFrame.
df = df.withColumn('days_outstanding', F.lit(get_hours2('sdate', 'edate')))
df.show()
+--------------------+--------------------+----------------+
| edate| sdate|days_outstanding|
+--------------------+--------------------+----------------+
|2019-12-06 22:55:...|2019-12-03 08:14:...| 3|
|2019-12-07 19:42:...|2019-12-04 08:14:...| 3|
|2019-12-08 21:26:...|2019-12-05 08:14:...| 2|
|2019-12-09 18:57:...|2019-12-06 08:14:...| 1|
|2019-12-11 04:08:...|2019-12-07 08:14:...| 2|
+--------------------+--------------------+----------------+
I hope this helps you.

Ungrouping the grouped table in Scala

I have the following data:
53,Male,11th,<=50K
53,Male,11th,<=50K
53,Male,11th,<=50K
20,Female,Masters,>50K
20,Female,Masters,>50K
33,Male,Bachelors,<=50K
Next I need to group the above data by using select and group. so it will be something like:
53,Male,11th,<=50K,3
20,Female,Masters,>50K,2
33,Male,Bachelors,<=50K,1
where the last number shows the number of similar records. Now I need to filter the number of equivalent records > 2, and store it in a separate file
I have grouped data by sql queries in Scala command. To ungroup data, I though we may create a table and add the grouped data by (insert command), and line by line. It works, but this is very very slow, and took around one hour for few records. Is there any thoughts using Scala with many thanks.
The command is shown below:
import spark.sqlContext.implicits._
import scala.collection.immutable.Map
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType}
case class Rating(age: Double,edu: String, sex: String, salary: String)
val Result = sc.textFile("hdfs://NameNode01:9000/input/adult.csv").map(_.split(",")).map(p => Rating(p(0).trim.toDouble,p(1),p(2),p(3))).toDF()
Result.registerTempTable("Start")
val sal1=spark.sqlContext.sql("SELECT age,edu,sex,salary,count(*) as cnt from Start group by age,edu,sex,salary")
sal1.registerTempTable("adult")
val sal2=spark.sqlContext.sql("SELECT age,edu,sex,salary,cnt from adult WHERE cnt>3")
sal2.registerTempTable("adult2")
var ag=sal2.map(age => ""+age(0)).collect()
var ed=sal2.map(edu => ""+edu(1)).collect()
var se=sal2.map(sex => ""+sex(2)).collect()
var sa=sal2.map(salary => ""+salary(3)).collect()
var cn=sal2.map(cnt => ""+cnt(4)).collect()
//convert age to double
val ages= ag.map(_.toDouble)
//convert the cnt to integer
val counts= cn.map(_.toInt)
//length of the array
var cnt_length=counts.size
//create a table and add the sal2 records in it
val adlt2=spark.sqlContext.sql("CREATE TABLE adult3 (age double, edu string, sex string, salary string)")
//loop and enter the number of cn
var sql_querys="query"
var i=0
var j=0
var loop_cnt=0
for(i <-0 to cnt_length-1){
loop_cnt=counts(i)
for(j <-0 to loop_cnt-1){
sql_querys="INSERT into adult3 values ("+ages(i)+",'"+ed(i)+"','"+se(i)+"','"+sa(i)+"')"
val adlt3=spark.sqlContext.sql("INSERT into adult3 values ("+ages(i)+",'"+ed(i)+"','"+se(i)+"','"+sa(i)+"')")
}
}
The main part is the loop at the end of the code.
Here is a shorter solution which uses only rdds:
val result = sc
.textFile("hdfs://NameNode01:9000/input/adult.csv")
.map({ (line: String) =>
val p = line.split(",")
(Rating(p(0).trim.toDouble,p(1),p(2),p(3)), 1)
})
.reduceByKey(_ + _)
.filter(_._2 > 2)
.flatMap(rating => Array.fill(rating._2)(rating._1))
It works as follows:
textfile loads the rdd from the file
map transforms the lines to pairs of the form (rating, 1)
reduceByKey groups the pairs by the rating and sums the 1s (i.e. counts the occurrences of each rating)
filter discards the ratings which appear less than 3 times
flatmap repeats each rating so many times as its count and then flattens all the results to a single rdd
Here are some reasons why the initial approach is not performant:
collect is used on the dataframe to read its contents on a local machine. This means that you directly loose all parallelization and clustering benefits of spark.
The for loop executes single insertions to the dataframe. The available transformations of spark objects (e.g. map, filter, reduce, single sql queries) are highly optimised to perform those actions in a distributed fashion. By using a for loop to perform single row actions, you loose this advantage and additionally you risk getting extreme overhead by dataframes which are copied during each iteration in the loop.
(minor) Converting an RDD to a dataframe adds some additional computational cost. Thus unless you plan to perform several operations which would benefit from performance features of dataframes or datasets, I would suggest keeping it simple by using just rdds.
You might want to consider ungrouping your dataframe using explode in accordance with the groupBy count:
import org.apache.spark.sql.functions._
case class Rating(age: Double, edu: String, sex: String, salary: String)
val Result = sc.textFile("/Users/leo/projects/spark/files/testfile.csv").
map(_.split(",")).
map(p => Rating(p(0).trim.toDouble, p(1).trim, p(2).trim, p(3).trim)).
toDF
val saDF1 = Result.groupBy("age", "edu", "sex", "salary").agg(count("*") as "cnt")
val saDF2 = Result.groupBy("age", "edu", "sex", "salary").agg(count("*") as "cnt").where($"cnt" > 2)
// Create a UDF to fill array of 1's to be later exploded
val fillArr = (n: Int) => Array.fill(n)(1)
val fillArrUDF = udf(fillArr)
val expandedDF1 = saDF1.withColumn("arr", fillArrUDF($"cnt"))
expandedDF1.show
+----+------+---------+------+---+---------+
| age| edu| sex|salary|cnt| arr|
+----+------+---------+------+---+---------+
|33.0| Male|Bachelors| <=50K| 1| [1]|
|20.0|Female| Masters| >50K| 2| [1, 1]|
|53.0| Male| 11th| <=50K| 3|[1, 1, 1]|
+----+------+---------+------+---+---------+
// Ungroup dataframe using explode
val ungroupedDF1 = expandedDF1.withColumn("a", explode($"arr")).
select("age", "edu", "sex", "salary")
ungroupedDF1.show
+----+------+---------+------+
| age| edu| sex|salary|
+----+------+---------+------+
|33.0| Male|Bachelors| <=50K|
|20.0|Female| Masters| >50K|
|20.0|Female| Masters| >50K|
|53.0| Male| 11th| <=50K|
|53.0| Male| 11th| <=50K|
|53.0| Male| 11th| <=50K|
+----+------+---------+------+
According to what I understood from your question that you want to filter out similar records which are greater than 2 and write to a file. If thats so following can be your solution.
You must already have original dataframe as
+----+------+---------+------+
|age |edu |sex |salary|
+----+------+---------+------+
|53.0|Male |11th |<=50K |
|53.0|Male |11th |<=50K |
|53.0|Male |11th |<=50K |
|20.0|Female|Masters |>50K |
|20.0|Female|Masters |>50K |
|33.0|Male |Bachelors|<=50K |
+----+------+---------+------+
You don't need to write complex sql queries to find the count, you can just use inbuilt functions as
val columnNames = Result.columns
val finalTemp = Result.groupBy(columnNames.map(col): _*).agg(count("salary").as("similar records"))
This should give output as
+----+------+---------+------+---------------+
|age |edu |sex |salary|similar records|
+----+------+---------+------+---------------+
|33.0|Male |Bachelors|<=50K |1 |
|20.0|Female|Masters |>50K |2 |
|53.0|Male |11th |<=50K |3 |
+----+------+---------+------+---------------+
Now to filter, you can just use filter function as
val finalTable = finalTemp.filter($"similar records" < 3)
final output is
+----+------+---------+------+---------------+
|age |edu |sex |salary|similar records|
+----+------+---------+------+---------------+
|33.0|Male |Bachelors|<=50K |1 |
|20.0|Female|Masters |>50K |2 |
+----+------+---------+------+---------------+
you can save it to a file
finalTable.write.format("com.databricks.spark.csv").save("output path")
If you want the original data with filtered out then you can simply use join as
Result.join(finalTable, Seq(columnNames: _*)).show(false)
output is
+----+------+---------+------+---------------+
|age |edu |sex |salary|similar records|
+----+------+---------+------+---------------+
|33.0|Male |Bachelors|<=50K |1 |
|20.0|Female|Masters |>50K |2 |
|20.0|Female|Masters |>50K |2 |
+----+------+---------+------+---------------+
You can save it to a file as above
Note: you will need following import for above functions to work
import org.apache.spark.sql.functions._

(Scala) Convert String to Date in Apache Spark

I would like to read a .csv file with Spark and associate the columns with fitting Types.
val conf = new SparkConf()
.setMaster("local[8]")
.setAppName("Name")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val customSchema = StructType(Array(
StructField("date", DateType, true),
StructField("time",StringType, true),
StructField("am", DoubleType, true),
StructField("hum", DoubleType, true),
StructField("temp", DoubleType, true)
))
val df = sqlContext.read
.format("org.apache.spark.sql.execution.datasources.csv.CSVFileFormat")
.option("header","true")
.option("delimiter",";")
.schema(customSchema)
.load("data.csv")
A line of the .csv I am reading looks like this
+----------+--------+-----+-----+-----+
| date| time| am| hum| temp|
+----------+--------+-----+-----+-----+
|04.10.2016|12:51:20|1.121|0.149|0.462|
+----------+--------+-----+-----+-----+
Spark will read the .csv and associate the Types correctly if I set the type for the date to String. If I keep the customSchema like in the code shown above, Spark will throw an exception due to the wrong date format (DateType will expect YYYY-MM-DD while mine is DD.MM.YYYY).
Is there a way to re-format the date Strings to YYYY-MM-DD and apply the schema afterwards? Or can I also alter the DateType given by Spark by adding parameters?
Thanks in advance
Use dateFormat option:
val df = sqlContext.read
.format("org.apache.spark.sql.execution.datasources.csv.CSVFileFormat")
.option("header","true")
.option("delimiter",";")
.option("dateFormat", "dd.MM.yyyy")
.schema(customSchema)
.load("data.csv")
I recommend parsing the dates afterwards. Also refer to this.
val df = Seq((1L, "05/26/2016 01:01:01"), (2L, "#$####")).toDF("id", "dts")
import org.apache.spark.sql.functions.unix_timestamp
val ts = unix_timestamp($"dts", "MM/dd/yyyy HH:mm:ss").cast("timestamp")
df.withColumn("ts", ts).show(2, false)
// +---+-------------------+---------------------+
// |id |dts |ts |
// +---+-------------------+---------------------+
// |1 |05/26/2016 01:01:01|2016-05-26 01:01:01.0|
// |2 |#$#### |null |
// +---+-------------------+---------------------+
and:
scala> date.format(DateTimeFormatter.ofPattern("yyyy.MM.dd"))
res0: String = 2020.01.01
Also as a side note, since spark 2.0, you use spark session object only and use encoders for inferring schema(instead of sc, sqlcontext etc). Something like this:
spark = SparkSession(...)
case class User(id:Int, city:String, loc:Array[Double], pop:Long, state:String)
val users = (spark.read.option("inferSchema","true").option("header","true").csv("data/users1.csv").as[User])