Spark: convert string to Date - dataframe

I'm using spark/scala
I have a dataframe. There are columns year/month/day with value, for ex. 2020/9/2. How can I add a column to the same dataframe with conversion to datetime (yyyy-mm-dd)?
I found how to convert date from String to Date format, but I can't find a solution how to combine values and convert it to datetime.
thanks for any advice or hint

You can use the to_date function.
val df1 = Seq(
("2020/9/2"),
("2020/9/15"),
("2020/9/30")
).toDF("str")
val df2 = df1.withColumn("dt", to_date(col("str"), "y/M/d"))
df2.show()

I do some test, I think you can use my examples to convert date. I hope I can help you.
package com.jackpan.spark.examples
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
object SomeExamples {
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder()
.appName("SomeExamples")
.getOrCreate()
val dataDF = spark.createDataFrame(Seq(("2022", "12", "09"), ("2022", "12", "19"),
("2022", "12", "15"))).toDF("year", "month", "day")
dataDF.withColumn("dateStr",
concat(col("year"), lit("-"),col("month"), lit("-"), col("day")))
.withColumn("date", to_date(col("dateStr"), "yyyy-MM-dd"))
.show(false)
}
}
And this function show result is diplay like below :
+----+-----+---+----------+----------+
|year|month|day|dateStr |date |
+----+-----+---+----------+----------+
|2022|12 |09 |2022-12-09|2022-12-09|
|2022|12 |19 |2022-12-19|2022-12-19|
|2022|12 |15 |2022-12-15|2022-12-15|
+----+-----+---+----------+----------+

Related

Schema for type org.apache.spark.sql.types.DataType is not supported

I try to create empty df with schema:
val sparkConf = new SparkConf()
.setAppName("app")
.setMaster("local")
val sparkSession = SparkSession
.builder()
.config(sparkConf)
.getOrCreate()
val sparkContext = sparkSession.sparkContext
var tmpScheme = StructType(
StructField("source_id", StringType, true) :: Nil)
var df = conf.SparkConf.sparkSession.createDataFrame(tmpScheme)
and got Schema for type org.apache.spark.sql.types.DataType is not supported ...
I don't understand why - there is no .DataType even in Imports:
import org.apache.spark.sql.types.{BooleanType, IntegerType, StringType, StructField, StructType}
What can be the problem here?
PS: spark version
"org.apache.spark" %% "spark-sql" % "3.2.2", // spark
"org.apache.spark" %% "spark-core" % "3.2.2", // spark
If you check the documentation, you can see that the argument fields of StructType is of type Array[StructField] and you are passing StructField.
This means that you should wrap your StructField with Array, for example:
val simpleSchema = StructType(Array(
StructField("source_id", StringType, true))
)
Good luck!
EDIT
The case with one parameter in createDataframe:
val data = Seq(
Data(1, "test"),
Data(2, "test2")
)
val dataDf = spark.createDataFrame(data)
dataDf.show(10, false)
The case with two parameterse in createDataframe:
val someSchema = List(
StructField("number", IntegerType, true),
StructField("word", StringType, true)
)
val someData = Seq(Row(1, "test"), Row(2, "test2"))
val someDF = spark.createDataFrame(
spark.sparkContext.parallelize(someData),
StructType(someSchema)
)
The output for both cases is the same:
+------+-----+
|number|word |
+------+-----+
|1 |test |
|2 |test2|
+------+-----+
In your case, the schema is trying to be inferred from attributes of the class (StructType) and is trying to be populated with StructField: source_id. StructType extends DataType and that is where your error comes from (Spark can not resolve the type)

Vectorized pandas udf in pyspark with dict lookup

I'm trying to learn to use pandas_udf in pyspark (Databricks).
One of the assignments is to write a pandas_udf to sort by day of the week. I know how to do this using spark udf:
from pyspark.sql.functions import *
data = [('Sun', 282905.5), ('Mon', 238195.5), ('Thu', 264620.0), ('Sat', 278482.0), ('Wed', 227214.0)]
schema = 'day string, avg_users double'
df = spark.createDataFrame(data, schema)
print('Original')
df.show()
#udf()
def udf(day: str) -> str:
dow = {"Mon": "1", "Tue": "2", "Wed": "3", "Thu": "4",
"Fri": "5", "Sat": "6", "Sun": "7"}
return dow[day] + '-' + day
print('with spark udf')
final_df = df.select(col('avg_users'), udf(col('day')).alias('day')).sort('day')
final_df.show()
Prints:
Original
+---+-----------+
|day| avg_users|
+---+-----------+
|Sun| 282905.5|
|Mon| 238195.5|
|Thu| 264620.0|
|Sat| 278482.0|
|Wed| 227214.0|
+---+-----------+
with spark udf
+-----------+-----+
| avg_users| day|
+-----------+-----+
| 238195.5|1-Mon|
| 227214.0|3-Wed|
| 264620.0|4-Thu|
| 278482.0|6-Sat|
| 282905.5|7-Sun|
+-----------+-----+
Trying to do the same with pandas_udf
import pandas as pd
#pandas_udf('string')
def p_udf(day: pd.Series) -> pd.Series:
dow = {"Mon": "1", "Tue": "2", "Wed": "3", "Thu": "4",
"Fri": "5", "Sat": "6", "Sun": "7"}
return dow[day.str] + '-' + day.str
p_final_df = df.select(df.avg_users, p_udf(df.day))
print('with pandas udf')
p_final_df.show()
I get KeyError: <pandas.core.strings.accessor.StringMethods object at 0x7f31197cd9a0>. I think it's coming from dow[day.str], which kinda makes sense.
I also tried:
return dow[day.str.__str__()] + '-' + day.str # KeyError: .... StringMethods
return dow[str(day.str)] + '-' + day.str # KeyError: .... StringMethods
return dow[day.str.upper()] + '-' + day.str # TypeError: unhashable type: 'Series'
return f"{dow[day.str]}-{day.str}" # KeyError: .... StringMethods (but I think this is logically
# wrong, returning a string instead of a Series)
I've read:
API reference
PySpark equivalent for lambda function in Pandas UDF
How to convert Scalar Pyspark UDF to Pandas UDF?
Pandas UDF in pyspark
Using the .str method alone without any actual vectorized transformation was giving you the error. Also, you can not use the whole series as a key for your dow dict. Use a map method for pandas.Series:
from pyspark.sql.functions import *
import pandas as pd
data = [('Sun', 282905.5), ('Mon', 238195.5), ('Thu', 264620.0), ('Sat', 278482.0), ('Wed', 227214.0)]
schema = 'day string, avg_users double'
df = spark.createDataFrame(data, schema)
#pandas_udf("string")
def p_udf(day: pd.Series) -> pd.Series:
dow = {"Mon": "1", "Tue": "2", "Wed": "3", "Thu": "4",
"Fri": "5", "Sat": "6", "Sun": "7"}
return day.map(dow) + '-' + day
df.select(df.avg_users, p_udf(df.day).alias("day")).show()
+---------+-----+
|avg_users| day|
+---------+-----+
| 282905.5|7-Sun|
| 238195.5|1-Mon|
| 264620.0|4-Thu|
| 278482.0|6-Sat|
| 227214.0|3-Wed|
+---------+-----+
what about we return a dataframe using groupeddata and orderby after you do the udf. Pandas sort_values is quite problematic within udfs.
Basically, in the udf I generate the numbers using python and then concatenate them back to the day column.
from pyspark.sql.functions import pandas_udf
import pandas as pd
from pyspark.sql.types import *
import calendar
def sortdf(pdf):
day=pdf.day
pdf =pdf.assign(day=(day.map(dict(zip(calendar.day_abbr, range(7))))+1).astype(str) + '-'+day)
return pdf
df.groupby('avg_users').applyInPandas(sortdf, schema=df.schema).show()
+-----+---------+
| day|avg_users|
+-----+---------+
|3-Wed| 227214.0|
|1-Mon| 238195.5|
|4-Thu| 264620.0|
|6-Sat| 278482.0|
|7-Sun| 282905.5|
+-----+---------+

Error while querying Data in Spark 2.2 dataframe

i have able to query Restapi data and convert to RDD and Dataframe.
but when i tried to query the column, instead of columnar result, i am getting comma separated value.
Am i missing something?
Code:
package Stream
import org.apache.spark.sql._
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs._
import org.apache.http.client.methods.HttpGet
import org.apache.http.impl.client.DefaultHttpClient
import org.apache.spark.sql.{DataFrame, SaveMode, SparkSession}
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.StructType
import org.apache.log4j.{Level, Logger}
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.types._
import org.apache.spark.storage.StorageLevel
object SparkRestApi {
def main(args: Array[String]): Unit = {
val logger = Logger.getLogger("blah")
Logger.getLogger("org").setLevel(Level.WARN)
Logger.getLogger("akka").setLevel(Level.WARN)
val spark = SparkSession.builder()
.appName("blah")
.config("spark.driver.memory", "2g")
.master("local[*]")
//.enableHiveSupport()
.getOrCreate()
import spark.implicits._
val url = "https://platform-api.opentargets.io/v3/platform/public/association/filter"
val result2 = List(scala.io.Source.fromURL(url).mkString)
val githubRdd2=spark.sparkContext.makeRDD(result2)
val gitHubDF2=spark.read.json(githubRdd2)
gitHubDF2.show()
val mediandf= gitHubDF2.select(col("data.association_score.overall").as("association_score"))
mediandf.show()
spark.stop()
}
}
But Result not well formatted. It is comma separated instead of value in well formatted rows.
association_score|
|[1.0, 1.0, 1.0, 1...|
But i am expecting
1.0
1.0
1.0
Check below code.
Note You will get duplicates rows if you explode array column values.
scala> val url = "https://platform-api.opentargets.io/v3/platform/public/association/filter"
scala> val df = spark.read.json(Seq(scala.io.Source.fromURL(url).mkString).toDS)
scala> df.withColumn("data",explode($"data")).select($"data.association_score.overall",$"data.disease.id".as("disease_id"),$"data.target.id").show(false)
+-------+-------------+---------------+
|overall|disease_id |id |
+-------+-------------+---------------+
|1.0 |MONDO_0045024|ENSG00000121879|
|1.0 |MONDO_0023370|ENSG00000121879|
|1.0 |EFO_0000616 |ENSG00000121879|
|1.0 |EFO_0010285 |ENSG00000121879|
|1.0 |MONDO_0045024|ENSG00000133703|
|1.0 |MONDO_0023370|ENSG00000133703|
|1.0 |MONDO_0045024|ENSG00000157764|
|1.0 |MONDO_0023370|ENSG00000157764|
|1.0 |EFO_0000616 |ENSG00000157764|
|1.0 |OTAR_0000018 |ENSG00000157764|
+-------+-------------+---------------+

in scala how to convert the result in sql query row to double

I try to get the result of spark sql query and do some calculations for them in Scala.
val sql_DF = spark.sql("SELECT count(distinct(my_id)) total_id FROM some_ids_table ")
val total_ids = sql_DF.select("total_id").first().toSeq.asInstanceOf[Seq[Double]][0]
val sql_DF01 = spark.sql("SELECT count(distinct(other_id)) other_ids FROM some_ids_table where column_value1 = 1")
val other_id_1 = sql_DF01.select("other_ids").first().toSeq.asInstanceOf[Seq[Double]][0]
println(other_id_1/total_ids)
I got error:
error: identifier expected but integer literal found.
val total_ids = sql_DF.select("total_id").first().toSeq.asInstanceOf[Seq[Double]][0]
How to convert the result in sql query row to double so that I can do some math calculations on them?
e.g.
other_ids / total_ids
thanks
You can use the CAST function as below
scala> val sql_DF = spark.sql("SELECT count(distinct(eid)) total_id FROM employee ")
sql_DF: org.apache.spark.sql.DataFrame = [total_id: bigint]
Using the CAST function the bigint is converted to double
scala> val sql_DF = spark.sql("SELECT CAST(count(distinct(eid)) AS DOUBLE) total_id FROM employee ")
sql_DF: org.apache.spark.sql.DataFrame = [total_id: double]
The question didn't seem quite clear to me. Here's how you generally handle the situation below.
import spark.implicits._
val df = Seq(("a",90.0),("b",30.0)).toDF("name","marks")
Collect the data in column "marks", which will be of type'Array[Row]'
val rows: Array[Row] = df.select("marks").collect()
You specifically need to get each element using map and give it the type as per its datatype in the column.
val values: Array[Double] = rows.map(row => row.getAs[Double](0)/*row.getDouble(0)*/)
There are different points in your question. If you want to get the count of distinct 'my_id' and assign it to a value, you can do it directly with:
total_ids = sql_DF.dropDuplicates("my_id").count()
The error that you are getting is due to the way to access the data from DF (You need to access to row object: https://spark.apache.org/docs/1.6.1/api/java/org/apache/spark/sql/Row.html).
Alternative: You can directly cast your DF using .as[DataType], like this:
val total_id = sql_DF.selectExpr("count(distinct(my_id)) as total_id").as[Long].first()

(Scala) Convert String to Date in Apache Spark

I would like to read a .csv file with Spark and associate the columns with fitting Types.
val conf = new SparkConf()
.setMaster("local[8]")
.setAppName("Name")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val customSchema = StructType(Array(
StructField("date", DateType, true),
StructField("time",StringType, true),
StructField("am", DoubleType, true),
StructField("hum", DoubleType, true),
StructField("temp", DoubleType, true)
))
val df = sqlContext.read
.format("org.apache.spark.sql.execution.datasources.csv.CSVFileFormat")
.option("header","true")
.option("delimiter",";")
.schema(customSchema)
.load("data.csv")
A line of the .csv I am reading looks like this
+----------+--------+-----+-----+-----+
| date| time| am| hum| temp|
+----------+--------+-----+-----+-----+
|04.10.2016|12:51:20|1.121|0.149|0.462|
+----------+--------+-----+-----+-----+
Spark will read the .csv and associate the Types correctly if I set the type for the date to String. If I keep the customSchema like in the code shown above, Spark will throw an exception due to the wrong date format (DateType will expect YYYY-MM-DD while mine is DD.MM.YYYY).
Is there a way to re-format the date Strings to YYYY-MM-DD and apply the schema afterwards? Or can I also alter the DateType given by Spark by adding parameters?
Thanks in advance
Use dateFormat option:
val df = sqlContext.read
.format("org.apache.spark.sql.execution.datasources.csv.CSVFileFormat")
.option("header","true")
.option("delimiter",";")
.option("dateFormat", "dd.MM.yyyy")
.schema(customSchema)
.load("data.csv")
I recommend parsing the dates afterwards. Also refer to this.
val df = Seq((1L, "05/26/2016 01:01:01"), (2L, "#$####")).toDF("id", "dts")
import org.apache.spark.sql.functions.unix_timestamp
val ts = unix_timestamp($"dts", "MM/dd/yyyy HH:mm:ss").cast("timestamp")
df.withColumn("ts", ts).show(2, false)
// +---+-------------------+---------------------+
// |id |dts |ts |
// +---+-------------------+---------------------+
// |1 |05/26/2016 01:01:01|2016-05-26 01:01:01.0|
// |2 |#$#### |null |
// +---+-------------------+---------------------+
and:
scala> date.format(DateTimeFormatter.ofPattern("yyyy.MM.dd"))
res0: String = 2020.01.01
Also as a side note, since spark 2.0, you use spark session object only and use encoders for inferring schema(instead of sc, sqlcontext etc). Something like this:
spark = SparkSession(...)
case class User(id:Int, city:String, loc:Array[Double], pop:Long, state:String)
val users = (spark.read.option("inferSchema","true").option("header","true").csv("data/users1.csv").as[User])