I would like to read a .csv file with Spark and associate the columns with fitting Types.
val conf = new SparkConf()
.setMaster("local[8]")
.setAppName("Name")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val customSchema = StructType(Array(
StructField("date", DateType, true),
StructField("time",StringType, true),
StructField("am", DoubleType, true),
StructField("hum", DoubleType, true),
StructField("temp", DoubleType, true)
))
val df = sqlContext.read
.format("org.apache.spark.sql.execution.datasources.csv.CSVFileFormat")
.option("header","true")
.option("delimiter",";")
.schema(customSchema)
.load("data.csv")
A line of the .csv I am reading looks like this
+----------+--------+-----+-----+-----+
| date| time| am| hum| temp|
+----------+--------+-----+-----+-----+
|04.10.2016|12:51:20|1.121|0.149|0.462|
+----------+--------+-----+-----+-----+
Spark will read the .csv and associate the Types correctly if I set the type for the date to String. If I keep the customSchema like in the code shown above, Spark will throw an exception due to the wrong date format (DateType will expect YYYY-MM-DD while mine is DD.MM.YYYY).
Is there a way to re-format the date Strings to YYYY-MM-DD and apply the schema afterwards? Or can I also alter the DateType given by Spark by adding parameters?
Thanks in advance
Use dateFormat option:
val df = sqlContext.read
.format("org.apache.spark.sql.execution.datasources.csv.CSVFileFormat")
.option("header","true")
.option("delimiter",";")
.option("dateFormat", "dd.MM.yyyy")
.schema(customSchema)
.load("data.csv")
I recommend parsing the dates afterwards. Also refer to this.
val df = Seq((1L, "05/26/2016 01:01:01"), (2L, "#$####")).toDF("id", "dts")
import org.apache.spark.sql.functions.unix_timestamp
val ts = unix_timestamp($"dts", "MM/dd/yyyy HH:mm:ss").cast("timestamp")
df.withColumn("ts", ts).show(2, false)
// +---+-------------------+---------------------+
// |id |dts |ts |
// +---+-------------------+---------------------+
// |1 |05/26/2016 01:01:01|2016-05-26 01:01:01.0|
// |2 |#$#### |null |
// +---+-------------------+---------------------+
and:
scala> date.format(DateTimeFormatter.ofPattern("yyyy.MM.dd"))
res0: String = 2020.01.01
Also as a side note, since spark 2.0, you use spark session object only and use encoders for inferring schema(instead of sc, sqlcontext etc). Something like this:
spark = SparkSession(...)
case class User(id:Int, city:String, loc:Array[Double], pop:Long, state:String)
val users = (spark.read.option("inferSchema","true").option("header","true").csv("data/users1.csv").as[User])
Related
I try to create empty df with schema:
val sparkConf = new SparkConf()
.setAppName("app")
.setMaster("local")
val sparkSession = SparkSession
.builder()
.config(sparkConf)
.getOrCreate()
val sparkContext = sparkSession.sparkContext
var tmpScheme = StructType(
StructField("source_id", StringType, true) :: Nil)
var df = conf.SparkConf.sparkSession.createDataFrame(tmpScheme)
and got Schema for type org.apache.spark.sql.types.DataType is not supported ...
I don't understand why - there is no .DataType even in Imports:
import org.apache.spark.sql.types.{BooleanType, IntegerType, StringType, StructField, StructType}
What can be the problem here?
PS: spark version
"org.apache.spark" %% "spark-sql" % "3.2.2", // spark
"org.apache.spark" %% "spark-core" % "3.2.2", // spark
If you check the documentation, you can see that the argument fields of StructType is of type Array[StructField] and you are passing StructField.
This means that you should wrap your StructField with Array, for example:
val simpleSchema = StructType(Array(
StructField("source_id", StringType, true))
)
Good luck!
EDIT
The case with one parameter in createDataframe:
val data = Seq(
Data(1, "test"),
Data(2, "test2")
)
val dataDf = spark.createDataFrame(data)
dataDf.show(10, false)
The case with two parameterse in createDataframe:
val someSchema = List(
StructField("number", IntegerType, true),
StructField("word", StringType, true)
)
val someData = Seq(Row(1, "test"), Row(2, "test2"))
val someDF = spark.createDataFrame(
spark.sparkContext.parallelize(someData),
StructType(someSchema)
)
The output for both cases is the same:
+------+-----+
|number|word |
+------+-----+
|1 |test |
|2 |test2|
+------+-----+
In your case, the schema is trying to be inferred from attributes of the class (StructType) and is trying to be populated with StructField: source_id. StructType extends DataType and that is where your error comes from (Spark can not resolve the type)
I'm using spark/scala
I have a dataframe. There are columns year/month/day with value, for ex. 2020/9/2. How can I add a column to the same dataframe with conversion to datetime (yyyy-mm-dd)?
I found how to convert date from String to Date format, but I can't find a solution how to combine values and convert it to datetime.
thanks for any advice or hint
You can use the to_date function.
val df1 = Seq(
("2020/9/2"),
("2020/9/15"),
("2020/9/30")
).toDF("str")
val df2 = df1.withColumn("dt", to_date(col("str"), "y/M/d"))
df2.show()
I do some test, I think you can use my examples to convert date. I hope I can help you.
package com.jackpan.spark.examples
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
object SomeExamples {
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder()
.appName("SomeExamples")
.getOrCreate()
val dataDF = spark.createDataFrame(Seq(("2022", "12", "09"), ("2022", "12", "19"),
("2022", "12", "15"))).toDF("year", "month", "day")
dataDF.withColumn("dateStr",
concat(col("year"), lit("-"),col("month"), lit("-"), col("day")))
.withColumn("date", to_date(col("dateStr"), "yyyy-MM-dd"))
.show(false)
}
}
And this function show result is diplay like below :
+----+-----+---+----------+----------+
|year|month|day|dateStr |date |
+----+-----+---+----------+----------+
|2022|12 |09 |2022-12-09|2022-12-09|
|2022|12 |19 |2022-12-19|2022-12-19|
|2022|12 |15 |2022-12-15|2022-12-15|
+----+-----+---+----------+----------+
44,8602,37.19
35,5368,65.89
35,3391,40.64
44,6694,14.98
val sc = new SparkContext("local[*]", "TotalSpentByCustomer")
val input = sc.textFile("C:\\Spar\cuscopy.csv")
val fields = input.map(x => (x.split("\t")(1).toInt, 1, 2.toFloat, 2))
val d = fields.reduceByKey((x,y) => x+y)
val results = d.collect()
results.foreach(print)
Getting error
value reduceByKey is not a member of org.apache.spark.rdd.RDD[(Int,
Int, Float, Int)]
val d = fields.reduceByKey((x,y) => x+y)
Please suggest is it right way to parse fields ?
You could load the csv file to a dataframe with a given schema using the Spark Project SQL library and if needed convert it to a RDD.
// remove those lines when not using jupyter
interp.load.ivy("org.apache.spark" %% "spark-sql" % "3.2.0")
interp.load.ivy("org.apache.spark" %% "spark-core" % "3.2.0")
import org.apache.spark.sql.types.{StructType, StructField, FloatType, IntegerType};
val spark = org.apache.spark.sql.SparkSession.builder
.master("local")
.appName("Spark CSV Reader")
.getOrCreate;
// create a new schema for reading the csv
val schema = new StructType()
.add("Field1", IntegerType, true)
.add("Field2", IntegerType ,true)
.add("Field3", FloatType, true)
val df = spark.read.format("csv")
.schema(schema)
.load("/vagrant/test/test.csv") //replace with desired path
// select only column 1 & 2, not sure if this was intended by the questioner
val selected = df.select("Field1","Field2")
// convert your dataframe to a rdd
val d = selected.rdd
d.collect().foreach(println)
This outputs the following lines for the given input from the question
[44,8602]
[35,5368]
[35,3391]
[44,6694]
How to read a csv file in Spark which has a structure like:
id,name,address
1,"ashu","building","street","area","city","state","pin"
When using a reader:
val df = spark.read.option("header",true).csv("input/input1.csv")
I am getting record till the third value in CSV.
+---+----+--------+
| id|name| address|
+---+----+--------+
| 1|ashu|building|
+---+----+--------+
How to ask Spark to read all the values starting from third value till the last one in single dataframe column address like:
+---+----+-----------------------------------------------+
| id|name| address |
+---+----+-----------------------------------------------+
| 1|ashu|"building","street","area","city","state","pin"|
+---+----+-----------------------------------------------+
I'm making my answer fit your requirements to use CSV. This is the least painful way to do what you want to do.
Modify your CSV file so that it use "|" to split fields instead of ",". This will allow you to have ',' inside your columns.
id,name,address
1|"ashu"|"building","street","area","city","state","pin"
Modify you code:
val df = spark.read
.option("header",true)
.option("delimiter", '|')
.csv("input/input1.csv")
If you can fix your input files to use another delimiter character than you should do that.
However, if you don't have that possibility, you can still read the file without header and specify a custom schema. Then, concatenate the 6 address columns to get the desired dataframe:
import org.apache.spark.sql.types._
val schema = StructType(
Array(
StructField("id", IntegerType, true),
StructField("name", StringType, true),
StructField("address1", StringType, true),
StructField("address2", StringType, true),
StructField("address3", StringType, true),
StructField("address4", StringType, true),
StructField("address5", StringType, true),
StructField("address6", StringType, true)
)
)
val input = spark.read.schema(schema).csv("input/input1.csv")
val df = input.filter("name != 'name'").withColumn(
"address",
concat_ws(", ", (1 to 6).map(n => col(s"address$n")):_*)
).select("id", "name", "address")
df.show(false)
//+---+----+----------------------------------------+
//|id |name|address |
//+---+----+----------------------------------------+
//|1 |ashu|building, street, area, city, state, pin|
//+---+----+----------------------------------------+
100000,20160214,93374987
100000,20160214,1925301
100000,20160216,1896542
100000,20160216,84167419
100000,20160216,77273616
100000,20160507,1303015
I want to read the csv file which has no column names in first row.
How to read it and name the columns with my specified names in the same time ?
for now, I just renamed the original columns with my specified names like this:
df = spark.read.csv("user_click_seq.csv",header=False)
df = df.withColumnRenamed("_c0", "member_srl")
df = df.withColumnRenamed("_c1", "click_day")
df = df.withColumnRenamed("_c2", "productid")
Any better way ?
You can import the csv file into a dataframe with a predefined schema. The way you define a schema is by using the StructType and StructField objects. Assuming your data is all IntegerType data:
from pyspark.sql.types import StructType, StructField, IntegerType
schema = StructType([
StructField("member_srl", IntegerType(), True),
StructField("click_day", IntegerType(), True),
StructField("productid", IntegerType(), True)])
df = spark.read.csv("user_click_seq.csv",header=False,schema=schema)
should work.
For those who would like to do this in scala and may not want to add types:
val df = spark.read.format("csv")
.option("header","false")
.load("hdfs_filepath")
.toDF("var0","var1","var2","var3")
You can read the data with header=False and then pass the column names with toDF as bellow:
data = spark.read.csv('data.csv', header=False)
data = data.toDF('name1', 'name2', 'name3')
In my case, it handled many columns and creating a schema was very tedious when, in addition, spark inferred the schema well.
So I opted to rename it using a select.
First I create a list with the new names:
val listNameColumns: List[String] = List("name1", "name2" , "name3")
Then I combine the column names of the original dataframe with the above list and create a list of Column elements:
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions.col
val selectStament: Array[Column] = df.columns zip listNameColumns map { case(a, b) => col(a).as(b)}
Finally I make the select:
val dfRenamed = df.select(selectStament:_*)