Read CSV to a Dataframe with less header and more values in a record - dataframe

How to read a csv file in Spark which has a structure like:
id,name,address
1,"ashu","building","street","area","city","state","pin"
When using a reader:
val df = spark.read.option("header",true).csv("input/input1.csv")
I am getting record till the third value in CSV.
+---+----+--------+
| id|name| address|
+---+----+--------+
| 1|ashu|building|
+---+----+--------+
How to ask Spark to read all the values starting from third value till the last one in single dataframe column address like:
+---+----+-----------------------------------------------+
| id|name| address |
+---+----+-----------------------------------------------+
| 1|ashu|"building","street","area","city","state","pin"|
+---+----+-----------------------------------------------+

I'm making my answer fit your requirements to use CSV. This is the least painful way to do what you want to do.
Modify your CSV file so that it use "|" to split fields instead of ",". This will allow you to have ',' inside your columns.
id,name,address
1|"ashu"|"building","street","area","city","state","pin"
Modify you code:
val df = spark.read
.option("header",true)
.option("delimiter", '|')
.csv("input/input1.csv")

If you can fix your input files to use another delimiter character than you should do that.
However, if you don't have that possibility, you can still read the file without header and specify a custom schema. Then, concatenate the 6 address columns to get the desired dataframe:
import org.apache.spark.sql.types._
val schema = StructType(
Array(
StructField("id", IntegerType, true),
StructField("name", StringType, true),
StructField("address1", StringType, true),
StructField("address2", StringType, true),
StructField("address3", StringType, true),
StructField("address4", StringType, true),
StructField("address5", StringType, true),
StructField("address6", StringType, true)
)
)
val input = spark.read.schema(schema).csv("input/input1.csv")
val df = input.filter("name != 'name'").withColumn(
"address",
concat_ws(", ", (1 to 6).map(n => col(s"address$n")):_*)
).select("id", "name", "address")
df.show(false)
//+---+----+----------------------------------------+
//|id |name|address |
//+---+----+----------------------------------------+
//|1 |ashu|building, street, area, city, state, pin|
//+---+----+----------------------------------------+

Related

Check matching data in Dataframes in spark

How can i match if two RDD generated the way i did contains the same data including number of rows?
I'm using scala test to run the tests and spark version 3 with scala 2.12.12
Below my code from the creation of the schemas of my two rdd, included the expected one and the creation of all 3 rdd with data.
-- CREATING SCHEMA FOR RDD AMOUNTS AND WAREHOUSE AND EXPECTED FINAL SCHEMA
val amountsSchema: StructType = StructType(
Seq(
StructField("positionId", LongType, nullable = true),
StructField("amount", DecimalType(10, 2), nullable = true),
StructField("eventTime",LongType, nullable = true),
)
)
val warehouseSchema: StructType = StructType(
Seq(
StructField("positionId", LongType, nullable = true),
StructField("warehouse", StringType, nullable = true),
StructField("product", StringType, nullable = true),
StructField("eventTime",LongType, nullable = true),
)
)
val expectedDfSchema: StructType = StructType(
Seq(
StructField("positionId", LongType, nullable = true),
StructField("warehouse", StringType, nullable = true),
StructField("product", StringType, nullable = true),
StructField("amount", DecimalType(10, 2), nullable = true),
)
)
--- CREATING DATA FOR RDD AMOUNTS RDD AND WAREHOUSE RDD AND EXPECTED FINAL RDD
val amounts_data = Seq(
Row("1", "5.00", "1528463387"),
Row("1", "7.20", "1528463005"),
Row("2", "5.00", "1528463097"),
Row("2", "7.20", "1528463007"),
Row("3", "6.00", "1528463078"),
Row("4", "24.20", "1528463008"),
Row("4", "15.00", "1528463100"),
)
val wh_data = Seq(
Row("1", "W-1", "P-1", "1528463098"),
Row("2", "W-2", "P-2", "1528463097"),
Row("3", "W-2", "P-3", "1528463078"),
Row("4", "W-1", "P-6", "1528463100"),
)
val expected_data = Seq(
Row("1", "W-1", "P-1", "5.00"),
Row("2", "W-2", "P-2", "5.00"),
Row("3", "W-2", "P-3", "6.00"),
Row("4", "W-1", "P-6", "15.00")
)
---- CREATING RDD WITH SCHEMAS AND DATA FOR DF_AMOUNTS AND DF_WAREHOUSE AND FOR THE EXPECTED RDD WITH EXPECTED_DATA
val df_amounts: DataFrame = spark.createDataFrame(
spark.sparkContext.parallelize(amounts_data),
amountsSchema
)
val df_wh: DataFrame = spark.createDataFrame(
spark.sparkContext.parallelize(wh_data),
warehouseSchema
)
val df_expected: DataFrame = spark.createDataFrame(
spark.sparkContext.parallelize(expected_data),
expectedDfSchema
)
---- USING GET_AMOUNTS METHOD TO GENERATE A RDD FROM THE FUNCTION get_amounts
val resDf: DataFrame = get_amounts(df_amounts, df_wh)
---- TESTING IF THE resDf SCHEMA MATCH WITH THE EXPECTED SCHEMA - IT DOES TEST PASSED
test("DataFrame Schema Test") {
assert(assertSchema(resDf.schema, df_expected.schema))
}
---- TESTING IF THE resDf DATA MATCH WITH THE EXPECTED DATA - IT DOESNT'T MATCH
test("DataFrame Data Test") {
assert(assertData(resDf, df_expected))
}
}
The assertData function used to match the data for the expected data rdd and the one coming from my function get_amounts but it fails the test.
def assertData(df1: DataFrame, df2: DataFrame): Boolean = {
df1.exceptAll(df2).rdd.isEmpty()
}
Thank You
The way you create a datasets is valid. A test structure looks good as well.
I would suggest to improve your assert method to see why the test case failes, here you can find some thoughts on your testing method:
exceptAll is not a perfect for testing, if the df2 contains an additional row it will still say that the data matches, consider below code:
val df1 = Seq(
(1, "x"),
(2, "y")
).toDF("x", "y")
val df2 = Seq(
(1, "x"),
(2, "y"),
(3, "z")
).toDF("x", "y")
assert(df1.exceptAll(df2).rdd.isEmpty())
"this function resolves columns by position (not by name)" (from Spark code scala docs), due to this sometimes you can get confused about your test result.
your assert method says nothing about what exactly mismatched
For testing purposes is not bad to collect (small amount of) data and match sequences.
You can consider using a method like this one:
def matchDF(resultDF: DataFrame, expectedDF: DataFrame): Unit = {
resultDF.printSchema()
expectedDF.printSchema()
assert(resultDF.schema == expectedDF.schema,
s"Schema does not match: ${resultDF.schema} != ${expectedDF.schema}")
val expected = expectedDF.collect().toSeq
val result = resultDF.collect().toSeq
assert(expected == result, s"Data does not match: $result != $expected")
}
It's not a perfect approach (still depends on the position in a row), but at least you will be able to find out what is going on and why your test fails.
For wrong data you'll see this:
assertion failed: Data does not match: WrappedArray([1,x], [2,y]) != WrappedArray([1,x], [3,y])
For wrong schema you'll get:
root
|-- x: integer (nullable = false)
|-- y: string (nullable = true)
root
|-- x: string (nullable = true)
|-- y: string (nullable = true)
Exception in thread "main" java.lang.AssertionError: assertion failed: Schema does not match
I hope this will help you understand what is going wrong.

How do I execute a readStream of a tab-delimited file in Databricks?

I have a 1000-line csv file containing tab-separated values, and want to read them into a data frame.
Here is my schema:
schema = StructType([StructField("TEST1", StringType(), True),
StructField("TEST2", StringType(), True),
StructField("TEST3", StringType(), True),
StructField("TEST4", StringType(), True),
StructField("TEST5", StringType(), True)])
Here is my readStream statement:
df = spark.readStream.format("cloudFiles")
.option("cloudFiles.format", "csv")
.option("delimiter", "\t")
.option("path", "/mnt/data/data.tsv")
.schema(schema).load()
Running the readStream cell results in:
df: pyspark.sql.dataframe.DataFrame
TEST1: string
TEST2: string
TEST3: string
TEST4: string
TEST5: string
In the next cell I executed
display(df)
and ended up with
Query returned no results.
I think there is something wrong with my schema or readStream statements. Is the tab delimiter set correctly?
I found out what the problem was. The path should not have been a specific file. Rather, it should have been a file path containing wild cards. After I changed the command to this:
df = spark.readStream.format("cloudFiles")
.option("cloudFiles.format", "csv")
.option("delimiter", "\t")
.option("path", "/mnt/data/*.tsv")
.schema(schema).load()
it worked just fine.

How to read csv without header and name them with names while reading in pyspark?

100000,20160214,93374987
100000,20160214,1925301
100000,20160216,1896542
100000,20160216,84167419
100000,20160216,77273616
100000,20160507,1303015
I want to read the csv file which has no column names in first row.
How to read it and name the columns with my specified names in the same time ?
for now, I just renamed the original columns with my specified names like this:
df = spark.read.csv("user_click_seq.csv",header=False)
df = df.withColumnRenamed("_c0", "member_srl")
df = df.withColumnRenamed("_c1", "click_day")
df = df.withColumnRenamed("_c2", "productid")
Any better way ?
You can import the csv file into a dataframe with a predefined schema. The way you define a schema is by using the StructType and StructField objects. Assuming your data is all IntegerType data:
from pyspark.sql.types import StructType, StructField, IntegerType
schema = StructType([
StructField("member_srl", IntegerType(), True),
StructField("click_day", IntegerType(), True),
StructField("productid", IntegerType(), True)])
df = spark.read.csv("user_click_seq.csv",header=False,schema=schema)
should work.
For those who would like to do this in scala and may not want to add types:
val df = spark.read.format("csv")
.option("header","false")
.load("hdfs_filepath")
.toDF("var0","var1","var2","var3")
You can read the data with header=False and then pass the column names with toDF as bellow:
data = spark.read.csv('data.csv', header=False)
data = data.toDF('name1', 'name2', 'name3')
In my case, it handled many columns and creating a schema was very tedious when, in addition, spark inferred the schema well.
So I opted to rename it using a select.
First I create a list with the new names:
val listNameColumns: List[String] = List("name1", "name2" , "name3")
Then I combine the column names of the original dataframe with the above list and create a list of Column elements:
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions.col
val selectStament: Array[Column] = df.columns zip listNameColumns map { case(a, b) => col(a).as(b)}
Finally I make the select:
val dfRenamed = df.select(selectStament:_*)

(Scala) Convert String to Date in Apache Spark

I would like to read a .csv file with Spark and associate the columns with fitting Types.
val conf = new SparkConf()
.setMaster("local[8]")
.setAppName("Name")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val customSchema = StructType(Array(
StructField("date", DateType, true),
StructField("time",StringType, true),
StructField("am", DoubleType, true),
StructField("hum", DoubleType, true),
StructField("temp", DoubleType, true)
))
val df = sqlContext.read
.format("org.apache.spark.sql.execution.datasources.csv.CSVFileFormat")
.option("header","true")
.option("delimiter",";")
.schema(customSchema)
.load("data.csv")
A line of the .csv I am reading looks like this
+----------+--------+-----+-----+-----+
| date| time| am| hum| temp|
+----------+--------+-----+-----+-----+
|04.10.2016|12:51:20|1.121|0.149|0.462|
+----------+--------+-----+-----+-----+
Spark will read the .csv and associate the Types correctly if I set the type for the date to String. If I keep the customSchema like in the code shown above, Spark will throw an exception due to the wrong date format (DateType will expect YYYY-MM-DD while mine is DD.MM.YYYY).
Is there a way to re-format the date Strings to YYYY-MM-DD and apply the schema afterwards? Or can I also alter the DateType given by Spark by adding parameters?
Thanks in advance
Use dateFormat option:
val df = sqlContext.read
.format("org.apache.spark.sql.execution.datasources.csv.CSVFileFormat")
.option("header","true")
.option("delimiter",";")
.option("dateFormat", "dd.MM.yyyy")
.schema(customSchema)
.load("data.csv")
I recommend parsing the dates afterwards. Also refer to this.
val df = Seq((1L, "05/26/2016 01:01:01"), (2L, "#$####")).toDF("id", "dts")
import org.apache.spark.sql.functions.unix_timestamp
val ts = unix_timestamp($"dts", "MM/dd/yyyy HH:mm:ss").cast("timestamp")
df.withColumn("ts", ts).show(2, false)
// +---+-------------------+---------------------+
// |id |dts |ts |
// +---+-------------------+---------------------+
// |1 |05/26/2016 01:01:01|2016-05-26 01:01:01.0|
// |2 |#$#### |null |
// +---+-------------------+---------------------+
and:
scala> date.format(DateTimeFormatter.ofPattern("yyyy.MM.dd"))
res0: String = 2020.01.01
Also as a side note, since spark 2.0, you use spark session object only and use encoders for inferring schema(instead of sc, sqlcontext etc). Something like this:
spark = SparkSession(...)
case class User(id:Int, city:String, loc:Array[Double], pop:Long, state:String)
val users = (spark.read.option("inferSchema","true").option("header","true").csv("data/users1.csv").as[User])

Pyspark: Convert a '\x01'-delimited file from S3 into a dataframe

Spark: 1.4.0
I have a flatfile from Amazon S3 which I loaded into HDFS (in the master node of my EC2 Spark cluster). The flatfile is a Hive output. Note: I couldn't change the context as it is already defined. The following codes are used in the pyspark shell:
Each 'row' corresponds to 1 row of data:
row = sc.textFile("/data/file")
row.first()
u'E8B98\x01John\x01Smith\x01Male\x01Gold\x0125''
Then I split each row using flatmap() since for some reason map() doesn't seem to delimit it (using '\x01' as the delimiter):
elements = row.flatMap(lambda x: x.split('\x01'))
elements.take(8)
[u'E8B98', u'John', u'Smith', u'Male', u'Gold', u'25', u'E8B99', u'Alice']
Since I know the data has 6 columns per row, how do I get the data into a dataframe? I'm intend to sort by attribute, sum etc.
I tried the following but it didn't work:
id = row.flatMap(lambda x: x.split('\x01')[0])
id.first()
E
There is many way to transform an rdd to a dataframe in python :
Considering the following rdd
rdd = sc.parallelize(list(["E8B98\x01John\x01Smith\x01Male\x01Gold\x0125","E8B2\x01Joe\x01Smith\x01Female\x01Gold\x0125"]))
rdd.first()
Output:
'E8B98\x01John\x01Smith\x01Male\x01Gold\x0125'
Let's now create an rdd of tuples :
rdd2 = rdd.map(lambda x : x.split("\x01"))
rdd2.first()
Output:
['E8B98', 'John', 'Smith', 'Male', 'Gold', '25']
We can now create a dataframe with one of the following ways :
Create it directly from the tuples rdd :
sqlContext.createDataFrame(rdd2).collect()
Output:
[Row(_1=u'E8B98', _2=u'John', _3=u'Smith', _4=u'Male', _5=u'Gold', _6=u'25'), Row(_1=u'E8B2', _2=u'Joe', _3=u'Smith', _4=u'Female', _5=u'Gold', _6=u'25')]
or create it with the same rdd specifying the name of the columns :
df = sqlContext.createDataFrame(rdd2, ['id', 'name', 'surname', 'gender', 'description', 'age'])
df.collect()
Output:
[Row(id=u'E8B98', name=u'John', surname=u'Smith', gender=u'Male', description=u'Gold', age=u'25'), Row(id=u'E8B2', name=u'Joe', surname=u'Smith', gender=u'Female', description=u'Gold', age=u'25')]
or create it with the inferred schema :
pyspark.sql.types import *
schema = StructType([
StructField("id", StringType(), True),
StructField("name", StringType(), True),
StructField("surname", StringType(), True),
StructField("gender", StringType(), True),
StructField("description", StringType(), True),
StructField("age", StringType(), True)])
df2 = sqlContext.createDataFrame(rdd2, schema)
df2.collect()
Output:
[Row(id=u'E8B98', name=u'John', surname=u'Smith', gender=u'Male', description=u'Gold', age=u'25'),Row(id=u'E8B2', name=u'Joe', surname=u'Smith', gender=u'Female', description=u'Gold', age=u'25')]
or yet specifying your row class schema as following :
from pyspark.sql import Row
Person = Row('id', 'name', 'surname', 'gender', 'description', 'age')
person = rdd2.map(lambda r: Person(*r))
df3 = sqlContext.createDataFrame(person)
df3.collect()
Output:
[Row(id=u'E8B98', name=u'John', surname=u'Smith', gender=u'Male', description=u'Gold', age=u'25'), Row(id=u'E8B2', name=u'Joe', surname=u'Smith', gender=u'Female', description=u'Gold', age=u'25')]
I hope this helps!
NB: Spark version >= 1.3.0