I've created a dataframe which contains 3 columns : zip, lat, lng
I want to select the lat and lng values where zip = 00650
So, I tried using :
sqlContext.sql("select lat,lng from census where zip=00650").show()
But it is returning ArrayOutOfBound Exception because it does not have any values in it.
If I remove the where clause it is running fine.
Can someone please explain what I am doing wrong ?
Update:
dataframe schema:
root
|-- zip: string (nullable = true)
|-- lat: string (nullable = true)
|-- lng: string (nullable = true)
First 10 rows are :
+-----+---------+-----------+
| zip| lat| lng|
+-----+---------+-----------+
|00601|18.180555| -66.749961|
|00602|18.361945| -67.175597|
|00603|18.455183| -67.119887|
|00606|18.158345| -66.932911|
|00610|18.295366| -67.125135|
|00612|18.402253| -66.711397|
|00616|18.420412| -66.671979|
|00617|18.445147| -66.559696|
|00622|17.991245| -67.153993|
|00623|18.083361| -67.153897|
|00624|18.064919| -66.716683|
|00627|18.412600| -66.863926|
|00631|18.190607| -66.832041|
|00637|18.076713| -66.947389|
|00638|18.295913| -66.515588|
|00641|18.263085| -66.712985|
|00646|18.433150| -66.285875|
|00647|17.963613| -66.947127|
|00650|18.349416| -66.578079|
As you can see in your schema zip is of type String, so your query should be something like this
sqlContext.sql("select lat, lng from census where zip = '00650'").show()
Update:
If you are using Spark 2 then you can do this:
import sparkSession.sqlContext.implicits._
val dataFrame = Seq(("10.023", "75.0125", "00650"),("12.0246", "76.4586", "00650"), ("10.023", "75.0125", "00651")).toDF("lat","lng", "zip")
dataFrame.printSchema()
dataFrame.select("*").where(dataFrame("zip") === "00650").show()
dataFrame.registerTempTable("census")
sparkSession.sqlContext.sql("SELECT lat, lng FROM census WHERE zip = '00650'").show()
output:
root
|-- lat: string (nullable = true)
|-- lng: string (nullable = true)
|-- zip: string (nullable = true)
+-------+-------+-----+
| lat| lng| zip|
+-------+-------+-----+
| 10.023|75.0125|00650|
|12.0246|76.4586|00650|
+-------+-------+-----+
+-------+-------+
| lat| lng|
+-------+-------+
| 10.023|75.0125|
|12.0246|76.4586|
+-------+-------+
I resolved my issue using RDD rather that DataFrame. It provided me desired results :
val data = sc.textFile("/home/ishan/Desktop/c").map(_.split(","))
val arr=data.filter(_.contains("00650")).take(1)
arr.foreach{a => a foreach println}
Related
I have the following code and outputs.
import org.apache.spark.sql.functions.{collect_list, struct}
import sqlContext.implicits._
val df = Seq(
("john", "tomato", 1.99),
("john", "carrot", 0.45),
("bill", "apple", 0.99),
("john", "banana", 1.29),
("bill", "taco", 2.59)
).toDF("name", "food", "price")
df.groupBy($"name")
.agg(collect_list(struct($"food", $"price")).as("foods"))
.show(false)
df.printSchema
Output and Schema:
+----+---------------------------------------------+
|name|foods |
+----+---------------------------------------------+
|john|[[tomato,1.99], [carrot,0.45], [banana,1.29]]|
|bill|[[apple,0.99], [taco,2.59]] |
+----+---------------------------------------------+
root
|-- name: string (nullable = true)
|-- foods: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- food: string (nullable = true)
| | |-- price: double (nullable = false)
I want to filter based on df("foods.price") > 1.00. How do I filter this to get the output below?
+----+---------------------------------------------+
|name|foods |
+----+---------------------------------------------+
|john|[[banana,1.29], [tomato,1.99]] |
|bill|[[[taco,2.59]] |
+----+---------------------------------------------+
I have tried df.filter($"foods.food" > 1.00), but this does not work as I'm getting an error. Anything else I can try?
you are trying to apply filter on an array, hence it will throw an error as the syntax will be wrong. You can apply filter on price before and then do transformation as needed.
val cf = df.filter("price > 1.0").groupBy($"name").agg(collect_list(struct($"food", $"price")).as("foods")
I'm new to Scala/Spark and I'm trying to make explode a dataframe that has an array column and array of struct column so that I end up with no arrays and no struct.
Here's an example
case class Area(start_time: String, end_time: String, area: String)
val df = Seq((
"1", Seq(4,5,6),
Seq(Area("07:00", "07:30", "70"), Area("08:00", "08:30", "80"), Area("09:00", "09:30", "90"))
)).toDF("id", "before", "after")
df.printSchema
df.show
df has the following schema
root
|-- id: string (nullable = true)
|-- before: array (nullable = true)
| |-- element: integer (containsNull = false)
|-- after: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- start_time: string (nullable = true)
| | |-- end_time: string (nullable = true)
| | |-- area: string (nullable = true)
and the data looks like
+---+---------+--------------------+
| id| before| after|
+---+---------+--------------------+
| 1|[4, 5, 6]|[[07:00, 07:30, 7...|
+---+---------+--------------------+
How do I explode the dataframe so I get the following schema
|-- id: string (nullable = true)
|-- before: integer (containsNull = false)
|-- after_start_time: string (nullable = true)
|-- after_end_time: string (nullable = true)
|-- after_area: string (nullable = true)
The resulting data should have 3 rows and 5 columns
+---+---------+--------------------+--------------------+--------+
| id| before| after_start_time| after_start_time| area|
+---+---------+--------------------+--------------------+--------+
| 1| 4| 07:00| 07:30| 70|
| 1| 5| 08:00| 08:30| 80|
| 1| 6| 09:00| 09:30| 90|
+---+---------+--------------------+--------------------+--------+
I'm using spark 2.3.0 (arrays_zip is not available). And the only solutions I can find is either for exploding two Arrays of String or one Array of struct.
Use arrays_zip to combine two arrays, then explode to explode array columns & use as to rename required columns.
As arrays_zip is not available in spark 2.3. Created UDF to perform same operation.
val arrays_zip = udf((before:Seq[Int],after: Seq[Area]) => before.zip(after))
Execution time with built in (spark 2.4.2) arrays_zip - Time taken: 1146 ms
Execution time with arrays_zip UDF - Time taken: 1165 ms
Check below code.
scala> df.show(false)
+---+---------+------------------------------------------------------------+
|id |before |after |
+---+---------+------------------------------------------------------------+
|1 |[4, 5, 6]|[[07:00, 07:30, 70], [08:00, 08:30, 80], [09:00, 09:30, 90]]|
+---+---------+------------------------------------------------------------+
scala>
df
.select(
$"id",
explode(
arrays_zip($"before",$"after")
.cast("array<struct<before:int,after:struct<start_time:string,end_time:string,area:string>>>")
).as("before_after")
)
.select(
$"id",
$"before_after.before".as("before"),
$"before_after.after.start_time".as("after_start_time"),
$"before_after.after.end_time".as("after_end_time"),
$"before_after.after.area"
)
.printSchema
root
|-- id: string (nullable = true)
|-- before: integer (nullable = true)
|-- after_start_time: string (nullable = true)
|-- after_end_time: string (nullable = true)
|-- area: string (nullable = true)
Output
scala>
df
.select(
$"id",
explode(
arrays_zip($"before",$"after")
.cast("array<struct<before:int,after:struct<start_time:string,end_time:string,area:string>>>")
).as("before_after")
)
.select(
$"id",
$"before_after.before".as("before"),
$"before_after.after.start_time".as("after_start_time"),
$"before_after.after.end_time".as("after_end_time"),
$"before_after.after.area"
)
.show(false)
+---+------+----------------+--------------+----+
|id |before|after_start_time|after_end_time|area|
+---+------+----------------+--------------+----+
|1 |4 |07:00 |07:30 |70 |
|1 |5 |08:00 |08:30 |80 |
|1 |6 |09:00 |09:30 |90 |
+---+------+----------------+--------------+----+
To handle some complex struct you can do,
Declare two beans Area(input) and Area2(output)
Map row to Area2 bean
import org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
import scala.collection.mutable
object ExplodeTwoArrays {
def main(args: Array[String]): Unit = {
val spark = Constant.getSparkSess
import spark.implicits._
val df = Seq((
"1", Seq(4, 5, 6),
Seq(Area("07:00", "07:30", "70"), Area("08:00", "08:30", "80"), Area("09:00", "09:30", "90"))
)).toDF("id", "before", "after")
val outDf = df.map(row=> {
val id = row.getString(0)
val beforeArray : Seq[Int]= row.getSeq[Int](1)
val afterArray : mutable.WrappedArray[Area2] =
row.getAs[mutable.WrappedArray[GenericRowWithSchema]](2) // Need to map Array(Struct) to the something compatible
.zipWithIndex // Require to iterate with indices
.map{ case(element,i) => {
Area2(element.getAs[String]("start_time"),
element.getAs[String]("end_time"),
element.getAs[String]("area"),
beforeArray(i))
}}
(id,afterArray) // Return row(id,Array(Area2(...)))
}).toDF("id","after")
outDf.printSchema()
outDf.show()
}
}
case class Area(start_time: String, end_time: String, area: String)
case class Area2(start_time: String, end_time: String, area: String, before: Int)
I used spark sql functions arrays_zip combined with flatten to transform data from array of struct of inner array of the same length into array of struct. printSchema shows exactly I want. However, df output lost original column names and replace them with generic column name "0", "1", "2" etc. no matter in Parquet or Avro format. I like to output original column names.
Not to reveal the business of my company. The followings are similar but much simplified examples.
scala> c2.printSchema
root
|-- cal: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- month: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- num: array (nullable = true)
| | | |-- element: long (containsNull = true)
scala> c2.show(false)
+----------------------------------------------+
|cal |
+----------------------------------------------+
|[[[Jan, Feb, Mar], [1, 2, 3]], [[April], [4]]]|
+----------------------------------------------+
I like to transform to
scala> newC2.show(false)
+------------------------------------------+
|cal |
+------------------------------------------+
|[[Jan, 1], [Feb, 2], [Mar, 3], [April, 4]]|
+------------------------------------------+
with
scala> newC2.printSchema
root
|-- cal: array (nullable = true)
| |-- element: struct (containsNull = false)
| | |-- month: string (nullable = true)
| | |-- num: long (nullable = true)
I know arrays_zip only work well on the top-level arrays. Therefore, I flatten them to top level. The followings codes work in this example
val newC2 = c2.withColumn("month", flatten(col("cal.month"))).withColumn("num", flatten(col("cal.num"))).withColumn("cal", arrays_zip(col("month"), col("num"))).drop("month", "num")
It generates exactly data and schema I want. However, it outputs all columns generically using "0", "1", "2" etc.
newC2.write.option("header", false).parquet("c2_parquet")
I tried another example that has original data of month array and num array at the top level. I can arrays_zip without flatten and get the same schema and data shown. However, it output original field name correctly in this case.
I tried add alias to flatten data. That does not work. I even tried manipulate columns like (assume field store the result of arrays_zip is 'zipped'
val columns: Array[Column] = inner.fields.map(_.name).map{x => col("zipped").getField(x).alias(x)}
val newB3 = newB2.withColumn("b", array(struct(columns:_*))).drop("zipped")
It ends up generate original schema ('month", array of string and "num", array of long).
To duplicate the problem, you can use the json input
"cal":[{"month":["Jan","Feb","Mar"],"num":[1,2,3]},{"month":["April"],"num":[4]}]}
the following json is for top-level arrays_zip
{"month":["Jan","Feb","Mar"],"num":[1,2,3]}
How Spark internally decide what field names to use? How can I get it to work? Please advise.
Since Spark 2.4, the schema transformation can be achieved using Higher Order Functions. In Scala the query can look like this:
import org.apache.spark.sql.functions.{expr, flatten}
val result = df
.withColumn("cal", flatten(expr("TRANSFORM(cal, x -> zip_with(x.month, x.num, (month, num) -> (month,num)))")))
After applying on your sample data i get this schema:
result.printSchema()
root
|-- cal: array (nullable = true)
| |-- element: struct (containsNull = false)
| | |-- month: string (nullable = true)
| | |-- num: long (nullable = true)
We are reading data from dynamo db so we are getting datatype as string, but we want writing string data type as array(map(array))
string Data :
{"policy_details":[{"cdhid":" 123","p2cid":" NA", "roleDesc":" NA","positionnum":"NA"}, {"cdhid":" 1234","p2cid":" NA", "roleDesc":" NA","positionnum":"NA"}]}
output required:
string data type need to convert to ARRAY(MAP(ARRAY))
We have tried with below schema:
ArrayType([
StructField("policy_num", MapType(ArrayType([
StructField("cdhid", StringType(), True),
StructField("role_id", StringType(), True),
StructField("role_desc", StringType(), True)
])))
])
getting below issue:
elementType [StructField(cdhid,StringType,true),
StructField(role_id,StringType,true),
StructField(role_desc,StringType,true)] should be an instance of < class 'pyspark.sql.types.DataType' >
Regarding your data, the schema your want is not the one that fits.
The schema of you data is :
from pyspark.sql import types as T
schm = T.StructType([T.StructField("policy_details",T.ArrayType(T.StructType([
T.StructField("cdhid", T.StringType(), True),
T.StructField("p2cid", T.StringType(), True),
T.StructField("roleDesc", T.StringType(), True),
T.StructField("positionnum", T.StringType(), True),
])), True)])
Then, you just need to use the from_json function.
from pyspark.sql import functions as F
df.show()
+--------------------+
| db_data|
+--------------------+
|{"policy_details"...|
+--------------------+
new_df = df.select(F.from_json("db_data", schm).alias("data"))
new_df.printSchema()
root
|-- data: struct (nullable = true)
| |-- policy_details: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- cdhid: string (nullable = true)
| | | |-- p2cid: string (nullable = true)
| | | |-- roleDesc: string (nullable = true)
| | | |-- positionnum: string (nullable = true)
EDIT : If you want to use MapType, you can replace the schema with :
schm = T.StructType([
T.StructField(
"policy_details",
T.ArrayType(T.MapType(
T.StringType(),
T.StringType()
)),
True
)
])
I am trying to output csv from a pyspark df an then re inputting it, but when I specify schema, for a column that is an array, it says that some of the rows are False.
Here is my df
avg(rating) belongs_to_collection budget \
0 2.909946 False 5000000
1 3.291962 False 18000000
2 3.239811 False 8000000
3 3.573318 False 1500000
4 3.516590 False 40000000
genres original_language
0 ['Drama', 'Romance'] en
1 ['Comedy'] en
2 ['Drama', 'Family'] en
3 ['Crime', 'Drama', 'Mystery', 'Thriller'] en
4 ['Crime', 'Drama', 'Thriller'] en
I first output to csv: df.drop('id').toPandas().to_csv('mergedDf.csv',index=False)
I tried reading using df = spark.read.csv('mergedDf.csv',schema=schema), but I get this error: 'CSV data source does not support array<string> data type.;'
So, I tried reading from pandas and then converting to spark df, but it tells me that the column that contains a list has a boolean value.
df = pd.read_csv('mergedDf.csv')
df = spark.createDataFrame(df,schema=schema)
TypeError: field genres: ArrayType(StringType,true) can not accept object False in type <class 'bool'>
However, when I check if some of the rows are == to False, I find that none of them are.
I checked :
df[df['genres']=="False"] and df[df['genres']==False]
Unfortunately, spark read csv function doesn't yet support complex datatypes like "array". You would have handle the logic of cast the string column into array column
Use pandas to write the spark dataframe as csv with header.
df.drop('id').toPandas().to_csv('mergedDf.csv',index=False,header=True)
df1 = spark.read.option('header','true').option("inferSchema","true").csv('mergedDf.csv')
df1.printSchema()
df1.show(10,False)
When you read back csv with spark, the array column would be convert to string type
root
|-- avg(rating): double (nullable = true)
|-- belongs_to_collection: boolean (nullable = true)
|-- budget: integer (nullable = true)
|-- genres: string (nullable = true)
|-- original_language: string (nullable = true)
+-----------+---------------------+--------+-----------------------------------------+-----------------+
|avg(rating)|belongs_to_collection|budget |genres |original_language|
+-----------+---------------------+--------+-----------------------------------------+-----------------+
|2.909946 |false |5000000 |['Drama', 'Romance'] |en |
|3.291962 |false |18000000|['Comedy'] |en |
|3.239811 |false |8000000 |['Drama', 'Family'] |en |
|3.573318 |false |1500000 |['Crime', 'Drama', 'Mystery', 'Thriller']|en |
|3.51659 |false |40000000|['Crime', 'Drama', 'Thriller'] |en |
+-----------+---------------------+--------+-----------------------------------------+-----------------+
Split the string column to create an array to get back your original format.
df2 = df1.withColumn('genres',split(regexp_replace(col('genres'), '\[|\]',''),',').cast('array<string>'))
df2.printSchema()
.
root
|-- avg(rating): double (nullable = true)
|-- belongs_to_collection: boolean (nullable = true)
|-- budget: integer (nullable = true)
|-- genres: array (nullable = true)
| |-- element: string (containsNull = true)
|-- original_language: string (nullable = true)