Extract columns from a list of lists in pyspark - dataframe

I am stuck trying to extract columns from a list of lists but can't visualize how to do it. I am fairly new to spark. Running pyspark on Spark 2.4.3.
I have a json organized like this:
{ "meta" : { ... },
"data" :
[[ "a", 0, null, "{ }"],
[ "b", 0, null, "{ }"],
[ "c", 0, null, "{ }"],
] }
I want to get the 'data' portion into columns, like
+------+------+------+------+
| col1 | col2 | col3 | col4 |
+------+------+------+------+
| a | 0 | None | "{ }"|
| b | 0 | None | "{ }"|
| c | 0 | None | "{ }"|
I have my dataframe read in and printSchema() shows this.
root
|-- data: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: string (containsNull = true)
|-- meta: struct (nullable = true)
| |-- view: struct (nullable = true)
| | |-- approvals: array (nullable = true) ...
My rough shape is 70 columns by 650k rows.
I was able to explode the df to get just the data portion but am stuck there.

Explode the rows first, and then select the array elements using [] in Python.
df2 = df.select(F.explode('data').alias('data')) \
.select(*[F.col('data')[i].alias('col%s'%(i+1)) for i in range(4)])
df2.show()
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
| a| 0|null| { }|
| b| 0|null| { }|
| c| 0|null| { }|
+----+----+----+----+

Why don't you use just SparkSession.createDataFrame() method?
https://spark.apache.org/docs/2.4.3/api/python/pyspark.sql.html#pyspark.sql.SparkSession.createDataFrame
You can provide data and schema parameters to this method and get spark dataframe.
Example:
from pyspark.sql import SparkSession
sparkSession = SparkSession.builder.getOrCreate()
df = sparkSession.createDataFrame(data)
If spark cannot infer schema from the data then schema also need to be provided
from pyspark.sql.types import StructType
struct = StructType()
struct.add("col1", "string", True)
struct.add("col2", "integer", True)
struct.add("col3", "string", True)
struct.add("col4", "string", True)
df = sparkSession.createDataFrame(data=data, schema=struct)
In addition, you can use pyspark type classes instead of python primitive type names.
https://spark.apache.org/docs/2.4.3/api/python/pyspark.sql.html#module-pyspark.sql.types
module contains both simple types(StringType, IntegerType, ...) and complex types(ArrayType, MapType, ...)
Last note: data cannot contain null, it should be None in python. spark DataFrame.show() will print None columns as null.

Related

Spark- Split or replace part of the column based on multiple delimiters [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
I have a dataframe as input below. Ultimately, I'm trying to get the output as below, so I can use df.filter(col("A").contains(col("B"))) to see if A contains B as substring. Noted here I'd like to check the order of the letters as well so set probably will not work. For example "acb" should not be considered as a substring of "abcd" I've tried to use split but it only takes one delimiter. Could some please help? I'm using spark 2.4.
Input
+---+-------+-----------+
| id|. A | B |
+---+-------+-----------+
| 1|abc d | bc, z |
| 2|abc-d | acb |
| 3|abcde | hj f |
+---+-------+-----------+
Output
+---+-------+-----------+
| id|. A | B |
+---+-------+-----------+
| 1|abcd | bcz |
| 2|abcd | acb |
| 3|abcde | hjf |
+---+-------+-----------+
You can use regex for both split and replace.
If you want to split, your output is not right.
Split
spark = SparkSession.builder.getOrCreate()
data = [
{"id": 1, "A": "abc d", "B": "bc, z"},
{"id": 2, "A": "abc-d", "B": "acb"},
{"id": 3, "A": "abcde", "B": "hj f"},
]
df = spark.createDataFrame(data)
split_regex = "((,)?\s|[-])"
df = df.withColumn("A", F.split(F.col("A"), split_regex))
df = df.withColumn("B", F.split(F.col("B"), split_regex))
Result:
root
|-- A: array (nullable = true)
| |-- element: string (containsNull = true)
|-- B: array (nullable = true)
| |-- element: string (containsNull = true)
|-- id: long (nullable = true)
+--------+-------+---+
|A |B |id |
+--------+-------+---+
|[abc, d]|[bc, z]|1 |
|[abc, d]|[acb] |2 |
|[abcde] |[hj, f]|3 |
+--------+-------+---+
Now you can create a UDF that will check if values in array B are substrings in values in array A.
Replace
spark = SparkSession.builder.getOrCreate()
data = [
{"id": 1, "A": "abc d", "B": "bc, z"},
{"id": 2, "A": "abc-d", "B": "acb"},
{"id": 3, "A": "abcde", "B": "hj f"},
]
df = spark.createDataFrame(data)
replace_regex = "((,)?\s|[-])"
df = df.withColumn("A", F.regexp_replace(F.col("A"), replace_regex, ""))
df = df.withColumn("B", F.regexp_replace(F.col("B"), replace_regex, ""))
Result:
root
|-- A: string (nullable = true)
|-- B: string (nullable = true)
|-- id: long (nullable = true)
+-----+---+---+
|A |B |id |
+-----+---+---+
|abcd |bcz|1 |
|abcd |acb|2 |
|abcde|hjf|3 |
+-----+---+---+
This is a bit involved, and I would stick to split since here abcd contains both b and bc and there's no way for you to keep track of the whole words if you completely replace the delimiter.

Spark- check intersect of two string columns

I have a dataframe below where colA and colB contain strings. I'm trying to check if colB contains any substring of values in colA. The vaules can contain , or space, but as long as any part of colB's string has overlap with colA's, it is a match. For example row 1 below has an overlap ("bc"), and row 2 does not.
I was thinking of splitting the values into arrays but the delimiters are not constant. Could someone please help to shed some light on how to do this? Many thanks for your help.
+---+-------+-----------+
| id|colA | colB |
+---+-------+-----------+
| 1|abc d | bc, z |
| 2|abcde | hj f |
+---+-------+-----------+
You could split by using regex and then create a UDF function to check substrings.
Example:
spark = SparkSession.builder.getOrCreate()
data = [
{"id": 1, "A": "abc d", "B": "bc, z, d"},
{"id": 2, "A": "abc-d", "B": "acb, abc"},
{"id": 3, "A": "abcde", "B": "hj f ab"},
]
df = spark.createDataFrame(data)
split_regex = "((,)?\s|[-])"
df = df.withColumn("A", F.split(F.col("A"), split_regex))
df = df.withColumn("B", F.split(F.col("B"), split_regex))
def mapper(a, b):
result = []
for ele_b in b:
for ele_a in a:
if ele_b in ele_a:
result.append(ele_b)
return result
df = df.withColumn(
"result", F.udf(mapper, ArrayType(StringType()))(F.col("A"), F.col("B"))
)
Result:
root
|-- A: array (nullable = true)
| |-- element: string (containsNull = true)
|-- B: array (nullable = true)
| |-- element: string (containsNull = true)
|-- id: long (nullable = true)
|-- result: array (nullable = true)
| |-- element: string (containsNull = true)
+--------+-----------+---+-------+
|A |B |id |result |
+--------+-----------+---+-------+
|[abc, d]|[bc, z, d] |1 |[bc, d]|
|[abc, d]|[acb, abc] |2 |[abc] |
|[abcde] |[hj, f, ab]|3 |[ab] |
+--------+-----------+---+-------+
You can use a custom UDF to implement the intersect logic as below -
Data Preparation
from pyspark import SparkContext
from pyspark.sql import SQLContext
import pyspark.sql.functions as F
from pyspark.sql.types import StringType
import pandas as pd
data = {"id" :[1,2],
"colA" : ["abc d","abcde"],
"colB" : ["bc, z","hj f"]}
mypd = pd.DataFrame(data)
sparkDF = sql.createDataFrame(mypd)
sparkDF.show()
+---+-----+-----+
| id| colA| colB|
+---+-----+-----+
| 1|abc d|bc, z|
| 2|abcde| hj f|
+---+-----+-----+
UDF
def str_intersect(x,y):
res = set(x) & set(y)
if res:
return ''.join(res)
else:
return None
str_intersect_udf = F.udf(lambda x,y:str_intersect(x,y),StringType())
sparkDF.withColumn('intersect',str_intersect_udf(F.col('colA'),F.col('colB'))).show()
+---+-----+-----+---------+
| id| colA| colB|intersect|
+---+-----+-----+---------+
| 1|abc d|bc, z| bc |
| 2|abcde| hj f| null|
+---+-----+-----+---------+

Scala Spark Dataframe: how to explode an array of Int and array of struct at the same time

I'm new to Scala/Spark and I'm trying to make explode a dataframe that has an array column and array of struct column so that I end up with no arrays and no struct.
Here's an example
case class Area(start_time: String, end_time: String, area: String)
val df = Seq((
"1", Seq(4,5,6),
Seq(Area("07:00", "07:30", "70"), Area("08:00", "08:30", "80"), Area("09:00", "09:30", "90"))
)).toDF("id", "before", "after")
df.printSchema
df.show
df has the following schema
root
|-- id: string (nullable = true)
|-- before: array (nullable = true)
| |-- element: integer (containsNull = false)
|-- after: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- start_time: string (nullable = true)
| | |-- end_time: string (nullable = true)
| | |-- area: string (nullable = true)
and the data looks like
+---+---------+--------------------+
| id| before| after|
+---+---------+--------------------+
| 1|[4, 5, 6]|[[07:00, 07:30, 7...|
+---+---------+--------------------+
How do I explode the dataframe so I get the following schema
|-- id: string (nullable = true)
|-- before: integer (containsNull = false)
|-- after_start_time: string (nullable = true)
|-- after_end_time: string (nullable = true)
|-- after_area: string (nullable = true)
The resulting data should have 3 rows and 5 columns
+---+---------+--------------------+--------------------+--------+
| id| before| after_start_time| after_start_time| area|
+---+---------+--------------------+--------------------+--------+
| 1| 4| 07:00| 07:30| 70|
| 1| 5| 08:00| 08:30| 80|
| 1| 6| 09:00| 09:30| 90|
+---+---------+--------------------+--------------------+--------+
I'm using spark 2.3.0 (arrays_zip is not available). And the only solutions I can find is either for exploding two Arrays of String or one Array of struct.
Use arrays_zip to combine two arrays, then explode to explode array columns & use as to rename required columns.
As arrays_zip is not available in spark 2.3. Created UDF to perform same operation.
val arrays_zip = udf((before:Seq[Int],after: Seq[Area]) => before.zip(after))
Execution time with built in (spark 2.4.2) arrays_zip - Time taken: 1146 ms
Execution time with arrays_zip UDF - Time taken: 1165 ms
Check below code.
scala> df.show(false)
+---+---------+------------------------------------------------------------+
|id |before |after |
+---+---------+------------------------------------------------------------+
|1 |[4, 5, 6]|[[07:00, 07:30, 70], [08:00, 08:30, 80], [09:00, 09:30, 90]]|
+---+---------+------------------------------------------------------------+
scala>
df
.select(
$"id",
explode(
arrays_zip($"before",$"after")
.cast("array<struct<before:int,after:struct<start_time:string,end_time:string,area:string>>>")
).as("before_after")
)
.select(
$"id",
$"before_after.before".as("before"),
$"before_after.after.start_time".as("after_start_time"),
$"before_after.after.end_time".as("after_end_time"),
$"before_after.after.area"
)
.printSchema
root
|-- id: string (nullable = true)
|-- before: integer (nullable = true)
|-- after_start_time: string (nullable = true)
|-- after_end_time: string (nullable = true)
|-- area: string (nullable = true)
Output
scala>
df
.select(
$"id",
explode(
arrays_zip($"before",$"after")
.cast("array<struct<before:int,after:struct<start_time:string,end_time:string,area:string>>>")
).as("before_after")
)
.select(
$"id",
$"before_after.before".as("before"),
$"before_after.after.start_time".as("after_start_time"),
$"before_after.after.end_time".as("after_end_time"),
$"before_after.after.area"
)
.show(false)
+---+------+----------------+--------------+----+
|id |before|after_start_time|after_end_time|area|
+---+------+----------------+--------------+----+
|1 |4 |07:00 |07:30 |70 |
|1 |5 |08:00 |08:30 |80 |
|1 |6 |09:00 |09:30 |90 |
+---+------+----------------+--------------+----+
To handle some complex struct you can do,
Declare two beans Area(input) and Area2(output)
Map row to Area2 bean
import org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
import scala.collection.mutable
object ExplodeTwoArrays {
def main(args: Array[String]): Unit = {
val spark = Constant.getSparkSess
import spark.implicits._
val df = Seq((
"1", Seq(4, 5, 6),
Seq(Area("07:00", "07:30", "70"), Area("08:00", "08:30", "80"), Area("09:00", "09:30", "90"))
)).toDF("id", "before", "after")
val outDf = df.map(row=> {
val id = row.getString(0)
val beforeArray : Seq[Int]= row.getSeq[Int](1)
val afterArray : mutable.WrappedArray[Area2] =
row.getAs[mutable.WrappedArray[GenericRowWithSchema]](2) // Need to map Array(Struct) to the something compatible
.zipWithIndex // Require to iterate with indices
.map{ case(element,i) => {
Area2(element.getAs[String]("start_time"),
element.getAs[String]("end_time"),
element.getAs[String]("area"),
beforeArray(i))
}}
(id,afterArray) // Return row(id,Array(Area2(...)))
}).toDF("id","after")
outDf.printSchema()
outDf.show()
}
}
case class Area(start_time: String, end_time: String, area: String)
case class Area2(start_time: String, end_time: String, area: String, before: Int)

The output of Spark DataFrame replace column names with "0", "1" when arrays_zip data originated nested

I used spark sql functions arrays_zip combined with flatten to transform data from array of struct of inner array of the same length into array of struct. printSchema shows exactly I want. However, df output lost original column names and replace them with generic column name "0", "1", "2" etc. no matter in Parquet or Avro format. I like to output original column names.
Not to reveal the business of my company. The followings are similar but much simplified examples.
scala> c2.printSchema
root
|-- cal: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- month: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- num: array (nullable = true)
| | | |-- element: long (containsNull = true)
scala> c2.show(false)
+----------------------------------------------+
|cal |
+----------------------------------------------+
|[[[Jan, Feb, Mar], [1, 2, 3]], [[April], [4]]]|
+----------------------------------------------+
I like to transform to
scala> newC2.show(false)
+------------------------------------------+
|cal |
+------------------------------------------+
|[[Jan, 1], [Feb, 2], [Mar, 3], [April, 4]]|
+------------------------------------------+
with
scala> newC2.printSchema
root
|-- cal: array (nullable = true)
| |-- element: struct (containsNull = false)
| | |-- month: string (nullable = true)
| | |-- num: long (nullable = true)
I know arrays_zip only work well on the top-level arrays. Therefore, I flatten them to top level. The followings codes work in this example
val newC2 = c2.withColumn("month", flatten(col("cal.month"))).withColumn("num", flatten(col("cal.num"))).withColumn("cal", arrays_zip(col("month"), col("num"))).drop("month", "num")
It generates exactly data and schema I want. However, it outputs all columns generically using "0", "1", "2" etc.
newC2.write.option("header", false).parquet("c2_parquet")
I tried another example that has original data of month array and num array at the top level. I can arrays_zip without flatten and get the same schema and data shown. However, it output original field name correctly in this case.
I tried add alias to flatten data. That does not work. I even tried manipulate columns like (assume field store the result of arrays_zip is 'zipped'
val columns: Array[Column] = inner.fields.map(_.name).map{x => col("zipped").getField(x).alias(x)}
val newB3 = newB2.withColumn("b", array(struct(columns:_*))).drop("zipped")
It ends up generate original schema ('month", array of string and "num", array of long).
To duplicate the problem, you can use the json input
"cal":[{"month":["Jan","Feb","Mar"],"num":[1,2,3]},{"month":["April"],"num":[4]}]}
the following json is for top-level arrays_zip
{"month":["Jan","Feb","Mar"],"num":[1,2,3]}
How Spark internally decide what field names to use? How can I get it to work? Please advise.
Since Spark 2.4, the schema transformation can be achieved using Higher Order Functions. In Scala the query can look like this:
import org.apache.spark.sql.functions.{expr, flatten}
val result = df
.withColumn("cal", flatten(expr("TRANSFORM(cal, x -> zip_with(x.month, x.num, (month, num) -> (month,num)))")))
After applying on your sample data i get this schema:
result.printSchema()
root
|-- cal: array (nullable = true)
| |-- element: struct (containsNull = false)
| | |-- month: string (nullable = true)
| | |-- num: long (nullable = true)

Extract Spark data frame from a nested structure

I have a DataFrame with nested structure (originally an Avro output from a mapreduce job). I would like to flatten it. The schema of original DataFrame looks like this (simplified):
|-- key: struct
| |-- outcome: boolean
| |-- date: string
| |-- age: int
| |-- features: map
| | |-- key: string
| | |-- value: double
|-- value: struct (nullable = true)
| |-- nullString: string (nullable = true)
In Json representation one row of the data looks like this:
{"key":
{"outcome": false,
"date": "2015-01-01",
"age" : 20,
"features": {
{"f1": 10.0,
"f2": 11.0,
...
"f100": 20.1
}
},
"value": null
}
The features map has the same structure for all rows, i.e. the key set is the same (f1, f2, ..., f100). By "flatten" I mean the following.
+----------+----------+---+----+----+-...-+------+
| outcome| date|age| f1| f2| ... | f100|
+----------+----------+---+----+----+-...-+------+
| true|2015-01-01| 20|10.0|11.0| ... | 20.1|
...
(truncated)
I am using Spark 2.1.0 the spark-avro package from https://github.com/databricks/spark-avro.
The original dataframe is read in by
import com.databricks.spark.avro._
val df = spark.read.avro("path/to/my/file.avro")
// it's nested
df.show()
+--------------------+------+
| key| value|
+--------------------+------+
|[false,2015... |[null]|
|[false,2015... |[null]|
...
(truncated)
Any help is greatly appreciated!
In Spark you can extract data from a nested AVRO file. For example, the JSON you have provided:
{"key":
{"outcome": false,
"date": "2015",
"features": {
{"f1": v1,
"f2": v2,
...
}
},
"value": null
}
after being read from AVRO:
import com.databricks.spark.avro._
val df = spark.read.avro("path/to/my/file.avro")
can provide flattened data from nested JSON. For that you can write code something like this:
df.select("key.*").show
+----+------------+-------+
|date| features |outcome|
+----+------------+-------+
|2015| [v1,v2,...]| false|
+----+------------+-------+
...
(truncated)
df.select("key.*").printSchema
root
|-- date: string (nullable = true)
|-- features: struct (nullable = true)
| |-- f1: string (nullable = true)
| |-- f2: string (nullable = true)
| |-- ...
|-- outcome: boolean (nullable = true)
or something like this:
df.select("key.features.*").show
+---+---+---
| f1| f2|...
+---+---+---
| v1| v2|...
+---+---+---
...
(truncated)
df.select("key.features.*").printSchema
root
|-- f1: string (nullable = true)
|-- f2: string (nullable = true)
|-- ...
If this is the output you are expecting.