Pivot spark dataframe (pyspark) when all column values are unique - dataframe

Apologies for the potentially basic question but I have exhausted all possibilities in searching for a solution to this. I am fairly new to pyspark. I have managed to flatten the first level of my json file. I have a dataframe that looks like the following:
There is one row. Each column represents a day between 2019-01-01 to 2020-01-01
| 2019-01-01 | 2019-01-02 |...
| -------- | -------- |...
| USD:1, | USD: 1, |...
| JPY:109.61 | JPY:109.62 |...
...
Schema as follows (with each day between 2019 and 2020 as a column header:
root
|-- 2019-01-01: struct (nullable = true)
| |-- AED: double (nullable = true)
| |-- AFN: double (nullable = true)
| |-- ALL: double (nullable = true)
| |-- ARS: double (nullable = true)
| |-- AUD: double (nullable = true)
| |-- BAM: double (nullable = true)
| |-- BBD: double (nullable = true)
| |-- BDT: double (nullable = true)
| |-- BGN: double (nullable = true)
| |-- BHD: double (nullable = true)
| |-- BIF: double (nullable = true)
etc.
I would like each value in each column to be in a speparate row and to pivot the table so that the dates are rows and each country code is a column heading as follows:
| Date | USD |JPY
| ----------- | -------- |-----
| 2019-01-01 | 1 |109.61
| 2019-01-02 | 1 |109.62
etc.
Every solution I have found so far requires that I specify which column to pivot/unnest.

The solution here loads and flattens the json before loading into a PySpark DataFrame.
Set-Up
Here is a sample of the json.
json_input = {'2019-01-01': {'AED': 1.0,
'AFN': 1.0,
'ALL': 1.0,
'ARS': 1.0,
'AUD': 1.0,
'BAM': 1.0,
'BBD': 1.0,
'BDT': 1.0,
'BGN': 1.0,
'BHD': 1.0,
'BIF': 1.0},
'2019-01-02': {'AED': 2.0,
'AFN': 2.0,
'ALL': 2.0,
'ARS': 2.0,
'AUD': 2.0,
'BAM': 2.0,
'BBD': 2.0,
'BDT': 2.0,
'BGN': 2.0,
'BHD': 2.0,
'BIF': 2.0}}
Logic
Next I reformat the json to make the "Date" key at the same level as the country_codes.
json_reformat = []
for date, country_data in json_input.items():
json_flat = country_data
json_flat["Date"] = date
json_reformat.append(json_flat)
output
[{'AED': 1.0,
'AFN': 1.0,
'ALL': 1.0,
'ARS': 1.0,
'AUD': 1.0,
'BAM': 1.0,
'BBD': 1.0,
'BDT': 1.0,
'BGN': 1.0,
'BHD': 1.0,
'BIF': 1.0,
'Date': '2019-01-01'},
{'AED': 2.0,
'AFN': 2.0,
'ALL': 2.0,
'ARS': 2.0,
'AUD': 2.0,
'BAM': 2.0,
'BBD': 2.0,
'BDT': 2.0,
'BGN': 2.0,
'BHD': 2.0,
'BIF': 2.0,
'Date': '2019-01-02'}]
Then load this newly formatted object into a PySpark DataFrame
df = spark.read.json(sc.parallelize(json_reformat))
df.show()
output
+---+---+---+---+---+---+---+---+---+---+---+----------+
|AED|AFN|ALL|ARS|AUD|BAM|BBD|BDT|BGN|BHD|BIF| Date|
+---+---+---+---+---+---+---+---+---+---+---+----------+
|1.0|1.0|1.0|1.0|1.0|1.0|1.0|1.0|1.0|1.0|1.0|2019-01-01|
|2.0|2.0|2.0|2.0|2.0|2.0|2.0|2.0|2.0|2.0|2.0|2019-01-02|
+---+---+---+---+---+---+---+---+---+---+---+----------+

Considering all date columns are in same order and has no missing values you can write for loops:
data is spark dataframe representing below:
data.show()
+----------+----------+
|2019-01-01|2019-01-02|
+----------+----------+
| USD:1,| USD: 1,|
|JPY:109.61|JPY:109.62|
+----------+----------+
row_items = ()
rows = []
cols = ['Date']
for i in data.select('2019-01-01').collect():
cols.append(i[0].split(':')[0])
for item in data.columns:
row_items += (item,)
for i in data.select(item).collect():
row_items += (i[0].split(':')[1],)
rows.append(row_items)
row_items = ()
result = spark.createDataFrame(rows,cols)

Related

How to extract embeddings generated from sparknlp WordEmbeddingsModel to feed a RNN model using keras and tensorflow

The bounty expires in 6 days. Answers to this question are eligible for a +100 reputation bounty.
Aiha wants to draw more attention to this question:
I need an embedding matrix which can directly be used in an RNN model
I have a text classification problem.
I'm particularly interested in this embedding model in sparknlp because I have a dataset from Wikipedia in 'sq' language. I need to convert sentences of my dataset into embeddings.
I do so by WordEmbeddingsModel, however, after the embeddings are generated I don't know how to prepare them to make ready as an input for an RNN model using keras and tensorflow.
My dataset has two columns 'text' and 'label', until now I was able to do the following steps:
# start spark session
spark = sparknlp.start(gpu=True)
# convert train df into spark df
spark_train_df=spark.createDataFrame(train)`
+--------------------+-----+
| text|label|
+--------------------+-----+
|Joy Adowaa Buolam...| 0|
|Ajo themeloi "Alg...| 1|
|Buolamwini lindi ...| 1|
|Kur ishte 9 vjeç,...| 0|
|Si një studente u...| 1|
+--------------------+-----+
# define sparknlp pipeline
document = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols(\["document"\]) \
.setOutputCol("token")
embeddings = WordEmbeddingsModel\
.pretrained("w2v_cc_300d","sq")\
.setInputCols(\["document", "token"\])\
.setOutputCol("embeddings")
pipeline = Pipeline(stages=\[document, tokenizer, embeddings\])
# fit the pipeline to the training data
model = pipeline.fit(spark_train_df)
# apply the pipeline to the training data
result = model.transform(spark_train_df)
result.show()
+--------------------+-----+--------------------+--------------------+--------------------+
| text|label| document| token| embeddings|
+--------------------+-----+--------------------+--------------------+--------------------+
|Joy Adowaa Buolam...| 0|[{document, 0, 13...|[{token, 0, 2, Jo...|[{word_embeddings...|
|Ajo themeloi "Alg...| 1|[{document, 0, 13...|[{token, 0, 2, Aj...|[{word_embeddings...|
|Buolamwini lindi ...| 1|[{document, 0, 94...|[{token, 0, 9, Bu...|[{word_embeddings...|
|Kur ishte 9 vjeç,...| 0|[{document, 0, 12...|[{token, 0, 2, Ku...|[{word_embeddings...|
|Si një studente u...| 1|[{document, 0, 15...|[{token, 0, 1, Si...|[{word_embeddings...|
|Buolamwini diplom...| 1|[{document, 0, 11...|[{token, 0, 9, Bu...|[{word_embeddings...|
+--------------------+-----+--------------------+--------------------+--------------------+
The schema of result is:
result.printSchema()
root
|-- text: string (nullable = true)
|-- label: long (nullable = true)
|-- document: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- annotatorType: string (nullable = true)
| | |-- begin: integer (nullable = false)
| | |-- end: integer (nullable = false)
| | |-- result: string (nullable = true)
| | |-- metadata: map (nullable = true)
| | | |-- key: string
| | | |-- value: string (valueContainsNull = true)
| | |-- embeddings: array (nullable = true)
| | | |-- element: float (containsNull = false)
|-- token: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- annotatorType: string (nullable = true)
| | |-- begin: integer (nullable = false)
| | |-- end: integer (nullable = false)
| | |-- result: string (nullable = true)
| | |-- metadata: map (nullable = true)
| | | |-- key: string
| | | |-- value: string (valueContainsNull = true)
| | |-- embeddings: array (nullable = true)
| | | |-- element: float (containsNull = false)
|-- embeddings: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- annotatorType: string (nullable = true)
| | |-- begin: integer (nullable = false)
| | |-- end: integer (nullable = false)
| | |-- result: string (nullable = true)
| | |-- metadata: map (nullable = true)
| | | |-- key: string
| | | |-- value: string (valueContainsNull = true)
| | |-- embeddings: array (nullable = true)
| | | |-- element: float (containsNull = false)
The output I receive from:
result.schema["embeddings"].dataType is:
ArrayType(StructType([StructField('annotatorType', StringType(), True), StructField('begin', IntegerType(), False), StructField('end', IntegerType(), False), StructField('result', StringType(), True), StructField('metadata', MapType(StringType(), StringType(), True), True), StructField('embeddings', ArrayType(FloatType(), False), True)]), True)

explode Nested Values of (year,month,day, hours,minute,second) to a date time type in one field in Pyspark Dataframe

I'm Trying to convert the nested Fields into one Field of DATETIME type when i use explode function i get an error : cannot resolve 'explode(START_Time)' due to data type mismatch
data i have :
|-- MODEL: string (nullable = true)
|-- START_Time: struct (nullable = true)
| |-- day: string (nullable = true)
| |-- hour: string (nullable = true)
| |-- minute: string (nullable = true)
| |-- month: string (nullable = true)
| |-- second: string (nullable = true)
| |-- year: string (nullable = true)
|-- WEIGHT: string (nullable = true)
|-- REGISTRED: struct (nullable = true)
| |-- day: string (nullable = true)
| |-- hour: string (nullable = true)
| |-- minute: string (nullable = true)
| |-- month: string (nullable = true)
| |-- second: string (nullable = true)
| |-- year: string (nullable = true)
|-- TOTAL: string (nullable = true)
Result i'm looking to have :
with START_TIME and REGISTRED as DATE type
+---------+------------------+----------+-----------------+---------+
|MODEL | START_Time | WEIGHT |REGISTED |TOTAL |
+---------+------------------+----------+-----------------+---------+
|.........| yy-mm-dd-hh-mm-ss| WEIGHT |yy-mm-dd-hh-mm-ss|TOTAL |
i have tried :
df.withColumn('START_Time', concat(col('START_Time.year'), lit('-'), .....)
but when there are empty values in the nested fiels it gets (-----) in
and it gets me :
+---------+------------------+----------+-----------------+---------+
|MODEL | START_Time | WEIGHT |REGISTED |TOTAL |
+---------+------------------+----------+-----------------+---------+
|value | ----- | value | ----- |value |
After concatenating, you can just cast the entire column to timestamp type, Spark will handle the missing (and invalid) data for you and return null instead
from pyspark.sql import functions as F
(df
.withColumn('raw_string_date', F
.concat(
F.col('START_TIME.year'),
F.lit('-'),
F.col('START_TIME.month'),
F.lit('-'),
F.col('START_TIME.day'),
F.lit(' '),
F.col('START_TIME.hour'),
F.lit(':'),
F.col('START_TIME.minute'),
F.lit(':'),
F.col('START_TIME.second'),
)
)
.withColumn('date_type', F.col('raw_string_date').cast('timestamp'))
.show(10, False)
)
# +------------------------------------+---------------+-------------------+
# |START_TIME |raw_string_date|date_type |
# +------------------------------------+---------------+-------------------+
# |{1, 2, 3, 4, 5, 2021} |2021-4-1 2:3:5 |2021-04-01 02:03:05|
# |{, , , , , } |-- :: |null |
# |{null, null, null, null, null, null}|null |null |
# +------------------------------------+---------------+-------------------+

Scala Spark Dataframe: how to explode an array of Int and array of struct at the same time

I'm new to Scala/Spark and I'm trying to make explode a dataframe that has an array column and array of struct column so that I end up with no arrays and no struct.
Here's an example
case class Area(start_time: String, end_time: String, area: String)
val df = Seq((
"1", Seq(4,5,6),
Seq(Area("07:00", "07:30", "70"), Area("08:00", "08:30", "80"), Area("09:00", "09:30", "90"))
)).toDF("id", "before", "after")
df.printSchema
df.show
df has the following schema
root
|-- id: string (nullable = true)
|-- before: array (nullable = true)
| |-- element: integer (containsNull = false)
|-- after: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- start_time: string (nullable = true)
| | |-- end_time: string (nullable = true)
| | |-- area: string (nullable = true)
and the data looks like
+---+---------+--------------------+
| id| before| after|
+---+---------+--------------------+
| 1|[4, 5, 6]|[[07:00, 07:30, 7...|
+---+---------+--------------------+
How do I explode the dataframe so I get the following schema
|-- id: string (nullable = true)
|-- before: integer (containsNull = false)
|-- after_start_time: string (nullable = true)
|-- after_end_time: string (nullable = true)
|-- after_area: string (nullable = true)
The resulting data should have 3 rows and 5 columns
+---+---------+--------------------+--------------------+--------+
| id| before| after_start_time| after_start_time| area|
+---+---------+--------------------+--------------------+--------+
| 1| 4| 07:00| 07:30| 70|
| 1| 5| 08:00| 08:30| 80|
| 1| 6| 09:00| 09:30| 90|
+---+---------+--------------------+--------------------+--------+
I'm using spark 2.3.0 (arrays_zip is not available). And the only solutions I can find is either for exploding two Arrays of String or one Array of struct.
Use arrays_zip to combine two arrays, then explode to explode array columns & use as to rename required columns.
As arrays_zip is not available in spark 2.3. Created UDF to perform same operation.
val arrays_zip = udf((before:Seq[Int],after: Seq[Area]) => before.zip(after))
Execution time with built in (spark 2.4.2) arrays_zip - Time taken: 1146 ms
Execution time with arrays_zip UDF - Time taken: 1165 ms
Check below code.
scala> df.show(false)
+---+---------+------------------------------------------------------------+
|id |before |after |
+---+---------+------------------------------------------------------------+
|1 |[4, 5, 6]|[[07:00, 07:30, 70], [08:00, 08:30, 80], [09:00, 09:30, 90]]|
+---+---------+------------------------------------------------------------+
scala>
df
.select(
$"id",
explode(
arrays_zip($"before",$"after")
.cast("array<struct<before:int,after:struct<start_time:string,end_time:string,area:string>>>")
).as("before_after")
)
.select(
$"id",
$"before_after.before".as("before"),
$"before_after.after.start_time".as("after_start_time"),
$"before_after.after.end_time".as("after_end_time"),
$"before_after.after.area"
)
.printSchema
root
|-- id: string (nullable = true)
|-- before: integer (nullable = true)
|-- after_start_time: string (nullable = true)
|-- after_end_time: string (nullable = true)
|-- area: string (nullable = true)
Output
scala>
df
.select(
$"id",
explode(
arrays_zip($"before",$"after")
.cast("array<struct<before:int,after:struct<start_time:string,end_time:string,area:string>>>")
).as("before_after")
)
.select(
$"id",
$"before_after.before".as("before"),
$"before_after.after.start_time".as("after_start_time"),
$"before_after.after.end_time".as("after_end_time"),
$"before_after.after.area"
)
.show(false)
+---+------+----------------+--------------+----+
|id |before|after_start_time|after_end_time|area|
+---+------+----------------+--------------+----+
|1 |4 |07:00 |07:30 |70 |
|1 |5 |08:00 |08:30 |80 |
|1 |6 |09:00 |09:30 |90 |
+---+------+----------------+--------------+----+
To handle some complex struct you can do,
Declare two beans Area(input) and Area2(output)
Map row to Area2 bean
import org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
import scala.collection.mutable
object ExplodeTwoArrays {
def main(args: Array[String]): Unit = {
val spark = Constant.getSparkSess
import spark.implicits._
val df = Seq((
"1", Seq(4, 5, 6),
Seq(Area("07:00", "07:30", "70"), Area("08:00", "08:30", "80"), Area("09:00", "09:30", "90"))
)).toDF("id", "before", "after")
val outDf = df.map(row=> {
val id = row.getString(0)
val beforeArray : Seq[Int]= row.getSeq[Int](1)
val afterArray : mutable.WrappedArray[Area2] =
row.getAs[mutable.WrappedArray[GenericRowWithSchema]](2) // Need to map Array(Struct) to the something compatible
.zipWithIndex // Require to iterate with indices
.map{ case(element,i) => {
Area2(element.getAs[String]("start_time"),
element.getAs[String]("end_time"),
element.getAs[String]("area"),
beforeArray(i))
}}
(id,afterArray) // Return row(id,Array(Area2(...)))
}).toDF("id","after")
outDf.printSchema()
outDf.show()
}
}
case class Area(start_time: String, end_time: String, area: String)
case class Area2(start_time: String, end_time: String, area: String, before: Int)

Schema Conversion from String datatype to Array(Map(Array)) datatype in Pyspark

We are reading data from dynamo db so we are getting datatype as string, but we want writing string data type as array(map(array))
string Data :
{"policy_details":[{"cdhid":" 123","p2cid":" NA", "roleDesc":" NA","positionnum":"NA"}, {"cdhid":" 1234","p2cid":" NA", "roleDesc":" NA","positionnum":"NA"}]}
output required:
string data type need to convert to ARRAY(MAP(ARRAY))
We have tried with below schema:
ArrayType([
StructField("policy_num", MapType(ArrayType([
StructField("cdhid", StringType(), True),
StructField("role_id", StringType(), True),
StructField("role_desc", StringType(), True)
])))
])
getting below issue:
elementType [StructField(cdhid,StringType,true),
StructField(role_id,StringType,true),
StructField(role_desc,StringType,true)] should be an instance of < class 'pyspark.sql.types.DataType' >
Regarding your data, the schema your want is not the one that fits.
The schema of you data is :
from pyspark.sql import types as T
schm = T.StructType([T.StructField("policy_details",T.ArrayType(T.StructType([
T.StructField("cdhid", T.StringType(), True),
T.StructField("p2cid", T.StringType(), True),
T.StructField("roleDesc", T.StringType(), True),
T.StructField("positionnum", T.StringType(), True),
])), True)])
Then, you just need to use the from_json function.
from pyspark.sql import functions as F
df.show()
+--------------------+
| db_data|
+--------------------+
|{"policy_details"...|
+--------------------+
new_df = df.select(F.from_json("db_data", schm).alias("data"))
new_df.printSchema()
root
|-- data: struct (nullable = true)
| |-- policy_details: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- cdhid: string (nullable = true)
| | | |-- p2cid: string (nullable = true)
| | | |-- roleDesc: string (nullable = true)
| | | |-- positionnum: string (nullable = true)
EDIT : If you want to use MapType, you can replace the schema with :
schm = T.StructType([
T.StructField(
"policy_details",
T.ArrayType(T.MapType(
T.StringType(),
T.StringType()
)),
True
)
])

Extract Spark data frame from a nested structure

I have a DataFrame with nested structure (originally an Avro output from a mapreduce job). I would like to flatten it. The schema of original DataFrame looks like this (simplified):
|-- key: struct
| |-- outcome: boolean
| |-- date: string
| |-- age: int
| |-- features: map
| | |-- key: string
| | |-- value: double
|-- value: struct (nullable = true)
| |-- nullString: string (nullable = true)
In Json representation one row of the data looks like this:
{"key":
{"outcome": false,
"date": "2015-01-01",
"age" : 20,
"features": {
{"f1": 10.0,
"f2": 11.0,
...
"f100": 20.1
}
},
"value": null
}
The features map has the same structure for all rows, i.e. the key set is the same (f1, f2, ..., f100). By "flatten" I mean the following.
+----------+----------+---+----+----+-...-+------+
| outcome| date|age| f1| f2| ... | f100|
+----------+----------+---+----+----+-...-+------+
| true|2015-01-01| 20|10.0|11.0| ... | 20.1|
...
(truncated)
I am using Spark 2.1.0 the spark-avro package from https://github.com/databricks/spark-avro.
The original dataframe is read in by
import com.databricks.spark.avro._
val df = spark.read.avro("path/to/my/file.avro")
// it's nested
df.show()
+--------------------+------+
| key| value|
+--------------------+------+
|[false,2015... |[null]|
|[false,2015... |[null]|
...
(truncated)
Any help is greatly appreciated!
In Spark you can extract data from a nested AVRO file. For example, the JSON you have provided:
{"key":
{"outcome": false,
"date": "2015",
"features": {
{"f1": v1,
"f2": v2,
...
}
},
"value": null
}
after being read from AVRO:
import com.databricks.spark.avro._
val df = spark.read.avro("path/to/my/file.avro")
can provide flattened data from nested JSON. For that you can write code something like this:
df.select("key.*").show
+----+------------+-------+
|date| features |outcome|
+----+------------+-------+
|2015| [v1,v2,...]| false|
+----+------------+-------+
...
(truncated)
df.select("key.*").printSchema
root
|-- date: string (nullable = true)
|-- features: struct (nullable = true)
| |-- f1: string (nullable = true)
| |-- f2: string (nullable = true)
| |-- ...
|-- outcome: boolean (nullable = true)
or something like this:
df.select("key.features.*").show
+---+---+---
| f1| f2|...
+---+---+---
| v1| v2|...
+---+---+---
...
(truncated)
df.select("key.features.*").printSchema
root
|-- f1: string (nullable = true)
|-- f2: string (nullable = true)
|-- ...
If this is the output you are expecting.