How to extract embeddings generated from sparknlp WordEmbeddingsModel to feed a RNN model using keras and tensorflow - tensorflow

The bounty expires in 6 days. Answers to this question are eligible for a +100 reputation bounty.
Aiha wants to draw more attention to this question:
I need an embedding matrix which can directly be used in an RNN model
I have a text classification problem.
I'm particularly interested in this embedding model in sparknlp because I have a dataset from Wikipedia in 'sq' language. I need to convert sentences of my dataset into embeddings.
I do so by WordEmbeddingsModel, however, after the embeddings are generated I don't know how to prepare them to make ready as an input for an RNN model using keras and tensorflow.
My dataset has two columns 'text' and 'label', until now I was able to do the following steps:
# start spark session
spark = sparknlp.start(gpu=True)
# convert train df into spark df
spark_train_df=spark.createDataFrame(train)`
+--------------------+-----+
| text|label|
+--------------------+-----+
|Joy Adowaa Buolam...| 0|
|Ajo themeloi "Alg...| 1|
|Buolamwini lindi ...| 1|
|Kur ishte 9 vjeç,...| 0|
|Si një studente u...| 1|
+--------------------+-----+
# define sparknlp pipeline
document = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols(\["document"\]) \
.setOutputCol("token")
embeddings = WordEmbeddingsModel\
.pretrained("w2v_cc_300d","sq")\
.setInputCols(\["document", "token"\])\
.setOutputCol("embeddings")
pipeline = Pipeline(stages=\[document, tokenizer, embeddings\])
# fit the pipeline to the training data
model = pipeline.fit(spark_train_df)
# apply the pipeline to the training data
result = model.transform(spark_train_df)
result.show()
+--------------------+-----+--------------------+--------------------+--------------------+
| text|label| document| token| embeddings|
+--------------------+-----+--------------------+--------------------+--------------------+
|Joy Adowaa Buolam...| 0|[{document, 0, 13...|[{token, 0, 2, Jo...|[{word_embeddings...|
|Ajo themeloi "Alg...| 1|[{document, 0, 13...|[{token, 0, 2, Aj...|[{word_embeddings...|
|Buolamwini lindi ...| 1|[{document, 0, 94...|[{token, 0, 9, Bu...|[{word_embeddings...|
|Kur ishte 9 vjeç,...| 0|[{document, 0, 12...|[{token, 0, 2, Ku...|[{word_embeddings...|
|Si një studente u...| 1|[{document, 0, 15...|[{token, 0, 1, Si...|[{word_embeddings...|
|Buolamwini diplom...| 1|[{document, 0, 11...|[{token, 0, 9, Bu...|[{word_embeddings...|
+--------------------+-----+--------------------+--------------------+--------------------+
The schema of result is:
result.printSchema()
root
|-- text: string (nullable = true)
|-- label: long (nullable = true)
|-- document: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- annotatorType: string (nullable = true)
| | |-- begin: integer (nullable = false)
| | |-- end: integer (nullable = false)
| | |-- result: string (nullable = true)
| | |-- metadata: map (nullable = true)
| | | |-- key: string
| | | |-- value: string (valueContainsNull = true)
| | |-- embeddings: array (nullable = true)
| | | |-- element: float (containsNull = false)
|-- token: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- annotatorType: string (nullable = true)
| | |-- begin: integer (nullable = false)
| | |-- end: integer (nullable = false)
| | |-- result: string (nullable = true)
| | |-- metadata: map (nullable = true)
| | | |-- key: string
| | | |-- value: string (valueContainsNull = true)
| | |-- embeddings: array (nullable = true)
| | | |-- element: float (containsNull = false)
|-- embeddings: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- annotatorType: string (nullable = true)
| | |-- begin: integer (nullable = false)
| | |-- end: integer (nullable = false)
| | |-- result: string (nullable = true)
| | |-- metadata: map (nullable = true)
| | | |-- key: string
| | | |-- value: string (valueContainsNull = true)
| | |-- embeddings: array (nullable = true)
| | | |-- element: float (containsNull = false)
The output I receive from:
result.schema["embeddings"].dataType is:
ArrayType(StructType([StructField('annotatorType', StringType(), True), StructField('begin', IntegerType(), False), StructField('end', IntegerType(), False), StructField('result', StringType(), True), StructField('metadata', MapType(StringType(), StringType(), True), True), StructField('embeddings', ArrayType(FloatType(), False), True)]), True)

Related

Merge arrays of a nested struct after groupBy/aggregation

I have a schema like this:
root
|-- first_name: string (nullable = true)
|-- medical_group: struct (nullable = true)
| |-- address: struct (nullable = true)
| | |-- phone_number: string (nullable = true)
| | |-- city: string (nullable = true)
| | |-- state: string (nullable = true)
| | |-- address2: string (nullable = true)
| | |-- zip: string (nullable = true)
| | |-- secondary_phone_number: string (nullable = true)
| | |-- address1: string (nullable = true)
| |-- offices: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- address: struct (nullable = true)
| | | | |-- phone_number: string (nullable = true)
| | | | |-- city: string (nullable = true)
| | | | |-- state: string (nullable = true)
| | | | |-- address2: string (nullable = true)
| | | | |-- zip: string (nullable = true)
| | | | |-- secondary_phone_number: string (nullable = true)
| | | | |-- address1: string (nullable = true)
| | | |-- name: string (nullable = true)
| |-- unique_id: string (nullable = true)
| |-- name: string (nullable = true)
|-- unique_id: string (nullable = true)
|-- last_name: string (nullable = true)
And I'm trying to groupBy/aggregate the data by key1 and key2. The aggregation process is to remove duplicate rows in the DataFrame primarily, as well as aggregate the array of offices.
agg_list = [
first("first_name").alias("first_name"),
first("last_name").alias("last_name"),
first("medical_group").alias("medical_group"),
# Maybe this? collect_list("medical_group.offices").alias("mg.offices")
]
provider_structs_grouped = \
provider_structs_structure \
.groupBy(col('unique_id'), col('medical_group.unique_id')) \
.agg(*agg_list)
I thought I could create a temporary column that uses collect_list, update the nested struct's offices value, and then drop the temporary column, but I was struggling to update the nested struct's value.
Question: How can I aggregate/collect_list the offices, and update the nested offices array with that latest value? (Or perhaps there's a better way?)

Flatten dataframe with nested struct ArrayType using pyspark

I have a dataframe with this schema
root
|-- AUTHOR_ID: integer (nullable = false)
|-- NAME: string (nullable = true)
|-- Books: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- BOOK_ID: integer (nullable = false)
| | |-- Chapters: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- NAME: string (nullable = true)
| | | | |-- NUMBER_PAGES: integer (nullable = true)
How to flat all columns into one level with Pyspark ?
Using inline function:
df2 = (df.selectExpr("AUTHOR_ID", "NAME", "inline(Books)")
.selectExpr("*", "inline(Chapters)")
.drop("Chapters")
)
Or explode:
from pyspark.sql import functions as F
df2 = (df.withColumn("Books", F.explode("Books"))
.select("*", "Books.*")
.withColumn("Chapters", F.explode("Chapters"))
.select("*", "Chapters.*")
)

How to convert JSON Schema from Camel case to lower case

I have a JSON Schema with keys in camel case and I am trying to convert all data type to lower case.
I am facing issue with the ArrayType.
import org.apache.spark.sql.types.{ArrayType, IntegerType, StringType, StructField, StructType}
import org.apache.spark.sql.types.{DataType, StructType}
import spark.implicits._
val spark: SparkSession = SparkSession.builder().enableHiveSupport().getOrCreate()
var sample_schema = spark.read.json("path").schema
def columnsToLowercase(schema: StructType): StructType = {
def recurRename(schema: StructType): Seq[StructField] =
schema.fields.map {
case StructField(name, dtype: StructType, nullable, meta) =>
StructField(name.toLowerCase, StructType(recurRename(dtype)), nullable, meta)
case StructField(name, dtype, nullable, meta) =>
StructField(name.toLowerCase, dtype, nullable, meta)
}
StructType(recurRename(schema))
}
val jsonDFrame: DataFrame = spark.read.schema(columnsToLowercase(sample_schema)).json("path")
Sample Schema:
root
|-- id: string (nullable = true)
|-- master: struct (nullable = true)
| |-- code: string (nullable = true)
| |-- provInfo: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- claimInfo: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- claimId: string (nullable = true)
| | | |-- demoInfo: struct (nullable = true)
| | | | |-- family: struct (nullable = true)
| | | | | |-- outOrder: struct (nullable = true)
| | | | | | |-- LocOut: boolean (nullable = true)
| | | | | | |-- found: boolean (nullable = true)
| |-- claimAddr: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- address: string (nullable = true)
|-- system: string (nullable = true)
You should be able to lowercase fields nested in ArrayType by adding another case clause. For array columns, you also need to check its sub-elements type:
def columnsToLowercase(schema: StructType): StructType = {
// ....
case StructField(name, dtype: ArrayType, nullable, meta) => dtype.elementType match {
case s: StructType => StructField(name.toLowerCase, ArrayType(StructType(recurRename(s)), true), nullable, meta)
case dt => StructField(name.toLowerCase, dt, nullable, meta)
}
//....
}
Applying on your schema:
df.printSchema
//root
// |-- id: string (nullable = true)
// |-- master: struct (nullable = true)
// | |-- provInfo: struct (nullable = true)
// | | |-- claimInfo: array (nullable = true)
// | | | |-- element: struct (containsNull = true)
// | | | | |-- claimId: string (nullable = true)
// | | |-- demoInfo: struct (nullable = true)
// | | | |-- family: struct (nullable = true)
// | | | | |-- outOrder: struct (nullable = true)
// | | | | | |-- LocOut: boolean (nullable = false)
// | | | | | |-- found: boolean (nullable = false)
// | |-- claimAddr: array (nullable = true)
// | | |-- element: struct (containsNull = true)
// | | | |-- address: string (nullable = true)
// |-- system: string (nullable = true)
columnsToLowercase(df.schema).printTreeString()
//root
// |-- id: string (nullable = true)
// |-- master: struct (nullable = true)
// | |-- provinfo: struct (nullable = true)
// | | |-- claiminfo: array (nullable = true)
// | | | |-- element: struct (containsNull = true)
// | | | | |-- claimid: string (nullable = true)
// | | |-- demoinfo: struct (nullable = true)
// | | | |-- family: struct (nullable = true)
// | | | | |-- outorder: struct (nullable = true)
// | | | | | |-- locout: boolean (nullable = false)
// | | | | | |-- found: boolean (nullable = false)
// | |-- claimaddr: array (nullable = true)
// | | |-- element: struct (containsNull = true)
// | | | |-- address: string (nullable = true)
// |-- system: string (nullable = true)

Datframe Struct fieldType to Array of field except last field on Pyspark

I have a spark dataframe with the following schema:
stat_chiamate
|
chiamate_ricevute: struct (nullable = true)
| | |-- h_0: string (nullable = true)
| | |-- h_1: string (nullable = true)
| | |-- h_10: string (nullable = true)
| | |-- h_11: string (nullable = true)
| | |-- h_12: string (nullable = true)
| | |-- h_13: string (nullable = true)
| | |-- h_14: string (nullable = true)
| | |-- h_15: string (nullable = true)
| | |-- h_16: string (nullable = true)
| | |-- h_17: string (nullable = true)
| | |-- h_18: string (nullable = true)
| | |-- h_19: string (nullable = true)
| | |-- h_2: string (nullable = true)
| | |-- h_20: string (nullable = true)
| | |-- h_21: string (nullable = true)
| | |-- h_22: string (nullable = true)
| | |-- h_23: string (nullable = true)
| | |-- h_3: string (nullable = true)
| | |-- h_4: string (nullable = true)
| | |-- h_5: string (nullable = true)
| | |-- h_6: string (nullable = true)
| | |-- h_7: string (nullable = true)
| | |-- h_8: string (nullable = true)
| | |-- h_9: string (nullable = true)
| | |-- n_totale: string (nullable = true)
I want a dataframe like:
stat_chiamate: struct (nullable = true)
|
chiamate_ricevute: Array
|-- element(String)
where chiamate_ricevute is a list of value of field for example:
h_0= 0
h_1= 1
h_2= 2
.
.
.
h_23=23
n_totale=412
I want:
[0,1,2....,23] <-- I don't want n_totale values
In my code i use df.select("stat_chiamate.chiamate_ricevute.*").schema.fieldNames()[:-1] but i have only single fieldsName but how i can use them?
df=df.select(F.array(*[field for field in
df.select("stat_chiamate.chiamate_ricevute.*").schema.fieldNames() if field.startswith("h_")]).alias("CIRCO"))
You could use the schema of the dataframe, and in particular the schema of your struct to extract all the field names but n_totale and then wrap them into an array.
from pyspark.sql import functions as f
fields = ['chiamate_ricevute.' + field.name for field in df.schema[0].dataType
if field.name != 'n_totale']
result = df.select(f.array(fields).alias("chiamate_ricevute"))

transformation of spark dataframe

I had a DataFrame and here's the schema. Numbers of element is unknown but some of the elements(for example element1 and element3) must exist and uniqueness
root
|-- context: struct (nullable = true)
|---|-- key: string (nullable = true)
| |-- data: struct (nullable = true)
| | |-- dimensions: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- element1: string (nullable = true)
| | | | |-- element2: string (nullable = true)
| | | | |-- element3: string (nullable = true)
| | | | |-- *** : string (nullable = true)
| | | | |-- elementN: string (nullable = true)
How can I transform it to schema like this?
root
|-- context: struct (nullable = true)
|---|-- key: string (nullable = true)
|---|-- element1: string (nullable = true)
|---|-- element3: string (nullable = true)
Thanks a lot.
Can you please try the explode function. These are following links, please go through them.
Extract columns in nested Spark DataFrame
Extract value from structure within an array of arrays in spark using scala