How to join two dataframe with both a array column, one's element is nullable, another is not - dataframe

I have two dataframa df1,df2
df1
df1's Index was generated by udf function.
+------+----------+
|other | Index |
+------+----------+
| 5| [1, 1] |
| 1| [0, 7] |
| 2| [1, 7] |
| 8| [2, 6] |
| 6| [2, 7] |
|-- gridIndex: array (nullable = true)
| |-- element: integer (containsNull = false)
df2
var l =new ListBuffer[(Int, Array[Int])]
l.append((5,Array(1,1)))
l.append((6,Array(0,7)))
l.append((1,Array(1,7)))
l.append((3,Array(2,6)))
l.append((7,Array(2,7)))
val df2 = l.toDF("other2","Index")
+------+----------+
|other2| Index |
+------+----------+
| 5| [1, 1] |
| 6| [0, 7] |
| 1| [1, 7] |
| 3| [2, 6] |
| 7| [2, 7] |
|-- gridIndex: array (nullable = true)
| |-- element: integer (containsNull = true)
I want to join them by Index.
But as you can see a error appear:
Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: Join keys from two sides should have same types
df1.Index is different with df2.Index, their containNull is different.
So how can i figure out this problem?
I also want to why |-- element: integer (containsNull = true) in df2's schema.
Thank you very much!

I'm able to do this.
Not sure if this was the desired result as you have not posted the expected output.
val df1 = Seq(
(5,Array(1,1)),
(1,Array(0,7)),
(2,Array(1,7)),
(8,Array(2,6)),
(6,Array(2,7)),
(100,Array(null.asInstanceOf[Int],null.asInstanceOf[Int]))
).toDF("Other","Index")
val df2 = Seq(
(5,Array(1,1)),
(6,Array(0,7)),
(1,Array(1,7)),
(3,Array(2,6)),
(7,Array(null.asInstanceOf[Int],null.asInstanceOf[Int]))
).toDF("Other2","Index")
root
|-- Other: integer (nullable = false)
|-- Index: array (nullable = true)
| |-- element: integer (containsNull = false)
root
|-- Other2: integer (nullable = false)
|-- Index: array (nullable = true)
| |-- element: integer (containsNull = false)
df1.join(df2, df1("Index") <=> df2("Index") ).show()
+-----+------+------+------+
|Other| Index|Other2| Index|
+-----+------+------+------+
| 5|[1, 1]| 5|[1, 1]|
| 1|[0, 7]| 6|[0, 7]|
| 2|[1, 7]| 1|[1, 7]|
| 8|[2, 6]| 3|[2, 6]|
| 100|[0, 0]| 7|[0, 0]|
+-----+------+------+------+

Related

how to flatten multiple structs and get the keys as one of the fields

I have this struct schema
|-- teams: struct (nullable = true)
| |-- blue: struct (nullable = true)
| | |-- has_won: boolean (nullable = true)
| | |-- rounds_lost: long (nullable = true)
| | |-- rounds_won: long (nullable = true)
| |-- red: struct (nullable = true)
| | |-- has_won: boolean (nullable = true)
| | |-- rounds_lost: long (nullable = true)
| | |-- rounds_won: long (nullable = true)
which I want to turn to this schema
+----+-------+-----------+----------+
|team|has_won|rounds_lost|rounds_win|
+----+-------+-----------+----------+
|blue| 1| 13| 10|
| red| 0| 10| 13|
+----+-------+-----------+----------+
I already tried selectExpr(inline(array('teams.*')))
inline array
but I don't have any idea to get the team to one of the fields? Thank you!
You can start by un-nesting the struct using * and then use stack to "un-pivot" the dataframe. Finally, un-nest the stats.
from pyspark.sql import Row
rows = [Row(teams=Row(blue=Row(has_won=1, rounds_lost=13, rounds_won=10),
red=Row(has_won=0, rounds_lost=10, rounds_won=13)))]
df = spark.createDataFrame(rows)
(df.select("teams.*")
.selectExpr("stack(2, 'blue', blue, 'red', red) as (team, stats)")
.selectExpr("team", "stats.*")
).show()
"""
+----+-------+-----------+----------+
|team|has_won|rounds_lost|rounds_won|
+----+-------+-----------+----------+
|blue| 1| 13| 10|
| red| 0| 10| 13|
+----+-------+-----------+----------+
"""

Spark- check intersect of two string columns

I have a dataframe below where colA and colB contain strings. I'm trying to check if colB contains any substring of values in colA. The vaules can contain , or space, but as long as any part of colB's string has overlap with colA's, it is a match. For example row 1 below has an overlap ("bc"), and row 2 does not.
I was thinking of splitting the values into arrays but the delimiters are not constant. Could someone please help to shed some light on how to do this? Many thanks for your help.
+---+-------+-----------+
| id|colA | colB |
+---+-------+-----------+
| 1|abc d | bc, z |
| 2|abcde | hj f |
+---+-------+-----------+
You could split by using regex and then create a UDF function to check substrings.
Example:
spark = SparkSession.builder.getOrCreate()
data = [
{"id": 1, "A": "abc d", "B": "bc, z, d"},
{"id": 2, "A": "abc-d", "B": "acb, abc"},
{"id": 3, "A": "abcde", "B": "hj f ab"},
]
df = spark.createDataFrame(data)
split_regex = "((,)?\s|[-])"
df = df.withColumn("A", F.split(F.col("A"), split_regex))
df = df.withColumn("B", F.split(F.col("B"), split_regex))
def mapper(a, b):
result = []
for ele_b in b:
for ele_a in a:
if ele_b in ele_a:
result.append(ele_b)
return result
df = df.withColumn(
"result", F.udf(mapper, ArrayType(StringType()))(F.col("A"), F.col("B"))
)
Result:
root
|-- A: array (nullable = true)
| |-- element: string (containsNull = true)
|-- B: array (nullable = true)
| |-- element: string (containsNull = true)
|-- id: long (nullable = true)
|-- result: array (nullable = true)
| |-- element: string (containsNull = true)
+--------+-----------+---+-------+
|A |B |id |result |
+--------+-----------+---+-------+
|[abc, d]|[bc, z, d] |1 |[bc, d]|
|[abc, d]|[acb, abc] |2 |[abc] |
|[abcde] |[hj, f, ab]|3 |[ab] |
+--------+-----------+---+-------+
You can use a custom UDF to implement the intersect logic as below -
Data Preparation
from pyspark import SparkContext
from pyspark.sql import SQLContext
import pyspark.sql.functions as F
from pyspark.sql.types import StringType
import pandas as pd
data = {"id" :[1,2],
"colA" : ["abc d","abcde"],
"colB" : ["bc, z","hj f"]}
mypd = pd.DataFrame(data)
sparkDF = sql.createDataFrame(mypd)
sparkDF.show()
+---+-----+-----+
| id| colA| colB|
+---+-----+-----+
| 1|abc d|bc, z|
| 2|abcde| hj f|
+---+-----+-----+
UDF
def str_intersect(x,y):
res = set(x) & set(y)
if res:
return ''.join(res)
else:
return None
str_intersect_udf = F.udf(lambda x,y:str_intersect(x,y),StringType())
sparkDF.withColumn('intersect',str_intersect_udf(F.col('colA'),F.col('colB'))).show()
+---+-----+-----+---------+
| id| colA| colB|intersect|
+---+-----+-----+---------+
| 1|abc d|bc, z| bc |
| 2|abcde| hj f| null|
+---+-----+-----+---------+

Transposing spark dataframe based on a subcolumn

I have a spark dataframe which looks like this:
root
|-- 0000154d-7585-5eb283ff985c: struct (nullable = true)
| |-- collaborative_rank: array (nullable = true)
| | |-- element: long (containsNull = true)
| |-- content_rank: array (nullable = true)
| | |-- element: long (containsNull = true)
| |-- curated_rank: array (nullable = true)
| | |-- element: long (containsNull = true)
| |-- discovery_score: array (nullable = true)
| | |-- element: long (containsNull = true)
| |-- original_rank: array (nullable = true)
| | |-- element: long (containsNull = true)
| |-- recipe_id: array (nullable = true)
| | |-- element: long (containsNull = true)
|-- 00005426-2675-68085cd359c7: struct (nullable = true)
| |-- collaborative_rank: array (nullable = true)
| | |-- element: long (containsNull = true)
| |-- content_rank: array (nullable = true)
| | |-- element: long (containsNull = true)
| |-- curated_rank: array (nullable = true)
| | |-- element: long (containsNull = true)
| |-- discovery_score: array (nullable = true)
| | |-- element: long (containsNull = true)
| |-- original_rank: array (nullable = true)
| | |-- element: long (containsNull = true)
| |-- recipe_id: array (nullable = true)
| | |-- element: long (containsNull = true)
Each column is a user id, e.g. 0000154d-7585-5eb283ff985c, and each row is made up of 15 000 users (they come from json files that each contain 15 000 users).
I want to transpose it such that each user id is a row, and each sub-column collaborative_rank, content_rank, curated_rank, discovery_score, original_rank and recipe_id is a column with the array being the value. I'm new to spark is there any painless way to do this?
Edit:
For reference, an input .json file I'm reading from looks like this:
{"0000154d-7585-4096-a71a-5eb283ff985c": {"recipe_id": [1, 2, 3], "collaborative_rank": [1, 2, 3], "curated_rank": [1, 2, 3], "discovery_score": [1]}, "00005426-2675-4940-8394-e8085cd359c7": {"recipe_id": [] ... }
etc.
If you don't want to convert it to rdd and perform UDF, you can consider stacking the dataframe.
df = spark.read.json(r'C:\stackoverflow\samples\inp.json')
stack_characteristics = str(len(df.columns))+','+','.join([f"'{v}',`{v}`" for v in df.columns])
df.select(expr(f'''stack({stack_characteristics})''').alias('userId','vals')).\
select('userId', 'vals.*').show()
+--------------------+------------------+------------+---------------+---------+
| userId|collaborative_rank|curated_rank|discovery_score|recipe_id|
+--------------------+------------------+------------+---------------+---------+
|0000154d-7585-409...| [1, 2, 3]| [1, 2, 3]| [1]|[1, 2, 3]|
|00005426-2675-494...| [1, 2, 3]| [1, 2, 3]| [1]|[1, 2, 3]|
+--------------------+------------------+------------+---------------+---------+
AFAIK, this below code may solve your problem.
input json considered,
{"0000154d-7585-4096-a71a-5eb283ff985c": {"recipe_id": [1, 2, 3], "collaborative_rank": [1, 2, 3], "curated_rank": [1, 2, 3], "discovery_score": [1] }}
from pyspark.sql import Row
#read an input data
df=spark.read.json("/home/sathya/Desktop/stackoverflo/input.json")
#method to extract keys to columns
def extract_json(row):
out_array = []
data_dict = row.asDict()
for k in data_dict.keys():
out_array.append(Row(k, data_dict[k][0], data_dict[k][1],data_dict[k][2],data_dict[k][3]))
return Row(*out_array)
#flatmap columns and extracting the data
rdd = df.rdd.flatMap(extract_json)
#df creation
df1=spark.createDataFrame(rdd)
df1.selectExpr("_1 as user_id","_2 as recipe_id", "_3 as collaborative_rank", "_4 as curated_rank", "_5 as discovery_score").show(truncate=False)
/*
+------------------------------------+---------+------------------+------------+---------------+
|user_id |recipe_id|collaborative_rank|curated_rank|discovery_score|
+------------------------------------+---------+------------------+------------+---------------+
|0000154d-7585-4096-a71a-5eb283ff985c|[1, 2, 3]|[1, 2, 3] |[1] |[1, 2, 3] |
+------------------------------------+---------+------------------+------------+---------------+
*/

PySpark: Dataframe with nested fields to relational table

I have a PySpark dataframe of students with schema as follows:
Id: string
|-- School: array
|-- element: struct
| |-- Subject: string
| |-- Classes: string
| |-- Score: array
| |-- element: struct
| |-- ScoreID: string
| |-- Value: string
I want to extract a few fields from the data frame and normalize it so that I can feed it in the database. The relational schema I expect consists of the fields Id, School, Subject, ScoreId, Value. How can I do it efficiently?
explode the array to get flattened data and then select all the required columns.
Example:
df.show(10,False)
#+---+--------------------------+
#|Id |School |
#+---+--------------------------+
#|1 |[[b, [[A, 3], [B, 4]], a]]|
#+---+--------------------------+
df.printSchema()
#root
# |-- Id: string (nullable = true)
# |-- School: array (nullable = true)
# | |-- element: struct (containsNull = true)
# | | |-- Classes: string (nullable = true)
# | | |-- Score: array (nullable = true)
# | | | |-- element: struct (containsNull = true)
# | | | | |-- ScoreID: string (nullable = true)
# | | | | |-- Value: string (nullable = true)
# | | |-- Subject: string (nullable = true)
df.selectExpr("Id","explode(School)").\
selectExpr("Id","col.*","explode(col.Score)").\
selectExpr("Id","Classes","Subject","col.*").\
show()
#+---+-------+-------+-------+-----+
#| Id|Classes|Subject|ScoreID|Value|
#+---+-------+-------+-------+-----+
#| 1| b| a| A| 3|
#| 1| b| a| B| 4|
#+---+-------+-------+-------+-----+

How to convert a DataFrame map column to a struct column?

Suppose we have a DataFrame with a column of map type.
val df = spark.sql("""select map("foo", 1, "bar", 2) AS mapColumn""")
df.show()
// +--------------------+
// | mapColumn|
// +--------------------+
// |{foo -> 1, bar -> 2}|
// +--------------------+
What is the most straightforward way to convert it to a struct (or, equivalently, define a new column with the same keys and values but as a struct type)? See the following spark-shell (2.4.5) session, for an insanely inefficient way of going about it:
val jsonStr = df.select(to_json($"mapColumn")).collect()(0)(0).asInstanceOf[String]
spark.read.json(Seq(jsonStr).toDS()).show()
// +---+---+
// |bar|foo|
// +---+---+
// | 2| 1|
// +---+---+
Now, obviously collect() is very inefficient, and this is generally an awful way to do things in Spark. But what is the preferred way to accomplish this conversion? named_struct and struct both take a sequence of parameter values to construct the results, but I can't find any way to "unwrap" the map key/values to pass them to these functions.
I would use explode function:
+--------------------+
| mapColumn|
+--------------------+
|[foo -> 1, bar -> 2]|
+--------------------+
df.select(explode('mapColumn)).select(struct('*).as("struct"))
output:
+--------+
| struct|
+--------+
|[foo, 1]|
|[bar, 2]|
+--------+
root
|-- struct: struct (nullable = false)
| |-- key: string (nullable = false)
| |-- value: integer (nullable = false)
I see #chlebek answer but in case it should be kept in one row you can use an UDF
scala> val df = spark.sql("""select map("foo", 1, "bar", 2) AS mapColumn""")
df: org.apache.spark.sql.DataFrame = [mapColumn: map<string,int>]
scala> df.show
+--------------------+
| mapColumn|
+--------------------+
|[foo -> 1, bar -> 2]|
+--------------------+
scala> case class KeyValue(key: String, value: String)
defined class KeyValue
scala> val toArrayOfStructs = udf((value: Map[String, String]) => value.map {
| case (k, v) => KeyValue(k, v)
| }.toArray )
toArrayOfStructs: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,ArrayType(StructType(StructField(key,StringType,true), StructField(value,StringType,true)),true),Some(List(MapType(StringType,StringType,true))))
scala> df.withColumn("alfa", toArrayOfStructs(col("mapColumn")))
res4: org.apache.spark.sql.DataFrame = [mapColumn: map<string,int>, alfa: array<struct<key:string,value:string>>]
scala> res4.show
+--------------------+--------------------+
| mapColumn| alfa|
+--------------------+--------------------+
|[foo -> 1, bar -> 2]|[[foo, 1], [bar, 2]]|
+--------------------+--------------------+
scala> res4.printSchema
root
|-- mapColumn: map (nullable = false)
| |-- key: string
| |-- value: integer (valueContainsNull = false)
|-- alfa: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- key: string (nullable = true)
| | |-- value: string (nullable = true)
Your method doesn't seem to work with more rows, and especially if they have different maps, like this one:
val df = Seq(
(Map("foo"->1, "bar"->2)),
(Map("foo"->3, "baz"->4))
).toDF("mapColumn")
df.show()
// +--------------------+
// | mapColumn|
// +--------------------+
// |{foo -> 1, bar -> 2}|
// |{foo -> 3, baz -> 4}|
// +--------------------+
Your script would return...
val jsonStr = df.select(to_json($"mapColumn")).collect()(0)(0).asInstanceOf[String]
spark.read.json(Seq(jsonStr).toDS()).show()
// +---+---+
// |bar|foo|
// +---+---+
// | 2| 1|
// +---+---+
Solutions
map to columns:
val json_col = to_json($"mapColumn")
val json_schema = spark.read.json(df.select(json_col).as[String]).schema
val df2 = df.withColumn("_c", from_json(json_col, json_schema)).select("_c.*")
df2.show()
// +----+----+---+
// | bar| baz|foo|
// +----+----+---+
// | 2|null| 1|
// |null| 4| 3|
// +----+----+---+
map to struct (field names: "key", "value"):
val df2 = df.select(explode(map_entries($"mapColumn")).as("struct"))
df2.show()
// +--------+
// | struct|
// +--------+
// |{foo, 1}|
// |{bar, 2}|
// |{foo, 3}|
// |{baz, 4}|
// +--------+
map to struct (field names: "foo", "bar", "baz"):
val json_col = to_json($"mapColumn")
val json_schema = spark.read.json(df.select(json_col).as[String]).schema
val df2 = df.select(from_json(json_col, json_schema).as("struct"))
df2.show()
// +------------+
// | struct|
// +------------+
// |{2, null, 1}|
// |{null, 4, 3}|
// +------------+
Define a case class
case class Bean56(foo: Int, bar: Int)
//Define a bean Encoder
val personEncoder = Encoders.bean[Bean56](classOf[Bean56])
val df = spark.sql("""select map("foo", 1, "bar", 2) AS mapColumn""")
//Map the output to required bean
val Bean56s = df.map(row => {
val map = row.getMap[String, Int](0)
Bean56(map.getOrElse("foo", -1), map.getOrElse("bar", -1))
})(personEncoder) // Supply implicit Encoder of the bean
Bean56s.foreach(println(_)) // Print the bean