Spark- check intersect of two string columns - dataframe

I have a dataframe below where colA and colB contain strings. I'm trying to check if colB contains any substring of values in colA. The vaules can contain , or space, but as long as any part of colB's string has overlap with colA's, it is a match. For example row 1 below has an overlap ("bc"), and row 2 does not.
I was thinking of splitting the values into arrays but the delimiters are not constant. Could someone please help to shed some light on how to do this? Many thanks for your help.
+---+-------+-----------+
| id|colA | colB |
+---+-------+-----------+
| 1|abc d | bc, z |
| 2|abcde | hj f |
+---+-------+-----------+

You could split by using regex and then create a UDF function to check substrings.
Example:
spark = SparkSession.builder.getOrCreate()
data = [
{"id": 1, "A": "abc d", "B": "bc, z, d"},
{"id": 2, "A": "abc-d", "B": "acb, abc"},
{"id": 3, "A": "abcde", "B": "hj f ab"},
]
df = spark.createDataFrame(data)
split_regex = "((,)?\s|[-])"
df = df.withColumn("A", F.split(F.col("A"), split_regex))
df = df.withColumn("B", F.split(F.col("B"), split_regex))
def mapper(a, b):
result = []
for ele_b in b:
for ele_a in a:
if ele_b in ele_a:
result.append(ele_b)
return result
df = df.withColumn(
"result", F.udf(mapper, ArrayType(StringType()))(F.col("A"), F.col("B"))
)
Result:
root
|-- A: array (nullable = true)
| |-- element: string (containsNull = true)
|-- B: array (nullable = true)
| |-- element: string (containsNull = true)
|-- id: long (nullable = true)
|-- result: array (nullable = true)
| |-- element: string (containsNull = true)
+--------+-----------+---+-------+
|A |B |id |result |
+--------+-----------+---+-------+
|[abc, d]|[bc, z, d] |1 |[bc, d]|
|[abc, d]|[acb, abc] |2 |[abc] |
|[abcde] |[hj, f, ab]|3 |[ab] |
+--------+-----------+---+-------+

You can use a custom UDF to implement the intersect logic as below -
Data Preparation
from pyspark import SparkContext
from pyspark.sql import SQLContext
import pyspark.sql.functions as F
from pyspark.sql.types import StringType
import pandas as pd
data = {"id" :[1,2],
"colA" : ["abc d","abcde"],
"colB" : ["bc, z","hj f"]}
mypd = pd.DataFrame(data)
sparkDF = sql.createDataFrame(mypd)
sparkDF.show()
+---+-----+-----+
| id| colA| colB|
+---+-----+-----+
| 1|abc d|bc, z|
| 2|abcde| hj f|
+---+-----+-----+
UDF
def str_intersect(x,y):
res = set(x) & set(y)
if res:
return ''.join(res)
else:
return None
str_intersect_udf = F.udf(lambda x,y:str_intersect(x,y),StringType())
sparkDF.withColumn('intersect',str_intersect_udf(F.col('colA'),F.col('colB'))).show()
+---+-----+-----+---------+
| id| colA| colB|intersect|
+---+-----+-----+---------+
| 1|abc d|bc, z| bc |
| 2|abcde| hj f| null|
+---+-----+-----+---------+

Related

Spark- Split or replace part of the column based on multiple delimiters [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
I have a dataframe as input below. Ultimately, I'm trying to get the output as below, so I can use df.filter(col("A").contains(col("B"))) to see if A contains B as substring. Noted here I'd like to check the order of the letters as well so set probably will not work. For example "acb" should not be considered as a substring of "abcd" I've tried to use split but it only takes one delimiter. Could some please help? I'm using spark 2.4.
Input
+---+-------+-----------+
| id|. A | B |
+---+-------+-----------+
| 1|abc d | bc, z |
| 2|abc-d | acb |
| 3|abcde | hj f |
+---+-------+-----------+
Output
+---+-------+-----------+
| id|. A | B |
+---+-------+-----------+
| 1|abcd | bcz |
| 2|abcd | acb |
| 3|abcde | hjf |
+---+-------+-----------+
You can use regex for both split and replace.
If you want to split, your output is not right.
Split
spark = SparkSession.builder.getOrCreate()
data = [
{"id": 1, "A": "abc d", "B": "bc, z"},
{"id": 2, "A": "abc-d", "B": "acb"},
{"id": 3, "A": "abcde", "B": "hj f"},
]
df = spark.createDataFrame(data)
split_regex = "((,)?\s|[-])"
df = df.withColumn("A", F.split(F.col("A"), split_regex))
df = df.withColumn("B", F.split(F.col("B"), split_regex))
Result:
root
|-- A: array (nullable = true)
| |-- element: string (containsNull = true)
|-- B: array (nullable = true)
| |-- element: string (containsNull = true)
|-- id: long (nullable = true)
+--------+-------+---+
|A |B |id |
+--------+-------+---+
|[abc, d]|[bc, z]|1 |
|[abc, d]|[acb] |2 |
|[abcde] |[hj, f]|3 |
+--------+-------+---+
Now you can create a UDF that will check if values in array B are substrings in values in array A.
Replace
spark = SparkSession.builder.getOrCreate()
data = [
{"id": 1, "A": "abc d", "B": "bc, z"},
{"id": 2, "A": "abc-d", "B": "acb"},
{"id": 3, "A": "abcde", "B": "hj f"},
]
df = spark.createDataFrame(data)
replace_regex = "((,)?\s|[-])"
df = df.withColumn("A", F.regexp_replace(F.col("A"), replace_regex, ""))
df = df.withColumn("B", F.regexp_replace(F.col("B"), replace_regex, ""))
Result:
root
|-- A: string (nullable = true)
|-- B: string (nullable = true)
|-- id: long (nullable = true)
+-----+---+---+
|A |B |id |
+-----+---+---+
|abcd |bcz|1 |
|abcd |acb|2 |
|abcde|hjf|3 |
+-----+---+---+
This is a bit involved, and I would stick to split since here abcd contains both b and bc and there's no way for you to keep track of the whole words if you completely replace the delimiter.

How to convert array<struct> to array<string>

I have a dataframe with the below schema
root
|-- ColA: array
|--|--element: struct
|--|--|--id: string (nullable = true)
|--|--|--name: string (nullable = true)
The schema of the expected dataframe would be,
root
|-- ColA: array
|--|--element: string (continsNull = true)
Is it possible to convert an array of struct to array of strings?
Any help would be much appreciated.
Thank you.
Assuming you need the field name, you can just select that array column as usual
(df
.withColumn('ColA', F.col('ColA.name'))
.printSchema()
)
# Input
+----------------+
| ColA|
+----------------+
|[{1, A}, {2, B}]|
+----------------+
# Output
+------+
| ColA|
+------+
|[A, B]|
+------+
You can use TRANSFORM expression:
expr = "TRANSFORM(ColA, x -> to_json(x))"
df = df.withColumn('ColA', f.expr(expr))
df.printSchema()
df.show(truncate=False)
Output:
root
|-- ColA: array (nullable = true)
| |-- element: string (containsNull = true)
+------------------------------------------------------+
|ColA |
+------------------------------------------------------+
|[{"id":"1","name":"Hello"}, {"id":"2","name":"World"}]|
+------------------------------------------------------+

Extract columns from a list of lists in pyspark

I am stuck trying to extract columns from a list of lists but can't visualize how to do it. I am fairly new to spark. Running pyspark on Spark 2.4.3.
I have a json organized like this:
{ "meta" : { ... },
"data" :
[[ "a", 0, null, "{ }"],
[ "b", 0, null, "{ }"],
[ "c", 0, null, "{ }"],
] }
I want to get the 'data' portion into columns, like
+------+------+------+------+
| col1 | col2 | col3 | col4 |
+------+------+------+------+
| a | 0 | None | "{ }"|
| b | 0 | None | "{ }"|
| c | 0 | None | "{ }"|
I have my dataframe read in and printSchema() shows this.
root
|-- data: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: string (containsNull = true)
|-- meta: struct (nullable = true)
| |-- view: struct (nullable = true)
| | |-- approvals: array (nullable = true) ...
My rough shape is 70 columns by 650k rows.
I was able to explode the df to get just the data portion but am stuck there.
Explode the rows first, and then select the array elements using [] in Python.
df2 = df.select(F.explode('data').alias('data')) \
.select(*[F.col('data')[i].alias('col%s'%(i+1)) for i in range(4)])
df2.show()
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
| a| 0|null| { }|
| b| 0|null| { }|
| c| 0|null| { }|
+----+----+----+----+
Why don't you use just SparkSession.createDataFrame() method?
https://spark.apache.org/docs/2.4.3/api/python/pyspark.sql.html#pyspark.sql.SparkSession.createDataFrame
You can provide data and schema parameters to this method and get spark dataframe.
Example:
from pyspark.sql import SparkSession
sparkSession = SparkSession.builder.getOrCreate()
df = sparkSession.createDataFrame(data)
If spark cannot infer schema from the data then schema also need to be provided
from pyspark.sql.types import StructType
struct = StructType()
struct.add("col1", "string", True)
struct.add("col2", "integer", True)
struct.add("col3", "string", True)
struct.add("col4", "string", True)
df = sparkSession.createDataFrame(data=data, schema=struct)
In addition, you can use pyspark type classes instead of python primitive type names.
https://spark.apache.org/docs/2.4.3/api/python/pyspark.sql.html#module-pyspark.sql.types
module contains both simple types(StringType, IntegerType, ...) and complex types(ArrayType, MapType, ...)
Last note: data cannot contain null, it should be None in python. spark DataFrame.show() will print None columns as null.

How to join two dataframe with both a array column, one's element is nullable, another is not

I have two dataframa df1,df2
df1
df1's Index was generated by udf function.
+------+----------+
|other | Index |
+------+----------+
| 5| [1, 1] |
| 1| [0, 7] |
| 2| [1, 7] |
| 8| [2, 6] |
| 6| [2, 7] |
|-- gridIndex: array (nullable = true)
| |-- element: integer (containsNull = false)
df2
var l =new ListBuffer[(Int, Array[Int])]
l.append((5,Array(1,1)))
l.append((6,Array(0,7)))
l.append((1,Array(1,7)))
l.append((3,Array(2,6)))
l.append((7,Array(2,7)))
val df2 = l.toDF("other2","Index")
+------+----------+
|other2| Index |
+------+----------+
| 5| [1, 1] |
| 6| [0, 7] |
| 1| [1, 7] |
| 3| [2, 6] |
| 7| [2, 7] |
|-- gridIndex: array (nullable = true)
| |-- element: integer (containsNull = true)
I want to join them by Index.
But as you can see a error appear:
Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: Join keys from two sides should have same types
df1.Index is different with df2.Index, their containNull is different.
So how can i figure out this problem?
I also want to why |-- element: integer (containsNull = true) in df2's schema.
Thank you very much!
I'm able to do this.
Not sure if this was the desired result as you have not posted the expected output.
val df1 = Seq(
(5,Array(1,1)),
(1,Array(0,7)),
(2,Array(1,7)),
(8,Array(2,6)),
(6,Array(2,7)),
(100,Array(null.asInstanceOf[Int],null.asInstanceOf[Int]))
).toDF("Other","Index")
val df2 = Seq(
(5,Array(1,1)),
(6,Array(0,7)),
(1,Array(1,7)),
(3,Array(2,6)),
(7,Array(null.asInstanceOf[Int],null.asInstanceOf[Int]))
).toDF("Other2","Index")
root
|-- Other: integer (nullable = false)
|-- Index: array (nullable = true)
| |-- element: integer (containsNull = false)
root
|-- Other2: integer (nullable = false)
|-- Index: array (nullable = true)
| |-- element: integer (containsNull = false)
df1.join(df2, df1("Index") <=> df2("Index") ).show()
+-----+------+------+------+
|Other| Index|Other2| Index|
+-----+------+------+------+
| 5|[1, 1]| 5|[1, 1]|
| 1|[0, 7]| 6|[0, 7]|
| 2|[1, 7]| 1|[1, 7]|
| 8|[2, 6]| 3|[2, 6]|
| 100|[0, 0]| 7|[0, 0]|
+-----+------+------+------+

Convert Pyspark Dataframe column from array to new columns

I've a Pyspark Dataframe with this structure:
root
|-- Id: string (nullable = true)
|-- Q: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- pr: string (nullable = true)
| | |-- qt: double (nullable = true)
Something similar to:
+----+--------------------- ... --+
| Id | Q |
+----+---------------------- ... -+
| 001| [ [pr1,1.9], [pr3,2.0]...] |
| 002| [ [pr2,1.0], [pr9,3.9]...] |
| 003| [ [pr2,9.0], ... ] |
...
I wold like to convert Q array into columns (name pr value qt).
Also I would like to avoid duplicated columns by merging (add) same columns.
+----+-----+-----+------+ ... ----+
| Id | pr1 | pr2 | pr3 | ... prn |
+----+-----+-----+------+ ... ----+
| 001| 1.9 | 0.0 | 2.0 | ... |
| 002| 0.0 | 1.0 | 0 | ... |
| 003| 0.0 | 9.0 | ... | ... |
...
How can I perform this transformation?.
Thakyou in advance!!.
Julián.
You can do this with a combination of explode and pivot:
import pyspark.sql.functions as F
# explode to get "long" format
df=df.withColumn('exploded', F.explode('Q'))
# get the name and the name in separate columns
df=df.withColumn('name', F.col('exploded').getItem(0))
df=df.withColumn('value', F.col('exploded').getItem(1))
# now pivot
df.groupby('Id').pivot('name').agg(F.max('value')).na.fill(0)
Very interesting question. This is how I approached it.
test.csv
001,pr1:0.9,pr3:1.2,pr2:2.0
002,pr3:5.2,pr4:0.99
Pyspark
file = sc.textFile("file:///test2.csv")
//get it in (key,value)
//[(u'001', u'pr1:0.9')...]
//rdd1 = file.map(lambda r: r.replace(",","\t",1)).map(lambda r: r.split("\t")).map(lambda r: (r[0],r[1])).flatMapValues(lambda r: r.split(','))
rdd1 = file.map(lambda r: r.split(",")[0]).map(lambda r: (r[0],r[1])).flatMapValues(lambda r: r.split(','))
//create a DF with 3 columns
//[(u'001', u'pr1', u'0.9')...)]
+---+---+----+
| _1| _2| _3|
+---+---+----+
|001|pr1| 0.9|
|001|pr3| 1.2|
|001|pr2| 2.0|
|002|pr3| 5.2|
|002|pr4|0.99|
+---+---+----+
rdd2 = rdd1.map(lambda r: (r[0],r[1].split(":"))).map(lambda r: (r[0],r[1][0],r[1][1]))
df = rdd2.toDF()
//Perform the magic
df.groupBy("_1").pivot("_2").agg(expr("coalesce(first(_3),0)"))
+---+---+---+---+----+
| _1|pr1|pr2|pr3| pr4|
+---+---+---+---+----+
|001|0.9|2.0|1.2| 0|
|002| 0| 0|5.2|0.99|
+---+---+---+---+----+