Convert Pyspark Dataframe column from array to new columns - dataframe

I've a Pyspark Dataframe with this structure:
root
|-- Id: string (nullable = true)
|-- Q: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- pr: string (nullable = true)
| | |-- qt: double (nullable = true)
Something similar to:
+----+--------------------- ... --+
| Id | Q |
+----+---------------------- ... -+
| 001| [ [pr1,1.9], [pr3,2.0]...] |
| 002| [ [pr2,1.0], [pr9,3.9]...] |
| 003| [ [pr2,9.0], ... ] |
...
I wold like to convert Q array into columns (name pr value qt).
Also I would like to avoid duplicated columns by merging (add) same columns.
+----+-----+-----+------+ ... ----+
| Id | pr1 | pr2 | pr3 | ... prn |
+----+-----+-----+------+ ... ----+
| 001| 1.9 | 0.0 | 2.0 | ... |
| 002| 0.0 | 1.0 | 0 | ... |
| 003| 0.0 | 9.0 | ... | ... |
...
How can I perform this transformation?.
Thakyou in advance!!.
Julián.

You can do this with a combination of explode and pivot:
import pyspark.sql.functions as F
# explode to get "long" format
df=df.withColumn('exploded', F.explode('Q'))
# get the name and the name in separate columns
df=df.withColumn('name', F.col('exploded').getItem(0))
df=df.withColumn('value', F.col('exploded').getItem(1))
# now pivot
df.groupby('Id').pivot('name').agg(F.max('value')).na.fill(0)

Very interesting question. This is how I approached it.
test.csv
001,pr1:0.9,pr3:1.2,pr2:2.0
002,pr3:5.2,pr4:0.99
Pyspark
file = sc.textFile("file:///test2.csv")
//get it in (key,value)
//[(u'001', u'pr1:0.9')...]
//rdd1 = file.map(lambda r: r.replace(",","\t",1)).map(lambda r: r.split("\t")).map(lambda r: (r[0],r[1])).flatMapValues(lambda r: r.split(','))
rdd1 = file.map(lambda r: r.split(",")[0]).map(lambda r: (r[0],r[1])).flatMapValues(lambda r: r.split(','))
//create a DF with 3 columns
//[(u'001', u'pr1', u'0.9')...)]
+---+---+----+
| _1| _2| _3|
+---+---+----+
|001|pr1| 0.9|
|001|pr3| 1.2|
|001|pr2| 2.0|
|002|pr3| 5.2|
|002|pr4|0.99|
+---+---+----+
rdd2 = rdd1.map(lambda r: (r[0],r[1].split(":"))).map(lambda r: (r[0],r[1][0],r[1][1]))
df = rdd2.toDF()
//Perform the magic
df.groupBy("_1").pivot("_2").agg(expr("coalesce(first(_3),0)"))
+---+---+---+---+----+
| _1|pr1|pr2|pr3| pr4|
+---+---+---+---+----+
|001|0.9|2.0|1.2| 0|
|002| 0| 0|5.2|0.99|
+---+---+---+---+----+

Related

pyspark dataframe replace null in one column with another column by converting it from string to array

I would like to replace a null value of a pyspark dataframe column with another string column converted to array.
import pyspark.sql.functions as F
import pyspark.sql.types as T
new_customers = spark.createDataFrame(data=[["Karen", ["a"]], ["Penny", ["b"]], ["John", [None]], ["Cosimo", ["d"]]], schema=["name", "val"])
new_customers.printSchema()
new_customers.show(5, False)
new_customers = new_customers.withColumn("new_val", F.coalesce(F.col("val"), F.array(F.col("name"))))
new_customers.show(10, truncate=False)
But, it is
root
|-- name: string (nullable = true)
|-- val: array (nullable = true)
| |-- element: string (containsNull = true)
+------+---+
|name |val|
+------+---+
|Karen |[a]|
|Penny |[b]|
|John |[] |
|Cosimo|[d]|
+------+---+
+------+---+-------+
|name |val|new_val|
+------+---+-------+
|Karen |[a]|[a] |
|Penny |[b]|[b] |
|John |[] |[] |
|Cosimo|[d]|[d] |
+------+---+-------+
what I expect:
+------+---+-------+
|name |val|new_val|
+------+---+-------+
|Karen |[a]|[a] |
|Penny |[b]|[b] |
|John |[] |[John] |
|Cosimo|[d]|[d] |
+------+---+-------+
Did I miss something ? thanks
Problem is that you've an array with null element in it. It will not test positive for isNull check.
First clean up single-null-element arrays:
import pyspark.sql.functions as F
import pyspark.sql.types as T
new_customers = spark.createDataFrame(data=[["Karen", ["a"]], ["Penny", ["b"]], ["John", [None]], ["Cosimo", ["d"]]], schema=["name", "val"])
+------+------+
|name |val |
+------+------+
|Karen |[a] |
|Penny |[b] |
|John |[null]|
|Cosimo|[d] |
+------+------+
new_customers = new_customers.withColumn("val", F.filter(F.col("val"), lambda x: x.isNotNull()))
+------+---+
|name |val|
+------+---+
|Karen |[a]|
|Penny |[b]|
|John |[] |
|Cosimo|[d]|
+------+---+
Then, change your expression for array empty check instead of null check:
new_customers = new_customers.withColumn("new_val", F.when(F.size("val")>0, F.col("val")).otherwise(F.array(F.col("name"))))
+------+---+-------+
|name |val|new_val|
+------+---+-------+
|Karen |[a]|[a] |
|Penny |[b]|[b] |
|John |[] |[John] |
|Cosimo|[d]|[d] |
+------+---+-------+

Spark- check intersect of two string columns

I have a dataframe below where colA and colB contain strings. I'm trying to check if colB contains any substring of values in colA. The vaules can contain , or space, but as long as any part of colB's string has overlap with colA's, it is a match. For example row 1 below has an overlap ("bc"), and row 2 does not.
I was thinking of splitting the values into arrays but the delimiters are not constant. Could someone please help to shed some light on how to do this? Many thanks for your help.
+---+-------+-----------+
| id|colA | colB |
+---+-------+-----------+
| 1|abc d | bc, z |
| 2|abcde | hj f |
+---+-------+-----------+
You could split by using regex and then create a UDF function to check substrings.
Example:
spark = SparkSession.builder.getOrCreate()
data = [
{"id": 1, "A": "abc d", "B": "bc, z, d"},
{"id": 2, "A": "abc-d", "B": "acb, abc"},
{"id": 3, "A": "abcde", "B": "hj f ab"},
]
df = spark.createDataFrame(data)
split_regex = "((,)?\s|[-])"
df = df.withColumn("A", F.split(F.col("A"), split_regex))
df = df.withColumn("B", F.split(F.col("B"), split_regex))
def mapper(a, b):
result = []
for ele_b in b:
for ele_a in a:
if ele_b in ele_a:
result.append(ele_b)
return result
df = df.withColumn(
"result", F.udf(mapper, ArrayType(StringType()))(F.col("A"), F.col("B"))
)
Result:
root
|-- A: array (nullable = true)
| |-- element: string (containsNull = true)
|-- B: array (nullable = true)
| |-- element: string (containsNull = true)
|-- id: long (nullable = true)
|-- result: array (nullable = true)
| |-- element: string (containsNull = true)
+--------+-----------+---+-------+
|A |B |id |result |
+--------+-----------+---+-------+
|[abc, d]|[bc, z, d] |1 |[bc, d]|
|[abc, d]|[acb, abc] |2 |[abc] |
|[abcde] |[hj, f, ab]|3 |[ab] |
+--------+-----------+---+-------+
You can use a custom UDF to implement the intersect logic as below -
Data Preparation
from pyspark import SparkContext
from pyspark.sql import SQLContext
import pyspark.sql.functions as F
from pyspark.sql.types import StringType
import pandas as pd
data = {"id" :[1,2],
"colA" : ["abc d","abcde"],
"colB" : ["bc, z","hj f"]}
mypd = pd.DataFrame(data)
sparkDF = sql.createDataFrame(mypd)
sparkDF.show()
+---+-----+-----+
| id| colA| colB|
+---+-----+-----+
| 1|abc d|bc, z|
| 2|abcde| hj f|
+---+-----+-----+
UDF
def str_intersect(x,y):
res = set(x) & set(y)
if res:
return ''.join(res)
else:
return None
str_intersect_udf = F.udf(lambda x,y:str_intersect(x,y),StringType())
sparkDF.withColumn('intersect',str_intersect_udf(F.col('colA'),F.col('colB'))).show()
+---+-----+-----+---------+
| id| colA| colB|intersect|
+---+-----+-----+---------+
| 1|abc d|bc, z| bc |
| 2|abcde| hj f| null|
+---+-----+-----+---------+

How to split a column by using length split and MaxSplit in Pyspark dataframe?

For Example
If I have a Column as given below by calling and showing the CSV in Pyspark
+--------+
| Names|
+--------+
|Rahul |
|Ravi |
|Raghu |
|Romeo |
+--------+
if I specify in my functions as Such
Length = 2
Maxsplit = 3
Then I have to get the results as
+----------+-----------+----------+
|Col_1 |Col_2 |Col_3 |
+----------+-----------+----------+
| Ra | hu | l |
| Ra | vi | Null |
| Ra | gh | u |
| Ro | me | o |
+----------+-----------+----------+
Simirarly in Pyspark
Length = 3
Max split = 2 it should provide me the output such as
+----------+-----------+
|Col_1 |Col_2 |
+----------+-----------+
| Rah | ul |
| Rav | i |
| Rag | hu |
| Rom | eo |
+----------+-----------+
This is how it should look like, Thank you
Another way to go about this. Should be faster than any looping or udf solution.
from pyspark.sql import functions as F
def split(df,length,maxsplit):
return df.withColumn('Names',F.split("Names","(?<=\\G{})".format('.'*length)))\
.select(*((F.col("Names")[x]).alias("Col_"+str(x+1)) for x in range(0,maxsplit)))
split(df,3,2).show()
#+-----+-----+
#|Col_1|Col_2|
#+-----+-----+
#| Rah| ul|
#| Rav| i|
#| Rag| hu|
#| Rom| eo|
#+-----+-----+
split(df,2,3).show()
#+-----+-----+-----+
#|col_1|col_2|col_3|
#+-----+-----+-----+
#| Ra| hu| l|
#| Ra| vi| |
#| Ra| gh| u|
#| Ro| me| o|
#+-----+-----+-----+
Try this,
import pyspark.sql.functions as F
tst = sqlContext.createDataFrame([("Raghu",1),("Ravi",2),("Rahul",3)],schema=["Name","val"])
def fn (split,max_n,tst):
for i in range(max_n):
tst_loop=tst.withColumn("coln"+str(i),F.substring(F.col("Name"),(i*split)+1,split))
tst=tst_loop
return(tst)
tst_res = fn(3,2,tst)
The for loop can also replaced by a list comprehension or reduce, but i felt in you case, a for loop looked neater. they have the same physical plan anyway.
The results
+-----+---+-----+-----+
| Name|val|coln0|coln1|
+-----+---+-----+-----+
|Raghu| 1| Rag| hu|
| Ravi| 2| Rav| i|
|Rahul| 3| Rah| ul|
+-----+---+-----+-----+
Try this
def split(data,length,maxSplit):
start=1
for i in range(0,maxSplit):
data = data.withColumn(f'col_{start}-{start+length-1}',f.substring('channel',start,length))
start=length+1
return data
df = split(data,3,2)
df.show()
+--------+----+-------+-------+
| channel|type|col_1-3|col_4-6|
+--------+----+-------+-------+
| web| 0| web| |
| web| 1| web| |
| web| 2| web| |
| twitter| 0| twi| tte|
| twitter| 1| twi| tte|
|facebook| 0| fac| ebo|
|facebook| 1| fac| ebo|
|facebook| 2| fac| ebo|
+--------+----+-------+-------+
Perhaps this is useful-
Load the test data
Note: written in scala
val Length = 2
val Maxsplit = 3
val df = Seq("Rahul", "Ravi", "Raghu", "Romeo").toDF("Names")
df.show(false)
/**
* +-----+
* |Names|
* +-----+
* |Rahul|
* |Ravi |
* |Raghu|
* |Romeo|
* +-----+
*/
split the string col as per the length and offset
val schema = StructType(Range(1, Maxsplit + 1).map(f => StructField(s"Col_$f", StringType)))
val split = udf((str:String, length: Int, maxSplit: Int) =>{
val splits = str.toCharArray.grouped(length).map(_.mkString).toArray
RowFactory.create(splits ++ Array.fill(maxSplit-splits.length)(null): _*)
}, schema)
val p = df
.withColumn("x", split($"Names", lit(Length), lit(Maxsplit)))
.selectExpr("x.*")
p.show(false)
p.printSchema()
/**
* +-----+-----+-----+
* |Col_1|Col_2|Col_3|
* +-----+-----+-----+
* |Ra |hu |l |
* |Ra |vi |null |
* |Ra |gh |u |
* |Ro |me |o |
* +-----+-----+-----+
*
* root
* |-- Col_1: string (nullable = true)
* |-- Col_2: string (nullable = true)
* |-- Col_3: string (nullable = true)
*/
Dataset[Row] -> Dataset[Array[String]]
val x = df.map(r => {
val splits = r.getString(0).toCharArray.grouped(Length).map(_.mkString).toArray
splits ++ Array.fill(Maxsplit-splits.length)(null)
})
x.show(false)
x.printSchema()
/**
* +-----------+
* |value |
* +-----------+
* |[Ra, hu, l]|
* |[Ra, vi,] |
* |[Ra, gh, u]|
* |[Ro, me, o]|
* +-----------+
*
* root
* |-- value: array (nullable = true)
* | |-- element: string (containsNull = true)
*/

Multiple AND conditions on the same column in pyspark without join operation

I have a table of three columns [s,p,o]. I would like to remove rows, that for each entry in s , the p column does not include [P625, P36] values both. For example
+----+----+------
| s| p| o |
+----+----+-----|
| Q31| P36| Q239|
| Q31|P625| 51|
| Q45| P36| Q597|
| Q45|P625| 123|
| Q51|P625| 22|
| Q24|P625| 56|
The end result should be
+----+----+------
| s| p| o |
+----+----+-----|
| Q31| P36| Q239|
| Q31|P625| 51|
| Q45| P36| Q597|
| Q45|P625| 123|
Using join operation, the above task is easy.
df.filter(df.p=='P625').join(df.filter(df.p=='P36'),'s')
But is there a more elegant way to do this?
You need a window
from pyspark.sql import Window
from pyspark.sql.functions import *
winSpec = Window.partitionBy('s')
df.withColumn("s_list", collect_list("s").over(winSpec)).
filter(array_contains(col("s_list"), "P625") & array_contains(col("s_list"), "P36") & size(col("s_list")) = 2)
Forgive me, as I'm much more familiar with the Scala API, but perhaps you can easily convert it:
scala> val df = spark.createDataset(Seq(
| ("Q31", "P36", "Q239"),
| ("Q31", "P625", "51"),
| ("Q45", "P36", "Q597"),
| ("Q45", "P625", "123"),
| ("Q51", "P625", "22"),
| ("Q24", "P625", "56")
| )).toDF("s", "p", "o")
df: org.apache.spark.sql.DataFrame = [s: string, p: string ... 1 more field]
scala> (df.select($"s", struct($"p", $"o").as("po"))
| .groupBy("s")
| .agg(collect_list($"po").as("polist"))
| .as[(String, Array[(String, String)])]
| .flatMap(r => {
| val ps = r._2.map(_._1).toSet
| if(ps("P625") && ps("P36")) {
| r._2.flatMap(po => Some(r._1, po._1, po._2))
| } else {
| None
| }
| }).toDF("s", "p", "o")
| .show())
+---+----+----+
| s| p| o|
+---+----+----+
|Q31| P36|Q239|
|Q31|P625| 51|
|Q45| P36|Q597|
|Q45|P625| 123|
+---+----+----+
For reference, your join() command above would have returned:
scala> df.filter($"p" === "P625").join(df.filter($"p" === "P36"), "s").show
+---+----+---+---+----+
| s| p| o| p| o|
+---+----+---+---+----+
|Q31|P625| 51|P36|Q239|
|Q45|P625|123|P36|Q597|
+---+----+---+---+----+
Which can be worked into your final solution as well, perhaps with less code, but I'm not sure which method would be more efficient, as that's largely data dependent.

How to extract values from key value map?

I have a column of type map, where the key and value changes. I am trying to extract the value and create a new column.
Input:
----------------+
|symbols |
+---------------+
|[3pea -> 3PEA] |
|[barello -> BA]|
|[] |
|[] |
+---------------+
Expected output:
--------+
|symbols|
+-------+
|3PEA |
|BA |
| |
| |
+-------+
Here is what I tried so far using a udf:
def map_value=udf((inputMap:Map[String,String])=> {inputMap.map(x=>x._2)
})
java.lang.UnsupportedOperationException: Schema for type scala.collection.immutable.Iterable[String] is not supported
Since Spark scala v2.3 api, sql v2.3 api, or pyspark v2.4 api you can use the spark sql function map_values
The following is in pyspark, scala would be very similar.
Setup (assuming working SparkSession as spark):
from pyspark.sql import functions as F
df = (
spark.read.json(sc.parallelize(["""[
{"key": ["3pea"], "value": ["3PEA"] },
{"key": ["barello"], "value": ["BA"] }
]"""]))
.select(F.map_from_arrays(F.col("key"), F.col("value")).alias("symbols") )
)
df.printSchema()
df.show()
root
|-- symbols: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
+---------------+
| symbols|
+---------------+
| [3pea -> 3PEA]|
|[barello -> BA]|
+---------------+
df.select((F.map_values(F.col("symbols"))[0]).alias("map_vals")).show()
+--------+
|map_vals|
+--------+
| 3PEA|
| BA|
+--------+
import org.apache.spark.sql.functions._
import spark.implicits._
val m = Seq(Array("A -> abc"), Array("B -> 0.11856755943424617"), Array("C -> kqcams"))
val df = m.toDF("map_data")
df.show
// Simulate your data I think.
val df2 = df.withColumn("xxx", split(concat_ws("",$"map_data"), "-> ")).select($"xxx".getItem(1).as("map_val")).drop("xxx")
df2.show(false)
results in:
+--------------------+
| map_data|
+--------------------+
| [A -> abc]|
|[B -> 0.118567559...|
| [C -> kqcams]|
+--------------------+
+-------------------+
|map_val |
+-------------------+
|abc |
|0.11856755943424617|
|kqcams |
+-------------------+