How to extract values from key value map? - dataframe

I have a column of type map, where the key and value changes. I am trying to extract the value and create a new column.
Input:
----------------+
|symbols |
+---------------+
|[3pea -> 3PEA] |
|[barello -> BA]|
|[] |
|[] |
+---------------+
Expected output:
--------+
|symbols|
+-------+
|3PEA |
|BA |
| |
| |
+-------+
Here is what I tried so far using a udf:
def map_value=udf((inputMap:Map[String,String])=> {inputMap.map(x=>x._2)
})
java.lang.UnsupportedOperationException: Schema for type scala.collection.immutable.Iterable[String] is not supported

Since Spark scala v2.3 api, sql v2.3 api, or pyspark v2.4 api you can use the spark sql function map_values
The following is in pyspark, scala would be very similar.
Setup (assuming working SparkSession as spark):
from pyspark.sql import functions as F
df = (
spark.read.json(sc.parallelize(["""[
{"key": ["3pea"], "value": ["3PEA"] },
{"key": ["barello"], "value": ["BA"] }
]"""]))
.select(F.map_from_arrays(F.col("key"), F.col("value")).alias("symbols") )
)
df.printSchema()
df.show()
root
|-- symbols: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
+---------------+
| symbols|
+---------------+
| [3pea -> 3PEA]|
|[barello -> BA]|
+---------------+
df.select((F.map_values(F.col("symbols"))[0]).alias("map_vals")).show()
+--------+
|map_vals|
+--------+
| 3PEA|
| BA|
+--------+

import org.apache.spark.sql.functions._
import spark.implicits._
val m = Seq(Array("A -> abc"), Array("B -> 0.11856755943424617"), Array("C -> kqcams"))
val df = m.toDF("map_data")
df.show
// Simulate your data I think.
val df2 = df.withColumn("xxx", split(concat_ws("",$"map_data"), "-> ")).select($"xxx".getItem(1).as("map_val")).drop("xxx")
df2.show(false)
results in:
+--------------------+
| map_data|
+--------------------+
| [A -> abc]|
|[B -> 0.118567559...|
| [C -> kqcams]|
+--------------------+
+-------------------+
|map_val |
+-------------------+
|abc |
|0.11856755943424617|
|kqcams |
+-------------------+

Related

Transform list of map to dataframe

I have the following data:
d = Some(List(Map(id -> 1, n -> Hi), Map(id -> 2, n -> Hello)))
I would like to transform it into a dataframe like the following:
+--------+
|id|n |
+--------+
|1 |Hi |
+--------+
|2 |Hello|
+--------+
I tried the following:
import spark.implicits._
val df = d
.map( m => (m.get("id"),m.get("n")))
.toDF("id", "n")
But im getting:
error: value get is not a member of Any
.map( m => (m.get("id"),m.get("n")))
Your top level here is Option and i think thats the reason why you cant handle it with single map. I managed to do it with something like this:
import spark.implicits._
val d = Some(List(Map("id" -> "1", "n" -> "Hi"), Map("id" -> "2", "n" -> "Hello")))
val data = d.fold (List.empty [(Option [String], Option [String])]) (_.map (m => (m.get ("id"), m.get ("n"))))
val df = data.toDF("id", "n").show()
Output:
+---+-----+
| id| n|
+---+-----+
| 1| Hi|
| 2|Hello|
+---+-----+

pyspark dataframe replace null in one column with another column by converting it from string to array

I would like to replace a null value of a pyspark dataframe column with another string column converted to array.
import pyspark.sql.functions as F
import pyspark.sql.types as T
new_customers = spark.createDataFrame(data=[["Karen", ["a"]], ["Penny", ["b"]], ["John", [None]], ["Cosimo", ["d"]]], schema=["name", "val"])
new_customers.printSchema()
new_customers.show(5, False)
new_customers = new_customers.withColumn("new_val", F.coalesce(F.col("val"), F.array(F.col("name"))))
new_customers.show(10, truncate=False)
But, it is
root
|-- name: string (nullable = true)
|-- val: array (nullable = true)
| |-- element: string (containsNull = true)
+------+---+
|name |val|
+------+---+
|Karen |[a]|
|Penny |[b]|
|John |[] |
|Cosimo|[d]|
+------+---+
+------+---+-------+
|name |val|new_val|
+------+---+-------+
|Karen |[a]|[a] |
|Penny |[b]|[b] |
|John |[] |[] |
|Cosimo|[d]|[d] |
+------+---+-------+
what I expect:
+------+---+-------+
|name |val|new_val|
+------+---+-------+
|Karen |[a]|[a] |
|Penny |[b]|[b] |
|John |[] |[John] |
|Cosimo|[d]|[d] |
+------+---+-------+
Did I miss something ? thanks
Problem is that you've an array with null element in it. It will not test positive for isNull check.
First clean up single-null-element arrays:
import pyspark.sql.functions as F
import pyspark.sql.types as T
new_customers = spark.createDataFrame(data=[["Karen", ["a"]], ["Penny", ["b"]], ["John", [None]], ["Cosimo", ["d"]]], schema=["name", "val"])
+------+------+
|name |val |
+------+------+
|Karen |[a] |
|Penny |[b] |
|John |[null]|
|Cosimo|[d] |
+------+------+
new_customers = new_customers.withColumn("val", F.filter(F.col("val"), lambda x: x.isNotNull()))
+------+---+
|name |val|
+------+---+
|Karen |[a]|
|Penny |[b]|
|John |[] |
|Cosimo|[d]|
+------+---+
Then, change your expression for array empty check instead of null check:
new_customers = new_customers.withColumn("new_val", F.when(F.size("val")>0, F.col("val")).otherwise(F.array(F.col("name"))))
+------+---+-------+
|name |val|new_val|
+------+---+-------+
|Karen |[a]|[a] |
|Penny |[b]|[b] |
|John |[] |[John] |
|Cosimo|[d]|[d] |
+------+---+-------+

Spark- check intersect of two string columns

I have a dataframe below where colA and colB contain strings. I'm trying to check if colB contains any substring of values in colA. The vaules can contain , or space, but as long as any part of colB's string has overlap with colA's, it is a match. For example row 1 below has an overlap ("bc"), and row 2 does not.
I was thinking of splitting the values into arrays but the delimiters are not constant. Could someone please help to shed some light on how to do this? Many thanks for your help.
+---+-------+-----------+
| id|colA | colB |
+---+-------+-----------+
| 1|abc d | bc, z |
| 2|abcde | hj f |
+---+-------+-----------+
You could split by using regex and then create a UDF function to check substrings.
Example:
spark = SparkSession.builder.getOrCreate()
data = [
{"id": 1, "A": "abc d", "B": "bc, z, d"},
{"id": 2, "A": "abc-d", "B": "acb, abc"},
{"id": 3, "A": "abcde", "B": "hj f ab"},
]
df = spark.createDataFrame(data)
split_regex = "((,)?\s|[-])"
df = df.withColumn("A", F.split(F.col("A"), split_regex))
df = df.withColumn("B", F.split(F.col("B"), split_regex))
def mapper(a, b):
result = []
for ele_b in b:
for ele_a in a:
if ele_b in ele_a:
result.append(ele_b)
return result
df = df.withColumn(
"result", F.udf(mapper, ArrayType(StringType()))(F.col("A"), F.col("B"))
)
Result:
root
|-- A: array (nullable = true)
| |-- element: string (containsNull = true)
|-- B: array (nullable = true)
| |-- element: string (containsNull = true)
|-- id: long (nullable = true)
|-- result: array (nullable = true)
| |-- element: string (containsNull = true)
+--------+-----------+---+-------+
|A |B |id |result |
+--------+-----------+---+-------+
|[abc, d]|[bc, z, d] |1 |[bc, d]|
|[abc, d]|[acb, abc] |2 |[abc] |
|[abcde] |[hj, f, ab]|3 |[ab] |
+--------+-----------+---+-------+
You can use a custom UDF to implement the intersect logic as below -
Data Preparation
from pyspark import SparkContext
from pyspark.sql import SQLContext
import pyspark.sql.functions as F
from pyspark.sql.types import StringType
import pandas as pd
data = {"id" :[1,2],
"colA" : ["abc d","abcde"],
"colB" : ["bc, z","hj f"]}
mypd = pd.DataFrame(data)
sparkDF = sql.createDataFrame(mypd)
sparkDF.show()
+---+-----+-----+
| id| colA| colB|
+---+-----+-----+
| 1|abc d|bc, z|
| 2|abcde| hj f|
+---+-----+-----+
UDF
def str_intersect(x,y):
res = set(x) & set(y)
if res:
return ''.join(res)
else:
return None
str_intersect_udf = F.udf(lambda x,y:str_intersect(x,y),StringType())
sparkDF.withColumn('intersect',str_intersect_udf(F.col('colA'),F.col('colB'))).show()
+---+-----+-----+---------+
| id| colA| colB|intersect|
+---+-----+-----+---------+
| 1|abc d|bc, z| bc |
| 2|abcde| hj f| null|
+---+-----+-----+---------+

Extract columns from a list of lists in pyspark

I am stuck trying to extract columns from a list of lists but can't visualize how to do it. I am fairly new to spark. Running pyspark on Spark 2.4.3.
I have a json organized like this:
{ "meta" : { ... },
"data" :
[[ "a", 0, null, "{ }"],
[ "b", 0, null, "{ }"],
[ "c", 0, null, "{ }"],
] }
I want to get the 'data' portion into columns, like
+------+------+------+------+
| col1 | col2 | col3 | col4 |
+------+------+------+------+
| a | 0 | None | "{ }"|
| b | 0 | None | "{ }"|
| c | 0 | None | "{ }"|
I have my dataframe read in and printSchema() shows this.
root
|-- data: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: string (containsNull = true)
|-- meta: struct (nullable = true)
| |-- view: struct (nullable = true)
| | |-- approvals: array (nullable = true) ...
My rough shape is 70 columns by 650k rows.
I was able to explode the df to get just the data portion but am stuck there.
Explode the rows first, and then select the array elements using [] in Python.
df2 = df.select(F.explode('data').alias('data')) \
.select(*[F.col('data')[i].alias('col%s'%(i+1)) for i in range(4)])
df2.show()
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
| a| 0|null| { }|
| b| 0|null| { }|
| c| 0|null| { }|
+----+----+----+----+
Why don't you use just SparkSession.createDataFrame() method?
https://spark.apache.org/docs/2.4.3/api/python/pyspark.sql.html#pyspark.sql.SparkSession.createDataFrame
You can provide data and schema parameters to this method and get spark dataframe.
Example:
from pyspark.sql import SparkSession
sparkSession = SparkSession.builder.getOrCreate()
df = sparkSession.createDataFrame(data)
If spark cannot infer schema from the data then schema also need to be provided
from pyspark.sql.types import StructType
struct = StructType()
struct.add("col1", "string", True)
struct.add("col2", "integer", True)
struct.add("col3", "string", True)
struct.add("col4", "string", True)
df = sparkSession.createDataFrame(data=data, schema=struct)
In addition, you can use pyspark type classes instead of python primitive type names.
https://spark.apache.org/docs/2.4.3/api/python/pyspark.sql.html#module-pyspark.sql.types
module contains both simple types(StringType, IntegerType, ...) and complex types(ArrayType, MapType, ...)
Last note: data cannot contain null, it should be None in python. spark DataFrame.show() will print None columns as null.

Convert Pyspark Dataframe column from array to new columns

I've a Pyspark Dataframe with this structure:
root
|-- Id: string (nullable = true)
|-- Q: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- pr: string (nullable = true)
| | |-- qt: double (nullable = true)
Something similar to:
+----+--------------------- ... --+
| Id | Q |
+----+---------------------- ... -+
| 001| [ [pr1,1.9], [pr3,2.0]...] |
| 002| [ [pr2,1.0], [pr9,3.9]...] |
| 003| [ [pr2,9.0], ... ] |
...
I wold like to convert Q array into columns (name pr value qt).
Also I would like to avoid duplicated columns by merging (add) same columns.
+----+-----+-----+------+ ... ----+
| Id | pr1 | pr2 | pr3 | ... prn |
+----+-----+-----+------+ ... ----+
| 001| 1.9 | 0.0 | 2.0 | ... |
| 002| 0.0 | 1.0 | 0 | ... |
| 003| 0.0 | 9.0 | ... | ... |
...
How can I perform this transformation?.
Thakyou in advance!!.
Julián.
You can do this with a combination of explode and pivot:
import pyspark.sql.functions as F
# explode to get "long" format
df=df.withColumn('exploded', F.explode('Q'))
# get the name and the name in separate columns
df=df.withColumn('name', F.col('exploded').getItem(0))
df=df.withColumn('value', F.col('exploded').getItem(1))
# now pivot
df.groupby('Id').pivot('name').agg(F.max('value')).na.fill(0)
Very interesting question. This is how I approached it.
test.csv
001,pr1:0.9,pr3:1.2,pr2:2.0
002,pr3:5.2,pr4:0.99
Pyspark
file = sc.textFile("file:///test2.csv")
//get it in (key,value)
//[(u'001', u'pr1:0.9')...]
//rdd1 = file.map(lambda r: r.replace(",","\t",1)).map(lambda r: r.split("\t")).map(lambda r: (r[0],r[1])).flatMapValues(lambda r: r.split(','))
rdd1 = file.map(lambda r: r.split(",")[0]).map(lambda r: (r[0],r[1])).flatMapValues(lambda r: r.split(','))
//create a DF with 3 columns
//[(u'001', u'pr1', u'0.9')...)]
+---+---+----+
| _1| _2| _3|
+---+---+----+
|001|pr1| 0.9|
|001|pr3| 1.2|
|001|pr2| 2.0|
|002|pr3| 5.2|
|002|pr4|0.99|
+---+---+----+
rdd2 = rdd1.map(lambda r: (r[0],r[1].split(":"))).map(lambda r: (r[0],r[1][0],r[1][1]))
df = rdd2.toDF()
//Perform the magic
df.groupBy("_1").pivot("_2").agg(expr("coalesce(first(_3),0)"))
+---+---+---+---+----+
| _1|pr1|pr2|pr3| pr4|
+---+---+---+---+----+
|001|0.9|2.0|1.2| 0|
|002| 0| 0|5.2|0.99|
+---+---+---+---+----+