pyspark dataframe column creation - dataframe

I'm beginner in pyspark. I have this problem where I have a vector column / list of values.
col = ["True", "False", "True"]
I want to create a column in dataframe (with 3 rows) with this vector / list of values. E.g., in pandas we can do df['col_name'] = col.

Unfortunately Spark don't have such function that work well in Pandas, you can can still achieve it by using joining. Assuming you have a sorted dataframe and a list of new value that want to be the new column:
from pyspark.sql import SparkSession
from pyspark.sql.window import Window
df = spark.createDataFrame([('a', ), ('b', ), ('c', )], ['column'])
df.show(3, False)
+------+----------+
|column|row_number|
+------+----------+
|a |1 |
|b |2 |
|c |3 |
+------+----------+
You can add a row number and do the joining:
new_add_column = spark.createDataFrame([(True, ), (False, ), (True, )], ['new_create_column'])\
.withColumn('row_number', func.row_number().over(Window.orderBy(func.lit(''))))
new_add_column.show(3, False)
+------+----------+
|column|row_number|
+------+----------+
|a |1 |
|b |2 |
|c |3 |
+------+----------+
new_add_column = spark.createDataFrame([(True, ), (False, ), (True, )], ['new_create_column'])\
.withColumn('row_number', func.row_number().over(Window.orderBy(func.lit(''))))
new_add_column.show(3, False)
+-----------------+----------+
|new_create_column|row_number|
+-----------------+----------+
|true |1 |
|false |2 |
|true |3 |
+-----------------+----------+
output = df.join(new_add_column, on='row_number', how='inner')
output.show(3, False)
+----------+------+-----------------+
|row_number|column|new_create_column|
+----------+------+-----------------+
|1 |a |true |
|2 |b |false |
|3 |c |true |
+----------+------+-----------------+

Related

How to explode complex string with a mix of keys and arrays using pyspark or pandas

I am trying to transform a dataframe with a column having the following string with structure:
{
Key = [Value 1, Value 2, Value 3, Value 4, Value 1, Value 2, Value 3, Value 4 ...],
Key = [Value 1, Value 2, Value 3, Value 4, Value 1, Value 2, Value 3, Value 4 ...],
}
That is represented with values as below:
{
100=[800,0,100,0,2.168675, 800,0,100,0,.4954798],
400=[800,0,400,0,3.987227, 800,0,400,0,.7282956],
4000=[3200,0,4000,0,3.112903, 3200,0,4000,0,1.850587]
}
How to transform the above string into the below dataframe with rows exploded?
Here is the data in the data frame with the schema:
After applying the suggested solution, Unfortunately, the regex expression starting with r'.... gave an error saying "R Literals are not supported"..
Thus, I have replaced them accordingly to make it work only using the following code.
The only issue left is to remove the [ character from the Value1 Column.
If you have suggestions, appreciate it.
I will select the suggested solution as accepted...
Many Thanks...
import pyspark.sql.functions as F
from pyspark.sql.functions import *
data_source = "/curated/SensorMEDCurated"
df = read_delta_from_datalake (global_definitions["curated_dl_zone"], data_source)
print("Original Data:")
df.select("rms").show(1, truncate = False)
extract_all_pattern = "\'(" + "\\\\d+=\\\\[[^\\]]+\\]" + ")\'" ## \d+=\[[^\]] +\] ==> Any Integer = Any values in a bracket
df = df.withColumn("rms", F.expr(f"regexp_extract_all(rms, {extract_all_pattern}, 1)")) \
.withColumn("rms", F.explode("rms"))
df =df.withColumn("outer_value", split(df["rms"], '=').getItem(0))
df =df.withColumn("values", split(df["rms"], '=').getItem(1))
df= df.withColumn("values", F.split("values", "[\\\\s*,\\\\s*]"))
df = df.select(["outer_value"] + [F.element_at("values", i).alias(f"value{i}") for i in range(1,16)])
df.show()
Results:
Text output:
{100=[800,0,100,0,2.168675, 800,0,100,0,.4954798, 160,0,20,0,.4119049, 48,20,26,0,.1014838, 96,26,38,0,.1790891, 496,38,100,0,.1671498], 400=[800,0,400,0,3.987227, 800,0,400,0,.7282956, 210,0,105,0,.5492065, 590,105,400,0,.4716012], 4000=[3200,0,4000,0,3.112903, 3200,0,4000,0,1.850587, 82,0,102,0,.5790547, 17,102,123,0,.1790891, 408,123,633,0,.6745689, 62,633,710,0,.1910284, 405,710,1216,0,.4178745, 202,1216,1468,0,.2387854, 330,1468,1880,0,.2507247, 758,1880,2828,0,1.361077, 936,2828,3998,0,.6089029]}
UPDATE
Updated the solution to work with newly provided input text:
df = spark.createDataFrame(
data=[["{100=[ 800,0,100,0,2.168675, 800,0 ,100,0,.4954798, 160 ,0,20,0,.4119049, 48,20,26,0,.1014838, 96,26,38,0,.1790891, 496,38,100,0,.1671498 ], 400=[800,0,400,0,3.987227, 800,0,400,0,.7282956, 210,0,105,0,.5492065, 590,105,400,0,.4716012 ], 4000=[3200,0,4000,0,3.112903, 3200,0,4000,0,1.850587, 82,0,102,0,.5790547, 17,102,123,0,.1790891, 408,123,633,0,.6745689, 62,633,710,0,.1910284, 405,710,1216,0,.4178745, 202,1216,1468,0,.2387854, 330,1468,1880,0,.2507247, 758,1880,2828,0,1.361077, 936,2828,3998,0,.6089029]}"]],
schema=["column_str"]
)
import pyspark.sql.functions as F
df = df.withColumn("column_str", F.regexp_replace("column_str", r"\s", "")) \
.withColumn("column_str", F.expr("regexp_extract_all(column_str, r'(\d+=\[[^\]]+\])', 1)")) \
.withColumn("column_str", F.explode("column_str")) \
.withColumn("outer_value", F.regexp_extract("column_str", r"(\d+)=\[[^\]]+\]", 1)) \
.withColumn("values", F.regexp_extract("column_str", r"\d+=\[([^\]]+)\]", 1)) \
.withColumn("values", F.split("values", r"\s*,\s*")) \
.withColumn("values", F.transform("values", lambda x, i: F.create_map(F.concat(F.lit("value"), F.lpad(i + 1, 2, "0")), x))) \
.withColumn("ID", F.monotonically_increasing_id()) \
.withColumn("values", F.explode("values")) \
.select("ID", "outer_value", F.explode("values")) \
.groupBy("ID", "outer_value") \
.pivot("key") \
.agg(F.first("value")) \
.drop("ID")
df.show(truncate=False)
+-----------+-------+-------+-------+-------+--------+-------+-------+-------+-------+--------+-------+-------+-------+-------+--------+-------+-------+-------+-------+--------+-------+-------+-------+-------+--------+-------+-------+-------+-------+--------+-------+-------+-------+-------+--------+-------+-------+-------+-------+--------+-------+-------+-------+-------+--------+-------+-------+-------+-------+--------+-------+-------+-------+-------+--------+
|outer_value|value01|value02|value03|value04|value05 |value06|value07|value08|value09|value10 |value11|value12|value13|value14|value15 |value16|value17|value18|value19|value20 |value21|value22|value23|value24|value25 |value26|value27|value28|value29|value30 |value31|value32|value33|value34|value35 |value36|value37|value38|value39|value40 |value41|value42|value43|value44|value45 |value46|value47|value48|value49|value50 |value51|value52|value53|value54|value55 |
+-----------+-------+-------+-------+-------+--------+-------+-------+-------+-------+--------+-------+-------+-------+-------+--------+-------+-------+-------+-------+--------+-------+-------+-------+-------+--------+-------+-------+-------+-------+--------+-------+-------+-------+-------+--------+-------+-------+-------+-------+--------+-------+-------+-------+-------+--------+-------+-------+-------+-------+--------+-------+-------+-------+-------+--------+
|100 |800 |0 |100 |0 |2.168675|800 |0 |100 |0 |.4954798|160 |0 |20 |0 |.4119049|48 |20 |26 |0 |.1014838|96 |26 |38 |0 |.1790891|496 |38 |100 |0 |.1671498|null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |
|400 |800 |0 |400 |0 |3.987227|800 |0 |400 |0 |.7282956|210 |0 |105 |0 |.5492065|590 |105 |400 |0 |.4716012|null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |
|4000 |3200 |0 |4000 |0 |3.112903|3200 |0 |4000 |0 |1.850587|82 |0 |102 |0 |.5790547|17 |102 |123 |0 |.1790891|408 |123 |633 |0 |.6745689|62 |633 |710 |0 |.1910284|405 |710 |1216 |0 |.4178745|202 |1216 |1468 |0 |.2387854|330 |1468 |1880 |0 |.2507247|758 |1880 |2828 |0 |1.361077|936 |2828 |3998 |0 |.6089029|
+-----------+-------+-------+-------+-------+--------+-------+-------+-------+-------+--------+-------+-------+-------+-------+--------+-------+-------+-------+-------+--------+-------+-------+-------+-------+--------+-------+-------+-------+-------+--------+-------+-------+-------+-------+--------+-------+-------+-------+-------+--------+-------+-------+-------+-------+--------+-------+-------+-------+-------+--------+-------+-------+-------+-------+--------+
The logic is:
Use regex to split string in form "outer_value=[value1, value2, ...]".
Use regex to extract "outer_value".
Use regex to extract array "value1, value2, ...".
Split values by ,.
Select required value columns by index.
df = spark.createDataFrame(
data=[[
"""{
100=[800,0,100,0,2.168675, 800,0,100,0,.4954798],
400=[800,0,400,0,3.987227, 800,0,400,0,.7282956],
4000=[3200,0,4000,0,3.112903, 3200,0,4000,0,1.850587]
}"""
]],
schema=["column_str"]
)
import pyspark.sql.functions as F
df = df.withColumn("column_str", F.expr("regexp_extract_all(column_str, r'(\d+=\[[^\]]+\])', 1)")) \
.withColumn("column_str", F.explode("column_str")) \
.withColumn("outer_value", F.regexp_extract("column_str", r"(\d+)=\[[^\]]+\]", 1)) \
.withColumn("values", F.regexp_extract("column_str", r"\d+=\[([^\]]+)\]", 1)) \
.withColumn("values", F.split("values", r"\s*,\s*")) \
.select(["outer_value"] + [F.element_at("values", i).alias(f"value{i}") for i in range(1,11)])
df.show()
+-----------+------+------+------+------+--------+------+------+------+------+--------+
|outer_value|value1|value2|value3|value4| value5|value6|value7|value8|value9| value10|
+-----------+------+------+------+------+--------+------+------+------+------+--------+
| 100| 800| 0| 100| 0|2.168675| 800| 0| 100| 0|.4954798|
| 400| 800| 0| 400| 0|3.987227| 800| 0| 400| 0|.7282956|
| 4000| 3200| 0| 4000| 0|3.112903| 3200| 0| 4000| 0|1.850587|
+-----------+------+------+------+------+--------+------+------+------+------+--------+

Creating MAPTYPE field from multiple columns - Spark SQL

I have a use case wherein multiple keys are distributed across the dataset in a JSON format, which needs to be aggregated into a consolidated resultset for further processing.
I have been able to develop a code structure that achieves it using both Python API (PySpark) & Spark SQL, but the latter involves a more composite & tardy of doing it and involves intermediate conversations which can in the future lead to errors.
Using the below snippets, is there a better way to achieve this using Spark SQL, by creating a MAP<STRING,ARRAY<STRING> using key and value?
Data Preparation
from pyspark.sql.types import *
import pandas as pd
from io import StringIO
s = StringIO("""
id|json_struct
1|{"a":["tyeqb","",""],"e":["qwrqc","",""]}
1|{"t":["sartq","",""],"r":["fsafsq","",""]}
1|{"b":["puhqiqh","",""],"e":["hjfsaj","",""]}
2|{"b":["basajhjwa","",""],"e":["asfafas","",""]}
2|{"n":["gaswq","",""],"r":["sar","",""],"l":["sar","",""],"s":["rqqrq","",""],"m":["wrqwrq","",""]}
2|{"s":["tqqwjh","",""],"t":["afs","",""],"l":["fsaafs","",""]}
""")
df = pd.read_csv(s,delimiter='|')
sparkDF = spark.createDataFrame(df)
sparkDF.registerTempTable("INPUT")
sparkDF = sparkDF.withColumn('json_struct', F.from_json(F.col('json_struct')
,schema=MapType(StringType(),ArrayType(StringType()),True)
))
sparkDF.show(truncate=False)
+---+---------------------------------------------------------------------------------------+
|id |json_struct |
+---+---------------------------------------------------------------------------------------+
|1 |{a -> [tyeqb, , ], e -> [qwrqc, , ]} |
|1 |{t -> [sartq, , ], r -> [fsafsq, , ]} |
|1 |{b -> [puhqiqh, , ], e -> [hjfsaj, , ]} |
|2 |{b -> [basajhjwa, , ], e -> [asfafas, , ]} |
|2 |{n -> [gaswq, , ], r -> [sar, , ], l -> [sar, , ], s -> [rqqrq, , ], m -> [wrqwrq, , ]}|
|2 |{s -> [tqqwjh, , ], t -> [afs, , ], l -> [fsaafs, , ]} |
+---+---------------------------------------------------------------------------------------+
Python API (PySpark) - Implementation
As you can see, the resultant key from explode is natively a STRING type and since PySpark has create_map, which is not available within Spark SQL, it can be readily used to generate the final json_struct column ensuring a single key with a varying length ARRAYTYPE<STRING> value
sparkDF.select(
F.col('id')
,F.explode(F.col('json_struct'))
).withColumn('value',F.filter(F.col('value'), lambda x: x != '')\
).withColumn('value',F.concat_ws(',', F.col('value'))\
).groupBy('id', 'key'
).agg(F.collect_set(F.col('value')).alias('value')\
).withColumn('json_struct',F.to_json(F.create_map("key","value"))
).orderBy('id'
).show(truncate=False)
+---+---+---------------+------------------------+
|id |key|value |json_struct |
+---+---+---------------+------------------------+
|1 |a |[tyeqb] |{"a":["tyeqb"]} |
|1 |e |[hjfsaj, qwrqc]|{"e":["hjfsaj","qwrqc"]}|
|1 |r |[fsafsq] |{"r":["fsafsq"]} |
|1 |b |[puhqiqh] |{"b":["puhqiqh"]} |
|1 |t |[sartq] |{"t":["sartq"]} |
|2 |b |[basajhjwa] |{"b":["basajhjwa"]} |
|2 |n |[gaswq] |{"n":["gaswq"]} |
|2 |t |[afs] |{"t":["afs"]} |
|2 |s |[tqqwjh, rqqrq]|{"s":["tqqwjh","rqqrq"]}|
|2 |e |[asfafas] |{"e":["asfafas"]} |
|2 |l |[sar, fsaafs] |{"l":["sar","fsaafs"]} |
|2 |r |[sar] |{"r":["sar"]} |
|2 |m |[wrqwrq] |{"m":["wrqwrq"]} |
+---+---+---------------+------------------------+
Spark SQL - Implementation
Within this implementation, I have to take additional steps to ensure both key and value columns are of ARRAYTYPE and consistent lengths as map_from_arrays takes in arrays as inputs.
Is there a way to bypass these and create a similar schema as depicted using Python API?
sql.sql("""
SELECT
id,
KEY,
VALUE,
TO_JSON(MAP_FROM_ARRAYS(KEY,VALUE)) as json_struct
FROM (
SELECT
id,
key,
ARRAY(COLLECT_SET( value )) as value -- <------- ### Ensuring Value is NESTED ARRAY
FROM (
SELECT
id,
SPLIT(k,'|',1) as key, -- <------- ### Ensuring Key is Array
CONCAT_WS(',',FILTER(v,x -> x != '')) as value
FROM (
SELECT
id,
EXPLODE(FROM_JSON(json_struct,'MAP<STRING,ARRAY<STRING>>')) as (k,v)
FROM INPUT
)
)
GROUP BY 1,2
)
ORDER BY 1
""").show(truncate=False)
+---+---+-----------------+------------------------+
|id |KEY|VALUE |json_struct |
+---+---+-----------------+------------------------+
|1 |[a]|[[tyeqb]] |{"a":["tyeqb"]} |
|1 |[e]|[[hjfsaj, qwrqc]]|{"e":["hjfsaj","qwrqc"]}|
|1 |[b]|[[puhqiqh]] |{"b":["puhqiqh"]} |
|1 |[r]|[[fsafsq]] |{"r":["fsafsq"]} |
|1 |[t]|[[sartq]] |{"t":["sartq"]} |
|2 |[n]|[[gaswq]] |{"n":["gaswq"]} |
|2 |[b]|[[basajhjwa]] |{"b":["basajhjwa"]} |
|2 |[t]|[[afs]] |{"t":["afs"]} |
|2 |[s]|[[tqqwjh, rqqrq]]|{"s":["tqqwjh","rqqrq"]}|
|2 |[e]|[[asfafas]] |{"e":["asfafas"]} |
|2 |[l]|[[sar, fsaafs]] |{"l":["sar","fsaafs"]} |
|2 |[r]|[[sar]] |{"r":["sar"]} |
|2 |[m]|[[wrqwrq]] |{"m":["wrqwrq"]} |
+---+---+-----------------+------------------------+
Spark SQL instead of create_map has map. Your PySpark code could be translated into this:
df = spark.sql("""
WITH
TBL2 (SELECT id, EXPLODE(FROM_JSON(json_struct,'MAP<STRING,ARRAY<STRING>>')) from INPUT),
TBL3 (SELECT id, key, FLATTEN(COLLECT_SET(FILTER(value, x -> x != ''))) value
FROM TBL2
GROUP BY id, key)
SELECT *, TO_JSON(MAP(key, value)) json_struct
FROM TBL3
""")
df.show(truncate=0)
# +---+---+---------------+------------------------+
# |id |key|value |json_struct |
# +---+---+---------------+------------------------+
# |1 |a |[tyeqb] |{"a":["tyeqb"]} |
# |1 |e |[qwrqc, hjfsaj]|{"e":["qwrqc","hjfsaj"]}|
# |1 |b |[puhqiqh] |{"b":["puhqiqh"]} |
# |1 |r |[fsafsq] |{"r":["fsafsq"]} |
# |1 |t |[sartq] |{"t":["sartq"]} |
# |2 |b |[basajhjwa] |{"b":["basajhjwa"]} |
# |2 |n |[gaswq] |{"n":["gaswq"]} |
# |2 |s |[rqqrq, tqqwjh]|{"s":["rqqrq","tqqwjh"]}|
# |2 |t |[afs] |{"t":["afs"]} |
# |2 |e |[asfafas] |{"e":["asfafas"]} |
# |2 |l |[fsaafs, sar] |{"l":["fsaafs","sar"]} |
# |2 |r |[sar] |{"r":["sar"]} |
# |2 |m |[wrqwrq] |{"m":["wrqwrq"]} |
# +---+---+---------------+------------------------+

spark join with column multiple values in list

I have
Dataset A: uuid, listOfLocationsIds, name
Dataset B: locationId, latitude, longitude
A.listOfLocationIds can have multiple locationIds
How can I do a join on A and B with each value in listOfLocationsIds?
So if there are two values in listOfLocationIds, I would want the join to consider each locationId in the listOfLocationIds
A.join(B, A.listOfLocationsIds[0] == B.locationId, "left")
A.join(B, A.listOfLocationsIds[1] == B.locationId, "left")
Assume dataset A is called df with this content:
+----+-----------------+-----+
|uuid|listOfLocationsId|name |
+----+-----------------+-----+
|1 |[1, 2, 3] |name1|
|2 |[1, 3] |name1|
+----+-----------------+-----+
and dataset B is called df2 with this content:
+----------+--------+---------+
|locationId|latitude|longitude|
+----------+--------+---------+
|2 |5 |7 |
+----------+--------+---------+
And we do an array_contains join:
df = df.join(df2,
array_contains(col("listOfLocationsId"), col("locationId")), "left"
)
The final result:
+----+-----------------+-----+----------+--------+---------+
|uuid|listOfLocationsId|name |locationId|latitude|longitude|
+----+-----------------+-----+----------+--------+---------+
|1 |[1, 2, 3] |name1|2 |5 |7 |
|2 |[1, 3] |name1|null |null |null |
+----+-----------------+-----+----------+--------+---------+
Good luck!

How to add more rows in pyspark df by column value

I'm stuck with this problem quite a while and probably making it bigger than really it is. I will try to simplify it.
I'm using pyspark and data frame functions along my code.
I already have a df as:
+--+-----+---------+
|id|col1 |col2 |
+--+-----+---------+
|1 |Hello|Repeat |
|2 |Word |Repeat |
|3 |Aux |No repeat|
|4 |Test |Repeat |
+--+-----+---------+
What I want to achieve is to repeat the df's rows when col2 is 'Repeat' increasing col1's values in value+1.
+--+-----+---------+------+
|id|col1 |col2 |col3 |
+--+-----+---------+------+
|1 |Hello|Repeat |Hello1|
|1 |Hello|Repeat |Hello2|
|1 |Hello|Repeat |Hello3|
|2 |Word |Repeat |Word1 |
|2 |Word |Repeat |Word2 |
|2 |Word |Repeat |Word3 |
|3 |Aux |No repeat|Aux |
|4 |Test |Repeat |Test1 |
|4 |Test |Repeat |Test2 |
|4 |Test |Repeat |Test3 |
+--+-----+---------+------+
My first approach was to use withColumn operator to create a new column with udf's help:
my_func = udf(lambda words: (words + str(i + 1 for i in range(3))), StringType())
df = df\
.withColumn('col3', when(col('col2') == 'No Repeat', col('col1'))
.otherwise(my_func(col('col1'))))
But when I evaluate this in a df.show(10,False) it's throw me an error. My guessing is because I just can't create more rows with withColumn function in that way.
So I decide to go for another approach with no success also. Using a rdd.flatMap:
test = df.rdd.flatMap(lambda row: (row if (row.col2== 'No Repeat') else (row.col1 + str(i+1) for i in range(3))))
print(test.collect())
But here I'm losing the df schema and I can not throw out the full row on the else condition, it only throw me the col1 words plus it's iterator.
Do you know any proper way to solve this?
At the end my problem is that I do not get a properly way to create more rows based on column values because I'm quite new in this world. Also answers that I found seems not to fit this problem.
All help will be appreciate.
One way is use a condition and assign an array , then explode,
import pyspark.sql.functions as F
(df.withColumn("test",F.when(df['col2']=='Repeat',
F.array([F.lit(str(i)) for i in range(1,4)])).otherwise(F.array(F.lit(''))))
.withColumn("col3",F.explode(F.col("test"))).drop("test")
.withColumn("col3",F.concat(F.col("col1"),F.col("col3")))).show()
A neater version of the same as suggested by #MohammadMurtazaHashmi would look like:
(df.withColumn("test",F.when(df['col2']=='Repeat',
F.array([F.concat(F.col("col1"),F.lit(str(i))) for i in range(1,4)]))
.otherwise(F.array(F.col("col1"))))
.select("id","col1","col2", F.explode("test"))).show()
+---+-----+---------+------+
| id| col1| col2| col3|
+---+-----+---------+------+
| 1|Hello| Repeat|Hello1|
| 1|Hello| Repeat|Hello2|
| 1|Hello| Repeat|Hello3|
| 2| Word| Repeat| Word1|
| 2| Word| Repeat| Word2|
| 2| Word| Repeat| Word3|
| 3| Aux|No repeat| Aux|
| 4| Test| Repeat| Test1|
| 4| Test| Repeat| Test2|
| 4| Test| Repeat| Test3|
+---+-----+---------+------+

How to compare two identically structured dataframes to calculate the row differences

I've the following two identically structurred dataframes with id in common.
val originalDF = Seq((1,"gaurav","jaipur",550,70000),(2,"sunil","noida",600,80000),(3,"rishi","ahmedabad",510,65000))
.toDF("id","name","city","credit_score","credit_limit")
scala> originalDF.show(false)
+---+------+---------+------------+------------+
|id |name |city |credit_score|credit_limit|
+---+------+---------+------------+------------+
|1 |gaurav|jaipur |550 |70000 |
|2 |sunil |noida |600 |80000 |
|3 |rishi |ahmedabad|510 |65000 |
+---+------+---------+------------+------------+
val changedDF= Seq((1,"gaurav","jaipur",550,70000),(2,"sunil","noida",650,90000),(4,"Joshua","cochin",612,85000))
.toDF("id","name","city","credit_score","credit_limit")
scala> changedDF.show(false)
+---+------+------+------------+------------+
|id |name |city |credit_score|credit_limit|
+---+------+------+------------+------------+
|1 |gaurav|jaipur|550 |70000 |
|2 |sunil |noida |650 |90000 |
|4 |Joshua|cochin|612 |85000 |
+---+------+------+------------+------------+
Hence I wrote one udf to calulate the change in column values.
val diff = udf((col: String, c1: String, c2: String) => if (c1 == c2) "" else col )
val somedf=changedDF.alias("a").join(originalDF.alias("b"), col("a.id") === col("b.id")).withColumn("diffcolumn", split(concat_ws(",",changedDF.columns.map(x => diff(lit(x), changedDF(x), originalDF(x))):_*),","))
scala> somedf.show(false)
+---+------+------+------------+------------+---+------+------+------------+------------+----------------------------------+
|id |name |city |credit_score|credit_limit|id |name |city |credit_score|credit_limit|diffcolumn |
+---+------+------+------------+------------+---+------+------+------------+------------+----------------------------------+
|1 |gaurav|jaipur|550 |70000 |1 |gaurav|jaipur|550 |70000 |[, , , , ] |
|2 |sunil |noida |650 |90000 |2 |sunil |noida |600 |80000 |[, , , credit_score, credit_limit]|
+---+------+------+------------+------------+---+------+------+------------+------------+----------------------------------+
But I'm not able to get id and diffcolumn separately. If I do a
somedf.select('id) it gives me ambiguity error coz there are two ids in the joined table
I want to get all the name of the columns in any array and id corresponding to which the values have changed. Like in the changedDF credit score and credit limit of id=2,name=sunil has been changed.
Hence I wanted the resultant dataframe to give me result like
+--+---+------+------+------------+------------+---+
|id | diffcolumn |
+---+------+------+------------+------------+---
|2 |[, , , credit_score, credit_limit] |
+---+------+------+------------+------------+---+
Can anyone suggest me what approach to follow to get eh id and changed column separately in a dataframe.
For your reference, these kinds of diffs can easily be done with the spark-extension package.
It provides the diff transformation that builds that complex query for you:
import uk.co.gresearch.spark.diff._
val options = DiffOptions.default.withChangeColumn("changes") // needed to get the optional 'changes' column
val diff = originalDF.diff(changedDF, options, "id")
diff.show(false)
+----+----------------------------+---+---------+----------+---------+----------+-----------------+------------------+-----------------+------------------+
|diff|changes |id |left_name|right_name|left_city|right_city|left_credit_score|right_credit_score|left_credit_limit|right_credit_limit|
+----+----------------------------+---+---------+----------+---------+----------+-----------------+------------------+-----------------+------------------+
|N |[] |1 |gaurav |gaurav |jaipur |jaipur |550 |550 |70000 |70000 |
|I |null |4 |null |Joshua |null |cochin |null |612 |null |85000 |
|C |[credit_score, credit_limit]|2 |sunil |sunil |noida |noida |600 |650 |80000 |90000 |
|D |null |3 |rishi |null |ahmedabad|null |510 |null |65000 |null |
+----+----------------------------+---+---------+----------+---------+----------+-----------------+------------------+-----------------+------------------+
diff.select($"id", $"diff", $"changes").show(false)
+---+----+----------------------------+
|id |diff|changes |
+---+----+----------------------------+
|1 |N |[] |
|4 |I |null |
|2 |C |[credit_score, credit_limit]|
|3 |D |null |
+---+----+----------------------------+
While this is a simple example, diffing DataFrames can become complicated when wide schemas and null values are involved.
That package is well-tested, so you don't have to worry about getting that query right yourself.
Try this :
val aliasedChangedDF = changedDF.as("a")
val aliasedOriginalDF = originalDF.as("b")
val diff = udf((col: String, c1: String, c2: String) => if (c1 == c2) "" else col )
val somedf=aliasedChangedDF.join(aliasedOriginalDF, col("a.id") === col("b.id")).withColumn("diffcolumn", split(concat_ws(",",changedDF.columns.map(x => diff(lit(x), changedDF(x), originalDF(x))):_*),","))
somedf.select(col("a.id").as("id"),col("diffcolumn"))
Just change your join condition from col("a.id") === col("b.id") to "id"
Then, there will be only a single id column.
Further, you don't need the alias("a") and alias("b"). So your join simplifies from
changedDF.alias("a").join(originalDF.alias("b"), col("a.id") === col("b.id"))
to
changedDF.join(originalDF, "id")