How to transform a column with an array of json to separate columns based on the json keys in Spark? - dataframe

I have a column like this:
column
[{"key":1,"value":"aaaaa"},{"key":2,"value":"bbbbb"},{"key":3,"value":"ccccc"}]
[{"key":1,"value":"abcde"},{"key":2,"value":"bcdef"}]
[{"key":1,"value":"edcba"},{"key":3,"value":"zxcvb"},{"key":4,"value":"qwert"}]
I want to separate such column base on the keys, with each key having their column.
I've tried something like this but it didn't work:
test_schema = ArrayType(StructType([StructField("key", IntegerType()), StructField("value", StringType())]))
teste = (hits_raw
.withColumn("keys", get_json_object("hitsCustomDimensions", "$[*].key"))
.withColumn("teste", explode(from_json("hitsCustomDimensions", test_schema)))
).display()
The output I want is something like this:
column_1
column_2
column_3
column_4
aaaaa
bbbbb
ccccc
null
abcde
bcdef
null
null
edcba
null
zxcvb
qwert

Parse into array of structs then pivot the key column. You'll need some ID column to group by for the pivot, here I used monotonically_increasing_id function to add an id before inlining the array of structs.
from pyspark.sql import functions as F
df = spark.createDataFrame([
('[{"key":1,"value":"aaaaa"},{"key":2,"value":"bbbbb"},{"key":3,"value":"ccccc"}]',),
('[{"key":1,"value":"abcde"},{"key":2,"value":"bcdef"}]',),
('[{"key":1,"value":"edcba"},{"key":3,"value":"zxcvb"},{"key":4,"value":"qwert"}]',)
], ["column"])
test = (df.withColumn("column", F.from_json("column", test_schema))
.withColumn("id", F.monotonically_increasing_id())
.selectExpr("id", "inline(column)")
.groupBy("id").pivot("key").agg(F.first("value"))
.drop("id")
)
test.show()
#+-----+-----+-----+-----+
#| 1| 2| 3| 4|
#+-----+-----+-----+-----+
#|aaaaa|bbbbb|ccccc| null|
#|abcde|bcdef| null| null|
#|edcba| null|zxcvb|qwert|
#+-----+-----+-----+-----+
You can then rename to columns to add prefix column_* if you want:
test = test.select(*[F.col(c).alias(f"column_{c}") for c in test.columns])

Related

Compare substrings of different columns

I have a dataframe like this:
+----------------------+--------------------------------------------------+-------------------+
| column_1 |column_2| |Required_column |
+----------------------+--------------------------------------------------+-------------------+
|K12B-45-84-6 |K12B-02-36-504, I05O-21-65-312, A301-21-25-363 | True |
|J020-35-2-9 |P12K-05-31-602, M002-22-22-636,L630-51-32-544 | False |
|L006-85-00-694 |M10P-22-94-349,L006-85-00-694, I553-35-12-240 | True |
|M002-22-36-989 |U985-12-45-363, M002-19-14-964 | True |
+----------------------+--------------------------------------------------+-------------------+
Explaination: column_1 and column_2 are a string,
for easy understanding let us call the values in the dataframe as "switch".
Column_1 always has only one switch value per row but column_2 may have multiple switch values in it. The value should be returned True or False only by comparing the first 4 strings(ex: K12B == K12B see row one)
Note: Even though the switch values in column_2 are comma seperated, there is never a common logic(sometimes there maybe a space or two spaces etc)
The hint is every switch value either in column_1 or column_2 starts with a letter, Therefore a logic is required based on that hint
The aim is to have the required column which either returns True or False, The solution is required in Pyspark
Thanks in Advance
Here is a solution using Pyspark's substring and contains functions. In this way, you don't have to worry about cleanliness of column_2, you just need to be sure that column_1 is cleaned:
import pyspark.sql.functions as F
data = [
("K12B-45-84-6", "K12B-02-36-504, I05O-21-65-312, A301-21-25-363"),
("J020-35-2-9", "P12K-05-31-602, M002-22-22-636,L630-51-32-544"),
("L006-85-00-694", "M10P-22-94-349,L006-85-00-694, I553-35-12-240"),
("M002-22-36-989", "U985-12-45-363, M002-19-14-964")]
columns = ["column_1", "column_2"]
df = spark.createDataFrame(data = data, schema = columns)
df = df.withColumn("Required_column", F.when(
F.col("column_2").contains(F.substring(F.col("column_1"), 1, 4)), True
).otherwise(False)
)
df.show()
Output:
+--------------+--------------------+---------------+
| column_1| column_2|Required_column|
+--------------+--------------------+---------------+
| K12B-45-84-6|K12B-02-36-504, I...| true|
| J020-35-2-9|P12K-05-31-602, M...| false|
|L006-85-00-694|M10P-22-94-349,L0...| true|
|M002-22-36-989|U985-12-45-363, ...| true|
+--------------+--------------------+---------------+

Extract key value from dataframe in PySpark

I have the below dataframe which I have read from a JSON file.
1
2
3
4
{"todo":["wakeup", "shower"]}
{"todo":["brush", "eat"]}
{"todo":["read", "write"]}
{"todo":["sleep", "snooze"]}
I need my output to be as below Key and Value. How do I do this? Do I need to create a schema?
ID
todo
1
wakeup, shower
2
brush, eat
3
read, write
4
sleep, snooze
The key-value which you refer to is a struct. "keys" are struct field names, while "values" are field values.
What you want to do is called unpivoting. One of the ways to do it in PySpark is using stack. The following is a dynamic approach, where you don't need to provide existent column names.
Input dataframe:
df = spark.createDataFrame(
[((['wakeup', 'shower'],),(['brush', 'eat'],),(['read', 'write'],),(['sleep', 'snooze'],))],
'`1` struct<todo:array<string>>, `2` struct<todo:array<string>>, `3` struct<todo:array<string>>, `4` struct<todo:array<string>>')
Script:
to_melt = [f"\'{c}\', `{c}`.todo" for c in df.columns]
df = df.selectExpr(f"stack({len(to_melt)}, {','.join(to_melt)}) (ID, todo)")
df.show()
# +---+----------------+
# | ID| todo|
# +---+----------------+
# | 1|[wakeup, shower]|
# | 2| [brush, eat]|
# | 3| [read, write]|
# | 4| [sleep, snooze]|
# +---+----------------+
Use from_json to convert string to array. Explode to cascade each unique element to row.
data
df = spark.createDataFrame(
[(('{"todo":"[wakeup, shower]"}'),('{"todo":"[brush, eat]"}'),('{"todo":"[read, write]"}'),('{"todo":"[sleep, snooze]"}'))],
('value1','values2','value3','value4'))
code
new = (df.withColumn('todo', explode(flatten(array(*[map_values(from_json(x, "MAP<STRING,STRING>")) for x in df.columns])))) #From string to array to indivicual row
.withColumn('todo', translate('todo',"[]",'')#Remove corner brackets
) ).show(truncate=False)
outcome
+---------------------------+-----------------------+------------------------+--------------------------+--------------+
|value1 |values2 |value3 |value4 |todo |
+---------------------------+-----------------------+------------------------+--------------------------+--------------+
|{"todo":"[wakeup, shower]"}|{"todo":"[brush, eat]"}|{"todo":"[read, write]"}|{"todo":"[sleep, snooze]"}|wakeup, shower|
|{"todo":"[wakeup, shower]"}|{"todo":"[brush, eat]"}|{"todo":"[read, write]"}|{"todo":"[sleep, snooze]"}|brush, eat |
|{"todo":"[wakeup, shower]"}|{"todo":"[brush, eat]"}|{"todo":"[read, write]"}|{"todo":"[sleep, snooze]"}|read, write |
|{"todo":"[wakeup, shower]"}|{"todo":"[brush, eat]"}|{"todo":"[read, write]"}|{"todo":"[sleep, snooze]"}|sleep, snooze |
+---------------------------+-----------------------+------------------------+--------------------------+--------------+

Pyspark Multiple JOINS Column <> Row values: Reducing Actions

I have a master table 'Table 1' with 3 columns(Shown below). Tables 2.1, 3.1 & 4.1 are for 3 unique dates present in Table 1 and need to be populated in column 'Points 1'. Similarly, Tables 2.2, 3.2 & 4.2 are for same 3 unique dates present in Table 1 and need to be populated in column 'Points 2'.
Current Approach:
df1 = spark.table("Table1")
df2_1 = spark.table("table2.1")
df2_1 = withColumn("Date", lit(3312019))
df3 = df1.join(df2_1, df1.ID==df2.1==ID & df1.Date==df2_1.Date, 'left')
df4 = df3.withColumn('Points', when(df3.Category==A, col('A'))
.when(df3.Category==B, col('B'))
.when(df3.Category==C, col('C'))
.when(df3.Category==D, col('D'))
.otherwise(lit(None)))
Current Approach makes my code lengthy if implemented for all 6 tables, any suggestions to shorten it and reduce multiple actions?
I don't know if this is much shorter or "cleaner" than your version, but since you asked for help on this, I will post this as an answer. Please note that my answer is in regular spark (scala) - not pyspark, but it shouldn't be too difficult to port it to pyspark, if you find the answer useful :)
So here goes:
First a little helper function
def columns2rows(row: Row) = {
val id = row.getInt(0)
val date = row.getInt(1)
val cols = Seq("A", "B", "C", "D")
cols.indices.map(index => (id, cols(index), date, if (row.isNullAt(index+2)) 0 else row.getInt(index+2)))
}
Then union together the tables needed to populate "Points1"
val df1 = table21.withColumn("Date", lit(3312019))
.unionByName(table31.withColumn("Date", lit(12312019)))
.unionByName(table41.withColumn("Date", lit(5302020)))
.select($"ID", $"Date", $"A", $"B", $"C", $"D")
.flatMap(row => columns2rows(row))
.toDF("ID", "Category", "Date", "Points1")
Then union together the tables needed to populate "Points2"
val df2 = table22.withColumn("Date", lit(3312019))
.unionByName(table32.withColumn("Date", lit(12312019)))
.unionByName(table42.withColumn("Date", lit(5302020)))
.select($"ID", $"Date", $"A", $"B", $"C", $"D")
.flatMap(row => columns2rows(row))
.toDF("ID", "Category", "Date", "Points2")
Join them together and finally with the original table:
val joiningTable = df1.join(df2, Seq("ID", "Category", "Date"))
val res = table1.join(joiningTable, Seq("ID", "Category", "Date"))
...and voila - printing the final result:
res.show()
+---+--------+--------+-------+-------+
| ID|Category| Date|Points1|Points2|
+---+--------+--------+-------+-------+
|123| A| 3312019| 40| 20|
|123| B| 5302020| 10| 90|
|123| D| 5302020| 0| 80|
|123| A|12312019| 20| 10|
|123| B|12312019| 0| 10|
|123| B| 3312019| 60| 60|
+---+--------+--------+-------+-------+

Add single quotes to the dataFrame column values

DataFrame is holding a column QUALIFY with values like below.
QUALIFY
=================
ColA|ColB|ColC
ColA
ColZ|ColP
The values in this column are split by "|". I want values in this column to be like 'ColA','ColB','ColC' ...
With the below code I am able to replace | with ,',. How can I add a single quote at the start and end of value?
newDf = df_qualify.withColumn('QUALIFY2', regexp_replace('QUALIFY', "\\|", "\\','"))
Your solution is almost there - you just need to add a single quote to the start and end. You can achieve this using pyspark.sql.functions.concat:
from pyspark.sql.functions import col, concat, lit, regexp_replace
df.withColumn(
"QUALIFY2",
concat(lit("'"), regexp_replace(col('QUALIFY'), r"\|", r"','"), lit("'"))
).show()
#+--------------+--------------------+
#| QUALIFY| QUALIFY2|
#+--------------+--------------------+
#|ColA|ColB|ColC|'ColA','ColB','ColC'|
#| ColA| 'ColA'|
#| ColZ|ColP| 'ColZ','ColP'|
#+--------------+--------------------+
Alternatively, you can avoid regular expressions and achieve the same using split and concat_ws:
from pyspark.sql.functions import split, concat_ws
df.withColumn(
"QUALIFY2",
concat(lit("'"), concat_ws("','", split("QUALIFY", "\|")), lit("'"))
).show()
#+--------------+--------------------+
#| QUALIFY| QUALIFY2|
#+--------------+--------------------+
#|ColA|ColB|ColC|'ColA','ColB','ColC'|
#| ColA| 'ColA'|
#| ColZ|ColP| 'ColZ','ColP'|
#+--------------+--------------------+
Split the column on | and then join the resulting array back to a string :
import pyspark.sql.functions as F
import pyspark.sql.types as T
def str_list(x):
return str(x).replace("[", "").replace("]", "")
str_udf = F.udf(str_list, T.StringType())
df = df.withColumn("arr_split", F.split(F.col("QUALIFY"), "\|")) # escape character
df = df.withColumn("QUALIFY2", str_udf(F.col("arr_split")))
My sample output frame:
df.drop("arr_split").show() # Please ignore a and b columns
+---+---+--------------+--------------------+
| a| b| abc| QUALIFY2|
+---+---+--------------+--------------------+
| 1| 1|col1|col2|col3|'col1', 'col2', '...|
| 2| 2|col1|col2|col3|'col1', 'col2', '...|
| 3| 3|col1|col2|col3|'col1', 'col2', '...|
| 4| 4|col1|col2|col3|'col1', 'col2', '...|
| 5| 5|col1|col2|col3|'col1', 'col2', '...|
+---+---+--------------+--------------------+
Below code worked for me, added the square brackets back to make it an array
import pyspark.sql.functions as F
import pyspark.sql.types as T
def str_list(x):
return str(x).replace("[", "").replace("]", "")
str_udf = F.udf(str_list, T.StringType())
df = df.withColumn(column_name,str_udf(F.col(column_name)))
df = df.withColumn(column_name, F.expr("concat('[', " + column_name +", ']')"))

convert pyspark groupedData object to spark Dataframe

I have to do a 2 levels grouping on a pyspark dataframe.
My tentative:
grouped_df=df.groupby(["A","B","C"])
grouped_df.groupby(["C"]).count()
But I get the following error:
'GroupedData' object has no attribute 'groupby'
I guess I should first convert the grouped object into a pySpark DF. But I cannot do that.
Any suggestion?
I had the same issue. The way I got around it was by first doing a "count()" after the first groupby, because that returns a Spark DataFrame, rather than the GroupedData object. Then you can do another groupby on that returned DataFrame.
So try:
grouped_df=df.groupby(["A","B","C"]).count()
grouped_df.groupby(["C"]).count()
The function DataFrame.groupBy(cols) returns a GroupedData object. In order to convert a GroupedData object back to a DataFrame, you will need to use one of the GroupedData functions such as mean(cols) avg(cols) count(). An example using your example is:
df = sqlContext.createDataFrame([['a', 'b', 'c'], ['a', 'b', 'c'], ['a', 'b', 'c']], schema=['A', 'B', 'C'])
df.show()
+---+---+---+
| A| B| C|
+---+---+---+
| a| b| c|
| a| b| c|
| a| b| c|
+---+---+---+
gdf = df.groupBy('C').count()
gdf.show()
+---+-----+
| C|count|
+---+-----+
| c| 3|
+---+-----+
https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.GroupedData
pyspark.sql.GroupedData Aggregation methods, returned by
DataFrame.groupBy().
A set of methods for aggregations on a DataFrame, created by
DataFrame.groupBy().
You may use an aggregation function as agg, avg, count, max, mean, min, pivot, sum, collect_list, collect_set, count, first, grouping, etc.
Attention to first: this function is an action, it can aaa to you script be slower if you misuse this.
If you have a numeric column you can use aggragation function such as min, max, mean, etc but if you have a string column you may want to use:
df.groupBy("ID").pivot("VAR").agg(concat_ws('', collect_list(col("VAL"))))
or
df.groupBy("ID").pivot("VAR").agg(collect_list(collect_list("VAL")[0]))
or
df.groupBy("ID").pivot("VAR").agg(first("VAL"))