Create a select with a struct within a list pyspark

Create a select with a struct within a list pyspark - sql

I have the following Dataframe View df_view:
+---+-----+
| b | c |
+---+-----+
| 1 | 3 |
+---+-----+
I needed to select this data to form a key with a list of structs.
{
"a": [
{
"b": 1,
"c": 3
}
]
}
With the select below it only creates a struct but not the list
df = spark.sql(
'''
SELECT
named_struct(
'b', b,
'c', c
) AS a
FROM df_view
'''
)
And after that I'll save to the database
df.write
.mode("overwrite")
.format("com.microsoft.azure.cosmosdb.spark")
.options(**cosmosConfig)
.save()
How is it possible to create a struct inside a list in SQL?

You can wrap the struct in array():
df = spark.sql(
'''
SELECT
array(named_struct(
'b', b,
'c', c
)) AS a
FROM df_view
'''
)

Related

Scala Spark Dataframe sum list of json values in the column

I have a spark data frame as given below:
id
col1
col2
1
[{"a":1}]
[{"d": 3, "e": 4}]
2
[{"a":2}]
[{"d": 5, "e": 10}]
I want to obtain the following data frame:
id
col2_sum
1
7
2
10
Datatypes:
id:StringType
col1:StringType
col2:StringType
Thanks in advance

Convert JSON string into map type using from_json then use aggregate function to sum the map values:
val df = Seq(
(1, """[{"a":1}]""", """[{"d": 3, "e": 4}]"""),
(2, """[{"a":2}]""", """[{"d": 5, "e": 10}]""")
).toDF("id", "col1", "col2")
val df1 = (df
.withColumn("col2", from_json(col("col2"), lit("array<map<string,int>>")))
.withColumn("col2", flatten(expr("transform(col2, x -> map_values(x))")))
.withColumn("col2_sum", expr("aggregate(col2, 0, (acc, x) -> acc + x)"))
.drop("col1", "col2")
)
df1.show
//+---+--------+
//| id|col2_sum|
//+---+--------+
//| 1| 7|
//| 2| 15|
//+---+--------+

Converting SQL table to nested JSON in SQL

I have rows like this in my snowflake database:
+-----+-----+-----+
| Foo | Bar | Baz |
+-----+-----+-----+
| A | a | [] |
| A | b | [] |
| B | a | [] |
| B | b | [] |
+-----+-----+-----+
I want to convert this into:
"A": {
"a": [],
"b": []
},
"B": {
"a": [],
"b": []
}

Snowflake allows to achieve the desired effect with SQL:
CREATE OR REPLACE TABLE t
AS
SELECT 'A' AS foo, 'a' AS bar, PARSE_JSON('[]') AS Baz
UNION ALL SELECT 'A' AS foo, 'b' AS bar, PARSE_JSON('[]') AS Baz
UNION ALL SELECT 'B' AS foo, 'a' AS bar, PARSE_JSON('[]') AS Baz
UNION ALL SELECT 'B' AS foo, 'b' AS bar, PARSE_JSON('[]') AS Baz;
SELECT OBJECT_AGG(foo, s) AS result
FROM (SELECT foo, OBJECT_AGG(bar, baz) AS s
FROM t
GROUP BY foo) sub;
Output:
{
"A": {
"a": [],
"b": []
},
"B": {
"a": [],
"b": []
}
}

You can try using pandas to read sql data and convert it to nested json
Refer to Convert Pandas Dataframe to nested JSON

Pandas add a summary column that counts values that are not empty strings

I have a table that looks like this:
A B C
1 foo
2 foobar blah
3
I want to count up the non empty columns from A, B and C to get a summary column like this:
A B C sum
1 foo 1
2 foobar blah 2
3 0
Here is how I'm trying to do it:
import pandas as pd
df = { 'A' : ["foo", "foobar", ""],
'B' : ["", "blah", ""],
'C' : ["","",""]}
df = pd.DataFrame(df)
print(df)
df['sum'] = df[['A', 'B', 'C']].notnull().sum(axis=1)
df['sum'] = (df[['A', 'B', 'C']] != "").sum(axis=1)
These last two lines are different ways to get what I want but they aren't working. Any suggestions?

df['sum'] = (df[['A', 'B', 'C']] != "").sum(axis=1)
Worked. Thanks for the assistance.

This one-liner worked for me :)
df["sum"] = df.replace("", np.nan).T.count().reset_index().iloc[:,1]

Transform column with int flags to array of strings in pyspark

I have a dataframe with a column called "traits" which is an integer composed of multiple flags.
I need to convert this column to a list of strings (for elastic search indexing). Conversion looks like this.
TRAIT_0 = 0
TRAIT_1 = 1
TRAIT_2 = 2
def flag_to_list(flag: int) -> List[str]:
trait_list = []
if flag & (1 << TRAIT_0):
trait_list.append("TRAIT_0")
elif flag & (1 << TRAIT_1):
trait_list.append("TRAIT_1")
elif flag & (1 << TRAIT_2):
trait_list.append("TRAIT_2")
return trait_list
What is the most efficient way of doing this transformation in pyspark? I saw lots of examples on how to do concatenation and splitting of strings, but not an operation like this.
Using pyspark vesion 2.4.5
Input json looks like this:
{ "name": "John Doe", "traits": 5 }
Output json should look like this:
{ "name": "John Doe", "traits": ["TRAIT_0", "TRAIT_2"] }

IIUC, you can try SparkSQL built-in functions: (1) use conv + split to convert integer(base-10) -> binary(base-2) -> string -> array of strings(reversed), (2) based on 0 or 1 values and their array indices to filter and transform the array into the corresponding array of named traits:
from pyspark.sql.functions import expr
df = spark.createDataFrame([("name1", 5),("name2", 1),("name3", 0),("name4", 12)], ['name', 'traits'])
#DataFrame[name: string, traits: bigint]
traits = [ "Traits_{}".format(i) for i in range(8) ]
traits_array = "array({})".format(",".join("'{}'".format(e) for e in traits))
# array('Traits_0','Traits_1','Traits_2','Traits_3','Traits_4','Traits_5','Traits_6','Traits_7')
sql_expr = """
filter(
transform(
/* convert int -> binary -> string -> array of strings, and then reverse the array */
reverse(split(string(conv(traits,10,2)),'(?!$)')),
/* take the corresponding items from the traits_array when value > 0, else NULL */
(x,i) -> {}[IF(x='1',i,NULL)]
),
/* filter out NULL items from the array */
y -> y is not NULL
) AS trait_list
""".format(traits_array)
# filter(
# transform(
# reverse(split(string(conv(traits,10,2)),'(?!$)')),
# (x,i) -> array('Traits_0','Traits_1','Traits_2','Traits_3','Traits_4','Traits_5','Traits_6','Traits_7')[IF(x='1',i,NULL)]
# ),
# y -> y is not NULL
# )
df.withColumn("traits_list", expr(sql_expr)).show(truncate=False)
+-----+------+--------------------+
|name |traits|traits_list |
+-----+------+--------------------+
|name1|5 |[Traits_0, Traits_2]|
|name2|1 |[Traits_0] |
|name3|0 |[] |
|name4|12 |[Traits_2, Traits_3]|
+-----+------+--------------------+
Below is the result after running reverse(split(string(conv(traits,10,2)),'(?!$)')), notice that the split-pattern (?!$) is used to avoid a NULL shown as the last array item.
df.selectExpr("*", "reverse(split(string(conv(traits,10,2)),'(?!$)')) as t1").show()
+-----+------+------------+
| name|traits| t1|
+-----+------+------------+
|name1| 5| [1, 0, 1]|
|name2| 1| [1]|
|name3| 0| [0]|
|name4| 12|[0, 0, 1, 1]|
+-----+------+------------+

We can define a UDF to wrap your function and then call it. This is some sample code:
from typing import List
from pyspark.sql.types import ArrayType, StringType
TRAIT_0 = 0
TRAIT_1 = 1
TRAIT_2 = 2
def flag_to_list(flag: int) -> List[str]:
trait_list = []
if flag & (1 << TRAIT_0):
trait_list.append("TRAIT_0")
elif flag & (1 << TRAIT_1):
trait_list.append("TRAIT_1")
elif flag & (1 << TRAIT_2):
trait_list.append("TRAIT_2")
return trait_list
flag_to_list_udf = udf(lambda x: None if x is None else flag_to_list(x),
ArrayType(StringType()))
# Create dummy data to test
data = [
{ "name": "John Doe", "traits": 5 },
{ "name": "Jane Doe", "traits": 2 },
{ "name": "Jane Roe", "traits": 0 },
{ "name": "John Roe", "traits": 6 },
]
df = spark.createDataFrame(data, 'name STRING, traits INT')
df.show()
# +--------+------+
# | name|traits|
# +--------+------+
# |John Doe| 5|
# |Jane Doe| 2|
# |Jane Roe| 0|
# |John Roe| 6|
# +--------+------+
df = df.withColumn('traits_processed', flag_to_list_udf(df['traits']))
df.show()
# +--------+------+----------------+
# | name|traits|traits_processed|
# +--------+------+----------------+
# |John Doe| 5| [TRAIT_0]|
# |Jane Doe| 2| [TRAIT_1]|
# |Jane Roe| 0| []|
# |John Roe| 6| [TRAIT_1]|
# +--------+------+----------------+
If you don't want to create a new column, you can replace traits_processed with traits.

Including null values in an Apache Spark Join

I would like to include null values in an Apache Spark join. Spark doesn't include rows with null by default.
Here is the default Spark behavior.
val numbersDf = Seq(
("123"),
("456"),
(null),
("")
).toDF("numbers")
val lettersDf = Seq(
("123", "abc"),
("456", "def"),
(null, "zzz"),
("", "hhh")
).toDF("numbers", "letters")
val joinedDf = numbersDf.join(lettersDf, Seq("numbers"))
Here is the output of joinedDf.show():
+-------+-------+
|numbers|letters|
+-------+-------+
| 123| abc|
| 456| def|
| | hhh|
+-------+-------+
This is the output I would like:
+-------+-------+
|numbers|letters|
+-------+-------+
| 123| abc|
| 456| def|
| | hhh|
| null| zzz|
+-------+-------+

Spark provides a special NULL safe equality operator:
numbersDf
.join(lettersDf, numbersDf("numbers") <=> lettersDf("numbers"))
.drop(lettersDf("numbers"))
+-------+-------+
|numbers|letters|
+-------+-------+
| 123| abc|
| 456| def|
| null| zzz|
| | hhh|
+-------+-------+
Be careful not to use it with Spark 1.5 or earlier. Prior to Spark 1.6 it required a Cartesian product (SPARK-11111 - Fast null-safe join).
In Spark 2.3.0 or later you can use Column.eqNullSafe in PySpark:
numbers_df = sc.parallelize([
("123", ), ("456", ), (None, ), ("", )
]).toDF(["numbers"])
letters_df = sc.parallelize([
("123", "abc"), ("456", "def"), (None, "zzz"), ("", "hhh")
]).toDF(["numbers", "letters"])
numbers_df.join(letters_df, numbers_df.numbers.eqNullSafe(letters_df.numbers))
+-------+-------+-------+
|numbers|numbers|letters|
+-------+-------+-------+
| 456| 456| def|
| null| null| zzz|
| | | hhh|
| 123| 123| abc|
+-------+-------+-------+
and %<=>% in SparkR:
numbers_df <- createDataFrame(data.frame(numbers = c("123", "456", NA, "")))
letters_df <- createDataFrame(data.frame(
numbers = c("123", "456", NA, ""),
letters = c("abc", "def", "zzz", "hhh")
))
head(join(numbers_df, letters_df, numbers_df$numbers %<=>% letters_df$numbers))
numbers numbers letters
1 456 456 def
2 <NA> <NA> zzz
3 hhh
4 123 123 abc
With SQL (Spark 2.2.0+) you can use IS NOT DISTINCT FROM:
SELECT * FROM numbers JOIN letters
ON numbers.numbers IS NOT DISTINCT FROM letters.numbers
This is can be used with DataFrame API as well:
numbersDf.alias("numbers")
.join(lettersDf.alias("letters"))
.where("numbers.numbers IS NOT DISTINCT FROM letters.numbers")

val numbers2 = numbersDf.withColumnRenamed("numbers","num1") //rename columns so that we can disambiguate them in the join
val letters2 = lettersDf.withColumnRenamed("numbers","num2")
val joinedDf = numbers2.join(letters2, $"num1" === $"num2" || ($"num1".isNull && $"num2".isNull) ,"outer")
joinedDf.select("num1","letters").withColumnRenamed("num1","numbers").show //rename the columns back to the original names

Based on K L's idea, you could use foldLeft to generate join column expression:
def nullSafeJoin(rightDF: DataFrame, columns: Seq[String], joinType: String)(leftDF: DataFrame): DataFrame =
{
val colExpr: Column = leftDF(columns.head) <=> rightDF(columns.head)
val fullExpr = columns.tail.foldLeft(colExpr) {
(colExpr, p) => colExpr && leftDF(p) <=> rightDF(p)
}
leftDF.join(rightDF, fullExpr, joinType)
}
then, you could call this function just like:
aDF.transform(nullSafejoin(bDF, columns, joinType))

Complementing the other answers, for PYSPARK < 2.3.0 you would not have Column.eqNullSafe neither IS NOT DISTINCT FROM.
You still can build the <=> operator with an sql expression to include it in the join, as long as you define alias for the join queries:
from pyspark.sql.types import StringType
import pyspark.sql.functions as F
numbers_df = spark.createDataFrame (["123","456",None,""], StringType()).toDF("numbers")
letters_df = spark.createDataFrame ([("123", "abc"),("456", "def"),(None, "zzz"),("", "hhh") ]).\
toDF("numbers", "letters")
joined_df = numbers_df.alias("numbers").join(letters_df.alias("letters"),
F.expr('numbers.numbers <=> letters.numbers')).\
select('letters.*')
joined_df.show()
+-------+-------+
|numbers|letters|
+-------+-------+
| 456| def|
| null| zzz|
| | hhh|
| 123| abc|
+-------+-------+

Based on timothyzhang's idea one can further improve it by removing duplicate columns:
def dropDuplicateColumns(df: DataFrame, rightDf: DataFrame, cols: Seq[String]): DataFrame
= cols.foldLeft(df)((df, c) => df.drop(rightDf(c)))
def joinTablesWithSafeNulls(rightDF: DataFrame, leftDF: DataFrame, columns: Seq[String], joinType: String): DataFrame =
{
val colExpr: Column = leftDF(columns.head) <=> rightDF(columns.head)
val fullExpr = columns.tail.foldLeft(colExpr) {
(colExpr, p) => colExpr && leftDF(p) <=> rightDF(p)
}
val finalDF = leftDF.join(rightDF, fullExpr, joinType)
val filteredDF = dropDuplicateColumns(finalDF, rightDF, columns)
filteredDF
}

Try the following method to include the null rows to the result of JOIN operator:
def nullSafeJoin(leftDF: DataFrame, rightDF: DataFrame, columns: Seq[String], joinType: String): DataFrame = {
var columnsExpr: Column = leftDF(columns.head) <=> rightDF(columns.head)
columns.drop(1).foreach(column => {
columnsExpr = columnsExpr && (leftDF(column) <=> rightDF(column))
})
var joinedDF: DataFrame = leftDF.join(rightDF, columnsExpr, joinType)
columns.foreach(column => {
joinedDF = joinedDF.drop(leftDF(column))
})
joinedDF
}

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Create a select with a struct within a list pyspark - sql

You can wrap the struct in array(): df = spark.sql( ''' SELECT array(named_struct( 'b', b, 'c', c )) AS a FROM df_view ''' )

Related

Scala Spark Dataframe sum list of json values in the column

Converting SQL table to nested JSON in SQL

Pandas add a summary column that counts values that are not empty strings

Transform column with int flags to array of strings in pyspark

Including null values in an Apache Spark Join

Categories

Resources