Related
I have a spark data frame as given below:
id
col1
col2
1
[{"a":1}]
[{"d": 3, "e": 4}]
2
[{"a":2}]
[{"d": 5, "e": 10}]
I want to obtain the following data frame:
id
col2_sum
1
7
2
10
Datatypes:
id:StringType
col1:StringType
col2:StringType
Thanks in advance
Convert JSON string into map type using from_json then use aggregate function to sum the map values:
val df = Seq(
(1, """[{"a":1}]""", """[{"d": 3, "e": 4}]"""),
(2, """[{"a":2}]""", """[{"d": 5, "e": 10}]""")
).toDF("id", "col1", "col2")
val df1 = (df
.withColumn("col2", from_json(col("col2"), lit("array<map<string,int>>")))
.withColumn("col2", flatten(expr("transform(col2, x -> map_values(x))")))
.withColumn("col2_sum", expr("aggregate(col2, 0, (acc, x) -> acc + x)"))
.drop("col1", "col2")
)
df1.show
//+---+--------+
//| id|col2_sum|
//+---+--------+
//| 1| 7|
//| 2| 15|
//+---+--------+
I have rows like this in my snowflake database:
+-----+-----+-----+
| Foo | Bar | Baz |
+-----+-----+-----+
| A | a | [] |
| A | b | [] |
| B | a | [] |
| B | b | [] |
+-----+-----+-----+
I want to convert this into:
"A": {
"a": [],
"b": []
},
"B": {
"a": [],
"b": []
}
Snowflake allows to achieve the desired effect with SQL:
CREATE OR REPLACE TABLE t
AS
SELECT 'A' AS foo, 'a' AS bar, PARSE_JSON('[]') AS Baz
UNION ALL SELECT 'A' AS foo, 'b' AS bar, PARSE_JSON('[]') AS Baz
UNION ALL SELECT 'B' AS foo, 'a' AS bar, PARSE_JSON('[]') AS Baz
UNION ALL SELECT 'B' AS foo, 'b' AS bar, PARSE_JSON('[]') AS Baz;
SELECT OBJECT_AGG(foo, s) AS result
FROM (SELECT foo, OBJECT_AGG(bar, baz) AS s
FROM t
GROUP BY foo) sub;
Output:
{
"A": {
"a": [],
"b": []
},
"B": {
"a": [],
"b": []
}
}
You can try using pandas to read sql data and convert it to nested json
Refer to Convert Pandas Dataframe to nested JSON
I have a table that looks like this:
A B C
1 foo
2 foobar blah
3
I want to count up the non empty columns from A, B and C to get a summary column like this:
A B C sum
1 foo 1
2 foobar blah 2
3 0
Here is how I'm trying to do it:
import pandas as pd
df = { 'A' : ["foo", "foobar", ""],
'B' : ["", "blah", ""],
'C' : ["","",""]}
df = pd.DataFrame(df)
print(df)
df['sum'] = df[['A', 'B', 'C']].notnull().sum(axis=1)
df['sum'] = (df[['A', 'B', 'C']] != "").sum(axis=1)
These last two lines are different ways to get what I want but they aren't working. Any suggestions?
df['sum'] = (df[['A', 'B', 'C']] != "").sum(axis=1)
Worked. Thanks for the assistance.
This one-liner worked for me :)
df["sum"] = df.replace("", np.nan).T.count().reset_index().iloc[:,1]
I have a dataframe with a column called "traits" which is an integer composed of multiple flags.
I need to convert this column to a list of strings (for elastic search indexing). Conversion looks like this.
TRAIT_0 = 0
TRAIT_1 = 1
TRAIT_2 = 2
def flag_to_list(flag: int) -> List[str]:
trait_list = []
if flag & (1 << TRAIT_0):
trait_list.append("TRAIT_0")
elif flag & (1 << TRAIT_1):
trait_list.append("TRAIT_1")
elif flag & (1 << TRAIT_2):
trait_list.append("TRAIT_2")
return trait_list
What is the most efficient way of doing this transformation in pyspark? I saw lots of examples on how to do concatenation and splitting of strings, but not an operation like this.
Using pyspark vesion 2.4.5
Input json looks like this:
{ "name": "John Doe", "traits": 5 }
Output json should look like this:
{ "name": "John Doe", "traits": ["TRAIT_0", "TRAIT_2"] }
IIUC, you can try SparkSQL built-in functions: (1) use conv + split to convert integer(base-10) -> binary(base-2) -> string -> array of strings(reversed), (2) based on 0 or 1 values and their array indices to filter and transform the array into the corresponding array of named traits:
from pyspark.sql.functions import expr
df = spark.createDataFrame([("name1", 5),("name2", 1),("name3", 0),("name4", 12)], ['name', 'traits'])
#DataFrame[name: string, traits: bigint]
traits = [ "Traits_{}".format(i) for i in range(8) ]
traits_array = "array({})".format(",".join("'{}'".format(e) for e in traits))
# array('Traits_0','Traits_1','Traits_2','Traits_3','Traits_4','Traits_5','Traits_6','Traits_7')
sql_expr = """
filter(
transform(
/* convert int -> binary -> string -> array of strings, and then reverse the array */
reverse(split(string(conv(traits,10,2)),'(?!$)')),
/* take the corresponding items from the traits_array when value > 0, else NULL */
(x,i) -> {}[IF(x='1',i,NULL)]
),
/* filter out NULL items from the array */
y -> y is not NULL
) AS trait_list
""".format(traits_array)
# filter(
# transform(
# reverse(split(string(conv(traits,10,2)),'(?!$)')),
# (x,i) -> array('Traits_0','Traits_1','Traits_2','Traits_3','Traits_4','Traits_5','Traits_6','Traits_7')[IF(x='1',i,NULL)]
# ),
# y -> y is not NULL
# )
df.withColumn("traits_list", expr(sql_expr)).show(truncate=False)
+-----+------+--------------------+
|name |traits|traits_list |
+-----+------+--------------------+
|name1|5 |[Traits_0, Traits_2]|
|name2|1 |[Traits_0] |
|name3|0 |[] |
|name4|12 |[Traits_2, Traits_3]|
+-----+------+--------------------+
Below is the result after running reverse(split(string(conv(traits,10,2)),'(?!$)')), notice that the split-pattern (?!$) is used to avoid a NULL shown as the last array item.
df.selectExpr("*", "reverse(split(string(conv(traits,10,2)),'(?!$)')) as t1").show()
+-----+------+------------+
| name|traits| t1|
+-----+------+------------+
|name1| 5| [1, 0, 1]|
|name2| 1| [1]|
|name3| 0| [0]|
|name4| 12|[0, 0, 1, 1]|
+-----+------+------------+
We can define a UDF to wrap your function and then call it. This is some sample code:
from typing import List
from pyspark.sql.types import ArrayType, StringType
TRAIT_0 = 0
TRAIT_1 = 1
TRAIT_2 = 2
def flag_to_list(flag: int) -> List[str]:
trait_list = []
if flag & (1 << TRAIT_0):
trait_list.append("TRAIT_0")
elif flag & (1 << TRAIT_1):
trait_list.append("TRAIT_1")
elif flag & (1 << TRAIT_2):
trait_list.append("TRAIT_2")
return trait_list
flag_to_list_udf = udf(lambda x: None if x is None else flag_to_list(x),
ArrayType(StringType()))
# Create dummy data to test
data = [
{ "name": "John Doe", "traits": 5 },
{ "name": "Jane Doe", "traits": 2 },
{ "name": "Jane Roe", "traits": 0 },
{ "name": "John Roe", "traits": 6 },
]
df = spark.createDataFrame(data, 'name STRING, traits INT')
df.show()
# +--------+------+
# | name|traits|
# +--------+------+
# |John Doe| 5|
# |Jane Doe| 2|
# |Jane Roe| 0|
# |John Roe| 6|
# +--------+------+
df = df.withColumn('traits_processed', flag_to_list_udf(df['traits']))
df.show()
# +--------+------+----------------+
# | name|traits|traits_processed|
# +--------+------+----------------+
# |John Doe| 5| [TRAIT_0]|
# |Jane Doe| 2| [TRAIT_1]|
# |Jane Roe| 0| []|
# |John Roe| 6| [TRAIT_1]|
# +--------+------+----------------+
If you don't want to create a new column, you can replace traits_processed with traits.
I would like to include null values in an Apache Spark join. Spark doesn't include rows with null by default.
Here is the default Spark behavior.
val numbersDf = Seq(
("123"),
("456"),
(null),
("")
).toDF("numbers")
val lettersDf = Seq(
("123", "abc"),
("456", "def"),
(null, "zzz"),
("", "hhh")
).toDF("numbers", "letters")
val joinedDf = numbersDf.join(lettersDf, Seq("numbers"))
Here is the output of joinedDf.show():
+-------+-------+
|numbers|letters|
+-------+-------+
| 123| abc|
| 456| def|
| | hhh|
+-------+-------+
This is the output I would like:
+-------+-------+
|numbers|letters|
+-------+-------+
| 123| abc|
| 456| def|
| | hhh|
| null| zzz|
+-------+-------+
Spark provides a special NULL safe equality operator:
numbersDf
.join(lettersDf, numbersDf("numbers") <=> lettersDf("numbers"))
.drop(lettersDf("numbers"))
+-------+-------+
|numbers|letters|
+-------+-------+
| 123| abc|
| 456| def|
| null| zzz|
| | hhh|
+-------+-------+
Be careful not to use it with Spark 1.5 or earlier. Prior to Spark 1.6 it required a Cartesian product (SPARK-11111 - Fast null-safe join).
In Spark 2.3.0 or later you can use Column.eqNullSafe in PySpark:
numbers_df = sc.parallelize([
("123", ), ("456", ), (None, ), ("", )
]).toDF(["numbers"])
letters_df = sc.parallelize([
("123", "abc"), ("456", "def"), (None, "zzz"), ("", "hhh")
]).toDF(["numbers", "letters"])
numbers_df.join(letters_df, numbers_df.numbers.eqNullSafe(letters_df.numbers))
+-------+-------+-------+
|numbers|numbers|letters|
+-------+-------+-------+
| 456| 456| def|
| null| null| zzz|
| | | hhh|
| 123| 123| abc|
+-------+-------+-------+
and %<=>% in SparkR:
numbers_df <- createDataFrame(data.frame(numbers = c("123", "456", NA, "")))
letters_df <- createDataFrame(data.frame(
numbers = c("123", "456", NA, ""),
letters = c("abc", "def", "zzz", "hhh")
))
head(join(numbers_df, letters_df, numbers_df$numbers %<=>% letters_df$numbers))
numbers numbers letters
1 456 456 def
2 <NA> <NA> zzz
3 hhh
4 123 123 abc
With SQL (Spark 2.2.0+) you can use IS NOT DISTINCT FROM:
SELECT * FROM numbers JOIN letters
ON numbers.numbers IS NOT DISTINCT FROM letters.numbers
This is can be used with DataFrame API as well:
numbersDf.alias("numbers")
.join(lettersDf.alias("letters"))
.where("numbers.numbers IS NOT DISTINCT FROM letters.numbers")
val numbers2 = numbersDf.withColumnRenamed("numbers","num1") //rename columns so that we can disambiguate them in the join
val letters2 = lettersDf.withColumnRenamed("numbers","num2")
val joinedDf = numbers2.join(letters2, $"num1" === $"num2" || ($"num1".isNull && $"num2".isNull) ,"outer")
joinedDf.select("num1","letters").withColumnRenamed("num1","numbers").show //rename the columns back to the original names
Based on K L's idea, you could use foldLeft to generate join column expression:
def nullSafeJoin(rightDF: DataFrame, columns: Seq[String], joinType: String)(leftDF: DataFrame): DataFrame =
{
val colExpr: Column = leftDF(columns.head) <=> rightDF(columns.head)
val fullExpr = columns.tail.foldLeft(colExpr) {
(colExpr, p) => colExpr && leftDF(p) <=> rightDF(p)
}
leftDF.join(rightDF, fullExpr, joinType)
}
then, you could call this function just like:
aDF.transform(nullSafejoin(bDF, columns, joinType))
Complementing the other answers, for PYSPARK < 2.3.0 you would not have Column.eqNullSafe neither IS NOT DISTINCT FROM.
You still can build the <=> operator with an sql expression to include it in the join, as long as you define alias for the join queries:
from pyspark.sql.types import StringType
import pyspark.sql.functions as F
numbers_df = spark.createDataFrame (["123","456",None,""], StringType()).toDF("numbers")
letters_df = spark.createDataFrame ([("123", "abc"),("456", "def"),(None, "zzz"),("", "hhh") ]).\
toDF("numbers", "letters")
joined_df = numbers_df.alias("numbers").join(letters_df.alias("letters"),
F.expr('numbers.numbers <=> letters.numbers')).\
select('letters.*')
joined_df.show()
+-------+-------+
|numbers|letters|
+-------+-------+
| 456| def|
| null| zzz|
| | hhh|
| 123| abc|
+-------+-------+
Based on timothyzhang's idea one can further improve it by removing duplicate columns:
def dropDuplicateColumns(df: DataFrame, rightDf: DataFrame, cols: Seq[String]): DataFrame
= cols.foldLeft(df)((df, c) => df.drop(rightDf(c)))
def joinTablesWithSafeNulls(rightDF: DataFrame, leftDF: DataFrame, columns: Seq[String], joinType: String): DataFrame =
{
val colExpr: Column = leftDF(columns.head) <=> rightDF(columns.head)
val fullExpr = columns.tail.foldLeft(colExpr) {
(colExpr, p) => colExpr && leftDF(p) <=> rightDF(p)
}
val finalDF = leftDF.join(rightDF, fullExpr, joinType)
val filteredDF = dropDuplicateColumns(finalDF, rightDF, columns)
filteredDF
}
Try the following method to include the null rows to the result of JOIN operator:
def nullSafeJoin(leftDF: DataFrame, rightDF: DataFrame, columns: Seq[String], joinType: String): DataFrame = {
var columnsExpr: Column = leftDF(columns.head) <=> rightDF(columns.head)
columns.drop(1).foreach(column => {
columnsExpr = columnsExpr && (leftDF(column) <=> rightDF(column))
})
var joinedDF: DataFrame = leftDF.join(rightDF, columnsExpr, joinType)
columns.foreach(column => {
joinedDF = joinedDF.drop(leftDF(column))
})
joinedDF
}