I have rows like this in my snowflake database:
+-----+-----+-----+
| Foo | Bar | Baz |
+-----+-----+-----+
| A | a | [] |
| A | b | [] |
| B | a | [] |
| B | b | [] |
+-----+-----+-----+
I want to convert this into:
"A": {
"a": [],
"b": []
},
"B": {
"a": [],
"b": []
}
Snowflake allows to achieve the desired effect with SQL:
CREATE OR REPLACE TABLE t
AS
SELECT 'A' AS foo, 'a' AS bar, PARSE_JSON('[]') AS Baz
UNION ALL SELECT 'A' AS foo, 'b' AS bar, PARSE_JSON('[]') AS Baz
UNION ALL SELECT 'B' AS foo, 'a' AS bar, PARSE_JSON('[]') AS Baz
UNION ALL SELECT 'B' AS foo, 'b' AS bar, PARSE_JSON('[]') AS Baz;
SELECT OBJECT_AGG(foo, s) AS result
FROM (SELECT foo, OBJECT_AGG(bar, baz) AS s
FROM t
GROUP BY foo) sub;
Output:
{
"A": {
"a": [],
"b": []
},
"B": {
"a": [],
"b": []
}
}
You can try using pandas to read sql data and convert it to nested json
Refer to Convert Pandas Dataframe to nested JSON
Related
Consider two dataframes:
>> import pandas as pd
>> df1 = pd.DataFrame({"category": ["foo", "foo", "bar", "bar", "bar"], "quantity": [1,2,1,2,3]})
>> print(df1)
category quantity
0 foo 1
1 foo 2
2 bar 1
3 bar 2
4 bar 3
>> df2 = pd.DataFrame({
"category": ["foo", "foo", "foo", "foo", "bar", "bar", "bar", "bar", "bar", "bar"],
"item": ["A", "B", "C", "D", "E", "F", "G", "H", "I", "J"]
})
>> print(df2)
category item
0 foo A
1 foo B
2 foo C
3 foo D
4 bar E
5 bar F
6 bar G
7 bar H
8 bar I
9 bar J
How can I create a new column in df1 (new dataframe called df3) which joins on category column of df1 and allocates the item column in df2. So, create something like:
>> df3 = pd.DataFrame({
"category": ["foo", "foo", "bar", "bar", "bar"],
"quantity": [1,2,1,2,3],
"item": ["A", "B,C", "E", "F,G", "H,I,J"]
})
category quantity item
0 foo 1 A
1 foo 2 B,C
2 bar 1 E
3 bar 2 F,G
4 bar 3 H,I,J
You can create helper DataFrame by repeat rows by quantity column by Index.repeat with DataFrame.loc, convert index to column for avoid lost indices and create helper column g in both DataFrames for merging by duplicated categories by GroupBy.cumcount, then use DataFrame.merge with aggregate join:
df11 = (df1.loc[df1.index.repeat(df1['quantity'])].reset_index()
.assign(g = lambda x: x.groupby('category').cumcount()))
df22 = df2.assign(g = df2.groupby('category').cumcount())
df = (df11.merge(df22, on=['g','category'], how='left')
.groupby(['index','category','quantity'])['item']
.agg(lambda x: ','.join(x.dropna()))
.droplevel(0)
.reset_index())
print (df)
category quantity item
0 foo 1 A
1 foo 2 B,C
2 bar 1 E
3 bar 2 F,G
4 bar 3 H,I,J
You can use an iterator with itertools.islice:
from itertools import islice
# aggregate the items as iterator
s = df2.groupby('category')['item'].agg(iter)
# for each category, allocate as many items as needed and join
df1['item'] = (df1.groupby('category', group_keys=False)['quantity']
.apply(lambda g:
g.map(lambda x: ','.join(list(islice(s[g.name], x)))))
)
Output:
category quantity item
0 foo 1 A
1 foo 2 B,C
2 bar 1 E
3 bar 2 F,G
4 bar 3 H,I,J
Note that if you don't have enough items, this will just use what is available.
Example using df2 truncated after F as input:
category quantity item
0 foo 1 A
1 foo 2 B,C
2 bar 1 E
3 bar 2 F
4 bar 3
def function1(dd:pd.DataFrame):
col2=dd.quantity.cumsum()
col1=col2.shift(fill_value=0)
return dd.assign(col1=col1,col2=col2).apply(lambda ss:",".join(
df2.loc[df2.category==ss.category,"item"].iloc[ss.col1:ss.col2].tolist()
),axis=1)
df1.assign(item=df1.groupby('category').apply(function1).droplevel(0))
out
category quantity item
0 foo 1 A
1 foo 2 B,C
2 bar 1 E
3 bar 2 F,G
4 bar 3 H,I,J
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
I have a dataframe as input below. Ultimately, I'm trying to get the output as below, so I can use df.filter(col("A").contains(col("B"))) to see if A contains B as substring. Noted here I'd like to check the order of the letters as well so set probably will not work. For example "acb" should not be considered as a substring of "abcd" I've tried to use split but it only takes one delimiter. Could some please help? I'm using spark 2.4.
Input
+---+-------+-----------+
| id|. A | B |
+---+-------+-----------+
| 1|abc d | bc, z |
| 2|abc-d | acb |
| 3|abcde | hj f |
+---+-------+-----------+
Output
+---+-------+-----------+
| id|. A | B |
+---+-------+-----------+
| 1|abcd | bcz |
| 2|abcd | acb |
| 3|abcde | hjf |
+---+-------+-----------+
You can use regex for both split and replace.
If you want to split, your output is not right.
Split
spark = SparkSession.builder.getOrCreate()
data = [
{"id": 1, "A": "abc d", "B": "bc, z"},
{"id": 2, "A": "abc-d", "B": "acb"},
{"id": 3, "A": "abcde", "B": "hj f"},
]
df = spark.createDataFrame(data)
split_regex = "((,)?\s|[-])"
df = df.withColumn("A", F.split(F.col("A"), split_regex))
df = df.withColumn("B", F.split(F.col("B"), split_regex))
Result:
root
|-- A: array (nullable = true)
| |-- element: string (containsNull = true)
|-- B: array (nullable = true)
| |-- element: string (containsNull = true)
|-- id: long (nullable = true)
+--------+-------+---+
|A |B |id |
+--------+-------+---+
|[abc, d]|[bc, z]|1 |
|[abc, d]|[acb] |2 |
|[abcde] |[hj, f]|3 |
+--------+-------+---+
Now you can create a UDF that will check if values in array B are substrings in values in array A.
Replace
spark = SparkSession.builder.getOrCreate()
data = [
{"id": 1, "A": "abc d", "B": "bc, z"},
{"id": 2, "A": "abc-d", "B": "acb"},
{"id": 3, "A": "abcde", "B": "hj f"},
]
df = spark.createDataFrame(data)
replace_regex = "((,)?\s|[-])"
df = df.withColumn("A", F.regexp_replace(F.col("A"), replace_regex, ""))
df = df.withColumn("B", F.regexp_replace(F.col("B"), replace_regex, ""))
Result:
root
|-- A: string (nullable = true)
|-- B: string (nullable = true)
|-- id: long (nullable = true)
+-----+---+---+
|A |B |id |
+-----+---+---+
|abcd |bcz|1 |
|abcd |acb|2 |
|abcde|hjf|3 |
+-----+---+---+
This is a bit involved, and I would stick to split since here abcd contains both b and bc and there's no way for you to keep track of the whole words if you completely replace the delimiter.
I have the following Dataframe View df_view:
+---+-----+
| b | c |
+---+-----+
| 1 | 3 |
+---+-----+
I needed to select this data to form a key with a list of structs.
{
"a": [
{
"b": 1,
"c": 3
}
]
}
With the select below it only creates a struct but not the list
df = spark.sql(
'''
SELECT
named_struct(
'b', b,
'c', c
) AS a
FROM df_view
'''
)
And after that I'll save to the database
df.write
.mode("overwrite")
.format("com.microsoft.azure.cosmosdb.spark")
.options(**cosmosConfig)
.save()
How is it possible to create a struct inside a list in SQL?
You can wrap the struct in array():
df = spark.sql(
'''
SELECT
array(named_struct(
'b', b,
'c', c
)) AS a
FROM df_view
'''
)
I have a dataframe with a column called "traits" which is an integer composed of multiple flags.
I need to convert this column to a list of strings (for elastic search indexing). Conversion looks like this.
TRAIT_0 = 0
TRAIT_1 = 1
TRAIT_2 = 2
def flag_to_list(flag: int) -> List[str]:
trait_list = []
if flag & (1 << TRAIT_0):
trait_list.append("TRAIT_0")
elif flag & (1 << TRAIT_1):
trait_list.append("TRAIT_1")
elif flag & (1 << TRAIT_2):
trait_list.append("TRAIT_2")
return trait_list
What is the most efficient way of doing this transformation in pyspark? I saw lots of examples on how to do concatenation and splitting of strings, but not an operation like this.
Using pyspark vesion 2.4.5
Input json looks like this:
{ "name": "John Doe", "traits": 5 }
Output json should look like this:
{ "name": "John Doe", "traits": ["TRAIT_0", "TRAIT_2"] }
IIUC, you can try SparkSQL built-in functions: (1) use conv + split to convert integer(base-10) -> binary(base-2) -> string -> array of strings(reversed), (2) based on 0 or 1 values and their array indices to filter and transform the array into the corresponding array of named traits:
from pyspark.sql.functions import expr
df = spark.createDataFrame([("name1", 5),("name2", 1),("name3", 0),("name4", 12)], ['name', 'traits'])
#DataFrame[name: string, traits: bigint]
traits = [ "Traits_{}".format(i) for i in range(8) ]
traits_array = "array({})".format(",".join("'{}'".format(e) for e in traits))
# array('Traits_0','Traits_1','Traits_2','Traits_3','Traits_4','Traits_5','Traits_6','Traits_7')
sql_expr = """
filter(
transform(
/* convert int -> binary -> string -> array of strings, and then reverse the array */
reverse(split(string(conv(traits,10,2)),'(?!$)')),
/* take the corresponding items from the traits_array when value > 0, else NULL */
(x,i) -> {}[IF(x='1',i,NULL)]
),
/* filter out NULL items from the array */
y -> y is not NULL
) AS trait_list
""".format(traits_array)
# filter(
# transform(
# reverse(split(string(conv(traits,10,2)),'(?!$)')),
# (x,i) -> array('Traits_0','Traits_1','Traits_2','Traits_3','Traits_4','Traits_5','Traits_6','Traits_7')[IF(x='1',i,NULL)]
# ),
# y -> y is not NULL
# )
df.withColumn("traits_list", expr(sql_expr)).show(truncate=False)
+-----+------+--------------------+
|name |traits|traits_list |
+-----+------+--------------------+
|name1|5 |[Traits_0, Traits_2]|
|name2|1 |[Traits_0] |
|name3|0 |[] |
|name4|12 |[Traits_2, Traits_3]|
+-----+------+--------------------+
Below is the result after running reverse(split(string(conv(traits,10,2)),'(?!$)')), notice that the split-pattern (?!$) is used to avoid a NULL shown as the last array item.
df.selectExpr("*", "reverse(split(string(conv(traits,10,2)),'(?!$)')) as t1").show()
+-----+------+------------+
| name|traits| t1|
+-----+------+------------+
|name1| 5| [1, 0, 1]|
|name2| 1| [1]|
|name3| 0| [0]|
|name4| 12|[0, 0, 1, 1]|
+-----+------+------------+
We can define a UDF to wrap your function and then call it. This is some sample code:
from typing import List
from pyspark.sql.types import ArrayType, StringType
TRAIT_0 = 0
TRAIT_1 = 1
TRAIT_2 = 2
def flag_to_list(flag: int) -> List[str]:
trait_list = []
if flag & (1 << TRAIT_0):
trait_list.append("TRAIT_0")
elif flag & (1 << TRAIT_1):
trait_list.append("TRAIT_1")
elif flag & (1 << TRAIT_2):
trait_list.append("TRAIT_2")
return trait_list
flag_to_list_udf = udf(lambda x: None if x is None else flag_to_list(x),
ArrayType(StringType()))
# Create dummy data to test
data = [
{ "name": "John Doe", "traits": 5 },
{ "name": "Jane Doe", "traits": 2 },
{ "name": "Jane Roe", "traits": 0 },
{ "name": "John Roe", "traits": 6 },
]
df = spark.createDataFrame(data, 'name STRING, traits INT')
df.show()
# +--------+------+
# | name|traits|
# +--------+------+
# |John Doe| 5|
# |Jane Doe| 2|
# |Jane Roe| 0|
# |John Roe| 6|
# +--------+------+
df = df.withColumn('traits_processed', flag_to_list_udf(df['traits']))
df.show()
# +--------+------+----------------+
# | name|traits|traits_processed|
# +--------+------+----------------+
# |John Doe| 5| [TRAIT_0]|
# |Jane Doe| 2| [TRAIT_1]|
# |Jane Roe| 0| []|
# |John Roe| 6| [TRAIT_1]|
# +--------+------+----------------+
If you don't want to create a new column, you can replace traits_processed with traits.
I want to join two dataframes based on certain condition is spark scala. However the catch is if row in df1 matches any row in df2, it should not try to match same row of df1 with any other row in df2. Below is sample data and outcome I am trying to get.
DF1
--------------------------------
Emp_id | Emp_Name | Address_id
1 | ABC | 1
2 | DEF | 2
3 | PQR | 3
4 | XYZ | 1
DF2
-----------------------
Address_id | City
1 | City_1
1 | City_2
2 | City_3
REST | Some_City
Output DF
----------------------------------------
Emp_id | Emp_Name | Address_id | City
1 | ABC | 1 | City_1
2 | DEF | 2 | City_3
3 | PQR | 3 | Some_City
4 | XYZ | 1 | City_1
Note:- REST is like wild card. Any value can be equal to REST.
So in above sample emp_name "ABC" can match with City_1, City_2 or Some_City. Output DF contains only City_1 because it finds it first.
You seem to have a custom logic for your join. Basically I've been to come up with the below UDF.
Note that you may want to change the logic for the UDF as per your requirement.
import spark.implicits._
import org.apache.spark.sql.functions.to_timestamp
import org.apache.spark.sql.functions.udf
import org.apache.spark.sql.functions.first
//dataframe 1
val df_1 = Seq(("1", "ABC", "1"), ("2", "DEF", "2"), ("3", "PQR", "3"), ("4", "XYZ", "1")).toDF("Emp_Id", "Emp_Name", "Address_Id")
//dataframe 2
val df_2 = Seq(("1", "City_1"), ("1", "City_2"), ("2", "City_3"), ("REST","Some_City")).toDF("Address_Id", "City_Name")
// UDF logic
val join_udf = udf((a: String, b: String) => {
(a,b) match {
case ("1", "1") => true
case ("1", _) => false
case ("2", "2") => true
case ("2", _) => false
case(_, "REST") => true
case(_, _) => false
}})
val dataframe_join = df_1.join(df_2, join_udf(df_1("Address_Id"), df_2("Address_Id")), "inner").drop(df_2("Address_Id"))
.orderBy($"City_Name")
.groupBy($"Emp_Id", $"Emp_Name", $"Address_Id")
.agg(first($"City_Name"))
.orderBy($"Emp_Id")
dataframe_join.show(false)
Basically post applying UDF, what you get is all possible combinations of the matches.
Post that when you apply groupBy and make use of first function of agg, you would only get the filtered values as what you are looking for.
+------+--------+----------+-----------------------+
|Emp_Id|Emp_Name|Address_Id|first(City_Name, false)|
+------+--------+----------+-----------------------+
|1 |ABC |1 |City_1 |
|2 |DEF |2 |City_3 |
|3 |PQR |3 |Some_City |
|4 |XYZ |1 |City_1 |
+------+--------+----------+-----------------------+
Note that I've made use of Spark 2.3 and hope this helps!
{
import org.apache.spark.sql.{SparkSession}
import org.apache.spark.sql.functions._
object JoinTwoDataFrame extends App {
val spark = SparkSession.builder()
.master("local")
.appName("DataFrame-example")
.getOrCreate()
import spark.implicits._
val df1 = Seq(
(1, "ABC", "1"),
(2, "DEF", "2"),
(3, "PQR", "3"),
(4, "XYZ", "1")
).toDF("Emp_id", "Emp_Name", "Address_id")
val df2 = Seq(
("1", "City_1"),
("1", "City_2"),
("2", "City_3"),
("REST", "Some_City")
).toDF("Address_id", "City")
val restCity: Option[String] = Some(df2.filter('Address_id.equalTo("REST")).select('City).first()(0).toString)
val res = df1.join(df2, df1.col("Address_id") === df2.col("Address_id") , "left_outer")
.select(
df1.col("Emp_id"),
df1.col("Emp_Name"),
df1.col("Address_id"),
df2.col("City")
)
.withColumn("city2", when('City.isNotNull, 'City).otherwise(restCity.getOrElse("")))
.drop("City")
.withColumnRenamed("city2", "City")
.orderBy("Address_id", "City")
.groupBy("Emp_id", "Emp_Name", "Address_id")
.agg(collect_list("City").alias("cityList"))
.withColumn("City", 'cityList.getItem(0))
.drop("cityList")
.orderBy("Emp_id")
res.show(false)
// +------+--------+----------+---------+
// |Emp_id|Emp_Name|Address_id|City |
// +------+--------+----------+---------+
// |1 |ABC |1 |City_1 |
// |2 |DEF |2 |City_3 |
// |3 |PQR |3 |Some_City|
// |4 |XYZ |1 |City_1 |
// +------+--------+----------+---------+
}
}