How to split array of set to multiple columns in SparkSQL - sql

I have array of set data below data in pyspark dataframe like below.
-+-----------------------------------------------------------------------------------+-
| targeting_values |
-+-----------------------------------------------------------------------------------+-
| [('123', '123', '123'), ('abc', 'def', 'ghi'), ('jkl', 'mno', 'pqr'), (0, 1, 2)] |
-+-----------------------------------------------------------------------------------+-
I want 4 different columns have with set in each column like below.
-+----------------------+----------------------+-----------------------+--------------------+-
| value1 | value2 | value3 | value4 |
-+----------------------+----------------------+-----------------------+--------------------+-
| ('123', '123', '123')|('abc', 'def', 'ghi') | ('jkl', 'mno', 'pqr') | (0, 1, 2) |
-+----------------------+----------------------+-----------------------+--------------------+-
I was trying to achieve this by using split() but no luck.
I did not found other way to do solve this issue.
So is there a good way to do this?

You can do it by exploding the array than pivoting it,
// first create the data:
val arrayStructData = Seq(
Row(List(Row("123", "123", "123"), Row("abc", "def", "ghi"), Row("jkl", "mno", "pqr"), Row("0", "1", "2"))),
Row(List(Row("456", "456", "456"), Row("qsd", "fgh", "hjk"), Row("aze", "rty", "uio"), Row("4", "5", "6")))
)
val arrayStructSchema = new StructType()
.add("targeting_values", ArrayType(new StructType()
.add("_1", StringType)
.add("_2", StringType)
.add("_3", StringType)))
val df = spark.createDataFrame(spark.sparkContext
.parallelize(arrayStructData), arrayStructSchema)
df.show(false)
+--------------------------------------------------------------+
|targeting_values |
+--------------------------------------------------------------+
|[{123, 123, 123}, {abc, def, ghi}, {jkl, mno, pqr}, {0, 1, 2}]|
|[{456, 456, 456}, {qsd, fgh, hjk}, {aze, rty, uio}, {4, 5, 6}]|
+--------------------------------------------------------------+
// Then a combination of explode, creating and id then pivoting it like this:
df.withColumn("id2", monotonically_increasing_id())
.select(col("id2"), posexplode(col("targeting_values"))).withColumn("id", concat(lit("value"), col("pos") + 1))
.groupBy("id2").pivot("id").agg(first("col")).drop("id2")
.show(false)
+---------------+---------------+---------------+---------+
|value1 |value2 |value3 |value4 |
+---------------+---------------+---------------+---------+
|{123, 123, 123}|{abc, def, ghi}|{jkl, mno, pqr}|{0, 1, 2}|
|{456, 456, 456}|{qsd, fgh, hjk}|{aze, rty, uio}|{4, 5, 6}|
+---------------+---------------+---------------+---------+

You can try this:
df.selectExpr([f"targeting_values[{i}] as value{i+1}" for i in range(4)])

Related

Swap values between columns based on third column

I have a table like this:
src_id | src_source | dst_id | dst_source | metadata
--------------------------------------------------------
123 | A | 345 | B | some_string
234 | B | 567 | A | some_other_string
498 | A | 432 | A | another_one # this line should be ignored
765 | B | 890 | B | another_one # this line should be ignored
What I would like is:
A_id | B_id | metadata
-----------------------
123 | 345 | some string
567 | 234 | some_other_string
Here's the data to replicate:
data = [
("123", "A", "345", "B", "some_string"),
("234", "B", "567", "A", "some_other_string"),
("498", "A", "432", "A", "another_one"),
("765", "B", "890", "B", "another_two"),
]
cols = ["src_id", "src_source", "dst_id", "dst_source", "metadata"]
df = spark.createDataFrame(data).toDF(*cols)
I am a bit confused as to how to do this - I got to here:
output = (
df
.filter(F.col("src_source") != F.col("dst_source"))
.withColumn("A_id",
F.when(F.col("src_source") == "A", F.col("src_id")))
.withColumn("B_id",
F.when(F.col("src_source") == "B", F.col("src_id")))
)
I think i figured it out - I need to split the df and union again!
ab_df = (
df
.filter(F.col("src_source") != F.col("dst_source"))
.filter((F.col("src_source") == "A") & (F.col("dst_source") == "B"))
.select(F.col("src_id").alias("A_id"),
F.col("dst_id").alias("B_id"),
"metadata")
)
ba_df = (
df
.filter(F.col("src_source") != F.col("dst_source"))
.filter((F.col("src_source") == "B") & (F.col("dst_source") == "A"))
.select(F.col("src_id").alias("B_id"),
F.col("dst_id").alias("A_id"),
"metadata")
)
all = ab_df.unionByName(ba_df)
You can do it without union, just in one select, without the need to write the same filter twice.
output = (
df
.filter(F.col("src_source") != F.col("dst_source"))
.select(
F.when(F.col("src_source") == "A", F.col("src_id")).otherwise(F.col("dst_id")).alias("A_id"),
F.when(F.col("src_source") == "A", F.col("dst_id")).otherwise(F.col("src_id")).alias("B_id"),
"metadata"
)
)
output.show()
# +----+----+-----------------+
# |A_id|B_id| metadata|
# +----+----+-----------------+
# | 123| 345| some_string|
# | 567| 234|some_other_string|
# +----+----+-----------------+

Split row into multiple rows to limit length of array in column (spark / scala)

I have a dataframe that looks like this:
+--------------+--------------------+
|id | items |
+--------------+--------------------+
| 1|[a, b, .... x, y, z]|
+--------------+--------------------+
| 1|[q, z, .... x, b, 5]|
+--------------+--------------------+
| 2|[q, z, .... x, b, 5]|
+--------------+--------------------+
I want to split the rows so that the array in the items column is at most length 20. If an array has length greater than 20, I would want to make new rows and split the array up so that each array is of length 20 or less. So for the first row in my example dataframe, if we assume the length is 10 and I want at most length 3 for each row, I would like for it to be split like this:
+--------------+--------------------+
|id | items |
+--------------+--------------------+
| 1|[a, b, c] |
+--------------+--------------------+
| 1|[z, y, z] |
+--------------+--------------------+
| 1|[e, f, g] |
+--------------+--------------------+
| 1|[q] |
+--------------+--------------------+
Ideally, all rows should be of length 3 except the last row if the length of the array is not evenly divisible by the max desired length. Note - the id column is not unique
Using higher-order functions transform + filter along with slice, you can split the array into sub arrays of size 20 then explode it:
val l = 20
val df1 = df.withColumn(
"items",
explode(
expr(
s"filter(transform(items, (x,i)-> IF(i%$l=0, slice(items,i+1,$l), null)), x-> x is not null)"
)
)
)
You could try this:
import pandas as pd
max_item_length = 3
df = pd.DataFrame(
{"fake_index": [1, 2, 3],
"items": [["a", "b", "c", "d", "e"], ["f", "g", "h", "i", "j"], ["k", "l"]]}
)
df2 = pd.DataFrame({"fake_index": [], "items": []})
for i in df.index:
try:
df2 = df2.append({"fake_index": int(df.iloc[i, 0]), "items": df.iloc[i, 1][:max_item_length]},
ignore_index=True)
df2 = df2.append({"fake_index": int(df.iloc[i, 0]), "items": df.iloc[i, 1][max_item_length:]},
ignore_index=True)
except:
df2 = df2.append({"fake_index": int(df.iloc[i, 0]), "items": df.iloc[i, 1]}, ignore_index=True)
df = df2
print(df)
Input:
fake_index items
0 1 [a, b, c, d, e]
1 2 [f, g, h, i, j]
2 3 [k, l]
Output:
fake_index items
0 1 [a, b, c]
1 1 [d, e]
2 2 [f, g, h]
3 2 [i, j]
4 3 [k, l]
Since this requires a more complex transformation, I've used datasets. This might not be as performant, but it will get what you want.
Setup
Creating some sample data to mimic your data.
val arrayData = Seq(
Row(1,List(1, 2, 3, 4, 5, 6, 7)),
Row(2,List(1, 2, 3, 4)),
Row(3,List(1, 2)),
Row(4,List(1, 2, 3))
)
val arraySchema = new StructType().add("id",IntegerType).add("values", ArrayType(IntegerType))
val df = spark.createDataFrame(spark.sparkContext.parallelize(arrayData), arraySchema)
/*
+---+---------------------+
|id |values |
+---+---------------------+
|1 |[1, 2, 3, 4, 5, 6, 7]|
|2 |[1, 2, 3, 4] |
|3 |[1, 2] |
|4 |[1, 2, 3] |
+---+---------------------+
*/
Transformations
// encoder for custom type of transformation
implicit val encoder = ExpressionEncoder[(Int, Array[Array[Int]])]
// Here we are using a sliding window of size 3 and step 3.
// This can be made into a generic function for a window of size k.
val df2 = df.map(r => {
val id = r.getInt(0)
val a = r.getSeq[Int](1).toArray
val arrays = a.sliding(3, 3).toArray
(id, arrays)
})
/*
+---+---------------------------------------------------------------+
|_1 |_2 |
+---+---------------------------------------------------------------+
|1 |[WrappedArray(1, 2, 3), WrappedArray(4, 5, 6), WrappedArray(7)]|
|2 |[WrappedArray(1, 2, 3), WrappedArray(4)] |
|3 |[WrappedArray(1, 2)] |
|4 |[WrappedArray(1, 2, 3)] |
+---+---------------------------------------------------------------+
*/
val df3 = df2
.withColumnRenamed("_1", "id")
.withColumnRenamed("_2", "values")
/*
+---+---------------------------------------------------------------+
|id |values |
+---+---------------------------------------------------------------+
|1 |[WrappedArray(1, 2, 3), WrappedArray(4, 5, 6), WrappedArray(7)]|
|2 |[WrappedArray(1, 2, 3), WrappedArray(4)] |
|3 |[WrappedArray(1, 2)] |
|4 |[WrappedArray(1, 2, 3)] |
+---+---------------------------------------------------------------+
*/
Use explode
Expode will create a new element for each array entry in the second column.
val df4 = df3.withColumn("values", functions.explode($"values"))
/*
+---+---------+
|id |values |
+---+---------+
|1 |[1, 2, 3]|
|1 |[4, 5, 6]|
|1 |[7] |
|2 |[1, 2, 3]|
|2 |[4] |
|3 |[1, 2] |
|4 |[1, 2, 3]|
+---+---------+
*/
Limitations
This approach is not without limitations.
Primarily, it will not be as performant on larger datasets since this code is no longer using dataframe built-in optimizations. However, the dataframe API might require the use of window functions, which can also have limited performance based on the size of the data. If it's possible to alter this data at the source, this would be recommended.
This approach also requires defining an encoder for something more complex. If the data schema changes, then different encoders will have to be used.

Kusto query how to iterator each row in a table as parameter to query in another table?

I have two 'PlayersNames' and 'PlayerSpendMondy'
How can I iterator 'PlayersNames' get each PlayerName then get how much money spend on each player?
Does Kusto query support this?
let PlayerName = datatable(name:string)
[
'player1',
'player2',
'player3',
];
let PlayerSpendMoney = datatable(name:string, spendMoney:int)
[
'player1', 1,
'player2', 3,
'player3', 4,
'player1', 1,
'player2', 5,
'player3', 1,
'player3', 1,
]
You could achieve that using the join operator.
For example:
let PlayerName = datatable(name:string)
[
'player1',
'player2',
'player3',
]
;
let PlayerSpendMoney = datatable(name:string, spendMoney:int)
[
'player1', 1,
'player2', 3,
'player3', 4,
'player1', 1,
'player2', 5,
'player3', 1,
'player3', 1,
]
;
PlayerName
| join kind=leftouter (
PlayerSpendMoney
| summarize sum(spendMoney) by name
) on $left.name == $right.name
| project name, sum_spendMoney
| name | sum_spendMoney |
|---------|----------------|
| player1 | 2 |
| player2 | 8 |
| player3 | 6 |

How to flatten JSON values into frequency counts in SQL

I have a column with JSON values like so:
{'A': 'true', 'B': 'false', 'C': 'true'}
{'A': 'true', 'C': 'false'}
{'D': 'true'}
{'C': 'true', 'A': 'false'}
I would like to create an SQL query which counts the number of entries with each key-value combination in the json.
Note that the keys and values are unknown in advance.
So the output of the above would be:
2 A=true
1 A=false
1 B=false
2 C=true
1 C=false
1 D=true
How can I do that?
SELECT a1||':'||a2, count(*) from (
SELECT map_entries(cast(json_parse(x) as MAP<VARCHAR, VARCHAR>)) row from
(VALUES ('{"A": "true", "B": "false", "C": "true"}'), ('{"A": "true", "C": "false"}'), ('{"D": "true"}'), ('{"C": "true", "A": "false"}')) as t(x))
as nested_data CROSS JOIN UNNEST(row) as nested_data(a1, a2)
group by 1;
_col0 | _col1
---------+-------
D:true | 1
B:false | 1
C:false | 1
C:true | 2
A:false | 1
A:true | 2
https://prestosql.io/docs/current/functions/map.html

SQL to JSON parent/child relationship

I have a table in my Microsoft SQL Server 2017 that looks like this:
+----+-------+----------+-------+-----------+
| ID | Level | ParentID | IsEnd | SomeText |
+----+-------+----------+-------+-----------+
| 1 | 1 | null | 1 | abc |
| 2 | 1 | null | 1 | asd |
| 3 | 2 | 1 | 1 | weqweq |
| 4 | 2 | 1 | 0 | lkjlkje |
| 5 | 3 | 4 | 1 | noonwqe |
| 6 | 3 | 4 | 0 | wet4t4 |
+----+-------+----------+-------+-----------+
And I would like to output a json string:
[{ ID: 1,
SomeText: 'abc',
Child2: [{
ID: 3,
SomeText: 'weqweq'
}, {
ID: 4,
SomeText: 'lkjlkje',
Child3: [{
ID: 5,
SomeText: 'noonwqe'
}, {
ID: 6,
SomeText: 'wet4t4'
}
]}
]
}]
IsEnd is a flag to know where you reached the last level.
You can use a recursive scalar UDF (User Defined Function) that builds the hierarchy starting from the root.
Here is the stub of an UDF you can start from:
create function dbo.udf_create_json_tree(#currentId int)
returns varchar(max)
begin
declare #json nvarchar(max)
declare #id int, #parentId int, #someText varchar(50)
select #id =[ID], #parentId = ParentID, #someText = SomeText
from dbo.tmp
where [ID] = #currentId
set #json =
(
select [ID], SomeText, json_query(dbo.udf_create_json_tree([ID])) as Child
from dbo.tmp
where ParentID = #currentId
for json auto
);
if(#parentId is null)
set #json = concat(
'[{"ID":' + cast (#id as nvarchar(50)) ,
',"SomeText":"' , #someText ,
'","Child":' , cast(#json as nvarchar(max)) ,
'}]'
)
return #json
end
Populate a table with your input values:
create table tmp ([ID] int, [Level] int, ParentID int, IsEnd bit, SomeText varchar(50))
insert into tmp values
(1, 1, null,1, 'abc' )
,(2, 1, null,1, 'asd' )
,(3, 2, 1 ,1, 'weqweq' )
,(4, 2, 1 ,0, 'lkjlkje')
,(5, 3, 4 ,1, 'noonwqe')
,(6, 3, 4 ,0, 'wet4t4' )
Now you can call the UDF on the first node (with ID=1):
select dbo.udf_create_json_tree(1)
Json result:
Formatted json result:
[{
"ID": 1,
"SomeText": "abc",
"Child": [{
"ID": 3,
"SomeText": "weqweq"
},
{
"ID": 4,
"SomeText": "lkjlkje",
"Child": [{
"ID": 5,
"SomeText": "noonwqe"
},
{
"ID": 6,
"SomeText": "wet4t4"
}]
}]
}]
If you really need to name each child node with the level number (Child2, Childx and so on) you'll probably want to implement a replace logic on "Child" string.