Spark SQL running sum with over() function - summing an indicator - sql

I have the following data:
df = [{"Category": 'A', "date": '01/01/2022', "Indictor": 1},
{"Category": 'A', "date": '02/01/2022', "Indictor": 0},
{"Category": 'A', "date": '03/01/2022', "Indictor": 1},
{"Category": 'A', "date": '04/01/2022', "Indictor": 1},
{"Category": 'A', "date": '05/01/2022', "Indictor": 1},
{"Category": 'B', "date": '01/01/2022', "Indictor": 0},
{"Category": 'B', "date": '02/01/2022', "Indictor": 1},
{"Category": 'B', "date": '03/01/2022', "Indictor": 1},
{"Category": 'B', "date": '04/01/2022', "Indictor": 0},
{"Category": 'B', "date": '05/01/2022', "Indictor": 0},
{"Category": 'B', "date": '06/01/2022', "Indictor": 1}]
df = spark.createDataFrame(df)
I want to use a LEAD() function to group by 'Category' and then order by 'date' ascending. Then create a new field called 'consec_ind' which is a counter of the number of consecutive days that the indicator has been 1.
This is the code I have tried but it doesn't quite work.
df.createOrReplaceTempView('df')
%sql
select date, Indictor,
case when Indictor > 0 THEN
(sum(count(Indictor)) over (order by date)) else 0 end as running_total
from df
WHERE Category = 'A'
group by date, Indictor
order by date, Indictor;
This is what I would like the data to look like:
{"Category": 'A', "date": '02/01/2022', "Indictor": 0,"consec_ind":0},
{"Category": 'A', "date": '03/01/2022', "Indictor": 1,"consec_ind":1},
{"Category": 'A', "date": '04/01/2022', "Indictor": 1,"consec_ind":2},
{"Category": 'A', "date": '05/01/2022', "Indictor": 1,"consec_ind":3},
{"Category": 'B', "date": '01/01/2022', "Indictor": 0,"consec_ind":0},
{"Category": 'B', "date": '02/01/2022', "Indictor": 1,"consec_ind":1},
{"Category": 'B', "date": '03/01/2022', "Indictor": 1,"consec_ind":2},
{"Category": 'B', "date": '04/01/2022', "Indictor": 0,"consec_ind":0},
{"Category": 'B', "date": '05/01/2022', "Indictor": 0,"consec_ind":0},
{"Category": 'B', "date": '06/01/2022', "Indictor": 1,"consec_ind":1}]

Here is my solution
First step: Partition by Category
Second step: Try to partition by your custom condition within first partitioning - i am doing it by incrementing counter every time i meet 0 while iterating over first window
Third step: Calcuate sum within final partitions
Here is version in PySpark:
import pyspark.sql.functions as F
from pyspark.sql import Window
df = [
{"Category": "A", "date": "01/01/2022", "Indictor": 1},
{"Category": "A", "date": "02/01/2022", "Indictor": 0},
{"Category": "A", "date": "03/01/2022", "Indictor": 1},
{"Category": "A", "date": "04/01/2022", "Indictor": 1},
{"Category": "A", "date": "05/01/2022", "Indictor": 1},
{"Category": "B", "date": "01/01/2022", "Indictor": 0},
{"Category": "B", "date": "02/01/2022", "Indictor": 1},
{"Category": "B", "date": "03/01/2022", "Indictor": 1},
{"Category": "B", "date": "04/01/2022", "Indictor": 0},
{"Category": "B", "date": "05/01/2022", "Indictor": 0},
{"Category": "B", "date": "06/01/2022", "Indictor": 1},
]
df = spark.createDataFrame(df)
windowSpec = Window.partitionBy("Category").orderBy("date")
df.withColumn(
"partition_number", F.sum((F.col("Indictor") == 0).cast("int")).over(windowSpec)
).withColumn(
"part_sum",
F.sum(F.col("Indictor")).over(
Window.partitionBy("Category", "partition_number").orderBy("date")
),
).drop(
"partition_number"
).show()
and here is sql:
select
Category,
Indictor,
date,
sum(Indictor) over (
PARTITION BY Category,
partition_number
ORDER BY
date
) as part_sum
from
(
select
*,
sum(
case when Indictor == 0 then 1 else 0 end
) over (
PARTITION BY Category
ORDER BY
date
) as partition_number
from
df
)
Output is:
+--------+--------+----------+--------+
|Category|Indictor| date|part_sum|
+--------+--------+----------+--------+
| A| 1|01/01/2022| 1|
| A| 0|02/01/2022| 0|
| A| 1|03/01/2022| 1|
| A| 1|04/01/2022| 2|
| A| 1|05/01/2022| 3|
| B| 0|01/01/2022| 0|
| B| 1|02/01/2022| 1|
| B| 1|03/01/2022| 2|
| B| 0|04/01/2022| 0|
| B| 0|05/01/2022| 0|
| B| 1|06/01/2022| 1|
+--------+--------+----------+--------+

Related

Make a property bag from a list of keys and values

I have a list containing the keys and another list containing values (obtained from splitting a log line). How can I combine the two to make a proeprty-bag in Kusto?
let headers = pack_array("A", "B", "C");
datatable(RawData:string)
[
"1,2,3",
"4,5,6",
]
| expand fields = split(RawData, ",")
| expand dict = ???
Expected:
dict
-----
{"A": 1, "B": 2, "C": 3}
{"A": 4, "B": 5, "C": 6}
Here's one option, that uses the combination of:
mv-apply: https://learn.microsoft.com/en-us/azure/data-explorer/kusto/query/mv-applyoperator
pack(): https://learn.microsoft.com/en-us/azure/data-explorer/kusto/query/packfunction
make_bag(): https://learn.microsoft.com/en-us/azure/data-explorer/kusto/query/make-bag-aggfunction
let keys = pack_array("A", "B", "C");
datatable(RawData:string)
[
"1,2,3",
"4,5,6",
]
| project values = split(RawData, ",")
| mv-apply with_itemindex = i key = keys to typeof(string) on (
summarize dict = make_bag(pack(key, values[i]))
)
values
dict
[ "1", "2", "3"]
{ "A": "1", "B": "2", "C": "3"}
[ "4", "5", "6"]
{ "A": "4", "B": "5", "C": "6"}

Combining separate temporal measurement series

I have a data set that combines two temporal measurement series with one row per measurement
time: 1, measurement: a, value: 5
time: 2, measurement: b, value: false
time: 10, measurement: a, value: 2
time: 13, measurement: b, value: true
time: 20, measurement: a, value: 4
time: 24, measurement: b, value: true
time: 30, measurement: a, value: 6
time: 32, measurement: b, value: false
in a visualization using Vega lite, I'd like to combine the measurement series and encode measurement a and b in a single visualization without simply layering their representation on a temporal axis but representing their value in a single encoding spec.
either measurement a values need to be interpolated and added as a new value to rows of measurement b
eg:
time: 2, measurement: b, value: false, interpolatedMeasurementA: 4.6667
or the other way around, which leaves the question of how to interpolate a boolean. maybe closest value by time, or simpler: last value
eg:
time: 30, measurement: a, value: 6, lastValueMeasurementB: true
I suppose this could be done either query side in which case this question would be regarding indexDB Flux query language
or this could be done on the visualization side in which case this would be regarding vega-lite
There's not any true linear interpolation schemes built-in to Vega-Lite (though the loess transform comes close), but you can achieve roughly what you wish with a window transform.
Here is an example (view in editor):
{
"data": {
"values": [
{"time": 1, "measurement": "a", "value": 5},
{"time": 2, "measurement": "b", "value": false},
{"time": 10, "measurement": "a", "value": 2},
{"time": 13, "measurement": "b", "value": true},
{"time": 20, "measurement": "a", "value": 4},
{"time": 24, "measurement": "b", "value": true},
{"time": 30, "measurement": "a", "value": 6},
{"time": 32, "measurement": "b", "value": false}
]
},
"transform": [
{
"calculate": "datum.measurement == 'a' ? datum.value : null",
"as": "measurement_a"
},
{
"window": [
{"op": "mean", "field": "measurement_a", "as": "interpolated"}
],
"sort": [{"field": "time"}],
"frame": [1, 1]
},
{"filter": "datum.measurement == 'b'"}
],
"mark": "line",
"encoding": {
"x": {"field": "time"},
"y": {"field": "interpolated"},
"color": {"field": "value"}
}
}
This first uses a calculate transform to isolate the values to be interpolated, then a window transform that computes the mean over adjacent values (frame: [1, 1]), then a filter transform to isolate interpolated rows.
If you wanted to go the other route, you could do a similar sequence of transforms targeting the boolean value instead.

How to build rows of JSON from a table with one one-to-many relationship

I have a table main like this:
create foreign table main (
"id" character varying not null,
"a" character varying not null,
"b" character varying not null
)
And I have another table, not_main, like this:
create foreign table not_main (
"id" character varying not null,
"fk" character varying not null,
"d" character varying not null,
"e" character varying not null
)
Should I want a query whose return is like:
json
0 {"id": "id_main_0", "a": "a0", "b": "b0", "cs": [{"id": "id_not_main_0", "fk": "id_main_0", "d": "d0", "e": "e0"}, {"id": "id_not_main_1", "fk": "id_main_0", "d": "d1", "e": "e1"}]}
1 {"id": "id_main_1", "a": "a1", "b": "b1", "cs": [{"id": "id_not_main_2", "fk": "id_main_1", "d": "d2", "e": "e3"}, {"id": "id_not_main_3", "fk": "id_main_1", "d": "d3", "e": "e3"}]}
How should I do it?
I tried:
select
json_build_object(
'id', m."id",
'a', m."a",
'b', m."b",
'cs', json_build_array(
json_build_object(
'd', nm."d",
'e', nm."e"
)
)
)
from main m
left join not_main nm on
nm."requisitionId" = m.id;
But it returns only one element in cs:
json
0 {"id": "id_main_0", "a": "a0", "b": "b0", "cs": [{"id": "id_not_main_0", "fk": "id_main_0", "d": "d0", "e": "e0"}]}
1 {"id": "id_main_1", "a": "a1", "b": "b1", "cs": [{"id": "id_not_main_2", "fk": "id_main_1", "d": "d2", "e": "e3"}]}
OBS: consider that the constraints of and between main and not_main are properly modeled, e.g., that I actually have both id columns as PKs and that fk is references the id column of main.
You want json array aggregation. Basically, you just need to change json_build_array() to json_agg(), and to add a group by clause:
select
json_build_object(
'id', m.id,
'a', m.a,
'b', m.b,
'cs', json_agg(
json_build_object(
'd', nm.d,
'e', nm.e
)
)
)
from main m
left join not_main nm on
nm.requisitionId = m.id
group by m.id, m.a, m.b

Acessing array in vega lite

I need to perform an operation in vega-lite/Kibana 6.5 similar to the next one. I need to divide y axis by "data.values[0].b". How can I perform this operation?
{
"$schema": "https://vega.github.io/schema/vega-lite/v4.json",
"description": "A simple bar chart with embedded data.",
"data": {
"values": [
{"a": "A", "b": 28}, {"a": "B", "b": 55}, {"a": "C", "b": 43},
{"a": "D", "b": 91}, {"a": "E", "b": 81}, {"a": "F", "b": 53},
{"a": "G", "b": 19}, {"a": "H", "b": 87}, {"a": "I", "b": 52}
]
},
"mark": "bar",
"encoding": {
"x": {"field": "a", "type": "ordinal"},
"y": {"field": "b", "type": "quantitative"}
}
Please take a look at Transform Calculate topic in Vega Lite docs.
You can do :
"transform": [
{"calculate": "datum.a / datum.b", "as": "y"}
]
Notice that 'datum' is the keyword used to access data set.

C3JS Acces value shown on X axis

I have simple bar chart like this:
Here is my C3JS
var chart = c3.generate({
data: {
json:[{"A": 67, "B": 10, "site": "Google", "C": 12}, {"A": 10, "B": 20, "site": "Amazon", "C": 12}, {"A": 25, "B": 10, "site": "Stackoverflow", "C": 8}, {"A": 20, "B": 22, "site": "Yahoo", "C": 12}, {"A": 76, "B": 30, "site": "eBay", "C": 9}],
mimeType: 'json',
keys: {
x: 'site',
value: ['A','B','C']
},
type: 'bar',
selection: {
enabled: true
},
onselected: function(d,element)
{
alert('selected x: '+chart.selected()[0].x+' value: '+chart.selected()[0].value+' name: '+chart.selected()[0].name);
},
groups: [
['A','B','C']
]
},
axis: {
x: {
type: 'category'
}
}
});
After some chart elemnt is selected (clicked), alert shows X and Value and Name attributes of first selected element. For example "selected x: 0 value: 67 name: A" after I click on left-top chart element. How can I get value shown on X axis? In this case it is "Google".
Property categories is populated when the x-axis is declared to be of type category as it is in this case. So to get the data from the x-axis you needs to call the .categories() function.
onselected: function(d,element){alert(chart.categories()[d.index]);}
https://jsfiddle.net/4bos2qzx/1/