Clickhouse moving average - moving-average

Input:
Clickhouse
Table A
business_dttm (datetime)
amount (float)
I need to calculate moving sum for 15 minutes (or for last 3 records) on each business_dttm
For example
amount business_dttm moving sum
0.3 2018-11-19 13:00:00
0.3 2018-11-19 13:05:00
0.4 2018-11-19 13:10:00 1
0.5 2018-11-19 13:15:00 1.2
0.6 2018-11-19 13:15:00 1.5
0.7 2018-11-19 13:20:00 1.8
0.8 2018-11-19 13:25:00 2.1
0.9 2018-11-19 13:25:00 2.4
0.5 2018-11-19 13:30:00 2.2
Unfortunately we haven't window functions and join without equal conditions in Clickhouse
How can i do it without cross join and where condition?

If the window size is countably small, you can do something like this
SELECT
sum(window.2) AS amount,
max(dttm) AS business_dttm,
sum(amt) AS moving_sum
FROM
(
SELECT
arrayJoin([(rowNumberInAllBlocks(), amount), (rowNumberInAllBlocks() + 1, 0), (rowNumberInAllBlocks() + 2, 0)]) AS window,
amount AS amt,
business_dttm AS dttm
FROM
(
SELECT
amount,
business_dttm
FROM A
ORDER BY business_dttm
)
)
GROUP BY window.1
HAVING count() = 3
ORDER BY window.1;
The first two rows are ignored as ClickHouse doesn't collapse aggregates into null. You can prepend them later.
Update:
It's still possible to compute moving sum for arbitrary window sizes. Tune the window_size as you want (3 for this example).
-- Note, rowNumberInAllBlocks is incorrect if declared inside with block due to being stateful
WITH
(
SELECT arrayCumSum(groupArray(amount))
FROM
(
SELECT
amount
FROM A
ORDER BY business_dttm
)
) AS arr,
3 AS window_size
SELECT
amount,
business_dttm,
if(rowNumberInAllBlocks() + 1 < window_size, NULL, arr[rowNumberInAllBlocks() + 1] - arr[rowNumberInAllBlocks() + 1 - window_size]) AS moving_sum
FROM
(
SELECT
amount,
business_dttm
FROM A
ORDER BY business_dttm
)
Or this variant
SELECT
amount,
business_dttm,
moving_sum
FROM
(
WITH 3 AS window_size
SELECT
groupArray(amount) AS amount_arr,
groupArray(business_dttm) AS business_dttm_arr,
arrayCumSum(amount_arr) AS amount_cum_arr,
arrayMap(i -> if(i < window_size, NULL, amount_cum_arr[i] - amount_cum_arr[(i - window_size)]), arrayEnumerate(amount_cum_arr)) AS moving_sum_arr
FROM
(
SELECT *
FROM A
ORDER BY business_dttm ASC
)
)
ARRAY JOIN
amount_arr AS amount,
business_dttm_arr AS business_dttm,
moving_sum_arr AS moving_sum
Fair warning, both approaches are far from optimal, but it exhibits the unique power of ClickHouse beyond SQL.

Starting from version 21.4 added the full support of window-functions. At this moment it was marked as an experimental feature.
SELECT
amount,
business_dttm,
sum(amount) OVER (ORDER BY business_dttm ASC ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS sum
FROM (
SELECT data.1 AS amount, toDateTime(data.2) AS business_dttm
FROM (
SELECT arrayJoin([
(0.3, '2018-11-19 13:00:00'),
(0.3, '2018-11-19 13:05:00'),
(0.4, '2018-11-19 13:10:00'),
(0.5, '2018-11-19 13:15:00'),
(0.6, '2018-11-19 13:15:00'),
(0.7, '2018-11-19 13:20:00'),
(0.8, '2018-11-19 13:25:00'),
(0.9, '2018-11-19 13:25:00'),
(0.5, '2018-11-19 13:30:00')]) data)
)
SETTINGS allow_experimental_window_functions = 1
/*
┌─amount─┬───────business_dttm─┬────────────────sum─┐
│ 0.3 │ 2018-11-19 13:00:00 │ 0.3 │
│ 0.3 │ 2018-11-19 13:05:00 │ 0.6 │
│ 0.4 │ 2018-11-19 13:10:00 │ 1 │
│ 0.5 │ 2018-11-19 13:15:00 │ 1.2 │
│ 0.6 │ 2018-11-19 13:15:00 │ 1.5 │
│ 0.7 │ 2018-11-19 13:20:00 │ 1.8 │
│ 0.8 │ 2018-11-19 13:25:00 │ 2.0999999999999996 │
│ 0.9 │ 2018-11-19 13:25:00 │ 2.4 │
│ 0.5 │ 2018-11-19 13:30:00 │ 2.2 │
└────────┴─────────────────────┴────────────────────┘
*/
See https://altinity.com/blog/clickhouse-window-functions-current-state-of-the-art.

Related

How can i fill empty values while summarizing over a frame?

I have a query that calculates moving sum over a frame:
SELECT "Дата",
"Износ",
SUM("Сумма") OVER (partition by "Износ" order by "Дата"
rows between unbounded preceding and current row) AS "Продажи"
FROM (
SELECT date_trunc('week', period) AS "Дата",
multiIf(wear_and_tear BETWEEN 1 AND 3, '1-3',
wear_and_tear BETWEEN 4 AND 10, '4-10',
wear_and_tear BETWEEN 11 AND 20, '11-20',
wear_and_tear BETWEEN 21 AND 30, '21-30',
wear_and_tear BETWEEN 31 AND 45, '31-45',
wear_and_tear BETWEEN 46 AND 100, '46-100',
'Новые') AS "Износ",
SUM(quantity) AS "Сумма"
FROM shinsale_prod.sale_1c sc
LEFT JOIN product_1c pc ON sc.product_id = pc.id
WHERE 1=1
-- AND partner != 'Наше предприятие'
-- AND wear_and_tear = 0
-- AND stock IN ('ShinSale Щитниково', 'ShinSale Строгино', 'ShinSale Кунцево', 'ShinSale Санкт-Петербург', 'Шиномонтаж Подольск')
AND seasonality = 'з'
-- AND (quantity IN {{quant}} OR quantity IN -{{quant}})
-- AND stock in {{Склад}}
GROUP BY "Дата", "Износ"
HAVING "Дата" BETWEEN '2021-06-01' AND '2022-01-08'
ORDER BY 'Дата'
The thing is that in some groups I have now rows dated between 2021-12-20 and 2022-01-03
Therefore the line that represent this group has a gap on my chart.
Is there a way I can fill this gap with average values or smth?
I tried to right join my subquery to empty range of dates, but then i get empty rows and my filters in WHERE section kill the query and then I get empty or nearly empty result
You can generate mockup dates and construct a proper outer join like this:
SELECT
a.the_date,
sum(your_query.value) OVER (PARTITION BY 1 ORDER BY a.the_date ASC)
FROM
(
SELECT
number AS value,
toDate('2021-01-01') + value AS the_date
FROM numbers(10)
) AS your_query
RIGHT JOIN
(
WITH
toStartOfDay(toDate('2021-01-01')) AS start,
toStartOfDay(toDate('2021-01-14')) AS end
SELECT arrayJoin(arrayMap(x -> toDate(x), range(toUInt32(start), toUInt32(end), 24 * 3600))) AS the_date
) AS a ON a.the_date = your_query.the_date
Then the results will have no gaps:
┌─a.the_date─┬─sum(value) OVER (PARTITION BY 1 ORDER BY a.the_date ASC)─┐
│ 2021-01-01 │ 0 │
│ 2021-01-02 │ 1 │
│ 2021-01-03 │ 3 │
│ 2021-01-04 │ 6 │
│ 2021-01-05 │ 10 │
│ 2021-01-06 │ 15 │
│ 2021-01-07 │ 21 │
│ 2021-01-08 │ 28 │
│ 2021-01-09 │ 36 │
│ 2021-01-10 │ 45 │
│ 2021-01-11 │ 45 │
│ 2021-01-12 │ 45 │
│ 2021-01-13 │ 45 │
└────────────┴──────────────────────────────────────────────────────────┘

how to filter columns in dataframes?

I have a long list of dates (starting from 1942-1-1 00:00:00 to 2012-12-31 24:00:00). These are associated with some amounts respectively (see below). Is there a way to first filter all amounts for one day separately, and then add them up together?
For example for 1942-01-01, how to find all values (amounts) that occur in this day (from time 0 to 24) and then sum them together?
time amount
DateTime Float64
1942-01-01T00:00:00 7.0
1942-01-02T00:00:00 0.2
1942-01-03T00:00:00 2.1
1942-01-04T00:00:00 3.0
:
2012-12-31T23:00:00 4.0
2012-12-31T24:00:00 0.0
df = CSV.read(path, DataFrame)
for i in 1:24
filter(r ->hour(r.time) == i, df)
end
Load InMemoryDatasets.jl and use format to aggregate daily,
using InMemoryDatasets
ds=Dataset(time=DateTime("1942-01-01"):Hour(1):DateTime("2012-12-31"))
ds.amount = rand(nrow(ds))
DateValue(x) = Date(x)
setformat!(ds, :time=>DateValue)
combine(gatherby(ds,:time), :amount=>IMD.sum)
didn't get how the accepted answer is answering the question, but let me give another answer using IMD package,
using InMemoryDatasets
ds=Dataset(time=DateTime("1942-01-01"):Hour(1):DateTime("2012-12-31"))
ds.amount = rand(nrow(ds))
datefmt(x) = round(x, Hour, RoundDown)
setformat!(ds, :time=>datefmt)
combine(gatherby(ds,:time), :amount=>IMD.sum)
PS I'm one of the IMD's contributors.
There are many approaches you could use (and maybe some other commenters will propose alternatives). Here let me show you how to achieve what you want without any filtering:
julia> df = DataFrame(time=[DateTime(2020, 1, rand(1:2), rand(0:23)) for _ in 1:100], amount=rand(100))
100×2 DataFrame
Row │ time amount
│ DateTime Float64
─────┼────────────────────────────────
1 │ 2020-01-02T16:00:00 0.29325
2 │ 2020-01-02T02:00:00 0.376917
3 │ 2020-01-02T09:00:00 0.11849
4 │ 2020-01-02T04:00:00 0.462997
⋮ │ ⋮ ⋮
97 │ 2020-01-02T18:00:00 0.750604
98 │ 2020-01-01T13:00:00 0.179414
99 │ 2020-01-01T15:00:00 0.552547
100 │ 2020-01-01T02:00:00 0.769066
92 rows omitted
julia> transform!(df, :time => ByRow(Date) => :date, :time => ByRow(hour) => :hour)
100×4 DataFrame
Row │ time amount date hour
│ DateTime Float64 Date Int64
─────┼───────────────────────────────────────────────────
1 │ 2020-01-02T16:00:00 0.29325 2020-01-02 16
2 │ 2020-01-02T02:00:00 0.376917 2020-01-02 2
3 │ 2020-01-02T09:00:00 0.11849 2020-01-02 9
4 │ 2020-01-02T04:00:00 0.462997 2020-01-02 4
⋮ │ ⋮ ⋮ ⋮ ⋮
97 │ 2020-01-02T18:00:00 0.750604 2020-01-02 18
98 │ 2020-01-01T13:00:00 0.179414 2020-01-01 13
99 │ 2020-01-01T15:00:00 0.552547 2020-01-01 15
100 │ 2020-01-01T02:00:00 0.769066 2020-01-01 2
92 rows omitted
julia> unstack(df, :hour, :date, :amount, combine=sum, fill=0)
24×3 DataFrame
Row │ hour 2020-01-02 2020-01-01
│ Int64 Float64 Float64
─────┼───────────────────────────────
1 │ 16 1.06636 0.949414
2 │ 2 0.990913 1.43032
3 │ 9 0.183206 3.16363
4 │ 4 1.24055 0.57196
⋮ │ ⋮ ⋮ ⋮
21 │ 10 0.0 0.492397
22 │ 14 0.393438 0.0
23 │ 21 0.0 0.487992
24 │ 8 0.848852 0.0
16 rows omitted
The final result is a data frame that gives you aggregates for all hours (in rows) for all days (in columns). The data is presented in the order of their appearance, so you might want to sort the result by hour:
julia> res = sort!(unstack(df, :hour, :date, :amount, combine=sum, fill=0), :hour)
24×3 DataFrame
Row │ hour 2020-01-02 2020-01-01
│ Int64 Float64 Float64
─────┼───────────────────────────────
1 │ 0 1.99143 0.150979
2 │ 1 1.25939 0.860835
3 │ 2 0.990913 1.43032
4 │ 3 3.83337 2.33696
⋮ │ ⋮ ⋮ ⋮
21 │ 20 1.73576 1.93323
22 │ 21 0.0 0.487992
23 │ 22 1.52546 0.651938
24 │ 23 1.03808 0.0
16 rows omitted
Now you can extract information for a specific day just by extracting a column corresponding to it, e.g.:
julia> res."2020-01-02"
24-element Vector{Float64}:
1.991425180864845
1.2593855803084226
0.9909134301068651
3.833369559458414
1.2405519797178841
1.4494215475119732
⋮
2.4509665509554157
0.0
1.7357636571508785
0.0
1.525457178008634
1.0380772820126043
For the amount of data you have there should be no problem with getting all the results in one shot (in this example I pre-sorted the source data frame on day and hour to make the final table nicely ordered both by rows and columns):
julia> #time big = DataFrame(time=[DateTime(rand(1942:2012), rand(1:12), rand(1:28), rand(0:23)) for _ in 1:10^7], amount=rand(10^7));
0.413495 seconds (99.39 k allocations: 310.149 MiB, 3.75% gc time, 5.54% compilation time)
julia> #time sort!(transform!(big, :time => ByRow(Date) => :date, :time => ByRow(hour) => :hour), [:date, :hour]);
5.049808 seconds (1.03 M allocations: 1.167 GiB, 0.81% gc time)
julia> #time unstack(big, :hour, :date, :amount, combine=sum, fill=0)
1.342251 seconds (21.58 M allocations: 673.052 MiB, 13.63% gc time)
24×23857 DataFrame
Row │ hour 1942-01-01 1942-01-02 1942-01-03 1942-01-04 1942-01-05 1942-01-06 1942-01-07 1942-01-08 1942-01-09 1942-01-10 1942-01-11 194 ⋯
│ Int64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Flo ⋯
─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ 0 9.19054 8.00765 6.99379 9.63979 6.5088 11.6281 12.4928 6.86322 11.4453 12.6505 10.0583 1 ⋯
2 │ 1 8.78977 8.32879 6.29344 12.0815 9.83297 8.24592 10.349 10.1213 6.51192 6.1523 8.38962
3 │ 2 5.51566 9.97157 12.1064 8.28468 11.1929 8.274 8.25525 7.88186 4.65225 7.44625 6.62251 1
4 │ 3 7.25526 13.1635 4.75877 9.77418 11.5427 6.30625 6.2512 8.06394 8.77394 12.5935 9.09008
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱
21 │ 20 8.46999 9.99227 11.1116 14.5478 11.8379 7.38414 11.0567 6.17652 10.6811 9.059 9.77321 ⋯
22 │ 21 7.02998 10.0908 5.5182 8.8145 9.81238 10.8413 8.65648 12.6846 12.1116 8.75566 11.2892 1
23 │ 22 9.17824 13.2115 10.589 9.87813 10.7258 7.97428 12.8137 10.3456 8.37605 9.54897 7.24197
24 │ 23 13.0214 10.2333 9.08972 11.8678 7.36996 9.80802 11.0031 6.0818 11.7789 4.3467 7.49586
23845 columns and 16 rows omitted
EDIT
Here is an example how you can use filter. I assume we work on a big data frame created above and want information for 1942-02-03 only. I am also using Chain.jl to nicely chain the performed operations:
julia> #chain big begin
filter(:date => ==(Date("1942-02-03")), _)
groupby(:hour, sort=true)
combine(:amount => sum)
end
24×2 DataFrame
Row │ hour amount_sum
│ Int64 Float64
─────┼───────────────────
1 │ 0 6.22427
2 │ 1 8.33195
3 │ 2 9.26992
4 │ 3 13.7858
⋮ │ ⋮ ⋮
21 │ 20 6.59938
22 │ 21 6.07788
23 │ 22 6.68741
24 │ 23 7.59147
16 rows omitted
(if anything is unclear please comment)

How to calculate the difference of two sums in SQL

I'm using Grafana to show some data from Clickhouse. The data comes from a table containing itime, count and some other columns.
id method count itime
1 aaa 12 2021-07-20 00:07:06
2 bbb 9 2021-07-20 00:07:06
3 ccc 7 2021-07-20 00:07:07
...
Now I can execute the following SQL to get the sum of count between two itimes:
SELECT toUnixTimestamp(toStartOfMinute(itime)) * 1000 as t,
method,
sum(count) as c
FROM me.my_table
WHERE itime BETWEEN toDateTime(1631870605) AND toDateTime(1631874205)
and method like 'a%'
GROUP BY method, t
HAVING c > 500
ORDER BY t
It works as expected.
Now, I want to select the sum(count) according to the difference between sum(count) - sum(count)<--7-day-ago. Something like SELECT ... FROM ... WHERE ... HAVING c - c<--7-day-ago >= 100. But I don't know how.
create table test(D Date, Key Int64, Val Int64) Engine=Memory;
insert into test select today(), number, 100 from numbers(5);
insert into test select today()-7, number, 110 from numbers(5);
select sx.2 d1, Key, sumIf(sx.1, D=sx.2) s, sumIf(sx.1, D!=sx.2) s1 from (
select D, Key, arrayJoin([(s, D), (s, D + interval 7 day)]) sx
from (select D, Key, sum(Val) s from test group by D, Key)
)group by d1, Key
order by d1, Key;
┌─────────d1─┬─Key─┬───s─┬──s1─┐
│ 2021-09-10 │ 0 │ 110 │ 0 │
│ 2021-09-10 │ 1 │ 110 │ 0 │
│ 2021-09-10 │ 2 │ 110 │ 0 │
│ 2021-09-10 │ 3 │ 110 │ 0 │
│ 2021-09-10 │ 4 │ 110 │ 0 │
│ 2021-09-17 │ 0 │ 100 │ 110 │
│ 2021-09-17 │ 1 │ 100 │ 110 │
│ 2021-09-17 │ 2 │ 100 │ 110 │
│ 2021-09-17 │ 3 │ 100 │ 110 │
│ 2021-09-17 │ 4 │ 100 │ 110 │
│ 2021-09-24 │ 0 │ 0 │ 100 │
│ 2021-09-24 │ 1 │ 0 │ 100 │
│ 2021-09-24 │ 2 │ 0 │ 100 │
│ 2021-09-24 │ 3 │ 0 │ 100 │
│ 2021-09-24 │ 4 │ 0 │ 100 │
└────────────┴─────┴─────┴─────┘
SELECT
D,
Key,
Val,
any(Val) OVER (PARTITION BY Key ORDER BY D ASC RANGE BETWEEN 7 PRECEDING AND 7 PRECEDING) Val1
FROM test
┌──────────D─┬─Key─┬─Val─┬─Val1─┐
│ 2021-09-10 │ 0 │ 110 │ 0 │
│ 2021-09-17 │ 0 │ 100 │ 110 │
│ 2021-09-10 │ 1 │ 110 │ 0 │
│ 2021-09-17 │ 1 │ 100 │ 110 │
│ 2021-09-10 │ 2 │ 110 │ 0 │
│ 2021-09-17 │ 2 │ 100 │ 110 │
│ 2021-09-10 │ 3 │ 110 │ 0 │
│ 2021-09-17 │ 3 │ 100 │ 110 │
│ 2021-09-10 │ 4 │ 110 │ 0 │
│ 2021-09-17 │ 4 │ 100 │ 110 │
└────────────┴─────┴─────┴──────┘
i had some similar problem a while ago
please check the SQLfiddle
to see the result press buttons: first- build schema, second: run sql
naming
i assumed that you want for the same period A you selected a seven days later period B of time to compare (you need to be more specific, what you really looking for).
period A = your selected time period (between from and to)
period B = your selected time period one week in the past
problem
this is a real delicate question, if i understood the question right.
your example is grouped by minute inside a period A. this means, you really need to have data in period A for every minute you have data in period B, otherwise you will ignore period B data inside your chosen period.
as you can see in the sqlfiddle, i made two query strings. the first one is working, but ignores B data. the second one does a right join (sadly mysql does not support full outer joins to show all in one table) and shows 2 ignored entries.
it even makes it worse, because you group by method too.
(in this case for the fiddle you have to change the last line of the join and add:)
as b on a.unix_itime = b.unix_itime and a.method = b.method
this means, you need for every selected method and period minutewise data.
it would be better if you group only by the method and not time, as you already use a time condition (period A) to keep it small.
or do the stepping bigger, by hour or day..
this code should fit your envirement (mysql does not support toUnixTimestamp, toStartOfMinute, toDateTime):
SELECT
a.unix_itime * 1000 as t,
a.method,
a.sum AS c,
b.sum AS c2,
ifnull(a.sum,0) - ifnull(b.sum,0) as diff,
FROM (select method, sum(count) as sum, toUnixTimestamp(toStartOfMinute(itime)) as unix_itime
from my_table
WHERE method like 'a%' and
itime BETWEEN toDateTime(1631870605)
AND toDateTime(1631874205)
GROUP BY method, unix_itime)
as a
LEFT JOIN (select method, sum(count) as sum, toUnixTimestamp(toStartOfMinute(itime + INTERVAL 7 DAY)) as unix_itime
from my_table
WHERE method like 'a%' and
itime BETWEEN toDateTime(1631870605)- INTERVAL 7 DAY
AND toDateTime(1631874205)- INTERVAL 7 DAY
GROUP BY method, unix_itime)
as b on a.unix_itime = b.unix_itime and a.method = b.method
ORDER BY a.unix_itime;
The logic is slightly ambiguous, but this could produce one possible meaning of the above. If you still want to return overall SUM(count), just add that to the select list.
SELECT toUnixTimestamp(toStartOfMinute(itime)) * 1000 AS t
, method
, SUM(count) AS c
, SUM(count) - SUM(CASE WHEN itime < current_date - INTERVAL 7 DAY THEN count END) AS c2
FROM me.my_table
WHERE method like 'a%'
GROUP BY method, t
HAVING c2 >= 100
ORDER BY t
;
Adjust as needed.
Maybe you didn't want to return the difference, just filter the groups returned. If so, try this:
SELECT toUnixTimestamp(toStartOfMinute(itime)) * 1000 AS t
, method
, SUM(count) AS c
FROM me.my_table
WHERE method like 'a%'
GROUP BY method, t
HAVING SUM(count) - SUM(CASE WHEN itime < current_date - INTERVAL 7 DAY THEN count END) >= 100
ORDER BY t
;

SQL Query (ClickHouse): group by where timediff between values less then X

I need a little help with sql-query. I'm using clickhouse, but maybe standard SQL syntax is enough for this task.
I've got the following table:
event_time; Text; ID
2021-03-16 09:00:48; Example_1; 1
2021-03-16 09:00:49; Example_2; 1
2021-03-16 09:00:50; Example_3; 1
2021-03-16 09:15:48; Example_1_1; 1
2021-03-16 09:15:49; Example_2_2; 1
2021-03-16 09:15:50; Example_3_3; 1
What I want to have at the end for this example - 2 rows:
Example_1Example2Example_3
Example_1_1Example2_2Example_3_3
Concatenation of Text field based on ID. The problem that this ID is not unique during some time interval. It's unique only for a minute as an example. So I want to concatenate only strings where the difference between first and last row is less than a minute.
Right now I've got a query like:
SELECT arrayStringConcat(groupArray(Text))
FROM (SELECT event_time, Text, ID
FROM Test_Table
ORDER by event_time asc)
GROUP BY ID;
What kind of condition should I add here?
Here is an example
create table X(event_time DateTime, Text String, ID Int64) Engine=Memory;
insert into X values ('2021-03-16 09:00:48','Example_1', 1), ('2021-03-16 09:00:49','Example_2', 1), ('2021-03-16 09:00:50','Example_3', 1), ('2021-03-16 09:01:48','Example_4', 1), ('2021-03-16 09:01:49','Example_5', 1), ('2021-03-16 09:15:48','Example_1_1', 1), ('2021-03-16 09:15:49','Example_2_2', 1),('2021-03-16 09:15:50','Example_3_3', 1);
SELECT * FROM X
┌──────────event_time─┬─Text────────┬─ID─┐
│ 2021-03-16 09:00:48 │ Example_1 │ 1 │
│ 2021-03-16 09:00:49 │ Example_2 │ 1 │
│ 2021-03-16 09:00:50 │ Example_3 │ 1 │
│ 2021-03-16 09:01:48 │ Example_4 │ 1 │
│ 2021-03-16 09:01:49 │ Example_5 │ 1 │
│ 2021-03-16 09:15:48 │ Example_1_1 │ 1 │
│ 2021-03-16 09:15:49 │ Example_2_2 │ 1 │
│ 2021-03-16 09:15:50 │ Example_3_3 │ 1 │
└─────────────────────┴─────────────┴────┘
What result is expected in this case?
CH 21.3
set allow_experimental_window_functions = 1;
SELECT
ID,
y,
groupArray(event_time),
groupArray(Text)
FROM
(
SELECT
ID,
event_time,
Text,
max(event_time) OVER (PARTITION BY ID ORDER BY event_time ASC RANGE BETWEEN CURRENT ROW AND 60 FOLLOWING) AS y
FROM X
)
GROUP BY
ID,
y
ORDER BY
ID ASC,
y ASC
Query id: 9219a1f2-8c96-425f-9301-745fa7b88b40
┌─ID─┬───────────────────y─┬─groupArray(event_time)────────────────────────────────────────────────────────────────────┬─groupArray(Text)──────────────────────────────────┐
│ 1 │ 2021-03-16 09:01:48 │ ['2021-03-16 09:00:48'] │ ['Example_1'] │
│ 1 │ 2021-03-16 09:01:49 │ ['2021-03-16 09:00:49','2021-03-16 09:00:50','2021-03-16 09:01:48','2021-03-16 09:01:49'] │ ['Example_2','Example_3','Example_4','Example_5'] │
│ 1 │ 2021-03-16 09:15:50 │ ['2021-03-16 09:15:48','2021-03-16 09:15:49','2021-03-16 09:15:50'] │ ['Example_1_1','Example_2_2','Example_3_3'] │
└────┴─────────────────────┴───────────────────────────────────────────────────────────────────────────────────────────┴───────────────────────────────────────────────────┘

Julia Dataframe group by inside another group by

i have a dataframe like the following :
julia> DataFrame(val=1:10, percent=nothing)
10×2 DataFrame
Row │ val percent
│ Int64 Nothing
─────┼────────────────
1 │ 1
2 │ 2
3 │ 3
4 │ 4
5 │ 5
6 │ 6
7 │ 7
8 │ 8
9 │ 9
10 │ 10
i want to apply this :
percent(df, threshold=0.33) = df / sum(df) .> threshold
which calculate the percentage and check if it's above threshold of a each value in a column compared with the total of the same column
to a DataFrame grouped by two times.
i grouped it by USER_KEY and then i want to group by again for each other column and then combine / apply the percent function to each.
It doesnt work i get
ERROR: MethodError: no method matching combine(::GroupedDataFrame{DataFrame}, ::var"#64#65")
i don't understand this error ...,
If someone can help thank you very much
EDIT :
There is a little difference with this example and i don't know how to reproduce it easily , it's that with these 2 columns i also have a column user_key where some keys can have many lines , i want to group by user_key and then group by val .
I want the column percent to have the percentage of the total of the column val
so for this dataframe the total is 10 i want the result to be like that :
10×2 DataFrame
Row │ val percent
│ Int64 Float64
─────┼────────────────
1 │ 1. 0.1
2 │ 2. 0.2
3 │ 3. 0.3
4 │ 4 0.4
Let me give an answer to the question in the edited part. But probably this is not all that you need - please comment in the question for me to learn what you need more.
So the simplest approach to your problem is:
julia> df = DataFrame(val=1:4)
4×1 DataFrame
Row │ val
│ Int64
─────┼───────
1 │ 1
2 │ 2
3 │ 3
4 │ 4
julia> df.percent = df.val / sum(df.val)
4-element Array{Float64,1}:
0.1
0.2
0.3
0.4
julia> df
4×2 DataFrame
Row │ val percent
│ Int64 Float64
─────┼────────────────
1 │ 1 0.1
2 │ 2 0.2
3 │ 3 0.3
4 │ 4 0.4
alternatively you can use transform!:
julia> df = DataFrame(val=1:4)
4×1 DataFrame
Row │ val
│ Int64
─────┼───────
1 │ 1
2 │ 2
3 │ 3
4 │ 4
julia> transform!(df, :val => (x -> x / sum(x)) => :percent)
4×2 DataFrame
Row │ val percent
│ Int64 Float64
─────┼────────────────
1 │ 1 0.1
2 │ 2 0.2
3 │ 3 0.3
4 │ 4 0.4