Julia Dataframe group by inside another group by

Julia Dataframe group by inside another group by - dataframe

i have a dataframe like the following :
julia> DataFrame(val=1:10, percent=nothing)
10×2 DataFrame
Row │ val percent
│ Int64 Nothing
─────┼────────────────
1 │ 1
2 │ 2
3 │ 3
4 │ 4
5 │ 5
6 │ 6
7 │ 7
8 │ 8
9 │ 9
10 │ 10
i want to apply this :
percent(df, threshold=0.33) = df / sum(df) .> threshold
which calculate the percentage and check if it's above threshold of a each value in a column compared with the total of the same column
to a DataFrame grouped by two times.
i grouped it by USER_KEY and then i want to group by again for each other column and then combine / apply the percent function to each.
It doesnt work i get
ERROR: MethodError: no method matching combine(::GroupedDataFrame{DataFrame}, ::var"#64#65")
i don't understand this error ...,
If someone can help thank you very much
EDIT :
There is a little difference with this example and i don't know how to reproduce it easily , it's that with these 2 columns i also have a column user_key where some keys can have many lines , i want to group by user_key and then group by val .
I want the column percent to have the percentage of the total of the column val
so for this dataframe the total is 10 i want the result to be like that :
10×2 DataFrame
Row │ val percent
│ Int64 Float64
─────┼────────────────
1 │ 1. 0.1
2 │ 2. 0.2
3 │ 3. 0.3
4 │ 4 0.4

Let me give an answer to the question in the edited part. But probably this is not all that you need - please comment in the question for me to learn what you need more.
So the simplest approach to your problem is:
julia> df = DataFrame(val=1:4)
4×1 DataFrame
Row │ val
│ Int64
─────┼───────
1 │ 1
2 │ 2
3 │ 3
4 │ 4
julia> df.percent = df.val / sum(df.val)
4-element Array{Float64,1}:
0.1
0.2
0.3
0.4
julia> df
4×2 DataFrame
Row │ val percent
│ Int64 Float64
─────┼────────────────
1 │ 1 0.1
2 │ 2 0.2
3 │ 3 0.3
4 │ 4 0.4
alternatively you can use transform!:
julia> df = DataFrame(val=1:4)
4×1 DataFrame
Row │ val
│ Int64
─────┼───────
1 │ 1
2 │ 2
3 │ 3
4 │ 4
julia> transform!(df, :val => (x -> x / sum(x)) => :percent)
4×2 DataFrame
Row │ val percent
│ Int64 Float64
─────┼────────────────
1 │ 1 0.1
2 │ 2 0.2
3 │ 3 0.3
4 │ 4 0.4

Related

Calculate difference between consecutive rows per group in dataframe Julia

I have the following dataframe called df:
using DataFrames
df = DataFrame(group = ["A", "A", "A", "A", "B", "B", "B", "B"],
value = [2,1,4,3,3,5,2,1])
8×2 DataFrame
Row │ group value
│ String Int64
─────┼───────────────
1 │ A 2
2 │ A 1
3 │ A 4
4 │ A 3
5 │ B 3
6 │ B 5
7 │ B 2
8 │ B 1
I would like to calculate the difference with previous values of consecutive rows in column value per group. The offset should have NaN, 0, or missing. Here is the desired output:
8×3 DataFrame
Row │ group value diff
│ String Int64 Float64
─────┼────────────────────────
1 │ A 2 NaN
2 │ A 1 -1.0
3 │ A 4 3.0
4 │ A 3 -1.0
5 │ B 3 NaN
6 │ B 5 2.0
7 │ B 2 -3.0
8 │ B 1 -2.0
So I was wondering if anyone knows how to calculate the difference between consecutive rows per group in Julia?

Using DataFrames.jl (you can replace missing by any value you like):
julia> select(groupby(df, :group),
:value => (x -> [missing; diff(x)]) => :diff)
8×2 DataFrame
Row │ group diff
│ String Int64?
─────┼─────────────────
1 │ A missing
2 │ A -1
3 │ A 3
4 │ A -1
5 │ B missing
6 │ B 2
7 │ B -3
8 │ B -1
Using DataFramesMeta.jl:
julia> #chain df begin
groupby(:group)
#select :diff = [missing; diff(:value)]
end
8×2 DataFrame
Row │ group diff
│ String Int64?
─────┼─────────────────
1 │ A missing
2 │ A -1
3 │ A 3
4 │ A -1
5 │ B missing
6 │ B 2
7 │ B -3
8 │ B -1
Normally diff in Julia like in e.g. R would produce one less row (and the syntax would be simpler:
julia> combine(groupby(df, :group), :value => diff => :diff)
6×2 DataFrame
Row │ group diff
│ String Int64
─────┼───────────────
1 │ A -1
2 │ A 3
3 │ A -1
4 │ B 2
5 │ B -3
6 │ B -1
julia> #chain df begin
groupby(:group)
#combine :diff = diff(:value)
end
6×2 DataFrame
Row │ group diff
│ String Int64
─────┼───────────────
1 │ A -1
2 │ A 3
3 │ A -1
4 │ B 2
5 │ B -3
6 │ B -1
Yet another way would be to use lag from ShiftedArrays.jl:
julia> using ShiftedArrays: lag
julia> #chain df begin
groupby(:group)
#combine :diff = :value - lag(:value)
end
8×2 DataFrame
Row │ group diff
│ String Int64?
─────┼─────────────────
1 │ A missing
2 │ A -1
3 │ A 3
4 │ A -1
5 │ B missing
6 │ B 2
7 │ B -3
8 │ B -1

Compare elements in Julia DataFrame and return value in new column

I need to compare elements by rows of c1 and c2 columns in the DataFrame and return higher value in new column.
Column "Result" should return [6,5,4,4,5]
df = DataFrame(c1=[1,2,3,4,5], c2=[6,5,4,3,2])
println(df)
if broadcast(.>, df.c1, df.c2)
df[:, "Result"] .= df.c1
else
df[:, "Result"] .= df.c2
end
println(df5)
ERROR: TypeError: non-boolean (BitVector) used in boolean context

An alternative is:
julia> df.Result = max.(df.c1, df.c2)
5-element Vector{Int64}:
6
5
4
4
5
(as some users prefer such code than higher order functions presented excellently by #Shayan)

Using eachrow
julia> maximum.(eachrow(df))
5-element Vector{Int64}:
6
5
4
4
5
or as DataFrame
julia> DataFrame(new = maximum.(eachrow(df)))
5×1 DataFrame
Row │ new
│ Int64
─────┼───────
1 │ 6
2 │ 5
3 │ 4
4 │ 4
5 │ 5
or as a new column the DataFrame
julia> df.Result = maximum.(eachrow(df))
julia> df
5×3 DataFrame
Row │ c1 c2 Result
│ Int64 Int64 Int64
─────┼──────────────────────
1 │ 1 6 6
2 │ 2 5 5
3 │ 3 4 4
4 │ 4 3 4
5 │ 5 2 5

You can use select:
julia> select(df, All() => ByRow(max) => :Result)
5×1 DataFrame
Row │ Result
│ Int64
─────┼────────
1 │ 6
2 │ 5
3 │ 4
4 │ 4
5 │ 5
Another alternative is using DataFramesMeta.jl:
julia> #select(df, :Result = max.(:c1, :c2))
5×1 DataFrame
Row │ Result
│ Int64
─────┼────────
1 │ 6
2 │ 5
3 │ 4
4 │ 4
5 │ 5
# Alternatively, you can use the following line to avoid mentioning the column names manually:
#select(df, :Result = $(max.(propertynames(df)...)))
# Gives the same result.
If you want to make the change in place, then use select! or #select!.
If you prefer the returned dataframe to contain c1 and c2 columns as well, then you can go for transform (its alternative in-place operator is transform!):
julia> transform(df, All() => ByRow(max) => :Result)
5×3 DataFrame
Row │ c1 c2 Result
│ Int64 Int64 Int64
─────┼──────────────────────
1 │ 1 6 6
2 │ 2 5 5
3 │ 3 4 4
4 │ 4 3 4
5 │ 5 2 5
# And the same thing using DataFramesMeta.jl:
julia> #transform(df, :Result = $(max.(propertynames(df)...)))
5×3 DataFrame
Row │ c1 c2 Result
│ Int64 Int64 Int64
─────┼──────────────────────
1 │ 1 6 6
2 │ 2 5 5
3 │ 3 4 4
4 │ 4 3 3
5 │ 5 2 2

Using Julia and dataframes plot the median of one columns based on bins from another column

my apologies if this is a simple question but I couldn't find a direct answer on the internet and I think it's a useful problem for others. I'm using Julia and DataFrame and I want to bin on column and then take the median of those bins and plot them in a histogram style plot.
Many thanks if you can help on this!

This is a basic approach:
julia> using Plots, Statistics, DataFrames
julia> df = DataFrame(x=repeat(["a","b","c"], 5), y=1:15)
15×2 DataFrame
Row │ x y
│ String Int64
─────┼───────────────
1 │ a 1
2 │ b 2
3 │ c 3
4 │ a 4
5 │ b 5
6 │ c 6
7 │ a 7
8 │ b 8
9 │ c 9
10 │ a 10
11 │ b 11
12 │ c 12
13 │ a 13
14 │ b 14
15 │ c 15
julia> res = combine(groupby(df, :x, sort=true), :y => median)
3×2 DataFrame
Row │ x y_median
│ String Float64
─────┼──────────────────
1 │ a 7.0
2 │ b 8.0
3 │ c 9.0
julia> bar(res.x, res.y_median, legend=false)
which gives you:

How to make a list of data frames in Julia?

Given that I have some data frames with a single dimension, how can I create a list of all the data frames? Is it really as simple as just making a list and adding them in?

You could also use vcat to combine these data frames into a single one with an extra column indicating the source like this:
julia> c = vcat(a, b, source=:source => ["a", "b"])
8×2 DataFrame
Row │ A source
│ Int64 String
─────┼───────────────
1 │ 1 a
2 │ 2 a
3 │ 3 a
4 │ 4 a
5 │ 1 b
6 │ 2 b
7 │ 3 b
8 │ 4 b
This form is often easier to work with later. In particular if you then groupby the c data frame by :source like this:
julia> groupby(c, :source)
GroupedDataFrame with 2 groups based on key: source
First Group (4 rows): source = "a"
Row │ A source
│ Int64 String
─────┼───────────────
1 │ 1 a
2 │ 2 a
3 │ 3 a
4 │ 4 a
⋮
Last Group (4 rows): source = "b"
Row │ A source
│ Int64 String
─────┼───────────────
1 │ 1 b
2 │ 2 b
3 │ 3 b
4 │ 4 b
As a result you also get a collection of data frames (like the list that was created in the other answer), but this time you can apply functions supporting the split-apply-combine to it, see https://dataframes.juliadata.org/stable/man/split_apply_combine/.

One possible option that appears to work is the straightforward, "just add them to the list" method mentioned above. This would look like:
julia> a = DataFrame(A = 1:4)
4×1 DataFrame
Row │ A
│ Int64
─────┼───────
1 │ 1
2 │ 2
3 │ 3
4 │ 4
julia> b = DataFrame(A = 1:4)
4×1 DataFrame
Row │ A
│ Int64
─────┼───────
1 │ 1
2 │ 2
3 │ 3
4 │ 4
julia> c = [a, b]
2-element Vector{DataFrame}:
4×1 DataFrame
Row │ A
│ Int64
─────┼───────
1 │ 1
2 │ 2
3 │ 3
4 │ 4
4×1 DataFrame
Row │ A
│ Int64
─────┼───────
1 │ 1
2 │ 2
3 │ 3
4 │ 4
julia> typeof(c)
Vector{DataFrame} (alias for Array{DataFrame, 1})

Is there a diff() function in Julia DataFrames like pandas?

I have a DataFrame in Julia and I want to create a new column that represents the difference between consecutive rows in a specific column. In python pandas, I would simply use df.series.diff(). Is there a Julia equivelant?
For example:
data
1
2
4
6
7
# in pandas
df['diff_data'] = df.data.diff()
data diff_data
1 NaN
2 1
4 2
6 2
7 1

You can use ShiftedArrays.jl like this.
Declarative style:
julia> using DataFrames, ShiftedArrays
julia> df = DataFrame(data=[1, 2, 4, 6, 7])
5×1 DataFrame
Row │ data
│ Int64
─────┼───────
1 │ 1
2 │ 2
3 │ 4
4 │ 6
5 │ 7
julia> transform(df, :data => (x -> x - lag(x)) => :data_diff)
5×2 DataFrame
Row │ data data_diff
│ Int64 Int64?
─────┼──────────────────
1 │ 1 missing
2 │ 2 1
3 │ 4 2
4 │ 6 2
5 │ 7 1
Imperative style (in place):
julia> df = DataFrame(data=[1, 2, 4, 6, 7])
5×1 DataFrame
Row │ data
│ Int64
─────┼───────
1 │ 1
2 │ 2
3 │ 4
4 │ 6
5 │ 7
julia> df.data_diff = df.data - lag(df.data)
5-element Vector{Union{Missing, Int64}}:
missing
1
2
2
1
julia> df
5×2 DataFrame
Row │ data data_diff
│ Int64 Int64?
─────┼──────────────────
1 │ 1 missing
2 │ 2 1
3 │ 4 2
4 │ 6 2
5 │ 7 1
with diff you do not need extra packages and can do similarly the following:
julia> df.data_diff = [missing; diff(df.data)]
5-element Vector{Union{Missing, Int64}}:
missing
1
2
2
1
(the issue is that diff is a general purpose function that does change the length of vector from n to n-1 so you have to add missing manually in front)

Pandas df.diff() does it to the whole data frame at once and allows you to specify row-wise or column-wise. There might be a better way but this is what I used before (I like chaining or piping like in dplyr):
# using chain.jl
#chain df begin
eachcol()
diff.()
DataFrame(:auto)
rename!(names(df))
end
# OR base pipe
df |>
x -> eachcol(x) |>
x -> diff.(x) |>
x -> DataFrame(x, :auto) |>
x -> rename!(x, names(df)[2:end])
# OR without piping
rename!(DataFrame(diff.(eachcol(df)), :auto), names(df))
You might need to insert the starting row, which will now have missing values.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Julia Dataframe group by inside another group by - dataframe

Related

Calculate difference between consecutive rows per group in dataframe Julia

Compare elements in Julia DataFrame and return value in new column

Using Julia and dataframes plot the median of one columns based on bins from another column

How to make a list of data frames in Julia?

Is there a diff() function in Julia DataFrames like pandas?

Categories

Resources