In Julia DataFrame, how can I do a group by and use the value of the next rows?
For example:
using DataFrames, DataFramesMeta
df = DataFrame(grp=["one", "one", "two", "two", "three"], val=[1, 2, 3, 4, 5])
# Row │ grp val
# │ String Int64
#─────┼───────────────
# 1 │ one 1
# 2 │ one 2
# 3 │ two 3
# 4 │ two 4
# 5 │ three 5
#combine(groupby(df, :grp),
count = length(:val),
first_val = first(:val),
#next_val = next(:val)
)
#3×3 DataFrame
# Row │ grp count first_val
# │ String Int64 Int64
#─────┼──────────────────────────
# 1 │ one 2 1
# 2 │ two 2 3
# 3 │ three 1 5
# I would like to obtain:
# Row │ grp count first_val next_val
# │ String Int64 Int64
#─────┼──────────────────────────
# 1 │ one 2 1 2
# 2 │ two 2 3 4
# 3 │ three 1 5 NA
With Julia DataFrames.jl it would be e.g.:
julia> combine(groupby(df, :grp),
nrow => :count,
:val => first => :first_val,
:val => (x -> length(x) > 1 ? x[2] : missing) => :next_val)
3×4 DataFrame
Row │ grp count first_val next_val
│ String Int64 Int64 Int64?
─────┼────────────────────────────────────
1 │ one 2 1 2
2 │ two 2 3 4
3 │ three 1 5 missing
and if you accept additional packages then with ShiftedArrays.jl it would be:
julia> using ShiftedArrays
julia> combine(groupby(df, :grp),
nrow => :count,
:val => first => :first_val,
:val => first∘lead => :next_val)
3×4 DataFrame
Row │ grp count first_val next_val
│ String Int64 Int64 Int64?
─────┼────────────────────────────────────
1 │ one 2 1 2
2 │ two 2 3 4
3 │ three 1 5 missing
And here is the same but with auto-generated column names:
julia> combine(groupby(df, :grp), nrow, :val => first, :val => first∘lead)
3×4 DataFrame
Row │ grp nrow val_first val_first_lead
│ String Int64 Int64 Int64?
─────┼──────────────────────────────────────────
1 │ one 2 1 2
2 │ two 2 3 4
3 │ three 1 5 missing
Related
I have the following dataframe called df:
using DataFrames
df = DataFrame(group = ["A", "A", "A", "A", "B", "B", "B", "B"],
value = [2,1,4,3,3,5,2,1])
8×2 DataFrame
Row │ group value
│ String Int64
─────┼───────────────
1 │ A 2
2 │ A 1
3 │ A 4
4 │ A 3
5 │ B 3
6 │ B 5
7 │ B 2
8 │ B 1
I would like to calculate the difference with previous values of consecutive rows in column value per group. The offset should have NaN, 0, or missing. Here is the desired output:
8×3 DataFrame
Row │ group value diff
│ String Int64 Float64
─────┼────────────────────────
1 │ A 2 NaN
2 │ A 1 -1.0
3 │ A 4 3.0
4 │ A 3 -1.0
5 │ B 3 NaN
6 │ B 5 2.0
7 │ B 2 -3.0
8 │ B 1 -2.0
So I was wondering if anyone knows how to calculate the difference between consecutive rows per group in Julia?
Using DataFrames.jl (you can replace missing by any value you like):
julia> select(groupby(df, :group),
:value => (x -> [missing; diff(x)]) => :diff)
8×2 DataFrame
Row │ group diff
│ String Int64?
─────┼─────────────────
1 │ A missing
2 │ A -1
3 │ A 3
4 │ A -1
5 │ B missing
6 │ B 2
7 │ B -3
8 │ B -1
Using DataFramesMeta.jl:
julia> #chain df begin
groupby(:group)
#select :diff = [missing; diff(:value)]
end
8×2 DataFrame
Row │ group diff
│ String Int64?
─────┼─────────────────
1 │ A missing
2 │ A -1
3 │ A 3
4 │ A -1
5 │ B missing
6 │ B 2
7 │ B -3
8 │ B -1
Normally diff in Julia like in e.g. R would produce one less row (and the syntax would be simpler:
julia> combine(groupby(df, :group), :value => diff => :diff)
6×2 DataFrame
Row │ group diff
│ String Int64
─────┼───────────────
1 │ A -1
2 │ A 3
3 │ A -1
4 │ B 2
5 │ B -3
6 │ B -1
julia> #chain df begin
groupby(:group)
#combine :diff = diff(:value)
end
6×2 DataFrame
Row │ group diff
│ String Int64
─────┼───────────────
1 │ A -1
2 │ A 3
3 │ A -1
4 │ B 2
5 │ B -3
6 │ B -1
Yet another way would be to use lag from ShiftedArrays.jl:
julia> using ShiftedArrays: lag
julia> #chain df begin
groupby(:group)
#combine :diff = :value - lag(:value)
end
8×2 DataFrame
Row │ group diff
│ String Int64?
─────┼─────────────────
1 │ A missing
2 │ A -1
3 │ A 3
4 │ A -1
5 │ B missing
6 │ B 2
7 │ B -3
8 │ B -1
I need to compare elements by rows of c1 and c2 columns in the DataFrame and return higher value in new column.
Column "Result" should return [6,5,4,4,5]
df = DataFrame(c1=[1,2,3,4,5], c2=[6,5,4,3,2])
println(df)
if broadcast(.>, df.c1, df.c2)
df[:, "Result"] .= df.c1
else
df[:, "Result"] .= df.c2
end
println(df5)
ERROR: TypeError: non-boolean (BitVector) used in boolean context
An alternative is:
julia> df.Result = max.(df.c1, df.c2)
5-element Vector{Int64}:
6
5
4
4
5
(as some users prefer such code than higher order functions presented excellently by #Shayan)
Using eachrow
julia> maximum.(eachrow(df))
5-element Vector{Int64}:
6
5
4
4
5
or as DataFrame
julia> DataFrame(new = maximum.(eachrow(df)))
5×1 DataFrame
Row │ new
│ Int64
─────┼───────
1 │ 6
2 │ 5
3 │ 4
4 │ 4
5 │ 5
or as a new column the DataFrame
julia> df.Result = maximum.(eachrow(df))
julia> df
5×3 DataFrame
Row │ c1 c2 Result
│ Int64 Int64 Int64
─────┼──────────────────────
1 │ 1 6 6
2 │ 2 5 5
3 │ 3 4 4
4 │ 4 3 4
5 │ 5 2 5
You can use select:
julia> select(df, All() => ByRow(max) => :Result)
5×1 DataFrame
Row │ Result
│ Int64
─────┼────────
1 │ 6
2 │ 5
3 │ 4
4 │ 4
5 │ 5
Another alternative is using DataFramesMeta.jl:
julia> #select(df, :Result = max.(:c1, :c2))
5×1 DataFrame
Row │ Result
│ Int64
─────┼────────
1 │ 6
2 │ 5
3 │ 4
4 │ 4
5 │ 5
# Alternatively, you can use the following line to avoid mentioning the column names manually:
#select(df, :Result = $(max.(propertynames(df)...)))
# Gives the same result.
If you want to make the change in place, then use select! or #select!.
If you prefer the returned dataframe to contain c1 and c2 columns as well, then you can go for transform (its alternative in-place operator is transform!):
julia> transform(df, All() => ByRow(max) => :Result)
5×3 DataFrame
Row │ c1 c2 Result
│ Int64 Int64 Int64
─────┼──────────────────────
1 │ 1 6 6
2 │ 2 5 5
3 │ 3 4 4
4 │ 4 3 4
5 │ 5 2 5
# And the same thing using DataFramesMeta.jl:
julia> #transform(df, :Result = $(max.(propertynames(df)...)))
5×3 DataFrame
Row │ c1 c2 Result
│ Int64 Int64 Int64
─────┼──────────────────────
1 │ 1 6 6
2 │ 2 5 5
3 │ 3 4 4
4 │ 4 3 3
5 │ 5 2 2
I am looking for a way how to shift DataFrame column by more rows.
Shifting by one row works fine:
df = DataFrame(A=[1,2,3,4], B=[9,8,7,6])
julia> transform(df, "A" => ShiftedArrays.lag => :A1)
4×3 DataFrame
Row │ A B A1
│ Int64 Int64 Int64?
─────┼───────────────────────
1 │ 1 9 missing
2 │ 2 8 1
3 │ 3 7 2
4 │ 4 6 3
But I am not able to find out how to transform the entire column with a function with more arguments, something like this (neither works):
transform(df, "A" => x -> ShiftedArrays.lag(x, 2) => :A1)
or
transform(df, ["A", 2] => f => :A1)
I hope there is a more suitable solution than using of for loop :-)
You need additional parentheses around the anonymous function:
transform(df, "A" => (x -> ShiftedArrays.lag(x, 2)) => :A1)
Result:
Row │ A B A1
│ Int64 Int64 Int64?
─────┼───────────────────────
1 │ 1 9 missing
2 │ 2 8 missing
3 │ 3 7 1
4 │ 4 6 2
Suppose I have the following data frame:
julia> using DataFrames
julia> df = DataFrame(id=["a", "b", "a", "b", "b"], v=[1, 1, 1, 1, 2])
5×2 DataFrame
Row │ id v
│ String Int64
─────┼───────────────
1 │ a 1
2 │ b 1
3 │ a 1
4 │ b 1
5 │ b 2
I wanted to compute the number of unique values in column :v per group defined by column :id. I tried the following:
julia> gdf = groupby(df, :id)
GroupedDataFrame with 2 groups based on key: id
First Group (2 rows): id = "a"
Row │ id v
│ String Int64
─────┼───────────────
1 │ a 1
2 │ a 1
⋮
Last Group (3 rows): id = "b"
Row │ id v
│ String Int64
─────┼───────────────
1 │ b 1
2 │ b 1
3 │ b 2
julia> combine(gdf, :v => x -> length(unique(x)) => :len)
2×2 DataFrame
Row │ id v_function
│ String Pair…
─────┼────────────────────
1 │ a 1=>:len
2 │ b 2=>:len
But it does not produce the expected result. How to fix the call to combine?
This is a common issue. The problem is how Julia interprets your transformations specification:
julia> :v => x -> length(unique(x)) => :len
:v => var"#3#4"()
And as you can see the whole x -> length(unique(x)) => :len part, due to Julia operator precedence rules, is treated as a definition of an anonymous function. Instead you should wrap the definition of an anonymous function in parentheses like this:
julia> combine(gdf, :v => (x -> length(unique(x))) => :len)
2×2 DataFrame
Row │ id len
│ String Int64
─────┼───────────────
1 │ a 1
2 │ b 2
Note also that in this case you could have used function composition operator ∘ like this:
julia> combine(gdf, :v => length∘unique => :len)
2×2 DataFrame
Row │ id len
│ String Int64
─────┼───────────────
1 │ a 1
2 │ b 2
in which case you do not have to define an anonymous function explicitly so parentheses are not needed.
I have a DataFrame in Julia and I want to create a new column that represents the difference between consecutive rows in a specific column. In python pandas, I would simply use df.series.diff(). Is there a Julia equivelant?
For example:
data
1
2
4
6
7
# in pandas
df['diff_data'] = df.data.diff()
data diff_data
1 NaN
2 1
4 2
6 2
7 1
You can use ShiftedArrays.jl like this.
Declarative style:
julia> using DataFrames, ShiftedArrays
julia> df = DataFrame(data=[1, 2, 4, 6, 7])
5×1 DataFrame
Row │ data
│ Int64
─────┼───────
1 │ 1
2 │ 2
3 │ 4
4 │ 6
5 │ 7
julia> transform(df, :data => (x -> x - lag(x)) => :data_diff)
5×2 DataFrame
Row │ data data_diff
│ Int64 Int64?
─────┼──────────────────
1 │ 1 missing
2 │ 2 1
3 │ 4 2
4 │ 6 2
5 │ 7 1
Imperative style (in place):
julia> df = DataFrame(data=[1, 2, 4, 6, 7])
5×1 DataFrame
Row │ data
│ Int64
─────┼───────
1 │ 1
2 │ 2
3 │ 4
4 │ 6
5 │ 7
julia> df.data_diff = df.data - lag(df.data)
5-element Vector{Union{Missing, Int64}}:
missing
1
2
2
1
julia> df
5×2 DataFrame
Row │ data data_diff
│ Int64 Int64?
─────┼──────────────────
1 │ 1 missing
2 │ 2 1
3 │ 4 2
4 │ 6 2
5 │ 7 1
with diff you do not need extra packages and can do similarly the following:
julia> df.data_diff = [missing; diff(df.data)]
5-element Vector{Union{Missing, Int64}}:
missing
1
2
2
1
(the issue is that diff is a general purpose function that does change the length of vector from n to n-1 so you have to add missing manually in front)
Pandas df.diff() does it to the whole data frame at once and allows you to specify row-wise or column-wise. There might be a better way but this is what I used before (I like chaining or piping like in dplyr):
# using chain.jl
#chain df begin
eachcol()
diff.()
DataFrame(:auto)
rename!(names(df))
end
# OR base pipe
df |>
x -> eachcol(x) |>
x -> diff.(x) |>
x -> DataFrame(x, :auto) |>
x -> rename!(x, names(df)[2:end])
# OR without piping
rename!(DataFrame(diff.(eachcol(df)), :auto), names(df))
You might need to insert the starting row, which will now have missing values.