I am looking for a way how to shift DataFrame column by more rows.
Shifting by one row works fine:
df = DataFrame(A=[1,2,3,4], B=[9,8,7,6])
julia> transform(df, "A" => ShiftedArrays.lag => :A1)
4×3 DataFrame
Row │ A B A1
│ Int64 Int64 Int64?
─────┼───────────────────────
1 │ 1 9 missing
2 │ 2 8 1
3 │ 3 7 2
4 │ 4 6 3
But I am not able to find out how to transform the entire column with a function with more arguments, something like this (neither works):
transform(df, "A" => x -> ShiftedArrays.lag(x, 2) => :A1)
or
transform(df, ["A", 2] => f => :A1)
I hope there is a more suitable solution than using of for loop :-)
You need additional parentheses around the anonymous function:
transform(df, "A" => (x -> ShiftedArrays.lag(x, 2)) => :A1)
Result:
Row │ A B A1
│ Int64 Int64 Int64?
─────┼───────────────────────
1 │ 1 9 missing
2 │ 2 8 missing
3 │ 3 7 1
4 │ 4 6 2
Related
I need to compare elements by rows of c1 and c2 columns in the DataFrame and return higher value in new column.
Column "Result" should return [6,5,4,4,5]
df = DataFrame(c1=[1,2,3,4,5], c2=[6,5,4,3,2])
println(df)
if broadcast(.>, df.c1, df.c2)
df[:, "Result"] .= df.c1
else
df[:, "Result"] .= df.c2
end
println(df5)
ERROR: TypeError: non-boolean (BitVector) used in boolean context
An alternative is:
julia> df.Result = max.(df.c1, df.c2)
5-element Vector{Int64}:
6
5
4
4
5
(as some users prefer such code than higher order functions presented excellently by #Shayan)
Using eachrow
julia> maximum.(eachrow(df))
5-element Vector{Int64}:
6
5
4
4
5
or as DataFrame
julia> DataFrame(new = maximum.(eachrow(df)))
5×1 DataFrame
Row │ new
│ Int64
─────┼───────
1 │ 6
2 │ 5
3 │ 4
4 │ 4
5 │ 5
or as a new column the DataFrame
julia> df.Result = maximum.(eachrow(df))
julia> df
5×3 DataFrame
Row │ c1 c2 Result
│ Int64 Int64 Int64
─────┼──────────────────────
1 │ 1 6 6
2 │ 2 5 5
3 │ 3 4 4
4 │ 4 3 4
5 │ 5 2 5
You can use select:
julia> select(df, All() => ByRow(max) => :Result)
5×1 DataFrame
Row │ Result
│ Int64
─────┼────────
1 │ 6
2 │ 5
3 │ 4
4 │ 4
5 │ 5
Another alternative is using DataFramesMeta.jl:
julia> #select(df, :Result = max.(:c1, :c2))
5×1 DataFrame
Row │ Result
│ Int64
─────┼────────
1 │ 6
2 │ 5
3 │ 4
4 │ 4
5 │ 5
# Alternatively, you can use the following line to avoid mentioning the column names manually:
#select(df, :Result = $(max.(propertynames(df)...)))
# Gives the same result.
If you want to make the change in place, then use select! or #select!.
If you prefer the returned dataframe to contain c1 and c2 columns as well, then you can go for transform (its alternative in-place operator is transform!):
julia> transform(df, All() => ByRow(max) => :Result)
5×3 DataFrame
Row │ c1 c2 Result
│ Int64 Int64 Int64
─────┼──────────────────────
1 │ 1 6 6
2 │ 2 5 5
3 │ 3 4 4
4 │ 4 3 4
5 │ 5 2 5
# And the same thing using DataFramesMeta.jl:
julia> #transform(df, :Result = $(max.(propertynames(df)...)))
5×3 DataFrame
Row │ c1 c2 Result
│ Int64 Int64 Int64
─────┼──────────────────────
1 │ 1 6 6
2 │ 2 5 5
3 │ 3 4 4
4 │ 4 3 3
5 │ 5 2 2
Suppose I have the following data frame:
julia> using DataFrames
julia> df = DataFrame(id=["a", "b", "a", "b", "b"], v=[1, 1, 1, 1, 2])
5×2 DataFrame
Row │ id v
│ String Int64
─────┼───────────────
1 │ a 1
2 │ b 1
3 │ a 1
4 │ b 1
5 │ b 2
I wanted to compute the number of unique values in column :v per group defined by column :id. I tried the following:
julia> gdf = groupby(df, :id)
GroupedDataFrame with 2 groups based on key: id
First Group (2 rows): id = "a"
Row │ id v
│ String Int64
─────┼───────────────
1 │ a 1
2 │ a 1
⋮
Last Group (3 rows): id = "b"
Row │ id v
│ String Int64
─────┼───────────────
1 │ b 1
2 │ b 1
3 │ b 2
julia> combine(gdf, :v => x -> length(unique(x)) => :len)
2×2 DataFrame
Row │ id v_function
│ String Pair…
─────┼────────────────────
1 │ a 1=>:len
2 │ b 2=>:len
But it does not produce the expected result. How to fix the call to combine?
This is a common issue. The problem is how Julia interprets your transformations specification:
julia> :v => x -> length(unique(x)) => :len
:v => var"#3#4"()
And as you can see the whole x -> length(unique(x)) => :len part, due to Julia operator precedence rules, is treated as a definition of an anonymous function. Instead you should wrap the definition of an anonymous function in parentheses like this:
julia> combine(gdf, :v => (x -> length(unique(x))) => :len)
2×2 DataFrame
Row │ id len
│ String Int64
─────┼───────────────
1 │ a 1
2 │ b 2
Note also that in this case you could have used function composition operator ∘ like this:
julia> combine(gdf, :v => length∘unique => :len)
2×2 DataFrame
Row │ id len
│ String Int64
─────┼───────────────
1 │ a 1
2 │ b 2
in which case you do not have to define an anonymous function explicitly so parentheses are not needed.
I have a DataFrame in Julia and I want to create a new column that represents the difference between consecutive rows in a specific column. In python pandas, I would simply use df.series.diff(). Is there a Julia equivelant?
For example:
data
1
2
4
6
7
# in pandas
df['diff_data'] = df.data.diff()
data diff_data
1 NaN
2 1
4 2
6 2
7 1
You can use ShiftedArrays.jl like this.
Declarative style:
julia> using DataFrames, ShiftedArrays
julia> df = DataFrame(data=[1, 2, 4, 6, 7])
5×1 DataFrame
Row │ data
│ Int64
─────┼───────
1 │ 1
2 │ 2
3 │ 4
4 │ 6
5 │ 7
julia> transform(df, :data => (x -> x - lag(x)) => :data_diff)
5×2 DataFrame
Row │ data data_diff
│ Int64 Int64?
─────┼──────────────────
1 │ 1 missing
2 │ 2 1
3 │ 4 2
4 │ 6 2
5 │ 7 1
Imperative style (in place):
julia> df = DataFrame(data=[1, 2, 4, 6, 7])
5×1 DataFrame
Row │ data
│ Int64
─────┼───────
1 │ 1
2 │ 2
3 │ 4
4 │ 6
5 │ 7
julia> df.data_diff = df.data - lag(df.data)
5-element Vector{Union{Missing, Int64}}:
missing
1
2
2
1
julia> df
5×2 DataFrame
Row │ data data_diff
│ Int64 Int64?
─────┼──────────────────
1 │ 1 missing
2 │ 2 1
3 │ 4 2
4 │ 6 2
5 │ 7 1
with diff you do not need extra packages and can do similarly the following:
julia> df.data_diff = [missing; diff(df.data)]
5-element Vector{Union{Missing, Int64}}:
missing
1
2
2
1
(the issue is that diff is a general purpose function that does change the length of vector from n to n-1 so you have to add missing manually in front)
Pandas df.diff() does it to the whole data frame at once and allows you to specify row-wise or column-wise. There might be a better way but this is what I used before (I like chaining or piping like in dplyr):
# using chain.jl
#chain df begin
eachcol()
diff.()
DataFrame(:auto)
rename!(names(df))
end
# OR base pipe
df |>
x -> eachcol(x) |>
x -> diff.(x) |>
x -> DataFrame(x, :auto) |>
x -> rename!(x, names(df)[2:end])
# OR without piping
rename!(DataFrame(diff.(eachcol(df)), :auto), names(df))
You might need to insert the starting row, which will now have missing values.
using DataFrames
df = DataFrame(a=1:3, b=1:3)
How do I create a new column c such that c = a+b element wise?
Can't figure it out by reading the transform doc.
I know that
df[!, :c] = df.a .+ df.b
works but I want to use transform in a chain like this
#chain df begin
#transform(c = :a .+ :b)
#where(...)
groupby(...)
end
The above syntax doesn't work with DataFramesMeta.jl
This is an answer using DataFrames.jl.
To create a new data frame:
julia> transform(df, [:a,:b] => (+) => :c)
3×3 DataFrame
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 1 2
2 │ 2 2 4
3 │ 3 3 6
and for an in-place operation:
julia> transform!(df, [:a,:b] => (+) => :c)
3×3 DataFrame
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 1 2
2 │ 2 2 4
3 │ 3 3 6
or
julia> insertcols!(df, :c => df.a + df.b)
3×3 DataFrame
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 1 2
2 │ 2 2 4
3 │ 3 3 6
The difference between transform! and insertcols! is that insertcols! will error if :c column is present in the data frame, while transform! will overwrite it.
Suppose I have the following dataframe
using DataFrames
df = DataFrame(A = 1:10, B = ["a","a","b","b","b","c","c","c","c","d"])
grouped_df = groupby(df, "B")
I would have four groups. How can I drop the groups that have fewer than, say, 2 rows? For example, how can I keep only groups a,b, and c? I can easily do it with a for loop, but I don't think the optimal way.
If you want the result to be still grouped then filter is simplest:
julia> filter(x -> nrow(x) > 1, grouped_df)
GroupedDataFrame with 3 groups based on key: B
First Group (2 rows): B = "a"
Row │ A B
│ Int64 String
─────┼───────────────
1 │ 1 a
2 │ 2 a
⋮
Last Group (4 rows): B = "c"
Row │ A B
│ Int64 String
─────┼───────────────
1 │ 6 c
2 │ 7 c
3 │ 8 c
4 │ 9 c
If you want to get a data frame as a result of one operation then do e.g.:
julia> combine(grouped_df, x -> nrow(x) < 2 ? DataFrame() : x)
9×2 DataFrame
Row │ B A
│ String Int64
─────┼───────────────
1 │ a 1
2 │ a 2
3 │ b 3
4 │ b 4
5 │ b 5
6 │ c 6
7 │ c 7
8 │ c 8
9 │ c 9