Suppose I create the following dataframe
using DataFrames
df = DataFrame(A = rand(500), B = repeat(1:10, inner=50), C = 1:500)
and I can do a groupby:
grouped_df = groupby(df,"B")
I will end up with 10 groups. How can I choose, say, the third element of each group, and combine them into a new dataframe? That is, I would like a new dataframe of 10 rows, with each row being the third element of each of the groups?
I looked into combine, but couldn't find a solution. Can I get a hint?
To get the third row from every group, groupby first and then combine using indexing:
julia> combine(groupby(df, :B), x->x[3, :])
10×3 DataFrame
Row │ B A C
│ Int64 Float64 Int64
─────┼─────────────────────────
1 │ 1 0.196572 3
2 │ 2 0.539942 53
3 │ 3 0.243455 103
4 │ 4 0.837491 153
5 │ 5 0.672861 203
6 │ 6 0.0220219 253
7 │ 7 0.303417 303
8 │ 8 0.409596 353
9 │ 9 0.165928 403
10 │ 10 0.752038 453
(I had initially misread the question and suggested logical indexing like df[df.B .== 3, :])
Related
I need to compare elements by rows of c1 and c2 columns in the DataFrame and return higher value in new column.
Column "Result" should return [6,5,4,4,5]
df = DataFrame(c1=[1,2,3,4,5], c2=[6,5,4,3,2])
println(df)
if broadcast(.>, df.c1, df.c2)
df[:, "Result"] .= df.c1
else
df[:, "Result"] .= df.c2
end
println(df5)
ERROR: TypeError: non-boolean (BitVector) used in boolean context
An alternative is:
julia> df.Result = max.(df.c1, df.c2)
5-element Vector{Int64}:
6
5
4
4
5
(as some users prefer such code than higher order functions presented excellently by #Shayan)
Using eachrow
julia> maximum.(eachrow(df))
5-element Vector{Int64}:
6
5
4
4
5
or as DataFrame
julia> DataFrame(new = maximum.(eachrow(df)))
5×1 DataFrame
Row │ new
│ Int64
─────┼───────
1 │ 6
2 │ 5
3 │ 4
4 │ 4
5 │ 5
or as a new column the DataFrame
julia> df.Result = maximum.(eachrow(df))
julia> df
5×3 DataFrame
Row │ c1 c2 Result
│ Int64 Int64 Int64
─────┼──────────────────────
1 │ 1 6 6
2 │ 2 5 5
3 │ 3 4 4
4 │ 4 3 4
5 │ 5 2 5
You can use select:
julia> select(df, All() => ByRow(max) => :Result)
5×1 DataFrame
Row │ Result
│ Int64
─────┼────────
1 │ 6
2 │ 5
3 │ 4
4 │ 4
5 │ 5
Another alternative is using DataFramesMeta.jl:
julia> #select(df, :Result = max.(:c1, :c2))
5×1 DataFrame
Row │ Result
│ Int64
─────┼────────
1 │ 6
2 │ 5
3 │ 4
4 │ 4
5 │ 5
# Alternatively, you can use the following line to avoid mentioning the column names manually:
#select(df, :Result = $(max.(propertynames(df)...)))
# Gives the same result.
If you want to make the change in place, then use select! or #select!.
If you prefer the returned dataframe to contain c1 and c2 columns as well, then you can go for transform (its alternative in-place operator is transform!):
julia> transform(df, All() => ByRow(max) => :Result)
5×3 DataFrame
Row │ c1 c2 Result
│ Int64 Int64 Int64
─────┼──────────────────────
1 │ 1 6 6
2 │ 2 5 5
3 │ 3 4 4
4 │ 4 3 4
5 │ 5 2 5
# And the same thing using DataFramesMeta.jl:
julia> #transform(df, :Result = $(max.(propertynames(df)...)))
5×3 DataFrame
Row │ c1 c2 Result
│ Int64 Int64 Int64
─────┼──────────────────────
1 │ 1 6 6
2 │ 2 5 5
3 │ 3 4 4
4 │ 4 3 3
5 │ 5 2 2
(New to Julia)
I'm trying to run this operation. Here's a minimal working example:
df = DataFrame(A = 1:4)
Row A
Int64
1 1
2 2
3 3
4 4
Just a dataframe with four values, 1-4. I want to add a new column where each value is equal to the element, plus the previous elements. In other words, I want:
Row A Row B
Int64 Int64
1 1 1
2 2 3
3 3 6
4 4 10
How can I do this?
I can write a function that calculates the desired number:
function first(j)
val = 0
while j != 0
val += df.A[j]
j -= 1
end
return val
end
Here j is the index of the element. This question also gives how to add a column after it's been calculated. However, I can't figure out how to turn these values into a new column. I suspect there should be an easier way than calculating the numbers, forming a column with it and then adding it to the dataframe, as well.
julia> df.B = cumsum(df.A);
julia> df
4×2 DataFrame
Row │ A B
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 2 3
3 │ 3 6
4 │ 4 10
If you want to use your function here's a way. See list comprehensions and map.
using DataFrames
julia> df.B = [first(i) for i in 1:4]
4-element Vector{Int64}:
1
3
6
10
julia> df
4×2 DataFrame
Row │ A B
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 2 3
3 │ 3 6
4 │ 4 10
my apologies if this is a simple question but I couldn't find a direct answer on the internet and I think it's a useful problem for others. I'm using Julia and DataFrame and I want to bin on column and then take the median of those bins and plot them in a histogram style plot.
Many thanks if you can help on this!
This is a basic approach:
julia> using Plots, Statistics, DataFrames
julia> df = DataFrame(x=repeat(["a","b","c"], 5), y=1:15)
15×2 DataFrame
Row │ x y
│ String Int64
─────┼───────────────
1 │ a 1
2 │ b 2
3 │ c 3
4 │ a 4
5 │ b 5
6 │ c 6
7 │ a 7
8 │ b 8
9 │ c 9
10 │ a 10
11 │ b 11
12 │ c 12
13 │ a 13
14 │ b 14
15 │ c 15
julia> res = combine(groupby(df, :x, sort=true), :y => median)
3×2 DataFrame
Row │ x y_median
│ String Float64
─────┼──────────────────
1 │ a 7.0
2 │ b 8.0
3 │ c 9.0
julia> bar(res.x, res.y_median, legend=false)
which gives you:
I would like to plot the values Emean against T (like shown in the image below).
My guess is that there should be only two lines, since there are only two dataframes.
This means that the "connecting" line that I marked in yellow should not be there. Is there a way to "separate" the plots?
I assume you want to plot two lines as they are defined by grouping variable :L. If this is correct then you can do the following:
julia> using DataFrames
julia> using Plots
julia> using StatsPlots
julia> df = DataFrame(L=[1,1,1,2,2,2], T=[1,2,3,1,2,3], Emean=[1,2,3,4,5,6])
6×3 DataFrame
Row │ L T Emean
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 1 1
2 │ 1 2 2
3 │ 1 3 3
4 │ 2 1 4
5 │ 2 2 5
6 │ 2 3 6
julia> #df df plot(:T, :Emean, group=:L)
to get what you want.
Here I am using the functionality provided by the StatsPlots.jl package.
I have a DataFrame in Julia and I want to create a new column that represents the difference between consecutive rows in a specific column. In python pandas, I would simply use df.series.diff(). Is there a Julia equivelant?
For example:
data
1
2
4
6
7
# in pandas
df['diff_data'] = df.data.diff()
data diff_data
1 NaN
2 1
4 2
6 2
7 1
You can use ShiftedArrays.jl like this.
Declarative style:
julia> using DataFrames, ShiftedArrays
julia> df = DataFrame(data=[1, 2, 4, 6, 7])
5×1 DataFrame
Row │ data
│ Int64
─────┼───────
1 │ 1
2 │ 2
3 │ 4
4 │ 6
5 │ 7
julia> transform(df, :data => (x -> x - lag(x)) => :data_diff)
5×2 DataFrame
Row │ data data_diff
│ Int64 Int64?
─────┼──────────────────
1 │ 1 missing
2 │ 2 1
3 │ 4 2
4 │ 6 2
5 │ 7 1
Imperative style (in place):
julia> df = DataFrame(data=[1, 2, 4, 6, 7])
5×1 DataFrame
Row │ data
│ Int64
─────┼───────
1 │ 1
2 │ 2
3 │ 4
4 │ 6
5 │ 7
julia> df.data_diff = df.data - lag(df.data)
5-element Vector{Union{Missing, Int64}}:
missing
1
2
2
1
julia> df
5×2 DataFrame
Row │ data data_diff
│ Int64 Int64?
─────┼──────────────────
1 │ 1 missing
2 │ 2 1
3 │ 4 2
4 │ 6 2
5 │ 7 1
with diff you do not need extra packages and can do similarly the following:
julia> df.data_diff = [missing; diff(df.data)]
5-element Vector{Union{Missing, Int64}}:
missing
1
2
2
1
(the issue is that diff is a general purpose function that does change the length of vector from n to n-1 so you have to add missing manually in front)
Pandas df.diff() does it to the whole data frame at once and allows you to specify row-wise or column-wise. There might be a better way but this is what I used before (I like chaining or piping like in dplyr):
# using chain.jl
#chain df begin
eachcol()
diff.()
DataFrame(:auto)
rename!(names(df))
end
# OR base pipe
df |>
x -> eachcol(x) |>
x -> diff.(x) |>
x -> DataFrame(x, :auto) |>
x -> rename!(x, names(df)[2:end])
# OR without piping
rename!(DataFrame(diff.(eachcol(df)), :auto), names(df))
You might need to insert the starting row, which will now have missing values.