Pick the nth element of every group in a grouped dataframe

Pick the nth element of every group in a grouped dataframe - dataframe

Suppose I create the following dataframe
using DataFrames
df = DataFrame(A = rand(500), B = repeat(1:10, inner=50), C = 1:500)
and I can do a groupby:
grouped_df = groupby(df,"B")
I will end up with 10 groups. How can I choose, say, the third element of each group, and combine them into a new dataframe? That is, I would like a new dataframe of 10 rows, with each row being the third element of each of the groups?
I looked into combine, but couldn't find a solution. Can I get a hint?

To get the third row from every group, groupby first and then combine using indexing:
julia> combine(groupby(df, :B), x->x[3, :])
10×3 DataFrame
Row │ B A C
│ Int64 Float64 Int64
─────┼─────────────────────────
1 │ 1 0.196572 3
2 │ 2 0.539942 53
3 │ 3 0.243455 103
4 │ 4 0.837491 153
5 │ 5 0.672861 203
6 │ 6 0.0220219 253
7 │ 7 0.303417 303
8 │ 8 0.409596 353
9 │ 9 0.165928 403
10 │ 10 0.752038 453
(I had initially misread the question and suggested logical indexing like df[df.B .== 3, :])

Related

Compare elements in Julia DataFrame and return value in new column

I need to compare elements by rows of c1 and c2 columns in the DataFrame and return higher value in new column.
Column "Result" should return [6,5,4,4,5]
df = DataFrame(c1=[1,2,3,4,5], c2=[6,5,4,3,2])
println(df)
if broadcast(.>, df.c1, df.c2)
df[:, "Result"] .= df.c1
else
df[:, "Result"] .= df.c2
end
println(df5)
ERROR: TypeError: non-boolean (BitVector) used in boolean context

An alternative is:
julia> df.Result = max.(df.c1, df.c2)
5-element Vector{Int64}:
6
5
4
4
5
(as some users prefer such code than higher order functions presented excellently by #Shayan)

Using eachrow
julia> maximum.(eachrow(df))
5-element Vector{Int64}:
6
5
4
4
5
or as DataFrame
julia> DataFrame(new = maximum.(eachrow(df)))
5×1 DataFrame
Row │ new
│ Int64
─────┼───────
1 │ 6
2 │ 5
3 │ 4
4 │ 4
5 │ 5
or as a new column the DataFrame
julia> df.Result = maximum.(eachrow(df))
julia> df
5×3 DataFrame
Row │ c1 c2 Result
│ Int64 Int64 Int64
─────┼──────────────────────
1 │ 1 6 6
2 │ 2 5 5
3 │ 3 4 4
4 │ 4 3 4
5 │ 5 2 5

You can use select:
julia> select(df, All() => ByRow(max) => :Result)
5×1 DataFrame
Row │ Result
│ Int64
─────┼────────
1 │ 6
2 │ 5
3 │ 4
4 │ 4
5 │ 5
Another alternative is using DataFramesMeta.jl:
julia> #select(df, :Result = max.(:c1, :c2))
5×1 DataFrame
Row │ Result
│ Int64
─────┼────────
1 │ 6
2 │ 5
3 │ 4
4 │ 4
5 │ 5
# Alternatively, you can use the following line to avoid mentioning the column names manually:
#select(df, :Result = $(max.(propertynames(df)...)))
# Gives the same result.
If you want to make the change in place, then use select! or #select!.
If you prefer the returned dataframe to contain c1 and c2 columns as well, then you can go for transform (its alternative in-place operator is transform!):
julia> transform(df, All() => ByRow(max) => :Result)
5×3 DataFrame
Row │ c1 c2 Result
│ Int64 Int64 Int64
─────┼──────────────────────
1 │ 1 6 6
2 │ 2 5 5
3 │ 3 4 4
4 │ 4 3 4
5 │ 5 2 5
# And the same thing using DataFramesMeta.jl:
julia> #transform(df, :Result = $(max.(propertynames(df)...)))
5×3 DataFrame
Row │ c1 c2 Result
│ Int64 Int64 Int64
─────┼──────────────────────
1 │ 1 6 6
2 │ 2 5 5
3 │ 3 4 4
4 │ 4 3 3
5 │ 5 2 2

How do I calculate a new column and add to dataframe in Julia?

(New to Julia)
I'm trying to run this operation. Here's a minimal working example:
df = DataFrame(A = 1:4)
Row A
Int64
1 1
2 2
3 3
4 4
Just a dataframe with four values, 1-4. I want to add a new column where each value is equal to the element, plus the previous elements. In other words, I want:
Row A Row B
Int64 Int64
1 1 1
2 2 3
3 3 6
4 4 10
How can I do this?
I can write a function that calculates the desired number:
function first(j)
val = 0
while j != 0
val += df.A[j]
j -= 1
end
return val
end
Here j is the index of the element. This question also gives how to add a column after it's been calculated. However, I can't figure out how to turn these values into a new column. I suspect there should be an easier way than calculating the numbers, forming a column with it and then adding it to the dataframe, as well.

julia> df.B = cumsum(df.A);
julia> df
4×2 DataFrame
Row │ A B
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 2 3
3 │ 3 6
4 │ 4 10

If you want to use your function here's a way. See list comprehensions and map.
using DataFrames
julia> df.B = [first(i) for i in 1:4]
4-element Vector{Int64}:
1
3
6
10
julia> df
4×2 DataFrame
Row │ A B
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 2 3
3 │ 3 6
4 │ 4 10

Using Julia and dataframes plot the median of one columns based on bins from another column

my apologies if this is a simple question but I couldn't find a direct answer on the internet and I think it's a useful problem for others. I'm using Julia and DataFrame and I want to bin on column and then take the median of those bins and plot them in a histogram style plot.
Many thanks if you can help on this!

This is a basic approach:
julia> using Plots, Statistics, DataFrames
julia> df = DataFrame(x=repeat(["a","b","c"], 5), y=1:15)
15×2 DataFrame
Row │ x y
│ String Int64
─────┼───────────────
1 │ a 1
2 │ b 2
3 │ c 3
4 │ a 4
5 │ b 5
6 │ c 6
7 │ a 7
8 │ b 8
9 │ c 9
10 │ a 10
11 │ b 11
12 │ c 12
13 │ a 13
14 │ b 14
15 │ c 15
julia> res = combine(groupby(df, :x, sort=true), :y => median)
3×2 DataFrame
Row │ x y_median
│ String Float64
─────┼──────────────────
1 │ a 7.0
2 │ b 8.0
3 │ c 9.0
julia> bar(res.x, res.y_median, legend=false)
which gives you:

plotting two dataframes at the same time in Julia

I would like to plot the values Emean against T (like shown in the image below).
My guess is that there should be only two lines, since there are only two dataframes.
This means that the "connecting" line that I marked in yellow should not be there. Is there a way to "separate" the plots?

I assume you want to plot two lines as they are defined by grouping variable :L. If this is correct then you can do the following:
julia> using DataFrames
julia> using Plots
julia> using StatsPlots
julia> df = DataFrame(L=[1,1,1,2,2,2], T=[1,2,3,1,2,3], Emean=[1,2,3,4,5,6])
6×3 DataFrame
Row │ L T Emean
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 1 1
2 │ 1 2 2
3 │ 1 3 3
4 │ 2 1 4
5 │ 2 2 5
6 │ 2 3 6
julia> #df df plot(:T, :Emean, group=:L)
to get what you want.
Here I am using the functionality provided by the StatsPlots.jl package.

Is there a diff() function in Julia DataFrames like pandas?

I have a DataFrame in Julia and I want to create a new column that represents the difference between consecutive rows in a specific column. In python pandas, I would simply use df.series.diff(). Is there a Julia equivelant?
For example:
data
1
2
4
6
7
# in pandas
df['diff_data'] = df.data.diff()
data diff_data
1 NaN
2 1
4 2
6 2
7 1

You can use ShiftedArrays.jl like this.
Declarative style:
julia> using DataFrames, ShiftedArrays
julia> df = DataFrame(data=[1, 2, 4, 6, 7])
5×1 DataFrame
Row │ data
│ Int64
─────┼───────
1 │ 1
2 │ 2
3 │ 4
4 │ 6
5 │ 7
julia> transform(df, :data => (x -> x - lag(x)) => :data_diff)
5×2 DataFrame
Row │ data data_diff
│ Int64 Int64?
─────┼──────────────────
1 │ 1 missing
2 │ 2 1
3 │ 4 2
4 │ 6 2
5 │ 7 1
Imperative style (in place):
julia> df = DataFrame(data=[1, 2, 4, 6, 7])
5×1 DataFrame
Row │ data
│ Int64
─────┼───────
1 │ 1
2 │ 2
3 │ 4
4 │ 6
5 │ 7
julia> df.data_diff = df.data - lag(df.data)
5-element Vector{Union{Missing, Int64}}:
missing
1
2
2
1
julia> df
5×2 DataFrame
Row │ data data_diff
│ Int64 Int64?
─────┼──────────────────
1 │ 1 missing
2 │ 2 1
3 │ 4 2
4 │ 6 2
5 │ 7 1
with diff you do not need extra packages and can do similarly the following:
julia> df.data_diff = [missing; diff(df.data)]
5-element Vector{Union{Missing, Int64}}:
missing
1
2
2
1
(the issue is that diff is a general purpose function that does change the length of vector from n to n-1 so you have to add missing manually in front)

Pandas df.diff() does it to the whole data frame at once and allows you to specify row-wise or column-wise. There might be a better way but this is what I used before (I like chaining or piping like in dplyr):
# using chain.jl
#chain df begin
eachcol()
diff.()
DataFrame(:auto)
rename!(names(df))
end
# OR base pipe
df |>
x -> eachcol(x) |>
x -> diff.(x) |>
x -> DataFrame(x, :auto) |>
x -> rename!(x, names(df)[2:end])
# OR without piping
rename!(DataFrame(diff.(eachcol(df)), :auto), names(df))
You might need to insert the starting row, which will now have missing values.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Pick the nth element of every group in a grouped dataframe - dataframe

Related

Compare elements in Julia DataFrame and return value in new column

How do I calculate a new column and add to dataframe in Julia?

Using Julia and dataframes plot the median of one columns based on bins from another column

plotting two dataframes at the same time in Julia

Is there a diff() function in Julia DataFrames like pandas?

Categories

Resources