Remove groups by condition

Remove groups by condition - dataframe

Suppose I have the following dataframe
using DataFrames
df = DataFrame(A = 1:10, B = ["a","a","b","b","b","c","c","c","c","d"])
grouped_df = groupby(df, "B")
I would have four groups. How can I drop the groups that have fewer than, say, 2 rows? For example, how can I keep only groups a,b, and c? I can easily do it with a for loop, but I don't think the optimal way.

If you want the result to be still grouped then filter is simplest:
julia> filter(x -> nrow(x) > 1, grouped_df)
GroupedDataFrame with 3 groups based on key: B
First Group (2 rows): B = "a"
Row │ A B
│ Int64 String
─────┼───────────────
1 │ 1 a
2 │ 2 a
⋮
Last Group (4 rows): B = "c"
Row │ A B
│ Int64 String
─────┼───────────────
1 │ 6 c
2 │ 7 c
3 │ 8 c
4 │ 9 c
If you want to get a data frame as a result of one operation then do e.g.:
julia> combine(grouped_df, x -> nrow(x) < 2 ? DataFrame() : x)
9×2 DataFrame
Row │ B A
│ String Int64
─────┼───────────────
1 │ a 1
2 │ a 2
3 │ b 3
4 │ b 4
5 │ b 5
6 │ c 6
7 │ c 7
8 │ c 8
9 │ c 9

Related

Calculate difference between consecutive rows per group in dataframe Julia

I have the following dataframe called df:
using DataFrames
df = DataFrame(group = ["A", "A", "A", "A", "B", "B", "B", "B"],
value = [2,1,4,3,3,5,2,1])
8×2 DataFrame
Row │ group value
│ String Int64
─────┼───────────────
1 │ A 2
2 │ A 1
3 │ A 4
4 │ A 3
5 │ B 3
6 │ B 5
7 │ B 2
8 │ B 1
I would like to calculate the difference with previous values of consecutive rows in column value per group. The offset should have NaN, 0, or missing. Here is the desired output:
8×3 DataFrame
Row │ group value diff
│ String Int64 Float64
─────┼────────────────────────
1 │ A 2 NaN
2 │ A 1 -1.0
3 │ A 4 3.0
4 │ A 3 -1.0
5 │ B 3 NaN
6 │ B 5 2.0
7 │ B 2 -3.0
8 │ B 1 -2.0
So I was wondering if anyone knows how to calculate the difference between consecutive rows per group in Julia?

Using DataFrames.jl (you can replace missing by any value you like):
julia> select(groupby(df, :group),
:value => (x -> [missing; diff(x)]) => :diff)
8×2 DataFrame
Row │ group diff
│ String Int64?
─────┼─────────────────
1 │ A missing
2 │ A -1
3 │ A 3
4 │ A -1
5 │ B missing
6 │ B 2
7 │ B -3
8 │ B -1
Using DataFramesMeta.jl:
julia> #chain df begin
groupby(:group)
#select :diff = [missing; diff(:value)]
end
8×2 DataFrame
Row │ group diff
│ String Int64?
─────┼─────────────────
1 │ A missing
2 │ A -1
3 │ A 3
4 │ A -1
5 │ B missing
6 │ B 2
7 │ B -3
8 │ B -1
Normally diff in Julia like in e.g. R would produce one less row (and the syntax would be simpler:
julia> combine(groupby(df, :group), :value => diff => :diff)
6×2 DataFrame
Row │ group diff
│ String Int64
─────┼───────────────
1 │ A -1
2 │ A 3
3 │ A -1
4 │ B 2
5 │ B -3
6 │ B -1
julia> #chain df begin
groupby(:group)
#combine :diff = diff(:value)
end
6×2 DataFrame
Row │ group diff
│ String Int64
─────┼───────────────
1 │ A -1
2 │ A 3
3 │ A -1
4 │ B 2
5 │ B -3
6 │ B -1
Yet another way would be to use lag from ShiftedArrays.jl:
julia> using ShiftedArrays: lag
julia> #chain df begin
groupby(:group)
#combine :diff = :value - lag(:value)
end
8×2 DataFrame
Row │ group diff
│ String Int64?
─────┼─────────────────
1 │ A missing
2 │ A -1
3 │ A 3
4 │ A -1
5 │ B missing
6 │ B 2
7 │ B -3
8 │ B -1

Compare elements in Julia DataFrame and return value in new column

I need to compare elements by rows of c1 and c2 columns in the DataFrame and return higher value in new column.
Column "Result" should return [6,5,4,4,5]
df = DataFrame(c1=[1,2,3,4,5], c2=[6,5,4,3,2])
println(df)
if broadcast(.>, df.c1, df.c2)
df[:, "Result"] .= df.c1
else
df[:, "Result"] .= df.c2
end
println(df5)
ERROR: TypeError: non-boolean (BitVector) used in boolean context

An alternative is:
julia> df.Result = max.(df.c1, df.c2)
5-element Vector{Int64}:
6
5
4
4
5
(as some users prefer such code than higher order functions presented excellently by #Shayan)

Using eachrow
julia> maximum.(eachrow(df))
5-element Vector{Int64}:
6
5
4
4
5
or as DataFrame
julia> DataFrame(new = maximum.(eachrow(df)))
5×1 DataFrame
Row │ new
│ Int64
─────┼───────
1 │ 6
2 │ 5
3 │ 4
4 │ 4
5 │ 5
or as a new column the DataFrame
julia> df.Result = maximum.(eachrow(df))
julia> df
5×3 DataFrame
Row │ c1 c2 Result
│ Int64 Int64 Int64
─────┼──────────────────────
1 │ 1 6 6
2 │ 2 5 5
3 │ 3 4 4
4 │ 4 3 4
5 │ 5 2 5

You can use select:
julia> select(df, All() => ByRow(max) => :Result)
5×1 DataFrame
Row │ Result
│ Int64
─────┼────────
1 │ 6
2 │ 5
3 │ 4
4 │ 4
5 │ 5
Another alternative is using DataFramesMeta.jl:
julia> #select(df, :Result = max.(:c1, :c2))
5×1 DataFrame
Row │ Result
│ Int64
─────┼────────
1 │ 6
2 │ 5
3 │ 4
4 │ 4
5 │ 5
# Alternatively, you can use the following line to avoid mentioning the column names manually:
#select(df, :Result = $(max.(propertynames(df)...)))
# Gives the same result.
If you want to make the change in place, then use select! or #select!.
If you prefer the returned dataframe to contain c1 and c2 columns as well, then you can go for transform (its alternative in-place operator is transform!):
julia> transform(df, All() => ByRow(max) => :Result)
5×3 DataFrame
Row │ c1 c2 Result
│ Int64 Int64 Int64
─────┼──────────────────────
1 │ 1 6 6
2 │ 2 5 5
3 │ 3 4 4
4 │ 4 3 4
5 │ 5 2 5
# And the same thing using DataFramesMeta.jl:
julia> #transform(df, :Result = $(max.(propertynames(df)...)))
5×3 DataFrame
Row │ c1 c2 Result
│ Int64 Int64 Int64
─────┼──────────────────────
1 │ 1 6 6
2 │ 2 5 5
3 │ 3 4 4
4 │ 4 3 3
5 │ 5 2 2

Order columns alphabetically in dataframe Julia

I have a dataframe like this:
using DataFrames
df = DataFrame(C = [1,2,3],
A = [1,1,1],
B = [3,2,1])
3×3 DataFrame
Row │ C A B
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 1 3
2 │ 2 1 2
3 │ 3 1 1
I would like to alphabetically order these columns which would be of course: A,B and C. How can we do this in a dataframe Julia? It could be more than 3 columns. In R we could use the order function.

How about using names with sortperm
julia> df[:, sortperm(names(df))]
3×3 DataFrame
Row │ A B C
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 3 1
2 │ 1 2 2
3 │ 1 1 3

The first method I came up with is:
julia> select(df, sort(propertynames(df)))
3×3 DataFrame
Row │ A B C
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 3 1
2 │ 1 2 2
3 │ 1 1 3
and to mutate df into ordered form:
select!(df, sort(propertynames(df)))

Using Julia and dataframes plot the median of one columns based on bins from another column

my apologies if this is a simple question but I couldn't find a direct answer on the internet and I think it's a useful problem for others. I'm using Julia and DataFrame and I want to bin on column and then take the median of those bins and plot them in a histogram style plot.
Many thanks if you can help on this!

This is a basic approach:
julia> using Plots, Statistics, DataFrames
julia> df = DataFrame(x=repeat(["a","b","c"], 5), y=1:15)
15×2 DataFrame
Row │ x y
│ String Int64
─────┼───────────────
1 │ a 1
2 │ b 2
3 │ c 3
4 │ a 4
5 │ b 5
6 │ c 6
7 │ a 7
8 │ b 8
9 │ c 9
10 │ a 10
11 │ b 11
12 │ c 12
13 │ a 13
14 │ b 14
15 │ c 15
julia> res = combine(groupby(df, :x, sort=true), :y => median)
3×2 DataFrame
Row │ x y_median
│ String Float64
─────┼──────────────────
1 │ a 7.0
2 │ b 8.0
3 │ c 9.0
julia> bar(res.x, res.y_median, legend=false)
which gives you:

How to make a list of data frames in Julia?

Given that I have some data frames with a single dimension, how can I create a list of all the data frames? Is it really as simple as just making a list and adding them in?

You could also use vcat to combine these data frames into a single one with an extra column indicating the source like this:
julia> c = vcat(a, b, source=:source => ["a", "b"])
8×2 DataFrame
Row │ A source
│ Int64 String
─────┼───────────────
1 │ 1 a
2 │ 2 a
3 │ 3 a
4 │ 4 a
5 │ 1 b
6 │ 2 b
7 │ 3 b
8 │ 4 b
This form is often easier to work with later. In particular if you then groupby the c data frame by :source like this:
julia> groupby(c, :source)
GroupedDataFrame with 2 groups based on key: source
First Group (4 rows): source = "a"
Row │ A source
│ Int64 String
─────┼───────────────
1 │ 1 a
2 │ 2 a
3 │ 3 a
4 │ 4 a
⋮
Last Group (4 rows): source = "b"
Row │ A source
│ Int64 String
─────┼───────────────
1 │ 1 b
2 │ 2 b
3 │ 3 b
4 │ 4 b
As a result you also get a collection of data frames (like the list that was created in the other answer), but this time you can apply functions supporting the split-apply-combine to it, see https://dataframes.juliadata.org/stable/man/split_apply_combine/.

One possible option that appears to work is the straightforward, "just add them to the list" method mentioned above. This would look like:
julia> a = DataFrame(A = 1:4)
4×1 DataFrame
Row │ A
│ Int64
─────┼───────
1 │ 1
2 │ 2
3 │ 3
4 │ 4
julia> b = DataFrame(A = 1:4)
4×1 DataFrame
Row │ A
│ Int64
─────┼───────
1 │ 1
2 │ 2
3 │ 3
4 │ 4
julia> c = [a, b]
2-element Vector{DataFrame}:
4×1 DataFrame
Row │ A
│ Int64
─────┼───────
1 │ 1
2 │ 2
3 │ 3
4 │ 4
4×1 DataFrame
Row │ A
│ Int64
─────┼───────
1 │ 1
2 │ 2
3 │ 3
4 │ 4
julia> typeof(c)
Vector{DataFrame} (alias for Array{DataFrame, 1})

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Remove groups by condition - dataframe

Related

Calculate difference between consecutive rows per group in dataframe Julia

Compare elements in Julia DataFrame and return value in new column

Order columns alphabetically in dataframe Julia

Using Julia and dataframes plot the median of one columns based on bins from another column

How to make a list of data frames in Julia?

Categories

Resources