Order columns alphabetically in dataframe Julia - dataframe

I have a dataframe like this:
using DataFrames
df = DataFrame(C = [1,2,3],
A = [1,1,1],
B = [3,2,1])
3×3 DataFrame
Row │ C A B
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 1 3
2 │ 2 1 2
3 │ 3 1 1
I would like to alphabetically order these columns which would be of course: A,B and C. How can we do this in a dataframe Julia? It could be more than 3 columns. In R we could use the order function.

How about using names with sortperm
julia> df[:, sortperm(names(df))]
3×3 DataFrame
Row │ A B C
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 3 1
2 │ 1 2 2
3 │ 1 1 3

The first method I came up with is:
julia> select(df, sort(propertynames(df)))
3×3 DataFrame
Row │ A B C
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 3 1
2 │ 1 2 2
3 │ 1 1 3
and to mutate df into ordered form:
select!(df, sort(propertynames(df)))

Related

Calculate difference between consecutive rows per group in dataframe Julia

I have the following dataframe called df:
using DataFrames
df = DataFrame(group = ["A", "A", "A", "A", "B", "B", "B", "B"],
value = [2,1,4,3,3,5,2,1])
8×2 DataFrame
Row │ group value
│ String Int64
─────┼───────────────
1 │ A 2
2 │ A 1
3 │ A 4
4 │ A 3
5 │ B 3
6 │ B 5
7 │ B 2
8 │ B 1
I would like to calculate the difference with previous values of consecutive rows in column value per group. The offset should have NaN, 0, or missing. Here is the desired output:
8×3 DataFrame
Row │ group value diff
│ String Int64 Float64
─────┼────────────────────────
1 │ A 2 NaN
2 │ A 1 -1.0
3 │ A 4 3.0
4 │ A 3 -1.0
5 │ B 3 NaN
6 │ B 5 2.0
7 │ B 2 -3.0
8 │ B 1 -2.0
So I was wondering if anyone knows how to calculate the difference between consecutive rows per group in Julia?
Using DataFrames.jl (you can replace missing by any value you like):
julia> select(groupby(df, :group),
:value => (x -> [missing; diff(x)]) => :diff)
8×2 DataFrame
Row │ group diff
│ String Int64?
─────┼─────────────────
1 │ A missing
2 │ A -1
3 │ A 3
4 │ A -1
5 │ B missing
6 │ B 2
7 │ B -3
8 │ B -1
Using DataFramesMeta.jl:
julia> #chain df begin
groupby(:group)
#select :diff = [missing; diff(:value)]
end
8×2 DataFrame
Row │ group diff
│ String Int64?
─────┼─────────────────
1 │ A missing
2 │ A -1
3 │ A 3
4 │ A -1
5 │ B missing
6 │ B 2
7 │ B -3
8 │ B -1
Normally diff in Julia like in e.g. R would produce one less row (and the syntax would be simpler:
julia> combine(groupby(df, :group), :value => diff => :diff)
6×2 DataFrame
Row │ group diff
│ String Int64
─────┼───────────────
1 │ A -1
2 │ A 3
3 │ A -1
4 │ B 2
5 │ B -3
6 │ B -1
julia> #chain df begin
groupby(:group)
#combine :diff = diff(:value)
end
6×2 DataFrame
Row │ group diff
│ String Int64
─────┼───────────────
1 │ A -1
2 │ A 3
3 │ A -1
4 │ B 2
5 │ B -3
6 │ B -1
Yet another way would be to use lag from ShiftedArrays.jl:
julia> using ShiftedArrays: lag
julia> #chain df begin
groupby(:group)
#combine :diff = :value - lag(:value)
end
8×2 DataFrame
Row │ group diff
│ String Int64?
─────┼─────────────────
1 │ A missing
2 │ A -1
3 │ A 3
4 │ A -1
5 │ B missing
6 │ B 2
7 │ B -3
8 │ B -1

Count missing values per column in dataframe Julia

I would like to count the number of missing values per column in a dataframe like df:
Pkg.add("DataFrames")
using DataFrames
df = DataFrame(i=1:5,
x=[missing, 4, missing, 2, 1],
y=[missing, missing, "c", "d", "e"])
5×3 DataFrame
Row │ i x y
│ Int64 Int64? String?
─────┼─────────────────────────
1 │ 1 missing missing
2 │ 2 4 missing
3 │ 3 missing c
4 │ 4 2 d
5 │ 5 1 e
This should return 0 for i, 2 for x and 2 for y column. So I was wondering if anyone knows how to count the number of missing values per column in Julia?
When writing the question I found an answer by using describe with :nmissing like this:
describe(df, :nmissing)
3×2 DataFrame
Row │ variable nmissing
│ Symbol Int64
─────┼────────────────────
1 │ i 0
2 │ x 2
3 │ y 2
If you wanted the output in columnar format you can write:
julia> mapcols(x -> count(ismissing, x), df)
1×3 DataFrame
Row │ i x y
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 0 2 2

Split a DataFrame into a Vector of DataFrames

I have a DataFrame
df = DataFrame(a=[1,1,2,2],b=[6,7,8,9])
4×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 6
2 │ 1 7
3 │ 2 8
4 │ 2 9
Is there a canonical way of splitting it into a Vector{DataFrame}s? I can do
[df[df.a .== i,:] for i in unique(df.a)]
2-element Vector{DataFrame}:
2×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 6
2 │ 1 7
2×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 6
2 │ 1 7
but is there maybe something more elegant?
Use:
julia> gdf = groupby(df, :a, sort=true)
GroupedDataFrame with 2 groups based on key: a
First Group (2 rows): a = 1
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 6
2 │ 1 7
⋮
Last Group (2 rows): a = 2
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 2 8
2 │ 2 9
(you could omit sort=true, but sorting ensures that the output is ordered in ascending order of the lookup key).
Then you can just work with this object as a vector:
julia> gdf[1]
2×2 SubDataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 6
2 │ 1 7
julia> gdf[2]
2×2 SubDataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 2 8
2 │ 2 9
This operation is non-allocating (it is a view into your original data frame).
If you really want Vector{DataFrame} (i.e. make copies of all groups) do:
julia> collect(DataFrame, gdf)
2-element Vector{DataFrame}:
2×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 6
2 │ 1 7
2×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 2 8
2 │ 2 9

Julia: How to create a new column in DataFrames.jl by adding two columns using `transform` or `#transform`?

using DataFrames
df = DataFrame(a=1:3, b=1:3)
How do I create a new column c such that c = a+b element wise?
Can't figure it out by reading the transform doc.
I know that
df[!, :c] = df.a .+ df.b
works but I want to use transform in a chain like this
#chain df begin
#transform(c = :a .+ :b)
#where(...)
groupby(...)
end
The above syntax doesn't work with DataFramesMeta.jl
This is an answer using DataFrames.jl.
To create a new data frame:
julia> transform(df, [:a,:b] => (+) => :c)
3×3 DataFrame
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 1 2
2 │ 2 2 4
3 │ 3 3 6
and for an in-place operation:
julia> transform!(df, [:a,:b] => (+) => :c)
3×3 DataFrame
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 1 2
2 │ 2 2 4
3 │ 3 3 6
or
julia> insertcols!(df, :c => df.a + df.b)
3×3 DataFrame
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 1 2
2 │ 2 2 4
3 │ 3 3 6
The difference between transform! and insertcols! is that insertcols! will error if :c column is present in the data frame, while transform! will overwrite it.

Remove groups by condition

Suppose I have the following dataframe
using DataFrames
df = DataFrame(A = 1:10, B = ["a","a","b","b","b","c","c","c","c","d"])
grouped_df = groupby(df, "B")
I would have four groups. How can I drop the groups that have fewer than, say, 2 rows? For example, how can I keep only groups a,b, and c? I can easily do it with a for loop, but I don't think the optimal way.
If you want the result to be still grouped then filter is simplest:
julia> filter(x -> nrow(x) > 1, grouped_df)
GroupedDataFrame with 3 groups based on key: B
First Group (2 rows): B = "a"
Row │ A B
│ Int64 String
─────┼───────────────
1 │ 1 a
2 │ 2 a
⋮
Last Group (4 rows): B = "c"
Row │ A B
│ Int64 String
─────┼───────────────
1 │ 6 c
2 │ 7 c
3 │ 8 c
4 │ 9 c
If you want to get a data frame as a result of one operation then do e.g.:
julia> combine(grouped_df, x -> nrow(x) < 2 ? DataFrame() : x)
9×2 DataFrame
Row │ B A
│ String Int64
─────┼───────────────
1 │ a 1
2 │ a 2
3 │ b 3
4 │ b 4
5 │ b 5
6 │ c 6
7 │ c 7
8 │ c 8
9 │ c 9