Suppose I have the following data frame:
julia> using DataFrames
julia> df = DataFrame(id=["a", "b", "a", "b", "b"], v=[1, 1, 1, 1, 2])
5×2 DataFrame
Row │ id v
│ String Int64
─────┼───────────────
1 │ a 1
2 │ b 1
3 │ a 1
4 │ b 1
5 │ b 2
I wanted to compute the number of unique values in column :v per group defined by column :id. I tried the following:
julia> gdf = groupby(df, :id)
GroupedDataFrame with 2 groups based on key: id
First Group (2 rows): id = "a"
Row │ id v
│ String Int64
─────┼───────────────
1 │ a 1
2 │ a 1
⋮
Last Group (3 rows): id = "b"
Row │ id v
│ String Int64
─────┼───────────────
1 │ b 1
2 │ b 1
3 │ b 2
julia> combine(gdf, :v => x -> length(unique(x)) => :len)
2×2 DataFrame
Row │ id v_function
│ String Pair…
─────┼────────────────────
1 │ a 1=>:len
2 │ b 2=>:len
But it does not produce the expected result. How to fix the call to combine?
This is a common issue. The problem is how Julia interprets your transformations specification:
julia> :v => x -> length(unique(x)) => :len
:v => var"#3#4"()
And as you can see the whole x -> length(unique(x)) => :len part, due to Julia operator precedence rules, is treated as a definition of an anonymous function. Instead you should wrap the definition of an anonymous function in parentheses like this:
julia> combine(gdf, :v => (x -> length(unique(x))) => :len)
2×2 DataFrame
Row │ id len
│ String Int64
─────┼───────────────
1 │ a 1
2 │ b 2
Note also that in this case you could have used function composition operator ∘ like this:
julia> combine(gdf, :v => length∘unique => :len)
2×2 DataFrame
Row │ id len
│ String Int64
─────┼───────────────
1 │ a 1
2 │ b 2
in which case you do not have to define an anonymous function explicitly so parentheses are not needed.
Related
I would like to count the number of missing values per column in a dataframe like df:
Pkg.add("DataFrames")
using DataFrames
df = DataFrame(i=1:5,
x=[missing, 4, missing, 2, 1],
y=[missing, missing, "c", "d", "e"])
5×3 DataFrame
Row │ i x y
│ Int64 Int64? String?
─────┼─────────────────────────
1 │ 1 missing missing
2 │ 2 4 missing
3 │ 3 missing c
4 │ 4 2 d
5 │ 5 1 e
This should return 0 for i, 2 for x and 2 for y column. So I was wondering if anyone knows how to count the number of missing values per column in Julia?
When writing the question I found an answer by using describe with :nmissing like this:
describe(df, :nmissing)
3×2 DataFrame
Row │ variable nmissing
│ Symbol Int64
─────┼────────────────────
1 │ i 0
2 │ x 2
3 │ y 2
If you wanted the output in columnar format you can write:
julia> mapcols(x -> count(ismissing, x), df)
1×3 DataFrame
Row │ i x y
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 0 2 2
How can I get the column types of a Julia DataFrame?
using DataFrames
df = DataFrame(a = 1:4, b = ["a", "b", "c", "d"])
4×2 DataFrame
Row │ a b
│ Int64 String
─────┼───────────────
1 │ 1 a
2 │ 2 b
3 │ 3 c
4 │ 4 d
Some additional options (keeping the result in a data frame):
julia> mapcols(eltype, df)
1×2 DataFrame
Row │ a b
│ DataType DataType
─────┼────────────────────
1 │ Int64 String
julia> mapcols(typeof, df)
1×2 DataFrame
Row │ a b
│ DataType DataType
─────┼───────────────────────────────
1 │ Vector{Int64} Vector{String}
julia> describe(df, :eltype)
2×2 DataFrame
Row │ variable eltype
│ Symbol DataType
─────┼────────────────────
1 │ a Int64
2 │ b String
EDIT: in describe you get the element type of a column with stripped Missing - I have forgotten to add this comment earlier.
For each column I can get the element type like this:
eltype.(eachcol(df))
The same can be achieved (and I like this even better) with
df |> eachcol .|> eltype
2-element Vector{DataType}:
Int64
String
The actual type of the column can be retrieved with
df |> eachcol .|> typeof
2-element Vector{DataType}:
Vector{Int64} (alias for Array{Int64, 1})
Vector{String} (alias for Array{String, 1})
I am looking for a way how to shift DataFrame column by more rows.
Shifting by one row works fine:
df = DataFrame(A=[1,2,3,4], B=[9,8,7,6])
julia> transform(df, "A" => ShiftedArrays.lag => :A1)
4×3 DataFrame
Row │ A B A1
│ Int64 Int64 Int64?
─────┼───────────────────────
1 │ 1 9 missing
2 │ 2 8 1
3 │ 3 7 2
4 │ 4 6 3
But I am not able to find out how to transform the entire column with a function with more arguments, something like this (neither works):
transform(df, "A" => x -> ShiftedArrays.lag(x, 2) => :A1)
or
transform(df, ["A", 2] => f => :A1)
I hope there is a more suitable solution than using of for loop :-)
You need additional parentheses around the anonymous function:
transform(df, "A" => (x -> ShiftedArrays.lag(x, 2)) => :A1)
Result:
Row │ A B A1
│ Int64 Int64 Int64?
─────┼───────────────────────
1 │ 1 9 missing
2 │ 2 8 missing
3 │ 3 7 1
4 │ 4 6 2
using DataFrames
df = DataFrame(a=1:3, b=1:3)
How do I create a new column c such that c = a+b element wise?
Can't figure it out by reading the transform doc.
I know that
df[!, :c] = df.a .+ df.b
works but I want to use transform in a chain like this
#chain df begin
#transform(c = :a .+ :b)
#where(...)
groupby(...)
end
The above syntax doesn't work with DataFramesMeta.jl
This is an answer using DataFrames.jl.
To create a new data frame:
julia> transform(df, [:a,:b] => (+) => :c)
3×3 DataFrame
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 1 2
2 │ 2 2 4
3 │ 3 3 6
and for an in-place operation:
julia> transform!(df, [:a,:b] => (+) => :c)
3×3 DataFrame
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 1 2
2 │ 2 2 4
3 │ 3 3 6
or
julia> insertcols!(df, :c => df.a + df.b)
3×3 DataFrame
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 1 2
2 │ 2 2 4
3 │ 3 3 6
The difference between transform! and insertcols! is that insertcols! will error if :c column is present in the data frame, while transform! will overwrite it.
In Julia DataFrame, how can I do a group by and use the value of the next rows?
For example:
using DataFrames, DataFramesMeta
df = DataFrame(grp=["one", "one", "two", "two", "three"], val=[1, 2, 3, 4, 5])
# Row │ grp val
# │ String Int64
#─────┼───────────────
# 1 │ one 1
# 2 │ one 2
# 3 │ two 3
# 4 │ two 4
# 5 │ three 5
#combine(groupby(df, :grp),
count = length(:val),
first_val = first(:val),
#next_val = next(:val)
)
#3×3 DataFrame
# Row │ grp count first_val
# │ String Int64 Int64
#─────┼──────────────────────────
# 1 │ one 2 1
# 2 │ two 2 3
# 3 │ three 1 5
# I would like to obtain:
# Row │ grp count first_val next_val
# │ String Int64 Int64
#─────┼──────────────────────────
# 1 │ one 2 1 2
# 2 │ two 2 3 4
# 3 │ three 1 5 NA
With Julia DataFrames.jl it would be e.g.:
julia> combine(groupby(df, :grp),
nrow => :count,
:val => first => :first_val,
:val => (x -> length(x) > 1 ? x[2] : missing) => :next_val)
3×4 DataFrame
Row │ grp count first_val next_val
│ String Int64 Int64 Int64?
─────┼────────────────────────────────────
1 │ one 2 1 2
2 │ two 2 3 4
3 │ three 1 5 missing
and if you accept additional packages then with ShiftedArrays.jl it would be:
julia> using ShiftedArrays
julia> combine(groupby(df, :grp),
nrow => :count,
:val => first => :first_val,
:val => first∘lead => :next_val)
3×4 DataFrame
Row │ grp count first_val next_val
│ String Int64 Int64 Int64?
─────┼────────────────────────────────────
1 │ one 2 1 2
2 │ two 2 3 4
3 │ three 1 5 missing
And here is the same but with auto-generated column names:
julia> combine(groupby(df, :grp), nrow, :val => first, :val => first∘lead)
3×4 DataFrame
Row │ grp nrow val_first val_first_lead
│ String Int64 Int64 Int64?
─────┼──────────────────────────────────────────
1 │ one 2 1 2
2 │ two 2 3 4
3 │ three 1 5 missing