Get Column Types of Julia DataFrames - dataframe

How can I get the column types of a Julia DataFrame?
using DataFrames
df = DataFrame(a = 1:4, b = ["a", "b", "c", "d"])
4×2 DataFrame
Row │ a b
│ Int64 String
─────┼───────────────
1 │ 1 a
2 │ 2 b
3 │ 3 c
4 │ 4 d

Some additional options (keeping the result in a data frame):
julia> mapcols(eltype, df)
1×2 DataFrame
Row │ a b
│ DataType DataType
─────┼────────────────────
1 │ Int64 String
julia> mapcols(typeof, df)
1×2 DataFrame
Row │ a b
│ DataType DataType
─────┼───────────────────────────────
1 │ Vector{Int64} Vector{String}
julia> describe(df, :eltype)
2×2 DataFrame
Row │ variable eltype
│ Symbol DataType
─────┼────────────────────
1 │ a Int64
2 │ b String
EDIT: in describe you get the element type of a column with stripped Missing - I have forgotten to add this comment earlier.

For each column I can get the element type like this:
eltype.(eachcol(df))
The same can be achieved (and I like this even better) with
df |> eachcol .|> eltype
2-element Vector{DataType}:
Int64
String
The actual type of the column can be retrieved with
df |> eachcol .|> typeof
2-element Vector{DataType}:
Vector{Int64} (alias for Array{Int64, 1})
Vector{String} (alias for Array{String, 1})

Related

Count missing values per column in dataframe Julia

I would like to count the number of missing values per column in a dataframe like df:
Pkg.add("DataFrames")
using DataFrames
df = DataFrame(i=1:5,
x=[missing, 4, missing, 2, 1],
y=[missing, missing, "c", "d", "e"])
5×3 DataFrame
Row │ i x y
│ Int64 Int64? String?
─────┼─────────────────────────
1 │ 1 missing missing
2 │ 2 4 missing
3 │ 3 missing c
4 │ 4 2 d
5 │ 5 1 e
This should return 0 for i, 2 for x and 2 for y column. So I was wondering if anyone knows how to count the number of missing values per column in Julia?
When writing the question I found an answer by using describe with :nmissing like this:
describe(df, :nmissing)
3×2 DataFrame
Row │ variable nmissing
│ Symbol Int64
─────┼────────────────────
1 │ i 0
2 │ x 2
3 │ y 2
If you wanted the output in columnar format you can write:
julia> mapcols(x -> count(ismissing, x), df)
1×3 DataFrame
Row │ i x y
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 0 2 2

Failing to execute column transformation in DataFrames.jl

Suppose I have the following data frame:
julia> using DataFrames
julia> df = DataFrame(id=["a", "b", "a", "b", "b"], v=[1, 1, 1, 1, 2])
5×2 DataFrame
Row │ id v
│ String Int64
─────┼───────────────
1 │ a 1
2 │ b 1
3 │ a 1
4 │ b 1
5 │ b 2
I wanted to compute the number of unique values in column :v per group defined by column :id. I tried the following:
julia> gdf = groupby(df, :id)
GroupedDataFrame with 2 groups based on key: id
First Group (2 rows): id = "a"
Row │ id v
│ String Int64
─────┼───────────────
1 │ a 1
2 │ a 1
⋮
Last Group (3 rows): id = "b"
Row │ id v
│ String Int64
─────┼───────────────
1 │ b 1
2 │ b 1
3 │ b 2
julia> combine(gdf, :v => x -> length(unique(x)) => :len)
2×2 DataFrame
Row │ id v_function
│ String Pair…
─────┼────────────────────
1 │ a 1=>:len
2 │ b 2=>:len
But it does not produce the expected result. How to fix the call to combine?
This is a common issue. The problem is how Julia interprets your transformations specification:
julia> :v => x -> length(unique(x)) => :len
:v => var"#3#4"()
And as you can see the whole x -> length(unique(x)) => :len part, due to Julia operator precedence rules, is treated as a definition of an anonymous function. Instead you should wrap the definition of an anonymous function in parentheses like this:
julia> combine(gdf, :v => (x -> length(unique(x))) => :len)
2×2 DataFrame
Row │ id len
│ String Int64
─────┼───────────────
1 │ a 1
2 │ b 2
Note also that in this case you could have used function composition operator ∘ like this:
julia> combine(gdf, :v => length∘unique => :len)
2×2 DataFrame
Row │ id len
│ String Int64
─────┼───────────────
1 │ a 1
2 │ b 2
in which case you do not have to define an anonymous function explicitly so parentheses are not needed.

Julia: How to create a new column in DataFrames.jl by adding two columns using `transform` or `#transform`?

using DataFrames
df = DataFrame(a=1:3, b=1:3)
How do I create a new column c such that c = a+b element wise?
Can't figure it out by reading the transform doc.
I know that
df[!, :c] = df.a .+ df.b
works but I want to use transform in a chain like this
#chain df begin
#transform(c = :a .+ :b)
#where(...)
groupby(...)
end
The above syntax doesn't work with DataFramesMeta.jl
This is an answer using DataFrames.jl.
To create a new data frame:
julia> transform(df, [:a,:b] => (+) => :c)
3×3 DataFrame
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 1 2
2 │ 2 2 4
3 │ 3 3 6
and for an in-place operation:
julia> transform!(df, [:a,:b] => (+) => :c)
3×3 DataFrame
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 1 2
2 │ 2 2 4
3 │ 3 3 6
or
julia> insertcols!(df, :c => df.a + df.b)
3×3 DataFrame
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 1 2
2 │ 2 2 4
3 │ 3 3 6
The difference between transform! and insertcols! is that insertcols! will error if :c column is present in the data frame, while transform! will overwrite it.

Replace missing values with values from another column in Julia Dataframe

I have a data frame where some columns have missing values. I would like that if missing values are found, an alternative from a second column is picked.
For example, in:
df = DataFrame(x = [0, missing, 2], y=[2, 4, 6])
I would like missing to be substituted with 4.
At the moment I am solving the problem with this solution:
for row in eachrow(df)
if ismissing(row[:x])
row[:x] = row[:y]
end
end
But I wonder if a better solution that avoids for-loops can be found🤔.
I tried with replace(A, old_new::Pair...; [count::Integer]), but it seems that the pair accepts only scalars, and also with broadcasting I was not able to have success.
Do you have any suggestions?
You can use coalesce:
julia> df = DataFrame(x = [0, missing, 2], y=[2, 4, 6])
3×2 DataFrame
Row │ x y
│ Int64? Int64
─────┼────────────────
1 │ 0 2
2 │ missing 4
3 │ 2 6
julia> df.x .= coalesce.(df.x, df.y)
3-element Array{Union{Missing, Int64},1}:
0
4
2
julia> df
3×2 DataFrame
Row │ x y
│ Int64? Int64
─────┼───────────────
1 │ 0 2
2 │ 4 4
3 │ 2 6
or if you like piping-aware functions:
julia> df = DataFrame(x = [0, missing, 2], y=[2, 4, 6])
3×2 DataFrame
Row │ x y
│ Int64? Int64
─────┼────────────────
1 │ 0 2
2 │ missing 4
3 │ 2 6
julia> transform!(df, [:x, :y] => ByRow(coalesce) => :x)
3×2 DataFrame
Row │ x y
│ Int64 Int64
─────┼──────────────
1 │ 0 2
2 │ 4 4
3 │ 2 6
and this is the same, but not requiring you to remember about coalesce:
julia> df = DataFrame(x = [0, missing, 2], y=[2, 4, 6])
3×2 DataFrame
Row │ x y
│ Int64? Int64
─────┼────────────────
1 │ 0 2
2 │ missing 4
3 │ 2 6
julia> transform!(df, [:x, :y] => ByRow((x,y) -> ismissing(x) ? y : x) => :x)
3×2 DataFrame
Row │ x y
│ Int64 Int64
─────┼──────────────
1 │ 0 2
2 │ 4 4
3 │ 2 6

Julia - DataFrames Insert a Row at Specific index in Julia 1.1

How can I insert a row in a dataframe in Julia at a specific index ? (Julia version 1.1)
I have found this related question. However, the code given in the answer isn't working anymore in Julia 1.1
I know how to push! a row into a dataframe or concatenate two dataframes, but what about inserting at a specific index ?
It also doesn't seem to be explained in Julia DataFrames documentation.
This is a non-standard operation. The recommendation given there is still valid, so:
df = DataFrame(x = [1,2,3], y = ["a", "b", "c"])
foreach((v,n) -> insert!(df[n], 2, v), [4, "d"], names(df))
works. A shorter version to write it under Julia 1.0 would be:
insert!.(eachcol(df, false), 2, [4, "d"])
(the need to add false as a second argument will not be needed in the future as we are in the deprecation period now)
The difference is that getproperty method can be overloaded since Julia 1.0 so df.columns does not work.
I have also updated the other answer, so you can close this question if you prefer.
EDIT
The instructions above are no longer valid (unless you use very old DataFrames.jl version).
In DataFrames.jl 1.4 use insert!, push!, or pushfirst! functions depending on where you want to add the row:
julia> using DataFrames
julia> df = DataFrame(x = [1,2,3], y = ["a", "b", "c"])
3×2 DataFrame
Row │ x y
│ Int64 String
─────┼───────────────
1 │ 1 a
2 │ 2 b
3 │ 3 c
julia> insert!(df, 2, (100, "new line"))
4×2 DataFrame
Row │ x y
│ Int64 String
─────┼─────────────────
1 │ 1 a
2 │ 100 new line
3 │ 2 b
4 │ 3 c
julia> push!(df, (200, "last line"))
5×2 DataFrame
Row │ x y
│ Int64 String
─────┼──────────────────
1 │ 1 a
2 │ 100 new line
3 │ 2 b
4 │ 3 c
5 │ 200 last line
julia> pushfirst!(df, (300, "first line"))
6×2 DataFrame
Row │ x y
│ Int64 String
─────┼───────────────────
1 │ 300 first line
2 │ 1 a
3 │ 100 new line
4 │ 2 b
5 │ 3 c
6 │ 200 last line