How to convert a GroupedDataFrame to a DataFrame in Julia? - dataframe

I have performed calculations on subsets of a DataFrame by using the groupby function:
using RDatasets
iris = dataset("datasets", "iris")
describe(iris)
iris_grouped = groupby(iris,:Species)
iris_avg = map(:SepalLength => mean,iris_grouped::GroupedDataFrame)
Now I would like to plot the results, but I get an error message for the following plot:
#df iris_avg bar(:Species,:SepalLength)
Only tables are supported
What would be the best way to plot the data? My idea would be to create a single DataFrame and go from there. How would I do this, ie how do I convert a GroupedDataFrame to a single DataFrame? Thanks!

To convert GroupedDataFrame into a DataFrame just call DataFrame on it, e.g.:
julia> DataFrame(iris_avg)
3×2 DataFrame
│ Row │ Species │ SepalLength_mean │
│ │ Categorical… │ Float64 │
├─────┼──────────────┼──────────────────┤
│ 1 │ setosa │ 5.006 │
│ 2 │ versicolor │ 5.936 │
│ 3 │ virginica │ 6.588 │
in your case.
You could also have written:
julia> combine(:SepalLength => mean, iris_grouped)
3×2 DataFrame
│ Row │ Species │ SepalLength_mean │
│ │ Categorical… │ Float64 │
├─────┼──────────────┼──────────────────┤
│ 1 │ setosa │ 5.006 │
│ 2 │ versicolor │ 5.936 │
│ 3 │ virginica │ 6.588 │
on an original GroupedDataFrame or
julia> by(:SepalLength => mean, iris, :Species)
3×2 DataFrame
│ Row │ Species │ SepalLength_mean │
│ │ Categorical… │ Float64 │
├─────┼──────────────┼──────────────────┤
│ 1 │ setosa │ 5.006 │
│ 2 │ versicolor │ 5.936 │
│ 3 │ virginica │ 6.588 │
on an original DataFrame.
I write the transformation as the first argument here, but typically, you would write it as the last (as then you can pass multiple transformations), e.g.:
julia> by(iris, :Species, :SepalLength => mean, :SepalWidth => minimum)
3×3 DataFrame
│ Row │ Species │ SepalLength_mean │ SepalWidth_minimum │
│ │ Categorical… │ Float64 │ Float64 │
├─────┼──────────────┼──────────────────┼────────────────────┤
│ 1 │ setosa │ 5.006 │ 2.3 │
│ 2 │ versicolor │ 5.936 │ 2.0 │
│ 3 │ virginica │ 6.588 │ 2.2 │

I think you might be better off using the by function to get to your iris_avg directly. by iterates through a DataFrame, and then applies the given function to the the results. Often, it's used with a do block.
julia> by(iris, :Species) do df
DataFrame(sepal_mean = mean(df.SepalLength))
end
3×2 DataFrame
│ Row │ Species │ sepal_mean │
│ │ Categorical… │ Float64 │
├─────┼──────────────┼────────────┤
│ 1 │ setosa │ 5.006 │
│ 2 │ versicolor │ 5.936 │
│ 3 │ virginica │ 6.588 │
Or equivalently,
julia> by(iris, :Species, SepalLength_mean = :SepalLength => mean)
3×2 DataFrame
│ Row │ Species │ SepalLength_mean │
│ │ Categorical… │ Float64 │
├─────┼──────────────┼──────────────────┤
│ 1 │ setosa │ 5.006 │
│ 2 │ versicolor │ 5.936 │
│ 3 │ virginica │ 6.588 │
See here for more details/examples.
Alternatively, you can do it in several steps as you've done, then use DataFrame constructor to convert to a proper DataFrame:
julia> iris_grouped = groupby(iris,:Species);
julia> iris_avg = map(:SepalLength => mean,iris_grouped::GroupedDataFrame);
julia> DataFrame(iris_avg)
3×2 DataFrame
│ Row │ Species │ SepalLength_mean │
│ │ Categorical… │ Float64 │
├─────┼──────────────┼──────────────────┤
│ 1 │ setosa │ 5.006 │
│ 2 │ versicolor │ 5.936 │
│ 3 │ virginica │ 6.588 │

Related

Return copy of a DataFrame that contains only rows with missing data in Julia

I am looking for the opposite of the dropmissing function in DataFrames.jl so that the user knows where to look to fix their bad data. It seems like this should be easy, but the filter function expects a column to be specified and I cannot get it to iterate over all columns.
julia> df=DataFrame(a=[1, missing, 3], b=[4, 5, missing])
3×2 DataFrame
│ Row │ a │ b │
│ │ Int64? │ Int64? │
├─────┼─────────┼─────────┤
│ 1 │ 1 │ 4 │
│ 2 │ missing │ 5 │
│ 3 │ 3 │ missing │
julia> filter(x -> ismissing(eachcol(x)), df)
ERROR: MethodError: no method matching eachcol(::DataFrameRow{DataFrame,DataFrames.Index})
julia> filter(x -> ismissing.(x), df)
ERROR: ArgumentError: broadcasting over `DataFrameRow`s is reserved
I am basically trying to recreate the disallowmissing function, but with a more useful error message.
Here are two ways to do it:
julia> df = DataFrame(a=[1, missing, 3], b=[4, 5, missing])
3×2 DataFrame
│ Row │ a │ b │
│ │ Int64? │ Int64? │
├─────┼─────────┼─────────┤
│ 1 │ 1 │ 4 │
│ 2 │ missing │ 5 │
│ 3 │ 3 │ missing │
julia> df[.!completecases(df), :] # this will be faster
2×2 DataFrame
│ Row │ a │ b │
│ │ Int64? │ Int64? │
├─────┼─────────┼─────────┤
│ 1 │ missing │ 5 │
│ 2 │ 3 │ missing │
julia> #view df[.!completecases(df), :]
2×2 SubDataFrame
│ Row │ a │ b │
│ │ Int64? │ Int64? │
├─────┼─────────┼─────────┤
│ 1 │ missing │ 5 │
│ 2 │ 3 │ missing │
julia> filter(row -> any(ismissing, row), df)
2×2 DataFrame
│ Row │ a │ b │
│ │ Int64? │ Int64? │
├─────┼─────────┼─────────┤
│ 1 │ missing │ 5 │
│ 2 │ 3 │ missing │
julia> filter(row -> any(ismissing, row), df, view=true) # requires DataFrames.jl 0.22
2×2 SubDataFrame
Row │ a b
│ Int64? Int64?
─────┼──────────────────
1 │ missing 5
2 │ 3 missing

transform function on all columns of dataframe

I have a dataframe df and I am trying to apply a function to each of the cells. According to the documentation I should use the transform function.
The function should be applied to each column so I use [:] as a selector for all columns
transform(
df, [:] .=> ByRow(x -> (if (x > 1) x else zero(Float64) end)) .=> [:]
)
but it yields an exception
ArgumentError: Unrecognized column selector: Colon() => (DataFrames.ByRow{Main.workspace293.var"#1#2"}(Main.workspace293.var"#1#2"()) => Colon())
although when I am using a single column, it works fine
transform(
df, [:K0] .=> ByRow(x -> (if (x > 1) x else zero(Float64) end)) .=> [:K0]
)
The simplest way to do it is to use broadcasting:
julia> df = DataFrame(2*rand(4,3), [:x1, :x2, :x3])
4×3 DataFrame
│ Row │ x1 │ x2 │ x3 │
│ │ Float64 │ Float64 │ Float64 │
├─────┼───────────┼──────────┼──────────┤
│ 1 │ 0.945879 │ 1.59742 │ 0.882428 │
│ 2 │ 0.0963367 │ 0.400404 │ 0.599865 │
│ 3 │ 1.23356 │ 0.807691 │ 0.547917 │
│ 4 │ 0.756098 │ 0.595673 │ 0.29678 │
julia> #. ifelse(df > 1, df, 0.0)
4×3 DataFrame
│ Row │ x1 │ x2 │ x3 │
│ │ Float64 │ Float64 │ Float64 │
├─────┼─────────┼─────────┼─────────┤
│ 1 │ 0.0 │ 1.59742 │ 0.0 │
│ 2 │ 0.0 │ 0.0 │ 0.0 │
│ 3 │ 1.23356 │ 0.0 │ 0.0 │
│ 4 │ 0.0 │ 0.0 │ 0.0 │
you can also transform for it if you prefer:
julia> transform(df, names(df) .=> ByRow(x -> ifelse(x>1, x, 0.0)) .=> names(df))
4×3 DataFrame
│ Row │ x1 │ x2 │ x3 │
│ │ Float64 │ Float64 │ Float64 │
├─────┼─────────┼─────────┼─────────┤
│ 1 │ 0.0 │ 1.59742 │ 0.0 │
│ 2 │ 0.0 │ 0.0 │ 0.0 │
│ 3 │ 1.23356 │ 0.0 │ 0.0 │
│ 4 │ 0.0 │ 0.0 │ 0.0 │
Also looking at the linked pandas solution DataFrames.jl seems faster in this case:
julia> df = DataFrame(2*rand(2,3), [:x1, :x2, :x3])
2×3 DataFrame
Row │ x1 x2 x3
│ Float64 Float64 Float64
─────┼────────────────────────────
1 │ 1.48781 1.20332 1.08071
2 │ 1.55462 1.66393 0.363993
julia> using BenchmarkTools
julia> #btime #. ifelse($df > 1, $df, 0.0)
6.252 μs (58 allocations: 3.89 KiB)
2×3 DataFrame
Row │ x1 x2 x3
│ Float64 Float64 Float64
─────┼───────────────────────────
1 │ 1.48781 1.20332 1.08071
2 │ 1.55462 1.66393 0.0
(in pandas for 2x3 data frame it was ranging from 163 µs to 2.26 ms)

Different datatypes of dataframe columns are not support for Impute(handling missing value method) in Julia

I did one small experiment and I got to know that it is just because the different data types of columns include in CSV. please see the following code
julia> using DataFrames
julia> df = DataFrame(:a => [1.0, 2, missing, missing, 5.0], :b => [1.1, 2.2, 3, missing, 5],:c => [1,3,5,missing,6])
5×3 DataFrame
│ Row │ a │ b │ c │
│ │ Float64? │ Float64? │ Int64? │
├─────┼──────────┼──────────┼─────────┤
│ 1 │ 1.0 │ 1.1 │ 1 │
│ 2 │ 2.0 │ 2.2 │ 3 │
│ 3 │ missing │ 3.0 │ 5 │
│ 4 │ missing │ missing │ missing │
│ 5 │ 5.0 │ 5.0 │ 6 │
julia> df
5×3 DataFrame
│ Row │ a │ b │ c │
│ │ Float64? │ Float64? │ Int64? │
├─────┼──────────┼──────────┼─────────┤
│ 1 │ 1.0 │ 1.1 │ 1 │
│ 2 │ 2.0 │ 2.2 │ 3 │
│ 3 │ missing │ 3.0 │ 5 │
│ 4 │ missing │ missing │ missing │
│ 5 │ 5.0 │ 5.0 │ 6 │
julia> using Impute
julia> Impute.interp(df)
ERROR: InexactError: Int64(5.5)
Stacktrace:
[1] Int64 at ./float.jl:710 [inlined]
[2] convert at ./number.jl:7 [inlined]
[3] convert at ./missing.jl:69 [inlined]
[4] setindex! at ./array.jl:826 [inlined]
[5] (::Impute.var"#58#59"{Int64,Array{Union{Missing, Int64},1}})(::Impute.Context) at /home/synerzip/.julia/packages/Impute/GmIMg/src/imputors/interp.jl:67
[6] (::Impute.Context)(::Impute.var"#58#59"{Int64,Array{Union{Missing, Int64},1}}) at /home/synerzip/.julia/packages/Impute/GmIMg/src/context.jl:227
[7] _impute!(::Array{Union{Missing, Int64},1}, ::Impute.Interpolate) at /home/synerzip/.julia/packages/Impute/GmIMg/src/imputors/interp.jl:49
[8] impute!(::Array{Union{Missing, Int64},1}, ::Impute.Interpolate) at /home/synerzip/.julia/packages/Impute/GmIMg/src/imputors.jl:84
[9] impute!(::DataFrame, ::Impute.Interpolate) at /home/synerzip/.julia/packages/Impute/GmIMg/src/imputors.jl:172
[10] #impute#17 at /home/synerzip/.julia/packages/Impute/GmIMg/src/imputors.jl:76 [inlined]
[11] impute at /home/synerzip/.julia/packages/Impute/GmIMg/src/imputors.jl:76 [inlined]
[12] _impute(::DataFrame, ::Type{Impute.Interpolate}) at /home/synerzip/.julia/packages/Impute/GmIMg/src/imputors.jl:58
[13] #interp#105 at /home/synerzip/.julia/packages/Impute/GmIMg/src/Impute.jl:84 [inlined]
[14] interp(::DataFrame) at /home/synerzip/.julia/packages/Impute/GmIMg/src/Impute.jl:84
[15] top-level scope at REPL[15]:1
and this error does not occur when I run the following code
julia> df = DataFrame(:a => [1.0, 2, missing, missing, 5.0], :b => [1.1, 2.2, 3, missing, 5])
5×2 DataFrame
│ Row │ a │ b │
│ │ Float64? │ Float64? │
├─────┼──────────┼──────────┤
│ 1 │ 1.0 │ 1.1 │
│ 2 │ 2.0 │ 2.2 │
│ 3 │ missing │ 3.0 │
│ 4 │ missing │ missing │
│ 5 │ 5.0 │ 5.0 │
julia> Impute.interp(df)
5×2 DataFrame
│ Row │ a │ b │
│ │ Float64? │ Float64? │
├─────┼──────────┼──────────┤
│ 1 │ 1.0 │ 1.1 │
│ 2 │ 2.0 │ 2.2 │
│ 3 │ 3.0 │ 3.0 │
│ 4 │ 4.0 │ 4.0 │
│ 5 │ 5.0 │ 5.0 │
now I know the reason but confused about how to solve it. I can not use eltype while reading CSV because in my dataset contains 171 columns and it typically has either Int or Float. stuck for how to convert all columns in Float64.
I assume you want:
something simple, that does not have to be maximally efficient
all your columns are numeric (possibly having missing values)
Then just write:
julia> df
5×3 DataFrame
│ Row │ a │ b │ c │
│ │ Float64? │ Float64? │ Int64? │
├─────┼──────────┼──────────┼─────────┤
│ 1 │ 1.5 │ 1.65 │ 1 │
│ 2 │ 3.0 │ 3.3 │ 3 │
│ 3 │ missing │ 4.5 │ 5 │
│ 4 │ missing │ missing │ missing │
│ 5 │ 7.5 │ 7.5 │ 6 │
julia> float.(df)
5×3 DataFrame
│ Row │ a │ b │ c │
│ │ Float64? │ Float64? │ Float64? │
├─────┼──────────┼──────────┼──────────┤
│ 1 │ 1.5 │ 1.65 │ 1.0 │
│ 2 │ 3.0 │ 3.3 │ 3.0 │
│ 3 │ missing │ 4.5 │ 5.0 │
│ 4 │ missing │ missing │ missing │
│ 5 │ 7.5 │ 7.5 │ 6.0 │
It is possible to be more efficient (i.e. convert only the columns that are integer in the source data frame, but it requires more code - please comment if you need such a solution)
EDIT
Also note that CSV.jl has typemap keyword argument that should allow to handle this issue when reading the data in.

Julia: how to compute a particular operation on certain columns of a Dataframe

I have the following Dataframe
using DataFrames, Statistics
df = DataFrame(name=["John", "Sally", "Kirk"],
age=[23., 42., 59.],
children=[3,5,2], height = [180, 150, 170])
print(df)
3×4 DataFrame
│ Row │ name │ age │ children │ height │
│ │ String │ Float64 │ Int64 │ Int64 │
├─────┼────────┼─────────┼──────────┼────────┤
│ 1 │ John │ 23.0 │ 3 │ 180 │
│ 2 │ Sally │ 42.0 │ 5 │ 150 │
│ 3 │ Kirk │ 59.0 │ 2 │ 170 │
I can compute the mean of a column as follow:
println(mean(df[:4]))
166.66666666666666
Now I want to get the mean of all the numeric column and tried this code:
x = [2,3,4]
for i in x
print(mean(df[:x[i]]))
end
But got the following error message:
MethodError: no method matching getindex(::Symbol, ::Int64)
Stacktrace:
[1] top-level scope at ./In[64]:3
How can I solve the problem?
You are trying to access the DataFrame's column using an integer index specifying the column's position. You should just use the integer value without any : before i, which would create the symbol :i but you do not a have column named i.
x = [2,3,4]
for i in x
println(mean(df[i])) # no need for `x[i]`
end
You can also index a DataFrame using a Symbol denoting the column's name.
x = [:age, :children, :height];
for c in x
println(mean(df[c]))
end
You get the following error in your attempt because you are trying to access the ith index of the symbol :x, which is an undefined operation.
MethodError: no method matching getindex(::Symbol, ::Int64)
Note that :4 is just 4.
julia> :4
4
julia> typeof(:4)
Int64
Here is a one-liner that actually selects all Number columns:
julia> mean.(eachcol(df[findall(x-> x<:Number, eltypes(df))]))
3-element Array{Float64,1}:
41.333333333333336
3.3333333333333335
166.66666666666666
For many scenarios describe is actually more convenient:
julia> describe(df)
4×8 DataFrame
│ Row │ variable │ mean │ min │ median │ max │ nunique │ nmissing │ eltype │
│ │ Symbol │ Union… │ Any │ Union… │ Any │ Union… │ Nothing │ DataType │
├─────┼──────────┼─────────┼──────┼────────┼───────┼─────────┼──────────┼──────────┤
│ 1 │ name │ │ John │ │ Sally │ 3 │ │ String │
│ 2 │ age │ 41.3333 │ 23.0 │ 42.0 │ 59.0 │ │ │ Float64 │
│ 3 │ children │ 3.33333 │ 2 │ 3.0 │ 5 │ │ │ Int64 │
│ 4 │ height │ 166.667 │ 150 │ 170.0 │ 180 │ │ │ Int64 │
In the question println(mean(df[4])) works as well (instead of println(mean(df[:4]))).
Hence we can write
x = [2,3,4]
for i in x
println(mean(df[i]))
end
which works

Convert a Julia DataFrame column with String to one with Int and missing values

I need to convert the following DataFrame
julia> df = DataFrame(:A=>["", "2", "3"], :B=>[1.1, 2.2, 3.3])
which looks like
3×2 DataFrame
│ Row │ A │ B │
│ │ String │ Float64 │
├─────┼────────┼─────────┤
│ 1 │ │ 1.1 │
│ 2 │ 2 │ 2.2 │
│ 3 │ 3 │ 3.3 │
I would like to convert A column from Array{String,1} to array of Int with missing values.
I tried
julia> df.A = tryparse.(Int, df.A)
3-element Array{Union{Nothing, Int64},1}:
nothing
2
3
julia> df
3×2 DataFrame
│ Row │ A │ B │
│ │ Union… │ Float64 │
├─────┼────────┼─────────┤
│ 1 │ │ 1.1 │
│ 2 │ 2 │ 2.2 │
│ 3 │ 3 │ 3.3 │
julia> eltype(df.A)
Union{Nothing, Int64}
but I'm getting A column with elements of type Union{Nothing, Int64}.
nothing (of type Nothing) and missing (of type Missing) seems to be 2 differents kind of value.
So I wonder how I can A columns with missing values instead?
I also wonder if missing and nothing leads to different performance.
I would have done the following:
julia> df.A = map(x->begin val = tryparse(Int, x)
ifelse(typeof(val) == Nothing, missing, val)
end, df.A)
3-element Array{Union{Missing, Int64},1}:
missing
2
3
julia> df
3×2 DataFrame
│ Row │ A │ B │
│ │ Int64⍰ │ Float64 │
├─────┼─────────┼─────────┤
│ 1 │ missing │ 1.1 │
│ 2 │ 2 │ 2.2 │
│ 3 │ 3 │ 3.3 │
I think missing is more suitable for dataframes which indeed have missing values, instead of nothing, because the latter is more considered as a void in C, or None in Python, see here.
As a side note, Missing type has some Julia functionalities.
Replacing nothing by missing can simply be done using replace:
julia> df.A = replace(df.A, nothing=>missing)
3-element Array{Union{Missing, Int64},1}:
missing
2
3
julia> df
3×2 DataFrame
│ Row │ A │ B │
│ │ Int64⍰ │ Float64 │
├─────┼─────────┼─────────┤
│ 1 │ missing │ 1.1 │
│ 2 │ 2 │ 2.2 │
│ 3 │ 3 │ 3.3 │
an other solution is to use tryparsem function defined as following
tryparsem(T, str) = something(tryparse(T, str), missing)
and use it like
julia> df = DataFrame(:A=>["", "2", "3"], :B=>[1.1, 2.2, 3.3])
julia> df.A = tryparsem.(Int, df.A)