Select numerical columns of Julia DataFrame with missing values - dataframe

I want to select all columns of a DataFrame in which the datatype is a subtype of Number. However, since there are columns with missing values, the numerical column datatypes can be something like Union{Missing, Int64}.
So far, I came up with:
using DataFrames
df = DataFrame([["a", "b"], [1, missing] ,[2, 5]])
df_numerical = df[typeintersect.(colwise(eltype, df), Number) .!= Union{}]
This yields the expected result.
Question
Is there a more simple, idiomatic way of doing this? Possibly simliar to:
df.select_dtypes(include=[np.number])
in pandas as taken from an answer to this question?

julia> df[(<:).(eltypes(df),Union{Number,Missing})]
2×2 DataFrame
│ Row │ x2 │ x3 │
├─────┼─────────┼────┤
│ 1 │ 1 │ 2 │
│ 2 │ missing │ 5 │
Please note that the . is the broadcasting operator and hence I had to use <: operator in a functional form.

An other way to do it could be:
df_numerical = df[[i for i in names(df) if Base.nonmissingtype(eltype(df[i])) <: Number]]
To retrieve all the columns that are subtype of Number, irrelevantly if they host missing data or not.

Related

ArgumentError: columns argument must be a vector of AbstractVector objects

I want to make a DataFrame in Julia with one column, but I get an error:
julia> using DataFrames
julia> r = rand(3);
julia> DataFrame(r, ["col1"])
ERROR: ArgumentError: columns argument must be a vector of AbstractVector objects
Why?
Update:
I figured out that I could say the following:
julia> DataFrame(reshape(r, :, 1), ["col1"])
3×1 DataFrame
Row │ col1
│ Float64
─────┼──────────
1 │ 0.800824
2 │ 0.989024
3 │ 0.722418
But it's not straightforward. Is there any better way? Why can't I easily create a DataFrame object from a Vector?
Why can't I easily create a DataFrame object from a Vector?
Because it would be ambiguous with the syntax where you pass positional arguments the way you tried. Many popular tables are vectors.
However, what you can write is just:
julia> r = rand(3);
julia> DataFrame(col1=r)
3×1 DataFrame
Row │ col1
│ Float64
─────┼────────────
1 │ 0.00676619
2 │ 0.554207
3 │ 0.394077
to get what you want.
An alternative more similar to your code would be:
julia> DataFrame([r], ["col1"])
3×1 DataFrame
Row │ col1
│ Float64
─────┼────────────
1 │ 0.00676619
2 │ 0.554207
3 │ 0.394077

Keep variables type after using data frame

I'm trying to use kproto() function from R package clustMixType to cluster mixed-type data in Julia, but I'm getting error No numeric variables in x! Try using kmodes() from package.... My data should have 3 variables: 2 continuous and 1 categorical. It seems after I used DataFrame() all the variables became categorical. Is there a way to avoid changing the variables type after using DataFrame() so that I have mixed-type data (continuous and categorical) to use kproto()?
using RCall
#rlibrary clustMixType
# group 1 variables
x1=rand(Normal(0,3),10)
x2=rand(Normal(1,2),10)
x3=["1","1","2","2","0","1","1","2","2","0"]
g1=hcat(x1,x2,x3)
# group 2 variables
y1=rand(Normal(0,4),10)
y2=rand(Normal(-1,6),10)
y3=["1","1","2","1","1","2","2","0","2","0"]
g2=hcat(y1,y2,y3)
#create the data
df0=vcat(g1,g2)
df1 = DataFrame(df0)
#use R function
R"kproto($df1, 2)"
I don't know anything about the R package and what kind of input it expects, but the issue is probably how you construct the data matrix from which you construct your DataFrame, not the DataFrame constructor itself.
When you concatenate a numerical and a string column, Julia falls back on the element type Any for the resulting matrix:
julia> g1=hcat(x1,x2,x3)
10×3 Matrix{Any}:
0.708309 -4.84767 "1"
0.566883 -0.214217 "1"
...
That means your df0 matrix is:
julia> #create the data
df0=vcat(g1,g2)
20×3 Matrix{Any}:
0.708309 -4.84767 "1"
0.566883 -0.214217 "1"
...
and the DataFrame constructor will just carry this lack of type information through rather than trying to infer column types.
julia> DataFrame(df0)
20×3 DataFrame
Row │ x1 x2 x3
│ Any Any Any
─────┼───────────────────────────
1 │ 0.708309 -4.84767 1
2 │ 0.566883 -0.214217 1
...
A simple way of getting around this is to just not concatenate your columns into a single matrix, but to construct the DataFrame from the columns:
julia> DataFrame([vcat(x1, y1), vcat(x2, y2), vcat(x3, y3)])
20×3 DataFrame
Row │ x1 x2 x3
│ Float64 Float64 String
─────┼───────────────────────────────
1 │ 0.708309 -4.84767 1
2 │ 0.566883 -0.214217 1
...
As you can see, we now have two Float64 numerical columns x1 and x2 in the resulting DataFrame.
As an addition to the nice answer by Nils (as the problem is indeed when a matrix is constructed not when DataFrame is created) there is this little trick:
julia> df = DataFrame([1 1.0 "1"; 2 2.0 "2"], [:int, :float, :string])
2×3 DataFrame
Row │ int float string
│ Any Any Any
─────┼────────────────────
1 │ 1 1.0 1
2 │ 2 2.0 2
julia> identity.(df)
2×3 DataFrame
Row │ int float string
│ Int64 Float64 String
─────┼────────────────────────
1 │ 1 1.0 1
2 │ 2 2.0 2

Julia data frame - need to select rows based upon multiple columns being missing

Help for the Julia newbie
I have joined 2 dataframes and need to select rows that have columns that are missing.
The following pulls from one column, but I need to pull multiples.
I need to pull :md5 and :md5_1 and :md5_2.... that are missing.
#where(bwjoinout_1_2, findall(x -> (ismissing(x)), :md5)) # works
This pulls rows that have :md5 as missing.
I am syntactically challenged!!
Regards and stay safe
Bryan Webb
Not sure I completely understand what you want to do, but would this work for you?
julia> df = DataFrame(id = 1:3, x=[1, missing, 3], y=[1, 2, missing])
3×3 DataFrame
Row │ id x y
│ Int64 Int64? Int64?
─────┼─────────────────────────
1 │ 1 1 1
2 │ 2 missing 2
3 │ 3 3 missing
julia> df[ismissing.(df.x) .| ismissing.(df.y), :]
2×3 DataFrame
Row │ id x y
│ Int64 Int64? Int64?
─────┼─────────────────────────
1 │ 2 missing 2
2 │ 3 3 missing
or
filter(row -> any(ismissing, row[names(df, r"^md5")]), df)
which will leave in df all rows that have a missing value in any of the columns whose name starts with "md5". This is not the most efficient way to do it, but I think it is simplest conceptually.
If you need maximum performance go along what François Févotte proposed, but it currently requires you to explicitly list columns you want to filter on (this PR will allow to make it more cleanly).
used
bwmissows = bwjoinout_1_2[ismissing.(bwjoinout_1_2.md5) .| ismissing.(bwjoinout_1_2.md5_1), :]
worked like a charm
pulled rows that had a missing md5 or md5_1
Thanks for your help
Stay safe!
i couldn't get the syntax
Regards
bryan
I'd do bwjoinout_1_2[.!completecases(bwjoinout_1_2, r"md5"), :].

Plotting Julia DataFrame columns that have whitespace in their names with Matplotlib

I have DataFrames that have whitespace in their column names, because the CSV files they were generated from had whitespace in the names as well. The DataFrames were generated with the lines
csvnames::Array{String,1} = filter(x -> endswith(x, ".csv"), readdir(CSV_DIR))
dfs::Dict{String, DataFrame} = Dict( csvnames[i] => CSV.File(CSV_DIR * csvnames[i]) |> DataFrame for i in 1:length(csvnames))
The DataFrames have column names such as "Tehtävä 1", but none of the following expressions work when I try to access the column (here ecols is a dataframe):
plot = axes.plot(ecols[Symbol("Tehtävä 1")]) produces the error TypeError("float() argument must be a string or a number, not 'PyCall.jlwrap'")
plot = axes.plot(ecols[:Tehtävä_1]) produces the error ERROR: LoadError: ArgumentError: column name :Tehtävä_1 not found in the data frame; existing most similar names are: :Tehtävä 1
plot = axes.plot(ecols[:Tehtävä 1]) raises the error ERROR: LoadError: MethodError: no method matching typed_hcat(::DataFrame, ::Symbol, ::Int64)
It therefore seems that I have no way of plotting DataFrame columns that have spaces in their names. Printing them works just fine, as the line
println(ecols[Symbol("Tehtävä 1")])
produces and array of floats: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0], which it is supposed to. Is Matplotlib just incompatible with DataFrames with whitespace in their column names and if it is, how could I remove all whitespace from the columns of a Julia DataFrame?
EDIT
I forgot to mention one very crucial point: the DataFrame contains missing values, which Matplotlib can't comprehend. This was causing error 1. I would still very much like to know if there is a way of getting rid of any whitespace in the table column names, possibly during the construction of the DataFrame.
The first approach works just fine, but it seems you are not using PyPlot.jl correctly (in particular you try to create a variable called plot which will overshadow plot function from PyPlot.jl).
To see that it works run:
julia> df = DataFrame(Symbol("Tehtävä 1") => 1.0:5.0)
5×1 DataFrame
│ Row │ Tehtävä 1 │
│ │ Float64 │
├─────┼───────────┤
│ 1 │ 1.0 │
│ 2 │ 2.0 │
│ 3 │ 3.0 │
│ 4 │ 4.0 │
│ 5 │ 5.0 │
julia> plot(df[Symbol("Tehtävä 1")])
1-element Array{PyCall.PyObject,1}:
PyObject <matplotlib.lines.Line2D object at 0x000000003F9EE0B8>
and a plot is shown as expected.
EDIT
If you want to remove whitespace from column names of data frame df write:
names!(df, Symbol.(replace.(string.(names(df)), Ref(r"\s"=>""))))

Convert missing to a numerical value in Julia 1

I am trying to convert all missing values in a df to a numerical value, e.g. 0 (yes, knowing what I am doing..).
In Julia 0.6 I can write:
julia> df = DataFrame(
cat = ["green","blue","white"],
v1 = [1.0,missing,2.0],
v2 = [1,2,missing]
)
julia> [df[ismissing.(df[i]), i] = 0 for i in names(df)]
And get:
julia> df
3×3 DataFrames.DataFrame
│ Row │ cat │ v1 │ v2 │
├─────┼───────┼─────┼────┤
│ 1 │ green │ 1.0 │ 1 │
│ 2 │ blue │ 0.0 │ 2 │
│ 3 │ white │ 2.0 │ 0 │
If I try it in Julia 0.7 I get instead a very weird error:
MethodError: Cannot convert an object of type Float64 to an object
of type String
I can't get what I am trying to convert to a string ??? Any explanation (and workaround) ?
The reason for this problem is that broadcasting mechanism has changed between Julia 0.6 and Julia 1.0 (and it is used in insert_multiple_entries! function in DataFrames.jl). In the end fill! is called and it tries to do a conversion before checking if the collection is empty.
Actually if you want to do a fully general replacement in place (and I understand you want to) this is a bit complex and less efficient than what you have in Base (the reason is that you cannot rely on checking types of elements in vectors as e.g. you can assign Int to vector of Float64 and they have different types):
function myreplacemissing!(vec, val)
for i in eachindex(vec)
ismissing(vec[i]) && (vec[i] = val)
end
end
And now you are good to go:
foreach(col -> myreplacemissing!(col[2], 0), eachcol(df))
While I appreciate the answer of Bogumil Kaminski (also because now I understood the reasons behind the failure), its proposed solution fails if it happens to exists missing elements in non-numeric columns, e.g.:
df = DataFrame(
cat = ["green","blue",missing],
v1 = [1.0,missing,2.0],
v2 = [1,2,missing]
)
What I can instead do is to use (either or only one, depending on my needs):
[df[ismissing.(df[i]), i] = 0 for i in names(df) if typeintersect(Number, eltype(df[i])) != Union{}]
[df[ismissing.(df[i]), i] = "" for i in names(df) if typeintersect(String, eltype(df[i])) != Union{}]
The advantage is that I can select the type of value I need as "missing replacement" for different type of column (e.g. 0 for a number or "" for a string).
EDIT:
Maybe more readable, thanks again to Begumil's answer:
[df[ismissing.(df[i]), i] = 0 for i in names(df) if Base.nonmissingtype(eltype(df[i])) <: Number]
[df[ismissing.(df[i]), i] = "" for i in names(df) if Base.nonmissingtype(eltype(df[i])) <: String]