how to deal with missing values in ifelse function julia - dataframe

I am using Julia and I got a dataframe with 42 values, of which 2 are missing.
This values are prices that go from 0.23 to 0.3
I am trying to get a new column that tells if its cheap or expensive by a ifelse statement.
the ifelse should go:
df.x_category=ifelse.(df.x .< mean(df.x),"cheap", "expensive")
but i get the following error:
ERROR: TypeError: non-boolean (Missing) used in boolean context
Is there a way to skip those missing values?
I tried with:
df.x_category=ifelse.(skipmissing(df.x) .< mean(skipmissing(df.x)),"cheap", "expensive")
but get this error:
ERROR: ArgumentError: New columns must have the same length as old columns
I can't just delete missing observations.
How can i make this?
Thanks in advance!

ifelse can handle only 2 values and you need handle 3.
Assuming that you have
df = DataFrame(x=rand([0.23,0.3,missing], 10))
than mean(df.x) yields a missing since some of values are missings. You need to do instead mean(skipmissing(df.x))).
Hence the code could be:
julia> map(x -> ismissing(x) ? missing : ifelse(x,"cheap", "expensive"), df.x .< mean(skipmissing(df.x)))
10-element Vector{Union{Missing, String}}:
missing
missing
"cheap"
missing
"expensive"
missing
missing
missing
"cheap"
"cheap"
Here I have combined ifelse with map for handling the missing value there are other ways but each one will require nesting some conditional function.

i would do it with a function that returns cheap, expensive or missing:
using Statistics
data = ifelse.(rand(Bool,100),missing,100*rand(100)) #generator for the data
meandata = mean(skipmissing(data)) #mean of the data
function category_select(x)
ismissing(x) && return missing #short-circuit operator
return ifelse(x<meandata,"cheap","expensive") #parentheses are optional
end
category_select2(x) = ismissing(x) ? missing : (x < meandata ? "cheap" : "expensive)
#broadcast values
x_category = category_selector.(data)
x_category = category_selector2.(data)
now, what is happening? there are two things with the ifelse function:
It evaluates both branches at the same time, so if one branch can error, it will error. take this example:
maybelog(x) = ifelse(x<0,zero(x),log(x)) #ifelse
maybelog2(x) = begin if x<0; zero(x);else;log(x);end #full if expression
maybelog3(x) = x<0 ? zero(x) : log(x) #ternary operator
maybelog fails with x = -1, whereas maybelog2 and maybelog3 does not.
The first argument is always a bool. In your initial expression,the result of df.x .< mean(df.x) can be true, false or missing, so ifelse also fails there.
in your modified expression, the length of skipmissing(df.x) is different than the length of x as the first one doesnt count the missing values present in x, resulting in a smaller vector than the size of your dataframe.

If you are using DataFrames.jl (which it looks like you do) then I recommend you to learn the metapackages that simplify syntax for such scenarios. Here is how you can write your query using DataFrameMacros.jl:
#transform!(df,
#subset(!ismissing(:x)),
:x_category = #c ifelse.(:x .< mean(:x), "cheap", "expensive"))
This is the simplest approach I think.

You can try something like this. Using toy data.
First get your string values from ifelse into a vector.
Then prepare the string vector by converting it to a Union of strings and missing to hold missing values.
Finally put the missing values into the vector.
julia> using DataFrames, Random
julia> vec = ifelse.(df.d[ismissing.(df.d) .== false] .> 0.5,"higher","lower")
40-element Vector{String}:
"higher"
"lower"
"lower"
etc...
julia> vec = convert(Vector{Union{Missing,String}}, vec)
40-element Vector{Union{Missing, String}}
julia> for i in findall(ismissing.(df.d)) insert!(vec, i, missing) end
julia> df.x = vec
julia> df
42×2 DataFrame
Row │ d x
│ Float64? String?
─────┼──────────────────────────
1 │ 0.533183 higher
2 │ 0.454029 lower
3 │ 0.0176868 lower
4 │ 0.172933 lower
5 │ 0.958926 higher
6 │ 0.973566 higher
7 │ 0.30387 lower
8 │ 0.176909 lower
9 │ 0.956916 higher
10 │ 0.584284 higher
11 │ 0.937466 higher
12 │ missing missing
13 │ 0.422956 lower
etc...
Data
julia> Random.seed!(42)
MersenneTwister(42)
julia> data = Random.rand(42)
42-element Vector{Float64}:
0.5331830160438613
0.4540291355871424
etc...
julia> data = convert(Vector{Union{Missing,Float64}}, data)
42-element Vector{Union{Missing, Float64}}
julia> data[[12,34]] .= missing
2-element view(::Vector{Union{Missing, Float64}}, [12, 34]) with eltype Union{Missing, Float64}:
missing
missing
julia> df = DataFrame(d=data)

Related

How to prevent Columns from being Decimals in DataFrames when retrieved from Postgres

I have a DataFrame df which I retrieved from a Postgres database as follows
using DataFrames, LibPQ
con = LibPQ.Connection(con_string)
result = execute(con, "SELECT * FROM [table]")
df = DataFrame(result)
close(con)
Sorry, I cannot make this reproducible.
Now, either DataFrames or LibPQ is turning NUMERIC Postgres columns into type Decimals.Decimal. This might be cool for being as accurate as possible, but it provides problems when I try to plot anything with these columns.
eltype.(eachcol(df))
5-element Vector{Union}:
Union{Missing, String}
Union{Missing, TimeZones.ZonedDateTime}
Union{Missing, Int32}
Union{Missing, Date}
Union{Missing, Decimals.Decimal}
As very nicely explained here by Bogumił Kamiński I can change the columns of a specific type to some other type. The caveat is that I cannot even test whether a column is of type Union{Missing, Decimals.Decimal}, because the Decimals package is not loaded. OK, I thought, let's load the Decimals package then - but it doesn't work, because the package must be installed first...
Is there some other way to turn these columns into Float64s? Without having to install the entire package? I know that I could change the column types by using the column names, like
df.my_column = Float64.(df.my_column)
but I will not know the relevant column names in advance.
You can use Union{Missing, AbstractFloat} as type selector as Decimal <: AbstractFloat.
Since Union{Missing, AbstractFloat} is not a concrete type you need to write eltype(col) <: Union{Missing, AbstractFloat} to check a subtyping condition.
By the way if you have LibPQ.jl installed then you also have access to Decimals.jl:
julia> LibPQ.Decimals.Decimal
Decimals.Decimal
You can use identity to properly type every column in DataFrame.
julia> df=DataFrame(A=Number[1,2],B=Union{Missing,AbstractFloat}[3,4])
2×2 DataFrame
Row │ A B
│ Number Abstract…?
─────┼────────────────────
1 │ 1 3.0
2 │ 2 4.0
julia> identity.(df)
2×2 DataFrame
Row │ A B
│ Int64 Float64
─────┼────────────────
1 │ 1 3.0
2 │ 2 4.0

If I have a vector that was loaded from a CSV and that number has commas separating

I'm working with a CSV in which one column of numbers is separated with commas (ex. 1,000,000 = 1000000)
Is there a way I can replace the entire column? When I try:
replace(df2.Volume, "," => "")
it gives me back the entire column as if nothing has changed.
... and when I tried:
julia> parse(Int, replace("df2.Volume",","=>"") )
ERROR: ArgumentError: invalid base 10 digit 'd' in "df2.Volume"
Stacktrace:
[1] tryparse_internal(#unused#::Type{Int64}, s::String, startpos::Int64, endpos::Int64, base_::Int64, raise::Bool)
# Base .\parse.jl:137
[2] parse(::Type{Int64}, s::String; base::Nothing)
# Base .\parse.jl:241
[3] parse(::Type{Int64}, s::String)
# Base .\parse.jl:241
[4] top-level scope
# REPL[263]:1
The data is all numbers in the millions, so how can I remove these commas??
I appreciate your help!
Source: https://testdataframesjl.readthedocs.io/en/readthedocs/subsets/
A column of a DataFrame in Julia is a Vector. Hence if you want to do something with the entire column you usually need to vectorize the operation using the dot (.) operator.
julia> df = DataFrame(Volume=["1,000","1,000,000","1,000,000,0000"]);
julia> df.VolumeOK = replace.(df.Volume, "," => "");
julia> df
3×2 DataFrame
Row │ Volume VolumeOK
│ String String
─────┼─────────────────────────────
1 │ 1,000 1000
2 │ 1,000,000 1000000
3 │ 1,000,000,0000 10000000000
Note the dot . after replace.
You can of course further parse it to Int using vectorized parse function such as parse.(Int, df.VolumeOK).
or parse directly to Float64 as:
df.VolumeOK = parse.(Float64,replace.(df.Volume, "," => ""))
You can do something like:
df.Volume = [parse(Int, replace(v, ","=>"")) for v in df.Volume]

Find a subset of columns of a data frame that have some missing values

Given the following data frame from DataFrames.jl:
julia> using DataFrames
julia> df = DataFrame(x1=[1, 2, 3], x2=Union{Int,Missing}[1, 2, 3], x3=[1, 2, missing])
3×3 DataFrame
Row │ x1 x2 x3
│ Int64 Int64? Int64?
─────┼────────────────────────
1 │ 1 1 1
2 │ 2 2 2
3 │ 3 3 missing
I would like to find columns that contain missing value in them.
I have tried:
julia> names(df, Missing)
String[]
but this is incorrect as the names function, when passed a type, looks for subtypes of the passed type.
If you want to find columns that actually contain missing value use:
julia> names(df, any.(ismissing, eachcol(df)))
1-element Vector{String}:
"x3"
In this approach we iterate each column of the df data frame and check if it contains at least one missing value.
If you want to find columns that potentially can contain missing value you need to check their element type:
julia> names(df, [eltype(col) >: Missing for col in eachcol(df)]) # using a comprehension
2-element Vector{String}:
"x2"
"x3"
julia> names(df, .>:(eltype.(eachcol(df)), Missing)) # using broadcasting
2-element Vector{String}:
"x2"
"x3"

DataFrames : no method matching setindex!(::DataFrame, ::Tuple{Float64, Float64}, ::Colon, ::String)

When I try to use the dot operator (element wise operation) in a DataFrame where a function returning a tuple is applied, I get the following error.
Here is a toy example,
df = DataFrame()
df[:, :x] = rand(5)
df[:, :y] = rand(5)
#Function that returns two values in the form of a tuple
add_minus_two(x,y) = (x-y,x+y)
df[:,"x+y"] = add_minus_two.(df[:,:x], df[:,:y])[2]
#Out > ERROR: MethodError: no method matching setindex!(::DataFrame, ::Tuple{Float64, Float64}, ::Colon, ::String)
#However removing the dot operator works fine
df[:,"x+y"] = add_minus_two(df[:,:x], df[:,:y])[2]
#Out > 5 x 3 DataFrame
#Furthermore if its just one argument either dot or not, works fine as well
add_two(x,y) = x+y
df[:, "x+y"] = add_two(df[:,:x], df[:,:y])
df[:, "x+y"] = add_two.(df[:,:x], df[:,:y])
#out > 5 x 3 DataFrame
Any reason why this is. I thought for elementwise operation you need to use "dot" operator.
Also for my actual problem (when a function return 2 values in a tuple), when NOT using the dot operator gives,
ERROR: MethodError: no method matching compute_T(::Vector{Float64}, ::Vector{Float64})
and using the dot operator gives,
ERROR: MethodError: no method matching setindex!(::DataFrame, ::Tuple{Float64, Float64}, ::Colon, ::String)
and returning a single argument, similar to the toy example works fine as well.
Any clue what I am doing incorrectly here ?
This is not a DataFrames.jl issue, but how Julia Base works.
I concentrate only on RHS, as LHS is irrelevant (and RHS is unrelated to DataFrames.jl).
First, how to write what you want. Initialization:
julia> using DataFrames
julia> df = DataFrame()
0×0 DataFrame
julia> df[:, :x] = rand(5)
5-element Vector{Float64}:
0.6146045473316457
0.6319531776216596
0.599267794937812
0.40864382019544965
0.3738682778395166
julia> df[:, :y] = rand(5)
5-element Vector{Float64}:
0.07891853567296825
0.2143545316544586
0.5943274462916335
0.2182702556068421
0.5810132720450707
julia> add_minus_two(x,y) = (x-y,x+y)
add_minus_two (generic function with 1 method)
And now you get:
julia> add_minus_two(df[:,:x], df[:,:y])
([0.5356860116586775, 0.417598645967201, 0.004940348646178538, 0.19037356458860755, -0.2071449942055541], [0.693523083004614, 0.8463077092761182, 1.1935952412294455, 0.6269140758022917, 0.9548815498845873])
julia> add_minus_two.(df[:,:x], df[:,:y])
5-element Vector{Tuple{Float64, Float64}}:
(0.5356860116586775, 0.693523083004614)
(0.417598645967201, 0.8463077092761182)
(0.004940348646178538, 1.1935952412294455)
(0.19037356458860755, 0.6269140758022917)
(-0.2071449942055541, 0.9548815498845873)
julia> add_minus_two(df[:,:x], df[:,:y])[2]
5-element Vector{Float64}:
0.693523083004614
0.8463077092761182
1.1935952412294455
0.6269140758022917
0.9548815498845873
julia> add_minus_two.(df[:,:x], df[:,:y])[2]
(0.417598645967201, 0.8463077092761182)
julia> getindex.(add_minus_two.(df[:,:x], df[:,:y]), 2) # this is probably what you want
5-element Vector{Float64}:
0.693523083004614
0.8463077092761182
1.1935952412294455
0.6269140758022917
0.9548815498845873
Now the point is that when you write:
df[:,"x+y"] = whatever_you_pass
The whatever_you_pass part must be an AbstractVector with an appropriate number of columns. This means that what will work is:
add_minus_two.(df[:,:x], df[:,:y])
add_minus_two(df[:,:x], df[:,:y])[2]
getindex.(add_minus_two.(df[:,:x], df[:,:y]), 2)
and what will fail is (as in these cases a Tuple not AbstractVector is produced)
add_minus_two(df[:,:x], df[:,:y])
add_minus_two.(df[:,:x], df[:,:y])[2]
Out of the working syntaxes just pick the one you want.
The general recommendation is that when you do assignment always inspect the RHS stand alone and analyze if it has a proper structure.
Also, notably, this will work:
julia> transform(df, [:x, :y] => ByRow(add_minus_two) => ["x-y", "x+y"])
5×4 DataFrame
Row │ x y x-y x+y
│ Float64 Float64 Float64 Float64
─────┼────────────────────────────────────────────
1 │ 0.614605 0.0789185 0.535686 0.693523
2 │ 0.631953 0.214355 0.417599 0.846308
3 │ 0.599268 0.594327 0.00494035 1.1936
4 │ 0.408644 0.21827 0.190374 0.626914
5 │ 0.373868 0.581013 -0.207145 0.954882
(you have not asked about it but maybe this is what you actually are looking for - and as opposed to setindex! this syntax is DataFrames.jl specific)

Julia data frame - need to select rows based upon multiple columns being missing

Help for the Julia newbie
I have joined 2 dataframes and need to select rows that have columns that are missing.
The following pulls from one column, but I need to pull multiples.
I need to pull :md5 and :md5_1 and :md5_2.... that are missing.
#where(bwjoinout_1_2, findall(x -> (ismissing(x)), :md5)) # works
This pulls rows that have :md5 as missing.
I am syntactically challenged!!
Regards and stay safe
Bryan Webb
Not sure I completely understand what you want to do, but would this work for you?
julia> df = DataFrame(id = 1:3, x=[1, missing, 3], y=[1, 2, missing])
3×3 DataFrame
Row │ id x y
│ Int64 Int64? Int64?
─────┼─────────────────────────
1 │ 1 1 1
2 │ 2 missing 2
3 │ 3 3 missing
julia> df[ismissing.(df.x) .| ismissing.(df.y), :]
2×3 DataFrame
Row │ id x y
│ Int64 Int64? Int64?
─────┼─────────────────────────
1 │ 2 missing 2
2 │ 3 3 missing
or
filter(row -> any(ismissing, row[names(df, r"^md5")]), df)
which will leave in df all rows that have a missing value in any of the columns whose name starts with "md5". This is not the most efficient way to do it, but I think it is simplest conceptually.
If you need maximum performance go along what François Févotte proposed, but it currently requires you to explicitly list columns you want to filter on (this PR will allow to make it more cleanly).
used
bwmissows = bwjoinout_1_2[ismissing.(bwjoinout_1_2.md5) .| ismissing.(bwjoinout_1_2.md5_1), :]
worked like a charm
pulled rows that had a missing md5 or md5_1
Thanks for your help
Stay safe!
i couldn't get the syntax
Regards
bryan
I'd do bwjoinout_1_2[.!completecases(bwjoinout_1_2, r"md5"), :].