How to handle missing in boolean context in Julia? - dataframe

I'm trying to create a categorical variable based on ranges of values from other (numerical) column. However, the code don't work when I have missings in the numerical column
Here is a replicable example:
using RDatasets;
using DataFrames;
using Pipe;
using FreqTables;
df = dataset("datasets","iris")
#lowercase columns just for convenience
#pipe df |> rename!(_, [lowercase(k) for k in names(df)]);
#without this line, the code works fine
#pipe df |> allowmissing!(_, :sepallength) |> replace!(_.sepallength, 4.9 => missing);
df[:size] = #. ifelse(df[:sepallength]<=4.7, "small", missing)
df[:size] = #. ifelse((df[:sepallength]>4.7) & (df[:sepallength]<=4.9), "avg", df[:size])
df[:size] = #. ifelse((df[:sepallength]>4.9) & (df[:sepallength]<=5), "large", df[:size])
df[:size] = #. ifelse(df[:sepallength]>5, "huge", df[:size])
println(#pipe df |> freqtable(_, :size))
Output:
TypeError: non-boolean (Missing) used in boolean context
I would like to ignore the missing cases in the numerical variable but I cannot just drop de missings because this will drop other important informations in my dataset. Moreover, if I drop just the missings in sepallength the column df[:size] would have a different length than the original dataframe.

Use the coalesce function like this:
julia> x = [1,2,3,missing,5,6,7]
7-element Array{Union{Missing, Int64},1}:
1
2
3
missing
5
6
7
julia> #. ifelse(coalesce(x < 4.7, false), "small", missing)
7-element Array{Union{Missing, String},1}:
"small"
"small"
"small"
missing
missing
missing
missing
As a side note do not write df[:size] (this syntax has been deprecated for over 2 years now and soon it will error) but rather df.size or df."size" to access the column of the data frame (the df."size" is for cases when your column names contain characters like spaces etc., e.g. df."my fancy column!").

I think Bogumil's approach is correct and probably best for most situations, but one other option that I like to use is to define my own comparison operators that can deal with missings by returning false if a missing is encountered. Using the unicode capabilities of Julia makes this quite pleasant in my opinion:
julia> ==ₘ(x, y) = ismissing(x) | ismissing(y) ? false : x == y;
julia> >=ₘ(x, y) = ismissing(x) | ismissing(y) ? false : x >= y;
julia> <=ₘ(x, y) = ismissing(x) | ismissing(y) ? false : x <= y;
julia> <ₘ(x, y) = ismissing(x) | ismissing(y) ? false : x < y;
julia> >ₘ(x, y) = ismissing(x) | ismissing(y) ? false : x > y;
julia> x = rand([missing; 1:10], 50)
julia> x .> 10
50-element Array{Union{Missing, Bool},1}
...
julia> x .>ₘ 10
50-element BitArray{1}
...
There are of course downsides to defining such an elementary operator in your own code, particularly using Unicode as well, in terms of your code being harder for other people to read (and potentially even to display correctly!), so I probably wouldn't advocate for this as the standard approach, or something to be used in library code. I do think though that for explorative work it makes life easier.

Related

i am new, program gives error "there are no type variables left in list"

How the game works is that there is a 3-digit number, and you have to guess it. If you guess a digit in the right spot, you get a strike, and if you guess a digit but in the wrong spot you get a ball. I've coded it like this.
x = random.randint(1, 9)
y = random.randint(1, 9)
z = random.randint(1, 9)
userguessunlisted = input('What number do you want to guess?')
numbertoguess = list[x, y, z]
userguess = list(userguessunlisted)
b = 0
s = 0
while 0 == 0:
if userguess[0] == numbertoguess[0]:
s = s + 1
if userguess[0] == numbertoguess[1]:
b = b + 1
if userguess[0] == numbertoguess[2]:
b = b + 1
if userguess[1] == numbertoguess[0]:
b = b + 1
if userguess[1] == numbertoguess[1]:
s = s + 1
if userguess[1] == numbertoguess[2]:
b = b + 1
if userguess[2] == numbertoguess[0]:
b = b + 1
if userguess[2] == numbertoguess[1]:
b = b + 1
if userguess[2] == numbertoguess[2]:
s = s + 1
print(s + "S", b + "B")
if s != 3:
b = 0
s = 0
else:
print('you win!')
break
When you said list[x, y, z] on line 5, you used square brackets, which python interprets to be a type annotation. For example, if I wanted to specify that a variable is a list of ints, I could say
my_list_of_ints: list[int] = [1, 2, 3]
I think what you meant to do is create a new list from x, y, and z. One way to do this is
numbertoguess = list([x, y, z])
which is probably what you meant to write. This is valid because the list function takes an iterable as its one and only argument.
However, the list portion is redundant; square brackets on the right-hand side of an assignment statement already means "create a list with this content," so instead you should simply say
numbertoguess = [x, y, z]
A few other notes:
input will return a string, but you are comparing that string to integers further down, so none of the comparisons will ever be true. What you want to say is something like the following:
while True:
try:
userguessunlisted = int(input('What number do you want to guess?'))
except:
continue
break
What this code does is attempts to parse the string returned from input into an int. If it fails to do so, which would happen if the user inputted something other than a valid integer, an exception would be thrown, and the except block would be entered. continue means go to the top of the loop, so the input line runs repeatedly until a valid int is entered. When that happens, the except block is skipped, so break runs, which means "exit the loop."
userguessunlisted is only ever going to contain 1 number as written, so userguess will be a list of length 1, and all of the comparisons using userguess[1] and userguess[2] will throw an IndexError. Try to figure out how to wrap the code from (1) in another loop to gather multiple guesses from the user. Hint: use a for loop with range.
It might also be that you meant for the user to input a 3-digit number all at once. In that case, you can use a list comprehension to grab each character from the input and parse it into a separate int. This is probably a bit complicated for a beginner, so I'll help you out:
[int(char) for char in input('What number do you want to guess?')]
print(s + "S", b + "B") will throw TypeError: unsupported operand type(s) for +: 'int' and 'str'. There are lots of ways to combine non-string types with strings, but the most modern way is using f-strings. For example, to combine s with "S", you can say f"{s}S".
When adding some amount to a variable, instead of saying e.g. b = b + 1, you can use the += operator to more concisely say b += 1.
It's idiomatic in python to use snake_case for variables and Pascal case for classes. So instead of writing e.g. numbertoguess, you should use number_to_guess. This makes your code more readable and familiar to other python programmers.
Happy coding!

Vectorize a function with a condition

I would like to vectorize a function with a condition, meaning to calculate its values with array arithmetic. np.vectorize handles vectorization, but it does not work with array arithmetic, so it is not a complete solution
An answer was given as the solution in the question "How to vectorize a function which contains an if statement?" but did not prevent errors here; see the MWE below.
import numpy as np
def myfx(x):
return np.where(x < 1.1, 1, np.arcsin(1 / x))
y = myfx(x)
This runs but raises the following warnings:
<stdin>:2: RuntimeWarning: divide by zero encountered in true_divide
<stdin>:2: RuntimeWarning: invalid value encountered in arcsin
What is the problem, or is there a better way to do this?
I think this could be done by
Getting the indices ks of x for which x[k] > 1.1 for each k in ks.
Applying np.arcsin(1 / x[ks]) to the slice x[ks], and using 1 for the rest of the elements.
Recombining the arrays.
I am not sure about the efficiency, though.
The statement np.where(x < 1.1, 1, np.arcsin(1 / x)) is equivalent to
mask = x < 1.1
a = 1
b = np.arcsin(1 / x)
np.where(mask, a, b)
Notice that you're calling np.arcsin on all the elements of x, regardless of whether 1 / x <= 1 or not. Your basic plan is correct. You can do the operations in-place on an output array using the where keyword of np.arcsin and np.reciprocal, without having to recombine anything:
def myfx(x):
mask = (x >= 1.1)
out = np.ones(x.shape)
np.reciprocal(x, where=mask, out=out) # >= 1.1 implies != 0
return np.arcsin(out, where=mask, out=out)
Using np.ones ensures that the unmasked elements of out are initialized correctly. An equivalent method would be
out = np.empty(x.shape)
out[~mask] = 1
You can always find an arithmetic expression that prevents the "divide by zero".
Example:
def myfx(x):
return np.where( x < 1.1, 1, np.arcsin(1/np.maximum(x, 1.1)) )
The values where x<1.1 in the right wing are not used, so it's not an issue computing np.arcsin(1/1.1) where x < 1.1.

Julia InexactError: Int64

I am new to Julia. Got this InexactError . To mention that I have tried to convert to float beforehand and it did not work, maybe I am doing something wrong.
column = df[:, i]
max = maximum(column)
min = minimum(column)
scaled_column = (column .- min)/max # This is the error, I think
df[:, i] = scaled_column
julia> VERSION
v"1.4.2"
Hard to give a sure answer without a minimal working example of the problem, but in general an InexactError happens when you try to convert a value to an exact type (like integer types, but unlike floating-point types) in which the original value cannot be exactly represented. For example:
julia> convert(Int, 1.0)
1
julia> convert(Int, 1.5)
ERROR: InexactError: Int64(1.5)
Other programming languages arbitrarily chose some way of rounding here (often truncation but sometimes rounding to nearest). Julia doesn't guess and requires you to be explicit. If you want to round, truncate, take a ceiling, etc. you can:
julia> floor(Int, 1.5)
1
julia> round(Int, 1.5)
2
julia> ceil(Int, 1.5)
2
Back to your problem: you're not calling convert anywhere, so why are you getting a conversion error? There are various situations where Julia will automatically call convert for you, typically when you try to assign a value to a typed location. For example, if you have an array of Ints and you assign a floating point value into it, it will be converted automatically:
julia> v = [1, 2, 3]
3-element Array{Int64,1}:
1
2
3
julia> v[2] = 4.0
4.0
julia> v
3-element Array{Int64,1}:
1
4
3
julia> v[2] = 4.5
ERROR: InexactError: Int64(4.5)
So that's likely what's happening to you: you get non-integer values by doing (column .- min)/max and then you try to assign it into an integer location and you get the error.
As a side note you can use transform! to achieve what you want like this:
transform!(df, i => (x -> (x .- minimum(x)) ./ maximum(x)) => i)
and this operation will replace the column.

vectorize join condition in pandas

This code is working correctly as expected. But it takes a lot of time for large dataframes.
for i in excel_df['name_of_college_school'] :
for y in mysql_df['college_name'] :
if SequenceMatcher(None, i.lower(), y.lower() ).ratio() > 0.8:
excel_df.loc[excel_df['name_of_college_school'] == i, 'dupmark4'] = y
I guess, I can not use a function on join clause to compare values like this.
How do I vectorize this?
Update:
Is it possible to update with the highest score? This loop will overwrite the earlier match and it is possible that the earlier match was more relevant than current one.
What you are looking for is fuzzy merging.
a = excel_df.as_matrix()
b = mysql_df.as_matrix()
for i in a:
for j in b:
if SequenceMatcher(None,
i[college_index_a].lower(), y[college_index_b].lower() ).ratio() > 0.8:
i[dupmark_index] = j
Never use loc in a loop, it has a huge overhead. And btw, get the index of the respective columns, (the numerical one). Use this -
df.columns.get_loc("college name")
You could avoid one of the loops using apply and instead of MxN .loc operations, now it'll be M operations.
for y in mysql_df['college_name']:
match = excel_df['name_of_college_school'].apply(lambda x: SequenceMatcher(
None, x.lower(), y.lower()).ratio() > 0.8)
excel_df.loc[match, 'dupmark4'] = y

parse input to rational in Julia

I want to read the user input and store it as a Rational, whatever the type: integer, float ot rational. For instance:
5 --> store it as 5//1
2.3 --> store it as 23//10
4//7 --> store it as 4//7
At the moment I wrote the following:
a = convert(Rational,parse(Float64,readline(STDIN)))
which is fine if I input an integer, like 5.
But if I input 2.3, a stores 2589569785738035//1125899906842624
.
And if I input a fraction (whether in the form 4/7 or the form 4//7) I get an ArgumentError: invalid number format for Float64.
How to solve the Float&Rational problems?
One way is to parse the raw input to an Expr (symbols), eval the expression, convert it to a Float64 and use rationalize to simplify the rational generated:
julia> rationalize(convert(Float64, eval(parse("5"))))
5//1
julia> rationalize(convert(Float64, eval(parse("2.3"))))
23//10
julia> rationalize(convert(Float64, eval(parse("4/7"))))
4//7
julia> rationalize(convert(Float64, eval(parse("4//7"))))
4//7
rationalize works with approximate floating point number and you could specify the error in the parameter tol.
tested with Julia Version 0.4.3
Update: The parse method was deprecated in Julia version >= 1.0. There are two methods that should be used: Base.parse (just for numbers, and it requires a Type argument) and Meta.parse (for expressions):
julia> rationalize(convert(Float64, eval(parse(Int64, "5"))))
5//1
julia> rationalize(convert(Float64, eval(parse(Float64, "2.3"))))
23//10
Multiplication (then division) by a highly composite integer works pretty well.
julia> N = 2*2 * 3*3 * 5*5 * 7 * 11 * 13
900900
julia> a = round(Int, N * parse(Float64, "2.3")) // N
23//10
julia> a = round(Int, N * parse(Float64, "5")) // N
5//1
julia> a = round(Int, N * parse(Float64, "9.1111111111")) // N
82//9
You could implement your own parse:
function Base.parse(::Type{Rational{Int}}, x::ASCIIString)
ms, ns = split(x, '/', keep=false)
m = parse(Int, ms)
n = parse(Int, ns)
return m//n
end
Base.parse(::Type{Rational}, x::ASCIIString) = parse(Rational{Int}, x)