Issue with Left Outer Join in Julia DataFrame - dataframe

This one has me stumped.
Im trying to join two dataframes in Julia but I get this wierd 'nothing' error. This works on a different machine so Im thinking it could be package problems. I Pkg.rm() everything and re-install but no go.
Julia v1.2
using PyCall;
using DataFrames;
using CSV;
using Statistics;
using StatsBase;
using Random;
using Plots;
using Dates;
using Missings;
using RollingFunctions;
# using Indicators;
using Pandas;
using GLM;
using Impute;
a = DataFrames.DataFrame(x = [1, 2, 3], y = ["a", "b", "c"])
b = DataFrames.DataFrame(x = [1, 2, 3, 4], z = ["d", "e", "f", "g"])
join(a, b, on=:x, kind =:left)
yields
ArgumentError: `nothing` should not be printed; use `show`, `repr`, or custom output instead.
Stacktrace:
[1] print(::Base.GenericIOBuffer{Array{UInt8,1}}, ::Nothing) at ./show.jl:587
[2] print_to_string(::String, ::Vararg{Any,N} where N) at ./strings/io.jl:129
[3] string at ./strings/io.jl:168 [inlined]
[4] #join#543(::Symbol, ::Symbol, ::Bool, ::Nothing, ::Tuple{Bool,Bool}, ::typeof(join), ::DataFrames.DataFrame, ::DataFrames.DataFrame) at /Users/username/.julia/packages/DataFrames/3ZmR2/src/deprecated.jl:298
[5] (::getfield(Base, Symbol("#kw##join")))(::NamedTuple{(:on, :kind),Tuple{Symbol,Symbol}}, ::typeof(join), ::DataFrames.DataFrame, ::DataFrames.DataFrame) at ./none:0
[6] top-level scope at In[15]:4
kind=:inner works fine but :left, :right, and :outer don't.

This is a problem caused by the way Julia 1.2 prints nothing (i.e. that it errors when trying to print it). If you would switch to Julia 1.4.1 the problem will disappear.
However, I can see you are on DataFrames.jl 0.21. In this version join function is deprecated. You should use innerjoin, leftjoin, rightjoin, outerjoin, etc. functions. Then all will work also on Julia 1.2, e.g.:
julia> leftjoin(a, b, on=:x)
3×3 DataFrame
│ Row │ x │ y │ z │
│ │ Int64 │ String │ String? │
├─────┼───────┼────────┼─────────┤
│ 1 │ 1 │ a │ d │
│ 2 │ 2 │ b │ e │
│ 3 │ 3 │ c │ f │

Related

ArgumentError: columns argument must be a vector of AbstractVector objects

I want to make a DataFrame in Julia with one column, but I get an error:
julia> using DataFrames
julia> r = rand(3);
julia> DataFrame(r, ["col1"])
ERROR: ArgumentError: columns argument must be a vector of AbstractVector objects
Why?
Update:
I figured out that I could say the following:
julia> DataFrame(reshape(r, :, 1), ["col1"])
3×1 DataFrame
Row │ col1
│ Float64
─────┼──────────
1 │ 0.800824
2 │ 0.989024
3 │ 0.722418
But it's not straightforward. Is there any better way? Why can't I easily create a DataFrame object from a Vector?
Why can't I easily create a DataFrame object from a Vector?
Because it would be ambiguous with the syntax where you pass positional arguments the way you tried. Many popular tables are vectors.
However, what you can write is just:
julia> r = rand(3);
julia> DataFrame(col1=r)
3×1 DataFrame
Row │ col1
│ Float64
─────┼────────────
1 │ 0.00676619
2 │ 0.554207
3 │ 0.394077
to get what you want.
An alternative more similar to your code would be:
julia> DataFrame([r], ["col1"])
3×1 DataFrame
Row │ col1
│ Float64
─────┼────────────
1 │ 0.00676619
2 │ 0.554207
3 │ 0.394077

How to create non-alphabetically ordered Categorical column in Polars Dataframe?

In Pandas, you can create an "ordered" Categorical column from existing string column as follows:
column_values_with_custom_order = ["B", "A", "C"] df["Column"] = pd.Categorical(df.Column, categories=column_values_with_custom_order, ordered=True)
In Polars documentation, I couldn't find such way to create ordered columns. However, I could reproduce this by using pl.from_pandas(df) so I suspect that this is possible with Polars as well.
What would be the recommended way to this?
I tried to create new column with polars_df.with_columns(col("Column").cast(pl.categorical)), but I don't know how to include the custom ordering to this.
I also checked In polars, can I create a categorical type with levels myself?, but I would prefer not to add another column to my Dataframe only for ordering.
Say you have
df = pl.DataFrame(
{"cats": ["z", "z", "k", "a", "b"], "vals": [3, 1, 2, 2, 3]}
)
and you want to make cats a categorical but you want the categorical ordered as
myorder=["k", "z", "b", "a"]
There are two ways to do this. One way is with pl.StringCache() as in the question you reference and the other is more messy. The former does not require you add any columns to your df. It's actually very succinct.
with pl.StringCache():
pl.Series(myorder).cast(pl.Categorical)
df=df.with_columns(pl.col('cats').cast(pl.Categorical))
What happens is that everything in the StringCache gets the same key values so when the myorder list is casted that saves what keys get allocated to each string value. When your df gets casted under the same cache it gets the same key/string values which are in the order you wanted.
The other way to do this is as follows:
You have to sort your df by the ordering then you can do set_ordering('physical'). If you want to maintain your original order then you just have to use with_row_count at the beginning so you can restore that order.
Putting it all together, it looks like this:
df=df.with_row_count('i').join(
pl.from_dicts([{'order':x, 'cats':y} for x,y in enumerate(myorder)]), on='cats') \
.sort('order').drop('order') \
.with_columns(pl.col('cats').cast(pl.Categorical).cat.set_ordering('physical')) \
.sort('i').drop('i')
You can verify by doing:
df.select(['cats',pl.col('cats').to_physical().alias('phys')])
shape: (5, 2)
┌──────┬──────┐
│ cats ┆ phys │
│ --- ┆ --- │
│ cat ┆ u32 │
╞══════╪══════╡
│ z ┆ 1 │
│ z ┆ 1 │
│ k ┆ 0 │
│ a ┆ 3 │
│ b ┆ 2 │
└──────┴──────┘
From the doc:
Use:
polars_df.with_columns(col("Column").cast(pl.categorical).cat.set_ordering("lexical"))
See the doc
df = pl.DataFrame(
{"cats": ["z", "z", "k", "a", "b"], "vals": [3, 1, 2, 2, 3]}
).with_columns(
[
pl.col("cats").cast(pl.Categorical).cat.set_ordering("lexical"),
]
)
df.sort(["cats", "vals"])

python-polars casting string to numeric

When applying pandas.to_numeric,Pandas return dtype is float64 or int64 depending on the data supplied.https://pandas.pydata.org/docs/reference/api/pandas.to_numeric.html
is there an equivelent to do this in polars?
I have seen this How to cast a column with data type List[null] to List[i64] in polars however dont want to individually cast each column. got couple of string columns i want to turn numeric. this could be int or float values
#code to show casting in pandas.to_numeric
import pandas as pd
df = pd.DataFrame({"col1":["1","2"], "col2":["3.5", "4.6"]})
print("DataFrame:")
print(df)
df[["col1","col2"]]=df[["col1","col2"]].apply(pd.to_numeric)
print(df.dtypes)
Unlike Pandas, Polars is quite picky about datatypes and tends to be rather unaccommodating when it comes to automatic casting. (Among the reasons is performance.)
You can create a feature request for a to_numeric method (but I'm not sure how enthusiastic the response will be.)
That said, here's some easy ways to accomplish this.
Create a method
Perhaps the simplest way is to write a method that attempts the cast to integer and then catches the exception. For convenience, you can even attach this method to the Series class itself.
def to_numeric(s: pl.Series) -> pl.Series:
try:
result = s.cast(pl.Int64)
except pl.exceptions.ComputeError:
result = s.cast(pl.Float64)
return result
pl.Series.to_numeric = to_numeric
Then to use it:
(
pl.select(
s.to_numeric()
for s in df
)
)
shape: (2, 2)
┌──────┬──────┐
│ col1 ┆ col2 │
│ --- ┆ --- │
│ i64 ┆ f64 │
╞══════╪══════╡
│ 1 ┆ 3.5 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2 ┆ 4.6 │
└──────┴──────┘
Use the automatic casting of csv parsing
Another method is to write your columns to a csv file (in a string buffer), and then have read_csv try to infer the types automatically. You may have to tweak the infer_schema_length parameter in some situations.
from io import StringIO
pl.read_csv(StringIO(df.write_csv()))
>>> pl.read_csv(StringIO(df.write_csv()))
shape: (2, 2)
┌──────┬──────┐
│ col1 ┆ col2 │
│ --- ┆ --- │
│ i64 ┆ f64 │
╞══════╪══════╡
│ 1 ┆ 3.5 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2 ┆ 4.6 │
└──────┴──────┘

Apparent issues with DataFrame string values

I am not sure if this is an actual problem or if I am just not doing something the correct way, but at the moment it appears a little bizarre to me.
When using DataFrames I came across an issue where if you copy a DataFrame to another variable, then any changes made to either of the variables changes both. This goes for the individual columns too. For example:
julia> x = DataFrame(A = ["pink", "blue", "green"], B = ["yellow", "red", "purple"]);
julia> y = x;
julia> x[x.A .== "blue", :A] = "red";
julia> x
3×2 DataFrame
│ Row │ A │ B │
├─────┼───────┼────────┤
│ 1 │ pink │ yellow │
│ 2 │ red │ red │
│ 3 │ green │ purple │
julia> y
3×2 DataFrame
│ Row │ A │ B │
├─────┼───────┼────────┤
│ 1 │ pink │ yellow │
│ 2 │ red │ red │
│ 3 │ green │ purple │
A similar thing happens with columns too, so if were to say setup a DataFrame like the above but use B = A before I incorporate both into a data frame, then if the values in one column is changed, the other is also automatically changed.
This seems odd to me, and maybe it is a feature of other programming languages but I have done the same thing as above in R many times when making a backup of a data table or swapping data between columns, and have never seen this issue. So the question is, is it working as designed and is there a correct way for copying values between data frames?
I am using Julia version 0.7.0 since I originally installed 1.0.0 through the Manjaro repository and had issues with the Is_windows() when trying to build Tk.
The command y = x does not create a new object; it just creates a new reference (or name) for the same DataFrame.
You can create a copy by calling y = copy(x). In your case, this still doesn't work, as it only copies the dataframe itself but not the variables in it.
If you want a completely independent new object, you can use y = deepcopy(x). In this case, y will have no references to x.
See this thread for a more detailed discussion:
https://discourse.julialang.org/t/what-is-the-difference-between-copy-and-deepcopy/3918/2

Convert missing to a numerical value in Julia 1

I am trying to convert all missing values in a df to a numerical value, e.g. 0 (yes, knowing what I am doing..).
In Julia 0.6 I can write:
julia> df = DataFrame(
cat = ["green","blue","white"],
v1 = [1.0,missing,2.0],
v2 = [1,2,missing]
)
julia> [df[ismissing.(df[i]), i] = 0 for i in names(df)]
And get:
julia> df
3×3 DataFrames.DataFrame
│ Row │ cat │ v1 │ v2 │
├─────┼───────┼─────┼────┤
│ 1 │ green │ 1.0 │ 1 │
│ 2 │ blue │ 0.0 │ 2 │
│ 3 │ white │ 2.0 │ 0 │
If I try it in Julia 0.7 I get instead a very weird error:
MethodError: Cannot convert an object of type Float64 to an object
of type String
I can't get what I am trying to convert to a string ??? Any explanation (and workaround) ?
The reason for this problem is that broadcasting mechanism has changed between Julia 0.6 and Julia 1.0 (and it is used in insert_multiple_entries! function in DataFrames.jl). In the end fill! is called and it tries to do a conversion before checking if the collection is empty.
Actually if you want to do a fully general replacement in place (and I understand you want to) this is a bit complex and less efficient than what you have in Base (the reason is that you cannot rely on checking types of elements in vectors as e.g. you can assign Int to vector of Float64 and they have different types):
function myreplacemissing!(vec, val)
for i in eachindex(vec)
ismissing(vec[i]) && (vec[i] = val)
end
end
And now you are good to go:
foreach(col -> myreplacemissing!(col[2], 0), eachcol(df))
While I appreciate the answer of Bogumil Kaminski (also because now I understood the reasons behind the failure), its proposed solution fails if it happens to exists missing elements in non-numeric columns, e.g.:
df = DataFrame(
cat = ["green","blue",missing],
v1 = [1.0,missing,2.0],
v2 = [1,2,missing]
)
What I can instead do is to use (either or only one, depending on my needs):
[df[ismissing.(df[i]), i] = 0 for i in names(df) if typeintersect(Number, eltype(df[i])) != Union{}]
[df[ismissing.(df[i]), i] = "" for i in names(df) if typeintersect(String, eltype(df[i])) != Union{}]
The advantage is that I can select the type of value I need as "missing replacement" for different type of column (e.g. 0 for a number or "" for a string).
EDIT:
Maybe more readable, thanks again to Begumil's answer:
[df[ismissing.(df[i]), i] = 0 for i in names(df) if Base.nonmissingtype(eltype(df[i])) <: Number]
[df[ismissing.(df[i]), i] = "" for i in names(df) if Base.nonmissingtype(eltype(df[i])) <: String]