Julia dataframe where a column is an array of arrays? - dataframe

I'm trying to create a table where each row has time-series data associated with a particular test-case.
julia> df = DataFrame(var1 = Int64[], var2 = Int64[], ts = Array{Array{Int64, 1}, 1})
0x3 DataFrames.DataFrame
I'm able to create the data frame. Each var1, var2 pair is intended to have an associated time series.
I want to generate data in a loop and want to append to this dataframe using push!
I've tried
julia> push!(df, [1, 2, [3,4,5]])
ERROR: ArgumentError: Length of iterable does not match DataFrame column count.
in push! at /Users/stro/.julia/v0.4/DataFrames/src/dataframe/dataframe.jl:871
and
julia> push!(df, (1, 2, [3,4,5]))
ERROR: ArgumentError: Error adding [3,4,5] to column :ts. Possible type mis-match.
in push! at /Users/stro/.julia/v0.4/DataFrames/src/dataframe/dataframe.jl:883
What's the best way to go about this? Is my intended approach even the right path?

You've accidentally put the type of a vector in instead of an actual vector. This declaration will work:
df = DataFrame(var1 = Int64[], var2 = Int64[], ts = Array{Int64, 1}[])
Note the change from Array{Array{Int64, 1}, 1}, which is a type, to Array{Int64, 1}[], which is an actual vector with that type.
Then things work:
julia> push!(df, (1, 2, [3,4,5]))
julia> df
1x3 DataFrames.DataFrame
│ Row │ var1 │ var2 │ ts │
┝━━━━━┿━━━━━━┿━━━━━━┿━━━━━━━━━┥
│ 1 │ 1 │ 2 │ [3,4,5] │
Note that your other example, using [1, 2, [3,4,5]] still does not work. This is because a quirk in Julia's array syntax means that the comma , operator does concatenation, so in fact [1, 2, [3,4,5]] means [1, 2, 3, 4, 5]. This behaviour is weird and will be fixed in Julia 0.5, but is preserved in 0.4 for backwards compatibility.

Related

How to create non-alphabetically ordered Categorical column in Polars Dataframe?

In Pandas, you can create an "ordered" Categorical column from existing string column as follows:
column_values_with_custom_order = ["B", "A", "C"] df["Column"] = pd.Categorical(df.Column, categories=column_values_with_custom_order, ordered=True)
In Polars documentation, I couldn't find such way to create ordered columns. However, I could reproduce this by using pl.from_pandas(df) so I suspect that this is possible with Polars as well.
What would be the recommended way to this?
I tried to create new column with polars_df.with_columns(col("Column").cast(pl.categorical)), but I don't know how to include the custom ordering to this.
I also checked In polars, can I create a categorical type with levels myself?, but I would prefer not to add another column to my Dataframe only for ordering.
Say you have
df = pl.DataFrame(
{"cats": ["z", "z", "k", "a", "b"], "vals": [3, 1, 2, 2, 3]}
)
and you want to make cats a categorical but you want the categorical ordered as
myorder=["k", "z", "b", "a"]
There are two ways to do this. One way is with pl.StringCache() as in the question you reference and the other is more messy. The former does not require you add any columns to your df. It's actually very succinct.
with pl.StringCache():
pl.Series(myorder).cast(pl.Categorical)
df=df.with_columns(pl.col('cats').cast(pl.Categorical))
What happens is that everything in the StringCache gets the same key values so when the myorder list is casted that saves what keys get allocated to each string value. When your df gets casted under the same cache it gets the same key/string values which are in the order you wanted.
The other way to do this is as follows:
You have to sort your df by the ordering then you can do set_ordering('physical'). If you want to maintain your original order then you just have to use with_row_count at the beginning so you can restore that order.
Putting it all together, it looks like this:
df=df.with_row_count('i').join(
pl.from_dicts([{'order':x, 'cats':y} for x,y in enumerate(myorder)]), on='cats') \
.sort('order').drop('order') \
.with_columns(pl.col('cats').cast(pl.Categorical).cat.set_ordering('physical')) \
.sort('i').drop('i')
You can verify by doing:
df.select(['cats',pl.col('cats').to_physical().alias('phys')])
shape: (5, 2)
┌──────┬──────┐
│ cats ┆ phys │
│ --- ┆ --- │
│ cat ┆ u32 │
╞══════╪══════╡
│ z ┆ 1 │
│ z ┆ 1 │
│ k ┆ 0 │
│ a ┆ 3 │
│ b ┆ 2 │
└──────┴──────┘
From the doc:
Use:
polars_df.with_columns(col("Column").cast(pl.categorical).cat.set_ordering("lexical"))
See the doc
df = pl.DataFrame(
{"cats": ["z", "z", "k", "a", "b"], "vals": [3, 1, 2, 2, 3]}
).with_columns(
[
pl.col("cats").cast(pl.Categorical).cat.set_ordering("lexical"),
]
)
df.sort(["cats", "vals"])

Find a subset of columns of a data frame that have some missing values

Given the following data frame from DataFrames.jl:
julia> using DataFrames
julia> df = DataFrame(x1=[1, 2, 3], x2=Union{Int,Missing}[1, 2, 3], x3=[1, 2, missing])
3×3 DataFrame
Row │ x1 x2 x3
│ Int64 Int64? Int64?
─────┼────────────────────────
1 │ 1 1 1
2 │ 2 2 2
3 │ 3 3 missing
I would like to find columns that contain missing value in them.
I have tried:
julia> names(df, Missing)
String[]
but this is incorrect as the names function, when passed a type, looks for subtypes of the passed type.
If you want to find columns that actually contain missing value use:
julia> names(df, any.(ismissing, eachcol(df)))
1-element Vector{String}:
"x3"
In this approach we iterate each column of the df data frame and check if it contains at least one missing value.
If you want to find columns that potentially can contain missing value you need to check their element type:
julia> names(df, [eltype(col) >: Missing for col in eachcol(df)]) # using a comprehension
2-element Vector{String}:
"x2"
"x3"
julia> names(df, .>:(eltype.(eachcol(df)), Missing)) # using broadcasting
2-element Vector{String}:
"x2"
"x3"

Keep variables type after using data frame

I'm trying to use kproto() function from R package clustMixType to cluster mixed-type data in Julia, but I'm getting error No numeric variables in x! Try using kmodes() from package.... My data should have 3 variables: 2 continuous and 1 categorical. It seems after I used DataFrame() all the variables became categorical. Is there a way to avoid changing the variables type after using DataFrame() so that I have mixed-type data (continuous and categorical) to use kproto()?
using RCall
#rlibrary clustMixType
# group 1 variables
x1=rand(Normal(0,3),10)
x2=rand(Normal(1,2),10)
x3=["1","1","2","2","0","1","1","2","2","0"]
g1=hcat(x1,x2,x3)
# group 2 variables
y1=rand(Normal(0,4),10)
y2=rand(Normal(-1,6),10)
y3=["1","1","2","1","1","2","2","0","2","0"]
g2=hcat(y1,y2,y3)
#create the data
df0=vcat(g1,g2)
df1 = DataFrame(df0)
#use R function
R"kproto($df1, 2)"
I don't know anything about the R package and what kind of input it expects, but the issue is probably how you construct the data matrix from which you construct your DataFrame, not the DataFrame constructor itself.
When you concatenate a numerical and a string column, Julia falls back on the element type Any for the resulting matrix:
julia> g1=hcat(x1,x2,x3)
10×3 Matrix{Any}:
0.708309 -4.84767 "1"
0.566883 -0.214217 "1"
...
That means your df0 matrix is:
julia> #create the data
df0=vcat(g1,g2)
20×3 Matrix{Any}:
0.708309 -4.84767 "1"
0.566883 -0.214217 "1"
...
and the DataFrame constructor will just carry this lack of type information through rather than trying to infer column types.
julia> DataFrame(df0)
20×3 DataFrame
Row │ x1 x2 x3
│ Any Any Any
─────┼───────────────────────────
1 │ 0.708309 -4.84767 1
2 │ 0.566883 -0.214217 1
...
A simple way of getting around this is to just not concatenate your columns into a single matrix, but to construct the DataFrame from the columns:
julia> DataFrame([vcat(x1, y1), vcat(x2, y2), vcat(x3, y3)])
20×3 DataFrame
Row │ x1 x2 x3
│ Float64 Float64 String
─────┼───────────────────────────────
1 │ 0.708309 -4.84767 1
2 │ 0.566883 -0.214217 1
...
As you can see, we now have two Float64 numerical columns x1 and x2 in the resulting DataFrame.
As an addition to the nice answer by Nils (as the problem is indeed when a matrix is constructed not when DataFrame is created) there is this little trick:
julia> df = DataFrame([1 1.0 "1"; 2 2.0 "2"], [:int, :float, :string])
2×3 DataFrame
Row │ int float string
│ Any Any Any
─────┼────────────────────
1 │ 1 1.0 1
2 │ 2 2.0 2
julia> identity.(df)
2×3 DataFrame
Row │ int float string
│ Int64 Float64 String
─────┼────────────────────────
1 │ 1 1.0 1
2 │ 2 2.0 2

Plotting Julia DataFrame columns that have whitespace in their names with Matplotlib

I have DataFrames that have whitespace in their column names, because the CSV files they were generated from had whitespace in the names as well. The DataFrames were generated with the lines
csvnames::Array{String,1} = filter(x -> endswith(x, ".csv"), readdir(CSV_DIR))
dfs::Dict{String, DataFrame} = Dict( csvnames[i] => CSV.File(CSV_DIR * csvnames[i]) |> DataFrame for i in 1:length(csvnames))
The DataFrames have column names such as "Tehtävä 1", but none of the following expressions work when I try to access the column (here ecols is a dataframe):
plot = axes.plot(ecols[Symbol("Tehtävä 1")]) produces the error TypeError("float() argument must be a string or a number, not 'PyCall.jlwrap'")
plot = axes.plot(ecols[:Tehtävä_1]) produces the error ERROR: LoadError: ArgumentError: column name :Tehtävä_1 not found in the data frame; existing most similar names are: :Tehtävä 1
plot = axes.plot(ecols[:Tehtävä 1]) raises the error ERROR: LoadError: MethodError: no method matching typed_hcat(::DataFrame, ::Symbol, ::Int64)
It therefore seems that I have no way of plotting DataFrame columns that have spaces in their names. Printing them works just fine, as the line
println(ecols[Symbol("Tehtävä 1")])
produces and array of floats: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0], which it is supposed to. Is Matplotlib just incompatible with DataFrames with whitespace in their column names and if it is, how could I remove all whitespace from the columns of a Julia DataFrame?
EDIT
I forgot to mention one very crucial point: the DataFrame contains missing values, which Matplotlib can't comprehend. This was causing error 1. I would still very much like to know if there is a way of getting rid of any whitespace in the table column names, possibly during the construction of the DataFrame.
The first approach works just fine, but it seems you are not using PyPlot.jl correctly (in particular you try to create a variable called plot which will overshadow plot function from PyPlot.jl).
To see that it works run:
julia> df = DataFrame(Symbol("Tehtävä 1") => 1.0:5.0)
5×1 DataFrame
│ Row │ Tehtävä 1 │
│ │ Float64 │
├─────┼───────────┤
│ 1 │ 1.0 │
│ 2 │ 2.0 │
│ 3 │ 3.0 │
│ 4 │ 4.0 │
│ 5 │ 5.0 │
julia> plot(df[Symbol("Tehtävä 1")])
1-element Array{PyCall.PyObject,1}:
PyObject <matplotlib.lines.Line2D object at 0x000000003F9EE0B8>
and a plot is shown as expected.
EDIT
If you want to remove whitespace from column names of data frame df write:
names!(df, Symbol.(replace.(string.(names(df)), Ref(r"\s"=>""))))

Select numerical columns of Julia DataFrame with missing values

I want to select all columns of a DataFrame in which the datatype is a subtype of Number. However, since there are columns with missing values, the numerical column datatypes can be something like Union{Missing, Int64}.
So far, I came up with:
using DataFrames
df = DataFrame([["a", "b"], [1, missing] ,[2, 5]])
df_numerical = df[typeintersect.(colwise(eltype, df), Number) .!= Union{}]
This yields the expected result.
Question
Is there a more simple, idiomatic way of doing this? Possibly simliar to:
df.select_dtypes(include=[np.number])
in pandas as taken from an answer to this question?
julia> df[(<:).(eltypes(df),Union{Number,Missing})]
2×2 DataFrame
│ Row │ x2 │ x3 │
├─────┼─────────┼────┤
│ 1 │ 1 │ 2 │
│ 2 │ missing │ 5 │
Please note that the . is the broadcasting operator and hence I had to use <: operator in a functional form.
An other way to do it could be:
df_numerical = df[[i for i in names(df) if Base.nonmissingtype(eltype(df[i])) <: Number]]
To retrieve all the columns that are subtype of Number, irrelevantly if they host missing data or not.