How to prop.table() in julia - dataframe

I am trying to move from R to Julia.
So I have a dataset with 2 columns of prices and 2 conditional columns telling me if the price is "cheap" or "expensive".
So I want to count how many "cheap" or "expensive" entries are.
So using the package DataStructures I got this:
using DataStructures
counter(df.p_orellana)
Accumulator{Union{Missing, String}, Int64} with 3 entries:
"expensive" => 18
missing => 2
"cheap" => 22
This would be the same as the table() function in R.
Is there any way to make these values proportions?
In R it would be to prop.Table() function, but I am not sure how to do it with Julia.
I would like to have:
Accumulator{Union{Missing, String}, Int64} with 3 entries:
"expensive" => 0.4285
missing => 0.0476
"cheap" => 0.5238
Thanks in advance!

Use the FreqTables.jl package.
Here is an example:
julia> using FreqTables
julia> data = [fill("expensive", 18); fill(missing, 2); fill("cheap", 22)];
julia> freqtable(data)
3-element Named Vector{Int64}
Dim1 │
──────────┼───
cheap │ 22
expensive │ 18
missing │ 2
julia> proptable(data)
3-element Named Vector{Float64}
Dim1 │
──────────┼─────────
cheap │ 0.52381
expensive │ 0.428571
missing │ 0.047619
The results are shown in sorted order. If you would like other order use the CategoricalArrays.jl package additionally and set an appropriate ordering of levels:
julia> using CategoricalArrays
julia> cat_data = categorical(data, levels=["expensive", "cheap"]);
julia> freqtable(cat_data)
3-element Named Vector{Int64}
Dim1 │
────────────┼───
"expensive" │ 18
"cheap" │ 22
missing │ 2
julia> proptable(cat_data)
3-element Named Vector{Float64}
Dim1 │
────────────┼─────────
"expensive" │ 0.428571
"cheap" │ 0.52381
missing │ 0.047619

Adding a base Julia approach.The function tableprop can be put into ~/.julia/config/startup.jl to load automatically.
function tableprop(data::Vector)
uniq = unique(data)
res = [sum(data .=== i) for i in uniq]
try
DataFrame(data=uniq, count=res, prop=res/sum(res))
catch
hcat(uniq, res, res/sum(res))
end
end
julia> using DataFrames # just for pretty print
julia> tableprop(df)
3×3 DataFrame
Row │ data count prop
│ String? Int64 Float64
─────┼───────────────────────────
1 │ cheap 5 0.5
2 │ expensive 3 0.3
3 │ missing 2 0.2
Data
julia> df = ["cheap","expensive",missing,"cheap","cheap",
"expensive","expensive","cheap",missing,"cheap"]
10-element Vector{Union{Missing, String}}:
"cheap"
"expensive"
missing
"cheap"
"cheap"
"expensive"
"expensive"
"cheap"
missing
"cheap"

Related

ArgumentError: columns argument must be a vector of AbstractVector objects

I want to make a DataFrame in Julia with one column, but I get an error:
julia> using DataFrames
julia> r = rand(3);
julia> DataFrame(r, ["col1"])
ERROR: ArgumentError: columns argument must be a vector of AbstractVector objects
Why?
Update:
I figured out that I could say the following:
julia> DataFrame(reshape(r, :, 1), ["col1"])
3×1 DataFrame
Row │ col1
│ Float64
─────┼──────────
1 │ 0.800824
2 │ 0.989024
3 │ 0.722418
But it's not straightforward. Is there any better way? Why can't I easily create a DataFrame object from a Vector?
Why can't I easily create a DataFrame object from a Vector?
Because it would be ambiguous with the syntax where you pass positional arguments the way you tried. Many popular tables are vectors.
However, what you can write is just:
julia> r = rand(3);
julia> DataFrame(col1=r)
3×1 DataFrame
Row │ col1
│ Float64
─────┼────────────
1 │ 0.00676619
2 │ 0.554207
3 │ 0.394077
to get what you want.
An alternative more similar to your code would be:
julia> DataFrame([r], ["col1"])
3×1 DataFrame
Row │ col1
│ Float64
─────┼────────────
1 │ 0.00676619
2 │ 0.554207
3 │ 0.394077

Keep variables type after using data frame

I'm trying to use kproto() function from R package clustMixType to cluster mixed-type data in Julia, but I'm getting error No numeric variables in x! Try using kmodes() from package.... My data should have 3 variables: 2 continuous and 1 categorical. It seems after I used DataFrame() all the variables became categorical. Is there a way to avoid changing the variables type after using DataFrame() so that I have mixed-type data (continuous and categorical) to use kproto()?
using RCall
#rlibrary clustMixType
# group 1 variables
x1=rand(Normal(0,3),10)
x2=rand(Normal(1,2),10)
x3=["1","1","2","2","0","1","1","2","2","0"]
g1=hcat(x1,x2,x3)
# group 2 variables
y1=rand(Normal(0,4),10)
y2=rand(Normal(-1,6),10)
y3=["1","1","2","1","1","2","2","0","2","0"]
g2=hcat(y1,y2,y3)
#create the data
df0=vcat(g1,g2)
df1 = DataFrame(df0)
#use R function
R"kproto($df1, 2)"
I don't know anything about the R package and what kind of input it expects, but the issue is probably how you construct the data matrix from which you construct your DataFrame, not the DataFrame constructor itself.
When you concatenate a numerical and a string column, Julia falls back on the element type Any for the resulting matrix:
julia> g1=hcat(x1,x2,x3)
10×3 Matrix{Any}:
0.708309 -4.84767 "1"
0.566883 -0.214217 "1"
...
That means your df0 matrix is:
julia> #create the data
df0=vcat(g1,g2)
20×3 Matrix{Any}:
0.708309 -4.84767 "1"
0.566883 -0.214217 "1"
...
and the DataFrame constructor will just carry this lack of type information through rather than trying to infer column types.
julia> DataFrame(df0)
20×3 DataFrame
Row │ x1 x2 x3
│ Any Any Any
─────┼───────────────────────────
1 │ 0.708309 -4.84767 1
2 │ 0.566883 -0.214217 1
...
A simple way of getting around this is to just not concatenate your columns into a single matrix, but to construct the DataFrame from the columns:
julia> DataFrame([vcat(x1, y1), vcat(x2, y2), vcat(x3, y3)])
20×3 DataFrame
Row │ x1 x2 x3
│ Float64 Float64 String
─────┼───────────────────────────────
1 │ 0.708309 -4.84767 1
2 │ 0.566883 -0.214217 1
...
As you can see, we now have two Float64 numerical columns x1 and x2 in the resulting DataFrame.
As an addition to the nice answer by Nils (as the problem is indeed when a matrix is constructed not when DataFrame is created) there is this little trick:
julia> df = DataFrame([1 1.0 "1"; 2 2.0 "2"], [:int, :float, :string])
2×3 DataFrame
Row │ int float string
│ Any Any Any
─────┼────────────────────
1 │ 1 1.0 1
2 │ 2 2.0 2
julia> identity.(df)
2×3 DataFrame
Row │ int float string
│ Int64 Float64 String
─────┼────────────────────────
1 │ 1 1.0 1
2 │ 2 2.0 2

Issue with Left Outer Join in Julia DataFrame

This one has me stumped.
Im trying to join two dataframes in Julia but I get this wierd 'nothing' error. This works on a different machine so Im thinking it could be package problems. I Pkg.rm() everything and re-install but no go.
Julia v1.2
using PyCall;
using DataFrames;
using CSV;
using Statistics;
using StatsBase;
using Random;
using Plots;
using Dates;
using Missings;
using RollingFunctions;
# using Indicators;
using Pandas;
using GLM;
using Impute;
a = DataFrames.DataFrame(x = [1, 2, 3], y = ["a", "b", "c"])
b = DataFrames.DataFrame(x = [1, 2, 3, 4], z = ["d", "e", "f", "g"])
join(a, b, on=:x, kind =:left)
yields
ArgumentError: `nothing` should not be printed; use `show`, `repr`, or custom output instead.
Stacktrace:
[1] print(::Base.GenericIOBuffer{Array{UInt8,1}}, ::Nothing) at ./show.jl:587
[2] print_to_string(::String, ::Vararg{Any,N} where N) at ./strings/io.jl:129
[3] string at ./strings/io.jl:168 [inlined]
[4] #join#543(::Symbol, ::Symbol, ::Bool, ::Nothing, ::Tuple{Bool,Bool}, ::typeof(join), ::DataFrames.DataFrame, ::DataFrames.DataFrame) at /Users/username/.julia/packages/DataFrames/3ZmR2/src/deprecated.jl:298
[5] (::getfield(Base, Symbol("#kw##join")))(::NamedTuple{(:on, :kind),Tuple{Symbol,Symbol}}, ::typeof(join), ::DataFrames.DataFrame, ::DataFrames.DataFrame) at ./none:0
[6] top-level scope at In[15]:4
kind=:inner works fine but :left, :right, and :outer don't.
This is a problem caused by the way Julia 1.2 prints nothing (i.e. that it errors when trying to print it). If you would switch to Julia 1.4.1 the problem will disappear.
However, I can see you are on DataFrames.jl 0.21. In this version join function is deprecated. You should use innerjoin, leftjoin, rightjoin, outerjoin, etc. functions. Then all will work also on Julia 1.2, e.g.:
julia> leftjoin(a, b, on=:x)
3×3 DataFrame
│ Row │ x │ y │ z │
│ │ Int64 │ String │ String? │
├─────┼───────┼────────┼─────────┤
│ 1 │ 1 │ a │ d │
│ 2 │ 2 │ b │ e │
│ 3 │ 3 │ c │ f │

Apparent issues with DataFrame string values

I am not sure if this is an actual problem or if I am just not doing something the correct way, but at the moment it appears a little bizarre to me.
When using DataFrames I came across an issue where if you copy a DataFrame to another variable, then any changes made to either of the variables changes both. This goes for the individual columns too. For example:
julia> x = DataFrame(A = ["pink", "blue", "green"], B = ["yellow", "red", "purple"]);
julia> y = x;
julia> x[x.A .== "blue", :A] = "red";
julia> x
3×2 DataFrame
│ Row │ A │ B │
├─────┼───────┼────────┤
│ 1 │ pink │ yellow │
│ 2 │ red │ red │
│ 3 │ green │ purple │
julia> y
3×2 DataFrame
│ Row │ A │ B │
├─────┼───────┼────────┤
│ 1 │ pink │ yellow │
│ 2 │ red │ red │
│ 3 │ green │ purple │
A similar thing happens with columns too, so if were to say setup a DataFrame like the above but use B = A before I incorporate both into a data frame, then if the values in one column is changed, the other is also automatically changed.
This seems odd to me, and maybe it is a feature of other programming languages but I have done the same thing as above in R many times when making a backup of a data table or swapping data between columns, and have never seen this issue. So the question is, is it working as designed and is there a correct way for copying values between data frames?
I am using Julia version 0.7.0 since I originally installed 1.0.0 through the Manjaro repository and had issues with the Is_windows() when trying to build Tk.
The command y = x does not create a new object; it just creates a new reference (or name) for the same DataFrame.
You can create a copy by calling y = copy(x). In your case, this still doesn't work, as it only copies the dataframe itself but not the variables in it.
If you want a completely independent new object, you can use y = deepcopy(x). In this case, y will have no references to x.
See this thread for a more detailed discussion:
https://discourse.julialang.org/t/what-is-the-difference-between-copy-and-deepcopy/3918/2

Select numerical columns of Julia DataFrame with missing values

I want to select all columns of a DataFrame in which the datatype is a subtype of Number. However, since there are columns with missing values, the numerical column datatypes can be something like Union{Missing, Int64}.
So far, I came up with:
using DataFrames
df = DataFrame([["a", "b"], [1, missing] ,[2, 5]])
df_numerical = df[typeintersect.(colwise(eltype, df), Number) .!= Union{}]
This yields the expected result.
Question
Is there a more simple, idiomatic way of doing this? Possibly simliar to:
df.select_dtypes(include=[np.number])
in pandas as taken from an answer to this question?
julia> df[(<:).(eltypes(df),Union{Number,Missing})]
2×2 DataFrame
│ Row │ x2 │ x3 │
├─────┼─────────┼────┤
│ 1 │ 1 │ 2 │
│ 2 │ missing │ 5 │
Please note that the . is the broadcasting operator and hence I had to use <: operator in a functional form.
An other way to do it could be:
df_numerical = df[[i for i in names(df) if Base.nonmissingtype(eltype(df[i])) <: Number]]
To retrieve all the columns that are subtype of Number, irrelevantly if they host missing data or not.