What is pandas' transpose equivalent in Julia - pandas

What is pandas' transpose equivalent in Julia? thanks
I like to transpose a data frame and transpose function isn't working.

It is permutedims, it turns a data frame on its side such that rows become columns and values in the column become the names.

Note the difference between transpose and permutedims in Julia Base. permutedims only affects the outermost container. transpose is recursive.
Here are the consequences:
julia> x = [randstring(3) for _ in 1:3, _ in 1:3]
3×3 Matrix{String}:
"nTa" "QBM" "3dJ"
"RsL" "mD1" "3jq"
"dFp" "bfB" "k6P"
julia> permutedims(x)
3×3 Matrix{String}:
"nTa" "RsL" "dFp"
"QBM" "mD1" "bfB"
"3dJ" "3jq" "k6P"
julia> transpose(x)
3×3 transpose(::Matrix{String}) with eltype Union{}:
Error showing value of type LinearAlgebra.Transpose{Union{}, Matrix{String}}:
ERROR: MethodError: no method matching transpose(::String)
julia> y = [rand(3) for _ in 1:3, _ in 1:3]
3×3 Matrix{Vector{Float64}}:
[0.446435, 0.653228, 0.0857836] [0.378189, 0.505487, 0.0504642] [0.0570918, 0.462984, 0.800813]
[0.801857, 0.75505, 0.714087] [0.253316, 0.458364, 0.80242] [0.93742, 0.699745, 0.140957]
[0.419783, 0.22946, 0.748267] [0.445365, 0.563222, 0.363561] [0.088825, 0.0869342, 0.311187]
julia> permutedims(y)
3×3 Matrix{Vector{Float64}}:
[0.446435, 0.653228, 0.0857836] [0.801857, 0.75505, 0.714087] [0.419783, 0.22946, 0.748267]
[0.378189, 0.505487, 0.0504642] [0.253316, 0.458364, 0.80242] [0.445365, 0.563222, 0.363561]
[0.0570918, 0.462984, 0.800813] [0.93742, 0.699745, 0.140957] [0.088825, 0.0869342, 0.311187]
julia> transpose(y) # note that inside we have 1x3 objects not vectors
3×3 transpose(::Matrix{Vector{Float64}}) with eltype LinearAlgebra.Transpose{Float64, Vector{Float64}}:
[0.446435 0.653228 0.0857836] [0.801857 0.75505 0.714087] [0.419783 0.22946 0.748267]
[0.378189 0.505487 0.0504642] [0.253316 0.458364 0.80242] [0.445365 0.563222 0.363561]
[0.0570918 0.462984 0.800813] [0.93742 0.699745 0.140957] [0.088825 0.0869342 0.311187]
In DataFrames.jl we decided that this recursive behavior (which makes sense in linear algebra context) is not desirable. You can even read this in the docstring of transpose which states:
This operation is intended for linear algebra usage - for general data manipulation see permutedims, which is non-recursive.
Additionally in DataFrames.jl permutedims requires you to specify the column which will become column names after the operation (this requirement is DataFrames.jl specific) and you need to be careful as eltype promotion is performed (this issue is not visible for matrices which have a common eltype for all elements, while in a data frame each column might have a diffeent eltype):
julia> df1 = DataFrame(rowkey=["x", "y"], b=[1.0, 2.0], c=[3, 4], d=[true, false])
2×4 DataFrame
Row │ rowkey b c d
│ String Float64 Int64 Bool
─────┼───────────────────────────────
1 │ x 1.0 3 true
2 │ y 2.0 4 false
julia> df2 = permutedims(df1, :rowkey)
3×3 DataFrame
Row │ rowkey x y
│ String Float64 Float64
─────┼──────────────────────────
1 │ b 1.0 2.0
2 │ c 3.0 4.0
3 │ d 1.0 0.0
julia> permutedims(df2, :rowkey)
2×4 DataFrame
Row │ rowkey b c d
│ String Float64 Float64 Float64
─────┼───────────────────────────────────
1 │ x 1.0 3.0 1.0
2 │ y 2.0 4.0 0.0

Related

ArgumentError: columns argument must be a vector of AbstractVector objects

I want to make a DataFrame in Julia with one column, but I get an error:
julia> using DataFrames
julia> r = rand(3);
julia> DataFrame(r, ["col1"])
ERROR: ArgumentError: columns argument must be a vector of AbstractVector objects
Why?
Update:
I figured out that I could say the following:
julia> DataFrame(reshape(r, :, 1), ["col1"])
3×1 DataFrame
Row │ col1
│ Float64
─────┼──────────
1 │ 0.800824
2 │ 0.989024
3 │ 0.722418
But it's not straightforward. Is there any better way? Why can't I easily create a DataFrame object from a Vector?
Why can't I easily create a DataFrame object from a Vector?
Because it would be ambiguous with the syntax where you pass positional arguments the way you tried. Many popular tables are vectors.
However, what you can write is just:
julia> r = rand(3);
julia> DataFrame(col1=r)
3×1 DataFrame
Row │ col1
│ Float64
─────┼────────────
1 │ 0.00676619
2 │ 0.554207
3 │ 0.394077
to get what you want.
An alternative more similar to your code would be:
julia> DataFrame([r], ["col1"])
3×1 DataFrame
Row │ col1
│ Float64
─────┼────────────
1 │ 0.00676619
2 │ 0.554207
3 │ 0.394077

Julia DataFrame of correlation matrix, how to extract high correlated cell values and columns?

New to Julia. I'm working on a correlation matrix. I've converted it into a dataframe to include feature names. To find which features are highly correlated, I need names of the features and its value.
I get the value using the following:
corr_matrix_df=cor(Matrix(df))
idx_hcorr=findall(x->abs.(x)>0.6, corr_matrix_df)
But I dont know how to get column names.
If I short it columnwise, the feature rows will shuffle up incorrectly. Any ideas?
Here is how you can do it:
julia> using DataFrames, Random
julia> Random.seed!(1234)
MersenneTwister(1234)
julia> df = DataFrame(rand(5, 5), :auto)
5×5 DataFrame
Row │ x1 x2 x3 x4 x5
│ Float64 Float64 Float64 Float64 Float64
─────┼────────────────────────────────────────────────────
1 │ 0.590845 0.854147 0.648882 0.112486 0.950498
2 │ 0.766797 0.200586 0.0109059 0.276021 0.96467
3 │ 0.566237 0.298614 0.066423 0.651664 0.945775
4 │ 0.460085 0.246837 0.956753 0.0566425 0.789904
5 │ 0.794026 0.579672 0.646691 0.842714 0.82116
julia> using Statistics
julia> cm = cor(Matrix(df))
5×5 Matrix{Float64}:
1.0 0.101686 -0.420953 0.562488 0.2127
0.101686 1.0 0.378276 0.00772785 0.100182
-0.420953 0.378276 1.0 -0.327604 -0.791489
0.562488 0.00772785 -0.327604 1.0 -0.0746962
0.2127 0.100182 -0.791489 -0.0746962 1.0
julia> high = findall(x -> abs(x) > 0.6, cm)
7-element Vector{CartesianIndex{2}}:
CartesianIndex(1, 1)
CartesianIndex(2, 2)
CartesianIndex(3, 3)
CartesianIndex(5, 3)
CartesianIndex(4, 4)
CartesianIndex(3, 5)
CartesianIndex(5, 5)
julia> [[names(df, idx.I[1]); names(df, idx.I[2])] for idx in high]
7-element Vector{Vector{String}}:
["x1", "x1"]
["x2", "x2"]
["x3", "x3"]
["x5", "x3"]
["x4", "x4"]
["x3", "x5"]
["x5", "x5"]
is this what you wanted? (I added one step after your last step)

Find a subset of columns of a data frame that have some missing values

Given the following data frame from DataFrames.jl:
julia> using DataFrames
julia> df = DataFrame(x1=[1, 2, 3], x2=Union{Int,Missing}[1, 2, 3], x3=[1, 2, missing])
3×3 DataFrame
Row │ x1 x2 x3
│ Int64 Int64? Int64?
─────┼────────────────────────
1 │ 1 1 1
2 │ 2 2 2
3 │ 3 3 missing
I would like to find columns that contain missing value in them.
I have tried:
julia> names(df, Missing)
String[]
but this is incorrect as the names function, when passed a type, looks for subtypes of the passed type.
If you want to find columns that actually contain missing value use:
julia> names(df, any.(ismissing, eachcol(df)))
1-element Vector{String}:
"x3"
In this approach we iterate each column of the df data frame and check if it contains at least one missing value.
If you want to find columns that potentially can contain missing value you need to check their element type:
julia> names(df, [eltype(col) >: Missing for col in eachcol(df)]) # using a comprehension
2-element Vector{String}:
"x2"
"x3"
julia> names(df, .>:(eltype.(eachcol(df)), Missing)) # using broadcasting
2-element Vector{String}:
"x2"
"x3"

Keep variables type after using data frame

I'm trying to use kproto() function from R package clustMixType to cluster mixed-type data in Julia, but I'm getting error No numeric variables in x! Try using kmodes() from package.... My data should have 3 variables: 2 continuous and 1 categorical. It seems after I used DataFrame() all the variables became categorical. Is there a way to avoid changing the variables type after using DataFrame() so that I have mixed-type data (continuous and categorical) to use kproto()?
using RCall
#rlibrary clustMixType
# group 1 variables
x1=rand(Normal(0,3),10)
x2=rand(Normal(1,2),10)
x3=["1","1","2","2","0","1","1","2","2","0"]
g1=hcat(x1,x2,x3)
# group 2 variables
y1=rand(Normal(0,4),10)
y2=rand(Normal(-1,6),10)
y3=["1","1","2","1","1","2","2","0","2","0"]
g2=hcat(y1,y2,y3)
#create the data
df0=vcat(g1,g2)
df1 = DataFrame(df0)
#use R function
R"kproto($df1, 2)"
I don't know anything about the R package and what kind of input it expects, but the issue is probably how you construct the data matrix from which you construct your DataFrame, not the DataFrame constructor itself.
When you concatenate a numerical and a string column, Julia falls back on the element type Any for the resulting matrix:
julia> g1=hcat(x1,x2,x3)
10×3 Matrix{Any}:
0.708309 -4.84767 "1"
0.566883 -0.214217 "1"
...
That means your df0 matrix is:
julia> #create the data
df0=vcat(g1,g2)
20×3 Matrix{Any}:
0.708309 -4.84767 "1"
0.566883 -0.214217 "1"
...
and the DataFrame constructor will just carry this lack of type information through rather than trying to infer column types.
julia> DataFrame(df0)
20×3 DataFrame
Row │ x1 x2 x3
│ Any Any Any
─────┼───────────────────────────
1 │ 0.708309 -4.84767 1
2 │ 0.566883 -0.214217 1
...
A simple way of getting around this is to just not concatenate your columns into a single matrix, but to construct the DataFrame from the columns:
julia> DataFrame([vcat(x1, y1), vcat(x2, y2), vcat(x3, y3)])
20×3 DataFrame
Row │ x1 x2 x3
│ Float64 Float64 String
─────┼───────────────────────────────
1 │ 0.708309 -4.84767 1
2 │ 0.566883 -0.214217 1
...
As you can see, we now have two Float64 numerical columns x1 and x2 in the resulting DataFrame.
As an addition to the nice answer by Nils (as the problem is indeed when a matrix is constructed not when DataFrame is created) there is this little trick:
julia> df = DataFrame([1 1.0 "1"; 2 2.0 "2"], [:int, :float, :string])
2×3 DataFrame
Row │ int float string
│ Any Any Any
─────┼────────────────────
1 │ 1 1.0 1
2 │ 2 2.0 2
julia> identity.(df)
2×3 DataFrame
Row │ int float string
│ Int64 Float64 String
─────┼────────────────────────
1 │ 1 1.0 1
2 │ 2 2.0 2

Select numerical columns of Julia DataFrame with missing values

I want to select all columns of a DataFrame in which the datatype is a subtype of Number. However, since there are columns with missing values, the numerical column datatypes can be something like Union{Missing, Int64}.
So far, I came up with:
using DataFrames
df = DataFrame([["a", "b"], [1, missing] ,[2, 5]])
df_numerical = df[typeintersect.(colwise(eltype, df), Number) .!= Union{}]
This yields the expected result.
Question
Is there a more simple, idiomatic way of doing this? Possibly simliar to:
df.select_dtypes(include=[np.number])
in pandas as taken from an answer to this question?
julia> df[(<:).(eltypes(df),Union{Number,Missing})]
2×2 DataFrame
│ Row │ x2 │ x3 │
├─────┼─────────┼────┤
│ 1 │ 1 │ 2 │
│ 2 │ missing │ 5 │
Please note that the . is the broadcasting operator and hence I had to use <: operator in a functional form.
An other way to do it could be:
df_numerical = df[[i for i in names(df) if Base.nonmissingtype(eltype(df[i])) <: Number]]
To retrieve all the columns that are subtype of Number, irrelevantly if they host missing data or not.