Julia DataFrame manipulation: concatenate datarow contents - dataframe

I am trying to concatenate the contents of the rows of a DataFrame similar to this one:
DataFrame(a=["aa","ab","ac"], year=[2015,2016,2017])
a year
aa 2015
ab 2016
ac 2017
The desired output would be a concatenation of the contents of the row cells converted to string
output
aa2015
ab2016
ac2017
I have found this code working in the right direction:
df[:c] = map((x,y) -> string(x, y), df[:a], df[:year])
However my input can be variable as I can have a different number of columns and I want all their contents to be concatenated row by row.
Any suggestion on how to achieve this? It doesn't matter if the column gets added to the original Dataframe, if that helps.
Thanks a lot

You can use eachrow to achieve this:
julia> df = DataFrame(rand('a':'z', 5,5))
5×5 DataFrame
│ Row │ x1 │ x2 │ x3 │ x4 │ x5 │
│ │ Char │ Char │ Char │ Char │ Char │
├─────┼──────┼──────┼──────┼──────┼──────┤
│ 1 │ 'l' │ 'p' │ 'y' │ 't' │ 'n' │
│ 2 │ 'p' │ 'y' │ 'y' │ 'r' │ 's' │
│ 3 │ 'y' │ 'a' │ 'o' │ 'c' │ 'a' │
│ 4 │ 'k' │ 't' │ 'q' │ 's' │ 'q' │
│ 5 │ 'a' │ 'c' │ 'w' │ 'f' │ 'v' │
julia> join.(eachrow(df))
5-element Array{String,1}:
"lpytn"
"pyyrs"
"yaoca"
"ktqsq"
"acwfv"
(here I just created a new vector - you can of course add it to a DataFrame if you want)

Related

Return copy of a DataFrame that contains only rows with missing data in Julia

I am looking for the opposite of the dropmissing function in DataFrames.jl so that the user knows where to look to fix their bad data. It seems like this should be easy, but the filter function expects a column to be specified and I cannot get it to iterate over all columns.
julia> df=DataFrame(a=[1, missing, 3], b=[4, 5, missing])
3×2 DataFrame
│ Row │ a │ b │
│ │ Int64? │ Int64? │
├─────┼─────────┼─────────┤
│ 1 │ 1 │ 4 │
│ 2 │ missing │ 5 │
│ 3 │ 3 │ missing │
julia> filter(x -> ismissing(eachcol(x)), df)
ERROR: MethodError: no method matching eachcol(::DataFrameRow{DataFrame,DataFrames.Index})
julia> filter(x -> ismissing.(x), df)
ERROR: ArgumentError: broadcasting over `DataFrameRow`s is reserved
I am basically trying to recreate the disallowmissing function, but with a more useful error message.
Here are two ways to do it:
julia> df = DataFrame(a=[1, missing, 3], b=[4, 5, missing])
3×2 DataFrame
│ Row │ a │ b │
│ │ Int64? │ Int64? │
├─────┼─────────┼─────────┤
│ 1 │ 1 │ 4 │
│ 2 │ missing │ 5 │
│ 3 │ 3 │ missing │
julia> df[.!completecases(df), :] # this will be faster
2×2 DataFrame
│ Row │ a │ b │
│ │ Int64? │ Int64? │
├─────┼─────────┼─────────┤
│ 1 │ missing │ 5 │
│ 2 │ 3 │ missing │
julia> #view df[.!completecases(df), :]
2×2 SubDataFrame
│ Row │ a │ b │
│ │ Int64? │ Int64? │
├─────┼─────────┼─────────┤
│ 1 │ missing │ 5 │
│ 2 │ 3 │ missing │
julia> filter(row -> any(ismissing, row), df)
2×2 DataFrame
│ Row │ a │ b │
│ │ Int64? │ Int64? │
├─────┼─────────┼─────────┤
│ 1 │ missing │ 5 │
│ 2 │ 3 │ missing │
julia> filter(row -> any(ismissing, row), df, view=true) # requires DataFrames.jl 0.22
2×2 SubDataFrame
Row │ a b
│ Int64? Int64?
─────┼──────────────────
1 │ missing 5
2 │ 3 missing

Julia: how to compute a particular operation on certain columns of a Dataframe

I have the following Dataframe
using DataFrames, Statistics
df = DataFrame(name=["John", "Sally", "Kirk"],
age=[23., 42., 59.],
children=[3,5,2], height = [180, 150, 170])
print(df)
3×4 DataFrame
│ Row │ name │ age │ children │ height │
│ │ String │ Float64 │ Int64 │ Int64 │
├─────┼────────┼─────────┼──────────┼────────┤
│ 1 │ John │ 23.0 │ 3 │ 180 │
│ 2 │ Sally │ 42.0 │ 5 │ 150 │
│ 3 │ Kirk │ 59.0 │ 2 │ 170 │
I can compute the mean of a column as follow:
println(mean(df[:4]))
166.66666666666666
Now I want to get the mean of all the numeric column and tried this code:
x = [2,3,4]
for i in x
print(mean(df[:x[i]]))
end
But got the following error message:
MethodError: no method matching getindex(::Symbol, ::Int64)
Stacktrace:
[1] top-level scope at ./In[64]:3
How can I solve the problem?
You are trying to access the DataFrame's column using an integer index specifying the column's position. You should just use the integer value without any : before i, which would create the symbol :i but you do not a have column named i.
x = [2,3,4]
for i in x
println(mean(df[i])) # no need for `x[i]`
end
You can also index a DataFrame using a Symbol denoting the column's name.
x = [:age, :children, :height];
for c in x
println(mean(df[c]))
end
You get the following error in your attempt because you are trying to access the ith index of the symbol :x, which is an undefined operation.
MethodError: no method matching getindex(::Symbol, ::Int64)
Note that :4 is just 4.
julia> :4
4
julia> typeof(:4)
Int64
Here is a one-liner that actually selects all Number columns:
julia> mean.(eachcol(df[findall(x-> x<:Number, eltypes(df))]))
3-element Array{Float64,1}:
41.333333333333336
3.3333333333333335
166.66666666666666
For many scenarios describe is actually more convenient:
julia> describe(df)
4×8 DataFrame
│ Row │ variable │ mean │ min │ median │ max │ nunique │ nmissing │ eltype │
│ │ Symbol │ Union… │ Any │ Union… │ Any │ Union… │ Nothing │ DataType │
├─────┼──────────┼─────────┼──────┼────────┼───────┼─────────┼──────────┼──────────┤
│ 1 │ name │ │ John │ │ Sally │ 3 │ │ String │
│ 2 │ age │ 41.3333 │ 23.0 │ 42.0 │ 59.0 │ │ │ Float64 │
│ 3 │ children │ 3.33333 │ 2 │ 3.0 │ 5 │ │ │ Int64 │
│ 4 │ height │ 166.667 │ 150 │ 170.0 │ 180 │ │ │ Int64 │
In the question println(mean(df[4])) works as well (instead of println(mean(df[:4]))).
Hence we can write
x = [2,3,4]
for i in x
println(mean(df[i]))
end
which works

Convert a Julia DataFrame column with String to one with Int and missing values

I need to convert the following DataFrame
julia> df = DataFrame(:A=>["", "2", "3"], :B=>[1.1, 2.2, 3.3])
which looks like
3×2 DataFrame
│ Row │ A │ B │
│ │ String │ Float64 │
├─────┼────────┼─────────┤
│ 1 │ │ 1.1 │
│ 2 │ 2 │ 2.2 │
│ 3 │ 3 │ 3.3 │
I would like to convert A column from Array{String,1} to array of Int with missing values.
I tried
julia> df.A = tryparse.(Int, df.A)
3-element Array{Union{Nothing, Int64},1}:
nothing
2
3
julia> df
3×2 DataFrame
│ Row │ A │ B │
│ │ Union… │ Float64 │
├─────┼────────┼─────────┤
│ 1 │ │ 1.1 │
│ 2 │ 2 │ 2.2 │
│ 3 │ 3 │ 3.3 │
julia> eltype(df.A)
Union{Nothing, Int64}
but I'm getting A column with elements of type Union{Nothing, Int64}.
nothing (of type Nothing) and missing (of type Missing) seems to be 2 differents kind of value.
So I wonder how I can A columns with missing values instead?
I also wonder if missing and nothing leads to different performance.
I would have done the following:
julia> df.A = map(x->begin val = tryparse(Int, x)
ifelse(typeof(val) == Nothing, missing, val)
end, df.A)
3-element Array{Union{Missing, Int64},1}:
missing
2
3
julia> df
3×2 DataFrame
│ Row │ A │ B │
│ │ Int64⍰ │ Float64 │
├─────┼─────────┼─────────┤
│ 1 │ missing │ 1.1 │
│ 2 │ 2 │ 2.2 │
│ 3 │ 3 │ 3.3 │
I think missing is more suitable for dataframes which indeed have missing values, instead of nothing, because the latter is more considered as a void in C, or None in Python, see here.
As a side note, Missing type has some Julia functionalities.
Replacing nothing by missing can simply be done using replace:
julia> df.A = replace(df.A, nothing=>missing)
3-element Array{Union{Missing, Int64},1}:
missing
2
3
julia> df
3×2 DataFrame
│ Row │ A │ B │
│ │ Int64⍰ │ Float64 │
├─────┼─────────┼─────────┤
│ 1 │ missing │ 1.1 │
│ 2 │ 2 │ 2.2 │
│ 3 │ 3 │ 3.3 │
an other solution is to use tryparsem function defined as following
tryparsem(T, str) = something(tryparse(T, str), missing)
and use it like
julia> df = DataFrame(:A=>["", "2", "3"], :B=>[1.1, 2.2, 3.3])
julia> df.A = tryparsem.(Int, df.A)

Rename Dataframe column names julia v1.0

In 0.6 I was using:
colnames = ["Date_Time","Date_index","Time_index"]
names!(data1_date_time_index.colindex, map(parse, colnames))
What is the syntax for v1.0 - right now .colindex is not found.
Per DataFrames docs:
rename!(data1_date_time_index, f => t for (f, t) =
zip([:x1, :x1_1, :x1_2],
[:Date_Time, :Date_index, :Time_index]))
Assuming data1_date_time_index is a DataFrame that has three columns use:
colnames = ["Date_Time","Date_index","Time_index"]
names!(data1_date_time_index, Symbol.(colnames))
I am not 100% sure if this is what you want, as your example was not fully reproducible (so if actually you needed something else can you please submit full code that can be run).
The problem with data1_date_time_index.colindex is that currently . is used to access columns of a DataFrame by their name (and not fields of DataFrame type). In general you are not recommended to use colindex as it is not part of exposed API and might change in the future. If you really need to reach it use getfield(data_frame_name, :colindex).
EDIT
In DataFrames 0.20 you should write:
rename!(data1_date_time_index, Symbol.(colnames))
and in DataFrames 0.21 (which will be released before summer 2020) also passing strings directly will most probably be allowed like this:
rename!(data1_date_time_index, colnames)
(see here for a related discussion)
You can rename column through select also
For Ex:
df = DataFrame(col1 = 1:4, col2 = ["John", "James", "Finch", "May"])
│ Row │ col1 │ col2 │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 1 │ John │
│ 2 │ 2 │ James │
│ 3 │ 3 │ Finch │
│ 4 │ 4 │ May │
select(df, "col1" => "Id", "col2" => "Name")
│ Row │ Id │ Name │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 1 │ John │
│ 2 │ 2 │ James │
│ 3 │ 3 │ Finch │
│ 4 │ 4 │ May │
Rename columns:
names!(df, [:c1,:c2,:c3]) #(all)
rename!(df, Dict(:oldCol => :newCol)) # (a selection)
(from: https://syl1.gitbook.io/julia-language-a-concise-tutorial/useful-packages/dataframes )

Indexing Dataframe with variable in Julia

I want to create a indexed subset of a DataFrame and use a variable inside it. In this case i want to change all -9999 values of the first column to NA's. If I do: df[df[:1] .== -9999, :1] = NA it works like it should.. But if i use a variable as the indexer it througs an error (LoadError: KeyError: key :i not found):
i = 1
df[df[:i] .== -9999, :i] = NA
:i is actually a symbol in julia:
julia> typeof(:i)
Symbol
you can define a variable binding to a symbol like this:
julia> i = Symbol(2)
Symbol("2")
then you can simply use df[df[i] .== 1, i] = 123:
julia> df
10×1 DataFrames.DataFrame
│ Row │ 2 │
├─────┼─────┤
│ 1 │ 123 │
│ 2 │ 2 │
│ 3 │ 3 │
│ 4 │ 4 │
│ 5 │ 5 │
│ 6 │ 6 │
│ 7 │ 7 │
│ 8 │ 8 │
│ 9 │ 9 │
│ 10 │ 10 │
It's worth noting that in your example df[df[:1] .== -9999, :1], :1 is NOT a symbol:
julia> :1
1
In fact, the expression is equal to df[df[1] .== -9999, 1] which works in that there is a corresponding getindex method whose argument (col_ind) can accept a common index:
julia> #which df[df[1].==1, 1]
getindex{T<:Real}(df::DataFrames.DataFrame, row_inds::AbstractArray{T,1}, col_ind::Union{Real,Symbol})
Since you just want to change the first (n) column, there is no difference between Symbol("1") and 1 as long as your column names are regularly arranged as:
│ Row │ 1 │ 2 │ 3 │...
├─────┼─────┤─────┼─────┤
│ 1 │ │ │ │...