I want to create a indexed subset of a DataFrame and use a variable inside it. In this case i want to change all -9999 values of the first column to NA's. If I do: df[df[:1] .== -9999, :1] = NA it works like it should.. But if i use a variable as the indexer it througs an error (LoadError: KeyError: key :i not found):
i = 1
df[df[:i] .== -9999, :i] = NA
:i is actually a symbol in julia:
julia> typeof(:i)
Symbol
you can define a variable binding to a symbol like this:
julia> i = Symbol(2)
Symbol("2")
then you can simply use df[df[i] .== 1, i] = 123:
julia> df
10×1 DataFrames.DataFrame
│ Row │ 2 │
├─────┼─────┤
│ 1 │ 123 │
│ 2 │ 2 │
│ 3 │ 3 │
│ 4 │ 4 │
│ 5 │ 5 │
│ 6 │ 6 │
│ 7 │ 7 │
│ 8 │ 8 │
│ 9 │ 9 │
│ 10 │ 10 │
It's worth noting that in your example df[df[:1] .== -9999, :1], :1 is NOT a symbol:
julia> :1
1
In fact, the expression is equal to df[df[1] .== -9999, 1] which works in that there is a corresponding getindex method whose argument (col_ind) can accept a common index:
julia> #which df[df[1].==1, 1]
getindex{T<:Real}(df::DataFrames.DataFrame, row_inds::AbstractArray{T,1}, col_ind::Union{Real,Symbol})
Since you just want to change the first (n) column, there is no difference between Symbol("1") and 1 as long as your column names are regularly arranged as:
│ Row │ 1 │ 2 │ 3 │...
├─────┼─────┤─────┼─────┤
│ 1 │ │ │ │...
Related
I have the following Dataframe
using DataFrames, Statistics
df = DataFrame(name=["John", "Sally", "Kirk"],
age=[23., 42., 59.],
children=[3,5,2], height = [180, 150, 170])
print(df)
3×4 DataFrame
│ Row │ name │ age │ children │ height │
│ │ String │ Float64 │ Int64 │ Int64 │
├─────┼────────┼─────────┼──────────┼────────┤
│ 1 │ John │ 23.0 │ 3 │ 180 │
│ 2 │ Sally │ 42.0 │ 5 │ 150 │
│ 3 │ Kirk │ 59.0 │ 2 │ 170 │
I can compute the mean of a column as follow:
println(mean(df[:4]))
166.66666666666666
Now I want to get the mean of all the numeric column and tried this code:
x = [2,3,4]
for i in x
print(mean(df[:x[i]]))
end
But got the following error message:
MethodError: no method matching getindex(::Symbol, ::Int64)
Stacktrace:
[1] top-level scope at ./In[64]:3
How can I solve the problem?
You are trying to access the DataFrame's column using an integer index specifying the column's position. You should just use the integer value without any : before i, which would create the symbol :i but you do not a have column named i.
x = [2,3,4]
for i in x
println(mean(df[i])) # no need for `x[i]`
end
You can also index a DataFrame using a Symbol denoting the column's name.
x = [:age, :children, :height];
for c in x
println(mean(df[c]))
end
You get the following error in your attempt because you are trying to access the ith index of the symbol :x, which is an undefined operation.
MethodError: no method matching getindex(::Symbol, ::Int64)
Note that :4 is just 4.
julia> :4
4
julia> typeof(:4)
Int64
Here is a one-liner that actually selects all Number columns:
julia> mean.(eachcol(df[findall(x-> x<:Number, eltypes(df))]))
3-element Array{Float64,1}:
41.333333333333336
3.3333333333333335
166.66666666666666
For many scenarios describe is actually more convenient:
julia> describe(df)
4×8 DataFrame
│ Row │ variable │ mean │ min │ median │ max │ nunique │ nmissing │ eltype │
│ │ Symbol │ Union… │ Any │ Union… │ Any │ Union… │ Nothing │ DataType │
├─────┼──────────┼─────────┼──────┼────────┼───────┼─────────┼──────────┼──────────┤
│ 1 │ name │ │ John │ │ Sally │ 3 │ │ String │
│ 2 │ age │ 41.3333 │ 23.0 │ 42.0 │ 59.0 │ │ │ Float64 │
│ 3 │ children │ 3.33333 │ 2 │ 3.0 │ 5 │ │ │ Int64 │
│ 4 │ height │ 166.667 │ 150 │ 170.0 │ 180 │ │ │ Int64 │
In the question println(mean(df[4])) works as well (instead of println(mean(df[:4]))).
Hence we can write
x = [2,3,4]
for i in x
println(mean(df[i]))
end
which works
I need to convert the following DataFrame
julia> df = DataFrame(:A=>["", "2", "3"], :B=>[1.1, 2.2, 3.3])
which looks like
3×2 DataFrame
│ Row │ A │ B │
│ │ String │ Float64 │
├─────┼────────┼─────────┤
│ 1 │ │ 1.1 │
│ 2 │ 2 │ 2.2 │
│ 3 │ 3 │ 3.3 │
I would like to convert A column from Array{String,1} to array of Int with missing values.
I tried
julia> df.A = tryparse.(Int, df.A)
3-element Array{Union{Nothing, Int64},1}:
nothing
2
3
julia> df
3×2 DataFrame
│ Row │ A │ B │
│ │ Union… │ Float64 │
├─────┼────────┼─────────┤
│ 1 │ │ 1.1 │
│ 2 │ 2 │ 2.2 │
│ 3 │ 3 │ 3.3 │
julia> eltype(df.A)
Union{Nothing, Int64}
but I'm getting A column with elements of type Union{Nothing, Int64}.
nothing (of type Nothing) and missing (of type Missing) seems to be 2 differents kind of value.
So I wonder how I can A columns with missing values instead?
I also wonder if missing and nothing leads to different performance.
I would have done the following:
julia> df.A = map(x->begin val = tryparse(Int, x)
ifelse(typeof(val) == Nothing, missing, val)
end, df.A)
3-element Array{Union{Missing, Int64},1}:
missing
2
3
julia> df
3×2 DataFrame
│ Row │ A │ B │
│ │ Int64⍰ │ Float64 │
├─────┼─────────┼─────────┤
│ 1 │ missing │ 1.1 │
│ 2 │ 2 │ 2.2 │
│ 3 │ 3 │ 3.3 │
I think missing is more suitable for dataframes which indeed have missing values, instead of nothing, because the latter is more considered as a void in C, or None in Python, see here.
As a side note, Missing type has some Julia functionalities.
Replacing nothing by missing can simply be done using replace:
julia> df.A = replace(df.A, nothing=>missing)
3-element Array{Union{Missing, Int64},1}:
missing
2
3
julia> df
3×2 DataFrame
│ Row │ A │ B │
│ │ Int64⍰ │ Float64 │
├─────┼─────────┼─────────┤
│ 1 │ missing │ 1.1 │
│ 2 │ 2 │ 2.2 │
│ 3 │ 3 │ 3.3 │
an other solution is to use tryparsem function defined as following
tryparsem(T, str) = something(tryparse(T, str), missing)
and use it like
julia> df = DataFrame(:A=>["", "2", "3"], :B=>[1.1, 2.2, 3.3])
julia> df.A = tryparsem.(Int, df.A)
I'm attempting to construct a DataFrame from a Dict in Julia 1.1. The keys in the dictionary are the column names and the values are vectors containing, well, the values that the column should have. I though it would be as straightforward as
df = DataFrame()
for (key,value) in datadict
df[key] = value
end
but this throws a ERROR: LoadError: MethodError: no method matching setindex!(::DataFrame, ::Array{String,1}, ::String). Instead of using tha variable key directly, I also tried passing a symbol :key as the column name, as in df[:key] = value, which removes the error message but only inserts the first key-value-pair into the dataframe as a column, with key as the column name:
10×1 DataFrame
│ Row │ key │
│ │ String │
├─────┼────────────┤
│ 1 │ 2019-03-04 │
│ 2 │ 2019-03-05 │
│ 3 │ 2019-03-06 │
│ 4 │ 2019-03-07 │
│ 5 │ 2019-03-08 │
│ 6 │ 2019-03-09 │
│ 7 │ 2019-03-10 │
│ 8 │ 2019-03-11 │
│ 9 │ 2019-03-12 │
│ 10 │ 2019-03-13 │
This is obviously not what I want. What am I doing wrong here?
This code should work :
using DataFrames
datadict = Dict(1 => ["2019-03-04", "2019-03-04"], 2 => ["1996-26-12", "1996-25-12"])
df = DataFrame()
for (key, value) in datadict
df[Symbol(key)] = value
end
You have to create a Symbol of your key with Symbol(key).
df3[10, :A] = missing
df3[15, :B] = missing
df3[15, :C] = missing
Even NA is not working.
I am getting an error
MethodError: Cannot convert an object of type Missings.Missing to an object of type Int64
This may have arisen from a call to the constructor Int64(...),
since type constructors fall back to convert methods.
Stacktrace:
[1] setindex!(::Array{Int64,1}, ::Missings.Missing, ::Int64) at ./array.jl:583
[2] insert_single_entry!(::DataFrames.DataFrame, ::Missings.Missing, ::Int64, ::Symbol) at /home/jrun/.julia/v0.6/DataFrames/src/dataframe/dataframe.jl:361
[3] setindex!(::DataFrames.DataFrame, ::Missings.Missing, ::Int64, ::Symbol) at /home/jrun/.julia/v0.6/DataFrames/src/dataframe/dataframe.jl:448
[4] include_string(::String, ::String) at ./loading.jl:522
Use allowmissing! function.
julia> using DataFrames
julia> df = DataFrame(a=[1,2,3])
3×1 DataFrame
│ Row │ a │
│ │ Int64 │
├─────┼───────┤
│ 1 │ 1 │
│ 2 │ 2 │
│ 3 │ 3 │
julia> df.a[1] = missing
ERROR: MethodError: Cannot `convert` an object of type Missing to an object of type Int64
julia> allowmissing!(df)
3×1 DataFrame
│ Row │ a │
│ │ Int64⍰ │
├─────┼────────┤
│ 1 │ 1 │
│ 2 │ 2 │
│ 3 │ 3 │
julia> df.a[1] = missing
missing
julia> df
3×1 DataFrame
│ Row │ a │
│ │ Int64⍰ │
├─────┼─────────┤
│ 1 │ missing │
│ 2 │ 2 │
│ 3 │ 3 │
You can see which columns in a DataFrame allow missing because they are highlighted with ⍰ after type name under column name.
You can also use allowmissing function to create a new DataFrame.
Both functions optionally accept columns that are to be converted.
Finally there is a disallowmissing/disallowmissing! pair that does the reverse (i.e. strips optional Missing union from eltype if a vector actually contains no missing values).
Suppose I have a DataFrame like this:
julia> df = DataFrame(a = [1,2,3], b = [3,4,5])
3×2 DataFrames.DataFrame
│ Row │ a │ b │
├─────┼───┼───┤
│ 1 │ 1 │ 3 │
│ 2 │ 2 │ 4 │
│ 3 │ 3 │ 5 │
How do I subsequently change the order of columns so that column :b comes before column :a?
These are the recommendations for DataFrames.jl 0.21 or later.
If you want to be minimally faster you can write
df[!, [2, 1]]
If you want to update df in place you can do it in two steps:
df[!, 1], df[!, 2] = df[!, 2], df[!, 1]
rename!(df, [:b, :a])
which is yet faster.
Also you can use select! like this:
select!(df, [:b, :a])
It's simple enough but it took a while to dawn on me so I thought I'd post it here:
julia> df = df[!, [:b, :a]]
3×2 DataFrames.DataFrame
│ Row │ b │ a │
├─────┼───┼───┤
│ 1 │ 3 │ 1 │
│ 2 │ 4 │ 2 │
│ 3 │ 5 │ 3 │
By using either select or select! from DataFrames.jl. For example:
julia> select(df, [:b, :a])
3×2 DataFrames.DataFrame
│ Row │ b │ a │
├─────┼───┼───┤
│ 1 │ 3 │ 1 │
│ 2 │ 4 │ 2 │
│ 3 │ 5 │ 3 │