I thought that the exclamation mark ! is the symbol for the logical NOT operator. Now, learning indexing in the DataFrames package, I came across this: data[!,:Treatment]. Which seems to be the same as using the known colon symbol :
data[:,:Treatment]==data[!,:Treatment] is true.
Why this redundancy then?
! in indexing is specific to DataFrames, and signals that you want a reference to the underlying vector storing the data, rather than a copy of it. You can read all about indexing DataFrames here. In your example the are both == because all values are identical, but they are not === since df[:, :Treatment] gives you a copy of the underlying data.
Example:
julia> using DataFrames
julia> df = DataFrame(y = [1, 2, 3]);
julia> df[:, :y] == df[!, :y] # true because all values are equal
true
julia> df[:, :y] === df[!, :y] # false because they are not the same vector
false
Quoting the documentation of DataFrames.jl:
Columns can be directly (i.e. without copying) accessed via df.col or df[!, :col]. [...] Since df[!, :col] does not make a copy, changing the elements of the column vector returned by this syntax will affect the values stored in the original df. To get a copy of the column use df[:, :col]: changing the vector returned by this syntax does not change df.
An example might make this clearer:
julia> using DataFrames
julia> df = DataFrame(x = rand(5), y=rand(5))
5×2 DataFrame
│ Row │ x │ y │
│ │ Float64 │ Float64 │
├─────┼──────────┼───────────┤
│ 1 │ 0.937892 │ 0.42232 │
│ 2 │ 0.54413 │ 0.932265 │
│ 3 │ 0.961372 │ 0.680818 │
│ 4 │ 0.958788 │ 0.923667 │
│ 5 │ 0.942518 │ 0.0428454 │
# `a` is a copy of `df.x`: modifying it will not affect `df`
julia> a = df[:, :x]
5-element Array{Float64,1}:
0.9378915597741728
0.544130347207969
0.9613717853719412
0.958788066884128
0.9425183324742632
julia> a[2] = 1;
julia> df
5×2 DataFrame
│ Row │ x │ y │
│ │ Float64 │ Float64 │
├─────┼──────────┼───────────┤
│ 1 │ 0.937892 │ 0.42232 │
│ 2 │ 0.54413 │ 0.932265 │
│ 3 │ 0.961372 │ 0.680818 │
│ 4 │ 0.958788 │ 0.923667 │
│ 5 │ 0.942518 │ 0.0428454 │
# `b` is a view of `df.x`: any change made to it will be reflected in df
julia> b = df[!, :x]
5-element Array{Float64,1}:
0.9378915597741728
0.544130347207969
0.9613717853719412
0.958788066884128
0.9425183324742632
julia> b[2] = 1;
julia> df
5×2 DataFrame
│ Row │ x │ y │
│ │ Float64 │ Float64 │
├─────┼──────────┼───────────┤
│ 1 │ 0.937892 │ 0.42232 │
│ 2 │ 1.0 │ 0.932265 │
│ 3 │ 0.961372 │ 0.680818 │
│ 4 │ 0.958788 │ 0.923667 │
│ 5 │ 0.942518 │ 0.0428454 │
Note that, since the indexing with ! does not involve any data copy, it will generally be more efficient.
Related
I am looking for the opposite of the dropmissing function in DataFrames.jl so that the user knows where to look to fix their bad data. It seems like this should be easy, but the filter function expects a column to be specified and I cannot get it to iterate over all columns.
julia> df=DataFrame(a=[1, missing, 3], b=[4, 5, missing])
3×2 DataFrame
│ Row │ a │ b │
│ │ Int64? │ Int64? │
├─────┼─────────┼─────────┤
│ 1 │ 1 │ 4 │
│ 2 │ missing │ 5 │
│ 3 │ 3 │ missing │
julia> filter(x -> ismissing(eachcol(x)), df)
ERROR: MethodError: no method matching eachcol(::DataFrameRow{DataFrame,DataFrames.Index})
julia> filter(x -> ismissing.(x), df)
ERROR: ArgumentError: broadcasting over `DataFrameRow`s is reserved
I am basically trying to recreate the disallowmissing function, but with a more useful error message.
Here are two ways to do it:
julia> df = DataFrame(a=[1, missing, 3], b=[4, 5, missing])
3×2 DataFrame
│ Row │ a │ b │
│ │ Int64? │ Int64? │
├─────┼─────────┼─────────┤
│ 1 │ 1 │ 4 │
│ 2 │ missing │ 5 │
│ 3 │ 3 │ missing │
julia> df[.!completecases(df), :] # this will be faster
2×2 DataFrame
│ Row │ a │ b │
│ │ Int64? │ Int64? │
├─────┼─────────┼─────────┤
│ 1 │ missing │ 5 │
│ 2 │ 3 │ missing │
julia> #view df[.!completecases(df), :]
2×2 SubDataFrame
│ Row │ a │ b │
│ │ Int64? │ Int64? │
├─────┼─────────┼─────────┤
│ 1 │ missing │ 5 │
│ 2 │ 3 │ missing │
julia> filter(row -> any(ismissing, row), df)
2×2 DataFrame
│ Row │ a │ b │
│ │ Int64? │ Int64? │
├─────┼─────────┼─────────┤
│ 1 │ missing │ 5 │
│ 2 │ 3 │ missing │
julia> filter(row -> any(ismissing, row), df, view=true) # requires DataFrames.jl 0.22
2×2 SubDataFrame
Row │ a b
│ Int64? Int64?
─────┼──────────────────
1 │ missing 5
2 │ 3 missing
I have a DataFrame and I want to compute a bunch of group-level summary statistics. Some of those statistics are derived from other statistics I want to compute first.
df = DataFrame(a=[1,1,2,3], b=[4,5,6,8])
df2 = df |>
#groupby(_.a) |>
#map({a = key(_),
bm = mean(_.b),
cs = sum(_.b),
d = _.bm + _.cs}) |>
DataFrame
ERROR: type NamedTuple has no field bm
The closest I can get is this, which works, but gets very repetitive as the number of initial statistics I want to carry forward into the computation of derived statistics grows:
df2 = df |>
#groupby(_.a) |>
#map({a=key(_), bm=mean(_.b), cs=sum(_.b)}) |>
#map({a=_.a, bm=_.bm, cs=_.cs, d=_.bm + _.cs}) |>
DataFrame
3×4 DataFrame
│ Row │ a │ bm │ cs │ d │
│ │ Int64 │ Float64 │ Int64 │ Float64 │
├─────┼───────┼─────────┼───────┼─────────┤
│ 1 │ 1 │ 4.5 │ 9 │ 13.5 │
│ 2 │ 2 │ 6.0 │ 6 │ 12.0 │
│ 3 │ 3 │ 8.0 │ 8 │ 16.0 │
Another option is to create a new DataFrame of first-order results, run a new #map on that to compute the second-order results, and then join the two afterward. Is there any way in Query, DataFramesMeta, or even bare DataFrames to do it in one relatively concise step?
Just for reference, the "create multiple DataFrames" approach:
df = DataFrame(a=[1,1,2,3], b=[4,5,6,8])
df2 = df |>
#groupby(_.a) |>
#map({a=key(_), bm=mean(_.b), cs=sum(_.b)}) |>
DataFrame
df3 = df2 |>
#map({a=_.a, d=_.bm + _.cs}) |>
DataFrame
df4 = innerjoin(df2, df3, on = :a)
3×4 DataFrame
│ Row │ a │ bm │ cs │ d │
│ │ Int64 │ Float64 │ Int64 │ Float64 │
├─────┼───────┼─────────┼───────┼─────────┤
│ 1 │ 1 │ 4.5 │ 9 │ 13.5 │
│ 2 │ 2 │ 6.0 │ 6 │ 12.0 │
│ 3 │ 3 │ 8.0 │ 8 │ 16.0 │
I have the following Dataframe
using DataFrames, Statistics
df = DataFrame(name=["John", "Sally", "Kirk"],
age=[23., 42., 59.],
children=[3,5,2], height = [180, 150, 170])
print(df)
3×4 DataFrame
│ Row │ name │ age │ children │ height │
│ │ String │ Float64 │ Int64 │ Int64 │
├─────┼────────┼─────────┼──────────┼────────┤
│ 1 │ John │ 23.0 │ 3 │ 180 │
│ 2 │ Sally │ 42.0 │ 5 │ 150 │
│ 3 │ Kirk │ 59.0 │ 2 │ 170 │
I can compute the mean of a column as follow:
println(mean(df[:4]))
166.66666666666666
Now I want to get the mean of all the numeric column and tried this code:
x = [2,3,4]
for i in x
print(mean(df[:x[i]]))
end
But got the following error message:
MethodError: no method matching getindex(::Symbol, ::Int64)
Stacktrace:
[1] top-level scope at ./In[64]:3
How can I solve the problem?
You are trying to access the DataFrame's column using an integer index specifying the column's position. You should just use the integer value without any : before i, which would create the symbol :i but you do not a have column named i.
x = [2,3,4]
for i in x
println(mean(df[i])) # no need for `x[i]`
end
You can also index a DataFrame using a Symbol denoting the column's name.
x = [:age, :children, :height];
for c in x
println(mean(df[c]))
end
You get the following error in your attempt because you are trying to access the ith index of the symbol :x, which is an undefined operation.
MethodError: no method matching getindex(::Symbol, ::Int64)
Note that :4 is just 4.
julia> :4
4
julia> typeof(:4)
Int64
Here is a one-liner that actually selects all Number columns:
julia> mean.(eachcol(df[findall(x-> x<:Number, eltypes(df))]))
3-element Array{Float64,1}:
41.333333333333336
3.3333333333333335
166.66666666666666
For many scenarios describe is actually more convenient:
julia> describe(df)
4×8 DataFrame
│ Row │ variable │ mean │ min │ median │ max │ nunique │ nmissing │ eltype │
│ │ Symbol │ Union… │ Any │ Union… │ Any │ Union… │ Nothing │ DataType │
├─────┼──────────┼─────────┼──────┼────────┼───────┼─────────┼──────────┼──────────┤
│ 1 │ name │ │ John │ │ Sally │ 3 │ │ String │
│ 2 │ age │ 41.3333 │ 23.0 │ 42.0 │ 59.0 │ │ │ Float64 │
│ 3 │ children │ 3.33333 │ 2 │ 3.0 │ 5 │ │ │ Int64 │
│ 4 │ height │ 166.667 │ 150 │ 170.0 │ 180 │ │ │ Int64 │
In the question println(mean(df[4])) works as well (instead of println(mean(df[:4]))).
Hence we can write
x = [2,3,4]
for i in x
println(mean(df[i]))
end
which works
I need to convert the following DataFrame
julia> df = DataFrame(:A=>["", "2", "3"], :B=>[1.1, 2.2, 3.3])
which looks like
3×2 DataFrame
│ Row │ A │ B │
│ │ String │ Float64 │
├─────┼────────┼─────────┤
│ 1 │ │ 1.1 │
│ 2 │ 2 │ 2.2 │
│ 3 │ 3 │ 3.3 │
I would like to convert A column from Array{String,1} to array of Int with missing values.
I tried
julia> df.A = tryparse.(Int, df.A)
3-element Array{Union{Nothing, Int64},1}:
nothing
2
3
julia> df
3×2 DataFrame
│ Row │ A │ B │
│ │ Union… │ Float64 │
├─────┼────────┼─────────┤
│ 1 │ │ 1.1 │
│ 2 │ 2 │ 2.2 │
│ 3 │ 3 │ 3.3 │
julia> eltype(df.A)
Union{Nothing, Int64}
but I'm getting A column with elements of type Union{Nothing, Int64}.
nothing (of type Nothing) and missing (of type Missing) seems to be 2 differents kind of value.
So I wonder how I can A columns with missing values instead?
I also wonder if missing and nothing leads to different performance.
I would have done the following:
julia> df.A = map(x->begin val = tryparse(Int, x)
ifelse(typeof(val) == Nothing, missing, val)
end, df.A)
3-element Array{Union{Missing, Int64},1}:
missing
2
3
julia> df
3×2 DataFrame
│ Row │ A │ B │
│ │ Int64⍰ │ Float64 │
├─────┼─────────┼─────────┤
│ 1 │ missing │ 1.1 │
│ 2 │ 2 │ 2.2 │
│ 3 │ 3 │ 3.3 │
I think missing is more suitable for dataframes which indeed have missing values, instead of nothing, because the latter is more considered as a void in C, or None in Python, see here.
As a side note, Missing type has some Julia functionalities.
Replacing nothing by missing can simply be done using replace:
julia> df.A = replace(df.A, nothing=>missing)
3-element Array{Union{Missing, Int64},1}:
missing
2
3
julia> df
3×2 DataFrame
│ Row │ A │ B │
│ │ Int64⍰ │ Float64 │
├─────┼─────────┼─────────┤
│ 1 │ missing │ 1.1 │
│ 2 │ 2 │ 2.2 │
│ 3 │ 3 │ 3.3 │
an other solution is to use tryparsem function defined as following
tryparsem(T, str) = something(tryparse(T, str), missing)
and use it like
julia> df = DataFrame(:A=>["", "2", "3"], :B=>[1.1, 2.2, 3.3])
julia> df.A = tryparsem.(Int, df.A)
df3[10, :A] = missing
df3[15, :B] = missing
df3[15, :C] = missing
Even NA is not working.
I am getting an error
MethodError: Cannot convert an object of type Missings.Missing to an object of type Int64
This may have arisen from a call to the constructor Int64(...),
since type constructors fall back to convert methods.
Stacktrace:
[1] setindex!(::Array{Int64,1}, ::Missings.Missing, ::Int64) at ./array.jl:583
[2] insert_single_entry!(::DataFrames.DataFrame, ::Missings.Missing, ::Int64, ::Symbol) at /home/jrun/.julia/v0.6/DataFrames/src/dataframe/dataframe.jl:361
[3] setindex!(::DataFrames.DataFrame, ::Missings.Missing, ::Int64, ::Symbol) at /home/jrun/.julia/v0.6/DataFrames/src/dataframe/dataframe.jl:448
[4] include_string(::String, ::String) at ./loading.jl:522
Use allowmissing! function.
julia> using DataFrames
julia> df = DataFrame(a=[1,2,3])
3×1 DataFrame
│ Row │ a │
│ │ Int64 │
├─────┼───────┤
│ 1 │ 1 │
│ 2 │ 2 │
│ 3 │ 3 │
julia> df.a[1] = missing
ERROR: MethodError: Cannot `convert` an object of type Missing to an object of type Int64
julia> allowmissing!(df)
3×1 DataFrame
│ Row │ a │
│ │ Int64⍰ │
├─────┼────────┤
│ 1 │ 1 │
│ 2 │ 2 │
│ 3 │ 3 │
julia> df.a[1] = missing
missing
julia> df
3×1 DataFrame
│ Row │ a │
│ │ Int64⍰ │
├─────┼─────────┤
│ 1 │ missing │
│ 2 │ 2 │
│ 3 │ 3 │
You can see which columns in a DataFrame allow missing because they are highlighted with ⍰ after type name under column name.
You can also use allowmissing function to create a new DataFrame.
Both functions optionally accept columns that are to be converted.
Finally there is a disallowmissing/disallowmissing! pair that does the reverse (i.e. strips optional Missing union from eltype if a vector actually contains no missing values).