transform function on all columns of dataframe - dataframe

I have a dataframe df and I am trying to apply a function to each of the cells. According to the documentation I should use the transform function.
The function should be applied to each column so I use [:] as a selector for all columns
transform(
df, [:] .=> ByRow(x -> (if (x > 1) x else zero(Float64) end)) .=> [:]
)
but it yields an exception
ArgumentError: Unrecognized column selector: Colon() => (DataFrames.ByRow{Main.workspace293.var"#1#2"}(Main.workspace293.var"#1#2"()) => Colon())
although when I am using a single column, it works fine
transform(
df, [:K0] .=> ByRow(x -> (if (x > 1) x else zero(Float64) end)) .=> [:K0]
)

The simplest way to do it is to use broadcasting:
julia> df = DataFrame(2*rand(4,3), [:x1, :x2, :x3])
4×3 DataFrame
│ Row │ x1 │ x2 │ x3 │
│ │ Float64 │ Float64 │ Float64 │
├─────┼───────────┼──────────┼──────────┤
│ 1 │ 0.945879 │ 1.59742 │ 0.882428 │
│ 2 │ 0.0963367 │ 0.400404 │ 0.599865 │
│ 3 │ 1.23356 │ 0.807691 │ 0.547917 │
│ 4 │ 0.756098 │ 0.595673 │ 0.29678 │
julia> #. ifelse(df > 1, df, 0.0)
4×3 DataFrame
│ Row │ x1 │ x2 │ x3 │
│ │ Float64 │ Float64 │ Float64 │
├─────┼─────────┼─────────┼─────────┤
│ 1 │ 0.0 │ 1.59742 │ 0.0 │
│ 2 │ 0.0 │ 0.0 │ 0.0 │
│ 3 │ 1.23356 │ 0.0 │ 0.0 │
│ 4 │ 0.0 │ 0.0 │ 0.0 │
you can also transform for it if you prefer:
julia> transform(df, names(df) .=> ByRow(x -> ifelse(x>1, x, 0.0)) .=> names(df))
4×3 DataFrame
│ Row │ x1 │ x2 │ x3 │
│ │ Float64 │ Float64 │ Float64 │
├─────┼─────────┼─────────┼─────────┤
│ 1 │ 0.0 │ 1.59742 │ 0.0 │
│ 2 │ 0.0 │ 0.0 │ 0.0 │
│ 3 │ 1.23356 │ 0.0 │ 0.0 │
│ 4 │ 0.0 │ 0.0 │ 0.0 │
Also looking at the linked pandas solution DataFrames.jl seems faster in this case:
julia> df = DataFrame(2*rand(2,3), [:x1, :x2, :x3])
2×3 DataFrame
Row │ x1 x2 x3
│ Float64 Float64 Float64
─────┼────────────────────────────
1 │ 1.48781 1.20332 1.08071
2 │ 1.55462 1.66393 0.363993
julia> using BenchmarkTools
julia> #btime #. ifelse($df > 1, $df, 0.0)
6.252 μs (58 allocations: 3.89 KiB)
2×3 DataFrame
Row │ x1 x2 x3
│ Float64 Float64 Float64
─────┼───────────────────────────
1 │ 1.48781 1.20332 1.08071
2 │ 1.55462 1.66393 0.0
(in pandas for 2x3 data frame it was ranging from 163 µs to 2.26 ms)

Related

Return copy of a DataFrame that contains only rows with missing data in Julia

I am looking for the opposite of the dropmissing function in DataFrames.jl so that the user knows where to look to fix their bad data. It seems like this should be easy, but the filter function expects a column to be specified and I cannot get it to iterate over all columns.
julia> df=DataFrame(a=[1, missing, 3], b=[4, 5, missing])
3×2 DataFrame
│ Row │ a │ b │
│ │ Int64? │ Int64? │
├─────┼─────────┼─────────┤
│ 1 │ 1 │ 4 │
│ 2 │ missing │ 5 │
│ 3 │ 3 │ missing │
julia> filter(x -> ismissing(eachcol(x)), df)
ERROR: MethodError: no method matching eachcol(::DataFrameRow{DataFrame,DataFrames.Index})
julia> filter(x -> ismissing.(x), df)
ERROR: ArgumentError: broadcasting over `DataFrameRow`s is reserved
I am basically trying to recreate the disallowmissing function, but with a more useful error message.
Here are two ways to do it:
julia> df = DataFrame(a=[1, missing, 3], b=[4, 5, missing])
3×2 DataFrame
│ Row │ a │ b │
│ │ Int64? │ Int64? │
├─────┼─────────┼─────────┤
│ 1 │ 1 │ 4 │
│ 2 │ missing │ 5 │
│ 3 │ 3 │ missing │
julia> df[.!completecases(df), :] # this will be faster
2×2 DataFrame
│ Row │ a │ b │
│ │ Int64? │ Int64? │
├─────┼─────────┼─────────┤
│ 1 │ missing │ 5 │
│ 2 │ 3 │ missing │
julia> #view df[.!completecases(df), :]
2×2 SubDataFrame
│ Row │ a │ b │
│ │ Int64? │ Int64? │
├─────┼─────────┼─────────┤
│ 1 │ missing │ 5 │
│ 2 │ 3 │ missing │
julia> filter(row -> any(ismissing, row), df)
2×2 DataFrame
│ Row │ a │ b │
│ │ Int64? │ Int64? │
├─────┼─────────┼─────────┤
│ 1 │ missing │ 5 │
│ 2 │ 3 │ missing │
julia> filter(row -> any(ismissing, row), df, view=true) # requires DataFrames.jl 0.22
2×2 SubDataFrame
Row │ a b
│ Int64? Int64?
─────┼──────────────────
1 │ missing 5
2 │ 3 missing

Query.jl - create a new column and use it immediately

I have a DataFrame and I want to compute a bunch of group-level summary statistics. Some of those statistics are derived from other statistics I want to compute first.
df = DataFrame(a=[1,1,2,3], b=[4,5,6,8])
df2 = df |>
#groupby(_.a) |>
#map({a = key(_),
bm = mean(_.b),
cs = sum(_.b),
d = _.bm + _.cs}) |>
DataFrame
ERROR: type NamedTuple has no field bm
The closest I can get is this, which works, but gets very repetitive as the number of initial statistics I want to carry forward into the computation of derived statistics grows:
df2 = df |>
#groupby(_.a) |>
#map({a=key(_), bm=mean(_.b), cs=sum(_.b)}) |>
#map({a=_.a, bm=_.bm, cs=_.cs, d=_.bm + _.cs}) |>
DataFrame
3×4 DataFrame
│ Row │ a │ bm │ cs │ d │
│ │ Int64 │ Float64 │ Int64 │ Float64 │
├─────┼───────┼─────────┼───────┼─────────┤
│ 1 │ 1 │ 4.5 │ 9 │ 13.5 │
│ 2 │ 2 │ 6.0 │ 6 │ 12.0 │
│ 3 │ 3 │ 8.0 │ 8 │ 16.0 │
Another option is to create a new DataFrame of first-order results, run a new #map on that to compute the second-order results, and then join the two afterward. Is there any way in Query, DataFramesMeta, or even bare DataFrames to do it in one relatively concise step?
Just for reference, the "create multiple DataFrames" approach:
df = DataFrame(a=[1,1,2,3], b=[4,5,6,8])
df2 = df |>
#groupby(_.a) |>
#map({a=key(_), bm=mean(_.b), cs=sum(_.b)}) |>
DataFrame
df3 = df2 |>
#map({a=_.a, d=_.bm + _.cs}) |>
DataFrame
df4 = innerjoin(df2, df3, on = :a)
3×4 DataFrame
│ Row │ a │ bm │ cs │ d │
│ │ Int64 │ Float64 │ Int64 │ Float64 │
├─────┼───────┼─────────┼───────┼─────────┤
│ 1 │ 1 │ 4.5 │ 9 │ 13.5 │
│ 2 │ 2 │ 6.0 │ 6 │ 12.0 │
│ 3 │ 3 │ 8.0 │ 8 │ 16.0 │

Duplicated columns in Julia Dataframes

In Python Pandas and R one can get rid of duplicated columns easily - just load the data, assign the column names, and select those that are not duplicated.
What is the best practices to deal with such data with Julia Dataframes? Assigning duplicated column names is not allowed here. I understand that only way would be to massage incoming data more, and get rid of such data before constructing a Dataframe?
The thing is that it is almost always easier to deal with duplicated columns in the dataframe that is already constructed, rather than in incoming data.
UPD: I meant the duplicated column names. I build dataframe from raw data, where columns names (and thus data) could be repeated.
UPD2: Python example added.
>>> import numpy as np
>>> import pandas as pd
>>> df = pd.DataFrame(np.hstack([np.zeros((4,1)), np.ones((4,2))]), columns=["a", "b", "b"])
>>> df
a b b
0 0.0 1.0 1.0
1 0.0 1.0 1.0
2 0.0 1.0 1.0
3 0.0 1.0 1.0
>>> df.loc[:, ~df.columns.duplicated()]
a b
0 0.0 1.0
1 0.0 1.0
2 0.0 1.0
3 0.0 1.0
I build my Julia Dataframe from a Float32 matrix and then assign column names from a vector. That is where I need to get rid of columns that have duplicated names (already present in dataframe). That is the nature of underlying data, sometimes it has dups, sometimes not, I have no control on its creation.
Is this something you are looking for (I was not 100% sure from your description - if this is not what you want then please update the question with an example):
julia> df = DataFrame([zeros(4,3) ones(4,5)])
4×8 DataFrame
│ Row │ x1 │ x2 │ x3 │ x4 │ x5 │ x6 │ x7 │ x8 │
│ │ Float64 │ Float64 │ Float64 │ Float64 │ Float64 │ Float64 │ Float64 │ Float64 │
├─────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ 1 │ 0.0 │ 0.0 │ 0.0 │ 1.0 │ 1.0 │ 1.0 │ 1.0 │ 1.0 │
│ 2 │ 0.0 │ 0.0 │ 0.0 │ 1.0 │ 1.0 │ 1.0 │ 1.0 │ 1.0 │
│ 3 │ 0.0 │ 0.0 │ 0.0 │ 1.0 │ 1.0 │ 1.0 │ 1.0 │ 1.0 │
│ 4 │ 0.0 │ 0.0 │ 0.0 │ 1.0 │ 1.0 │ 1.0 │ 1.0 │ 1.0 │
julia> DataFrame(unique(last, pairs(eachcol(df))))
4×2 DataFrame
│ Row │ x1 │ x4 │
│ │ Float64 │ Float64 │
├─────┼─────────┼─────────┤
│ 1 │ 0.0 │ 1.0 │
│ 2 │ 0.0 │ 1.0 │
│ 3 │ 0.0 │ 1.0 │
│ 4 │ 0.0 │ 1.0 │
EDIT
To deduplicate column names use makeunique keyword argument:
julia> DataFrame(rand(3,4), [:x, :x, :x, :x], makeunique=true)
3×4 DataFrame
│ Row │ x │ x_1 │ x_2 │ x_3 │
│ │ Float64 │ Float64 │ Float64 │ Float64 │
├─────┼───────────┼──────────┼──────────┼───────────┤
│ 1 │ 0.410494 │ 0.775563 │ 0.819916 │ 0.0520466 │
│ 2 │ 0.0503997 │ 0.427499 │ 0.262234 │ 0.965793 │
│ 3 │ 0.838595 │ 0.996305 │ 0.833607 │ 0.953539 │
EDIT 2
So you seem to have access to column names when creating a data frame. In this case I would do:
julia> mat = [ones(3,1) zeros(3,2)]
3×3 Array{Float64,2}:
1.0 0.0 0.0
1.0 0.0 0.0
1.0 0.0 0.0
julia> cols = ["a", "b", "b"]
3-element Array{String,1}:
"a"
"b"
"b"
julia> df = DataFrame(mat, cols, makeunique=true)
3×3 DataFrame
│ Row │ a │ b │ b_1 │
│ │ Float64 │ Float64 │ Float64 │
├─────┼─────────┼─────────┼─────────┤
│ 1 │ 1.0 │ 0.0 │ 0.0 │
│ 2 │ 1.0 │ 0.0 │ 0.0 │
│ 3 │ 1.0 │ 0.0 │ 0.0 │
julia> select!(df, unique(cols))
3×2 DataFrame
│ Row │ a │ b │
│ │ Float64 │ Float64 │
├─────┼─────────┼─────────┤
│ 1 │ 1.0 │ 0.0 │
│ 2 │ 1.0 │ 0.0 │
│ 3 │ 1.0 │ 0.0 │

How to convert a GroupedDataFrame to a DataFrame in Julia?

I have performed calculations on subsets of a DataFrame by using the groupby function:
using RDatasets
iris = dataset("datasets", "iris")
describe(iris)
iris_grouped = groupby(iris,:Species)
iris_avg = map(:SepalLength => mean,iris_grouped::GroupedDataFrame)
Now I would like to plot the results, but I get an error message for the following plot:
#df iris_avg bar(:Species,:SepalLength)
Only tables are supported
What would be the best way to plot the data? My idea would be to create a single DataFrame and go from there. How would I do this, ie how do I convert a GroupedDataFrame to a single DataFrame? Thanks!
To convert GroupedDataFrame into a DataFrame just call DataFrame on it, e.g.:
julia> DataFrame(iris_avg)
3×2 DataFrame
│ Row │ Species │ SepalLength_mean │
│ │ Categorical… │ Float64 │
├─────┼──────────────┼──────────────────┤
│ 1 │ setosa │ 5.006 │
│ 2 │ versicolor │ 5.936 │
│ 3 │ virginica │ 6.588 │
in your case.
You could also have written:
julia> combine(:SepalLength => mean, iris_grouped)
3×2 DataFrame
│ Row │ Species │ SepalLength_mean │
│ │ Categorical… │ Float64 │
├─────┼──────────────┼──────────────────┤
│ 1 │ setosa │ 5.006 │
│ 2 │ versicolor │ 5.936 │
│ 3 │ virginica │ 6.588 │
on an original GroupedDataFrame or
julia> by(:SepalLength => mean, iris, :Species)
3×2 DataFrame
│ Row │ Species │ SepalLength_mean │
│ │ Categorical… │ Float64 │
├─────┼──────────────┼──────────────────┤
│ 1 │ setosa │ 5.006 │
│ 2 │ versicolor │ 5.936 │
│ 3 │ virginica │ 6.588 │
on an original DataFrame.
I write the transformation as the first argument here, but typically, you would write it as the last (as then you can pass multiple transformations), e.g.:
julia> by(iris, :Species, :SepalLength => mean, :SepalWidth => minimum)
3×3 DataFrame
│ Row │ Species │ SepalLength_mean │ SepalWidth_minimum │
│ │ Categorical… │ Float64 │ Float64 │
├─────┼──────────────┼──────────────────┼────────────────────┤
│ 1 │ setosa │ 5.006 │ 2.3 │
│ 2 │ versicolor │ 5.936 │ 2.0 │
│ 3 │ virginica │ 6.588 │ 2.2 │
I think you might be better off using the by function to get to your iris_avg directly. by iterates through a DataFrame, and then applies the given function to the the results. Often, it's used with a do block.
julia> by(iris, :Species) do df
DataFrame(sepal_mean = mean(df.SepalLength))
end
3×2 DataFrame
│ Row │ Species │ sepal_mean │
│ │ Categorical… │ Float64 │
├─────┼──────────────┼────────────┤
│ 1 │ setosa │ 5.006 │
│ 2 │ versicolor │ 5.936 │
│ 3 │ virginica │ 6.588 │
Or equivalently,
julia> by(iris, :Species, SepalLength_mean = :SepalLength => mean)
3×2 DataFrame
│ Row │ Species │ SepalLength_mean │
│ │ Categorical… │ Float64 │
├─────┼──────────────┼──────────────────┤
│ 1 │ setosa │ 5.006 │
│ 2 │ versicolor │ 5.936 │
│ 3 │ virginica │ 6.588 │
See here for more details/examples.
Alternatively, you can do it in several steps as you've done, then use DataFrame constructor to convert to a proper DataFrame:
julia> iris_grouped = groupby(iris,:Species);
julia> iris_avg = map(:SepalLength => mean,iris_grouped::GroupedDataFrame);
julia> DataFrame(iris_avg)
3×2 DataFrame
│ Row │ Species │ SepalLength_mean │
│ │ Categorical… │ Float64 │
├─────┼──────────────┼──────────────────┤
│ 1 │ setosa │ 5.006 │
│ 2 │ versicolor │ 5.936 │
│ 3 │ virginica │ 6.588 │

Convert a Julia DataFrame column with String to one with Int and missing values

I need to convert the following DataFrame
julia> df = DataFrame(:A=>["", "2", "3"], :B=>[1.1, 2.2, 3.3])
which looks like
3×2 DataFrame
│ Row │ A │ B │
│ │ String │ Float64 │
├─────┼────────┼─────────┤
│ 1 │ │ 1.1 │
│ 2 │ 2 │ 2.2 │
│ 3 │ 3 │ 3.3 │
I would like to convert A column from Array{String,1} to array of Int with missing values.
I tried
julia> df.A = tryparse.(Int, df.A)
3-element Array{Union{Nothing, Int64},1}:
nothing
2
3
julia> df
3×2 DataFrame
│ Row │ A │ B │
│ │ Union… │ Float64 │
├─────┼────────┼─────────┤
│ 1 │ │ 1.1 │
│ 2 │ 2 │ 2.2 │
│ 3 │ 3 │ 3.3 │
julia> eltype(df.A)
Union{Nothing, Int64}
but I'm getting A column with elements of type Union{Nothing, Int64}.
nothing (of type Nothing) and missing (of type Missing) seems to be 2 differents kind of value.
So I wonder how I can A columns with missing values instead?
I also wonder if missing and nothing leads to different performance.
I would have done the following:
julia> df.A = map(x->begin val = tryparse(Int, x)
ifelse(typeof(val) == Nothing, missing, val)
end, df.A)
3-element Array{Union{Missing, Int64},1}:
missing
2
3
julia> df
3×2 DataFrame
│ Row │ A │ B │
│ │ Int64⍰ │ Float64 │
├─────┼─────────┼─────────┤
│ 1 │ missing │ 1.1 │
│ 2 │ 2 │ 2.2 │
│ 3 │ 3 │ 3.3 │
I think missing is more suitable for dataframes which indeed have missing values, instead of nothing, because the latter is more considered as a void in C, or None in Python, see here.
As a side note, Missing type has some Julia functionalities.
Replacing nothing by missing can simply be done using replace:
julia> df.A = replace(df.A, nothing=>missing)
3-element Array{Union{Missing, Int64},1}:
missing
2
3
julia> df
3×2 DataFrame
│ Row │ A │ B │
│ │ Int64⍰ │ Float64 │
├─────┼─────────┼─────────┤
│ 1 │ missing │ 1.1 │
│ 2 │ 2 │ 2.2 │
│ 3 │ 3 │ 3.3 │
an other solution is to use tryparsem function defined as following
tryparsem(T, str) = something(tryparse(T, str), missing)
and use it like
julia> df = DataFrame(:A=>["", "2", "3"], :B=>[1.1, 2.2, 3.3])
julia> df.A = tryparsem.(Int, df.A)