I have a data frame where some columns have missing values. I would like that if missing values are found, an alternative from a second column is picked.
For example, in:
df = DataFrame(x = [0, missing, 2], y=[2, 4, 6])
I would like missing to be substituted with 4.
At the moment I am solving the problem with this solution:
for row in eachrow(df)
if ismissing(row[:x])
row[:x] = row[:y]
end
end
But I wonder if a better solution that avoids for-loops can be foundπ€.
I tried with replace(A, old_new::Pair...; [count::Integer]), but it seems that the pair accepts only scalars, and also with broadcasting I was not able to have success.
Do you have any suggestions?
You can use coalesce:
julia> df = DataFrame(x = [0, missing, 2], y=[2, 4, 6])
3Γ2 DataFrame
Row β x y
β Int64? Int64
ββββββΌββββββββββββββββ
1 β 0 2
2 β missing 4
3 β 2 6
julia> df.x .= coalesce.(df.x, df.y)
3-element Array{Union{Missing, Int64},1}:
0
4
2
julia> df
3Γ2 DataFrame
Row β x y
β Int64? Int64
ββββββΌβββββββββββββββ
1 β 0 2
2 β 4 4
3 β 2 6
or if you like piping-aware functions:
julia> df = DataFrame(x = [0, missing, 2], y=[2, 4, 6])
3Γ2 DataFrame
Row β x y
β Int64? Int64
ββββββΌββββββββββββββββ
1 β 0 2
2 β missing 4
3 β 2 6
julia> transform!(df, [:x, :y] => ByRow(coalesce) => :x)
3Γ2 DataFrame
Row β x y
β Int64 Int64
ββββββΌββββββββββββββ
1 β 0 2
2 β 4 4
3 β 2 6
and this is the same, but not requiring you to remember about coalesce:
julia> df = DataFrame(x = [0, missing, 2], y=[2, 4, 6])
3Γ2 DataFrame
Row β x y
β Int64? Int64
ββββββΌββββββββββββββββ
1 β 0 2
2 β missing 4
3 β 2 6
julia> transform!(df, [:x, :y] => ByRow((x,y) -> ismissing(x) ? y : x) => :x)
3Γ2 DataFrame
Row β x y
β Int64 Int64
ββββββΌββββββββββββββ
1 β 0 2
2 β 4 4
3 β 2 6
Related
I would like to count the number of missing values per column in a dataframe like df:
Pkg.add("DataFrames")
using DataFrames
df = DataFrame(i=1:5,
x=[missing, 4, missing, 2, 1],
y=[missing, missing, "c", "d", "e"])
5Γ3 DataFrame
Row β i x y
β Int64 Int64? String?
ββββββΌβββββββββββββββββββββββββ
1 β 1 missing missing
2 β 2 4 missing
3 β 3 missing c
4 β 4 2 d
5 β 5 1 e
This should return 0 for i, 2 for x and 2 for y column. So I was wondering if anyone knows how to count the number of missing values per column in Julia?
When writing the question I found an answer by using describe with :nmissing like this:
describe(df, :nmissing)
3Γ2 DataFrame
Row β variable nmissing
β Symbol Int64
ββββββΌββββββββββββββββββββ
1 β i 0
2 β x 2
3 β y 2
If you wanted the output in columnar format you can write:
julia> mapcols(x -> count(ismissing, x), df)
1Γ3 DataFrame
Row β i x y
β Int64 Int64 Int64
ββββββΌβββββββββββββββββββββ
1 β 0 2 2
I have the following dataframe called df:
df = DataFrame(i=1:5,
x=[missing, missing, missing, missing, missing],
y=[missing, missing, 1, 3, 6])
5Γ3 DataFrame
Row β i x y
β Int64 Missing Int64?
ββββββΌβββββββββββββββββββββββββ
1 β 1 missing missing
2 β 2 missing missing
3 β 3 missing 1
4 β 4 missing 3
5 β 5 missing 6
I would like to remove the columns where all values are missing. In this case it should remove column x because it has only all missing values. with dropmissing it removes all rows, but that's not what I want. So I was wondering if anyone knows how to remove only columns where all values are missing in a dataframe Julia?
A mediocre answer would be:
df1 = DataFrame()
foreach(
x->all(ismissing, df[!, x]) ? nothing : df1[!, x] = df[!, x],
propertynames(df)
)
df
# 5Γ2 DataFrame
# Row β i y
# β Int64 Int64?
# ββββββΌββββββββββββββββ
# 1 β 1 missing
# 2 β 2 missing
# 3 β 3 1
# 4 β 4 3
# 5 β 5 6
But a slightly better one would be using the slicing approach:
df[:, map(x->!all(ismissing, df[!, x]), propertynames(df))]
# 5Γ2 DataFrame
# Row β i y
# β Int64 Int64?
# ββββββΌββββββββββββββββ
# 1 β 1 missing
# 2 β 2 missing
# 3 β 3 1
# 4 β 4 3
# 5 β 5 6
# OR
df[!, map(x->!all(ismissing, x), eachcol(df))]
# 5Γ2 DataFrame
# Row β i y
# β Int64 Int64?
# ββββββΌββββββββββββββββ
# 1 β 1 missing
# 2 β 2 missing
# 3 β 3 1
# 4 β 4 3
# 5 β 5 6
#Or
df[!, Not(names(df, all.(ismissing, eachcol(df))))]
# I omitted the result to prevent this answer from becoming extensively lengthy.
#Or
df[!, Not(all.(ismissing, eachcol(df)))]
I almost forgot the deleteat! function:
deleteat!(permutedims(df), all.(ismissing, eachcol(df))) |> permutedims
# 5Γ2 DataFrame
# Row β i y
# β Int64 Int64?
# ββββββΌββββββββββββββββ
# 1 β 1 missing
# 2 β 2 missing
# 3 β 3 1
# 4 β 4 3
# 5 β 5 6
You can use the select! function, as Dan noted:
select!(df, [k for (k,v) in pairs(eachcol(df)) if !all(ismissing, v)])
# 5Γ2 DataFrame
# Row β i y
# β Int64 Int64?
# ββββββΌββββββββββββββββ
# 1 β 1 missing
# 2 β 2 missing
# 3 β 3 1
# 4 β 4 3
# 5 β 5 6
The names functions accepts a type as an input to select columns of a specific type, so I would do:
julia> select(df, Not(names(df, Missing)))
5Γ2 DataFrame
Row β i y
β Int64 Int64?
ββββββΌββββββββββββββββ
1 β 1 missing
2 β 2 missing
3 β 3 1
4 β 4 3
5 β 5 6
Without benchmarking this I would guess that it is also significantly faster, as it doesn't have to check each element of each column but as far as I know simply queries the type information for each column readily available in the DataFrame:
julia> dump(df)
DataFrame
columns: Array{AbstractVector}((3,))
1: Array{Int64}((5,)) [1, 2, 3, 4, 5]
2: Array{Missing}((5,))
1: Missing missing
2: Missing missing
3: Missing missing
4: Missing missing
5: Missing missing
3: Array{Union{Missing, Int64}}((5,))
The downside of this approach is that it relies on the type information to be correct, which might not be the case after a transformation:
julia> df2 = df[1:2, :]
2Γ3 DataFrame
Row β i x y
β Int64 Missing Int64?
ββββββΌβββββββββββββββββββββββββ
1 β 1 missing missing
2 β 2 missing missing
This can be fixed by calling identity to narrow column types, but this is again potentially expensive:
julia> identity.(df2)
2Γ3 DataFrame
Row β i x y
β Int64 Missing Missing
ββββββΌβββββββββββββββββββββββββ
1 β 1 missing missing
2 β 2 missing missing
So I'd say if you're creating a DataFrame from scratch, such as reading it in via XLSX.jl (as people loooove putting empty columns in their Excel sheet) or are creating whole columns in your workflow, names(df, Not(Missing)) is the way to go, while for analysis on subsets of DataFrames it's only guaranteed to work when using identity so that the other approaches mentioned which check every cell are viable alternatives.
Another simple option is to use
df[!, any.(!ismissing, eachcol(df))]
5Γ2 DataFrame
Row β i y
β Int64 Int64?
ββββββΌββββββββββββββββ
1 β 1 missing
2 β 2 missing
3 β 3 1
4 β 4 3
5 β 5 6
and if the DataFrame is created from scratch, there is another fast option using the column type. Since any column with all missing entries isa Vector{Missing}, we can use this Type information to skip these columns. The drawback of this fast method as #NilsGudat pointed out, is that it fails if the DataFrame column types have changed by some transformation.
df[!, (!isa).(eachcol(df), Vector{Missing})]
5Γ2 DataFrame
Row β i y
β Int64 Int64?
ββββββΌββββββββββββββββ
1 β 1 missing
2 β 2 missing
3 β 3 1
4 β 4 3
5 β 5 6
I have a DataFrame in Julia and I want to create a new column that represents the difference between consecutive rows in a specific column. In python pandas, I would simply use df.series.diff(). Is there a Julia equivelant?
For example:
data
1
2
4
6
7
# in pandas
df['diff_data'] = df.data.diff()
data diff_data
1 NaN
2 1
4 2
6 2
7 1
You can use ShiftedArrays.jl like this.
Declarative style:
julia> using DataFrames, ShiftedArrays
julia> df = DataFrame(data=[1, 2, 4, 6, 7])
5Γ1 DataFrame
Row β data
β Int64
ββββββΌβββββββ
1 β 1
2 β 2
3 β 4
4 β 6
5 β 7
julia> transform(df, :data => (x -> x - lag(x)) => :data_diff)
5Γ2 DataFrame
Row β data data_diff
β Int64 Int64?
ββββββΌββββββββββββββββββ
1 β 1 missing
2 β 2 1
3 β 4 2
4 β 6 2
5 β 7 1
Imperative style (in place):
julia> df = DataFrame(data=[1, 2, 4, 6, 7])
5Γ1 DataFrame
Row β data
β Int64
ββββββΌβββββββ
1 β 1
2 β 2
3 β 4
4 β 6
5 β 7
julia> df.data_diff = df.data - lag(df.data)
5-element Vector{Union{Missing, Int64}}:
missing
1
2
2
1
julia> df
5Γ2 DataFrame
Row β data data_diff
β Int64 Int64?
ββββββΌββββββββββββββββββ
1 β 1 missing
2 β 2 1
3 β 4 2
4 β 6 2
5 β 7 1
with diff you do not need extra packages and can do similarly the following:
julia> df.data_diff = [missing; diff(df.data)]
5-element Vector{Union{Missing, Int64}}:
missing
1
2
2
1
(the issue is that diff is a general purpose function that does change the length of vector from n to n-1 so you have to add missing manually in front)
Pandas df.diff() does it to the whole data frame at once and allows you to specify row-wise or column-wise. There might be a better way but this is what I used before (I like chaining or piping like in dplyr):
# using chain.jl
#chain df begin
eachcol()
diff.()
DataFrame(:auto)
rename!(names(df))
end
# OR base pipe
df |>
x -> eachcol(x) |>
x -> diff.(x) |>
x -> DataFrame(x, :auto) |>
x -> rename!(x, names(df)[2:end])
# OR without piping
rename!(DataFrame(diff.(eachcol(df)), :auto), names(df))
You might need to insert the starting row, which will now have missing values.
using DataFrames
df = DataFrame(a=1:3, b=1:3)
How do I create a new column c such that c = a+b element wise?
Can't figure it out by reading the transform doc.
I know that
df[!, :c] = df.a .+ df.b
works but I want to use transform in a chain like this
#chain df begin
#transform(c = :a .+ :b)
#where(...)
groupby(...)
end
The above syntax doesn't work with DataFramesMeta.jl
This is an answer using DataFrames.jl.
To create a new data frame:
julia> transform(df, [:a,:b] => (+) => :c)
3Γ3 DataFrame
Row β a b c
β Int64 Int64 Int64
ββββββΌβββββββββββββββββββββ
1 β 1 1 2
2 β 2 2 4
3 β 3 3 6
and for an in-place operation:
julia> transform!(df, [:a,:b] => (+) => :c)
3Γ3 DataFrame
Row β a b c
β Int64 Int64 Int64
ββββββΌβββββββββββββββββββββ
1 β 1 1 2
2 β 2 2 4
3 β 3 3 6
or
julia> insertcols!(df, :c => df.a + df.b)
3Γ3 DataFrame
Row β a b c
β Int64 Int64 Int64
ββββββΌβββββββββββββββββββββ
1 β 1 1 2
2 β 2 2 4
3 β 3 3 6
The difference between transform! and insertcols! is that insertcols! will error if :c column is present in the data frame, while transform! will overwrite it.
How can I insert a row in a dataframe in Julia at a specific index ? (Julia version 1.1)
I have found this related question. However, the code given in the answer isn't working anymore in Julia 1.1
I know how to push! a row into a dataframe or concatenate two dataframes, but what about inserting at a specific index ?
It also doesn't seem to be explained in Julia DataFrames documentation.
This is a non-standard operation. The recommendation given there is still valid, so:
df = DataFrame(x = [1,2,3], y = ["a", "b", "c"])
foreach((v,n) -> insert!(df[n], 2, v), [4, "d"], names(df))
works. A shorter version to write it under Julia 1.0 would be:
insert!.(eachcol(df, false), 2, [4, "d"])
(the need to add false as a second argument will not be needed in the future as we are in the deprecation period now)
The difference is that getproperty method can be overloaded since Julia 1.0 so df.columns does not work.
I have also updated the other answer, so you can close this question if you prefer.
EDIT
The instructions above are no longer valid (unless you use very old DataFrames.jl version).
In DataFrames.jl 1.4 use insert!, push!, or pushfirst! functions depending on where you want to add the row:
julia> using DataFrames
julia> df = DataFrame(x = [1,2,3], y = ["a", "b", "c"])
3Γ2 DataFrame
Row β x y
β Int64 String
ββββββΌβββββββββββββββ
1 β 1 a
2 β 2 b
3 β 3 c
julia> insert!(df, 2, (100, "new line"))
4Γ2 DataFrame
Row β x y
β Int64 String
ββββββΌβββββββββββββββββ
1 β 1 a
2 β 100 new line
3 β 2 b
4 β 3 c
julia> push!(df, (200, "last line"))
5Γ2 DataFrame
Row β x y
β Int64 String
ββββββΌββββββββββββββββββ
1 β 1 a
2 β 100 new line
3 β 2 b
4 β 3 c
5 β 200 last line
julia> pushfirst!(df, (300, "first line"))
6Γ2 DataFrame
Row β x y
β Int64 String
ββββββΌβββββββββββββββββββ
1 β 300 first line
2 β 1 a
3 β 100 new line
4 β 2 b
5 β 3 c
6 β 200 last line