Remove columns with all values missing in dataframe Julia - dataframe

I have the following dataframe called df:
df = DataFrame(i=1:5,
x=[missing, missing, missing, missing, missing],
y=[missing, missing, 1, 3, 6])
5×3 DataFrame
Row │ i x y
│ Int64 Missing Int64?
─────┼─────────────────────────
1 │ 1 missing missing
2 │ 2 missing missing
3 │ 3 missing 1
4 │ 4 missing 3
5 │ 5 missing 6
I would like to remove the columns where all values are missing. In this case it should remove column x because it has only all missing values. with dropmissing it removes all rows, but that's not what I want. So I was wondering if anyone knows how to remove only columns where all values are missing in a dataframe Julia?

A mediocre answer would be:
df1 = DataFrame()
foreach(
x->all(ismissing, df[!, x]) ? nothing : df1[!, x] = df[!, x],
propertynames(df)
)
df
# 5×2 DataFrame
# Row │ i y
# │ Int64 Int64?
# ─────┼────────────────
# 1 │ 1 missing
# 2 │ 2 missing
# 3 │ 3 1
# 4 │ 4 3
# 5 │ 5 6
But a slightly better one would be using the slicing approach:
df[:, map(x->!all(ismissing, df[!, x]), propertynames(df))]
# 5×2 DataFrame
# Row │ i y
# │ Int64 Int64?
# ─────┼────────────────
# 1 │ 1 missing
# 2 │ 2 missing
# 3 │ 3 1
# 4 │ 4 3
# 5 │ 5 6
# OR
df[!, map(x->!all(ismissing, x), eachcol(df))]
# 5×2 DataFrame
# Row │ i y
# │ Int64 Int64?
# ─────┼────────────────
# 1 │ 1 missing
# 2 │ 2 missing
# 3 │ 3 1
# 4 │ 4 3
# 5 │ 5 6
#Or
df[!, Not(names(df, all.(ismissing, eachcol(df))))]
# I omitted the result to prevent this answer from becoming extensively lengthy.
#Or
df[!, Not(all.(ismissing, eachcol(df)))]
I almost forgot the deleteat! function:
deleteat!(permutedims(df), all.(ismissing, eachcol(df))) |> permutedims
# 5×2 DataFrame
# Row │ i y
# │ Int64 Int64?
# ─────┼────────────────
# 1 │ 1 missing
# 2 │ 2 missing
# 3 │ 3 1
# 4 │ 4 3
# 5 │ 5 6
You can use the select! function, as Dan noted:
select!(df, [k for (k,v) in pairs(eachcol(df)) if !all(ismissing, v)])
# 5×2 DataFrame
# Row │ i y
# │ Int64 Int64?
# ─────┼────────────────
# 1 │ 1 missing
# 2 │ 2 missing
# 3 │ 3 1
# 4 │ 4 3
# 5 │ 5 6

The names functions accepts a type as an input to select columns of a specific type, so I would do:
julia> select(df, Not(names(df, Missing)))
5×2 DataFrame
Row │ i y
│ Int64 Int64?
─────┼────────────────
1 │ 1 missing
2 │ 2 missing
3 │ 3 1
4 │ 4 3
5 │ 5 6
Without benchmarking this I would guess that it is also significantly faster, as it doesn't have to check each element of each column but as far as I know simply queries the type information for each column readily available in the DataFrame:
julia> dump(df)
DataFrame
columns: Array{AbstractVector}((3,))
1: Array{Int64}((5,)) [1, 2, 3, 4, 5]
2: Array{Missing}((5,))
1: Missing missing
2: Missing missing
3: Missing missing
4: Missing missing
5: Missing missing
3: Array{Union{Missing, Int64}}((5,))
The downside of this approach is that it relies on the type information to be correct, which might not be the case after a transformation:
julia> df2 = df[1:2, :]
2×3 DataFrame
Row │ i x y
│ Int64 Missing Int64?
─────┼─────────────────────────
1 │ 1 missing missing
2 │ 2 missing missing
This can be fixed by calling identity to narrow column types, but this is again potentially expensive:
julia> identity.(df2)
2×3 DataFrame
Row │ i x y
│ Int64 Missing Missing
─────┼─────────────────────────
1 │ 1 missing missing
2 │ 2 missing missing
So I'd say if you're creating a DataFrame from scratch, such as reading it in via XLSX.jl (as people loooove putting empty columns in their Excel sheet) or are creating whole columns in your workflow, names(df, Not(Missing)) is the way to go, while for analysis on subsets of DataFrames it's only guaranteed to work when using identity so that the other approaches mentioned which check every cell are viable alternatives.

Another simple option is to use
df[!, any.(!ismissing, eachcol(df))]
5×2 DataFrame
Row │ i y
│ Int64 Int64?
─────┼────────────────
1 │ 1 missing
2 │ 2 missing
3 │ 3 1
4 │ 4 3
5 │ 5 6
and if the DataFrame is created from scratch, there is another fast option using the column type. Since any column with all missing entries isa Vector{Missing}, we can use this Type information to skip these columns. The drawback of this fast method as #NilsGudat pointed out, is that it fails if the DataFrame column types have changed by some transformation.
df[!, (!isa).(eachcol(df), Vector{Missing})]
5×2 DataFrame
Row │ i y
│ Int64 Int64?
─────┼────────────────
1 │ 1 missing
2 │ 2 missing
3 │ 3 1
4 │ 4 3
5 │ 5 6

Related

Remove rows with all missing values for columns that start with certain name in dataframe Julia

I have the following dataframe:
using DataFrames
df = DataFrame(
group = ["A", "A", "A", "B", "B", "B"],
V1 = [1, missing, missing, 3, missing, missing],
V2 = [missing, missing, missing, 2, missing, missing],
V3 = [missing, missing, 4, missing, 1, missing],
Z1 = [3, missing, missing, 3, missing, missing],
Z2 = [3, 1, 5, 2, missing, 3],
Z3 = [missing, missing, 2, missing, missing, missing])
6×7 DataFrame
Row │ group V1 V2 V3 Z1 Z2 Z3
│ String Int64? Int64? Int64? Int64? Int64? Int64?
─────┼──────────────────────────────────────────────────────────────
1 │ A 1 missing missing 3 3 missing
2 │ A missing missing missing missing 1 missing
3 │ A missing missing 4 missing 5 2
4 │ B 3 2 missing 3 2 missing
5 │ B missing missing 1 missing missing missing
6 │ B missing missing missing missing 3 missing
I would like to remove the rows with all values missing, but only where the columns start with "V" in their column names. This means that row 2 and 6 should be removed because they have all values missing across the columns that start with "V". The desired output should look like this:
4×7 DataFrame
Row │ group V1 V2 V3 Z1 Z2 Z3
│ String Int64? Int64? Int64? Int64? Int64? Int64?
─────┼──────────────────────────────────────────────────────────────
1 │ A 1 missing missing 3 3 missing
2 │ A missing missing 4 missing 5 2
3 │ B 3 2 missing 3 2 missing
4 │ B missing missing 1 missing missing missing
So I was wondering if anyone knows how to remove rows where all values are missing across columns that start with certain column name in a dataframe Julia?
You can use the deleteat! function to drop the rows of the given data frame with the given indexes:
deleteat!(df, all.(ismissing, eachrow(df[!, r"V"])))
# 4×7 DataFrame
# Row │ group V1 V2 V3 Z1 Z2 Z3
# │ String Int64? Int64? Int64? Int64? Int64? Int64?
# ─────┼──────────────────────────────────────────────────────────────
# 1 │ A 1 missing missing 3 3 missing
# 2 │ A missing missing 4 missing 5 2
# 3 │ B 3 2 missing 3 2 missing
# 4 │ B missing missing 1 missing missing missing
Another way is following this approach (slicing by a mask):
mask = map(x->!all(ismissing, x), eachrow(df[!, r"V.*"]))
df[mask, :]
# 4×7 DataFrame
# Row │ group V1 V2 V3 Z1 Z2 Z3
# │ String Int64? Int64? Int64? Int64? Int64? Int64?
# ─────┼──────────────────────────────────────────────────────────────
# 1 │ A 1 missing missing 3 3 missing
# 2 │ A missing missing 4 missing 5 2
# 3 │ B 3 2 missing 3 2 missing
# 4 │ B missing missing 1 missing missing missing
# Or
mask = broadcast(~, all.(ismissing, eachrow(df[!, r"V"])))
df[mask, :]
# Or
df[Not(all.(ismissing, eachrow(df[!, r"V"]))), :]
The r"V.*" is a RegEx that is allowed for indexing by DataFrames.jl. Its interpretation:
V: Starts with the V letter.
.: Any char can appear.
.*: Any char(s) indefinitely can appear.
The pattern could be r"^V" which catches any sequence of chars that starts with the V letter, or even an r"V" could be enough.
Following this approach, another way is to create a mask DataFrame:
maskdf = select(df, AsTable(r"V") => ByRow(x-> !all(ismissing, x)) => :mask)
# 6×1 DataFrame
# Row │ mask
# │ Bool
# ─────┼───────
# 1 │ true
# 2 │ false
# 3 │ true
# 4 │ true
# 5 │ true
# 6 │ false
df[maskdf.mask, :]
# returns the desired result.

Count missing values per column in dataframe Julia

I would like to count the number of missing values per column in a dataframe like df:
Pkg.add("DataFrames")
using DataFrames
df = DataFrame(i=1:5,
x=[missing, 4, missing, 2, 1],
y=[missing, missing, "c", "d", "e"])
5×3 DataFrame
Row │ i x y
│ Int64 Int64? String?
─────┼─────────────────────────
1 │ 1 missing missing
2 │ 2 4 missing
3 │ 3 missing c
4 │ 4 2 d
5 │ 5 1 e
This should return 0 for i, 2 for x and 2 for y column. So I was wondering if anyone knows how to count the number of missing values per column in Julia?
When writing the question I found an answer by using describe with :nmissing like this:
describe(df, :nmissing)
3×2 DataFrame
Row │ variable nmissing
│ Symbol Int64
─────┼────────────────────
1 │ i 0
2 │ x 2
3 │ y 2
If you wanted the output in columnar format you can write:
julia> mapcols(x -> count(ismissing, x), df)
1×3 DataFrame
Row │ i x y
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 0 2 2

How to make a list of data frames in Julia?

Given that I have some data frames with a single dimension, how can I create a list of all the data frames? Is it really as simple as just making a list and adding them in?
You could also use vcat to combine these data frames into a single one with an extra column indicating the source like this:
julia> c = vcat(a, b, source=:source => ["a", "b"])
8×2 DataFrame
Row │ A source
│ Int64 String
─────┼───────────────
1 │ 1 a
2 │ 2 a
3 │ 3 a
4 │ 4 a
5 │ 1 b
6 │ 2 b
7 │ 3 b
8 │ 4 b
This form is often easier to work with later. In particular if you then groupby the c data frame by :source like this:
julia> groupby(c, :source)
GroupedDataFrame with 2 groups based on key: source
First Group (4 rows): source = "a"
Row │ A source
│ Int64 String
─────┼───────────────
1 │ 1 a
2 │ 2 a
3 │ 3 a
4 │ 4 a
⋮
Last Group (4 rows): source = "b"
Row │ A source
│ Int64 String
─────┼───────────────
1 │ 1 b
2 │ 2 b
3 │ 3 b
4 │ 4 b
As a result you also get a collection of data frames (like the list that was created in the other answer), but this time you can apply functions supporting the split-apply-combine to it, see https://dataframes.juliadata.org/stable/man/split_apply_combine/.
One possible option that appears to work is the straightforward, "just add them to the list" method mentioned above. This would look like:
julia> a = DataFrame(A = 1:4)
4×1 DataFrame
Row │ A
│ Int64
─────┼───────
1 │ 1
2 │ 2
3 │ 3
4 │ 4
julia> b = DataFrame(A = 1:4)
4×1 DataFrame
Row │ A
│ Int64
─────┼───────
1 │ 1
2 │ 2
3 │ 3
4 │ 4
julia> c = [a, b]
2-element Vector{DataFrame}:
4×1 DataFrame
Row │ A
│ Int64
─────┼───────
1 │ 1
2 │ 2
3 │ 3
4 │ 4
4×1 DataFrame
Row │ A
│ Int64
─────┼───────
1 │ 1
2 │ 2
3 │ 3
4 │ 4
julia> typeof(c)
Vector{DataFrame} (alias for Array{DataFrame, 1})

Is there a diff() function in Julia DataFrames like pandas?

I have a DataFrame in Julia and I want to create a new column that represents the difference between consecutive rows in a specific column. In python pandas, I would simply use df.series.diff(). Is there a Julia equivelant?
For example:
data
1
2
4
6
7
# in pandas
df['diff_data'] = df.data.diff()
data diff_data
1 NaN
2 1
4 2
6 2
7 1
You can use ShiftedArrays.jl like this.
Declarative style:
julia> using DataFrames, ShiftedArrays
julia> df = DataFrame(data=[1, 2, 4, 6, 7])
5×1 DataFrame
Row │ data
│ Int64
─────┼───────
1 │ 1
2 │ 2
3 │ 4
4 │ 6
5 │ 7
julia> transform(df, :data => (x -> x - lag(x)) => :data_diff)
5×2 DataFrame
Row │ data data_diff
│ Int64 Int64?
─────┼──────────────────
1 │ 1 missing
2 │ 2 1
3 │ 4 2
4 │ 6 2
5 │ 7 1
Imperative style (in place):
julia> df = DataFrame(data=[1, 2, 4, 6, 7])
5×1 DataFrame
Row │ data
│ Int64
─────┼───────
1 │ 1
2 │ 2
3 │ 4
4 │ 6
5 │ 7
julia> df.data_diff = df.data - lag(df.data)
5-element Vector{Union{Missing, Int64}}:
missing
1
2
2
1
julia> df
5×2 DataFrame
Row │ data data_diff
│ Int64 Int64?
─────┼──────────────────
1 │ 1 missing
2 │ 2 1
3 │ 4 2
4 │ 6 2
5 │ 7 1
with diff you do not need extra packages and can do similarly the following:
julia> df.data_diff = [missing; diff(df.data)]
5-element Vector{Union{Missing, Int64}}:
missing
1
2
2
1
(the issue is that diff is a general purpose function that does change the length of vector from n to n-1 so you have to add missing manually in front)
Pandas df.diff() does it to the whole data frame at once and allows you to specify row-wise or column-wise. There might be a better way but this is what I used before (I like chaining or piping like in dplyr):
# using chain.jl
#chain df begin
eachcol()
diff.()
DataFrame(:auto)
rename!(names(df))
end
# OR base pipe
df |>
x -> eachcol(x) |>
x -> diff.(x) |>
x -> DataFrame(x, :auto) |>
x -> rename!(x, names(df)[2:end])
# OR without piping
rename!(DataFrame(diff.(eachcol(df)), :auto), names(df))
You might need to insert the starting row, which will now have missing values.

How to convert numeric values to missing in a Julia DataFrame column?

Trying to replace values of 99 to missing:
df_miss = DataFrame(a=[1,2,99,99,5,6,99])
allowmissing!(df_miss)
df_miss.a .= replace.(df_miss.a, 99 => missing)
But getting this error:
MethodError: no method matching similar(::Int64, ::Type{Union{Missing, Int64}})
Using:
Julia Version 1.5.3
DataFrames Version 0.22.7
You do not need to use broadcasting here. Just write:
julia> df_miss = DataFrame(a=[1,2,99,99,5,6,99])
7×1 DataFrame
Row │ a
│ Int64
─────┼───────
1 │ 1
2 │ 2
3 │ 99
4 │ 99
5 │ 5
6 │ 6
7 │ 99
julia> df_miss.a = replace(df_miss.a, 99 => missing)
7-element Vector{Union{Missing, Int64}}:
1
2
missing
missing
5
6
missing
julia> df_miss
7×1 DataFrame
Row │ a
│ Int64?
─────┼─────────
1 │ 1
2 │ 2
3 │ missing
4 │ missing
5 │ 5
6 │ 6
7 │ missing