I'd like to know how I can permanently delete multiple rows from a dataframe in Julia.
Here is the dataframe example:
Group Variable1 Variable2
String Float64 Float64
1 B -0.661256 0.265538
2 B 0.111651 0.837895
3 A 0.197754 0.987195
4 A 1.35057 0.696815
5 A -1.20899 0.496407
6 B 0.813047 0.324904
I'd like to delete rows 2, 4, and 6 from my dataframe. There is an easy function to do that?
If you want to delete inplace:
delete!(df, [2, 4, 6])
In case you want a new df without the selected rows:
df[Not([2, 4, 6]), :]
As of DataFrames.jl version 1.3, delete! has been deprecated and replaced with deleteat!. This change was made to more correctly reflect the difference between Base.delete! and Base.deleteat!. You can compare the docstring for Base.deleteat! to the docstring for Base.delete!. In fact, the DataFrames deleteat! function is a method extension of Base.deleteat!.
Here's an example of using deleteat! on a data frame:
julia> using DataFrames
julia> df = DataFrame(a=1:4, b=5:8)
4×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 5
2 │ 2 6
3 │ 3 7
4 │ 4 8
julia> deleteat!(df, [2, 3])
2×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 5
2 │ 4 8
Here's the DataFrames documentation for deleteat!:
https://dataframes.juliadata.org/stable/lib/functions/#Base.deleteat!
Related
I would like to count the number of missing values per column in a dataframe like df:
Pkg.add("DataFrames")
using DataFrames
df = DataFrame(i=1:5,
x=[missing, 4, missing, 2, 1],
y=[missing, missing, "c", "d", "e"])
5×3 DataFrame
Row │ i x y
│ Int64 Int64? String?
─────┼─────────────────────────
1 │ 1 missing missing
2 │ 2 4 missing
3 │ 3 missing c
4 │ 4 2 d
5 │ 5 1 e
This should return 0 for i, 2 for x and 2 for y column. So I was wondering if anyone knows how to count the number of missing values per column in Julia?
When writing the question I found an answer by using describe with :nmissing like this:
describe(df, :nmissing)
3×2 DataFrame
Row │ variable nmissing
│ Symbol Int64
─────┼────────────────────
1 │ i 0
2 │ x 2
3 │ y 2
If you wanted the output in columnar format you can write:
julia> mapcols(x -> count(ismissing, x), df)
1×3 DataFrame
Row │ i x y
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 0 2 2
I would like to plot the values Emean against T (like shown in the image below).
My guess is that there should be only two lines, since there are only two dataframes.
This means that the "connecting" line that I marked in yellow should not be there. Is there a way to "separate" the plots?
I assume you want to plot two lines as they are defined by grouping variable :L. If this is correct then you can do the following:
julia> using DataFrames
julia> using Plots
julia> using StatsPlots
julia> df = DataFrame(L=[1,1,1,2,2,2], T=[1,2,3,1,2,3], Emean=[1,2,3,4,5,6])
6×3 DataFrame
Row │ L T Emean
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 1 1
2 │ 1 2 2
3 │ 1 3 3
4 │ 2 1 4
5 │ 2 2 5
6 │ 2 3 6
julia> #df df plot(:T, :Emean, group=:L)
to get what you want.
Here I am using the functionality provided by the StatsPlots.jl package.
I have a DataFrame in Julia and I want to create a new column that represents the difference between consecutive rows in a specific column. In python pandas, I would simply use df.series.diff(). Is there a Julia equivelant?
For example:
data
1
2
4
6
7
# in pandas
df['diff_data'] = df.data.diff()
data diff_data
1 NaN
2 1
4 2
6 2
7 1
You can use ShiftedArrays.jl like this.
Declarative style:
julia> using DataFrames, ShiftedArrays
julia> df = DataFrame(data=[1, 2, 4, 6, 7])
5×1 DataFrame
Row │ data
│ Int64
─────┼───────
1 │ 1
2 │ 2
3 │ 4
4 │ 6
5 │ 7
julia> transform(df, :data => (x -> x - lag(x)) => :data_diff)
5×2 DataFrame
Row │ data data_diff
│ Int64 Int64?
─────┼──────────────────
1 │ 1 missing
2 │ 2 1
3 │ 4 2
4 │ 6 2
5 │ 7 1
Imperative style (in place):
julia> df = DataFrame(data=[1, 2, 4, 6, 7])
5×1 DataFrame
Row │ data
│ Int64
─────┼───────
1 │ 1
2 │ 2
3 │ 4
4 │ 6
5 │ 7
julia> df.data_diff = df.data - lag(df.data)
5-element Vector{Union{Missing, Int64}}:
missing
1
2
2
1
julia> df
5×2 DataFrame
Row │ data data_diff
│ Int64 Int64?
─────┼──────────────────
1 │ 1 missing
2 │ 2 1
3 │ 4 2
4 │ 6 2
5 │ 7 1
with diff you do not need extra packages and can do similarly the following:
julia> df.data_diff = [missing; diff(df.data)]
5-element Vector{Union{Missing, Int64}}:
missing
1
2
2
1
(the issue is that diff is a general purpose function that does change the length of vector from n to n-1 so you have to add missing manually in front)
Pandas df.diff() does it to the whole data frame at once and allows you to specify row-wise or column-wise. There might be a better way but this is what I used before (I like chaining or piping like in dplyr):
# using chain.jl
#chain df begin
eachcol()
diff.()
DataFrame(:auto)
rename!(names(df))
end
# OR base pipe
df |>
x -> eachcol(x) |>
x -> diff.(x) |>
x -> DataFrame(x, :auto) |>
x -> rename!(x, names(df)[2:end])
# OR without piping
rename!(DataFrame(diff.(eachcol(df)), :auto), names(df))
You might need to insert the starting row, which will now have missing values.
Suppose I create the following dataframe
using DataFrames
df = DataFrame(A = rand(500), B = repeat(1:10, inner=50), C = 1:500)
and I can do a groupby:
grouped_df = groupby(df,"B")
I will end up with 10 groups. How can I choose, say, the third element of each group, and combine them into a new dataframe? That is, I would like a new dataframe of 10 rows, with each row being the third element of each of the groups?
I looked into combine, but couldn't find a solution. Can I get a hint?
To get the third row from every group, groupby first and then combine using indexing:
julia> combine(groupby(df, :B), x->x[3, :])
10×3 DataFrame
Row │ B A C
│ Int64 Float64 Int64
─────┼─────────────────────────
1 │ 1 0.196572 3
2 │ 2 0.539942 53
3 │ 3 0.243455 103
4 │ 4 0.837491 153
5 │ 5 0.672861 203
6 │ 6 0.0220219 253
7 │ 7 0.303417 303
8 │ 8 0.409596 353
9 │ 9 0.165928 403
10 │ 10 0.752038 453
(I had initially misread the question and suggested logical indexing like df[df.B .== 3, :])
How can I insert a row in a dataframe in Julia at a specific index ? (Julia version 1.1)
I have found this related question. However, the code given in the answer isn't working anymore in Julia 1.1
I know how to push! a row into a dataframe or concatenate two dataframes, but what about inserting at a specific index ?
It also doesn't seem to be explained in Julia DataFrames documentation.
This is a non-standard operation. The recommendation given there is still valid, so:
df = DataFrame(x = [1,2,3], y = ["a", "b", "c"])
foreach((v,n) -> insert!(df[n], 2, v), [4, "d"], names(df))
works. A shorter version to write it under Julia 1.0 would be:
insert!.(eachcol(df, false), 2, [4, "d"])
(the need to add false as a second argument will not be needed in the future as we are in the deprecation period now)
The difference is that getproperty method can be overloaded since Julia 1.0 so df.columns does not work.
I have also updated the other answer, so you can close this question if you prefer.
EDIT
The instructions above are no longer valid (unless you use very old DataFrames.jl version).
In DataFrames.jl 1.4 use insert!, push!, or pushfirst! functions depending on where you want to add the row:
julia> using DataFrames
julia> df = DataFrame(x = [1,2,3], y = ["a", "b", "c"])
3×2 DataFrame
Row │ x y
│ Int64 String
─────┼───────────────
1 │ 1 a
2 │ 2 b
3 │ 3 c
julia> insert!(df, 2, (100, "new line"))
4×2 DataFrame
Row │ x y
│ Int64 String
─────┼─────────────────
1 │ 1 a
2 │ 100 new line
3 │ 2 b
4 │ 3 c
julia> push!(df, (200, "last line"))
5×2 DataFrame
Row │ x y
│ Int64 String
─────┼──────────────────
1 │ 1 a
2 │ 100 new line
3 │ 2 b
4 │ 3 c
5 │ 200 last line
julia> pushfirst!(df, (300, "first line"))
6×2 DataFrame
Row │ x y
│ Int64 String
─────┼───────────────────
1 │ 300 first line
2 │ 1 a
3 │ 100 new line
4 │ 2 b
5 │ 3 c
6 │ 200 last line