Skipping last N rows of Julia Dataframe - dataframe

I have a problem with removing last N rows from a Dataframe in Julia.
N_SKIP = 3
df = DataFrame(:col1=>1:10,:col2=>21:30)
N = nrow(df)
Original example Dataframe:
10×2 DataFrame
│ Row │ col1 │ col2 │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 21 │
│ 2 │ 2 │ 22 │
│ 3 │ 3 │ 23 │
│ 4 │ 4 │ 24 │
│ 5 │ 5 │ 25 │
│ 6 │ 6 │ 26 │
│ 7 │ 7 │ 27 │
│ 8 │ 8 │ 28 │
│ 9 │ 9 │ 29 │
│ 10 │ 10 │ 30 │
I want to get first N - N_SKIP rows, in this example rows with id in the 1:7 range.
Result I'm trying to achieve with N = 3:
7×2 DataFrame
│ Row │ col1 │ col2 │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 21 │
│ 2 │ 2 │ 22 │
│ 3 │ 3 │ 23 │
│ 4 │ 4 │ 24 │
│ 5 │ 5 │ 25 │
│ 6 │ 6 │ 26 │
│ 7 │ 7 │ 27 │
I could use first(df::AbstractDataFrame, n::Integer) and pass the remaining number of rows in the arguments. It works, but it doesn't seem correct.
julia> N_SKIP = 3
julia> df = DataFrame(:col1=>1:10,:col2=>21:30)
julia> N = nrow(df)
julia> first(df,N - N_SKIP)
7×2 DataFrame
│ Row │ col1 │ col2 │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 21 │
│ 2 │ 2 │ 22 │
│ 3 │ 3 │ 23 │
│ 4 │ 4 │ 24 │
│ 5 │ 5 │ 25 │
│ 6 │ 6 │ 26 │
│ 7 │ 7 │ 27 │

There are three ways you could want to do it (depending on what you want).
Create a new data frame:
julia> df[1:end-3, :]
7×2 DataFrame
Row │ col1 col2
│ Int64 Int64
─────┼──────────────
1 │ 1 21
2 │ 2 22
3 │ 3 23
4 │ 4 24
5 │ 5 25
6 │ 6 26
7 │ 7 27
julia> first(df, nrow(df) - 3)
7×2 DataFrame
Row │ col1 col2
│ Int64 Int64
─────┼──────────────
1 │ 1 21
2 │ 2 22
3 │ 3 23
4 │ 4 24
5 │ 5 25
6 │ 6 26
7 │ 7 27
Create a view of a data frame:
julia> first(df, nrow(df) - 3, view=true)
7×2 SubDataFrame
Row │ col1 col2
│ Int64 Int64
─────┼──────────────
1 │ 1 21
2 │ 2 22
3 │ 3 23
4 │ 4 24
5 │ 5 25
6 │ 6 26
7 │ 7 27
julia> #view df[1:end-3, :]
7×2 SubDataFrame
Row │ col1 col2
│ Int64 Int64
─────┼──────────────
1 │ 1 21
2 │ 2 22
3 │ 3 23
4 │ 4 24
5 │ 5 25
6 │ 6 26
7 │ 7 27
Update the source data frame in place (alternatively deleteat! could be used depending on what is more convenient for you):
julia> keepat!(df, 1:nrow(df)-3)
7×2 DataFrame
Row │ col1 col2
│ Int64 Int64
─────┼──────────────
1 │ 1 21
2 │ 2 22
3 │ 3 23
4 │ 4 24
5 │ 5 25
6 │ 6 26
7 │ 7 27

Related

How to nest / unnest data frames in Julia?

Does Julia have any analogues of the nest and unnest functions from the tidyr R package? Particularly, is there a way to make efficient nesting / unnesting operations using DataFrames.jl?
Suppose you have the following DataFrame:
julia> d = DataFrame(g=[1,1,1,2,2,3,3,], val1=1:7, val2 = 'a':'g')
7×3 DataFrame
│ Row │ g │ val1 │ val2 │
│ │ Int64 │ Int64 │ Char │
├─────┼───────┼───────┼──────┤
│ 1 │ 1 │ 1 │ 'a' │
│ 2 │ 1 │ 2 │ 'b' │
│ 3 │ 1 │ 3 │ 'c' │
│ 4 │ 2 │ 4 │ 'd' │
│ 5 │ 2 │ 5 │ 'e' │
│ 6 │ 3 │ 6 │ 'f' │
│ 7 │ 3 │ 7 │ 'g' │
and assume that you want to sample one element from each group defined by the g column.
This can be achieved by:
julia> DataFrame([rand(eachrow(gr)) for gr in groupby(d,:g)])
3×3 DataFrame
│ Row │ g │ val1 │ val2 │
│ │ Int64 │ Int64 │ Char │
├─────┼───────┼───────┼──────┤
│ 1 │ 1 │ 2 │ 'b' │
│ 2 │ 2 │ 4 │ 'd' │
│ 3 │ 3 │ 6 │ 'f' │
Hope this is what you need.
EDIT
If you want a different element count from each group you could do something like this:
julia> g_to_rows=Dict(1=>4,2=>3,3=>7); # desired element counts
julia> [ gr[rand(1:nrow(gr),g_to_rows[gr.g[1]]), :] for gr in groupby(d,:g)]
3-element Array{DataFrame,1}:
4×3 DataFrame
│ Row │ g │ val1 │ val2 │
│ │ Int64 │ Int64 │ Char │
├─────┼───────┼───────┼──────┤
│ 1 │ 1 │ 1 │ 'a' │
│ 2 │ 1 │ 1 │ 'a' │
│ 3 │ 1 │ 3 │ 'c' │
│ 4 │ 1 │ 2 │ 'b' │
3×3 DataFrame
│ Row │ g │ val1 │ val2 │
│ │ Int64 │ Int64 │ Char │
├─────┼───────┼───────┼──────┤
│ 1 │ 2 │ 5 │ 'e' │
│ 2 │ 2 │ 5 │ 'e' │
│ 3 │ 2 │ 5 │ 'e' │
7×3 DataFrame
│ Row │ g │ val1 │ val2 │
│ │ Int64 │ Int64 │ Char │
├─────┼───────┼───────┼──────┤
│ 1 │ 3 │ 7 │ 'g' │
│ 2 │ 3 │ 6 │ 'f' │
│ 3 │ 3 │ 6 │ 'f' │
│ 4 │ 3 │ 7 │ 'g' │
│ 5 │ 3 │ 7 │ 'g' │
│ 6 │ 3 │ 6 │ 'f' │
│ 7 │ 3 │ 6 │ 'f' │

Julia Dataframes Groupby chain using combine

i'm trying to do the following :
For each columns i want to groupby the key with the column , count the number of occurence and keep only the biggest occurence by key (i don't want to keep the amount of occurence , just the value that correspond to it ).
I have many columns to groupby with keys in a row and i wanted to know if there is a way to chain it together.
Here is an example :
│ Row │ KEY │ A │
│ │ String │ String │
├──────────┼──────────────────────────────────┼───────────────┤
│ 1 │ 44473 │ ROCK │
│ 2 │ 4f4ef │ CLASSICAL │
│ 3 │ 0b8bd │ POP │
│ 4 │ 57c94 │ POP │
│ 5 │ a7070 │ RAP - HIP HOP │
│ 6 │ 1d9a3 │ JAZZ │
│ 7 │ 947fd │ POP │
Here i do :
per_key = DataFrames.groupby(test, [:KEY, :A])
combine(per_key, nrow => :A)
which gives me :
│ Row │ KEY │ A │ nrow │
│ │ String │ String │ Int64 │
├──────┼──────────────────────────────────┼────────────────────┼───────┤
│ 1 │ 44473ff │ ROCK │ 2 │
│ 2 │ 4f4effc │ CLASSICAL │ 12 │
│ 3 │ 0b8bd64 │ POP │ 2 │
│ 4 │ 57c94f5 │ POP │ 2 │
│ 5 │ a7070e4 │ RAP - HIP HOP │ 1 │
│ 6 │ 1d9a3c7 │ JAZZ │ 1 │
How do i do for each KEY , get the max "nrow" and keep the corresponding value in "A".
I have to do it with many other columns one by one also.
Thank you
I am not 100% what you wanted, but I assume this is the thing that you are looking for:
julia> using DataFrames, StatsBase
julia> df = DataFrame(key=rand(1:10, 10^6),
A = rand(1:10, 10^6),
B = rand(1:10, 10^6),
C = rand(1:10, 10^6));
julia> gdf = groupby(df, :key);
julia> combine(gdf, valuecols(gdf) .=>
(x -> last(maximum(reverse, countmap(x)))) .=>
valuecols(gdf))
10×4 DataFrame
│ Row │ key │ A │ B │ C │
│ │ Int64 │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┼───────┤
│ 1 │ 1 │ 8 │ 8 │ 1 │
│ 2 │ 10 │ 1 │ 2 │ 9 │
│ 3 │ 2 │ 6 │ 7 │ 3 │
│ 4 │ 4 │ 10 │ 7 │ 4 │
│ 5 │ 3 │ 8 │ 3 │ 1 │
│ 6 │ 7 │ 9 │ 7 │ 8 │
│ 7 │ 8 │ 2 │ 3 │ 2 │
│ 8 │ 5 │ 4 │ 3 │ 9 │
│ 9 │ 9 │ 3 │ 2 │ 10 │
│ 10 │ 6 │ 8 │ 4 │ 10 │
(on master you can add renamecols=false kwarg to avoid last .=> valuecols(gdf)).
The key function here is countmap which gives you counts of occurences of different values in a vector, e.g.:
julia> countmap(gdf[1].A)
Dict{Int64,Int64} with 10 entries:
7 => 10028
4 => 10130
9 => 10007
10 => 9841
2 => 10090
3 => 9985
5 => 10022
8 => 10262
6 => 10103
1 => 10128
the rest is just a wrapper around it. You need reverse to change key => value to value => key order to make sure maximum picks a right group (note that your problem will not have a unique solution if there are two groups with the same count), and then we use last to extract the group (as you did not want to keep the count).
EDIT:
Now I realized that argmax works for dictionaries so you can just write:
julia> combine(gdf, valuecols(gdf) .=>
argmax∘countmap .=>
valuecols(gdf))
10×4 DataFrame
│ Row │ key │ A │ B │ C │
│ │ Int64 │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┼───────┤
│ 1 │ 1 │ 8 │ 8 │ 1 │
│ 2 │ 10 │ 1 │ 2 │ 9 │
│ 3 │ 2 │ 6 │ 7 │ 3 │
│ 4 │ 4 │ 10 │ 7 │ 4 │
│ 5 │ 3 │ 8 │ 3 │ 1 │
│ 6 │ 7 │ 9 │ 7 │ 8 │
│ 7 │ 8 │ 2 │ 3 │ 2 │
│ 8 │ 5 │ 4 │ 3 │ 9 │
│ 9 │ 9 │ 3 │ 2 │ 10 │
│ 10 │ 6 │ 8 │ 4 │ 10 │

How to pivot two columns at a time?

Consider this input and I wish to pivot val1 and val2 using cname as column name.
wide = DataFrame(x = 1:12,
a = 2:13,
b = 3:14,
val1 = randn(12),
val2 = randn(12),
cname = repeat(["c", "d"], inner =6)
)
12×6 DataFrame
│ Row │ x │ a │ b │ val1 │ val2 │ cname │
│ │ Int64 │ Int64 │ Int64 │ Float64 │ Float64 │ String │
├─────┼───────┼───────┼───────┼───────────┼───────────┼────────┤
│ 1 │ 1 │ 2 │ 3 │ 1.51014 │ -1.18548 │ c │
│ 2 │ 2 │ 3 │ 4 │ 0.0845411 │ -0.370083 │ c │
│ 3 │ 3 │ 4 │ 5 │ 0.826283 │ -1.00423 │ c │
│ 4 │ 4 │ 5 │ 6 │ -0.53175 │ -1.16659 │ c │
│ 5 │ 5 │ 6 │ 7 │ -1.77975 │ 0.336333 │ c │
│ 6 │ 6 │ 7 │ 8 │ 0.632577 │ 0.236621 │ c │
│ 7 │ 7 │ 8 │ 9 │ -0.681532 │ 1.14869 │ d │
│ 8 │ 8 │ 9 │ 10 │ -0.775619 │ 0.393475 │ d │
│ 9 │ 9 │ 10 │ 11 │ -0.533034 │ 0.059624 │ d │
│ 10 │ 10 │ 11 │ 12 │ 0.496152 │ -1.23507 │ d │
│ 11 │ 11 │ 12 │ 13 │ 0.834099 │ 2.12115 │ d │
│ 12 │ 12 │ 13 │ 14 │ 0.532357 │ -0.369267 │ d │
In the tidyverse, the pivot_wider function can do this
wide %>% pivot_wider(names_from = cname, values_from = c(val1,val2))
=== === === ========== ========== ========== ==========
x a b val1_c val1_d val2_c val2_d
=== === === ========== ========== ========== ==========
1 2 3 1.0174232 NA -0.6611959 NA
2 3 4 0.6590795 NA -2.0954505 NA
3 4 5 1.2939581 NA 1.6350356 NA
4 5 6 -1.9395356 NA 0.7813238 NA
5 6 7 0.3558087 NA 0.9789414 NA
6 7 8 0.9859100 NA -0.9803336 NA
7 8 9 NA 0.4949224 NA -0.0659333
8 9 10 NA 0.5024755 NA -0.2317832
9 10 11 NA 1.6926897 NA -0.3840687
10 11 12 NA -0.4324705 NA -0.0901276
11 12 13 NA -0.6415260 NA 0.0014151
12 13 14 NA 1.2406868 NA -2.1959740
=== === === ========== ========== ========== ==========
Julia's DataFrames.unstack does not work
For example
using DataFrames
unstack(wide, [:x, :a,:b], :cname, [:val1,:val2])
What is the R (data.table)/Scala/etc solution for comparison? Solutions from any language welcome.
But I don't want any python solutions as it clashes with another.
Unfortunately this currently is not supported in DataFrames.jl in one call (but the feature was requested to be added to unstack, so it will be available in the future). Currently you have to do:
tmp = [unstack(wide, [:x, :a, :b], :cname, v, renamecols=x->v*"_"*x) for v in ["val1", "val2"]]
result = [tmp[1] tmp[2][:, 4:end]]
or
innerjoin([unstack(wide, [:x, :a, :b], :cname, v, renamecols=x->v*"_"*x) for v in ["val1", "val2"]]..., on=[:x, :a, :b])

Get elements of one column with certain values of another column with DataFrames.jl

Having a dataframe df with columns :a and :b, how can I get all elements in column :a that are in a row with e.g. b = 0.5?
Can this be done with DataFrames alone or is a meta package needed?
df[df.b .== 5, :]
Example
julia> df = DataFrame(a=11:17, b=vcat([5,5],1:5))
7×2 DataFrame
│ Row │ a │ b │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 11 │ 5 │
│ 2 │ 12 │ 5 │
│ 3 │ 13 │ 1 │
│ 4 │ 14 │ 2 │
│ 5 │ 15 │ 3 │
│ 6 │ 16 │ 4 │
│ 7 │ 17 │ 5 │
julia> df[df.b .== 5, :]
3×2 DataFrame
│ Row │ a │ b │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 11 │ 5 │
│ 2 │ 12 │ 5 │
│ 3 │ 17 │ 5 │
If you want just the column a:
julia> df[df.b .== 5, :].a
3-element Array{Int64,1}:
11
12
17
Yet another option is to use filter with a lambda function (this is slightly faster and uses less memory):
julia> filter(row -> row[:b] == 5, df)
3×2 DataFrame
│ Row │ a │ b │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 11 │ 5 │
│ 2 │ 12 │ 5 │
│ 3 │ 17 │ 5 │

Concatenate Julia DataFrames adding a categorical column

Say I have the following
a = DataFrame(x = [1,2,3,4], y = [10,20,30,40])
b = DataFrame(x = [1,2,3,4], y = [50,60,70,80])
is there a way of getting [a;b] with an additional categorical column to obtain something like the following?
8×3 DataFrames.DataFrame
│ Row │ x │ y │ c │
├─────┼───┼────┼───┤
│ 1 │ 1 │ 10 │ 1 │
│ 2 │ 2 │ 20 │ 1 │
│ 3 │ 3 │ 30 │ 1 │
│ 4 │ 4 │ 40 │ 1 │
│ 5 │ 1 │ 50 │ 2 │
│ 6 │ 2 │ 60 │ 2 │
│ 7 │ 3 │ 70 │ 2 │
│ 8 │ 4 │ 80 │ 2 │
For two dataframes, something like
using DataFramesMeta
[#transform(a, c = 1); #transform(b, c = 2)]
works, but what if I have more than a few DataFrames?
You could use enumerate if you like to create result from array of DataFrames. For example:
l = [a,b]
vcat([transform(x,c=i) for (i,x) in enumerate(l)])