How to nest / unnest data frames in Julia? - dataframe

Does Julia have any analogues of the nest and unnest functions from the tidyr R package? Particularly, is there a way to make efficient nesting / unnesting operations using DataFrames.jl?

Suppose you have the following DataFrame:
julia> d = DataFrame(g=[1,1,1,2,2,3,3,], val1=1:7, val2 = 'a':'g')
7×3 DataFrame
│ Row │ g │ val1 │ val2 │
│ │ Int64 │ Int64 │ Char │
├─────┼───────┼───────┼──────┤
│ 1 │ 1 │ 1 │ 'a' │
│ 2 │ 1 │ 2 │ 'b' │
│ 3 │ 1 │ 3 │ 'c' │
│ 4 │ 2 │ 4 │ 'd' │
│ 5 │ 2 │ 5 │ 'e' │
│ 6 │ 3 │ 6 │ 'f' │
│ 7 │ 3 │ 7 │ 'g' │
and assume that you want to sample one element from each group defined by the g column.
This can be achieved by:
julia> DataFrame([rand(eachrow(gr)) for gr in groupby(d,:g)])
3×3 DataFrame
│ Row │ g │ val1 │ val2 │
│ │ Int64 │ Int64 │ Char │
├─────┼───────┼───────┼──────┤
│ 1 │ 1 │ 2 │ 'b' │
│ 2 │ 2 │ 4 │ 'd' │
│ 3 │ 3 │ 6 │ 'f' │
Hope this is what you need.
EDIT
If you want a different element count from each group you could do something like this:
julia> g_to_rows=Dict(1=>4,2=>3,3=>7); # desired element counts
julia> [ gr[rand(1:nrow(gr),g_to_rows[gr.g[1]]), :] for gr in groupby(d,:g)]
3-element Array{DataFrame,1}:
4×3 DataFrame
│ Row │ g │ val1 │ val2 │
│ │ Int64 │ Int64 │ Char │
├─────┼───────┼───────┼──────┤
│ 1 │ 1 │ 1 │ 'a' │
│ 2 │ 1 │ 1 │ 'a' │
│ 3 │ 1 │ 3 │ 'c' │
│ 4 │ 1 │ 2 │ 'b' │
3×3 DataFrame
│ Row │ g │ val1 │ val2 │
│ │ Int64 │ Int64 │ Char │
├─────┼───────┼───────┼──────┤
│ 1 │ 2 │ 5 │ 'e' │
│ 2 │ 2 │ 5 │ 'e' │
│ 3 │ 2 │ 5 │ 'e' │
7×3 DataFrame
│ Row │ g │ val1 │ val2 │
│ │ Int64 │ Int64 │ Char │
├─────┼───────┼───────┼──────┤
│ 1 │ 3 │ 7 │ 'g' │
│ 2 │ 3 │ 6 │ 'f' │
│ 3 │ 3 │ 6 │ 'f' │
│ 4 │ 3 │ 7 │ 'g' │
│ 5 │ 3 │ 7 │ 'g' │
│ 6 │ 3 │ 6 │ 'f' │
│ 7 │ 3 │ 6 │ 'f' │

Related

Pivottable in over multiple columns in Julia

I'd like to do a pivot table on a DataFrame in julia. From the documentation, I know I can do that with by and unstack. E.g.
julia> using DataFrames, Random
julia> Random.seed!(42);
julia> df = DataFrame(
Step = rand(1:3, 15) |> sort,
Label1 = rand('A':'B', 15) .|> Symbol,
Label2 = rand('Q':'R', 15) .|> Symbol
)
15×3 DataFrame
│ Row │ Step │ Label1 │ Label2 │
│ │ Int64 │ Symbol │ Symbol │
├─────┼───────┼────────┼────────┤
│ 1 │ 1 │ A │ Q │
│ 2 │ 1 │ A │ Q │
│ 3 │ 1 │ B │ R │
│ 4 │ 1 │ B │ R │
│ 5 │ 1 │ B │ Q │
│ 6 │ 2 │ B │ Q │
│ 7 │ 2 │ B │ Q │
│ 8 │ 2 │ B │ R │
│ 9 │ 2 │ B │ R │
│ 10 │ 3 │ B │ R │
│ 11 │ 3 │ B │ Q │
│ 12 │ 3 │ B │ R │
│ 13 │ 3 │ A │ R │
│ 14 │ 3 │ B │ R │
│ 15 │ 3 │ B │ Q │
julia> unstack(by(df, [:Step, :Label1, :Label2], nrow), :Label1, :nrow)
6×4 DataFrame
│ Row │ Step │ Label2 │ A │ B │
│ │ Int64 │ Symbol │ Int64? │ Int64? │
├─────┼───────┼────────┼─────────┼────────┤
│ 1 │ 1 │ Q │ 2 │ 1 │
│ 2 │ 1 │ R │ missing │ 2 │
│ 3 │ 2 │ Q │ missing │ 2 │
│ 4 │ 2 │ R │ missing │ 2 │
│ 5 │ 3 │ Q │ missing │ 2 │
│ 6 │ 3 │ R │ 1 │ 3 │
Now, how do I do a pivot on two columns, here Label1 and Label2, that I get the row counts for each combination of the elements of these two columns? The expected output would be something like
│ Row │ Step │ AQ │ AR │ BQ │ BR │
│ │ Int64 │ Int64? │ Int64? │ Int64? │ Int64? │
├─────┼───────┼─────────┼─────────┼─────────┼─────────┤
│ 1 │ 1 │ 2 │ missing │ 1 │ 2 │
│ 3 │ 2 │ missing │ missing │ 2 │ 2 │
│ 5 │ 3 │ missing │ 1 │ 2 │ 3 │
Thanks in advance!
Tim
First - by is deprecated (the manual will be updated in a few days to reflect that) so one should write:
julia> unstack(combine(groupby(df, [:Step, :Label1, :Label2]), nrow), :Label1, :nrow)
6×4 DataFrame
│ Row │ Step │ Label2 │ A │ B │
│ │ Int64 │ Symbol │ Int64? │ Int64? │
├─────┼───────┼────────┼─────────┼────────┤
│ 1 │ 1 │ Q │ 2 │ 1 │
│ 2 │ 1 │ R │ missing │ 2 │
│ 3 │ 2 │ Q │ missing │ 2 │
│ 4 │ 2 │ R │ missing │ 2 │
│ 5 │ 3 │ Q │ missing │ 2 │
│ 6 │ 3 │ R │ 1 │ 3 │
However, if you want row counts I would rather do something like:
julia> gdf = groupby(df, [:Step, :Label2], sort=true);
julia> lev = unique(df.Label1)
2-element Array{Symbol,1}:
:A
:B
julia> combine(gdf, :Label1 .=> [x -> count(==(l), x) for l in lev] .=> lev)
6×4 DataFrame
│ Row │ Step │ Label2 │ A │ B │
│ │ Int64 │ Symbol │ Int64 │ Int64 │
├─────┼───────┼────────┼───────┼───────┤
│ 1 │ 1 │ Q │ 2 │ 1 │
│ 2 │ 1 │ R │ 0 │ 2 │
│ 3 │ 2 │ Q │ 0 │ 2 │
│ 4 │ 2 │ R │ 0 │ 2 │
│ 5 │ 3 │ Q │ 0 │ 2 │
│ 6 │ 3 │ R │ 1 │ 3 │
so you have 0 not missing in place where you have a missing value.
This pattern generalizes to multiple groups:
julia> gdf = groupby(df, :Step, sort=true);
julia> l1 = unique(df.Label1)
2-element Array{Symbol,1}:
:A
:B
julia> l2 = unique(df.Label2)
2-element Array{Symbol,1}:
:Q
:R
julia> combine(gdf, [[:Label1, :Label2] =>
((x,y) -> count(((x,y),) -> x==v1 && y==v2, zip(x,y))) =>
Symbol(v1, v2) for v1 in l1 for v2 in l2])
3×5 DataFrame
│ Row │ Step │ AQ │ AR │ BQ │ BR │
│ │ Int64 │ Int64 │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┼───────┼───────┤
│ 1 │ 1 │ 2 │ 0 │ 1 │ 2 │
│ 2 │ 2 │ 0 │ 0 │ 2 │ 2 │
│ 3 │ 3 │ 0 │ 1 │ 2 │ 3 │
another way to do it using your original code would be:
julia> unstack(combine(groupby(select(df, :Step, [:Label1, :Label2] => ByRow(Symbol) => :Label), [:Step, :Label]), nrow), :Label, :nrow)
3×5 DataFrame
│ Row │ Step │ AQ │ AR │ BQ │ BR │
│ │ Int64 │ Int64? │ Int64? │ Int64? │ Int64? │
├─────┼───────┼─────────┼─────────┼────────┼────────┤
│ 1 │ 1 │ 2 │ missing │ 1 │ 2 │
│ 2 │ 2 │ missing │ missing │ 2 │ 2 │
│ 3 │ 3 │ missing │ 1 │ 2 │ 3 │
However, I agree this is not super easy. This issue is tracked in https://github.com/JuliaData/DataFrames.jl/issues/2148 and relatedly https://github.com/JuliaData/DataFrames.jl/issues/2205.

How to change only one column name in julia

If I have a dataframe like:
test = DataFrame(A = [1,2,3] , B= [4,5,6])
and I want to change only the name of A, what can I do? I know that I can change the name of all columns together by rename! but I need to rename one by one. The reason is that I'm adding new columns by hcat in a loop and need to give them unique names each time.
Use the Pair syntax:
julia> test = DataFrame(A = [1,2,3] , B= [4,5,6])
3×2 DataFrame
│ Row │ A │ B │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 4 │
│ 2 │ 2 │ 5 │
│ 3 │ 3 │ 6 │
julia> rename!(test, :A => :newA)
3×2 DataFrame
│ Row │ newA │ B │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 4 │
│ 2 │ 2 │ 5 │
│ 3 │ 3 │ 6 │
julia> test
3×2 DataFrame
│ Row │ newA │ B │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 4 │
│ 2 │ 2 │ 5 │
│ 3 │ 3 │ 6 │
With strings it is the same:
3×2 DataFrame
│ Row │ A │ B │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 4 │
│ 2 │ 2 │ 5 │
│ 3 │ 3 │ 6 │
julia> rename!(test, "A" => "newA")
3×2 DataFrame
│ Row │ newA │ B │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 4 │
│ 2 │ 2 │ 5 │
│ 3 │ 3 │ 6 │
julia> test
3×2 DataFrame
│ Row │ newA │ B │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 4 │
│ 2 │ 2 │ 5 │
│ 3 │ 3 │ 6 │

Julia Dataframes Groupby chain using combine

i'm trying to do the following :
For each columns i want to groupby the key with the column , count the number of occurence and keep only the biggest occurence by key (i don't want to keep the amount of occurence , just the value that correspond to it ).
I have many columns to groupby with keys in a row and i wanted to know if there is a way to chain it together.
Here is an example :
│ Row │ KEY │ A │
│ │ String │ String │
├──────────┼──────────────────────────────────┼───────────────┤
│ 1 │ 44473 │ ROCK │
│ 2 │ 4f4ef │ CLASSICAL │
│ 3 │ 0b8bd │ POP │
│ 4 │ 57c94 │ POP │
│ 5 │ a7070 │ RAP - HIP HOP │
│ 6 │ 1d9a3 │ JAZZ │
│ 7 │ 947fd │ POP │
Here i do :
per_key = DataFrames.groupby(test, [:KEY, :A])
combine(per_key, nrow => :A)
which gives me :
│ Row │ KEY │ A │ nrow │
│ │ String │ String │ Int64 │
├──────┼──────────────────────────────────┼────────────────────┼───────┤
│ 1 │ 44473ff │ ROCK │ 2 │
│ 2 │ 4f4effc │ CLASSICAL │ 12 │
│ 3 │ 0b8bd64 │ POP │ 2 │
│ 4 │ 57c94f5 │ POP │ 2 │
│ 5 │ a7070e4 │ RAP - HIP HOP │ 1 │
│ 6 │ 1d9a3c7 │ JAZZ │ 1 │
How do i do for each KEY , get the max "nrow" and keep the corresponding value in "A".
I have to do it with many other columns one by one also.
Thank you
I am not 100% what you wanted, but I assume this is the thing that you are looking for:
julia> using DataFrames, StatsBase
julia> df = DataFrame(key=rand(1:10, 10^6),
A = rand(1:10, 10^6),
B = rand(1:10, 10^6),
C = rand(1:10, 10^6));
julia> gdf = groupby(df, :key);
julia> combine(gdf, valuecols(gdf) .=>
(x -> last(maximum(reverse, countmap(x)))) .=>
valuecols(gdf))
10×4 DataFrame
│ Row │ key │ A │ B │ C │
│ │ Int64 │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┼───────┤
│ 1 │ 1 │ 8 │ 8 │ 1 │
│ 2 │ 10 │ 1 │ 2 │ 9 │
│ 3 │ 2 │ 6 │ 7 │ 3 │
│ 4 │ 4 │ 10 │ 7 │ 4 │
│ 5 │ 3 │ 8 │ 3 │ 1 │
│ 6 │ 7 │ 9 │ 7 │ 8 │
│ 7 │ 8 │ 2 │ 3 │ 2 │
│ 8 │ 5 │ 4 │ 3 │ 9 │
│ 9 │ 9 │ 3 │ 2 │ 10 │
│ 10 │ 6 │ 8 │ 4 │ 10 │
(on master you can add renamecols=false kwarg to avoid last .=> valuecols(gdf)).
The key function here is countmap which gives you counts of occurences of different values in a vector, e.g.:
julia> countmap(gdf[1].A)
Dict{Int64,Int64} with 10 entries:
7 => 10028
4 => 10130
9 => 10007
10 => 9841
2 => 10090
3 => 9985
5 => 10022
8 => 10262
6 => 10103
1 => 10128
the rest is just a wrapper around it. You need reverse to change key => value to value => key order to make sure maximum picks a right group (note that your problem will not have a unique solution if there are two groups with the same count), and then we use last to extract the group (as you did not want to keep the count).
EDIT:
Now I realized that argmax works for dictionaries so you can just write:
julia> combine(gdf, valuecols(gdf) .=>
argmax∘countmap .=>
valuecols(gdf))
10×4 DataFrame
│ Row │ key │ A │ B │ C │
│ │ Int64 │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┼───────┤
│ 1 │ 1 │ 8 │ 8 │ 1 │
│ 2 │ 10 │ 1 │ 2 │ 9 │
│ 3 │ 2 │ 6 │ 7 │ 3 │
│ 4 │ 4 │ 10 │ 7 │ 4 │
│ 5 │ 3 │ 8 │ 3 │ 1 │
│ 6 │ 7 │ 9 │ 7 │ 8 │
│ 7 │ 8 │ 2 │ 3 │ 2 │
│ 8 │ 5 │ 4 │ 3 │ 9 │
│ 9 │ 9 │ 3 │ 2 │ 10 │
│ 10 │ 6 │ 8 │ 4 │ 10 │

julia DataFrame select rows based values of one column belonging to a set

Using a DataFrame in Julia, I want to select rows on the basis of the value taken in a column.
With the following example
using DataFrames, DataFramesMeta
DT = DataFrame(ID = [1, 1, 2,2,3,3, 4,4], x1 = rand(8))
I want to extract the rows with ID taking the values 1 and 4.
For the moment, I came out with that solution.
#where(DT, findall(x -> (x==4 || x==1), DT.ID))
When using only two values, it is manageable.
However, I want to make it applicable to a case with many rows and a large set of value for the ID to be selected. Therefore, this solution is unrealistic if I need to write down all the value to be selected
Any fancier solution to make this selection generic?
Damien
Here is a way to do it using standard DataFrames.jl indexing and using #where from DataFramesMeta.jl:
julia> DT
8×2 DataFrame
│ Row │ ID │ x1 │
│ │ Int64 │ Float64 │
├─────┼───────┼───────────┤
│ 1 │ 1 │ 0.433397 │
│ 2 │ 1 │ 0.963775 │
│ 3 │ 2 │ 0.365919 │
│ 4 │ 2 │ 0.325169 │
│ 5 │ 3 │ 0.0495252 │
│ 6 │ 3 │ 0.637568 │
│ 7 │ 4 │ 0.391051 │
│ 8 │ 4 │ 0.436209 │
julia> DT[in([1,4]).(DT.ID), :]
4×2 DataFrame
│ Row │ ID │ x1 │
│ │ Int64 │ Float64 │
├─────┼───────┼──────────┤
│ 1 │ 1 │ 0.433397 │
│ 2 │ 1 │ 0.963775 │
│ 3 │ 4 │ 0.391051 │
│ 4 │ 4 │ 0.436209 │
julia> #where(DT, in([1,4]).(:ID))
4×2 DataFrame
│ Row │ ID │ x1 │
│ │ Int64 │ Float64 │
├─────┼───────┼──────────┤
│ 1 │ 1 │ 0.433397 │
│ 2 │ 1 │ 0.963775 │
│ 3 │ 4 │ 0.391051 │
│ 4 │ 4 │ 0.436209 │
In non performance critical code you can also use filter, which is - at least for me a bit simpler to digest (but it has a drawback, that it is slower than the methods discussed above):
julia> filter(row -> row.ID in [1,4], DT)
4×2 DataFrame
│ Row │ ID │ x1 │
│ │ Int64 │ Float64 │
├─────┼───────┼──────────┤
│ 1 │ 1 │ 0.433397 │
│ 2 │ 1 │ 0.963775 │
│ 3 │ 4 │ 0.391051 │
│ 4 │ 4 │ 0.436209 │
Note that in the approach you mention in your question you could omit DT in front of ID like this:
julia> #where(DT, findall(x -> (x==4 || x==1), :ID))
4×2 DataFrame
│ Row │ ID │ x1 │
│ │ Int64 │ Float64 │
├─────┼───────┼──────────┤
│ 1 │ 1 │ 0.433397 │
│ 2 │ 1 │ 0.963775 │
│ 3 │ 4 │ 0.391051 │
│ 4 │ 4 │ 0.436209 │
(this is a beauty of DataFramesMeta.jl that it knows the context of the DataFrame you want to refer to)

How do I drop DataFrame rows with missing values?

E.g. if the original example were
│ Row │ a │ b │
├─────┼─────────┼─────────┤
│ 1 │ 1 │ missing │
│ 2 │ 2 │ 2 │
│ 3 │ missing │ 3 │
│ 4 │ 4 │ 4 │
And I want:
│ Row │ a │ b │
├─────┼───┼─────────┤
│ 1 │ 1 │ missing │
│ 2 │ 2 │ 2 │
│ 3 │ 4 │ 4 │
Is there a nice function to do this?
Found the answer in the docs. In this case, it would be:
dropmissing(df, :a)