Julia Dataframes Groupby chain using combine - dataframe

i'm trying to do the following :
For each columns i want to groupby the key with the column , count the number of occurence and keep only the biggest occurence by key (i don't want to keep the amount of occurence , just the value that correspond to it ).
I have many columns to groupby with keys in a row and i wanted to know if there is a way to chain it together.
Here is an example :
│ Row │ KEY │ A │
│ │ String │ String │
├──────────┼──────────────────────────────────┼───────────────┤
│ 1 │ 44473 │ ROCK │
│ 2 │ 4f4ef │ CLASSICAL │
│ 3 │ 0b8bd │ POP │
│ 4 │ 57c94 │ POP │
│ 5 │ a7070 │ RAP - HIP HOP │
│ 6 │ 1d9a3 │ JAZZ │
│ 7 │ 947fd │ POP │
Here i do :
per_key = DataFrames.groupby(test, [:KEY, :A])
combine(per_key, nrow => :A)
which gives me :
│ Row │ KEY │ A │ nrow │
│ │ String │ String │ Int64 │
├──────┼──────────────────────────────────┼────────────────────┼───────┤
│ 1 │ 44473ff │ ROCK │ 2 │
│ 2 │ 4f4effc │ CLASSICAL │ 12 │
│ 3 │ 0b8bd64 │ POP │ 2 │
│ 4 │ 57c94f5 │ POP │ 2 │
│ 5 │ a7070e4 │ RAP - HIP HOP │ 1 │
│ 6 │ 1d9a3c7 │ JAZZ │ 1 │
How do i do for each KEY , get the max "nrow" and keep the corresponding value in "A".
I have to do it with many other columns one by one also.
Thank you

I am not 100% what you wanted, but I assume this is the thing that you are looking for:
julia> using DataFrames, StatsBase
julia> df = DataFrame(key=rand(1:10, 10^6),
A = rand(1:10, 10^6),
B = rand(1:10, 10^6),
C = rand(1:10, 10^6));
julia> gdf = groupby(df, :key);
julia> combine(gdf, valuecols(gdf) .=>
(x -> last(maximum(reverse, countmap(x)))) .=>
valuecols(gdf))
10×4 DataFrame
│ Row │ key │ A │ B │ C │
│ │ Int64 │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┼───────┤
│ 1 │ 1 │ 8 │ 8 │ 1 │
│ 2 │ 10 │ 1 │ 2 │ 9 │
│ 3 │ 2 │ 6 │ 7 │ 3 │
│ 4 │ 4 │ 10 │ 7 │ 4 │
│ 5 │ 3 │ 8 │ 3 │ 1 │
│ 6 │ 7 │ 9 │ 7 │ 8 │
│ 7 │ 8 │ 2 │ 3 │ 2 │
│ 8 │ 5 │ 4 │ 3 │ 9 │
│ 9 │ 9 │ 3 │ 2 │ 10 │
│ 10 │ 6 │ 8 │ 4 │ 10 │
(on master you can add renamecols=false kwarg to avoid last .=> valuecols(gdf)).
The key function here is countmap which gives you counts of occurences of different values in a vector, e.g.:
julia> countmap(gdf[1].A)
Dict{Int64,Int64} with 10 entries:
7 => 10028
4 => 10130
9 => 10007
10 => 9841
2 => 10090
3 => 9985
5 => 10022
8 => 10262
6 => 10103
1 => 10128
the rest is just a wrapper around it. You need reverse to change key => value to value => key order to make sure maximum picks a right group (note that your problem will not have a unique solution if there are two groups with the same count), and then we use last to extract the group (as you did not want to keep the count).
EDIT:
Now I realized that argmax works for dictionaries so you can just write:
julia> combine(gdf, valuecols(gdf) .=>
argmax∘countmap .=>
valuecols(gdf))
10×4 DataFrame
│ Row │ key │ A │ B │ C │
│ │ Int64 │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┼───────┤
│ 1 │ 1 │ 8 │ 8 │ 1 │
│ 2 │ 10 │ 1 │ 2 │ 9 │
│ 3 │ 2 │ 6 │ 7 │ 3 │
│ 4 │ 4 │ 10 │ 7 │ 4 │
│ 5 │ 3 │ 8 │ 3 │ 1 │
│ 6 │ 7 │ 9 │ 7 │ 8 │
│ 7 │ 8 │ 2 │ 3 │ 2 │
│ 8 │ 5 │ 4 │ 3 │ 9 │
│ 9 │ 9 │ 3 │ 2 │ 10 │
│ 10 │ 6 │ 8 │ 4 │ 10 │

Related

Pivottable in over multiple columns in Julia

I'd like to do a pivot table on a DataFrame in julia. From the documentation, I know I can do that with by and unstack. E.g.
julia> using DataFrames, Random
julia> Random.seed!(42);
julia> df = DataFrame(
Step = rand(1:3, 15) |> sort,
Label1 = rand('A':'B', 15) .|> Symbol,
Label2 = rand('Q':'R', 15) .|> Symbol
)
15×3 DataFrame
│ Row │ Step │ Label1 │ Label2 │
│ │ Int64 │ Symbol │ Symbol │
├─────┼───────┼────────┼────────┤
│ 1 │ 1 │ A │ Q │
│ 2 │ 1 │ A │ Q │
│ 3 │ 1 │ B │ R │
│ 4 │ 1 │ B │ R │
│ 5 │ 1 │ B │ Q │
│ 6 │ 2 │ B │ Q │
│ 7 │ 2 │ B │ Q │
│ 8 │ 2 │ B │ R │
│ 9 │ 2 │ B │ R │
│ 10 │ 3 │ B │ R │
│ 11 │ 3 │ B │ Q │
│ 12 │ 3 │ B │ R │
│ 13 │ 3 │ A │ R │
│ 14 │ 3 │ B │ R │
│ 15 │ 3 │ B │ Q │
julia> unstack(by(df, [:Step, :Label1, :Label2], nrow), :Label1, :nrow)
6×4 DataFrame
│ Row │ Step │ Label2 │ A │ B │
│ │ Int64 │ Symbol │ Int64? │ Int64? │
├─────┼───────┼────────┼─────────┼────────┤
│ 1 │ 1 │ Q │ 2 │ 1 │
│ 2 │ 1 │ R │ missing │ 2 │
│ 3 │ 2 │ Q │ missing │ 2 │
│ 4 │ 2 │ R │ missing │ 2 │
│ 5 │ 3 │ Q │ missing │ 2 │
│ 6 │ 3 │ R │ 1 │ 3 │
Now, how do I do a pivot on two columns, here Label1 and Label2, that I get the row counts for each combination of the elements of these two columns? The expected output would be something like
│ Row │ Step │ AQ │ AR │ BQ │ BR │
│ │ Int64 │ Int64? │ Int64? │ Int64? │ Int64? │
├─────┼───────┼─────────┼─────────┼─────────┼─────────┤
│ 1 │ 1 │ 2 │ missing │ 1 │ 2 │
│ 3 │ 2 │ missing │ missing │ 2 │ 2 │
│ 5 │ 3 │ missing │ 1 │ 2 │ 3 │
Thanks in advance!
Tim
First - by is deprecated (the manual will be updated in a few days to reflect that) so one should write:
julia> unstack(combine(groupby(df, [:Step, :Label1, :Label2]), nrow), :Label1, :nrow)
6×4 DataFrame
│ Row │ Step │ Label2 │ A │ B │
│ │ Int64 │ Symbol │ Int64? │ Int64? │
├─────┼───────┼────────┼─────────┼────────┤
│ 1 │ 1 │ Q │ 2 │ 1 │
│ 2 │ 1 │ R │ missing │ 2 │
│ 3 │ 2 │ Q │ missing │ 2 │
│ 4 │ 2 │ R │ missing │ 2 │
│ 5 │ 3 │ Q │ missing │ 2 │
│ 6 │ 3 │ R │ 1 │ 3 │
However, if you want row counts I would rather do something like:
julia> gdf = groupby(df, [:Step, :Label2], sort=true);
julia> lev = unique(df.Label1)
2-element Array{Symbol,1}:
:A
:B
julia> combine(gdf, :Label1 .=> [x -> count(==(l), x) for l in lev] .=> lev)
6×4 DataFrame
│ Row │ Step │ Label2 │ A │ B │
│ │ Int64 │ Symbol │ Int64 │ Int64 │
├─────┼───────┼────────┼───────┼───────┤
│ 1 │ 1 │ Q │ 2 │ 1 │
│ 2 │ 1 │ R │ 0 │ 2 │
│ 3 │ 2 │ Q │ 0 │ 2 │
│ 4 │ 2 │ R │ 0 │ 2 │
│ 5 │ 3 │ Q │ 0 │ 2 │
│ 6 │ 3 │ R │ 1 │ 3 │
so you have 0 not missing in place where you have a missing value.
This pattern generalizes to multiple groups:
julia> gdf = groupby(df, :Step, sort=true);
julia> l1 = unique(df.Label1)
2-element Array{Symbol,1}:
:A
:B
julia> l2 = unique(df.Label2)
2-element Array{Symbol,1}:
:Q
:R
julia> combine(gdf, [[:Label1, :Label2] =>
((x,y) -> count(((x,y),) -> x==v1 && y==v2, zip(x,y))) =>
Symbol(v1, v2) for v1 in l1 for v2 in l2])
3×5 DataFrame
│ Row │ Step │ AQ │ AR │ BQ │ BR │
│ │ Int64 │ Int64 │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┼───────┼───────┤
│ 1 │ 1 │ 2 │ 0 │ 1 │ 2 │
│ 2 │ 2 │ 0 │ 0 │ 2 │ 2 │
│ 3 │ 3 │ 0 │ 1 │ 2 │ 3 │
another way to do it using your original code would be:
julia> unstack(combine(groupby(select(df, :Step, [:Label1, :Label2] => ByRow(Symbol) => :Label), [:Step, :Label]), nrow), :Label, :nrow)
3×5 DataFrame
│ Row │ Step │ AQ │ AR │ BQ │ BR │
│ │ Int64 │ Int64? │ Int64? │ Int64? │ Int64? │
├─────┼───────┼─────────┼─────────┼────────┼────────┤
│ 1 │ 1 │ 2 │ missing │ 1 │ 2 │
│ 2 │ 2 │ missing │ missing │ 2 │ 2 │
│ 3 │ 3 │ missing │ 1 │ 2 │ 3 │
However, I agree this is not super easy. This issue is tracked in https://github.com/JuliaData/DataFrames.jl/issues/2148 and relatedly https://github.com/JuliaData/DataFrames.jl/issues/2205.

How to change only one column name in julia

If I have a dataframe like:
test = DataFrame(A = [1,2,3] , B= [4,5,6])
and I want to change only the name of A, what can I do? I know that I can change the name of all columns together by rename! but I need to rename one by one. The reason is that I'm adding new columns by hcat in a loop and need to give them unique names each time.
Use the Pair syntax:
julia> test = DataFrame(A = [1,2,3] , B= [4,5,6])
3×2 DataFrame
│ Row │ A │ B │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 4 │
│ 2 │ 2 │ 5 │
│ 3 │ 3 │ 6 │
julia> rename!(test, :A => :newA)
3×2 DataFrame
│ Row │ newA │ B │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 4 │
│ 2 │ 2 │ 5 │
│ 3 │ 3 │ 6 │
julia> test
3×2 DataFrame
│ Row │ newA │ B │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 4 │
│ 2 │ 2 │ 5 │
│ 3 │ 3 │ 6 │
With strings it is the same:
3×2 DataFrame
│ Row │ A │ B │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 4 │
│ 2 │ 2 │ 5 │
│ 3 │ 3 │ 6 │
julia> rename!(test, "A" => "newA")
3×2 DataFrame
│ Row │ newA │ B │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 4 │
│ 2 │ 2 │ 5 │
│ 3 │ 3 │ 6 │
julia> test
3×2 DataFrame
│ Row │ newA │ B │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 4 │
│ 2 │ 2 │ 5 │
│ 3 │ 3 │ 6 │

Columns misaligned in attempt to delimit tabular data in Julia

Here is what my tabular data file looks like
P(days) growth
0 0.67150E+01 -0.11654E-02
1 0.47166E+01 -0.15529E-02
2 0.35861E+01 -0.12327E+00
3 0.28754E+01 -0.30987E+00
4 0.23721E+01 -0.48377E+00
5 0.20062E+01 -0.63666E+00
6 0.17097E+01 -0.17122E+01
7 0.16867E+01 -0.86038E+00
8 0.14523E+01 -0.55203E+00
9 0.12864E+01 -0.37704E+00
I am attempting to read this into a data frame. I tried this:
LINAData = DataFrame(CSV.File(LINAFile, skipto = 2, header = 1, delim=' ', ignorerepeated=true))
But as you can see:
│ Row │ P(days) │ growth │
│ │ Int64? │ Float64 │
├─────┼─────────┼─────────┤
│ 1 │ 0 │ 6.715 │
│ 2 │ missing │ 1.0 │
│ 3 │ missing │ 2.0 │
│ 4 │ missing │ 3.0 │
│ 5 │ missing │ 4.0 │
│ 6 │ missing │ 5.0 │
│ 7 │ missing │ 6.0 │
│ 8 │ missing │ 7.0 │
│ 9 │ missing │ 8.0 │
│ 10 │ missing │ 9.0 │
Is there an issue with how I am delimiting?
Your header is missing column name for the first column, so you have to supply it manually:
julia> LINAData = CSV.read(LINAFile, DataFrame, skipto = 2, header = ["","P(days)", "growth"], delim=' ', ignorerepeated=true)
10×3 DataFrame
│ Row │ │ P(days) │ growth │
│ │ Int64 │ Float64 │ Float64 │
├─────┼───────┼─────────┼────────────┤
│ 1 │ 0 │ 6.715 │ -0.0011654 │
│ 2 │ 1 │ 4.7166 │ -0.0015529 │
│ 3 │ 2 │ 3.5861 │ -0.12327 │
│ 4 │ 3 │ 2.8754 │ -0.30987 │
│ 5 │ 4 │ 2.3721 │ -0.48377 │
│ 6 │ 5 │ 2.0062 │ -0.63666 │
│ 7 │ 6 │ 1.7097 │ -1.7122 │
│ 8 │ 7 │ 1.6867 │ -0.86038 │
│ 9 │ 8 │ 1.4523 │ -0.55203 │
│ 10 │ 9 │ 1.2864 │ -0.37704 │

julia DataFrame select rows based values of one column belonging to a set

Using a DataFrame in Julia, I want to select rows on the basis of the value taken in a column.
With the following example
using DataFrames, DataFramesMeta
DT = DataFrame(ID = [1, 1, 2,2,3,3, 4,4], x1 = rand(8))
I want to extract the rows with ID taking the values 1 and 4.
For the moment, I came out with that solution.
#where(DT, findall(x -> (x==4 || x==1), DT.ID))
When using only two values, it is manageable.
However, I want to make it applicable to a case with many rows and a large set of value for the ID to be selected. Therefore, this solution is unrealistic if I need to write down all the value to be selected
Any fancier solution to make this selection generic?
Damien
Here is a way to do it using standard DataFrames.jl indexing and using #where from DataFramesMeta.jl:
julia> DT
8×2 DataFrame
│ Row │ ID │ x1 │
│ │ Int64 │ Float64 │
├─────┼───────┼───────────┤
│ 1 │ 1 │ 0.433397 │
│ 2 │ 1 │ 0.963775 │
│ 3 │ 2 │ 0.365919 │
│ 4 │ 2 │ 0.325169 │
│ 5 │ 3 │ 0.0495252 │
│ 6 │ 3 │ 0.637568 │
│ 7 │ 4 │ 0.391051 │
│ 8 │ 4 │ 0.436209 │
julia> DT[in([1,4]).(DT.ID), :]
4×2 DataFrame
│ Row │ ID │ x1 │
│ │ Int64 │ Float64 │
├─────┼───────┼──────────┤
│ 1 │ 1 │ 0.433397 │
│ 2 │ 1 │ 0.963775 │
│ 3 │ 4 │ 0.391051 │
│ 4 │ 4 │ 0.436209 │
julia> #where(DT, in([1,4]).(:ID))
4×2 DataFrame
│ Row │ ID │ x1 │
│ │ Int64 │ Float64 │
├─────┼───────┼──────────┤
│ 1 │ 1 │ 0.433397 │
│ 2 │ 1 │ 0.963775 │
│ 3 │ 4 │ 0.391051 │
│ 4 │ 4 │ 0.436209 │
In non performance critical code you can also use filter, which is - at least for me a bit simpler to digest (but it has a drawback, that it is slower than the methods discussed above):
julia> filter(row -> row.ID in [1,4], DT)
4×2 DataFrame
│ Row │ ID │ x1 │
│ │ Int64 │ Float64 │
├─────┼───────┼──────────┤
│ 1 │ 1 │ 0.433397 │
│ 2 │ 1 │ 0.963775 │
│ 3 │ 4 │ 0.391051 │
│ 4 │ 4 │ 0.436209 │
Note that in the approach you mention in your question you could omit DT in front of ID like this:
julia> #where(DT, findall(x -> (x==4 || x==1), :ID))
4×2 DataFrame
│ Row │ ID │ x1 │
│ │ Int64 │ Float64 │
├─────┼───────┼──────────┤
│ 1 │ 1 │ 0.433397 │
│ 2 │ 1 │ 0.963775 │
│ 3 │ 4 │ 0.391051 │
│ 4 │ 4 │ 0.436209 │
(this is a beauty of DataFramesMeta.jl that it knows the context of the DataFrame you want to refer to)

How do I drop DataFrame rows with missing values?

E.g. if the original example were
│ Row │ a │ b │
├─────┼─────────┼─────────┤
│ 1 │ 1 │ missing │
│ 2 │ 2 │ 2 │
│ 3 │ missing │ 3 │
│ 4 │ 4 │ 4 │
And I want:
│ Row │ a │ b │
├─────┼───┼─────────┤
│ 1 │ 1 │ missing │
│ 2 │ 2 │ 2 │
│ 3 │ 4 │ 4 │
Is there a nice function to do this?
Found the answer in the docs. In this case, it would be:
dropmissing(df, :a)