Julia Dataframes Groupby chain using combine

Julia Dataframes Groupby chain using combine - dataframe

i'm trying to do the following :
For each columns i want to groupby the key with the column , count the number of occurence and keep only the biggest occurence by key (i don't want to keep the amount of occurence , just the value that correspond to it ).
I have many columns to groupby with keys in a row and i wanted to know if there is a way to chain it together.
Here is an example :
│ Row │ KEY │ A │
│ │ String │ String │
├──────────┼──────────────────────────────────┼───────────────┤
│ 1 │ 44473 │ ROCK │
│ 2 │ 4f4ef │ CLASSICAL │
│ 3 │ 0b8bd │ POP │
│ 4 │ 57c94 │ POP │
│ 5 │ a7070 │ RAP - HIP HOP │
│ 6 │ 1d9a3 │ JAZZ │
│ 7 │ 947fd │ POP │
Here i do :
per_key = DataFrames.groupby(test, [:KEY, :A])
combine(per_key, nrow => :A)
which gives me :
│ Row │ KEY │ A │ nrow │
│ │ String │ String │ Int64 │
├──────┼──────────────────────────────────┼────────────────────┼───────┤
│ 1 │ 44473ff │ ROCK │ 2 │
│ 2 │ 4f4effc │ CLASSICAL │ 12 │
│ 3 │ 0b8bd64 │ POP │ 2 │
│ 4 │ 57c94f5 │ POP │ 2 │
│ 5 │ a7070e4 │ RAP - HIP HOP │ 1 │
│ 6 │ 1d9a3c7 │ JAZZ │ 1 │
How do i do for each KEY , get the max "nrow" and keep the corresponding value in "A".
I have to do it with many other columns one by one also.
Thank you

I am not 100% what you wanted, but I assume this is the thing that you are looking for:
julia> using DataFrames, StatsBase
julia> df = DataFrame(key=rand(1:10, 10^6),
A = rand(1:10, 10^6),
B = rand(1:10, 10^6),
C = rand(1:10, 10^6));
julia> gdf = groupby(df, :key);
julia> combine(gdf, valuecols(gdf) .=>
(x -> last(maximum(reverse, countmap(x)))) .=>
valuecols(gdf))
10×4 DataFrame
│ Row │ key │ A │ B │ C │
│ │ Int64 │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┼───────┤
│ 1 │ 1 │ 8 │ 8 │ 1 │
│ 2 │ 10 │ 1 │ 2 │ 9 │
│ 3 │ 2 │ 6 │ 7 │ 3 │
│ 4 │ 4 │ 10 │ 7 │ 4 │
│ 5 │ 3 │ 8 │ 3 │ 1 │
│ 6 │ 7 │ 9 │ 7 │ 8 │
│ 7 │ 8 │ 2 │ 3 │ 2 │
│ 8 │ 5 │ 4 │ 3 │ 9 │
│ 9 │ 9 │ 3 │ 2 │ 10 │
│ 10 │ 6 │ 8 │ 4 │ 10 │
(on master you can add renamecols=false kwarg to avoid last .=> valuecols(gdf)).
The key function here is countmap which gives you counts of occurences of different values in a vector, e.g.:
julia> countmap(gdf[1].A)
Dict{Int64,Int64} with 10 entries:
7 => 10028
4 => 10130
9 => 10007
10 => 9841
2 => 10090
3 => 9985
5 => 10022
8 => 10262
6 => 10103
1 => 10128
the rest is just a wrapper around it. You need reverse to change key => value to value => key order to make sure maximum picks a right group (note that your problem will not have a unique solution if there are two groups with the same count), and then we use last to extract the group (as you did not want to keep the count).
EDIT:
Now I realized that argmax works for dictionaries so you can just write:
julia> combine(gdf, valuecols(gdf) .=>
argmax∘countmap .=>
valuecols(gdf))
10×4 DataFrame
│ Row │ key │ A │ B │ C │
│ │ Int64 │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┼───────┤
│ 1 │ 1 │ 8 │ 8 │ 1 │
│ 2 │ 10 │ 1 │ 2 │ 9 │
│ 3 │ 2 │ 6 │ 7 │ 3 │
│ 4 │ 4 │ 10 │ 7 │ 4 │
│ 5 │ 3 │ 8 │ 3 │ 1 │
│ 6 │ 7 │ 9 │ 7 │ 8 │
│ 7 │ 8 │ 2 │ 3 │ 2 │
│ 8 │ 5 │ 4 │ 3 │ 9 │
│ 9 │ 9 │ 3 │ 2 │ 10 │
│ 10 │ 6 │ 8 │ 4 │ 10 │

Related

Pivottable in over multiple columns in Julia

I'd like to do a pivot table on a DataFrame in julia. From the documentation, I know I can do that with by and unstack. E.g.
julia> using DataFrames, Random
julia> Random.seed!(42);
julia> df = DataFrame(
Step = rand(1:3, 15) |> sort,
Label1 = rand('A':'B', 15) .|> Symbol,
Label2 = rand('Q':'R', 15) .|> Symbol
)
15×3 DataFrame
│ Row │ Step │ Label1 │ Label2 │
│ │ Int64 │ Symbol │ Symbol │
├─────┼───────┼────────┼────────┤
│ 1 │ 1 │ A │ Q │
│ 2 │ 1 │ A │ Q │
│ 3 │ 1 │ B │ R │
│ 4 │ 1 │ B │ R │
│ 5 │ 1 │ B │ Q │
│ 6 │ 2 │ B │ Q │
│ 7 │ 2 │ B │ Q │
│ 8 │ 2 │ B │ R │
│ 9 │ 2 │ B │ R │
│ 10 │ 3 │ B │ R │
│ 11 │ 3 │ B │ Q │
│ 12 │ 3 │ B │ R │
│ 13 │ 3 │ A │ R │
│ 14 │ 3 │ B │ R │
│ 15 │ 3 │ B │ Q │
julia> unstack(by(df, [:Step, :Label1, :Label2], nrow), :Label1, :nrow)
6×4 DataFrame
│ Row │ Step │ Label2 │ A │ B │
│ │ Int64 │ Symbol │ Int64? │ Int64? │
├─────┼───────┼────────┼─────────┼────────┤
│ 1 │ 1 │ Q │ 2 │ 1 │
│ 2 │ 1 │ R │ missing │ 2 │
│ 3 │ 2 │ Q │ missing │ 2 │
│ 4 │ 2 │ R │ missing │ 2 │
│ 5 │ 3 │ Q │ missing │ 2 │
│ 6 │ 3 │ R │ 1 │ 3 │
Now, how do I do a pivot on two columns, here Label1 and Label2, that I get the row counts for each combination of the elements of these two columns? The expected output would be something like
│ Row │ Step │ AQ │ AR │ BQ │ BR │
│ │ Int64 │ Int64? │ Int64? │ Int64? │ Int64? │
├─────┼───────┼─────────┼─────────┼─────────┼─────────┤
│ 1 │ 1 │ 2 │ missing │ 1 │ 2 │
│ 3 │ 2 │ missing │ missing │ 2 │ 2 │
│ 5 │ 3 │ missing │ 1 │ 2 │ 3 │
Thanks in advance!
Tim

First - by is deprecated (the manual will be updated in a few days to reflect that) so one should write:
julia> unstack(combine(groupby(df, [:Step, :Label1, :Label2]), nrow), :Label1, :nrow)
6×4 DataFrame
│ Row │ Step │ Label2 │ A │ B │
│ │ Int64 │ Symbol │ Int64? │ Int64? │
├─────┼───────┼────────┼─────────┼────────┤
│ 1 │ 1 │ Q │ 2 │ 1 │
│ 2 │ 1 │ R │ missing │ 2 │
│ 3 │ 2 │ Q │ missing │ 2 │
│ 4 │ 2 │ R │ missing │ 2 │
│ 5 │ 3 │ Q │ missing │ 2 │
│ 6 │ 3 │ R │ 1 │ 3 │
However, if you want row counts I would rather do something like:
julia> gdf = groupby(df, [:Step, :Label2], sort=true);
julia> lev = unique(df.Label1)
2-element Array{Symbol,1}:
:A
:B
julia> combine(gdf, :Label1 .=> [x -> count(==(l), x) for l in lev] .=> lev)
6×4 DataFrame
│ Row │ Step │ Label2 │ A │ B │
│ │ Int64 │ Symbol │ Int64 │ Int64 │
├─────┼───────┼────────┼───────┼───────┤
│ 1 │ 1 │ Q │ 2 │ 1 │
│ 2 │ 1 │ R │ 0 │ 2 │
│ 3 │ 2 │ Q │ 0 │ 2 │
│ 4 │ 2 │ R │ 0 │ 2 │
│ 5 │ 3 │ Q │ 0 │ 2 │
│ 6 │ 3 │ R │ 1 │ 3 │
so you have 0 not missing in place where you have a missing value.
This pattern generalizes to multiple groups:
julia> gdf = groupby(df, :Step, sort=true);
julia> l1 = unique(df.Label1)
2-element Array{Symbol,1}:
:A
:B
julia> l2 = unique(df.Label2)
2-element Array{Symbol,1}:
:Q
:R
julia> combine(gdf, [[:Label1, :Label2] =>
((x,y) -> count(((x,y),) -> x==v1 && y==v2, zip(x,y))) =>
Symbol(v1, v2) for v1 in l1 for v2 in l2])
3×5 DataFrame
│ Row │ Step │ AQ │ AR │ BQ │ BR │
│ │ Int64 │ Int64 │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┼───────┼───────┤
│ 1 │ 1 │ 2 │ 0 │ 1 │ 2 │
│ 2 │ 2 │ 0 │ 0 │ 2 │ 2 │
│ 3 │ 3 │ 0 │ 1 │ 2 │ 3 │
another way to do it using your original code would be:
julia> unstack(combine(groupby(select(df, :Step, [:Label1, :Label2] => ByRow(Symbol) => :Label), [:Step, :Label]), nrow), :Label, :nrow)
3×5 DataFrame
│ Row │ Step │ AQ │ AR │ BQ │ BR │
│ │ Int64 │ Int64? │ Int64? │ Int64? │ Int64? │
├─────┼───────┼─────────┼─────────┼────────┼────────┤
│ 1 │ 1 │ 2 │ missing │ 1 │ 2 │
│ 2 │ 2 │ missing │ missing │ 2 │ 2 │
│ 3 │ 3 │ missing │ 1 │ 2 │ 3 │
However, I agree this is not super easy. This issue is tracked in https://github.com/JuliaData/DataFrames.jl/issues/2148 and relatedly https://github.com/JuliaData/DataFrames.jl/issues/2205.

How to change only one column name in julia

If I have a dataframe like:
test = DataFrame(A = [1,2,3] , B= [4,5,6])
and I want to change only the name of A, what can I do? I know that I can change the name of all columns together by rename! but I need to rename one by one. The reason is that I'm adding new columns by hcat in a loop and need to give them unique names each time.

Use the Pair syntax:
julia> test = DataFrame(A = [1,2,3] , B= [4,5,6])
3×2 DataFrame
│ Row │ A │ B │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 4 │
│ 2 │ 2 │ 5 │
│ 3 │ 3 │ 6 │
julia> rename!(test, :A => :newA)
3×2 DataFrame
│ Row │ newA │ B │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 4 │
│ 2 │ 2 │ 5 │
│ 3 │ 3 │ 6 │
julia> test
3×2 DataFrame
│ Row │ newA │ B │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 4 │
│ 2 │ 2 │ 5 │
│ 3 │ 3 │ 6 │
With strings it is the same:
3×2 DataFrame
│ Row │ A │ B │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 4 │
│ 2 │ 2 │ 5 │
│ 3 │ 3 │ 6 │
julia> rename!(test, "A" => "newA")
3×2 DataFrame
│ Row │ newA │ B │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 4 │
│ 2 │ 2 │ 5 │
│ 3 │ 3 │ 6 │
julia> test
3×2 DataFrame
│ Row │ newA │ B │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 4 │
│ 2 │ 2 │ 5 │
│ 3 │ 3 │ 6 │

Columns misaligned in attempt to delimit tabular data in Julia

Here is what my tabular data file looks like
P(days) growth
0 0.67150E+01 -0.11654E-02
1 0.47166E+01 -0.15529E-02
2 0.35861E+01 -0.12327E+00
3 0.28754E+01 -0.30987E+00
4 0.23721E+01 -0.48377E+00
5 0.20062E+01 -0.63666E+00
6 0.17097E+01 -0.17122E+01
7 0.16867E+01 -0.86038E+00
8 0.14523E+01 -0.55203E+00
9 0.12864E+01 -0.37704E+00
I am attempting to read this into a data frame. I tried this:
LINAData = DataFrame(CSV.File(LINAFile, skipto = 2, header = 1, delim=' ', ignorerepeated=true))
But as you can see:
│ Row │ P(days) │ growth │
│ │ Int64? │ Float64 │
├─────┼─────────┼─────────┤
│ 1 │ 0 │ 6.715 │
│ 2 │ missing │ 1.0 │
│ 3 │ missing │ 2.0 │
│ 4 │ missing │ 3.0 │
│ 5 │ missing │ 4.0 │
│ 6 │ missing │ 5.0 │
│ 7 │ missing │ 6.0 │
│ 8 │ missing │ 7.0 │
│ 9 │ missing │ 8.0 │
│ 10 │ missing │ 9.0 │
Is there an issue with how I am delimiting?

Your header is missing column name for the first column, so you have to supply it manually:
julia> LINAData = CSV.read(LINAFile, DataFrame, skipto = 2, header = ["","P(days)", "growth"], delim=' ', ignorerepeated=true)
10×3 DataFrame
│ Row │ │ P(days) │ growth │
│ │ Int64 │ Float64 │ Float64 │
├─────┼───────┼─────────┼────────────┤
│ 1 │ 0 │ 6.715 │ -0.0011654 │
│ 2 │ 1 │ 4.7166 │ -0.0015529 │
│ 3 │ 2 │ 3.5861 │ -0.12327 │
│ 4 │ 3 │ 2.8754 │ -0.30987 │
│ 5 │ 4 │ 2.3721 │ -0.48377 │
│ 6 │ 5 │ 2.0062 │ -0.63666 │
│ 7 │ 6 │ 1.7097 │ -1.7122 │
│ 8 │ 7 │ 1.6867 │ -0.86038 │
│ 9 │ 8 │ 1.4523 │ -0.55203 │
│ 10 │ 9 │ 1.2864 │ -0.37704 │

julia DataFrame select rows based values of one column belonging to a set

Using a DataFrame in Julia, I want to select rows on the basis of the value taken in a column.
With the following example
using DataFrames, DataFramesMeta
DT = DataFrame(ID = [1, 1, 2,2,3,3, 4,4], x1 = rand(8))
I want to extract the rows with ID taking the values 1 and 4.
For the moment, I came out with that solution.
#where(DT, findall(x -> (x==4 || x==1), DT.ID))
When using only two values, it is manageable.
However, I want to make it applicable to a case with many rows and a large set of value for the ID to be selected. Therefore, this solution is unrealistic if I need to write down all the value to be selected
Any fancier solution to make this selection generic?
Damien

Here is a way to do it using standard DataFrames.jl indexing and using #where from DataFramesMeta.jl:
julia> DT
8×2 DataFrame
│ Row │ ID │ x1 │
│ │ Int64 │ Float64 │
├─────┼───────┼───────────┤
│ 1 │ 1 │ 0.433397 │
│ 2 │ 1 │ 0.963775 │
│ 3 │ 2 │ 0.365919 │
│ 4 │ 2 │ 0.325169 │
│ 5 │ 3 │ 0.0495252 │
│ 6 │ 3 │ 0.637568 │
│ 7 │ 4 │ 0.391051 │
│ 8 │ 4 │ 0.436209 │
julia> DT[in([1,4]).(DT.ID), :]
4×2 DataFrame
│ Row │ ID │ x1 │
│ │ Int64 │ Float64 │
├─────┼───────┼──────────┤
│ 1 │ 1 │ 0.433397 │
│ 2 │ 1 │ 0.963775 │
│ 3 │ 4 │ 0.391051 │
│ 4 │ 4 │ 0.436209 │
julia> #where(DT, in([1,4]).(:ID))
4×2 DataFrame
│ Row │ ID │ x1 │
│ │ Int64 │ Float64 │
├─────┼───────┼──────────┤
│ 1 │ 1 │ 0.433397 │
│ 2 │ 1 │ 0.963775 │
│ 3 │ 4 │ 0.391051 │
│ 4 │ 4 │ 0.436209 │
In non performance critical code you can also use filter, which is - at least for me a bit simpler to digest (but it has a drawback, that it is slower than the methods discussed above):
julia> filter(row -> row.ID in [1,4], DT)
4×2 DataFrame
│ Row │ ID │ x1 │
│ │ Int64 │ Float64 │
├─────┼───────┼──────────┤
│ 1 │ 1 │ 0.433397 │
│ 2 │ 1 │ 0.963775 │
│ 3 │ 4 │ 0.391051 │
│ 4 │ 4 │ 0.436209 │
Note that in the approach you mention in your question you could omit DT in front of ID like this:
julia> #where(DT, findall(x -> (x==4 || x==1), :ID))
4×2 DataFrame
│ Row │ ID │ x1 │
│ │ Int64 │ Float64 │
├─────┼───────┼──────────┤
│ 1 │ 1 │ 0.433397 │
│ 2 │ 1 │ 0.963775 │
│ 3 │ 4 │ 0.391051 │
│ 4 │ 4 │ 0.436209 │
(this is a beauty of DataFramesMeta.jl that it knows the context of the DataFrame you want to refer to)

How do I drop DataFrame rows with missing values?

E.g. if the original example were
│ Row │ a │ b │
├─────┼─────────┼─────────┤
│ 1 │ 1 │ missing │
│ 2 │ 2 │ 2 │
│ 3 │ missing │ 3 │
│ 4 │ 4 │ 4 │
And I want:
│ Row │ a │ b │
├─────┼───┼─────────┤
│ 1 │ 1 │ missing │
│ 2 │ 2 │ 2 │
│ 3 │ 4 │ 4 │
Is there a nice function to do this?

Found the answer in the docs. In this case, it would be:
dropmissing(df, :a)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Julia Dataframes Groupby chain using combine - dataframe

Related

Pivottable in over multiple columns in Julia

How to change only one column name in julia

Columns misaligned in attempt to delimit tabular data in Julia

julia DataFrame select rows based values of one column belonging to a set

How do I drop DataFrame rows with missing values?

Categories

Resources