Concatenate Julia DataFrames adding a categorical column

Concatenate Julia DataFrames adding a categorical column - dataframe

Say I have the following
a = DataFrame(x = [1,2,3,4], y = [10,20,30,40])
b = DataFrame(x = [1,2,3,4], y = [50,60,70,80])
is there a way of getting [a;b] with an additional categorical column to obtain something like the following?
8×3 DataFrames.DataFrame
│ Row │ x │ y │ c │
├─────┼───┼────┼───┤
│ 1 │ 1 │ 10 │ 1 │
│ 2 │ 2 │ 20 │ 1 │
│ 3 │ 3 │ 30 │ 1 │
│ 4 │ 4 │ 40 │ 1 │
│ 5 │ 1 │ 50 │ 2 │
│ 6 │ 2 │ 60 │ 2 │
│ 7 │ 3 │ 70 │ 2 │
│ 8 │ 4 │ 80 │ 2 │
For two dataframes, something like
using DataFramesMeta
[#transform(a, c = 1); #transform(b, c = 2)]
works, but what if I have more than a few DataFrames?

You could use enumerate if you like to create result from array of DataFrames. For example:
l = [a,b]
vcat([transform(x,c=i) for (i,x) in enumerate(l)])

Related

How to nest / unnest data frames in Julia?

Does Julia have any analogues of the nest and unnest functions from the tidyr R package? Particularly, is there a way to make efficient nesting / unnesting operations using DataFrames.jl?

Suppose you have the following DataFrame:
julia> d = DataFrame(g=[1,1,1,2,2,3,3,], val1=1:7, val2 = 'a':'g')
7×3 DataFrame
│ Row │ g │ val1 │ val2 │
│ │ Int64 │ Int64 │ Char │
├─────┼───────┼───────┼──────┤
│ 1 │ 1 │ 1 │ 'a' │
│ 2 │ 1 │ 2 │ 'b' │
│ 3 │ 1 │ 3 │ 'c' │
│ 4 │ 2 │ 4 │ 'd' │
│ 5 │ 2 │ 5 │ 'e' │
│ 6 │ 3 │ 6 │ 'f' │
│ 7 │ 3 │ 7 │ 'g' │
and assume that you want to sample one element from each group defined by the g column.
This can be achieved by:
julia> DataFrame([rand(eachrow(gr)) for gr in groupby(d,:g)])
3×3 DataFrame
│ Row │ g │ val1 │ val2 │
│ │ Int64 │ Int64 │ Char │
├─────┼───────┼───────┼──────┤
│ 1 │ 1 │ 2 │ 'b' │
│ 2 │ 2 │ 4 │ 'd' │
│ 3 │ 3 │ 6 │ 'f' │
Hope this is what you need.
EDIT
If you want a different element count from each group you could do something like this:
julia> g_to_rows=Dict(1=>4,2=>3,3=>7); # desired element counts
julia> [ gr[rand(1:nrow(gr),g_to_rows[gr.g[1]]), :] for gr in groupby(d,:g)]
3-element Array{DataFrame,1}:
4×3 DataFrame
│ Row │ g │ val1 │ val2 │
│ │ Int64 │ Int64 │ Char │
├─────┼───────┼───────┼──────┤
│ 1 │ 1 │ 1 │ 'a' │
│ 2 │ 1 │ 1 │ 'a' │
│ 3 │ 1 │ 3 │ 'c' │
│ 4 │ 1 │ 2 │ 'b' │
3×3 DataFrame
│ Row │ g │ val1 │ val2 │
│ │ Int64 │ Int64 │ Char │
├─────┼───────┼───────┼──────┤
│ 1 │ 2 │ 5 │ 'e' │
│ 2 │ 2 │ 5 │ 'e' │
│ 3 │ 2 │ 5 │ 'e' │
7×3 DataFrame
│ Row │ g │ val1 │ val2 │
│ │ Int64 │ Int64 │ Char │
├─────┼───────┼───────┼──────┤
│ 1 │ 3 │ 7 │ 'g' │
│ 2 │ 3 │ 6 │ 'f' │
│ 3 │ 3 │ 6 │ 'f' │
│ 4 │ 3 │ 7 │ 'g' │
│ 5 │ 3 │ 7 │ 'g' │
│ 6 │ 3 │ 6 │ 'f' │
│ 7 │ 3 │ 6 │ 'f' │

Pivottable in over multiple columns in Julia

I'd like to do a pivot table on a DataFrame in julia. From the documentation, I know I can do that with by and unstack. E.g.
julia> using DataFrames, Random
julia> Random.seed!(42);
julia> df = DataFrame(
Step = rand(1:3, 15) |> sort,
Label1 = rand('A':'B', 15) .|> Symbol,
Label2 = rand('Q':'R', 15) .|> Symbol
)
15×3 DataFrame
│ Row │ Step │ Label1 │ Label2 │
│ │ Int64 │ Symbol │ Symbol │
├─────┼───────┼────────┼────────┤
│ 1 │ 1 │ A │ Q │
│ 2 │ 1 │ A │ Q │
│ 3 │ 1 │ B │ R │
│ 4 │ 1 │ B │ R │
│ 5 │ 1 │ B │ Q │
│ 6 │ 2 │ B │ Q │
│ 7 │ 2 │ B │ Q │
│ 8 │ 2 │ B │ R │
│ 9 │ 2 │ B │ R │
│ 10 │ 3 │ B │ R │
│ 11 │ 3 │ B │ Q │
│ 12 │ 3 │ B │ R │
│ 13 │ 3 │ A │ R │
│ 14 │ 3 │ B │ R │
│ 15 │ 3 │ B │ Q │
julia> unstack(by(df, [:Step, :Label1, :Label2], nrow), :Label1, :nrow)
6×4 DataFrame
│ Row │ Step │ Label2 │ A │ B │
│ │ Int64 │ Symbol │ Int64? │ Int64? │
├─────┼───────┼────────┼─────────┼────────┤
│ 1 │ 1 │ Q │ 2 │ 1 │
│ 2 │ 1 │ R │ missing │ 2 │
│ 3 │ 2 │ Q │ missing │ 2 │
│ 4 │ 2 │ R │ missing │ 2 │
│ 5 │ 3 │ Q │ missing │ 2 │
│ 6 │ 3 │ R │ 1 │ 3 │
Now, how do I do a pivot on two columns, here Label1 and Label2, that I get the row counts for each combination of the elements of these two columns? The expected output would be something like
│ Row │ Step │ AQ │ AR │ BQ │ BR │
│ │ Int64 │ Int64? │ Int64? │ Int64? │ Int64? │
├─────┼───────┼─────────┼─────────┼─────────┼─────────┤
│ 1 │ 1 │ 2 │ missing │ 1 │ 2 │
│ 3 │ 2 │ missing │ missing │ 2 │ 2 │
│ 5 │ 3 │ missing │ 1 │ 2 │ 3 │
Thanks in advance!
Tim

First - by is deprecated (the manual will be updated in a few days to reflect that) so one should write:
julia> unstack(combine(groupby(df, [:Step, :Label1, :Label2]), nrow), :Label1, :nrow)
6×4 DataFrame
│ Row │ Step │ Label2 │ A │ B │
│ │ Int64 │ Symbol │ Int64? │ Int64? │
├─────┼───────┼────────┼─────────┼────────┤
│ 1 │ 1 │ Q │ 2 │ 1 │
│ 2 │ 1 │ R │ missing │ 2 │
│ 3 │ 2 │ Q │ missing │ 2 │
│ 4 │ 2 │ R │ missing │ 2 │
│ 5 │ 3 │ Q │ missing │ 2 │
│ 6 │ 3 │ R │ 1 │ 3 │
However, if you want row counts I would rather do something like:
julia> gdf = groupby(df, [:Step, :Label2], sort=true);
julia> lev = unique(df.Label1)
2-element Array{Symbol,1}:
:A
:B
julia> combine(gdf, :Label1 .=> [x -> count(==(l), x) for l in lev] .=> lev)
6×4 DataFrame
│ Row │ Step │ Label2 │ A │ B │
│ │ Int64 │ Symbol │ Int64 │ Int64 │
├─────┼───────┼────────┼───────┼───────┤
│ 1 │ 1 │ Q │ 2 │ 1 │
│ 2 │ 1 │ R │ 0 │ 2 │
│ 3 │ 2 │ Q │ 0 │ 2 │
│ 4 │ 2 │ R │ 0 │ 2 │
│ 5 │ 3 │ Q │ 0 │ 2 │
│ 6 │ 3 │ R │ 1 │ 3 │
so you have 0 not missing in place where you have a missing value.
This pattern generalizes to multiple groups:
julia> gdf = groupby(df, :Step, sort=true);
julia> l1 = unique(df.Label1)
2-element Array{Symbol,1}:
:A
:B
julia> l2 = unique(df.Label2)
2-element Array{Symbol,1}:
:Q
:R
julia> combine(gdf, [[:Label1, :Label2] =>
((x,y) -> count(((x,y),) -> x==v1 && y==v2, zip(x,y))) =>
Symbol(v1, v2) for v1 in l1 for v2 in l2])
3×5 DataFrame
│ Row │ Step │ AQ │ AR │ BQ │ BR │
│ │ Int64 │ Int64 │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┼───────┼───────┤
│ 1 │ 1 │ 2 │ 0 │ 1 │ 2 │
│ 2 │ 2 │ 0 │ 0 │ 2 │ 2 │
│ 3 │ 3 │ 0 │ 1 │ 2 │ 3 │
another way to do it using your original code would be:
julia> unstack(combine(groupby(select(df, :Step, [:Label1, :Label2] => ByRow(Symbol) => :Label), [:Step, :Label]), nrow), :Label, :nrow)
3×5 DataFrame
│ Row │ Step │ AQ │ AR │ BQ │ BR │
│ │ Int64 │ Int64? │ Int64? │ Int64? │ Int64? │
├─────┼───────┼─────────┼─────────┼────────┼────────┤
│ 1 │ 1 │ 2 │ missing │ 1 │ 2 │
│ 2 │ 2 │ missing │ missing │ 2 │ 2 │
│ 3 │ 3 │ missing │ 1 │ 2 │ 3 │
However, I agree this is not super easy. This issue is tracked in https://github.com/JuliaData/DataFrames.jl/issues/2148 and relatedly https://github.com/JuliaData/DataFrames.jl/issues/2205.

Julia Dataframes Groupby chain using combine

i'm trying to do the following :
For each columns i want to groupby the key with the column , count the number of occurence and keep only the biggest occurence by key (i don't want to keep the amount of occurence , just the value that correspond to it ).
I have many columns to groupby with keys in a row and i wanted to know if there is a way to chain it together.
Here is an example :
│ Row │ KEY │ A │
│ │ String │ String │
├──────────┼──────────────────────────────────┼───────────────┤
│ 1 │ 44473 │ ROCK │
│ 2 │ 4f4ef │ CLASSICAL │
│ 3 │ 0b8bd │ POP │
│ 4 │ 57c94 │ POP │
│ 5 │ a7070 │ RAP - HIP HOP │
│ 6 │ 1d9a3 │ JAZZ │
│ 7 │ 947fd │ POP │
Here i do :
per_key = DataFrames.groupby(test, [:KEY, :A])
combine(per_key, nrow => :A)
which gives me :
│ Row │ KEY │ A │ nrow │
│ │ String │ String │ Int64 │
├──────┼──────────────────────────────────┼────────────────────┼───────┤
│ 1 │ 44473ff │ ROCK │ 2 │
│ 2 │ 4f4effc │ CLASSICAL │ 12 │
│ 3 │ 0b8bd64 │ POP │ 2 │
│ 4 │ 57c94f5 │ POP │ 2 │
│ 5 │ a7070e4 │ RAP - HIP HOP │ 1 │
│ 6 │ 1d9a3c7 │ JAZZ │ 1 │
How do i do for each KEY , get the max "nrow" and keep the corresponding value in "A".
I have to do it with many other columns one by one also.
Thank you

I am not 100% what you wanted, but I assume this is the thing that you are looking for:
julia> using DataFrames, StatsBase
julia> df = DataFrame(key=rand(1:10, 10^6),
A = rand(1:10, 10^6),
B = rand(1:10, 10^6),
C = rand(1:10, 10^6));
julia> gdf = groupby(df, :key);
julia> combine(gdf, valuecols(gdf) .=>
(x -> last(maximum(reverse, countmap(x)))) .=>
valuecols(gdf))
10×4 DataFrame
│ Row │ key │ A │ B │ C │
│ │ Int64 │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┼───────┤
│ 1 │ 1 │ 8 │ 8 │ 1 │
│ 2 │ 10 │ 1 │ 2 │ 9 │
│ 3 │ 2 │ 6 │ 7 │ 3 │
│ 4 │ 4 │ 10 │ 7 │ 4 │
│ 5 │ 3 │ 8 │ 3 │ 1 │
│ 6 │ 7 │ 9 │ 7 │ 8 │
│ 7 │ 8 │ 2 │ 3 │ 2 │
│ 8 │ 5 │ 4 │ 3 │ 9 │
│ 9 │ 9 │ 3 │ 2 │ 10 │
│ 10 │ 6 │ 8 │ 4 │ 10 │
(on master you can add renamecols=false kwarg to avoid last .=> valuecols(gdf)).
The key function here is countmap which gives you counts of occurences of different values in a vector, e.g.:
julia> countmap(gdf[1].A)
Dict{Int64,Int64} with 10 entries:
7 => 10028
4 => 10130
9 => 10007
10 => 9841
2 => 10090
3 => 9985
5 => 10022
8 => 10262
6 => 10103
1 => 10128
the rest is just a wrapper around it. You need reverse to change key => value to value => key order to make sure maximum picks a right group (note that your problem will not have a unique solution if there are two groups with the same count), and then we use last to extract the group (as you did not want to keep the count).
EDIT:
Now I realized that argmax works for dictionaries so you can just write:
julia> combine(gdf, valuecols(gdf) .=>
argmax∘countmap .=>
valuecols(gdf))
10×4 DataFrame
│ Row │ key │ A │ B │ C │
│ │ Int64 │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┼───────┤
│ 1 │ 1 │ 8 │ 8 │ 1 │
│ 2 │ 10 │ 1 │ 2 │ 9 │
│ 3 │ 2 │ 6 │ 7 │ 3 │
│ 4 │ 4 │ 10 │ 7 │ 4 │
│ 5 │ 3 │ 8 │ 3 │ 1 │
│ 6 │ 7 │ 9 │ 7 │ 8 │
│ 7 │ 8 │ 2 │ 3 │ 2 │
│ 8 │ 5 │ 4 │ 3 │ 9 │
│ 9 │ 9 │ 3 │ 2 │ 10 │
│ 10 │ 6 │ 8 │ 4 │ 10 │

Get elements of one column with certain values of another column with DataFrames.jl

Having a dataframe df with columns :a and :b, how can I get all elements in column :a that are in a row with e.g. b = 0.5?
Can this be done with DataFrames alone or is a meta package needed?

df[df.b .== 5, :]
Example
julia> df = DataFrame(a=11:17, b=vcat([5,5],1:5))
7×2 DataFrame
│ Row │ a │ b │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 11 │ 5 │
│ 2 │ 12 │ 5 │
│ 3 │ 13 │ 1 │
│ 4 │ 14 │ 2 │
│ 5 │ 15 │ 3 │
│ 6 │ 16 │ 4 │
│ 7 │ 17 │ 5 │
julia> df[df.b .== 5, :]
3×2 DataFrame
│ Row │ a │ b │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 11 │ 5 │
│ 2 │ 12 │ 5 │
│ 3 │ 17 │ 5 │
If you want just the column a:
julia> df[df.b .== 5, :].a
3-element Array{Int64,1}:
11
12
17
Yet another option is to use filter with a lambda function (this is slightly faster and uses less memory):
julia> filter(row -> row[:b] == 5, df)
3×2 DataFrame
│ Row │ a │ b │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 11 │ 5 │
│ 2 │ 12 │ 5 │
│ 3 │ 17 │ 5 │

julia DataFrame select rows based values of one column belonging to a set

Using a DataFrame in Julia, I want to select rows on the basis of the value taken in a column.
With the following example
using DataFrames, DataFramesMeta
DT = DataFrame(ID = [1, 1, 2,2,3,3, 4,4], x1 = rand(8))
I want to extract the rows with ID taking the values 1 and 4.
For the moment, I came out with that solution.
#where(DT, findall(x -> (x==4 || x==1), DT.ID))
When using only two values, it is manageable.
However, I want to make it applicable to a case with many rows and a large set of value for the ID to be selected. Therefore, this solution is unrealistic if I need to write down all the value to be selected
Any fancier solution to make this selection generic?
Damien

Here is a way to do it using standard DataFrames.jl indexing and using #where from DataFramesMeta.jl:
julia> DT
8×2 DataFrame
│ Row │ ID │ x1 │
│ │ Int64 │ Float64 │
├─────┼───────┼───────────┤
│ 1 │ 1 │ 0.433397 │
│ 2 │ 1 │ 0.963775 │
│ 3 │ 2 │ 0.365919 │
│ 4 │ 2 │ 0.325169 │
│ 5 │ 3 │ 0.0495252 │
│ 6 │ 3 │ 0.637568 │
│ 7 │ 4 │ 0.391051 │
│ 8 │ 4 │ 0.436209 │
julia> DT[in([1,4]).(DT.ID), :]
4×2 DataFrame
│ Row │ ID │ x1 │
│ │ Int64 │ Float64 │
├─────┼───────┼──────────┤
│ 1 │ 1 │ 0.433397 │
│ 2 │ 1 │ 0.963775 │
│ 3 │ 4 │ 0.391051 │
│ 4 │ 4 │ 0.436209 │
julia> #where(DT, in([1,4]).(:ID))
4×2 DataFrame
│ Row │ ID │ x1 │
│ │ Int64 │ Float64 │
├─────┼───────┼──────────┤
│ 1 │ 1 │ 0.433397 │
│ 2 │ 1 │ 0.963775 │
│ 3 │ 4 │ 0.391051 │
│ 4 │ 4 │ 0.436209 │
In non performance critical code you can also use filter, which is - at least for me a bit simpler to digest (but it has a drawback, that it is slower than the methods discussed above):
julia> filter(row -> row.ID in [1,4], DT)
4×2 DataFrame
│ Row │ ID │ x1 │
│ │ Int64 │ Float64 │
├─────┼───────┼──────────┤
│ 1 │ 1 │ 0.433397 │
│ 2 │ 1 │ 0.963775 │
│ 3 │ 4 │ 0.391051 │
│ 4 │ 4 │ 0.436209 │
Note that in the approach you mention in your question you could omit DT in front of ID like this:
julia> #where(DT, findall(x -> (x==4 || x==1), :ID))
4×2 DataFrame
│ Row │ ID │ x1 │
│ │ Int64 │ Float64 │
├─────┼───────┼──────────┤
│ 1 │ 1 │ 0.433397 │
│ 2 │ 1 │ 0.963775 │
│ 3 │ 4 │ 0.391051 │
│ 4 │ 4 │ 0.436209 │
(this is a beauty of DataFramesMeta.jl that it knows the context of the DataFrame you want to refer to)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Concatenate Julia DataFrames adding a categorical column - dataframe

You could use enumerate if you like to create result from array of DataFrames. For example: l = [a,b] vcat([transform(x,c=i) for (i,x) in enumerate(l)])

Related

How to nest / unnest data frames in Julia?

Pivottable in over multiple columns in Julia

Julia Dataframes Groupby chain using combine

Get elements of one column with certain values of another column with DataFrames.jl

julia DataFrame select rows based values of one column belonging to a set

Categories

Resources