Juila Dataframe rollup by groups (aka subtotals) - dataframe

What is a concise way to express rollup aggregations in DataFrames.jl?
Example dataset:
+---+----------+-----+---------+------+
| id| date_col|group| item|amount|
+---+----------+-----+---------+------+
| 1|2020-03-11| A|BOO00OXXX| 1.0|
| 2|2020-03-11| A|BOO00OXXY| 2.0|
| 3|2020-03-11| B|BOO00OXXZ| 17.0|
| 4|2020-03-12| B|BOO00OXXA| 9.0|
| 5|2020-03-12| B|BOO00OXXB| 1.0|
| 6|2020-03-12| B|BOO00OXXY| 5.0|
| 7|2020-03-13| C|BOO00OXXY| 2.0|
| 8|2020-03-13| C|BOO00OXXX| 1.0|
| 9|2020-03-13| C|BOO00OXXY| 2.0|
+---+----------+-----+---------+------+
# desired output
+------+---------+
|group |total_amt|
+------+---------+
|ROLLUP| 40.0|
| A | 3.0|
| B | 32.0|
| C | 5.0|
+------+---------+
I commonly need to summarize a dataset, sometimes for sharing reports, which aggregates values over certain columns with subtotals and grand totals. These are called 'rollups' or 'subtotals'/'grand totals' in Excel.
In Spark these are conveniently generated with rollup or cube aggregations. The above result is generated with the following spark api call.
How can I produce a similar table with Julia DataFrames.jl?
// scala spark
df.rollup("group")
.agg(sum("amount").as("total_amt"))
.orderBy("group")
.show()
+-----+---------+
|group|total_amt|
+-----+---------+
| null| 40.0|
| A| 3.0|
| B| 32.0|
| C| 5.0|
+-----+---------+
// note the aggregated column label is null for the subtotal (aka rollup)
NOTE: I am able to produce the result with multiple julia groupby() and combine() operations, and then union or vcat the result into a single dataframe. I need and want a concise and readable idiom.
EDIT: adding a specific julia implementation to show why I want something more concise.
using DataFrames, Dates
df = DataFrame(id = [1,2,3,4,5,6,7,8,9]
, date_col = Date.(["2020-03-11","2020-03-11","2020-03-11","2020-03-12","2020-03-12","2020-03-12","2020-03-13","2020-03-13","2020-03-13"])
, group = ["A","A","B","B","B","B","C","C","C"]
, amount = [1.0,2.0,17.0,9.0,1.0,5.0,2.0,1.0,2.0]
)
# replicate the spark.rollup example
df1 = combine(groupby(_, :group), :amount => sum => :total_amt);
df2 = combine(df, :amount => sum => :total_amt);
df2[:, :group] = [missing];
df_result = sort(vcat(df1, df2, cols = :setequal), rev = true)
4×2 DataFrame
Row │ group total_amt
│ String? Float64
─────┼────────────────────
1 │ missing 40.0
2 │ C 5.0
3 │ B 32.0
4 │ A 3.0
Adding a version of #bkamins answer, sticking with combine()
I think I prefer this answer so far, as it maintains a bit of symmetry and if made into a function is easier to see where the arguments would go.
using Chain
#chain df begin
groupby(:group)
combine(:amount => sum => :total_amt)
append!(insertcols!(combine(df, :amount => sum => :total_amt), :group => "ROLLUP"))
sort(:total_amt, rev = true)
end

This is how I would do it:
julia> using DataFrames, Chain
julia> df = DataFrame(group=["A", "A", "B", "B", "C", "C"], amount=1:6)
6×2 DataFrame
Row │ group amount
│ String Int64
─────┼────────────────
1 │ A 1
2 │ A 2
3 │ B 3
4 │ B 4
5 │ C 5
6 │ C 6
julia> #chain df begin
groupby(:group)
combine(:amount => sum => :total_amount)
push!(_, (missing, sum(_.total_amount)), promote=true)
sort(:total_amount, rev=true)
end
4×2 DataFrame
Row │ group total_amount
│ String? Int64
─────┼───────────────────────
1 │ missing 21
2 │ C 11
3 │ B 7
4 │ A 3
This will be efficient and hopefully you find it readable.
As #jling commented we do not have in-built rollup.

Here is an answer with DataFramesMeta.jl
julia> using DataFramesMeta;
julia> #chain df begin
groupby(:group)
#combine :total_amount = sum(:amount)
#aside df2 = #combine df :total_amount = sum(:amount)
vcat(df2; cols = :union)
end
4×2 DataFrame
Row │ group total_amount
│ String? Int64
─────┼───────────────────────
1 │ A 3
2 │ B 7
3 │ C 11
4 │ missing 21

julia> df
5×2 DataFrame
Row │ g amt
│ Int64 Int64
─────┼──────────────
1 │ 0 2
2 │ 1 1
3 │ 1 1
4 │ 0 1
5 │ 1 1
julia> combine(groupby(df, :g), :amt => sum => :total_amt)
2×2 DataFrame
Row │ g total_amt
│ Int64 Int64
─────┼─────────────────────
1 │ 0 3
2 │ 1 3
#alternative do-block syntax:
julia> combine(groupby(df, :g)) do sub_df
(total_amt = sum(sub_df.amt),)
end
2×2 DataFrame
Row │ g total_amt
│ Int64 Int64
─────┼──────────────────
1 │ 0 3
2 │ 1 3
does this more or less do what you want? btw the relevant docs: https://dataframes.juliadata.org/stable/man/split_apply_combine/
I feel like we would need a feel iteration to solve all the things you might want to do in Spark, SO is hard to do those kind of back and forth.

Related

Add thousands separator to column in dataframe in julia

I have a dataframe with two columns a and b and at the moment both are looking like column a, but I want to add separators so that column b looks like below. I have tried using the package format.jl. But I haven't gotten the result I'm afte. Maybe worth mentioning is that both columns is Int64 and the column names a and b is of type symbol.
a | b
150000 | 1500,00
27 | 27,00
16614 | 166,14
Is there some other way to solve this than using format.jl? Or is format.jl the way to go?
Assuming you want the commas in their typical positions rather than how you wrote them, this is one way:
julia> using DataFrames, Format
julia> f(x) = format(x, commas=true)
f (generic function with 1 method)
julia> df = DataFrame(a = [1000000, 200000, 30000])
3×1 DataFrame
Row │ a
│ Int64
─────┼─────────
1 │ 1000000
2 │ 200000
3 │ 30000
julia> transform(df, :a => ByRow(f) => :a_string)
3×2 DataFrame
Row │ a a_string
│ Int64 String
─────┼────────────────────
1 │ 1000000 1,000,000
2 │ 200000 200,000
3 │ 30000 30,000
If you instead want the row replaced, use transform(df, :a => ByRow(f), renamecols=false).
If you just want the output vector rather than changing the DataFrame, you can use format.(df.a, commas=true)
You could write your own function f to achieve the same behavior, but you might as well use the one someone already wrote inside the Format.jl package.
However, once you transform you data to Strings as above, you won't be able to filter/sort/analyze the numerical data in the DataFrame. I would suggest that you apply the formatting in the printing step (rather than modifying the DataFrame itself to contain strings) by using the PrettyTables package. This can format the entire DataFrame at once.
julia> using DataFrames, PrettyTables
julia> df = DataFrame(a = [1000000, 200000, 30000], b = [500, 6000, 70000])
3×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼────────────────
1 │ 1000000 500
2 │ 200000 6000
3 │ 30000 70000
julia> pretty_table(df, formatters = ft_printf("%'d"))
┌───────────┬────────┐
│ a │ b │
│ Int64 │ Int64 │
├───────────┼────────┤
│ 1,000,000 │ 500 │
│ 200,000 │ 6,000 │
│ 30,000 │ 70,000 │
└───────────┴────────┘
(Edited to reflect the updated specs in the question)
julia> df = DataFrame(a = [150000, 27, 16614]);
julia> function insertdecimalcomma(n)
if n < 100
return string(n) * ",00"
else
return replace(string(n), r"(..)$" => s",\1")
end
end
insertdecimalcomma (generic function with 1 method)
julia> df.b = insertdecimalcomma.(df.a)
julia> df
3×2 DataFrame
Row │ a b
│ Int64 String
─────┼─────────────────
1 │ 150000 1500,00
2 │ 27 27,00
3 │ 16614 166,14
Note that the b column will necessarily be a String after this change, as integer types cannot store formatting information in them.
If you have a lot of data and find that you need better performance, you may also want to use the InlineStrings package:
julia> #same as before upto the function definition
julia> using InlineStrings
julia> df.b = inlinestrings(insertdecimalcomma.(df.a))
3-element Vector{String7}:
"1500,00"
"27,00"
"166,14"
This stores the b column's data as fixed-size strings (String7 type here), which are generally treated like normal Strings, but can be significantly better for performance.

How to extract column_name String and data Vector from a one-column DataFrame in Julia?

I was able to extract the column of a DataFrame that I want using a regular expression, but now I want to extract from that DataFrame column a String with the column name and a Vector with the data. How can I construct f and g below? Alternate approaches also welcome.
julia> df = DataFrame("x (in)" => 1:3, "y (°C)" => 4:6)
3×2 DataFrame
Row │ x (in) y (°C)
│ Int64 Int64
─────┼────────────────
1 │ 1 4
2 │ 2 5
3 │ 3 6
julia> y = df[:, r"y "]
3×1 DataFrame
Row │ y (°C)
│ Int64
─────┼────────
1 │ 4
2 │ 5
3 │ 6
julia> y_units = f(y)
"°C"
julia> y_data = g(y)
3-element Vector{Int64}:
4
5
6
f(df) = only(names(df))
g(df) = only(eachcol(df)) # or df[!, 1] if you do not need to check that this is the only column
(only is used to check that the data frame actually has only one column)
An alternate approach to get the column name without creating an intermediate data frame is just writing:
julia> names(df, r"y ")
1-element Vector{String}:
"y (°C)"
to extract out the column name (you need to get the first element of this vector)

Julia Dataframe combine specific calculations and tranpose

I need to do something quite specific and i'm trying to do it the good way , especially i want it to be optimized .
So i have a DataFrame that look like this :
v = ["x","y","z"][rand(1:3, 10)]
df = DataFrame(Any[collect(1:10), v, rand(10)], [:USER_ID, :GENRE_MAIN, :TOTAL_LISTENED])
Row │ USER_ID GENRE_MAIN TOTAL_LISTENED
│ Int64 String Float64
─────┼─────────────────────────────────────
1 │ 1 x 0.237186
12 │ 1 y 0.237186
13 │ 1 x 0.254486
2 │ 2 z 0.920804
3 │ 3 y 0.140626
4 │ 4 x 0.653306
5 │ 5 x 0.83126
6 │ 6 x 0.928973
7 │ 7 y 0.519728
8 │ 8 x 0.409969
9 │ 9 z 0.798064
10 │ 10 x 0.701332
I want to aggregate it by user (i have many rows per user_id ) and do many calculations
I need to calculate the top 1 ,2 ,3 ,4 ,5 genre, album name, artist name per user_id and its respective values (the total_listened that correspond) and it has to be like this :
USER_ID │ ALBUM1_NAME │ ALBUM2_NAME | ALBUM1_NAME_VALUE | ALBUM2_NAME_VALUES | ......│ GENRE1 │ GENRE2
One line per user_id .
I got this solution that fits 90% of what i wanted but i can't modify it to also include the values of total_listened:
using DataFrames, Pipe, Random, Pkg
Pkg.activate(".")
Pkg.add("DataFrames")
Pkg.add("Pipe")
Random.seed!(1234)
df = DataFrame(USER_ID=rand(1:10, 80),
GENRE_MAIN=rand(string.("genre_", 1:6), 80),
ALBUM_NAME=rand(string.("album_", 1:6), 80),
ALBUM_ARTIST_NAME=rand(string.("artist_", 1:6), 80))
function top5(sdf, col, prefix)
return #pipe groupby(sdf, col) |>
combine(_, nrow) |>
sort!(_, :nrow, rev=true) |>
first(_, 5) |>
vcat(_[!, 1], fill(missing, 5 - nrow(_))) |>
DataFrame([string(prefix, i) for i in 1:5] .=> _)
end
#pipe groupby(df, :USER_ID) |>
combine(_,
x -> top5(x, :GENRE_MAIN, "genre"),
x -> top5(x, :ALBUM_NAME, "album"),
x -> top5(x, :ALBUM_ARTIST_NAME, "artist"))
An example :
for the user 1 of the DataFrame just up i want the result to be :
Row │ USER_ID GENRE1 GENRE2 GENRE1_VALUE GENRE2_VALUE ......
│ Int64 String String Float64 Float64
─────┼─────────────────────────────────────────────────────
1 │ 1 x y 0.491672 0.237186. ......
I took only GENRE here , but i also want it for ALBUM_NAME, ALBUM_ARTIST_NAME
I also want after to do a top rank % ,
Order the users by total_listened and calculate their percentile.
to rank them by top5% , top10%, top20% of the total
I can calculate the tagetted quantile i want with
x = .05
quantile(df.TOTAL_LISTENED, x)
and then just put all the users's total_listened that is superior to this quantile
but i don't know how to calculate it properly in the combine...
Thank you
As commented in the previous post - I would recommend you to ask a specific question not to redo your whole project on StackOverflow (if you need such help https://discourse.julialang.org/ is a good place to discuss, especially that you need many steps of the analysis and they require a precise definition of what you want exactly - also it would be best if on https://discourse.julialang.org/ you shared your full data set, as the sampler you provide here is not enough to do a proper analysis later since it is too small).
Here is an example how to add totals columns (I assume that you want data to be ordered by the totals):
julia> using Random, DataFrames, Pipe
julia> Random.seed!(1234);
julia> df = DataFrame([rand(1:10, 100), rand('a':'k', 100), rand(100)],
[:USER_ID, :GENRE_MAIN, :TOTAL_LISTENED]);
julia> function top5(sdf, col, prefix)
#pipe groupby(sdf, col) |>
combine(_, :TOTAL_LISTENED => sum => :SUM) |>
sort!(_, :SUM, rev=true) |>
first(_, 5) |>
vcat(_[!, 1], fill(missing, 5 - nrow(_)),
_[!, 2], fill(missing, 5 - nrow(_))) |>
DataFrame([[string(prefix, i) for i in 1:5];
[string(prefix, i, "_VALUE") for i in 1:5]] .=> _)
end;
julia> #pipe groupby(df, :USER_ID) |>
combine(_, x -> top5(x, :GENRE_MAIN, "genre"))
10×11 DataFrame
Row │ USER_ID genre1 genre2 genre3 genre4 genre5 genre1_VALUE genre2_VALUE genre3_VALUE genre4_VALUE genre5_VALUE
│ Int64 Char Char Char Char Char? Float64 Float64 Float64 Float64 Float64?
─────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ 1 d b j e i 2.34715 2.014 1.68587 0.693472 0.377869
2 │ 4 b e d c missing 0.90263 0.589418 0.263121 0.107839 missing
3 │ 8 c d i k j 1.55335 1.40416 0.977785 0.779468 0.118024
4 │ 2 a e f g k 1.34841 0.901507 0.87146 0.797606 0.669002
5 │ 10 a e f i d 1.60554 1.07311 0.820425 0.757363 0.678598
6 │ 7 f i g c a 2.59654 1.49654 1.15944 0.670488 0.258173
7 │ 9 i b e a g 1.57373 0.954117 0.603848 0.338918 0.133201
8 │ 5 f g c k d 1.33899 0.722283 0.664457 0.54016 0.507337
9 │ 3 d c f h e 1.63695 0.919088 0.544296 0.531262 0.0540101
10 │ 6 d g f j i 1.68768 0.97688 0.333207 0.259212 0.0636912

R's table function in Julia (for DataFrames)

Is there something like R's table function in Julia? I've read about xtab, but do not know how to use it.
Suppose we have R's data.frame rdata which col6 is of the Factor type.
R sample code:
rdata <- read.csv("mycsv.csv") #1
table(rdata$col6) #2
In order to read data and make factors in Julia I do it like this:
using DataFrames
jldata = readtable("mycsv.csv", makefactors=true) #1 :col6 will be now pooled.
..., but how to build R's table like in julia (how to achieve #2)?
You can use the countmap function from StatsBase.jl to count the entries of a single variable. General cross tabulation and statistical tests for contingency tables are lacking at this point. As Ismael points out, this has been discussed in the issue tracker for StatsBase.jl.
I came to the conclusion that a similar effect can be achieved using by:
Let jldata consists of :gender column.
julia> by(jldata, :gender, nrow)
3x2 DataFrames.DataFrame
| Row | gender | x1 |
|-----|----------|-------|
| 1 | NA | 175 |
| 2 | "female" | 40254 |
| 3 | "male" | 58574 |
Of course it's not a table but at least I get the same data type as the datasource. Surprisingly by seems to be faster than countmap.
I believe, "by" is depreciated in Julia as of 1.5.3 (It says: ERROR: ArgumentError: by function was removed from DataFrames.jl).
So here are some alternatives, we can use split apply combine to do a cross tabs as well or use FreqTables.
Using Split Combine:
Example 1 - SingleColumn:
using RDatasets
using DataFrames
mtcars = dataset("datasets", "mtcars")
## To do a table on cyl column
gdf = groupby(mtcars, :Cyl)
combine(gdf, nrow)
Output:
# 3×2 DataFrame
# Row │ Cyl nrow
# │ Int64 Int64
# ─────┼──────────────
# 1 │ 6 7
# 2 │ 4 11
# 3 │ 8 14
Example 2 - CrossTabs Between 2 columns:
## we have to just change the groupby code a little bit and rest is same
gdf = groupby(mtcars, [:Cyl, :AM])
combine(gdf, nrow)
Output:
#6×3 DataFrame
# Row │ Cyl AM nrow
# │ Int64 Int64 Int64
#─────┼─────────────────────
# 1 │ 6 1 3
# 2 │ 4 1 8
# 3 │ 6 0 4
# 4 │ 8 0 12
# 5 │ 4 0 3
# 6 │ 8 1 2
Also on a side note if you don't like the name as nrow on top, you can use :
combine(gdf, nrow => :Count)
to change the name to Count
Alternate way: Using FreqTables
You can use package, FreqTables like below to do count and proportion very easily, to add it you can use Pkg.add("FreqTables") :
## Cross tab between cyl and am
freqtable(mtcars.Cyl, mtcars.AM)
## Proportion between cyl and am
prop(freqtable(mtcars.Cyl, mtcars.AM))
## with margin like R you can use it too in this (columnwise proportion: margin=2)
prop(freqtable(mtcars.Cyl, mtcars.AM), margins=2)
## with margin for rowwise proportion: margin = 1
prop(freqtable(mtcars.Cyl, mtcars.AM), margins=1)
Outputs:
## count cross tabs
#3×2 Named Array{Int64,2}
#Dim1 ╲ Dim2 │ 0 1
#────────────┼───────
#4 │ 3 8
#6 │ 4 3
#8 │ 12 2
## proportion wise (overall)
#3×2 Named Array{Float64,2}
#Dim1 ╲ Dim2 │ 0 1
#────────────┼─────────────────
#4 │ 0.09375 0.25
#6 │ 0.125 0.09375
#8 │ 0.375 0.0625
## Column wise proportion
#3×2 Named Array{Float64,2}
#Dim1 ╲ Dim2 │ 0 1
#────────────┼───────────────────
#4 │ 0.157895 0.615385
#6 │ 0.210526 0.230769
#8 │ 0.631579 0.153846
## Row wise proportion
#3×2 Named Array{Float64,2}
#Dim1 ╲ Dim2 │ 0 1
#────────────┼───────────────────
#4 │ 0.272727 0.727273
#6 │ 0.571429 0.428571
#8 │ 0.857143 0.142857

julia create an empty dataframe and append rows to it

I am trying out the Julia DataFrames module. I am interested in it so I can use it to plot simple simulations in Gadfly. I want to be able to iteratively add rows to the dataframe and I want to initialize it as empty.
The tutorials/documentation on how to do this is sparse (most documentation describes how to analyse imported data).
To append to a nonempty dataframe is straightforward:
df = DataFrame(A = [1, 2], B = [4, 5])
push!(df, [3 6])
This returns.
3x2 DataFrame
| Row | A | B |
|-----|---|---|
| 1 | 1 | 4 |
| 2 | 2 | 5 |
| 3 | 3 | 6 |
But for an empty init I get errors.
df = DataFrame(A = [], B = [])
push!(df, [3, 6])
Error message:
ArgumentError("Error adding 3 to column :A. Possible type mis-match.")
while loading In[220], in expression starting on line 2
What is the best way to initialize an empty Julia DataFrame such that you can iteratively add items to it later in a for loop?
A zero length array defined using only [] will lack sufficient type information.
julia> typeof([])
Array{None,1}
So to avoid that problem is to simply indicate the type.
julia> typeof(Int64[])
Array{Int64,1}
And you can apply that to your DataFrame problem
julia> df = DataFrame(A = Int64[], B = Int64[])
0x2 DataFrame
julia> push!(df, [3 6])
julia> df
1x2 DataFrame
| Row | A | B |
|-----|---|---|
| 1 | 3 | 6 |
using Pkg, CSV, DataFrames
iris = CSV.read(joinpath(Pkg.dir("DataFrames"), "test/data/iris.csv"))
new_iris = similar(iris, nrow(iris))
head(new_iris, 2)
# 2×5 DataFrame
# │ Row │ SepalLength │ SepalWidth │ PetalLength │ PetalWidth │ Species │
# ├─────┼─────────────┼────────────┼─────────────┼────────────┼─────────┤
# │ 1 │ missing │ missing │ missing │ missing │ missing │
# │ 2 │ missing │ missing │ missing │ missing │ missing │
for (i, row) in enumerate(eachrow(iris))
new_iris[i, :] = row[:]
end
head(new_iris, 2)
# 2×5 DataFrame
# │ Row │ SepalLength │ SepalWidth │ PetalLength │ PetalWidth │ Species │
# ├─────┼─────────────┼────────────┼─────────────┼────────────┼─────────┤
# │ 1 │ 5.1 │ 3.5 │ 1.4 │ 0.2 │ setosa │
# │ 2 │ 4.9 │ 3.0 │ 1.4 │ 0.2 │ setosa │
The answer from #waTeim already answers the initial question. But what if I want to dynamically create an empty DataFrame and append rows to it. E.g. what if I don't want hard-coded column names?
In this case, df = DataFrame(A = Int64[], B = Int64[]) is not sufficient.
The NamedTuple A = Int64[], B = Int64[] needs to be create dynamically.
Let's assume we have a vector of column names col_names and a vector of column types colum_types from which to create an emptyDataFrame.
col_names = [:A, :B] # needs to be a vector Symbols
col_types = [Int64, Float64]
# Create a NamedTuple (A=Int64[], ....) by doing
named_tuple = (; zip(col_names, type[] for type in col_types )...)
df = DataFrame(named_tuple) # 0×2 DataFrame
Alternatively, the NameTuple could be created with
# or by doing
named_tuple = NamedTuple{Tuple(col_names)}(type[] for type in col_types )
I think at least in the latest version of Julia you can achieve this by creating a pair object without specifying type
df = DataFrame("A" => [], "B" => [])
push!(df, [5,'f'])
1×2 DataFrame
Row │ A B
│ Any Any
─────┼──────────
1 │ 5 f
as seen in this post by #Bogumił Kamiński where multiple columns are needed, something like this can be done:
entries = ["A", "B", "C", "D"]
df = DataFrame([ name =>[] for name in entries])
julia> push!(df,[4,5,'r','p'])
1×4 DataFrame
Row │ A B C D
│ Any Any Any Any
─────┼────────────────────
1 │ 4 5 r p
Or as pointed out by #Antonello below if you know that type you can do.
df = DataFrame([name => Int[] for name in entries])
which is also in #Bogumil Kaminski's original post.