how to filter columns in dataframes? - dataframe

I have a long list of dates (starting from 1942-1-1 00:00:00 to 2012-12-31 24:00:00). These are associated with some amounts respectively (see below). Is there a way to first filter all amounts for one day separately, and then add them up together?
For example for 1942-01-01, how to find all values (amounts) that occur in this day (from time 0 to 24) and then sum them together?
time amount
DateTime Float64
1942-01-01T00:00:00 7.0
1942-01-02T00:00:00 0.2
1942-01-03T00:00:00 2.1
1942-01-04T00:00:00 3.0
:
2012-12-31T23:00:00 4.0
2012-12-31T24:00:00 0.0
df = CSV.read(path, DataFrame)
for i in 1:24
filter(r ->hour(r.time) == i, df)
end

Load InMemoryDatasets.jl and use format to aggregate daily,
using InMemoryDatasets
ds=Dataset(time=DateTime("1942-01-01"):Hour(1):DateTime("2012-12-31"))
ds.amount = rand(nrow(ds))
DateValue(x) = Date(x)
setformat!(ds, :time=>DateValue)
combine(gatherby(ds,:time), :amount=>IMD.sum)

didn't get how the accepted answer is answering the question, but let me give another answer using IMD package,
using InMemoryDatasets
ds=Dataset(time=DateTime("1942-01-01"):Hour(1):DateTime("2012-12-31"))
ds.amount = rand(nrow(ds))
datefmt(x) = round(x, Hour, RoundDown)
setformat!(ds, :time=>datefmt)
combine(gatherby(ds,:time), :amount=>IMD.sum)
PS I'm one of the IMD's contributors.

There are many approaches you could use (and maybe some other commenters will propose alternatives). Here let me show you how to achieve what you want without any filtering:
julia> df = DataFrame(time=[DateTime(2020, 1, rand(1:2), rand(0:23)) for _ in 1:100], amount=rand(100))
100×2 DataFrame
Row │ time amount
│ DateTime Float64
─────┼────────────────────────────────
1 │ 2020-01-02T16:00:00 0.29325
2 │ 2020-01-02T02:00:00 0.376917
3 │ 2020-01-02T09:00:00 0.11849
4 │ 2020-01-02T04:00:00 0.462997
⋮ │ ⋮ ⋮
97 │ 2020-01-02T18:00:00 0.750604
98 │ 2020-01-01T13:00:00 0.179414
99 │ 2020-01-01T15:00:00 0.552547
100 │ 2020-01-01T02:00:00 0.769066
92 rows omitted
julia> transform!(df, :time => ByRow(Date) => :date, :time => ByRow(hour) => :hour)
100×4 DataFrame
Row │ time amount date hour
│ DateTime Float64 Date Int64
─────┼───────────────────────────────────────────────────
1 │ 2020-01-02T16:00:00 0.29325 2020-01-02 16
2 │ 2020-01-02T02:00:00 0.376917 2020-01-02 2
3 │ 2020-01-02T09:00:00 0.11849 2020-01-02 9
4 │ 2020-01-02T04:00:00 0.462997 2020-01-02 4
⋮ │ ⋮ ⋮ ⋮ ⋮
97 │ 2020-01-02T18:00:00 0.750604 2020-01-02 18
98 │ 2020-01-01T13:00:00 0.179414 2020-01-01 13
99 │ 2020-01-01T15:00:00 0.552547 2020-01-01 15
100 │ 2020-01-01T02:00:00 0.769066 2020-01-01 2
92 rows omitted
julia> unstack(df, :hour, :date, :amount, combine=sum, fill=0)
24×3 DataFrame
Row │ hour 2020-01-02 2020-01-01
│ Int64 Float64 Float64
─────┼───────────────────────────────
1 │ 16 1.06636 0.949414
2 │ 2 0.990913 1.43032
3 │ 9 0.183206 3.16363
4 │ 4 1.24055 0.57196
⋮ │ ⋮ ⋮ ⋮
21 │ 10 0.0 0.492397
22 │ 14 0.393438 0.0
23 │ 21 0.0 0.487992
24 │ 8 0.848852 0.0
16 rows omitted
The final result is a data frame that gives you aggregates for all hours (in rows) for all days (in columns). The data is presented in the order of their appearance, so you might want to sort the result by hour:
julia> res = sort!(unstack(df, :hour, :date, :amount, combine=sum, fill=0), :hour)
24×3 DataFrame
Row │ hour 2020-01-02 2020-01-01
│ Int64 Float64 Float64
─────┼───────────────────────────────
1 │ 0 1.99143 0.150979
2 │ 1 1.25939 0.860835
3 │ 2 0.990913 1.43032
4 │ 3 3.83337 2.33696
⋮ │ ⋮ ⋮ ⋮
21 │ 20 1.73576 1.93323
22 │ 21 0.0 0.487992
23 │ 22 1.52546 0.651938
24 │ 23 1.03808 0.0
16 rows omitted
Now you can extract information for a specific day just by extracting a column corresponding to it, e.g.:
julia> res."2020-01-02"
24-element Vector{Float64}:
1.991425180864845
1.2593855803084226
0.9909134301068651
3.833369559458414
1.2405519797178841
1.4494215475119732
⋮
2.4509665509554157
0.0
1.7357636571508785
0.0
1.525457178008634
1.0380772820126043
For the amount of data you have there should be no problem with getting all the results in one shot (in this example I pre-sorted the source data frame on day and hour to make the final table nicely ordered both by rows and columns):
julia> #time big = DataFrame(time=[DateTime(rand(1942:2012), rand(1:12), rand(1:28), rand(0:23)) for _ in 1:10^7], amount=rand(10^7));
0.413495 seconds (99.39 k allocations: 310.149 MiB, 3.75% gc time, 5.54% compilation time)
julia> #time sort!(transform!(big, :time => ByRow(Date) => :date, :time => ByRow(hour) => :hour), [:date, :hour]);
5.049808 seconds (1.03 M allocations: 1.167 GiB, 0.81% gc time)
julia> #time unstack(big, :hour, :date, :amount, combine=sum, fill=0)
1.342251 seconds (21.58 M allocations: 673.052 MiB, 13.63% gc time)
24×23857 DataFrame
Row │ hour 1942-01-01 1942-01-02 1942-01-03 1942-01-04 1942-01-05 1942-01-06 1942-01-07 1942-01-08 1942-01-09 1942-01-10 1942-01-11 194 ⋯
│ Int64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Flo ⋯
─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ 0 9.19054 8.00765 6.99379 9.63979 6.5088 11.6281 12.4928 6.86322 11.4453 12.6505 10.0583 1 ⋯
2 │ 1 8.78977 8.32879 6.29344 12.0815 9.83297 8.24592 10.349 10.1213 6.51192 6.1523 8.38962
3 │ 2 5.51566 9.97157 12.1064 8.28468 11.1929 8.274 8.25525 7.88186 4.65225 7.44625 6.62251 1
4 │ 3 7.25526 13.1635 4.75877 9.77418 11.5427 6.30625 6.2512 8.06394 8.77394 12.5935 9.09008
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱
21 │ 20 8.46999 9.99227 11.1116 14.5478 11.8379 7.38414 11.0567 6.17652 10.6811 9.059 9.77321 ⋯
22 │ 21 7.02998 10.0908 5.5182 8.8145 9.81238 10.8413 8.65648 12.6846 12.1116 8.75566 11.2892 1
23 │ 22 9.17824 13.2115 10.589 9.87813 10.7258 7.97428 12.8137 10.3456 8.37605 9.54897 7.24197
24 │ 23 13.0214 10.2333 9.08972 11.8678 7.36996 9.80802 11.0031 6.0818 11.7789 4.3467 7.49586
23845 columns and 16 rows omitted
EDIT
Here is an example how you can use filter. I assume we work on a big data frame created above and want information for 1942-02-03 only. I am also using Chain.jl to nicely chain the performed operations:
julia> #chain big begin
filter(:date => ==(Date("1942-02-03")), _)
groupby(:hour, sort=true)
combine(:amount => sum)
end
24×2 DataFrame
Row │ hour amount_sum
│ Int64 Float64
─────┼───────────────────
1 │ 0 6.22427
2 │ 1 8.33195
3 │ 2 9.26992
4 │ 3 13.7858
⋮ │ ⋮ ⋮
21 │ 20 6.59938
22 │ 21 6.07788
23 │ 22 6.68741
24 │ 23 7.59147
16 rows omitted
(if anything is unclear please comment)

Related

Calculate difference between consecutive rows per group in dataframe Julia

I have the following dataframe called df:
using DataFrames
df = DataFrame(group = ["A", "A", "A", "A", "B", "B", "B", "B"],
value = [2,1,4,3,3,5,2,1])
8×2 DataFrame
Row │ group value
│ String Int64
─────┼───────────────
1 │ A 2
2 │ A 1
3 │ A 4
4 │ A 3
5 │ B 3
6 │ B 5
7 │ B 2
8 │ B 1
I would like to calculate the difference with previous values of consecutive rows in column value per group. The offset should have NaN, 0, or missing. Here is the desired output:
8×3 DataFrame
Row │ group value diff
│ String Int64 Float64
─────┼────────────────────────
1 │ A 2 NaN
2 │ A 1 -1.0
3 │ A 4 3.0
4 │ A 3 -1.0
5 │ B 3 NaN
6 │ B 5 2.0
7 │ B 2 -3.0
8 │ B 1 -2.0
So I was wondering if anyone knows how to calculate the difference between consecutive rows per group in Julia?
Using DataFrames.jl (you can replace missing by any value you like):
julia> select(groupby(df, :group),
:value => (x -> [missing; diff(x)]) => :diff)
8×2 DataFrame
Row │ group diff
│ String Int64?
─────┼─────────────────
1 │ A missing
2 │ A -1
3 │ A 3
4 │ A -1
5 │ B missing
6 │ B 2
7 │ B -3
8 │ B -1
Using DataFramesMeta.jl:
julia> #chain df begin
groupby(:group)
#select :diff = [missing; diff(:value)]
end
8×2 DataFrame
Row │ group diff
│ String Int64?
─────┼─────────────────
1 │ A missing
2 │ A -1
3 │ A 3
4 │ A -1
5 │ B missing
6 │ B 2
7 │ B -3
8 │ B -1
Normally diff in Julia like in e.g. R would produce one less row (and the syntax would be simpler:
julia> combine(groupby(df, :group), :value => diff => :diff)
6×2 DataFrame
Row │ group diff
│ String Int64
─────┼───────────────
1 │ A -1
2 │ A 3
3 │ A -1
4 │ B 2
5 │ B -3
6 │ B -1
julia> #chain df begin
groupby(:group)
#combine :diff = diff(:value)
end
6×2 DataFrame
Row │ group diff
│ String Int64
─────┼───────────────
1 │ A -1
2 │ A 3
3 │ A -1
4 │ B 2
5 │ B -3
6 │ B -1
Yet another way would be to use lag from ShiftedArrays.jl:
julia> using ShiftedArrays: lag
julia> #chain df begin
groupby(:group)
#combine :diff = :value - lag(:value)
end
8×2 DataFrame
Row │ group diff
│ String Int64?
─────┼─────────────────
1 │ A missing
2 │ A -1
3 │ A 3
4 │ A -1
5 │ B missing
6 │ B 2
7 │ B -3
8 │ B -1

Using Julia and dataframes plot the median of one columns based on bins from another column

my apologies if this is a simple question but I couldn't find a direct answer on the internet and I think it's a useful problem for others. I'm using Julia and DataFrame and I want to bin on column and then take the median of those bins and plot them in a histogram style plot.
Many thanks if you can help on this!
This is a basic approach:
julia> using Plots, Statistics, DataFrames
julia> df = DataFrame(x=repeat(["a","b","c"], 5), y=1:15)
15×2 DataFrame
Row │ x y
│ String Int64
─────┼───────────────
1 │ a 1
2 │ b 2
3 │ c 3
4 │ a 4
5 │ b 5
6 │ c 6
7 │ a 7
8 │ b 8
9 │ c 9
10 │ a 10
11 │ b 11
12 │ c 12
13 │ a 13
14 │ b 14
15 │ c 15
julia> res = combine(groupby(df, :x, sort=true), :y => median)
3×2 DataFrame
Row │ x y_median
│ String Float64
─────┼──────────────────
1 │ a 7.0
2 │ b 8.0
3 │ c 9.0
julia> bar(res.x, res.y_median, legend=false)
which gives you:

Split a DataFrame into a Vector of DataFrames

I have a DataFrame
df = DataFrame(a=[1,1,2,2],b=[6,7,8,9])
4×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 6
2 │ 1 7
3 │ 2 8
4 │ 2 9
Is there a canonical way of splitting it into a Vector{DataFrame}s? I can do
[df[df.a .== i,:] for i in unique(df.a)]
2-element Vector{DataFrame}:
2×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 6
2 │ 1 7
2×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 6
2 │ 1 7
but is there maybe something more elegant?
Use:
julia> gdf = groupby(df, :a, sort=true)
GroupedDataFrame with 2 groups based on key: a
First Group (2 rows): a = 1
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 6
2 │ 1 7
⋮
Last Group (2 rows): a = 2
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 2 8
2 │ 2 9
(you could omit sort=true, but sorting ensures that the output is ordered in ascending order of the lookup key).
Then you can just work with this object as a vector:
julia> gdf[1]
2×2 SubDataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 6
2 │ 1 7
julia> gdf[2]
2×2 SubDataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 2 8
2 │ 2 9
This operation is non-allocating (it is a view into your original data frame).
If you really want Vector{DataFrame} (i.e. make copies of all groups) do:
julia> collect(DataFrame, gdf)
2-element Vector{DataFrame}:
2×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 6
2 │ 1 7
2×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 2 8
2 │ 2 9

Understanding the behavior of the colon in julia DataFrames.select()

I have some data with many rows, that I want to reorder, and in some cases rename. Because of the number of columns I wouldn't want to select and rename every single on of them. But when I use the : operator to select the remaining columns I got a result that I did not expect. The columns that I renamed are included twice:
julia> data = [2 1 3 50
52 51 53 100]
julia> names = ["col 2","col 1", "col_3", "col_50"]
julia> df = DataFrame(data, names)
2×4 DataFrame
Row │ col 2 col 1 col_3 col_50
│ Int64 Int64 Int64 Int64
─────┼─────────────────────────────
1 │ 2 1 3 50
2 │ 52 51 53 100
julia> select(df, "col 1" => :col_1, "col 2" => :col_2, :)
2×6 DataFrame
Row │ col_1 col_2 col 1 col 2 col_3 col_50
│ Int64 Int64 Int64 Int64 Int64 Int64
─────┼───────────────────────────────────────────
1 │ 1 2 1 2 3 50
2 │ 51 52 51 52 53 100
I was hoping for/expecting this
julia> select(df, "col 1" => :col_1, "col 2" => :col_2, :)
2×6 DataFrame
Row │ col_1 col_2 col_3 col_50
│ Int64 Int64 Int64 Int64
─────┼────────────────────────────
1 │ 1 2 3 50
2 │ 51 52 53 100
What do I misunderstand about the : operator.
Is there a/another way to achieve the transformation I want?
Turns out there is. Funny how people (I answer my own question here) can focus on using one function while I just could have used rename!() and then reorder them using select!():
julia> rename!(df, "col 1" => :col_1, "col 2" => :col_2)
2×4 DataFrame
│ Row │ col_2 │ col_1 │ col_3 │ col_50 │
│ │ Int64 │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┼────────┤
│ 1 │ 2 │ 1 │ 3 │ 50 │
│ 2 │ 52 │ 51 │ 53 │ 100 │
julia> select!(df, :col_1, :col_2, :)
2×4 DataFrame
│ Row │ col_1 │ col_2 │ col_3 │ col_50 │
│ │ Int64 │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┼────────┤
│ 1 │ 1 │ 2 │ 3 │ 50 │
│ 2 │ 51 │ 52 │ 53 │ 100 │
Or using a Pipe:
julia> using Pipe
julia> #pipe df |>
rename!(_, "col 1" => :col_1, "col 2" => :col_2) |>
select!(_, :col_1, :col_2, :)
2×4 DataFrame
│ Row │ col_1 │ col_2 │ col_3 │ col_50 │
│ │ Int64 │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┼────────┤
│ 1 │ 1 │ 2 │ 3 │ 50 │
│ 2 │ 51 │ 52 │ 53 │ 100 │
Regarding the behaviour of the : operator I have to thank bkamins for providing me with the answer on github
add in the place where : is placed all columns of the source data frame that have not been added to the result; the adding is based on column name (not contents)
Why it works like this:
In general we allow for potentially very complex transformations in select etc. - columns can be created, renamed, added in any order. In order to keep the rules simple (so that users can build a correct mental model of what is going on and not too much magic happens) the approach is that columns are processed left to right and are identified by on their name in target data frame.
I agree that in your particular case it seems better to do what you propose, but if you consider a wider context (i.e. that in one select you can have dozens of different transformations combined) keeping the rules consistent without any special cases is I believe better.
I thought I share it here as well.

Julia Dataframe group by inside another group by

i have a dataframe like the following :
julia> DataFrame(val=1:10, percent=nothing)
10×2 DataFrame
Row │ val percent
│ Int64 Nothing
─────┼────────────────
1 │ 1
2 │ 2
3 │ 3
4 │ 4
5 │ 5
6 │ 6
7 │ 7
8 │ 8
9 │ 9
10 │ 10
i want to apply this :
percent(df, threshold=0.33) = df / sum(df) .> threshold
which calculate the percentage and check if it's above threshold of a each value in a column compared with the total of the same column
to a DataFrame grouped by two times.
i grouped it by USER_KEY and then i want to group by again for each other column and then combine / apply the percent function to each.
It doesnt work i get
ERROR: MethodError: no method matching combine(::GroupedDataFrame{DataFrame}, ::var"#64#65")
i don't understand this error ...,
If someone can help thank you very much
EDIT :
There is a little difference with this example and i don't know how to reproduce it easily , it's that with these 2 columns i also have a column user_key where some keys can have many lines , i want to group by user_key and then group by val .
I want the column percent to have the percentage of the total of the column val
so for this dataframe the total is 10 i want the result to be like that :
10×2 DataFrame
Row │ val percent
│ Int64 Float64
─────┼────────────────
1 │ 1. 0.1
2 │ 2. 0.2
3 │ 3. 0.3
4 │ 4 0.4
Let me give an answer to the question in the edited part. But probably this is not all that you need - please comment in the question for me to learn what you need more.
So the simplest approach to your problem is:
julia> df = DataFrame(val=1:4)
4×1 DataFrame
Row │ val
│ Int64
─────┼───────
1 │ 1
2 │ 2
3 │ 3
4 │ 4
julia> df.percent = df.val / sum(df.val)
4-element Array{Float64,1}:
0.1
0.2
0.3
0.4
julia> df
4×2 DataFrame
Row │ val percent
│ Int64 Float64
─────┼────────────────
1 │ 1 0.1
2 │ 2 0.2
3 │ 3 0.3
4 │ 4 0.4
alternatively you can use transform!:
julia> df = DataFrame(val=1:4)
4×1 DataFrame
Row │ val
│ Int64
─────┼───────
1 │ 1
2 │ 2
3 │ 3
4 │ 4
julia> transform!(df, :val => (x -> x / sum(x)) => :percent)
4×2 DataFrame
Row │ val percent
│ Int64 Float64
─────┼────────────────
1 │ 1 0.1
2 │ 2 0.2
3 │ 3 0.3
4 │ 4 0.4