How to convert numeric values to missing in a Julia DataFrame column? - dataframe

Trying to replace values of 99 to missing:
df_miss = DataFrame(a=[1,2,99,99,5,6,99])
allowmissing!(df_miss)
df_miss.a .= replace.(df_miss.a, 99 => missing)
But getting this error:
MethodError: no method matching similar(::Int64, ::Type{Union{Missing, Int64}})
Using:
Julia Version 1.5.3
DataFrames Version 0.22.7

You do not need to use broadcasting here. Just write:
julia> df_miss = DataFrame(a=[1,2,99,99,5,6,99])
7×1 DataFrame
Row │ a
│ Int64
─────┼───────
1 │ 1
2 │ 2
3 │ 99
4 │ 99
5 │ 5
6 │ 6
7 │ 99
julia> df_miss.a = replace(df_miss.a, 99 => missing)
7-element Vector{Union{Missing, Int64}}:
1
2
missing
missing
5
6
missing
julia> df_miss
7×1 DataFrame
Row │ a
│ Int64?
─────┼─────────
1 │ 1
2 │ 2
3 │ missing
4 │ missing
5 │ 5
6 │ 6
7 │ missing

Related

Calculate difference between consecutive rows per group in dataframe Julia

I have the following dataframe called df:
using DataFrames
df = DataFrame(group = ["A", "A", "A", "A", "B", "B", "B", "B"],
value = [2,1,4,3,3,5,2,1])
8×2 DataFrame
Row │ group value
│ String Int64
─────┼───────────────
1 │ A 2
2 │ A 1
3 │ A 4
4 │ A 3
5 │ B 3
6 │ B 5
7 │ B 2
8 │ B 1
I would like to calculate the difference with previous values of consecutive rows in column value per group. The offset should have NaN, 0, or missing. Here is the desired output:
8×3 DataFrame
Row │ group value diff
│ String Int64 Float64
─────┼────────────────────────
1 │ A 2 NaN
2 │ A 1 -1.0
3 │ A 4 3.0
4 │ A 3 -1.0
5 │ B 3 NaN
6 │ B 5 2.0
7 │ B 2 -3.0
8 │ B 1 -2.0
So I was wondering if anyone knows how to calculate the difference between consecutive rows per group in Julia?
Using DataFrames.jl (you can replace missing by any value you like):
julia> select(groupby(df, :group),
:value => (x -> [missing; diff(x)]) => :diff)
8×2 DataFrame
Row │ group diff
│ String Int64?
─────┼─────────────────
1 │ A missing
2 │ A -1
3 │ A 3
4 │ A -1
5 │ B missing
6 │ B 2
7 │ B -3
8 │ B -1
Using DataFramesMeta.jl:
julia> #chain df begin
groupby(:group)
#select :diff = [missing; diff(:value)]
end
8×2 DataFrame
Row │ group diff
│ String Int64?
─────┼─────────────────
1 │ A missing
2 │ A -1
3 │ A 3
4 │ A -1
5 │ B missing
6 │ B 2
7 │ B -3
8 │ B -1
Normally diff in Julia like in e.g. R would produce one less row (and the syntax would be simpler:
julia> combine(groupby(df, :group), :value => diff => :diff)
6×2 DataFrame
Row │ group diff
│ String Int64
─────┼───────────────
1 │ A -1
2 │ A 3
3 │ A -1
4 │ B 2
5 │ B -3
6 │ B -1
julia> #chain df begin
groupby(:group)
#combine :diff = diff(:value)
end
6×2 DataFrame
Row │ group diff
│ String Int64
─────┼───────────────
1 │ A -1
2 │ A 3
3 │ A -1
4 │ B 2
5 │ B -3
6 │ B -1
Yet another way would be to use lag from ShiftedArrays.jl:
julia> using ShiftedArrays: lag
julia> #chain df begin
groupby(:group)
#combine :diff = :value - lag(:value)
end
8×2 DataFrame
Row │ group diff
│ String Int64?
─────┼─────────────────
1 │ A missing
2 │ A -1
3 │ A 3
4 │ A -1
5 │ B missing
6 │ B 2
7 │ B -3
8 │ B -1

Compare elements in Julia DataFrame and return value in new column

I need to compare elements by rows of c1 and c2 columns in the DataFrame and return higher value in new column.
Column "Result" should return [6,5,4,4,5]
df = DataFrame(c1=[1,2,3,4,5], c2=[6,5,4,3,2])
println(df)
if broadcast(.>, df.c1, df.c2)
df[:, "Result"] .= df.c1
else
df[:, "Result"] .= df.c2
end
println(df5)
ERROR: TypeError: non-boolean (BitVector) used in boolean context
An alternative is:
julia> df.Result = max.(df.c1, df.c2)
5-element Vector{Int64}:
6
5
4
4
5
(as some users prefer such code than higher order functions presented excellently by #Shayan)
Using eachrow
julia> maximum.(eachrow(df))
5-element Vector{Int64}:
6
5
4
4
5
or as DataFrame
julia> DataFrame(new = maximum.(eachrow(df)))
5×1 DataFrame
Row │ new
│ Int64
─────┼───────
1 │ 6
2 │ 5
3 │ 4
4 │ 4
5 │ 5
or as a new column the DataFrame
julia> df.Result = maximum.(eachrow(df))
julia> df
5×3 DataFrame
Row │ c1 c2 Result
│ Int64 Int64 Int64
─────┼──────────────────────
1 │ 1 6 6
2 │ 2 5 5
3 │ 3 4 4
4 │ 4 3 4
5 │ 5 2 5
You can use select:
julia> select(df, All() => ByRow(max) => :Result)
5×1 DataFrame
Row │ Result
│ Int64
─────┼────────
1 │ 6
2 │ 5
3 │ 4
4 │ 4
5 │ 5
Another alternative is using DataFramesMeta.jl:
julia> #select(df, :Result = max.(:c1, :c2))
5×1 DataFrame
Row │ Result
│ Int64
─────┼────────
1 │ 6
2 │ 5
3 │ 4
4 │ 4
5 │ 5
# Alternatively, you can use the following line to avoid mentioning the column names manually:
#select(df, :Result = $(max.(propertynames(df)...)))
# Gives the same result.
If you want to make the change in place, then use select! or #select!.
If you prefer the returned dataframe to contain c1 and c2 columns as well, then you can go for transform (its alternative in-place operator is transform!):
julia> transform(df, All() => ByRow(max) => :Result)
5×3 DataFrame
Row │ c1 c2 Result
│ Int64 Int64 Int64
─────┼──────────────────────
1 │ 1 6 6
2 │ 2 5 5
3 │ 3 4 4
4 │ 4 3 4
5 │ 5 2 5
# And the same thing using DataFramesMeta.jl:
julia> #transform(df, :Result = $(max.(propertynames(df)...)))
5×3 DataFrame
Row │ c1 c2 Result
│ Int64 Int64 Int64
─────┼──────────────────────
1 │ 1 6 6
2 │ 2 5 5
3 │ 3 4 4
4 │ 4 3 3
5 │ 5 2 2

Using Julia and dataframes plot the median of one columns based on bins from another column

my apologies if this is a simple question but I couldn't find a direct answer on the internet and I think it's a useful problem for others. I'm using Julia and DataFrame and I want to bin on column and then take the median of those bins and plot them in a histogram style plot.
Many thanks if you can help on this!
This is a basic approach:
julia> using Plots, Statistics, DataFrames
julia> df = DataFrame(x=repeat(["a","b","c"], 5), y=1:15)
15×2 DataFrame
Row │ x y
│ String Int64
─────┼───────────────
1 │ a 1
2 │ b 2
3 │ c 3
4 │ a 4
5 │ b 5
6 │ c 6
7 │ a 7
8 │ b 8
9 │ c 9
10 │ a 10
11 │ b 11
12 │ c 12
13 │ a 13
14 │ b 14
15 │ c 15
julia> res = combine(groupby(df, :x, sort=true), :y => median)
3×2 DataFrame
Row │ x y_median
│ String Float64
─────┼──────────────────
1 │ a 7.0
2 │ b 8.0
3 │ c 9.0
julia> bar(res.x, res.y_median, legend=false)
which gives you:

How to make a list of data frames in Julia?

Given that I have some data frames with a single dimension, how can I create a list of all the data frames? Is it really as simple as just making a list and adding them in?
You could also use vcat to combine these data frames into a single one with an extra column indicating the source like this:
julia> c = vcat(a, b, source=:source => ["a", "b"])
8×2 DataFrame
Row │ A source
│ Int64 String
─────┼───────────────
1 │ 1 a
2 │ 2 a
3 │ 3 a
4 │ 4 a
5 │ 1 b
6 │ 2 b
7 │ 3 b
8 │ 4 b
This form is often easier to work with later. In particular if you then groupby the c data frame by :source like this:
julia> groupby(c, :source)
GroupedDataFrame with 2 groups based on key: source
First Group (4 rows): source = "a"
Row │ A source
│ Int64 String
─────┼───────────────
1 │ 1 a
2 │ 2 a
3 │ 3 a
4 │ 4 a
⋮
Last Group (4 rows): source = "b"
Row │ A source
│ Int64 String
─────┼───────────────
1 │ 1 b
2 │ 2 b
3 │ 3 b
4 │ 4 b
As a result you also get a collection of data frames (like the list that was created in the other answer), but this time you can apply functions supporting the split-apply-combine to it, see https://dataframes.juliadata.org/stable/man/split_apply_combine/.
One possible option that appears to work is the straightforward, "just add them to the list" method mentioned above. This would look like:
julia> a = DataFrame(A = 1:4)
4×1 DataFrame
Row │ A
│ Int64
─────┼───────
1 │ 1
2 │ 2
3 │ 3
4 │ 4
julia> b = DataFrame(A = 1:4)
4×1 DataFrame
Row │ A
│ Int64
─────┼───────
1 │ 1
2 │ 2
3 │ 3
4 │ 4
julia> c = [a, b]
2-element Vector{DataFrame}:
4×1 DataFrame
Row │ A
│ Int64
─────┼───────
1 │ 1
2 │ 2
3 │ 3
4 │ 4
4×1 DataFrame
Row │ A
│ Int64
─────┼───────
1 │ 1
2 │ 2
3 │ 3
4 │ 4
julia> typeof(c)
Vector{DataFrame} (alias for Array{DataFrame, 1})

Return clickhouse array as column

Is it possible with Clickhouse to have result containing a pair of array transformed as columns?
Form this result:
┌─f1──┬f2───────┬f3─────────────┐
│ 'a' │ [1,2,3] │ ['x','y','z'] │
│ 'b' │ [4,5,6] │ ['x','y','z'] │
└─────┴─────────┴───────────────┘
to :
┌─f1──┬x──┬y──┬z──┐
│ 'a' │ 1 │ 2 │ 3 │
│ 'b' │ 4 │ 5 │ 6 │
└─────┴───┴───┴───┘
The idea is to not have to repeat the header values for each line.
In my case, the "header" array f3 unique by queries and join to the f1,f2.
You can do it with help of indexOf function.
SELECT *
FROM test_sof
┌─f1─┬─f2──────┬─f3────────────┐
│ a │ [1,2,3] │ ['x','y','z'] │
└────┴─────────┴───────────────┘
┌─f1─┬─f2────────┬─f3────────────────┐
│ c │ [7,8,9,0] │ ['x','y','z','n'] │
└────┴───────────┴───────────────────┘
┌─f1─┬─f2─────────┬─f3────────────────┐
│ d │ [7,8,9,11] │ ['x','y','z','n'] │
└────┴────────────┴───────────────────┘
┌─f1─┬─f2──────┬─f3────────────┐
│ b │ [4,5,6] │ ['x','y','z'] │
└────┴─────────┴───────────────┘
4 rows in set. Elapsed: 0.001 sec.
Then:
SELECT
f1,
f2[indexOf(f3, 'x')] AS x,
f2[indexOf(f3, 'y')] AS y,
f2[indexOf(f3, 'z')] AS z,
f2[indexOf(f3, 'n')] AS n
FROM test_sof
ORDER BY
f1 ASC,
x ASC
┌─f1─┬─x─┬─y─┬─z─┬──n─┐
│ a │ 1 │ 2 │ 3 │ 0 │
│ b │ 4 │ 5 │ 6 │ 0 │
│ c │ 7 │ 8 │ 9 │ 0 │
│ d │ 7 │ 8 │ 9 │ 11 │
└────┴───┴───┴───┴────┘
4 rows in set. Elapsed: 0.002 sec.
Keep in mind situation when index from header array will not be present in data array or vise-versa.
UPD: the way how to get data without knowing "headers".
You will get three columns, third one with headers.
SELECT
f1,
f2[num] AS f2_el,
f3[num] AS f3_el
FROM test_sof
ARRAY JOIN arrayEnumerate(f2) AS num
ORDER BY f1 ASC
┌─f1─┬─f2_el─┬─f3_el─┐
│ a │ 1 │ x │
│ a │ 2 │ y │
│ a │ 3 │ z │
│ b │ 4 │ x │
│ b │ 5 │ y │
│ b │ 6 │ z │
│ c │ 7 │ x │
│ c │ 8 │ y │
│ c │ 9 │ z │
│ c │ 0 │ n │
│ d │ 7 │ x │
│ d │ 8 │ y │
│ d │ 9 │ z │
│ d │ 11 │ n │
└────┴───────┴───────┘
14 rows in set. Elapsed: 0.006 sec.
This a fun puzzle. As pointed out already the indexOf() function seems to be the best way to pivot array columns inside ClickHouse but requires explicit selection of array positions. If you are using Python and your result set is not absurdly large, you can solve the problem in a more general way by flipping the array values into rows in SQL, then pivoting columns f2 and f3 in Python. Here's how it works.
First, use clickHouse-sqlalchemy and pandas to expand the matching arrays into rows as follows. (This example uses Jupyter Notebook running on Anaconda.)
# Load SQL Alchemy and connect to ClickHouse
from sqlalchemy import create_engine
%load_ext sql
%sql clickhouse://default:#localhost/default
# Use JOIN ARRAY to flip corresponding positions in f2, f3 to rows.
result = %sql select * from f array join f2, f3
df = result.DataFrame()
print(df)
The data frame appears as follows:
f1 f2 f3
0 a 1 x
1 a 2 y
2 a 3 z
3 b 4 x
4 b 5 y
5 b 6 z
Now we can pivot f2 and f3 into a new data frame.
dfp = df.pivot(columns='f3', values='f2', index='f1')
print(dfp)
The new dataframe dfp appears as follows:
f3 x y z
f1
a 1 2 3
b 4 5 6
This solution requires you to work outside the database but has the advantage that it works generally for any set of arrays as long as the names and values match. For instance if we add another row with different values and properties the same code gets the right answer. Here's a new row.
insert into f values ('c', [7,8,9,10], ['x', 'y', 'aa', 'bb'])
The pivoted data frame will appear as follows. NaN corresponds to missing values.
f3 aa bb x y z
f1
a NaN NaN 1.0 2.0 3.0
b NaN NaN 4.0 5.0 6.0
c 9.0 10.0 7.0 8.0 NaN
For more information on this solution see https://pandas.pydata.org/pandas-docs/stable/getting_started/dsintro.html and https://github.com/xzkostyan/clickhouse-sqlalchemy.