Understanding the behavior of the colon in julia DataFrames.select()

Understanding the behavior of the colon in julia DataFrames.select() - dataframe

I have some data with many rows, that I want to reorder, and in some cases rename. Because of the number of columns I wouldn't want to select and rename every single on of them. But when I use the : operator to select the remaining columns I got a result that I did not expect. The columns that I renamed are included twice:
julia> data = [2 1 3 50
52 51 53 100]
julia> names = ["col 2","col 1", "col_3", "col_50"]
julia> df = DataFrame(data, names)
2×4 DataFrame
Row │ col 2 col 1 col_3 col_50
│ Int64 Int64 Int64 Int64
─────┼─────────────────────────────
1 │ 2 1 3 50
2 │ 52 51 53 100
julia> select(df, "col 1" => :col_1, "col 2" => :col_2, :)
2×6 DataFrame
Row │ col_1 col_2 col 1 col 2 col_3 col_50
│ Int64 Int64 Int64 Int64 Int64 Int64
─────┼───────────────────────────────────────────
1 │ 1 2 1 2 3 50
2 │ 51 52 51 52 53 100
I was hoping for/expecting this
julia> select(df, "col 1" => :col_1, "col 2" => :col_2, :)
2×6 DataFrame
Row │ col_1 col_2 col_3 col_50
│ Int64 Int64 Int64 Int64
─────┼────────────────────────────
1 │ 1 2 3 50
2 │ 51 52 53 100
What do I misunderstand about the : operator.
Is there a/another way to achieve the transformation I want?

Turns out there is. Funny how people (I answer my own question here) can focus on using one function while I just could have used rename!() and then reorder them using select!():
julia> rename!(df, "col 1" => :col_1, "col 2" => :col_2)
2×4 DataFrame
│ Row │ col_2 │ col_1 │ col_3 │ col_50 │
│ │ Int64 │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┼────────┤
│ 1 │ 2 │ 1 │ 3 │ 50 │
│ 2 │ 52 │ 51 │ 53 │ 100 │
julia> select!(df, :col_1, :col_2, :)
2×4 DataFrame
│ Row │ col_1 │ col_2 │ col_3 │ col_50 │
│ │ Int64 │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┼────────┤
│ 1 │ 1 │ 2 │ 3 │ 50 │
│ 2 │ 51 │ 52 │ 53 │ 100 │
Or using a Pipe:
julia> using Pipe
julia> #pipe df |>
rename!(_, "col 1" => :col_1, "col 2" => :col_2) |>
select!(_, :col_1, :col_2, :)
2×4 DataFrame
│ Row │ col_1 │ col_2 │ col_3 │ col_50 │
│ │ Int64 │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┼────────┤
│ 1 │ 1 │ 2 │ 3 │ 50 │
│ 2 │ 51 │ 52 │ 53 │ 100 │
Regarding the behaviour of the : operator I have to thank bkamins for providing me with the answer on github
add in the place where : is placed all columns of the source data frame that have not been added to the result; the adding is based on column name (not contents)
Why it works like this:
In general we allow for potentially very complex transformations in select etc. - columns can be created, renamed, added in any order. In order to keep the rules simple (so that users can build a correct mental model of what is going on and not too much magic happens) the approach is that columns are processed left to right and are identified by on their name in target data frame.
I agree that in your particular case it seems better to do what you propose, but if you consider a wider context (i.e. that in one select you can have dozens of different transformations combined) keeping the rules consistent without any special cases is I believe better.
I thought I share it here as well.

Related

Julia: how to compute a particular operation on certain columns of a Dataframe

I have the following Dataframe
using DataFrames, Statistics
df = DataFrame(name=["John", "Sally", "Kirk"],
age=[23., 42., 59.],
children=[3,5,2], height = [180, 150, 170])
print(df)
3×4 DataFrame
│ Row │ name │ age │ children │ height │
│ │ String │ Float64 │ Int64 │ Int64 │
├─────┼────────┼─────────┼──────────┼────────┤
│ 1 │ John │ 23.0 │ 3 │ 180 │
│ 2 │ Sally │ 42.0 │ 5 │ 150 │
│ 3 │ Kirk │ 59.0 │ 2 │ 170 │
I can compute the mean of a column as follow:
println(mean(df[:4]))
166.66666666666666
Now I want to get the mean of all the numeric column and tried this code:
x = [2,3,4]
for i in x
print(mean(df[:x[i]]))
end
But got the following error message:
MethodError: no method matching getindex(::Symbol, ::Int64)
Stacktrace:
[1] top-level scope at ./In[64]:3
How can I solve the problem?

You are trying to access the DataFrame's column using an integer index specifying the column's position. You should just use the integer value without any : before i, which would create the symbol :i but you do not a have column named i.
x = [2,3,4]
for i in x
println(mean(df[i])) # no need for `x[i]`
end
You can also index a DataFrame using a Symbol denoting the column's name.
x = [:age, :children, :height];
for c in x
println(mean(df[c]))
end
You get the following error in your attempt because you are trying to access the ith index of the symbol :x, which is an undefined operation.
MethodError: no method matching getindex(::Symbol, ::Int64)
Note that :4 is just 4.
julia> :4
4
julia> typeof(:4)
Int64

Here is a one-liner that actually selects all Number columns:
julia> mean.(eachcol(df[findall(x-> x<:Number, eltypes(df))]))
3-element Array{Float64,1}:
41.333333333333336
3.3333333333333335
166.66666666666666
For many scenarios describe is actually more convenient:
julia> describe(df)
4×8 DataFrame
│ Row │ variable │ mean │ min │ median │ max │ nunique │ nmissing │ eltype │
│ │ Symbol │ Union… │ Any │ Union… │ Any │ Union… │ Nothing │ DataType │
├─────┼──────────┼─────────┼──────┼────────┼───────┼─────────┼──────────┼──────────┤
│ 1 │ name │ │ John │ │ Sally │ 3 │ │ String │
│ 2 │ age │ 41.3333 │ 23.0 │ 42.0 │ 59.0 │ │ │ Float64 │
│ 3 │ children │ 3.33333 │ 2 │ 3.0 │ 5 │ │ │ Int64 │
│ 4 │ height │ 166.667 │ 150 │ 170.0 │ 180 │ │ │ Int64 │

In the question println(mean(df[4])) works as well (instead of println(mean(df[:4]))).
Hence we can write
x = [2,3,4]
for i in x
println(mean(df[i]))
end
which works

Convert a Julia DataFrame column with String to one with Int and missing values

I need to convert the following DataFrame
julia> df = DataFrame(:A=>["", "2", "3"], :B=>[1.1, 2.2, 3.3])
which looks like
3×2 DataFrame
│ Row │ A │ B │
│ │ String │ Float64 │
├─────┼────────┼─────────┤
│ 1 │ │ 1.1 │
│ 2 │ 2 │ 2.2 │
│ 3 │ 3 │ 3.3 │
I would like to convert A column from Array{String,1} to array of Int with missing values.
I tried
julia> df.A = tryparse.(Int, df.A)
3-element Array{Union{Nothing, Int64},1}:
nothing
2
3
julia> df
3×2 DataFrame
│ Row │ A │ B │
│ │ Union… │ Float64 │
├─────┼────────┼─────────┤
│ 1 │ │ 1.1 │
│ 2 │ 2 │ 2.2 │
│ 3 │ 3 │ 3.3 │
julia> eltype(df.A)
Union{Nothing, Int64}
but I'm getting A column with elements of type Union{Nothing, Int64}.
nothing (of type Nothing) and missing (of type Missing) seems to be 2 differents kind of value.
So I wonder how I can A columns with missing values instead?
I also wonder if missing and nothing leads to different performance.

I would have done the following:
julia> df.A = map(x->begin val = tryparse(Int, x)
ifelse(typeof(val) == Nothing, missing, val)
end, df.A)
3-element Array{Union{Missing, Int64},1}:
missing
2
3
julia> df
3×2 DataFrame
│ Row │ A │ B │
│ │ Int64⍰ │ Float64 │
├─────┼─────────┼─────────┤
│ 1 │ missing │ 1.1 │
│ 2 │ 2 │ 2.2 │
│ 3 │ 3 │ 3.3 │
I think missing is more suitable for dataframes which indeed have missing values, instead of nothing, because the latter is more considered as a void in C, or None in Python, see here.
As a side note, Missing type has some Julia functionalities.

Replacing nothing by missing can simply be done using replace:
julia> df.A = replace(df.A, nothing=>missing)
3-element Array{Union{Missing, Int64},1}:
missing
2
3
julia> df
3×2 DataFrame
│ Row │ A │ B │
│ │ Int64⍰ │ Float64 │
├─────┼─────────┼─────────┤
│ 1 │ missing │ 1.1 │
│ 2 │ 2 │ 2.2 │
│ 3 │ 3 │ 3.3 │
an other solution is to use tryparsem function defined as following
tryparsem(T, str) = something(tryparse(T, str), missing)
and use it like
julia> df = DataFrame(:A=>["", "2", "3"], :B=>[1.1, 2.2, 3.3])
julia> df.A = tryparsem.(Int, df.A)

Return clickhouse array as column

Is it possible with Clickhouse to have result containing a pair of array transformed as columns?
Form this result:
┌─f1──┬f2───────┬f3─────────────┐
│ 'a' │ [1,2,3] │ ['x','y','z'] │
│ 'b' │ [4,5,6] │ ['x','y','z'] │
└─────┴─────────┴───────────────┘
to :
┌─f1──┬x──┬y──┬z──┐
│ 'a' │ 1 │ 2 │ 3 │
│ 'b' │ 4 │ 5 │ 6 │
└─────┴───┴───┴───┘
The idea is to not have to repeat the header values for each line.
In my case, the "header" array f3 unique by queries and join to the f1,f2.

You can do it with help of indexOf function.
SELECT *
FROM test_sof
┌─f1─┬─f2──────┬─f3────────────┐
│ a │ [1,2,3] │ ['x','y','z'] │
└────┴─────────┴───────────────┘
┌─f1─┬─f2────────┬─f3────────────────┐
│ c │ [7,8,9,0] │ ['x','y','z','n'] │
└────┴───────────┴───────────────────┘
┌─f1─┬─f2─────────┬─f3────────────────┐
│ d │ [7,8,9,11] │ ['x','y','z','n'] │
└────┴────────────┴───────────────────┘
┌─f1─┬─f2──────┬─f3────────────┐
│ b │ [4,5,6] │ ['x','y','z'] │
└────┴─────────┴───────────────┘
4 rows in set. Elapsed: 0.001 sec.
Then:
SELECT
f1,
f2[indexOf(f3, 'x')] AS x,
f2[indexOf(f3, 'y')] AS y,
f2[indexOf(f3, 'z')] AS z,
f2[indexOf(f3, 'n')] AS n
FROM test_sof
ORDER BY
f1 ASC,
x ASC
┌─f1─┬─x─┬─y─┬─z─┬──n─┐
│ a │ 1 │ 2 │ 3 │ 0 │
│ b │ 4 │ 5 │ 6 │ 0 │
│ c │ 7 │ 8 │ 9 │ 0 │
│ d │ 7 │ 8 │ 9 │ 11 │
└────┴───┴───┴───┴────┘
4 rows in set. Elapsed: 0.002 sec.
Keep in mind situation when index from header array will not be present in data array or vise-versa.
UPD: the way how to get data without knowing "headers".
You will get three columns, third one with headers.
SELECT
f1,
f2[num] AS f2_el,
f3[num] AS f3_el
FROM test_sof
ARRAY JOIN arrayEnumerate(f2) AS num
ORDER BY f1 ASC
┌─f1─┬─f2_el─┬─f3_el─┐
│ a │ 1 │ x │
│ a │ 2 │ y │
│ a │ 3 │ z │
│ b │ 4 │ x │
│ b │ 5 │ y │
│ b │ 6 │ z │
│ c │ 7 │ x │
│ c │ 8 │ y │
│ c │ 9 │ z │
│ c │ 0 │ n │
│ d │ 7 │ x │
│ d │ 8 │ y │
│ d │ 9 │ z │
│ d │ 11 │ n │
└────┴───────┴───────┘
14 rows in set. Elapsed: 0.006 sec.

This a fun puzzle. As pointed out already the indexOf() function seems to be the best way to pivot array columns inside ClickHouse but requires explicit selection of array positions. If you are using Python and your result set is not absurdly large, you can solve the problem in a more general way by flipping the array values into rows in SQL, then pivoting columns f2 and f3 in Python. Here's how it works.
First, use clickHouse-sqlalchemy and pandas to expand the matching arrays into rows as follows. (This example uses Jupyter Notebook running on Anaconda.)
# Load SQL Alchemy and connect to ClickHouse
from sqlalchemy import create_engine
%load_ext sql
%sql clickhouse://default:#localhost/default
# Use JOIN ARRAY to flip corresponding positions in f2, f3 to rows.
result = %sql select * from f array join f2, f3
df = result.DataFrame()
print(df)
The data frame appears as follows:
f1 f2 f3
0 a 1 x
1 a 2 y
2 a 3 z
3 b 4 x
4 b 5 y
5 b 6 z
Now we can pivot f2 and f3 into a new data frame.
dfp = df.pivot(columns='f3', values='f2', index='f1')
print(dfp)
The new dataframe dfp appears as follows:
f3 x y z
f1
a 1 2 3
b 4 5 6
This solution requires you to work outside the database but has the advantage that it works generally for any set of arrays as long as the names and values match. For instance if we add another row with different values and properties the same code gets the right answer. Here's a new row.
insert into f values ('c', [7,8,9,10], ['x', 'y', 'aa', 'bb'])
The pivoted data frame will appear as follows. NaN corresponds to missing values.
f3 aa bb x y z
f1
a NaN NaN 1.0 2.0 3.0
b NaN NaN 4.0 5.0 6.0
c 9.0 10.0 7.0 8.0 NaN
For more information on this solution see https://pandas.pydata.org/pandas-docs/stable/getting_started/dsintro.html and https://github.com/xzkostyan/clickhouse-sqlalchemy.

Rename Dataframe column names julia v1.0

In 0.6 I was using:
colnames = ["Date_Time","Date_index","Time_index"]
names!(data1_date_time_index.colindex, map(parse, colnames))
What is the syntax for v1.0 - right now .colindex is not found.
Per DataFrames docs:
rename!(data1_date_time_index, f => t for (f, t) =
zip([:x1, :x1_1, :x1_2],
[:Date_Time, :Date_index, :Time_index]))

Assuming data1_date_time_index is a DataFrame that has three columns use:
colnames = ["Date_Time","Date_index","Time_index"]
names!(data1_date_time_index, Symbol.(colnames))
I am not 100% sure if this is what you want, as your example was not fully reproducible (so if actually you needed something else can you please submit full code that can be run).
The problem with data1_date_time_index.colindex is that currently . is used to access columns of a DataFrame by their name (and not fields of DataFrame type). In general you are not recommended to use colindex as it is not part of exposed API and might change in the future. If you really need to reach it use getfield(data_frame_name, :colindex).
EDIT
In DataFrames 0.20 you should write:
rename!(data1_date_time_index, Symbol.(colnames))
and in DataFrames 0.21 (which will be released before summer 2020) also passing strings directly will most probably be allowed like this:
rename!(data1_date_time_index, colnames)
(see here for a related discussion)

You can rename column through select also
For Ex:
df = DataFrame(col1 = 1:4, col2 = ["John", "James", "Finch", "May"])
│ Row │ col1 │ col2 │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 1 │ John │
│ 2 │ 2 │ James │
│ 3 │ 3 │ Finch │
│ 4 │ 4 │ May │
select(df, "col1" => "Id", "col2" => "Name")
│ Row │ Id │ Name │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 1 │ John │
│ 2 │ 2 │ James │
│ 3 │ 3 │ Finch │
│ 4 │ 4 │ May │

Rename columns:
names!(df, [:c1,:c2,:c3]) #(all)
rename!(df, Dict(:oldCol => :newCol)) # (a selection)
(from: https://syl1.gitbook.io/julia-language-a-concise-tutorial/useful-packages/dataframes )

Indexing Dataframe with variable in Julia

I want to create a indexed subset of a DataFrame and use a variable inside it. In this case i want to change all -9999 values of the first column to NA's. If I do: df[df[:1] .== -9999, :1] = NA it works like it should.. But if i use a variable as the indexer it througs an error (LoadError: KeyError: key :i not found):
i = 1
df[df[:i] .== -9999, :i] = NA

:i is actually a symbol in julia:
julia> typeof(:i)
Symbol
you can define a variable binding to a symbol like this:
julia> i = Symbol(2)
Symbol("2")
then you can simply use df[df[i] .== 1, i] = 123:
julia> df
10×1 DataFrames.DataFrame
│ Row │ 2 │
├─────┼─────┤
│ 1 │ 123 │
│ 2 │ 2 │
│ 3 │ 3 │
│ 4 │ 4 │
│ 5 │ 5 │
│ 6 │ 6 │
│ 7 │ 7 │
│ 8 │ 8 │
│ 9 │ 9 │
│ 10 │ 10 │
It's worth noting that in your example df[df[:1] .== -9999, :1], :1 is NOT a symbol:
julia> :1
1
In fact, the expression is equal to df[df[1] .== -9999, 1] which works in that there is a corresponding getindex method whose argument (col_ind) can accept a common index:
julia> #which df[df[1].==1, 1]
getindex{T<:Real}(df::DataFrames.DataFrame, row_inds::AbstractArray{T,1}, col_ind::Union{Real,Symbol})
Since you just want to change the first (n) column, there is no difference between Symbol("1") and 1 as long as your column names are regularly arranged as:
│ Row │ 1 │ 2 │ 3 │...
├─────┼─────┤─────┼─────┤
│ 1 │ │ │ │...

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Understanding the behavior of the colon in julia DataFrames.select() - dataframe

Related

Julia: how to compute a particular operation on certain columns of a Dataframe

Convert a Julia DataFrame column with String to one with Int and missing values

Return clickhouse array as column

Rename Dataframe column names julia v1.0

Indexing Dataframe with variable in Julia

Categories

Resources