Is it possible with Clickhouse to have result containing a pair of array transformed as columns?
Form this result:
┌─f1──┬f2───────┬f3─────────────┐
│ 'a' │ [1,2,3] │ ['x','y','z'] │
│ 'b' │ [4,5,6] │ ['x','y','z'] │
└─────┴─────────┴───────────────┘
to :
┌─f1──┬x──┬y──┬z──┐
│ 'a' │ 1 │ 2 │ 3 │
│ 'b' │ 4 │ 5 │ 6 │
└─────┴───┴───┴───┘
The idea is to not have to repeat the header values for each line.
In my case, the "header" array f3 unique by queries and join to the f1,f2.
You can do it with help of indexOf function.
SELECT *
FROM test_sof
┌─f1─┬─f2──────┬─f3────────────┐
│ a │ [1,2,3] │ ['x','y','z'] │
└────┴─────────┴───────────────┘
┌─f1─┬─f2────────┬─f3────────────────┐
│ c │ [7,8,9,0] │ ['x','y','z','n'] │
└────┴───────────┴───────────────────┘
┌─f1─┬─f2─────────┬─f3────────────────┐
│ d │ [7,8,9,11] │ ['x','y','z','n'] │
└────┴────────────┴───────────────────┘
┌─f1─┬─f2──────┬─f3────────────┐
│ b │ [4,5,6] │ ['x','y','z'] │
└────┴─────────┴───────────────┘
4 rows in set. Elapsed: 0.001 sec.
Then:
SELECT
f1,
f2[indexOf(f3, 'x')] AS x,
f2[indexOf(f3, 'y')] AS y,
f2[indexOf(f3, 'z')] AS z,
f2[indexOf(f3, 'n')] AS n
FROM test_sof
ORDER BY
f1 ASC,
x ASC
┌─f1─┬─x─┬─y─┬─z─┬──n─┐
│ a │ 1 │ 2 │ 3 │ 0 │
│ b │ 4 │ 5 │ 6 │ 0 │
│ c │ 7 │ 8 │ 9 │ 0 │
│ d │ 7 │ 8 │ 9 │ 11 │
└────┴───┴───┴───┴────┘
4 rows in set. Elapsed: 0.002 sec.
Keep in mind situation when index from header array will not be present in data array or vise-versa.
UPD: the way how to get data without knowing "headers".
You will get three columns, third one with headers.
SELECT
f1,
f2[num] AS f2_el,
f3[num] AS f3_el
FROM test_sof
ARRAY JOIN arrayEnumerate(f2) AS num
ORDER BY f1 ASC
┌─f1─┬─f2_el─┬─f3_el─┐
│ a │ 1 │ x │
│ a │ 2 │ y │
│ a │ 3 │ z │
│ b │ 4 │ x │
│ b │ 5 │ y │
│ b │ 6 │ z │
│ c │ 7 │ x │
│ c │ 8 │ y │
│ c │ 9 │ z │
│ c │ 0 │ n │
│ d │ 7 │ x │
│ d │ 8 │ y │
│ d │ 9 │ z │
│ d │ 11 │ n │
└────┴───────┴───────┘
14 rows in set. Elapsed: 0.006 sec.
This a fun puzzle. As pointed out already the indexOf() function seems to be the best way to pivot array columns inside ClickHouse but requires explicit selection of array positions. If you are using Python and your result set is not absurdly large, you can solve the problem in a more general way by flipping the array values into rows in SQL, then pivoting columns f2 and f3 in Python. Here's how it works.
First, use clickHouse-sqlalchemy and pandas to expand the matching arrays into rows as follows. (This example uses Jupyter Notebook running on Anaconda.)
# Load SQL Alchemy and connect to ClickHouse
from sqlalchemy import create_engine
%load_ext sql
%sql clickhouse://default:#localhost/default
# Use JOIN ARRAY to flip corresponding positions in f2, f3 to rows.
result = %sql select * from f array join f2, f3
df = result.DataFrame()
print(df)
The data frame appears as follows:
f1 f2 f3
0 a 1 x
1 a 2 y
2 a 3 z
3 b 4 x
4 b 5 y
5 b 6 z
Now we can pivot f2 and f3 into a new data frame.
dfp = df.pivot(columns='f3', values='f2', index='f1')
print(dfp)
The new dataframe dfp appears as follows:
f3 x y z
f1
a 1 2 3
b 4 5 6
This solution requires you to work outside the database but has the advantage that it works generally for any set of arrays as long as the names and values match. For instance if we add another row with different values and properties the same code gets the right answer. Here's a new row.
insert into f values ('c', [7,8,9,10], ['x', 'y', 'aa', 'bb'])
The pivoted data frame will appear as follows. NaN corresponds to missing values.
f3 aa bb x y z
f1
a NaN NaN 1.0 2.0 3.0
b NaN NaN 4.0 5.0 6.0
c 9.0 10.0 7.0 8.0 NaN
For more information on this solution see https://pandas.pydata.org/pandas-docs/stable/getting_started/dsintro.html and https://github.com/xzkostyan/clickhouse-sqlalchemy.
Related
Given that I have some data frames with a single dimension, how can I create a list of all the data frames? Is it really as simple as just making a list and adding them in?
You could also use vcat to combine these data frames into a single one with an extra column indicating the source like this:
julia> c = vcat(a, b, source=:source => ["a", "b"])
8×2 DataFrame
Row │ A source
│ Int64 String
─────┼───────────────
1 │ 1 a
2 │ 2 a
3 │ 3 a
4 │ 4 a
5 │ 1 b
6 │ 2 b
7 │ 3 b
8 │ 4 b
This form is often easier to work with later. In particular if you then groupby the c data frame by :source like this:
julia> groupby(c, :source)
GroupedDataFrame with 2 groups based on key: source
First Group (4 rows): source = "a"
Row │ A source
│ Int64 String
─────┼───────────────
1 │ 1 a
2 │ 2 a
3 │ 3 a
4 │ 4 a
⋮
Last Group (4 rows): source = "b"
Row │ A source
│ Int64 String
─────┼───────────────
1 │ 1 b
2 │ 2 b
3 │ 3 b
4 │ 4 b
As a result you also get a collection of data frames (like the list that was created in the other answer), but this time you can apply functions supporting the split-apply-combine to it, see https://dataframes.juliadata.org/stable/man/split_apply_combine/.
One possible option that appears to work is the straightforward, "just add them to the list" method mentioned above. This would look like:
julia> a = DataFrame(A = 1:4)
4×1 DataFrame
Row │ A
│ Int64
─────┼───────
1 │ 1
2 │ 2
3 │ 3
4 │ 4
julia> b = DataFrame(A = 1:4)
4×1 DataFrame
Row │ A
│ Int64
─────┼───────
1 │ 1
2 │ 2
3 │ 3
4 │ 4
julia> c = [a, b]
2-element Vector{DataFrame}:
4×1 DataFrame
Row │ A
│ Int64
─────┼───────
1 │ 1
2 │ 2
3 │ 3
4 │ 4
4×1 DataFrame
Row │ A
│ Int64
─────┼───────
1 │ 1
2 │ 2
3 │ 3
4 │ 4
julia> typeof(c)
Vector{DataFrame} (alias for Array{DataFrame, 1})
I understand if I want to filter a column between two numbers I can use BETWEEN:
SELECT a
FROM table
WHERE a BETWEEN 1 AND 5
Is there a way of mapping the filtering to an array of values, for instance, if the array was [1, 10, ... , N]:
SELECT a
FROM table
WHERE (a BETWEEN 1 AND 1+4) AND (a BETWEEN 10 AND 10+4) AND ... AND (a BETWEEN N AND N+4)
Try this query:
WITH
[1, 10, 75] AS starts_from,
4 AS step,
arrayMap(x -> (x, x + step), starts_from) AS intervals
SELECT number
FROM numbers(100)
WHERE arrayFirstIndex(x -> number >= x.1 AND number <= x.2, intervals) != 0
/*
┌─number─┐
│ 1 │
│ 2 │
│ 3 │
│ 4 │
│ 5 │
│ 10 │
│ 11 │
│ 12 │
│ 13 │
│ 14 │
│ 75 │
│ 76 │
│ 77 │
│ 78 │
│ 79 │
└────────┘
*/
I'm trying to figure out how one can make in Clickhouse a column with the name "What I want" in the table below:
Category
Row Number
What I have
What I want
A
1
0
0
A
2
1
1
B
3
0
1
B
4
0
1
A
5
3
3
B
6
0
3
B
7
0
3
A
8
2
2
B
9
0
2
There are two categories A and B.
And I want B category to 'remember' the latest value from A category.
There's a column by which all records are ordered: Row Number.
I've found a function arrayFill which looks promising but unfortunately it isn't supported by my version of server (19.14.11.16) and there's no chance it'll be updated soon.
I guess there's should be some trick with clickhouse arrays. But I didn't manage to find a way. Is there any clickhouse-ninja who could give me a hint how to deal with it?
p.s. In fact B category isn't zero filled but I provide it just to simplify a little my problem.
create table z(c String, rn Int64, hv Int64) Engine=Memory;
insert into z values ('A',1,0)('A',2,1)('B',3,0)('B',4,0)('A',5,3)('B',6,0)('B',7,0)('A',8,2)('B',9,0);
select (arrayJoin(flatten(arrayMap( j -> arrayMap(m -> if(m.1 = 'B', (m.1, m.2, ga1[j-1][-1].3), m) , ga1[j]),
arrayEnumerate(arraySplit(k,i -> ga[i].1 <> ga[i-1].1 , (groupArray( (c, rn, hv) ) as ga), arrayEnumerate(ga)) as ga1)))) as r).1 _c,
r.2 _rn, r.3 _n
from (select * from z order by rn)
┌─_c─┬─_rn─┬─_n─┐
│ A │ 1 │ 0 │
│ A │ 2 │ 1 │
│ B │ 3 │ 1 │
│ B │ 4 │ 1 │
│ A │ 5 │ 3 │
│ B │ 6 │ 3 │
│ B │ 7 │ 3 │
│ A │ 8 │ 2 │
│ B │ 9 │ 2 │
└────┴─────┴────┘
In 0.6 I was using:
colnames = ["Date_Time","Date_index","Time_index"]
names!(data1_date_time_index.colindex, map(parse, colnames))
What is the syntax for v1.0 - right now .colindex is not found.
Per DataFrames docs:
rename!(data1_date_time_index, f => t for (f, t) =
zip([:x1, :x1_1, :x1_2],
[:Date_Time, :Date_index, :Time_index]))
Assuming data1_date_time_index is a DataFrame that has three columns use:
colnames = ["Date_Time","Date_index","Time_index"]
names!(data1_date_time_index, Symbol.(colnames))
I am not 100% sure if this is what you want, as your example was not fully reproducible (so if actually you needed something else can you please submit full code that can be run).
The problem with data1_date_time_index.colindex is that currently . is used to access columns of a DataFrame by their name (and not fields of DataFrame type). In general you are not recommended to use colindex as it is not part of exposed API and might change in the future. If you really need to reach it use getfield(data_frame_name, :colindex).
EDIT
In DataFrames 0.20 you should write:
rename!(data1_date_time_index, Symbol.(colnames))
and in DataFrames 0.21 (which will be released before summer 2020) also passing strings directly will most probably be allowed like this:
rename!(data1_date_time_index, colnames)
(see here for a related discussion)
You can rename column through select also
For Ex:
df = DataFrame(col1 = 1:4, col2 = ["John", "James", "Finch", "May"])
│ Row │ col1 │ col2 │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 1 │ John │
│ 2 │ 2 │ James │
│ 3 │ 3 │ Finch │
│ 4 │ 4 │ May │
select(df, "col1" => "Id", "col2" => "Name")
│ Row │ Id │ Name │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 1 │ John │
│ 2 │ 2 │ James │
│ 3 │ 3 │ Finch │
│ 4 │ 4 │ May │
Rename columns:
names!(df, [:c1,:c2,:c3]) #(all)
rename!(df, Dict(:oldCol => :newCol)) # (a selection)
(from: https://syl1.gitbook.io/julia-language-a-concise-tutorial/useful-packages/dataframes )
I want to create a indexed subset of a DataFrame and use a variable inside it. In this case i want to change all -9999 values of the first column to NA's. If I do: df[df[:1] .== -9999, :1] = NA it works like it should.. But if i use a variable as the indexer it througs an error (LoadError: KeyError: key :i not found):
i = 1
df[df[:i] .== -9999, :i] = NA
:i is actually a symbol in julia:
julia> typeof(:i)
Symbol
you can define a variable binding to a symbol like this:
julia> i = Symbol(2)
Symbol("2")
then you can simply use df[df[i] .== 1, i] = 123:
julia> df
10×1 DataFrames.DataFrame
│ Row │ 2 │
├─────┼─────┤
│ 1 │ 123 │
│ 2 │ 2 │
│ 3 │ 3 │
│ 4 │ 4 │
│ 5 │ 5 │
│ 6 │ 6 │
│ 7 │ 7 │
│ 8 │ 8 │
│ 9 │ 9 │
│ 10 │ 10 │
It's worth noting that in your example df[df[:1] .== -9999, :1], :1 is NOT a symbol:
julia> :1
1
In fact, the expression is equal to df[df[1] .== -9999, 1] which works in that there is a corresponding getindex method whose argument (col_ind) can accept a common index:
julia> #which df[df[1].==1, 1]
getindex{T<:Real}(df::DataFrames.DataFrame, row_inds::AbstractArray{T,1}, col_ind::Union{Real,Symbol})
Since you just want to change the first (n) column, there is no difference between Symbol("1") and 1 as long as your column names are regularly arranged as:
│ Row │ 1 │ 2 │ 3 │...
├─────┼─────┤─────┼─────┤
│ 1 │ │ │ │...