Related
I'm trying to figure out how one can make in Clickhouse a column with the name "What I want" in the table below:
Category
Row Number
What I have
What I want
A
1
0
0
A
2
1
1
B
3
0
1
B
4
0
1
A
5
3
3
B
6
0
3
B
7
0
3
A
8
2
2
B
9
0
2
There are two categories A and B.
And I want B category to 'remember' the latest value from A category.
There's a column by which all records are ordered: Row Number.
I've found a function arrayFill which looks promising but unfortunately it isn't supported by my version of server (19.14.11.16) and there's no chance it'll be updated soon.
I guess there's should be some trick with clickhouse arrays. But I didn't manage to find a way. Is there any clickhouse-ninja who could give me a hint how to deal with it?
p.s. In fact B category isn't zero filled but I provide it just to simplify a little my problem.
create table z(c String, rn Int64, hv Int64) Engine=Memory;
insert into z values ('A',1,0)('A',2,1)('B',3,0)('B',4,0)('A',5,3)('B',6,0)('B',7,0)('A',8,2)('B',9,0);
select (arrayJoin(flatten(arrayMap( j -> arrayMap(m -> if(m.1 = 'B', (m.1, m.2, ga1[j-1][-1].3), m) , ga1[j]),
arrayEnumerate(arraySplit(k,i -> ga[i].1 <> ga[i-1].1 , (groupArray( (c, rn, hv) ) as ga), arrayEnumerate(ga)) as ga1)))) as r).1 _c,
r.2 _rn, r.3 _n
from (select * from z order by rn)
┌─_c─┬─_rn─┬─_n─┐
│ A │ 1 │ 0 │
│ A │ 2 │ 1 │
│ B │ 3 │ 1 │
│ B │ 4 │ 1 │
│ A │ 5 │ 3 │
│ B │ 6 │ 3 │
│ B │ 7 │ 3 │
│ A │ 8 │ 2 │
│ B │ 9 │ 2 │
└────┴─────┴────┘
I have a ClickHouse database with a simple table with two fields (pagePath string, pageviews int)
I want sum visitis for each filter + value, to know pageviews by users in filters (most used filter) . Filters are separated by comma
Example data
pagePath
pageviews
/url1/filter1.value1
1
/url1/filter1.value2
2
/url1/filter1.value3,filter1.value2
3
/url1/filter1.value4,filter3.value2
4
/url1/filter1.value5,filter3.value2,filter1.value2
5
/url1/filter2.value1,filter3.value2,filter1.value2
6
/url1/filter2.value2
7
/url-2/filter2.value3
8
/url-2/filter2.value4
9
/url-11/filter3.value1
10
/url-21/filter3.value1
11
/url1/filter3.value2
12
/url1/filter3.value3
13
/url1/filter3.value4
14
create table T (pagePath String , pageviews Int64) Engine=Memory;
insert into T values ('/url1/filter1.value1',1);
insert into T values ('/url1/filter1.value2',2);
insert into T values ('/url1/filter1.value3,filter1.value2',3);
insert into T values ('/url1/filter1.value4,filter3.value2',4);
insert into T values ('/url1/filter1.value5,filter3.value2,filter1.value2',5);
insert into T values ('/url1/filter2.value1,filter3.value2,filter1.value2',6);
insert into T values ('/url1/filter2.value2',7);
insert into T values ('/url-2/filter2.value3',8);
insert into T values ('/url-2/filter2.value4',9);
insert into T values ('/url-11/filter3.value1',10);
insert into T values ('/url-21/filter3.value1',11);
insert into T values ('/url1/filter3.value2',12);
insert into T values ('/url1/filter3.value3',13);
insert into T values ('/url1/filter3.value4',14);
And I want get sum of pageviews foreach filter + value
Filters
pageviews
filter1.value1
1
filter1.value2
16 (2 + 3 +5 +6)
filter1.value3
3
filter1.value4
4
filter1.value5
5
filter2.value1
6
filter2.value2
7
filter2.value3
8
filter2.value4
9
filter3.value1
10
filter3.value1
11
filter3.value2
12 (4+5+6+12)
filter3.value3
13
filter3.value4
14
And also I want get sum of pageviews foreach filter (without values)
Filters
pageviews
filter1
15 (1, 2, 3, 4, 5)
filter2
30 (6+ 7+ 8+ 9)
filter1
60 ( 10, 11, 12, 13, 14)
I try with
select
arrJoinFilters,
sum (PV) as totales
from
(
SELECT
arrJoinFilters,
splitByChar(',',replaceRegexpAll(pagePath,'^/url.*/(.*\..*)$','\\1')) arrFilter,
pageviews as PV
FROM
Table ARRAY JOIN arrFilter AS arrJoinFilters
GROUP by arrFilter,PV,arrJoinFilters
)
group by
arrJoinFilters
order by arrJoinFilters
But, I think there are some wrong, and I don't get second result desired
Thanks!
4+5+6+12 = 27
SELECT
arrayJoin(splitByChar(',',replaceRegexpAll(pagePath,'^/url.*/(.*\..*)$','\\1'))) f,
sum(pageviews) as PV
FROM T
GROUP by f
order by f
┌─f──────────────┬─PV─┐
│ filter1.value1 │ 1 │
│ filter1.value2 │ 16 │
│ filter1.value3 │ 3 │
│ filter1.value4 │ 4 │
│ filter1.value5 │ 5 │
│ filter2.value1 │ 6 │
│ filter2.value2 │ 7 │
│ filter2.value3 │ 8 │
│ filter2.value4 │ 9 │
│ filter3.value1 │ 21 │
│ filter3.value2 │ 27 │
│ filter3.value3 │ 13 │
│ filter3.value4 │ 14 │
└────────────────┴────┘
select splitByChar('.',f)[1] x, sum(PV), groupArray(PV), groupArray(f)
from (
SELECT
arrayJoin(splitByChar(',',replaceRegexpAll(pagePath,'^/url.*/(.*\..*)$','\\1'))) f,
sum(pageviews) as PV
FROM T
GROUP by f
order by f) group by x
┌─x───────┬─sum(PV)─┬─groupArray(PV)─┬─groupArray(f)──────────────────────────────────────────────────────────────────────────┐
│ filter2 │ 30 │ [6,7,8,9] │ ['filter2.value1','filter2.value2','filter2.value3','filter2.value4'] │
│ filter3 │ 75 │ [21,27,13,14] │ ['filter3.value1','filter3.value2','filter3.value3','filter3.value4'] │
│ filter1 │ 29 │ [1,16,3,4,5] │ ['filter1.value1','filter1.value2','filter1.value3','filter1.value4','filter1.value5'] │
└─────────┴─────────┴────────────────┴────────────────────────────────────────────────────────────────────────────────────────┘
How can I make sure that with this join I'll only receive the sum of results and not the product?
I have a project entity, which contains two one-to-many relations. If I query disposal and supply.
With the following query:
SELECT *
FROM projects
JOIN disposals disposal on projects.project_id = disposal.disposal_project_refer
WHERE (projects.project_name = 'Höngg')
I get following result:
project_id,project_name,disposal_id,depository_refer,material_refer,disposal_date,disposal_measurement,disposal_project_refer
1,Test,1,1,1,2020-08-12 15:24:49.913248,123,1
1,Test,2,1,2,2020-08-12 15:24:49.913248,123,1
1,Test,7,2,1,2020-08-12 15:24:49.913248,123,1
1,Test,10,3,4,2020-08-12 15:24:49.913248,123,1
The same amount of results get returned by same query for supplies.
type Project struct {
ProjectID uint `gorm:"primary_key" json:"ProjectID"`
ProjectName string `json:"ProjectName"`
Disposals []Disposal `gorm:"ForeignKey:disposal_project_refer"`
Supplies []Supply `gorm:"ForeignKey:supply_project_refer"`
}
If I query both tables I would like to receive the sum of both single queries. Currently I am receiving 16 results (4 supply results multiplied by 4 disposal results).
The combined query:
SELECT *
FROM projects
JOIN disposals disposal ON projects.project_id = disposal.disposal_project_refer
JOIN supplies supply ON projects.project_id = supply.supply_project_refer
WHERE (projects.project_name = 'Höngg');
I have tried achieving my goal with union queries but I was not sucessfull. What else should I try to achieve my goal?
It is your case (simplified):
# with a(x,y) as (values(1,1)), b(x,z) as (values(1,11),(1,22)), c(x,t) as (values(1,111),(1,222))
select * from a join b on (a.x=b.x) join c on (b.x=c.x);
┌───┬───┬───┬────┬───┬─────┐
│ x │ y │ x │ z │ x │ t │
├───┼───┼───┼────┼───┼─────┤
│ 1 │ 1 │ 1 │ 11 │ 1 │ 111 │
│ 1 │ 1 │ 1 │ 11 │ 1 │ 222 │
│ 1 │ 1 │ 1 │ 22 │ 1 │ 111 │
│ 1 │ 1 │ 1 │ 22 │ 1 │ 222 │
└───┴───┴───┴────┴───┴─────┘
It produces cartesian join because the value for join is same in all tables. You need some additional condition for joining your data.For example (tests for various cases):
# with a(x,y) as (values(1,1)), b(x,z) as (values(1,11),(1,22)), c(x,t) as (values(1,111),(1,222))
select *
from a
cross join lateral (
select *
from (select row_number() over() as rn, * from b where b.x=a.x) as b
full join (select row_number() over() as rn, * from c where c.x=a.x) as c on (b.rn=c.rn)
) as bc;
┌───┬───┬────┬───┬────┬────┬───┬─────┐
│ x │ y │ rn │ x │ z │ rn │ x │ t │
├───┼───┼────┼───┼────┼────┼───┼─────┤
│ 1 │ 1 │ 1 │ 1 │ 11 │ 1 │ 1 │ 111 │
│ 1 │ 1 │ 2 │ 1 │ 22 │ 2 │ 1 │ 222 │
└───┴───┴────┴───┴────┴────┴───┴─────┘
# with a(x,y) as (values(1,1)), b(x,z) as (values(1,11),(1,22),(1,33)), c(x,t) as (values(1,111),(1,222))
select *
from a
cross join lateral (
select *
from (select row_number() over() as rn, * from b where b.x=a.x) as b
full join (select row_number() over() as rn, * from c where c.x=a.x) as c on (b.rn=c.rn)
) as bc;
┌───┬───┬────┬───┬─────┬──────┬──────┬──────┐
│ x │ y │ rn │ x │ z │ rn │ x │ t │
├───┼───┼────┼───┼─────┼──────┼──────┼──────┤
│ 1 │ 1 │ 1 │ 1 │ 11 │ 1 │ 1 │ 111 │
│ 1 │ 1 │ 2 │ 1 │ 22 │ 2 │ 1 │ 222 │
│ 1 │ 1 │ 3 │ 1 │ 33 │ ░░░░ │ ░░░░ │ ░░░░ │
└───┴───┴────┴───┴─────┴──────┴──────┴──────┘
# with a(x,y) as (values(1,1)), b(x,z) as (values(1,11),(1,22)), c(x,t) as (values(1,111),(1,222),(1,333))
select *
from a
cross join lateral (
select *
from (select row_number() over() as rn, * from b where b.x=a.x) as b
full join (select row_number() over() as rn, * from c where c.x=a.x) as c on (b.rn=c.rn)
) as bc;
┌───┬───┬──────┬──────┬──────┬────┬───┬─────┐
│ x │ y │ rn │ x │ z │ rn │ x │ t │
├───┼───┼──────┼──────┼──────┼────┼───┼─────┤
│ 1 │ 1 │ 1 │ 1 │ 11 │ 1 │ 1 │ 111 │
│ 1 │ 1 │ 2 │ 1 │ 22 │ 2 │ 1 │ 222 │
│ 1 │ 1 │ ░░░░ │ ░░░░ │ ░░░░ │ 3 │ 1 │ 333 │
└───┴───┴──────┴──────┴──────┴────┴───┴─────┘
db<>fiddle
Note that there is no any obvious relations between disposals and supplies (b and c in my example) so the order of both could be random. As for me the better solution for this task could be the aggregation of the data from those tables using JSON for example:
with a(x,y) as (values(1,1)), b(x,z) as (values(1,11),(1,22),(1,33)), c(x,t) as (values(1,111),(1,222))
select
*,
(select json_agg(to_json(b.*)) from b where a.x=b.x) as b,
(select json_agg(to_json(c.*)) from c where a.x=c.x) as c
from a;
┌───┬───┬──────────────────────────────────────────────────┬────────────────────────────────────┐
│ x │ y │ b │ c │
├───┼───┼──────────────────────────────────────────────────┼────────────────────────────────────┤
│ 1 │ 1 │ [{"x":1,"z":11}, {"x":1,"z":22}, {"x":1,"z":33}] │ [{"x":1,"t":111}, {"x":1,"t":222}] │
└───┴───┴──────────────────────────────────────────────────┴────────────────────────────────────┘
Is it possible with Clickhouse to have result containing a pair of array transformed as columns?
Form this result:
┌─f1──┬f2───────┬f3─────────────┐
│ 'a' │ [1,2,3] │ ['x','y','z'] │
│ 'b' │ [4,5,6] │ ['x','y','z'] │
└─────┴─────────┴───────────────┘
to :
┌─f1──┬x──┬y──┬z──┐
│ 'a' │ 1 │ 2 │ 3 │
│ 'b' │ 4 │ 5 │ 6 │
└─────┴───┴───┴───┘
The idea is to not have to repeat the header values for each line.
In my case, the "header" array f3 unique by queries and join to the f1,f2.
You can do it with help of indexOf function.
SELECT *
FROM test_sof
┌─f1─┬─f2──────┬─f3────────────┐
│ a │ [1,2,3] │ ['x','y','z'] │
└────┴─────────┴───────────────┘
┌─f1─┬─f2────────┬─f3────────────────┐
│ c │ [7,8,9,0] │ ['x','y','z','n'] │
└────┴───────────┴───────────────────┘
┌─f1─┬─f2─────────┬─f3────────────────┐
│ d │ [7,8,9,11] │ ['x','y','z','n'] │
└────┴────────────┴───────────────────┘
┌─f1─┬─f2──────┬─f3────────────┐
│ b │ [4,5,6] │ ['x','y','z'] │
└────┴─────────┴───────────────┘
4 rows in set. Elapsed: 0.001 sec.
Then:
SELECT
f1,
f2[indexOf(f3, 'x')] AS x,
f2[indexOf(f3, 'y')] AS y,
f2[indexOf(f3, 'z')] AS z,
f2[indexOf(f3, 'n')] AS n
FROM test_sof
ORDER BY
f1 ASC,
x ASC
┌─f1─┬─x─┬─y─┬─z─┬──n─┐
│ a │ 1 │ 2 │ 3 │ 0 │
│ b │ 4 │ 5 │ 6 │ 0 │
│ c │ 7 │ 8 │ 9 │ 0 │
│ d │ 7 │ 8 │ 9 │ 11 │
└────┴───┴───┴───┴────┘
4 rows in set. Elapsed: 0.002 sec.
Keep in mind situation when index from header array will not be present in data array or vise-versa.
UPD: the way how to get data without knowing "headers".
You will get three columns, third one with headers.
SELECT
f1,
f2[num] AS f2_el,
f3[num] AS f3_el
FROM test_sof
ARRAY JOIN arrayEnumerate(f2) AS num
ORDER BY f1 ASC
┌─f1─┬─f2_el─┬─f3_el─┐
│ a │ 1 │ x │
│ a │ 2 │ y │
│ a │ 3 │ z │
│ b │ 4 │ x │
│ b │ 5 │ y │
│ b │ 6 │ z │
│ c │ 7 │ x │
│ c │ 8 │ y │
│ c │ 9 │ z │
│ c │ 0 │ n │
│ d │ 7 │ x │
│ d │ 8 │ y │
│ d │ 9 │ z │
│ d │ 11 │ n │
└────┴───────┴───────┘
14 rows in set. Elapsed: 0.006 sec.
This a fun puzzle. As pointed out already the indexOf() function seems to be the best way to pivot array columns inside ClickHouse but requires explicit selection of array positions. If you are using Python and your result set is not absurdly large, you can solve the problem in a more general way by flipping the array values into rows in SQL, then pivoting columns f2 and f3 in Python. Here's how it works.
First, use clickHouse-sqlalchemy and pandas to expand the matching arrays into rows as follows. (This example uses Jupyter Notebook running on Anaconda.)
# Load SQL Alchemy and connect to ClickHouse
from sqlalchemy import create_engine
%load_ext sql
%sql clickhouse://default:#localhost/default
# Use JOIN ARRAY to flip corresponding positions in f2, f3 to rows.
result = %sql select * from f array join f2, f3
df = result.DataFrame()
print(df)
The data frame appears as follows:
f1 f2 f3
0 a 1 x
1 a 2 y
2 a 3 z
3 b 4 x
4 b 5 y
5 b 6 z
Now we can pivot f2 and f3 into a new data frame.
dfp = df.pivot(columns='f3', values='f2', index='f1')
print(dfp)
The new dataframe dfp appears as follows:
f3 x y z
f1
a 1 2 3
b 4 5 6
This solution requires you to work outside the database but has the advantage that it works generally for any set of arrays as long as the names and values match. For instance if we add another row with different values and properties the same code gets the right answer. Here's a new row.
insert into f values ('c', [7,8,9,10], ['x', 'y', 'aa', 'bb'])
The pivoted data frame will appear as follows. NaN corresponds to missing values.
f3 aa bb x y z
f1
a NaN NaN 1.0 2.0 3.0
b NaN NaN 4.0 5.0 6.0
c 9.0 10.0 7.0 8.0 NaN
For more information on this solution see https://pandas.pydata.org/pandas-docs/stable/getting_started/dsintro.html and https://github.com/xzkostyan/clickhouse-sqlalchemy.
I have three tables in Google Bigquery:
t1) ID1, ID2
t2) ID1, Keywords (500.000 rows)
t3) ID2, Keywords (3 million rows)
The observations of ID1 have been matched/linked with observations in ID2, each observation has a number of keywords.
I want to know about the overlap in keywords between the matched ID1's and ID2's.
t1
┌─────────────┐
│ ID1 │ ID2 │
├──────┼──────┤
│ 1 │ A │
│ 1 │ B │
│ 1 │ C │
│ 1 │ D │
│ 2 │ E │
│ 2 │ F │
│ 2 │ G │
│ 2 │ H │
│ 3 │ I │
│ 3 │ J │
│ 3 │ K │
│ 3 │ L │
│ 4 │ M │
│ 4 │ N │
│ 4 │ O │
│ 4 │ P │
t2
┌──────────────────────┐
│ TABLE 2 │
├──────────────────────┤
│ ID1 │ KEYWORD │
│ 1 │ KEYWORD 1 │
│ 1 │ KEYWORD 2 │
│ 1 │ KEYWORD 3 │
│ 1 │ KEYWORD 4 │
│ 2 │ KEYWORD 2 │
│ 2 │ KEYWORD 3 │
│ 2 │ KEYWORD 6 │
│ 2 │ KEYWORD 8 │
│ 3 │ KEYWORD 10 │
│ 3 │ KEYWORD 64 │
│ 3 │ KEYWORD 42 │
│ 3 │ KEYWORD 39 │
│ 4 │ KEYWORD 18 │
│ 4 │ KEYWORD 33 │
│ 4 │ KEYWORD 52 │
│ 4 │ KEYWORD 24 │
└─────────┴────────────┘
t3
┌───────────────────────┐
│ TABLE 3 │
├───────────────────────┤
│ ID2 │ KEYWORD │
│ A │ KEYWORD 1 │
│ A │ KEYWORD 2 │
│ A │ KEYWORD 54 │
│ A │ KEYWORD 34 │
│ B │ KEYWORD 32 │
│ B │ KEYWORD 876 │
│ B │ KEYWORD 632 │
│ B │ KEYWORD 2 │
│ K │ KEYWORD 53 │
│ K │ KEYWORD 43 │
│ K │ KEYWORD 10 │
│ K │ KEYWORD 64 │
│ P │ KEYWORD 56 │
│ P │ KEYWORD 44 │
│ P │ KEYWORD 322 │
│ P │ KEYWORD 99 │
└─────────┴─────────────┘
As the tables show, ID1 (1) is matched to ID2 (A). Both ID1 and ID2 have a KEYWORD 1 and KEYWORD 2, so there's a total of 2 keywords that overlap between both matched observations, which in this case (as ID1 (A) has 4 keywords total) is 50% overlap.
I am looking to make the following table, where every row in t1 gets additional columns MATCH COUNT and MATCH PERCENTAGE.
┌───────────────────────────────────────────────┐
│ RESULT │
├───────────────────────────────────────────────┤
│ ID │ ID2 │ MATCH COUNT │ MATCH PERCENTAGE │
│ 1 │ A │ 2 │ 50% │
│ 1 │ B │ 1 │ 25% │
│(...) │(...)│ (...) │ (...) │
│ 3 │ K │ 2 │ 50% │
│ 4 │ P │ 0 │ 0% │
└────────┴─────┴─────────────┴──────────────────┘
I know it is good etiquette to show what I've already done, but honestly this one is way over my head and I don't even know where to start. I am hoping that somebody can get me into the right direction.
You can do this using join and group by:
select t1.id1, t2.id2
count(t3.keyword) as num_matches,
count(t3.keyword) / count(*) as proportion_matches
from t1 left join
t2
on t1.id1 = t2.id1 left join
t3
on t1.id2 = t3.id2 and
t2.keyword = t3.keyword
group by t1.id1, t2.id2;
This assumes that the keywords are unique for each id.
I think it is solution:
select Id1, Id2, Sum(Match) Match, Sum(Match) / Sum(Total) as Perc
from (
select t2.Id1, t2.Id2, Decode(t1.Keyword, t3.Keyword, 1, 0) Match, 1 Total
from t2
inner join t1 on (t2.Id1 = t1.Id1)
inner join t3 on (t2.Id2 = t3.Id2)
)
group by Id1, Id2
if you don't have function Decode you can use case:
case when t1.Keyword = t3.Keyword then 1 else 0 end
Easier:
select t1.Id1, t1.Id2, Sum(case when t2.Keyword = t3.Keyword then 1 else 0 end) Match, Sum(case when t2.Keyword = t3.Keyword then 1 else 0 end) / Count(1) Perc
from t2
inner join t1 on (t2.Id1 = t1.Id1)
inner join t3 on (t1.Id2 = t3.Id2)
group by t1.Id1, t1.Id2
Google have function CountIf, you can use also:
select t1.Id1, t1.Id2, CountIf(t2.Keyword = t3.Keyword) Match, CountIf(t2.Keyword = t3.Keyword) / Count(1) Perc
from t2
inner join t1 on (t2.Id1 = t1.Id1)
inner join t3 on (t1.Id2 = t3.Id2)
group by t1.Id1, t1.Id2
Below is for BigQuery Standard SQL
#standardSQL
SELECT t1.id1, t1.id2,
COUNTIF(t2.keyword = t3.keyword) match_count,
COUNTIF(t2.keyword = t3.keyword) / COUNT(DISTINCT t2.keyword) match_percentage
FROM t2 CROSS JOIN t3
JOIN t1 ON t1.id1 = t2.id1 AND t1.id2 = t3.id2
GROUP BY t1.id1, t1.id2
-- ORDER BY t1.id1, t1.id2
with result as below
Row id1 id2 match_count match_percentage
1 1 A 2 0.5
2 1 B 1 0.25
3 3 K 2 0.5
4 4 P 0 0.0