I'm triyng to unpivot multiple columns in my dataset. Here's what my data look like.
CREATE TABLE T5 (idnum NUMBER,f1 NUMBER(10,5),f2 NUMBER(10,5),f3 NUMBER(10,5)
,e1 NUMBER(10,5),e2 NUMBER(10,5)
,h1 NUMBER(10,5),h2 NUMBER(10,5));
INSERT INTO T5 (IDNUM,F1,F2,F3,E1,E2,H1,H2)
VALUES (1,'10.2004','5.009','7.330','9.008','8.003','.99383','1.43243');
INSERT INTO T5 (IDNUM,F1,F2,F3,E1,E2,H1,H2
VALUES (2,'4.2004','6.009','9.330','4.7008','4.60333','1.993','3.3243');
INSERT INTO T5 (IDNUM,F1,F2,F3,E1,E2,H1,H2)
VALUES (3,'10.2040','52.6009','67.330','9.5008','8.003','.99383','1.43243');
INSERT INTO T5 (IDNUM,F1,F2,F3,E1,E2,H1,H2)
VALUES (4,'9.20704','45.009','17.330','29.008','5.003','3.9583','1.243');
COMMIT;
select * from t5;
IDNUM F1 F2 F3 E1 E2 H1 H2
1 10.2004 5.009 7.33 9.008 8.003 0.99383 1.43243
2 4.2004 6.009 9.33 4.7008 4.60333 1.993 3.3243
3 10.204 52.6009 67.33 9.5008 8.003 0.99383 1.43243
4 9.20704 45.009 17.33 29.008 5.003 3.9583 1.243
I'm unpivoting like so...
select *
from (select IDNUM,F1,F2,F3,E1,E2,H1,H2,
null as E3,null as H3
from T5)
UnPivot((F,E,H) for sk in ((F1,E1,H1) as 1,
(F2,E2,H2) as 2,
(F3,E3,H3) as 3))
order by IDNUM,SK;
IDNUM SK F E H
----- -- ------- ------- -------
1 1 10.2004 9.008 .99383
1 2 5.009 8.003 1.43243
1 3 7.33 null null
2 1 4.2004 4.7008 1.993
2 2 6.009 4.60333 3.3243
2 3 9.33 null null
3 1 10.204 9.5008 .99383
3 2 52.6009 8.003 1.43243
3 3 67.33 null null
4 1 9.20704 29.008 3.9583
4 2 45.009 5.003 1.243
4 3 17.33 null null
But what I really need is as follows...
IDNUM SK F E H F_COL_NAME
----- -- ------- ------- ------- ----------
1 1 10.2004 9.008 .99383 F1
1 2 5.009 8.003 1.43243 F2
1 3 7.33 null null F3
2 1 4.2004 4.7008 1.993 F1
2 2 6.009 4.60333 3.3243 F2
2 3 9.33 null null F3
3 1 10.204 9.5008 .99383 F1
3 2 52.6009 8.003 1.43243 F2
3 3 67.33 null null F3
4 1 9.20704 29.008 3.9583 F1
4 2 45.009 5.003 1.243 F2
4 3 17.33 null null F3
How can I do this?
Change your UNPIVOT to be like this
select *
from (
select IDNUM,F1,F2,F3,E1,E2,H1,H2,
null as E3,null as H3
from T5
) A
UnPivot(
(F,E,H) for sk in (
(F1,E1,H1) as 'F1',
(F2,E2,H2) as 'F2',
(F3,E3,H3) as 'F3')
)
order by IDNUM,SK
This should do the trick
Just select idnum, sk, f, e, h, 'F'||SK as col_name ... You need to specify all columns instead of an asterix.
Like this http://sqlfiddle.com/#!4/12446/21
If you need to store result of UNPIVOT you could use INSERT ALL:
CREATE TABLE T5_unpiv(IDNUM NUMBER,SK NUMBER,F NUMBER,E NUMBER,H NUMBER
,F_COL_NAME VARCHAR2(100));
INSERT ALL
INTO T5_unpiv(IDNUM,SK,F,E,H,F_COL_NAME) VALUES(idnum,1,f1,e1,h1,'F1')
INTO T5_unpiv(IDNUM,SK,F,E,H,F_COL_NAME) VALUES(idnum,2,f2,e2,h2,'F2')
INTO T5_unpiv(IDNUM,SK,F,E,H,F_COL_NAME) VALUES(idnum,3,f3,NULL,NULL,'F3')
SELECT * FROM T5;
SELECT * FROM T5_unpiv;
DBFiddle Demo
Output:
┌───────┬────┬─────────┬─────────┬─────────┬────────────┐
│ IDNUM │ SK │ F │ E │ H │ F_COL_NAME │
├───────┼────┼─────────┼─────────┼─────────┼────────────┤
│ 1 │ 1 │ 10.2004 │ 9.008 │ .99383 │ F1 │
│ 1 │ 2 │ 5.009 │ 8.003 │ 1.43243 │ F2 │
│ 1 │ 3 │ 7.33 │ null │ null │ F3 │
│ 2 │ 1 │ 4.2004 │ 4.7008 │ 1.993 │ F1 │
│ 2 │ 2 │ 6.009 │ 4.60333 │ 3.3243 │ F2 │
│ 2 │ 3 │ 9.33 │ null │ null │ F3 │
│ 3 │ 1 │ 10.204 │ 9.5008 │ .99383 │ F1 │
│ 3 │ 2 │ 52.6009 │ 8.003 │ 1.43243 │ F2 │
│ 3 │ 3 │ 67.33 │ null │ null │ F3 │
│ 4 │ 1 │ 9.20704 │ 29.008 │ 3.9583 │ F1 │
│ 4 │ 2 │ 45.009 │ 5.003 │ 1.243 │ F2 │
│ 4 │ 3 │ 17.33 │ null │ null │ F3 │
└───────┴────┴─────────┴─────────┴─────────┴────────────┘
Try This..
select * from (select IDNUM,F1,F2,F3,E1,E2,H1,H2, null as E3,null as H3 from T5) UnPivot((F,E,H) for sk in ((F1,E1,H1) as 'F1',
(F2,E2,H2) as 'F2',
(F3,E3,H3) as 'F3')) order by IDNUM,SK;
Related
I'm trying to figure out how one can make in Clickhouse a column with the name "What I want" in the table below:
Category
Row Number
What I have
What I want
A
1
0
0
A
2
1
1
B
3
0
1
B
4
0
1
A
5
3
3
B
6
0
3
B
7
0
3
A
8
2
2
B
9
0
2
There are two categories A and B.
And I want B category to 'remember' the latest value from A category.
There's a column by which all records are ordered: Row Number.
I've found a function arrayFill which looks promising but unfortunately it isn't supported by my version of server (19.14.11.16) and there's no chance it'll be updated soon.
I guess there's should be some trick with clickhouse arrays. But I didn't manage to find a way. Is there any clickhouse-ninja who could give me a hint how to deal with it?
p.s. In fact B category isn't zero filled but I provide it just to simplify a little my problem.
create table z(c String, rn Int64, hv Int64) Engine=Memory;
insert into z values ('A',1,0)('A',2,1)('B',3,0)('B',4,0)('A',5,3)('B',6,0)('B',7,0)('A',8,2)('B',9,0);
select (arrayJoin(flatten(arrayMap( j -> arrayMap(m -> if(m.1 = 'B', (m.1, m.2, ga1[j-1][-1].3), m) , ga1[j]),
arrayEnumerate(arraySplit(k,i -> ga[i].1 <> ga[i-1].1 , (groupArray( (c, rn, hv) ) as ga), arrayEnumerate(ga)) as ga1)))) as r).1 _c,
r.2 _rn, r.3 _n
from (select * from z order by rn)
┌─_c─┬─_rn─┬─_n─┐
│ A │ 1 │ 0 │
│ A │ 2 │ 1 │
│ B │ 3 │ 1 │
│ B │ 4 │ 1 │
│ A │ 5 │ 3 │
│ B │ 6 │ 3 │
│ B │ 7 │ 3 │
│ A │ 8 │ 2 │
│ B │ 9 │ 2 │
└────┴─────┴────┘
Let us say that I have a table with user_id of Int32 type and login_time as DateTime in UTC format. user_id is not unique, so SELECT user_id, login_time FROM some_table; gives following result:
┌─user_id─┬──login_time─┐
│ 1 │ 2021-03-01 │
│ 1 │ 2021-03-01 │
│ 1 │ 2021-03-02 │
│ 2 │ 2021-03-02 │
│ 2 │ 2021-03-03 │
└─────────┴─────────────┘
If I run SELECT COUNT(*) as count, toDate(login_time) as l FROM some_table GROUP BY l I get following result:
┌─count───┬──login_time─┐
│ 2 │ 2021-03-01 │
│ 2 │ 2021-03-02 │
│ 1 │ 2021-03-03 │
└─────────┴─────────────┘
I would like to reformat the result to show COUNT on a weekly level, instead of every day, as I currently do.
My result for the above example could look something like this:
┌──count──┬──year─┬──month──┬─week ordinal┐
│ 5 │ 2021 │ 03 │ 1 │
│ 0 │ 2021 │ 03 │ 2 │
│ 0 │ 2021 │ 03 │ 3 │
│ 0 │ 2021 │ 03 │ 4 │
└─────────┴───────┴─────────┴─────────────┘
I have gone through the documentation, found some interesting functions, but did not manage to make them solve my problem.
I have never worked with clickhouse before and am not very experienced with SQL, which is why I ask here for help.
Try this query:
select count() count, toYear(start_of_month) year, toMonth(start_of_month) month,
toWeek(start_of_week) - toWeek(start_of_month) + 1 AS "week ordinal"
from (
select *, toStartOfMonth(login_time) start_of_month,
toStartOfWeek(login_time) start_of_week
from (
/* emulate test dataset */
select data.1 user_id, toDate(data.2) login_time
from (
select arrayJoin([
(1, '2021-02-27'),
(1, '2021-02-28'),
(1, '2021-03-01'),
(1, '2021-03-01'),
(1, '2021-03-02'),
(2, '2021-03-02'),
(2, '2021-03-03'),
(2, '2021-03-08'),
(2, '2021-03-16'),
(2, '2021-04-01')]) data)
)
)
group by start_of_month, start_of_week
order by start_of_month, start_of_week
/*
┌─count─┬─year─┬─month─┬─week ordinal─┐
│ 1 │ 2021 │ 2 │ 4 │
│ 1 │ 2021 │ 2 │ 5 │
│ 5 │ 2021 │ 3 │ 1 │
│ 1 │ 2021 │ 3 │ 2 │
│ 1 │ 2021 │ 3 │ 3 │
│ 1 │ 2021 │ 4 │ 1 │
└───────┴──────┴───────┴──────────────┘
*/
How can I make sure that with this join I'll only receive the sum of results and not the product?
I have a project entity, which contains two one-to-many relations. If I query disposal and supply.
With the following query:
SELECT *
FROM projects
JOIN disposals disposal on projects.project_id = disposal.disposal_project_refer
WHERE (projects.project_name = 'Höngg')
I get following result:
project_id,project_name,disposal_id,depository_refer,material_refer,disposal_date,disposal_measurement,disposal_project_refer
1,Test,1,1,1,2020-08-12 15:24:49.913248,123,1
1,Test,2,1,2,2020-08-12 15:24:49.913248,123,1
1,Test,7,2,1,2020-08-12 15:24:49.913248,123,1
1,Test,10,3,4,2020-08-12 15:24:49.913248,123,1
The same amount of results get returned by same query for supplies.
type Project struct {
ProjectID uint `gorm:"primary_key" json:"ProjectID"`
ProjectName string `json:"ProjectName"`
Disposals []Disposal `gorm:"ForeignKey:disposal_project_refer"`
Supplies []Supply `gorm:"ForeignKey:supply_project_refer"`
}
If I query both tables I would like to receive the sum of both single queries. Currently I am receiving 16 results (4 supply results multiplied by 4 disposal results).
The combined query:
SELECT *
FROM projects
JOIN disposals disposal ON projects.project_id = disposal.disposal_project_refer
JOIN supplies supply ON projects.project_id = supply.supply_project_refer
WHERE (projects.project_name = 'Höngg');
I have tried achieving my goal with union queries but I was not sucessfull. What else should I try to achieve my goal?
It is your case (simplified):
# with a(x,y) as (values(1,1)), b(x,z) as (values(1,11),(1,22)), c(x,t) as (values(1,111),(1,222))
select * from a join b on (a.x=b.x) join c on (b.x=c.x);
┌───┬───┬───┬────┬───┬─────┐
│ x │ y │ x │ z │ x │ t │
├───┼───┼───┼────┼───┼─────┤
│ 1 │ 1 │ 1 │ 11 │ 1 │ 111 │
│ 1 │ 1 │ 1 │ 11 │ 1 │ 222 │
│ 1 │ 1 │ 1 │ 22 │ 1 │ 111 │
│ 1 │ 1 │ 1 │ 22 │ 1 │ 222 │
└───┴───┴───┴────┴───┴─────┘
It produces cartesian join because the value for join is same in all tables. You need some additional condition for joining your data.For example (tests for various cases):
# with a(x,y) as (values(1,1)), b(x,z) as (values(1,11),(1,22)), c(x,t) as (values(1,111),(1,222))
select *
from a
cross join lateral (
select *
from (select row_number() over() as rn, * from b where b.x=a.x) as b
full join (select row_number() over() as rn, * from c where c.x=a.x) as c on (b.rn=c.rn)
) as bc;
┌───┬───┬────┬───┬────┬────┬───┬─────┐
│ x │ y │ rn │ x │ z │ rn │ x │ t │
├───┼───┼────┼───┼────┼────┼───┼─────┤
│ 1 │ 1 │ 1 │ 1 │ 11 │ 1 │ 1 │ 111 │
│ 1 │ 1 │ 2 │ 1 │ 22 │ 2 │ 1 │ 222 │
└───┴───┴────┴───┴────┴────┴───┴─────┘
# with a(x,y) as (values(1,1)), b(x,z) as (values(1,11),(1,22),(1,33)), c(x,t) as (values(1,111),(1,222))
select *
from a
cross join lateral (
select *
from (select row_number() over() as rn, * from b where b.x=a.x) as b
full join (select row_number() over() as rn, * from c where c.x=a.x) as c on (b.rn=c.rn)
) as bc;
┌───┬───┬────┬───┬─────┬──────┬──────┬──────┐
│ x │ y │ rn │ x │ z │ rn │ x │ t │
├───┼───┼────┼───┼─────┼──────┼──────┼──────┤
│ 1 │ 1 │ 1 │ 1 │ 11 │ 1 │ 1 │ 111 │
│ 1 │ 1 │ 2 │ 1 │ 22 │ 2 │ 1 │ 222 │
│ 1 │ 1 │ 3 │ 1 │ 33 │ ░░░░ │ ░░░░ │ ░░░░ │
└───┴───┴────┴───┴─────┴──────┴──────┴──────┘
# with a(x,y) as (values(1,1)), b(x,z) as (values(1,11),(1,22)), c(x,t) as (values(1,111),(1,222),(1,333))
select *
from a
cross join lateral (
select *
from (select row_number() over() as rn, * from b where b.x=a.x) as b
full join (select row_number() over() as rn, * from c where c.x=a.x) as c on (b.rn=c.rn)
) as bc;
┌───┬───┬──────┬──────┬──────┬────┬───┬─────┐
│ x │ y │ rn │ x │ z │ rn │ x │ t │
├───┼───┼──────┼──────┼──────┼────┼───┼─────┤
│ 1 │ 1 │ 1 │ 1 │ 11 │ 1 │ 1 │ 111 │
│ 1 │ 1 │ 2 │ 1 │ 22 │ 2 │ 1 │ 222 │
│ 1 │ 1 │ ░░░░ │ ░░░░ │ ░░░░ │ 3 │ 1 │ 333 │
└───┴───┴──────┴──────┴──────┴────┴───┴─────┘
db<>fiddle
Note that there is no any obvious relations between disposals and supplies (b and c in my example) so the order of both could be random. As for me the better solution for this task could be the aggregation of the data from those tables using JSON for example:
with a(x,y) as (values(1,1)), b(x,z) as (values(1,11),(1,22),(1,33)), c(x,t) as (values(1,111),(1,222))
select
*,
(select json_agg(to_json(b.*)) from b where a.x=b.x) as b,
(select json_agg(to_json(c.*)) from c where a.x=c.x) as c
from a;
┌───┬───┬──────────────────────────────────────────────────┬────────────────────────────────────┐
│ x │ y │ b │ c │
├───┼───┼──────────────────────────────────────────────────┼────────────────────────────────────┤
│ 1 │ 1 │ [{"x":1,"z":11}, {"x":1,"z":22}, {"x":1,"z":33}] │ [{"x":1,"t":111}, {"x":1,"t":222}] │
└───┴───┴──────────────────────────────────────────────────┴────────────────────────────────────┘
Is it possible with Clickhouse to have result containing a pair of array transformed as columns?
Form this result:
┌─f1──┬f2───────┬f3─────────────┐
│ 'a' │ [1,2,3] │ ['x','y','z'] │
│ 'b' │ [4,5,6] │ ['x','y','z'] │
└─────┴─────────┴───────────────┘
to :
┌─f1──┬x──┬y──┬z──┐
│ 'a' │ 1 │ 2 │ 3 │
│ 'b' │ 4 │ 5 │ 6 │
└─────┴───┴───┴───┘
The idea is to not have to repeat the header values for each line.
In my case, the "header" array f3 unique by queries and join to the f1,f2.
You can do it with help of indexOf function.
SELECT *
FROM test_sof
┌─f1─┬─f2──────┬─f3────────────┐
│ a │ [1,2,3] │ ['x','y','z'] │
└────┴─────────┴───────────────┘
┌─f1─┬─f2────────┬─f3────────────────┐
│ c │ [7,8,9,0] │ ['x','y','z','n'] │
└────┴───────────┴───────────────────┘
┌─f1─┬─f2─────────┬─f3────────────────┐
│ d │ [7,8,9,11] │ ['x','y','z','n'] │
└────┴────────────┴───────────────────┘
┌─f1─┬─f2──────┬─f3────────────┐
│ b │ [4,5,6] │ ['x','y','z'] │
└────┴─────────┴───────────────┘
4 rows in set. Elapsed: 0.001 sec.
Then:
SELECT
f1,
f2[indexOf(f3, 'x')] AS x,
f2[indexOf(f3, 'y')] AS y,
f2[indexOf(f3, 'z')] AS z,
f2[indexOf(f3, 'n')] AS n
FROM test_sof
ORDER BY
f1 ASC,
x ASC
┌─f1─┬─x─┬─y─┬─z─┬──n─┐
│ a │ 1 │ 2 │ 3 │ 0 │
│ b │ 4 │ 5 │ 6 │ 0 │
│ c │ 7 │ 8 │ 9 │ 0 │
│ d │ 7 │ 8 │ 9 │ 11 │
└────┴───┴───┴───┴────┘
4 rows in set. Elapsed: 0.002 sec.
Keep in mind situation when index from header array will not be present in data array or vise-versa.
UPD: the way how to get data without knowing "headers".
You will get three columns, third one with headers.
SELECT
f1,
f2[num] AS f2_el,
f3[num] AS f3_el
FROM test_sof
ARRAY JOIN arrayEnumerate(f2) AS num
ORDER BY f1 ASC
┌─f1─┬─f2_el─┬─f3_el─┐
│ a │ 1 │ x │
│ a │ 2 │ y │
│ a │ 3 │ z │
│ b │ 4 │ x │
│ b │ 5 │ y │
│ b │ 6 │ z │
│ c │ 7 │ x │
│ c │ 8 │ y │
│ c │ 9 │ z │
│ c │ 0 │ n │
│ d │ 7 │ x │
│ d │ 8 │ y │
│ d │ 9 │ z │
│ d │ 11 │ n │
└────┴───────┴───────┘
14 rows in set. Elapsed: 0.006 sec.
This a fun puzzle. As pointed out already the indexOf() function seems to be the best way to pivot array columns inside ClickHouse but requires explicit selection of array positions. If you are using Python and your result set is not absurdly large, you can solve the problem in a more general way by flipping the array values into rows in SQL, then pivoting columns f2 and f3 in Python. Here's how it works.
First, use clickHouse-sqlalchemy and pandas to expand the matching arrays into rows as follows. (This example uses Jupyter Notebook running on Anaconda.)
# Load SQL Alchemy and connect to ClickHouse
from sqlalchemy import create_engine
%load_ext sql
%sql clickhouse://default:#localhost/default
# Use JOIN ARRAY to flip corresponding positions in f2, f3 to rows.
result = %sql select * from f array join f2, f3
df = result.DataFrame()
print(df)
The data frame appears as follows:
f1 f2 f3
0 a 1 x
1 a 2 y
2 a 3 z
3 b 4 x
4 b 5 y
5 b 6 z
Now we can pivot f2 and f3 into a new data frame.
dfp = df.pivot(columns='f3', values='f2', index='f1')
print(dfp)
The new dataframe dfp appears as follows:
f3 x y z
f1
a 1 2 3
b 4 5 6
This solution requires you to work outside the database but has the advantage that it works generally for any set of arrays as long as the names and values match. For instance if we add another row with different values and properties the same code gets the right answer. Here's a new row.
insert into f values ('c', [7,8,9,10], ['x', 'y', 'aa', 'bb'])
The pivoted data frame will appear as follows. NaN corresponds to missing values.
f3 aa bb x y z
f1
a NaN NaN 1.0 2.0 3.0
b NaN NaN 4.0 5.0 6.0
c 9.0 10.0 7.0 8.0 NaN
For more information on this solution see https://pandas.pydata.org/pandas-docs/stable/getting_started/dsintro.html and https://github.com/xzkostyan/clickhouse-sqlalchemy.
I have three tables in Google Bigquery:
t1) ID1, ID2
t2) ID1, Keywords (500.000 rows)
t3) ID2, Keywords (3 million rows)
The observations of ID1 have been matched/linked with observations in ID2, each observation has a number of keywords.
I want to know about the overlap in keywords between the matched ID1's and ID2's.
t1
┌─────────────┐
│ ID1 │ ID2 │
├──────┼──────┤
│ 1 │ A │
│ 1 │ B │
│ 1 │ C │
│ 1 │ D │
│ 2 │ E │
│ 2 │ F │
│ 2 │ G │
│ 2 │ H │
│ 3 │ I │
│ 3 │ J │
│ 3 │ K │
│ 3 │ L │
│ 4 │ M │
│ 4 │ N │
│ 4 │ O │
│ 4 │ P │
t2
┌──────────────────────┐
│ TABLE 2 │
├──────────────────────┤
│ ID1 │ KEYWORD │
│ 1 │ KEYWORD 1 │
│ 1 │ KEYWORD 2 │
│ 1 │ KEYWORD 3 │
│ 1 │ KEYWORD 4 │
│ 2 │ KEYWORD 2 │
│ 2 │ KEYWORD 3 │
│ 2 │ KEYWORD 6 │
│ 2 │ KEYWORD 8 │
│ 3 │ KEYWORD 10 │
│ 3 │ KEYWORD 64 │
│ 3 │ KEYWORD 42 │
│ 3 │ KEYWORD 39 │
│ 4 │ KEYWORD 18 │
│ 4 │ KEYWORD 33 │
│ 4 │ KEYWORD 52 │
│ 4 │ KEYWORD 24 │
└─────────┴────────────┘
t3
┌───────────────────────┐
│ TABLE 3 │
├───────────────────────┤
│ ID2 │ KEYWORD │
│ A │ KEYWORD 1 │
│ A │ KEYWORD 2 │
│ A │ KEYWORD 54 │
│ A │ KEYWORD 34 │
│ B │ KEYWORD 32 │
│ B │ KEYWORD 876 │
│ B │ KEYWORD 632 │
│ B │ KEYWORD 2 │
│ K │ KEYWORD 53 │
│ K │ KEYWORD 43 │
│ K │ KEYWORD 10 │
│ K │ KEYWORD 64 │
│ P │ KEYWORD 56 │
│ P │ KEYWORD 44 │
│ P │ KEYWORD 322 │
│ P │ KEYWORD 99 │
└─────────┴─────────────┘
As the tables show, ID1 (1) is matched to ID2 (A). Both ID1 and ID2 have a KEYWORD 1 and KEYWORD 2, so there's a total of 2 keywords that overlap between both matched observations, which in this case (as ID1 (A) has 4 keywords total) is 50% overlap.
I am looking to make the following table, where every row in t1 gets additional columns MATCH COUNT and MATCH PERCENTAGE.
┌───────────────────────────────────────────────┐
│ RESULT │
├───────────────────────────────────────────────┤
│ ID │ ID2 │ MATCH COUNT │ MATCH PERCENTAGE │
│ 1 │ A │ 2 │ 50% │
│ 1 │ B │ 1 │ 25% │
│(...) │(...)│ (...) │ (...) │
│ 3 │ K │ 2 │ 50% │
│ 4 │ P │ 0 │ 0% │
└────────┴─────┴─────────────┴──────────────────┘
I know it is good etiquette to show what I've already done, but honestly this one is way over my head and I don't even know where to start. I am hoping that somebody can get me into the right direction.
You can do this using join and group by:
select t1.id1, t2.id2
count(t3.keyword) as num_matches,
count(t3.keyword) / count(*) as proportion_matches
from t1 left join
t2
on t1.id1 = t2.id1 left join
t3
on t1.id2 = t3.id2 and
t2.keyword = t3.keyword
group by t1.id1, t2.id2;
This assumes that the keywords are unique for each id.
I think it is solution:
select Id1, Id2, Sum(Match) Match, Sum(Match) / Sum(Total) as Perc
from (
select t2.Id1, t2.Id2, Decode(t1.Keyword, t3.Keyword, 1, 0) Match, 1 Total
from t2
inner join t1 on (t2.Id1 = t1.Id1)
inner join t3 on (t2.Id2 = t3.Id2)
)
group by Id1, Id2
if you don't have function Decode you can use case:
case when t1.Keyword = t3.Keyword then 1 else 0 end
Easier:
select t1.Id1, t1.Id2, Sum(case when t2.Keyword = t3.Keyword then 1 else 0 end) Match, Sum(case when t2.Keyword = t3.Keyword then 1 else 0 end) / Count(1) Perc
from t2
inner join t1 on (t2.Id1 = t1.Id1)
inner join t3 on (t1.Id2 = t3.Id2)
group by t1.Id1, t1.Id2
Google have function CountIf, you can use also:
select t1.Id1, t1.Id2, CountIf(t2.Keyword = t3.Keyword) Match, CountIf(t2.Keyword = t3.Keyword) / Count(1) Perc
from t2
inner join t1 on (t2.Id1 = t1.Id1)
inner join t3 on (t1.Id2 = t3.Id2)
group by t1.Id1, t1.Id2
Below is for BigQuery Standard SQL
#standardSQL
SELECT t1.id1, t1.id2,
COUNTIF(t2.keyword = t3.keyword) match_count,
COUNTIF(t2.keyword = t3.keyword) / COUNT(DISTINCT t2.keyword) match_percentage
FROM t2 CROSS JOIN t3
JOIN t1 ON t1.id1 = t2.id1 AND t1.id2 = t3.id2
GROUP BY t1.id1, t1.id2
-- ORDER BY t1.id1, t1.id2
with result as below
Row id1 id2 match_count match_percentage
1 1 A 2 0.5
2 1 B 1 0.25
3 3 K 2 0.5
4 4 P 0 0.0