Say I have a ClickHouse cluster with 3 shards and I have distributed table and local table on each node.
I have two sub queries q1 and q2 from distributed table. Can I instruct ClickHouse to perform join for the final result of q1 and q2?
In my understanding if I write query like below, the join happens on each node and the left table will be from the local table instead of distributed table.
with q1 as (select * from distributed_table ...) q2 as (select * from distributed_table ...) select * from q1 GLOBAL INNER JOIN q2 on(...)
The reason why I ask this is I have a use case where I need detect the sequence of events so I need use the global order of events when I join.
select * from q1 GLOBAL INNER JOIN q2
fetch the result of q2 from all nodes to the initiator at put into temporary table T
resend temporary table T to all nodes of q1
execute join locally of local q1 with T
gather results at the initiator
select * from (select * from q1) as q1' INNER JOIN q2
fetch the result of q2 from all nodes to the initiator
fetch the result of q1 from all nodes to the initiator
join q1 and q2 at the initiator
select * from q1 INNER JOIN q2 settings distributed_product_mode='allow'
each shard execute and fetch q2 to itself
each shard join local q1 and q2 (from step 1)
gather results at the initiator
select * from q1 INNER JOIN q2 settings distributed_product_mode='local'
each shard join local q1 and local q2 (both tables sharded the same way)
gather results at the initiator
Related
I have 2 tables : Calls (10,000 rows) , CRM (25 million rows)
I want to do Calls left join CRM.
select *
from calls a
left join crm b
on (
(a.customerID = b.customerID)
OR
(a.Number1 in (b.Number_A,b.Number_B))
OR
(a.Number2 in (b.Number_A,b.Number_B))
);
When I do just the customerID join, it runs fine. But the above code causes timeout and it crashes.
I would suggest multiple left joins:
select c.*,
coalesce(cc.col1, c1a.col1, c1b.col1, c2a.col1, c2b.col1)
from calls c left join
crm cc
on c.customerID = cc.customerID left join
crm c1a
on c.Number1 = c1a.Number_A left join
crm c1b
on c.Number1 = c1b.Number_B left join
crm c2a
on c.Number2 = c2a.Number_A left join
crm c2b
on c.Number2 = c2b.Number_B;
This can then take advantage of indexes on crm(CustomerId), crm(Number1), and crm(Number2).
Sometimes, when replacing one query that contains two conditons with OR with two queries that get glued together with UNION, this results in a better execution plan. I have never understood why DBMS optimizers don't take this in consideration themselves. And I don't know whether this is true for PostgreSQL or not. But it may be worth a try.
In your case there is an outer join in the query. That complicates the matter. With the separate queries we may get both outer joined and matching crm rows for a call and must get rid of the former in that case.
select *
from
(
select * from calls left join crm on crm.customerID = calls.customerID
union
select * from calls left join crm on crm.number_a = calls.number1
union
select * from calls left join crm on crm.number_a = calls.number2
union
select * from calls left join crm on crm.number_b = calls.number1
union
select * from calls left join crm on crm.number_b = calls.number2
) data
order by rank() over (partition by calls.id order by case when crm.id is null then 2 else 1 end)
fetch first row with ties;
For this to work fast you should have one index per column in the query, i.e. six single-column indexes.
Whether this is faster than your original query depends on a lot of things. Mainly: the fewer matches the better.
I'm trying to delete using left join from my sql server studio and my question is how do i get the list of ids that are getting deleted as part of the left join also i would like to compare the difference between the sum from both the tables
Table A:
ID NAME LOC SUM
4 abc NY 500
5 seq CA 100
15 juv TX 120
Table B:
ID NAME LOC SUM INFO
5 seq CA 90 x
18 jay AL 94 x
15 juv CL 190 x
I want to get to the number of rows that are getting removed as part of the left join and i want to see the difference in the sum
DELETE MYDB
FROM MYDB.A
LEFT JOIN MYDB.B
ON A.ID=B.ID
WHERE A.ID=B.ID
It is not clear why you would be using a LEFT JOIN for the JOIN. Your WHERE clause -- which is otherwise redundant -- is turning the outer join into an inner join.
I would suggest using exists:
delete from mydb.a
where exists (select 1 from mydb.b where b.id = a.id);
For a count, you can use:
select count(*)
from mydb.a
where exists (select 1 from mydb.b where b.id = a.id);
Do note: If you run these as two separate operations, the underlying data can change between the operations.
After running the delete, you can use ##ROWCOUNT to get the number of records deleted.
I currently have a large SQL query (not mine) which I need to modify. I have a transaction and valuation table. The transaction has a one-to-many relationship with valuations. The two tables are being joined via a foreign key.
I've been asked to prevent any transactions (along with their subsequent valuations) from being returned if no valuations for a transaction exist past a certain date. The way I thought I would achieve this would be to use an inner query, but I need to make the inner query aware of the outer query and the transaction. So something like:
SELECT * FROM TRANSACTION_TABLE T
INNER JOIN VALUATION_TABLE V WHERE T.VAL_FK = V.ID
WHERE (SELECT COUNT(*) FROM V WHERE V.DATE > <GIVEN DATE>) > 1
Obviously the above wouldn't work as the inner query is separate and I can't reference the outer query V reference from the inner. How would I go about doing this, or is there a simpler way?
This would just be the case of setting the WHERE V.DATE > in the outer query as I want to prevent any valuation for a given transaction if ANY of them exceed a specified date, not just the ones that do.
Many thanks for any help you can offer.
You may looking for this
SELECT *
FROM TRANSACTION_TABLE T
INNER JOIN VALUATION_TABLE V1 ON T.VAL_FK = V1.ID
WHERE (SELECT COUNT(*)
FROM VALUATION_TABLE V2
WHERE V2.ID = V1.ID AND V2.DATE > <GIVEN DATE>) > 1
SELECT *
FROM TRANSACTION_TABLE T
INNER JOIN VALUATION_TABLE V1 ON T.VAL_FK = V.ID
WHERE V.ID IN ( SELECT ID
FROM VALUATION_TABLE
WHERE DATE > <GIVEN DATE>
)
If execution time is important, you may want to test the various solutions on your actual data and see which works best in your situation.
I have two tables a main table and a work in progress table. Any inserts/updates are inserted into the WIP table while the record is being manipulated, this allows for validation checks and the like. I want to create a view that combines the two tables showing the WIP table data whenever it exists and the main table data when there is no WIP data.
I have figured out a way to do this but it seems that it's not the most elegant solution. I would like to know if there are other ideas or better solutions?
Example illustrating the situation:
select mt.id, wt.id wip_id, isnull(wt.name,mt.name) name,
isnull(wt.address, mt.address) address
from main_table mt full outer join
wip_table wt on mt.id = wt.orig_id;
So that will pull results from the WIP table when they exist, if they dont it will pull results from the main table. This was a simple example but the tables could have many rows.
if you want data either from one table or the other:
select top 1 *
from
(
select 1 as prio, wt.name, wt.address, .... from wip_table wt where ...
union
select 2 as prio, mt.name, mt.address, .... from main_table mt where ...
order by prio
) x
otherwise, like you have done (checking individual columns), but maybe using a left outer join rather than a full one:
select
mt.id
, wt.id wip_id
, isnull(wt.name,mt.name) name
, isnull(wt.address, mt.address) address
from main_table mt left outer join wip_table wt
on mt.id = wt.orig_id;
I have a table, horribly designed (not my doing thankfully), that stores data in a fashion similar to the following:
[key], [lease_id], [building_name], ~20 more columns of data
A lease_id can and will exist for a centre as well as head office. I've been asked to find all instances where data in a building for a lease doesn't match data in head office for the same lease.
I can do this, quite easily, with a self join. The challenge here is that there are about 20 columns to compare and although I could type each one in manually I was wondering if there's a better way to do this (which would also mean the query can be used in future, accounting for any table changes).
In syntaxtically ridiculous psuedo code- I want to do something similar to what the following would do if it were to work:
select lp.*
from lease_proposal lp
inner join
(
select *
from lease_proposal lp2
where building_id = '001' -- assume 001 is head office for sake of example
) lp2
on lp2.lease_id = lp.lease_id
where lp.* <> lp2.*
You could do an INTERSECT operation to find all rows where all data matched, then LEFT JOIN that result and select only the rows where there wasn't an intersection:
SELECT
a.*
FROM
lease_proposal a
LEFT JOIN
(
SELECT *
FROM lease_proposal
INTERSECT
SELECT *
FROM lease_proposal
WHERE building_id = 001
) b ON a.lease_id = b.lease_id
WHERE
b.lease_id IS NULL
If SQL Server supported it, you could also use a NATURAL LEFT JOIN like so:
SELECT
a.*
FROM
lease_proposal a
NATURAL LEFT JOIN
(
SELECT *
FROM lease_proposal
WHERE building_id = 001
) b
WHERE b.lease_id IS NULL