Large Table With Multiple Outer Apply Row Compare Performance

Large Table With Multiple Outer Apply Row Compare Performance - sql

I have a large table with a sample query like below to retrieve matched results.
Select col1,col2,col3
from
Table1 T1
OUTER APPLY (select col2 from Table2 Where t2id=T1.id)
OUTER APPLY (select col3 from Table3 Where t3id=T1.id)
Where col3>0
problem is the its running extremely slow when I have the Where clause column value check.
I have tried different approach including CROSS APPLY, without any improvement to the performance.
Any idea?

Try moving the where clause inside the select statement. This should result in less rows to compute and therefore quicker results
Select col1,col2,col3
from
Table1 T1
OUTER APPLY (select col2 from Table2 Where t2id=T1.id)
OUTER APPLY (select col3 from Table3 Where t3id=T1.id Where col3>0)

Related

sql - ignore duplicates while joining

I have two tables.
Table1 is 1591 rows. Table2 is 270 rows.
I want to fetch specific column data from Table2 based on some condition between them and also exclude duplicates which are in Table2. Which I mean to join the tables but get only one value from Table2 even if the condition has occurred more than time. The result should be exactly 1591 rows.
I tried to make Left,Right, Inner joins but the data comes more than or less 1591.
Example
Table1
type,address,name
40,blabla,Adam
20,blablabla,Joe
Table2
type,currency
40,usd
40,gbp
40,omr
Joining on 'type'
Result
type,address,name,currency
40,blabla,name,usd
20,blblbla,Joe,null

try this it has to work
select *
from
Table1 h
inner join
(select type,currency,ROW_NUMBER()over (partition by type order by
currency) as rn
from
Table2
) sr on
sr.type=h.type
and rn=1

Try this. It's standard SQL, therefore, it should work on your rdbms system.
select * from Table1 AS t
LEFT OUTER JOIN Table2 AS y ON t.[type] = y.[type] and y.currency IN (SELECT MAX(currency) FROM Table2 GROUP BY [type])
If you want to control which currency is joined, consider altering Table2 by adding a new column active/non active and modifying accordingly the JOIN clause.

You can use outer apply if it's supported.
select a.type, a.address, a.name, b.currency
from Table1 a
outer apply (
select top 1 currency
from Table2
where Table2.type = a.type
) b

I typical way to do this uses a correlated subquery. This guarantees that all rows in the first table are kept. And it generates an error if more than one row is returned from the second.
So:
select t1.*,
(select t2.currency
from table2 t2
where t2.type = t1.type
fetch first 1 row only
) as currency
from table1 t1;
You don't specify what database you are using, so this uses standard syntax for returning one row. Some databases use limit or top instead.

Comparing two tables for equality in HIVE

I have two tables, table1 and table2. Each with the same columns:
key, c1, c2, c3
I want to check to see if these tables are equal to eachother (they have the same rows). So far I have these two queries (<> = not equal in HIVE):
select count(*) from table1 t1
left outer join table2 t2
on t1.key=t2.key
where t2.key is null or t1.c1<>t2.c1 or t1.c2<>t2.c2 or t1.c3<>t2.c3
And
select count(*) from table1 t1
left outer join table2 t2
on t1.key=t2.key and t1.c1=t2.c1 and t1.c2=t2.c2 and t1.c3=t2.c3
where t2.key is null
So my idea is that, if a zero count is returned, the tables are the same. However, I'm getting a zero count for the first query, and a non-zero count for the second query. How exactly do they differ? If there is a better way to check this certainly let me know.

The first one excludes rows where t1.c1, t1.c2, t1.c3, t2.c1, t2.c2, or t2.c3 is null. That means that you effectively doing an inner join.
The second one will find rows that exist in t1 but not in t2.
To also find rows that exist in t2 but not in t1 you can do a full outer join. The following SQL assumes that all columns are NOT NULL:
select count(*) from table1 t1
full outer join table2 t2
on t1.key=t2.key and t1.c1=t2.c1 and t1.c2=t2.c2 and t1.c3=t2.c3
where t1.key is null /* this condition matches rows that only exist in t2 */
or t2.key is null /* this condition matches rows that only exist in t1 */

If you want to check for duplicates and the tables have exactly the same structure and the tables do not have duplicates within them, then you can do:
select t.key, t.c1, t.c2, t.c3, count(*) as cnt
from ((select t1.*, 1 as which from table1 t1) union all
(select t2.*, 2 as which from table2 t2)
) t
group by t.key, t.c1, t.c2, t.c3
having cnt <> 2;
There are various ways that you can relax the conditions in the first paragraph, if necessary.
Note that this version also works when the columns have NULL values. These might be causing the problem with your data.

Well, the best way is calculate the hash sum of each table, and compare the sum of hash.
So no matter how many column are they, no matter what data type are they, as long as the two table has the same schema, you can use following query to do the comparison:
select sum(hash(*)) from t1;
select sum(hash(*)) from t2;
And you just need to compare the return values.

I used EXCEPT statement and it worked.
select * from Original_table
EXCEPT
select * from Revised_table
Will show us all the rows of the Original table that are not in the Revised table.
If your table is partitioned you will have to provide a partition predicate.
Fyi, partition values don't need to be provided if you use Presto and querying via SQL lab.

I would recommend you not using any JOINs to try to compare tables:
it is quite an expensive operations when tables are big (which is often the case in Hive)
it can give problems when some rows/"join keys" are repeated
(and it can also be unpractical when data are in different clusters/datacenters/clouds).
Instead, I think using a checksum approach and comparing the checksums of both tables is best.
I have developed a Python script that allows you to do easily such comparison, and see the differences in a webbrowser:
https://github.com/bolcom/hive_compared_bq
I hope that can help you!

First get count for both the tables C1 and C2. C1 and C2 should be equal. C1 and C2 can be obtained from the following query
select count(*) from table1
if C1 and C2 are not equal, then the tables are not identical.
2: Find distinct count for both the tables DC1 and DC2. DC1 and DC2 should be equal. Number of distinct records can be found using the following query:
select count(*) from (select distinct * from table1)
if DC1 and DC2 are not equal, the tables are not identical.
3: Now get the number of records obtained by performing a union on the 2 tables. Let it be U. Use the following query to get the number of records in a union of 2 tables:
SELECT count (*)
FROM
(SELECT *
FROM table1
UNION
SELECT *
FROM table2)
You can say that the data in the 2 tables is identical if distinct count for the 2 tables is equal to the number of records obtained by performing union of the 2 tables. ie DC1 = U and DC2 = U

another variant
select c1-c2 "different row counts"
, c1-c3 "mismatched rows"
from
( select count(*) c1 from table1)
,( select count(*) c2 from table2 )
,(select count(*) c3 from table1 t1, table2 t2
where t1.key= t2.key
and T1.c1=T2.c1 )

Try with WITH Clause:
With cnt as(
select count(*) cn1 from table1
)
select 'X' from dual,cnt where cnt.cn1 = (select count(*) from table2);

One easy solution is to do inner join. Let's suppose we have two hive tables namely table1 and table2. Both the table has same column namely col1, col2 and col3. The number of rows should also be same. Then the command would be as follows
**
select count(*) from table1
inner join table2
on table1.col1 = table2.col1
and table1.col2 = table2.col2
and table1.col3 = table2.col3 ;
**
If the output value is same as number of rows in table1 and table2 , then all the columns has same value, If however the output count is lesser than there are some data which are different.

Use a MINUS operator:
SELECT count(*) FROM
(SELECT t1.c1, t1.c2, t1.c3 from table1 t1
MINUS
SELECT t2.c1, t2.c2, t2.c3 from table2 t2)

SQL aggregate function returning inflated values on joined table

I'm breaking my head here where I'm going wrong.
The following query:
SELECT SUM(table1.col1) FROM table1
returns value x.
And the following query:
SELECT SUM(table1.col1) FROM table2 RIGHT OUTER JOIN table1 ON table2.ID = table1.ID
returns value y. (I need the Join for the other data of table2). Why is the 2nd example returning a different value than in the first?

Make life easier on yourself, your colleagues that will support your code, and your clients by temporarily ignoring the existence of RIGHT OUTER JOIN. Use Table1 as the "from table" instead of table2.
Then, If aggregating, you will often find it necessary to do this BEFORE joining, so that the numbers are accurate. e.g.
SELECT T1.SUMCOL1
FROM (
SELECT id, SUM(col1) as SUMCOL1 FROM Table1 GROUP BY id
) T1
LEFT OUTER JOIN table2 T2 on T1.id = T2.ID

Obvious answer is because table2 is many to table1's one. That is, there are multiple rows in table2 for one id in table1. You may also be eliminating rows from table1 if the id isn't present in table2.
Compare:
SELECT COUNT(*) FROM table1
To:
SELECT COUNT(*) FROM table2 RIGHT OUTER JOIN table1 ON table2.ID = table1.ID
If you get different results, you're aggregating duplicates or eliminating rows from table1.
If you want to avoid this, you'll need to use a subquery.

Optimization of DB2 query which uses joins and takes 1.5 hours to execute

when i run SELECT stataement on my view it takes around 1.5 hours to run, what can i do to optimize it.
Below is the sample structure of how my view looks like
CREATE VIEW SCHEMANAME.VIEWNAME
{
COL, COL1, COL2, COL3 }
AS SELECT
COST.ETA,
CASE
WHEN VOL.CURR IS NOT NULL
THEN COALESCE {VOL.COMM,0}
END CASE,
CASE
WHEN...
END CASE
FROM TABLE1 t1 inner join TABLE2 t2 ON t1.ETA=t2.ETA
INNER JOIN TABLE3 t3 on t2.ETA=t3.ETA
LEFT OUTER JOIN TABLE4 t4 on t2.ETA=t4.ETA

This is your query:
SELECT COST.ETA,
(CASE WHEN VOL.CURR IS NOT NULL THEN COALESCE {VOL.COMM,0}
END) as ??,
. . .
FROM TABLE1 t1 inner join
TABLE2 t2
ON t1.ETA = t2.ETA INNER JOIN
TABLE3 t3
on t2.ETA = t3.ETA LEFT OUTER JOIN
TABLE4 t4
on t2.ETA = t4.ETA;
First, I will the fact that the select clause references tables that are not in the from clause. I assume this is a typo.
Second, you should be able to use indexes to improve this query: table1(eta), table2(eta),table3(eta), andtable4(eta).
Third, I am highly suspicious on seeing the same column used for joining so many tables. I suspect that you might have cartesian products occurring, because there are multiple values of any given eta in several tables. If that is the case, you need to fix the query to better reflect what you really need. If so, ask another question with sample data and desired results, because your query is probably not correct.

How to do a SUM across two unrelated tables?

I'm trying to sum on two unrelated tables with postgres. With MySQL, I would do something like this :
SELECT SUM(table1.col1) AS sum_1, SUM(table2.col1) AS sum_2 FROM table1, table2
This should give me a table with two column named sum_1 and sum_2. However, postgres doesn't give me any result for this query.
Any ideas?

SELECT (SELECT SUM(table1.col1) FROM table1) AS sum_1,
(SELECT SUM(table2.col1) FROM table2) AS sum_2;
You can also write it as:
SELECT t1.sum_c1, t1.sum_c2, t2.sum_t2_c1
FROM
(
SELECT SUM(col1) sum_c1,
SUM(col2) sum_c2
FROM table1
) t1
FULL OUTER JOIN
(
SELECT SUM(col1) sum_t2_c1
FROM table2
) t2 ON 1=1;
The FULL JOIN is used with a dud condition so that either subquery could produce no results (empty) without causing the greater query to have no result.
I don't think the query as you have written would have produced the result you expected to get, because it's doing a CROSS JOIN between table1 and table2, which would inflate each SUM by the count of rows in the other table. Note that if either table1/table2 is empty, the CROSS JOIN will cause X rows by 0 rows to return an empty result.
Look at this SQL Fiddle and compare the results.

To combine multiple aggregates from multiple tables, use CROSS JOIN:
SELECT sum_1, sum_2, sum_3, sum_4
FROM
(SELECT sum(col1) AS sum_1, sum(col2) AS sum_2 FROM table1) t1
CROSS JOIN
(SELECT sum(col3) AS sum_3, sum(col4) AS sum_4 FROM table2) t2
There is always exactly one row from either of the subqueries, even with no rows in the source tables. So a CROSS JOIN (or even just a lowly comma between the subqueries - being the not so easy to read shorthand for a cross join with lower precedence) is the simplest way.
Note that this produces a cross join between single aggregated rows, not a cross join between individual rows of multiple tables like your incorrect statement in the question would - thereby multiplying each other.

I suggest something like the following, although I hjaven't tried it.
select sum1, sum2
from
(select sum(col1) sum1 from table1),
(select sum(col1) sum2 from table2);
The idea is to create two inline views, each with one row it, and then do a cartesian join on these two views, each with one row.

SELECT SUM(table1_column1 + table2_column1)
FROM table1
JOIN table2
ON table1_id= table2_id
WHERE account_no='${account_no}'
Express-JS with PostgreSQL via postman API

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Large Table With Multiple Outer Apply Row Compare Performance - sql

Try moving the where clause inside the select statement. This should result in less rows to compute and therefore quicker results Select col1,col2,col3 from Table1 T1 OUTER APPLY (select col2 from Table2 Where t2id=T1.id) OUTER APPLY (select col3 from Table3 Where t3id=T1.id Where col3>0)

Related

sql - ignore duplicates while joining

Comparing two tables for equality in HIVE

SQL aggregate function returning inflated values on joined table

Optimization of DB2 query which uses joins and takes 1.5 hours to execute

How to do a SUM across two unrelated tables?

Categories

Resources