Large inner join - sql

We have 2 tables with English words: words_1 and words_2 with fields(word as VARCHAR, ref as INT), where word - it's an english word, ref - reference on another(third) table(it's not important).
In each table all words are unique. First table contains some words that are not in second one(and on the contrary second table contains some unique words).
But most words in two tables are same.
Need to get: Result table with all distinct words and ref's.
Initial conditions
Ref's for same tables can be different( dictionaries were loaded from different places).
Words count 300 000 in each table, so inner join is not convinient
Examples
words_1
________
Health-1
Car-3
Speed-5
words_2
_________
Health-2
Buty-6
Fast-8
Speed-9
Result table
_____________
Health-1
Car-3
Speed-5
Buty-6
Fast-8

select word,min(ref)
from (
select word,ref
from words_1
union all
select word,ref
from words_2
) t
group by word

Try using a full outer join:
select coalesce(w1.word, w2.word) as word, coalesce(w1.ref, w2.ref) as ref
from words_1 w1 full outer join
words_2 w2
on w1.word = w2.word;
The only time this will not work is if ref can be NULL in either table. In that case, change the on to:
on w1.word = w2.word and w1.ref is not null and w2.ref is not null
If you want to improve performance, just create an index on the tables:
create index idx_words1_word_ref on words_1(word, ref);
create index idx_words2_word_ref on words_2(word, ref);
A join is quite doable and even without the index, SQL Server should be smart enough to come up with a reasonable implementation.

Related

Left Join is filtering rows out of my query in MySQL 5.7 without any left join columns in the where clause

I have a query that joins 4 tables. It returns 35 rows every time I run it. Here it is..
SELECT Lender.id AS LenderId,
Loans.Loan_ID AS LoanId,
Parcels.Parcel_ID AS ParcelId,
tr.Tax_ID AS TaxRecordId,
tr.Tax_Year AS TaxYear
FROM parcels
INNER JOIN Loans ON (Parcels.Loan_ID = Loans.Loan_ID AND Parcels.Escrow = 1)
INNER JOIN Lender ON (Lender.id = Loans.Bank_ID)
INNER JOIN Tax_Record tr ON (tr.Parcel_ID = Parcels.Parcel_ID AND tr.Tax_Year = :taxYear)
WHERE Loans.Active = 1
AND Loans.Date_Submitted IS NOT NULL
AND Parcels.Municipality = :municipality
AND Parcels.County = :county
AND Parcels.State LIKE :stateCode
If I left join a table (using a subquery in the on clause of the join), MySQL does some very unexpected things. Here's the modified query with the left join...
SELECT Lender.id AS LenderId,
Loans.Loan_ID AS LoanId,
Parcels.Parcel_ID AS ParcelId,
tr.Tax_ID AS TaxRecordId,
tr.Tax_Year AS TaxYear
FROM parcels
INNER JOIN Loans ON (Parcels.Loan_ID = Loans.Loan_ID AND Parcels.Escrow = 1)
INNER JOIN Lender ON (Lender.id = Loans.Bank_ID)
INNER JOIN Tax_Record tr ON (tr.Parcel_ID = Parcels.Parcel_ID AND tr.Tax_Year = :taxYear)
LEFT OUTER JOIN taxrecordpayment trp ON trp.taxRecordId = tr.Tax_ID AND trp.paymentId = (
SELECT p.id
FROM taxrecordpayment trpi
JOIN payments p ON p.id = trpi.paymentId
WHERE trpi.taxRecordId = tr.Tax_ID AND p.isFullYear = 0
ORDER BY p.dueDate, p.paymentSendTo
LIMIT 1
)
WHERE Loans.Active = 1
AND Loans.Date_Submitted IS NOT NULL
AND Parcels.Municipality = :municipality
AND Parcels.County = :county
AND Parcels.State LIKE :stateCode
I would like to note that the left join table does not appear in the where clause of the query at all, and I am not using the left join table in the select clause. In real life, I actually use the left join records in the select clause, but in my effort to get to the essential elements causing this problem, I have simplified the query and removed everything but the essential parts that cause trouble.
Here's what is happening...
Where I used to get 35 records, now I get a random number of records approaching 35. Sometimes, I get 33. Other times, I get 27, or 29, or 31, and so on. I would never expect a left join like this to filter out any records from my result set. A left join should only add additional columns to the result set, particularly when - as is the case here - the left join table is not part of the where clause.
I have determined that the problem really only happens if the subquery has a non-deterministic sort. In other words, if I have two taxrecordpayment records that match the subquery and both have the same due date and the same "paymentSendTo" value, then I see the issue. If the inner subquery has a deterministic sort, the issue goes away.
I would imagine that some people will look at my simplified example and recommend that I simply remove the subquery. If my query were this simple in real life, that would be the way to go.
In reality, the entire query is more complicated, is hitting a LOT of data, and modifying it is possible, but costly. Removing the subquery is even more costly.
Has anyone seen this sort of behavior before? I would expect a non-deterministic subquery to simply produce inconsistent results and I would never expect a left join like this to actually filter records out when the left joined table is not used at all in the where clause.
Here is the query plan, as provided by EXPLAIN...
id
select_type
table
partitions
type
possible_keys
key
key_len
ref
rows
filtered
Extra
1
PRIMARY
parcels
NULL
range
PRIMARY,Loan_ID,state_county,ParcelsCounty,county_state,Location,CountyLoan
county_state
106
NULL
590
1
Using index condition; Using where
1
PRIMARY
tr
NULL
eq_ref
parcel_year,ParcelsTax_Record,Year
parcel_year
8
infoexchange.parcels.Parcel_ID,const
1
100
Using index
1
PRIMARY
Loans
NULL
eq_ref
PRIMARY,Bank_ID,Bank,DateSub,loan_number
PRIMARY
4
infoexchange.parcels.Loan_ID
1
21.14
Using where
1
PRIMARY
Lender
NULL
eq_ref
PRIMARY
PRIMARY
8
infoexchange.Loans.bank_id
1
100
Using index
1
PRIMARY
trp
NULL
eq_ref
taxRecordPayment_key,IDX_trp_pymtId_trId
taxRecordPayment_key
8
infoexchange.tr.Tax_ID,func
1
100
Using where; Using index
2
DEPENDENT SUBQUERY
trpi
NULL
ref
taxRecordPayment_key,IDX_trp_pymtId_trId
taxRecordPayment_key
4
infoexchange.tr.Tax_ID
1
100
Using index; Using temporary; Using filesort
2
DEPENDENT SUBQUERY
p
NULL
eq_ref
PRIMARY
PRIMARY
4
infoexchange.trpi.paymentId
1
10
Using where
I have attempted to recreate this with a contrived data setup and an analogous query, but with my contrived data set, I cannot get the subquery behave non-deterministically even though it suffers from the same problem as my subquery above (there are multiple records that match the subquery and the order by is not unique for those records).
This seems to require a massive data set to start misbehaving. It happens on multiple distinct instances of MySQL 5.7, while a MySQL 5.6 instance does not demonstrate the problem at all. I am hoping someone can spot something in the above query plan to help me understand why the subquery is non-deterministic and - more importantly - why that causes records to get dropped from the result set.
I feel like this is either a data set issue (perhaps we need to do a table optimize or do some maintenance on our tables), or a bug in MySQL.
I have submitted a bug for this behavior.
https://bugs.mysql.com/bug.php?id=104824
You can recreate this behavior as follows...
CREATE TABLE tableA (
id INTEGER NOT NULL PRIMARY KEY AUTO_INCREMENT,
name VARCHAR(10)
);
CREATE TABLE tableB (
id INTEGER NOT NULL PRIMARY KEY AUTO_INCREMENT,
tableAId INTEGER NOT NULL,
name VARCHAR(10),
CONSTRAINT tableBFKtableAId FOREIGN KEY (tableAId) REFERENCES tableA (id)
);
INSERT INTO tableA (name)
VALUES ('he'),
('she'),
('it'),
('they');
INSERT INTO tableB (tableAId, name)
VALUES (1, 'hat'),
(2, 'shoes'),
(4, 'roof');
Run this query multiple times and the number of rows returned will vary:
SELECT COALESCE(b.id, -1) AS tableBId,
a.id AS tableAId
FROM tableA a
LEFT JOIN tableB b ON (b.tableAId = a.id AND 0.5 > RAND());

Efficent HIVE Method for spatial join & intersect

I have a HIVE table called "favoriteshop"(shopname, wkt) with 10 locations and its wkt(Well-known text). I also have another table called "city"(cityname, wkt) with all cities and the city's full wkt. I want to do spatial joins to the 2 tables to see if they spatially intersects each other. Below is my query:
SELECT a.shopname, a.wkt, b.cityname
FROM favoriteshop a, city b
WHERE ST_Intersects(ST_GeomFromText(a.wkt), ST_GeomFromText(b.wkt)) = true
Is there a more efficient way of doing this? it feels like it is a full table scan on city table which is problematic as city can be huge(and let's pretend it can be over millions or billions of record). Thanks for your suggestions!
calculate ST_GeomFromText in the subqueries and move condition to the ON clause:
SELECT a.shopname, a.wkt, b.cityname
FROM ( select a.shopname, a.wkt, ST_GeomFromText(a.wkt) Geom from favoriteshop a ) a
INNER JOIN
(selectb.cityname, ST_GeomFromText(b.wkt) Geom from city b ) b
on ST_Intersects(a.Geom, b.Geom) = true;

Oracle 11 SQL lists and two table comparison

I have a list of 12 strings (strings of numbers) that I need to compare to an existing table in Oracle. However I don't want to create a table just to do the compare; that takes more time than it is worth.
select
column_value as account_number
from
table(sys.odcivarchar2list('27001', '900480', '589358', '130740',
'807958', '579813', '1000100462', '656025',
'11046', '945287', '18193', '897603'))
This provides the correct result.
Now I want to compare that list of 12 against an actual table with account numbers to find the missing values. Normally with two tables I would do a left join table1.account_number = table2.account_number and table two results will have blanks. When I attempt that using the above, all I get are the results where the two records are equal.
select column_value as account_number, k.acct_num
from table(sys.odcivarchar2list('27001',
'900480',
'589358',
'130740',
'807958',
'579813',
'1000100462',
'656025',
'11046',
'945287',
'18193',
'897603'
)) left join
isi_owner.t_known_acct k on column_value = k.acct_num
9 match, but 3 should be included in table1 and blank in table2
Thoughts?
Sean

Joining tables on multiple conditions

I have a little problem - since im not very experienced in SQL - about joining the same table on multiple values. Imagine there is table 1 (called Strings):
id value
1 value1
2 value2
and then there is table 2 (called Maps):
id name description
1 1 2
so name is reference into the Strings table, as is description. Without the second field referencing the Strings table it would be no problem, id just do an inner join on Strings.id = Maps.name. But now id like to obtain the actual string also for description. What would be the best approach for a SELECT that returns me both? Right now it looks like this:
SELECT Maps.id, Strings.value AS mapName FROM Maps INNER JOIN Strings ON Strings.id = Maps.name;
But that obviously only covers one of the localized names. Thank you in advance.
You can do this with two joins to the same table:
SELECT m.id, sname.value AS mapName, sdesc.value as description
FROM Maps m INNER JOIN
Strings sname
ON sname.id = m.name INNER JOIN
Strings desc
ON sdesc.id = m.description;
Note the use of table aliases to distinguish between the two tables.
As long as you want to get a single value from another table, you can use subqueries to do these lookups:
SELECT id,
(SELECT value FROM Strings WHERE id = Maps.name) AS name,
(SELECT value FROM Strings WHERE id = Maps.description) AS description
FROM Maps

How to build virtual columns?

Sorry if this is a basic question. I'm fairly new to SQL, so I guess I'm just missing the name of the concept to search for.
Quick overview.
First table (items):
ID | name
-------------
1 | abc
2 | def
3 | ghi
4 | jkl
Second table (pairs):
ID | FirstMember | SecondMember Virtual column (pair name)
-------------------------------------
1 | 2 | 3 defghi
2 | 1 | 4 abcjkl
I'm trying to build the virtual column shown in the second table
It could be built at the time any entry is made in the second table, but if done that way, the data in that column would get wrong any time one of the items in the first table is renamed.
I also understand that I can build that column any time I need it (in either plain requests or stored procedures), but that would lead to code duplication, since the second table can be involved in multiple different requests.
So is there a way to define a "virtual" column, that could be accessed as a normal column, but whose content is built dynamically?
Thanks.
Edit: this is on MsSql 2008, but an engine-agnostic solution would be preferred.
Edit: the example above was oversimplified in multiple ways - the major one being that the virtual column content isn't a straight concatenation of both names, but something more complex, depending on the content of columns I didn't described. Still, you've provided multiple paths that seems promising - I'll be back. Thanks.
You need to join the items table twice:
select p.id,
p.firstMember,
p.secondMember,
i1.name||i2.name as pair_name
from pairs as p
join items as i1 on p.FirstMember = i1.id
join items as i2 on p.SecondMember = i2.id;
Then put this into a view and you have your "virtual column". You would simply query the view instead of the actual pairs table wherever you need the pair_name column.
Note that the above uses inner joins, if your "FirstMember" and "SecondMember" columns might be null, you probably want to use an outer join instead.
You can use a view, which creates a table-like object from a query result, such as the one with a_horse_with_no_name provided.
CREATE VIEW pair_names AS
SELECT p.id,
p.firstMember,
p.secondMember,
CONCAT(i1.name, i2.name) AS pair_name
FROM pairs AS p
JOIN items AS i1 ON p.FirstMember = i1.id
JOIN items AS i2 ON p.SecondMember = i2.id;
Then to query the results just do:
SELECT id, pair_name FROM pair_names;
You could create a view for your 'virtual column', if you wanted to, like so:
CREATE VIEW aView AS
SELECT
p.ID,
p.FirstMember,
p.SecondMember,
a.name + b.name as 'PairName'
FROM
pairs p
LEFT JOIN
items a
ON
p.FirstMember = a.ID
LEFT JOIN
items b
ON
p.SecondMember = b.ID
Edit:
Or, of course, you could just use a similar select statement every time.
When selecting from tables you can name the results of a column using AS.
SELECT st.ID, st.FirstMember, st.SecondMember, ft1.Name + ft2.Name AS PairName
FROM Second_Table st
JOIN First_Table ft1 ON st.FirstMember = ft1.ID
JOIN First_Table ft2 ON st.SecondMember = ft2.ID
Should give you something like what you are after.