Efficent HIVE Method for spatial join & intersect - hive

I have a HIVE table called "favoriteshop"(shopname, wkt) with 10 locations and its wkt(Well-known text). I also have another table called "city"(cityname, wkt) with all cities and the city's full wkt. I want to do spatial joins to the 2 tables to see if they spatially intersects each other. Below is my query:
SELECT a.shopname, a.wkt, b.cityname
FROM favoriteshop a, city b
WHERE ST_Intersects(ST_GeomFromText(a.wkt), ST_GeomFromText(b.wkt)) = true
Is there a more efficient way of doing this? it feels like it is a full table scan on city table which is problematic as city can be huge(and let's pretend it can be over millions or billions of record). Thanks for your suggestions!

calculate ST_GeomFromText in the subqueries and move condition to the ON clause:
SELECT a.shopname, a.wkt, b.cityname
FROM ( select a.shopname, a.wkt, ST_GeomFromText(a.wkt) Geom from favoriteshop a ) a
INNER JOIN
(selectb.cityname, ST_GeomFromText(b.wkt) Geom from city b ) b
on ST_Intersects(a.Geom, b.Geom) = true;

Related

Postgres - How to find id's that are not used in different multiple tables (inactive id's) - badly written query

I have table towns which is main table. This table contains so many rows and it became so 'dirty' (someone inserted 5 milions rows) that I would like to get rid of unused towns.
There are 3 referent table that are using my town_id as reference to towns.
And I know there are many towns that are not used in this tables, and only if town_id is not found in neither of these 3 tables I am considering it as inactive and I would like to remove that town (because it's not used).
as you can see towns is used in this 2 different tables:
employees
offices
and for table * vendors there is vendor_id in table towns since one vendor can have multiple towns.
so if vendor_id in towns is null and town_id is not found in any of these 2 tables it is safe to remove it :)
I created a query which might work but it is taking tooooo much time to execute, and it looks something like this:
select count(*)
from towns
where vendor_id is null
and id not in (select town_id from banks)
and id not in (select town_id from employees)
So basically I said, if vendor_is is null it means this town is definately not related to vendors and in the same time if same town is not in banks and employees, than it will be safe to remove it.. but query took too long, and never executed successfully...since towns has 5 milions rows and that is reason why it is so dirty..
In face I'm not able to execute given query since server terminated abnormally..
Here is full error message:
ERROR: server closed the connection unexpectedly This probably means
the server terminated abnormally before or while processing the
request.
Any kind of help would be awesome
Thanks!
You can join the tables using LEFT JOIN so that to identify the town_id for which there is no row in tables banks and employee in the WHERE clause :
WITH list AS
( SELECT t.town_id
FROM towns AS t
LEFT JOIN tbl.banks AS b ON b.town_id = t.town_id
LEFT JOIN tbl.employees AS e ON e.town_id = t.town_id
WHERE t.vendor_id IS NULL
AND b.town_id IS NULL
AND e.town_id IS NULL
LIMIT 1000
)
DELETE FROM tbl.towns AS t
USING list AS l
WHERE t.town_id = l.town_id ;
Before launching the DELETE, you can check the indexes on your tables.
Adding an index as follow can be usefull :
CREATE INDEX town_id_nulls ON towns (town_id NULLS FIRST) ;
Last but not least you can add a LIMIT clause in the cte so that to limit the number of rows you detele when you execute the DELETE and avoid the unexpected termination. As a consequence, you will have to relaunch the DELETE several times until there is no more row to delete.
You can try an JOIN on big tables it would be faster then two IN
you could also try UNION ALL and live with the duplicates, as it is faster as UNION
Finally you can use a combined Index on id and vendor_id, to speed up the query
CREATE TABLe towns (id int , vendor_id int)
CREATE TABLE
CREATE tABLE banks (town_id int)
CREATE TABLE
CREATE tABLE employees (town_id int)
CREATE TABLE
select count(*)
from towns t1 JOIN (select town_id from banks UNION select town_id from employees) t2 on t1.id <> t2.town_id
where vendor_id is null
count
0
SELECT 1
fiddle
The trick is to first make a list of all the town_id's you want to keep and then start removing those that are not there.
By looking in 2 tables you're making life harder for the server so let's just create 1 single list first.
-- build empty temp-table
CREATE TEMPORARY TABLE TEMP_must_keep
AS
SELECT town_id
FROM tbl.towns
WHERE 1 = 2;
-- get id's from first table
INSERT TEMP_must_keep (town_id)
SELECT DISTINCT town_id
FROM tbl.banks;
-- add index to speed up the EXCEPT below
CREATE UNIQUE INDEX idx_uq_must_keep_town_id ON TEMP_must_keep (town_id);
-- add new ones from second table
INSERT TEMP_must_keep (town_id)
SELECT town_id
FROM tbl.employees
EXCEPT -- auto-distincts
SELECT town_id
FROM TEMP_must_keep;
-- rebuild index simply to ensure little fragmentation
REINDEX TABLE TEMP_must_keep;
-- optional, but might help: create a temporary index on the towns table to speed up the delete
CREATE INDEX idx_towns_town_id_where_vendor_null ON tbl.towns (town_id) WHERE vendor IS NULL;
-- Now do actual delete
-- You can do a `SELECT COUNT(*)` rather than a `DELETE` first if you feel like it, both will probably take some time depending on your hardware.
DELETE
FROM tbl.towns as del
WHERE vendor_id is null
AND NOT EXISTS ( SELECT *
FROM TEMP_must_keep mk
WHERE mk.town_id = del.town_id);
-- cleanup
DROP INDEX tbl.idx_towns_town_id_where_vendor_null;
DROP TABLE TEMP_must_keep;
The idx_towns_town_id_where_vendor_null is optional and I'm not sure if it will actaully lower the total time but IMHO it will help out with the DELETE operation if only because the index should give the Query Optimizer a better view on what volumes to expect.

Create New SQL Table w/o duplicates

I'm learning how to create tables in SQL pulling data from existing tables from two different databases. I am trying to create a table combining two tables without duplicates. I've seen some say using UNION but I could not get that to work.
Say TABLE 1 has 2 COLUMNS (IdNumber, Material) and TABLE 2 has 3 COLUMNS (IdNumber, Size, Description)
How can I create a new table (named TABLE3) that combines those two but only shows the columns (PartDescription, Weight, Color) but without duplicates.
What I have done so far is as follows:
CREATE TABLE #Materialsearch (IdNumber varchar(30), Material varchar(30))
CREATE TABLE #Sizesearch (idnumber varchar(30), Size varchar(30), Description varchar(50))
INSERT INTO #Materialsearch (IdNumber, Material)
SELECT [IdNumber],[Material]
FROM [datalist].[dbo].[Table1]
WHERE Material LIKE 'Steel' AND IdNumber NOT LIKE 'Steel'
INSERT INTO #Sizesearch (idnumber, Size, Description)
SELECT [idNumber],[itemSize], [ShortDesc]
FROM [515dap].[dbo].[Table2]
WHERE itemSize LIKE '1' AND idnumber NOT LIKE 'Steel'
SELECT DISTINCT #Materialsearch.IdNumber, #Materialsearch.Material,
#Sizesearch.Size, #Sizesearch.Description
FROM #Materialsearch
INNER JOIN #Sizesearch
ON #Materialsearch.IdNumber = #Sizesearch.idnumber
ORDER BY #Materialsearch.IdNumber
DROP TABLE #Materialsearch
DROP TABLE #Sizesearch
This would show all items that are made from steel but do not have steel as their itemid's.
Thanks for your help
I'm not 100% sure what you're after - but you may find this useful.
You could use a FULL OUTER JOIN which takes takes all rows from both tables, matches the ones it can, then reports all rows.
I'd suggest (for your understanding) running
SELECT A.*, B.*
FROM #Materialsearch AS A
FULL OUTER JOIN #Sizesearch AS B ON A.[IdNumber] = B.[IdNumber]
Then to get the relevant data, you just need some tweaks on that e.g.,
SELECT
ISNULL(A.[IdNumber], B.[IdNumber]) AS [IdNumber],
A.Material,
B.Size,
B.Description
FROM #Materialsearch AS A
FULL OUTER JOIN #Sizesearch AS B ON A.[IdNumber] = B.[IdNumber]
Edit: Changed typoed INNER JOINs to FULL OUTER JOINs. Oops :( Thankyou very much #Thorsten for finding it!

Assigning a value from one table to other table

There are two tables Table A and Table B. These contains the same columns cost and item. The Table B contains the list of items and their corresponding costs whereas the Table A contains only the list of items.
Now we need to check the items of Table A, if they are present in the Table B then the corresponging item cost should be assigned to the item's cost in Table A.
Can someone help me out by writing a query for this.
Consider the tables as shown:
Table A:
item cost
-------------
pen null
book null
watch null
Table B:
item cost
-------------
watch 1000
book 50
Expected output
Table A:
item cost
pen 0
book 50
watch 1000
Just add a foreign key (primary key of table A) in the Table B as you can say table A ID then add a join(right join may be) in the query to get or assign the prices respective items.
join be like
SELECT item, cost
FROM tablename a
RIGHT JOIN tablename b ON a.item= b.item;
Edit:
Just edit this table name ,now run it.
I would structure the update like this:
with cost_data as (
select
item,
max (cost) filter (where item = 'watch') as watch,
max (cost) filter (where item = 'book') as book
from table_b
group by item
)
update table_a a
set
watch = c.watch,
book = c.book
from cost_data c
where
a.item = c.item and
(a.watch is distinct from c.watch or
a.book is distinct from c.book)
In essence, I am doing a common table expression to do a poor man's pivot table on the Table B to get the rows into columns. One caveat here -- if there are multiple costs listed for the same item, this may not do what you want, but then you would need to know how to handle that in almost any case.
Then I am doing an "update A from B" against the CTE.
The last part is not critical, per se, but it is helpful -- to limit the query to only execute on rows that need to change. It's best to limit DML if it doesn't need to occur (the best way to optimize something is to not do it).
There are plenty of ways you could do this, if you are taking table b to be the one containing the price then a left outer join would do the trick.
SELECT
table_a.item,
CASE
WHEN table_b.cost IS NULL
THEN 0
ELSE table_b.cost
END as cost
FROM table_a
LEFT OUTER JOIN table_b ON table_a.item = table_b.item
The result also appears to suggest that pen which is not in table b should have a price of 0 (this is bad practice) but for the sake of returning the desired result you will want a case statement to assign a value if it is null.
In order to update the table, as per the comment
update table_a set cost = some_alias.cost
from (
SELECT
table_a.item,
CASE
WHEN table_b.cost IS NULL
THEN 0
ELSE table_b.cost
END as cost
FROM table_a
LEFT OUTER JOIN table_b ON table_a.item = table_b.item
) some_alias
where table_a.item = some_alias.item

Large inner join

We have 2 tables with English words: words_1 and words_2 with fields(word as VARCHAR, ref as INT), where word - it's an english word, ref - reference on another(third) table(it's not important).
In each table all words are unique. First table contains some words that are not in second one(and on the contrary second table contains some unique words).
But most words in two tables are same.
Need to get: Result table with all distinct words and ref's.
Initial conditions
Ref's for same tables can be different( dictionaries were loaded from different places).
Words count 300 000 in each table, so inner join is not convinient
Examples
words_1
________
Health-1
Car-3
Speed-5
words_2
_________
Health-2
Buty-6
Fast-8
Speed-9
Result table
_____________
Health-1
Car-3
Speed-5
Buty-6
Fast-8
select word,min(ref)
from (
select word,ref
from words_1
union all
select word,ref
from words_2
) t
group by word
Try using a full outer join:
select coalesce(w1.word, w2.word) as word, coalesce(w1.ref, w2.ref) as ref
from words_1 w1 full outer join
words_2 w2
on w1.word = w2.word;
The only time this will not work is if ref can be NULL in either table. In that case, change the on to:
on w1.word = w2.word and w1.ref is not null and w2.ref is not null
If you want to improve performance, just create an index on the tables:
create index idx_words1_word_ref on words_1(word, ref);
create index idx_words2_word_ref on words_2(word, ref);
A join is quite doable and even without the index, SQL Server should be smart enough to come up with a reasonable implementation.

Join on two dis-similar column

I need to join one column of a table to a column of another table.
Now these two column consists the geographic region data. But the issue is the column dont have exactly same strings od data.
For ex. Latin America in one column and LATM in another.
The data is table if had been same string would be the simplest joins but these two mean the same but then are different strings . What do I use to accomplish my task.
I need to do is
Select * from Table1 Inner Join Table2 on table1.region = table2.region
You would need to create a mapping table which maps every possible region in Table1.region to every possible region in Table2
for example your Mapping table be like.
MappingTable
--------------------------
Region1 | Region 2
--------------------------
Latin America | LATM
Europe | EUR
.....
The you can create a join like
Select *
from
Table1
inner join
MappingTable
on
Table1.region = MappingTable.Region1
inner join
Table2
on
MappingTable.Region2 = Table2.region
You need to make another table which contains information of TWO table Joining like 'Latin America' = 'LATM' and then have to use this table in join.