geometry matching and update - sql

I am trying to match geometries from two tables and update one table based on the match. But this is taking huge time.
Table1
+-------------+----------+-------------+
| Column | Type | Modifiers |
|-------------+----------+-------------|
| id | bigint | |
| jid | integer | |
| geom | geometry | |
+-------------+----------+-------------+
Indexes:
"points_geom_gix" gist (geom)
"points_jid_idx" btree (jid)
Table2
+----------+----------+------------+
| Column | Type | Modifiers |
|----------+----------+------------|
| id | integer | |
| geom | geometry | |
+----------+----------+------------+
Indexes:
"jxn_geom_idx" gist (geom)
I tried with bellow queries.
UPDATE table1 SET jid = a.id from table2 a WHERE st_equals(geom,a.geom);
and
UPDATE table1 SET jid = b.id from table1 as a JOIN table2 b on st_equals(a.geoproperty,b.geom);
But both queries are taking huge amount of time(hours).
If I do count of matching geometries in both tables it will give count within seconds.
UPDATE
I am using PostgreSQL 9.5.7 and Postgis 2.2.1

If you required only bounding box level comparison use "=" on geometry columns instead of st_equals, its fast. Like a.geom=b.geom.
Refer to this as well,
Link

Related

array clustering with unique identifier for file datasets

I have a dataset with big int array column in s3 and I want to filter rows efficiently based on array values. I know we can use gin index in sql table but need solution to work on s3 dataset. I am planning to use cluster id for each combinations of elements in array (as their cardinality is not huge. max 2500) and then store it as new column on which later on filter can applied.
Example,
Table A
+------+------+-----------+
| Col1 | Col2 | Col3 |
+------+------+-----------+
| 1 | 101 | [123,234] |
| 2 | 102 | [123] |
| 3 | 103 | [234,345] |
+------+------+-----------+
I am trying to add new column like,
Table B (column Col3 will be removed from actual schema)
+------+------+-----------+-----------+
| Col1 | Col2 | Col3 | Cid |
+------+------+-----------+-----------+
| 1 | 101 | [123,234] | 1 |
| 2 | 102 | [123] | 2 |
| 3 | 103 | [234,345] | 3 |
+------+------+-----------+-----------+
and there will be another table of mapping for col3 and Cid like,
Table C
+-----------+-----+
| Col3 | Cid |
+-----------+-----+
| [123,234] | 1 |
| [123] | 2 |
| [234,345] | 3 |
+-----------+-----+
This table C will be added a new entry if a new combination is created and B will be updated if any array element gets added or removed. Goal is to be able to filter out records from Table A based on values in array column efficiently. Queries like
123 = Any(Col3) can be served as Cid = 2 or queries like [123, 345] = Any(Col3) can be served as Cid in (2,3).
Is there any better way to do solve this problem?
Also I am thinking of creating required combinations at runtime to limit number of combinations. Is it a good idea to create minimum combinations?
In Postgres, you can create the table and use join to calculate the values:
create table array_dim as
select col3 as arr, row_number() over (order by min(col1)) as array_id
from t
group by col3;
You can then add the new column:
select a.*, ad.array_id
from a join
array_dim ad
on a.col3 = ad.arr

How do I update a column from a table with data from a another column from this same table?

I have a table "table1" like this:
+------+--------------------+
| id | barcode | lot |
+------+-------------+------+
| 0 | ABC-123-456 | |
| 1 | ABC-123-654 | |
| 2 | ABC-789-EFG | |
| 3 | ABC-456-EFG | |
+------+-------------+------+
I have to extract the number in the center of the column "barcode", like with this request :
SELECT SUBSTR(barcode, 5, 3) AS ToExtract FROM table1;
The result:
+-----------+
| ToExtract |
+-----------+
| 123 |
| 123 |
| 789 |
| 456 |
+-----------+
And insert this into the column "lot" .
follow along the lines
UPDATE table_name
SET column1 = value1, column2 = value2, ...
WHERE condition;
i.e in your case
UPDATE table_name
SET lot = SUBSTR(barcode, 5, 3)
WHERE condition;(if any)
UPDATE table1 SET Lot = SUBSTR(barcode, 5, 3)
-- WHERE ...;
Many databases support generated (aka "virtual"/"computed" columns). This allows you to define a column as an expression. The syntax is something like this:
alter table table1 add column lot varchar(3) generated always as (SUBSTR(barcode, 5, 3))
Using a generated column has several advantages:
It is always up-to-date.
It generally does not occupy any space.
There is no overhead when creating the table (although there is overhead when querying the table).
I should note that the syntax varies a bit among databases. Some don't require the type specification. Some use just as instead of generated always as.
CREATE TABLE Table1(id INT,barcode varchar(255),lot varchar(255))
INSERT INTO Table1 VALUES (0,'ABC-123-456',NULL),(1,'ABC-123-654',NULL),(2,'ABC-789-EFG',NULL)
,(3,'ABC-456-EFG',NULL)
UPDATE a
SET a.lot = SUBSTRING(b.barcode, 5, 3)
FROM Table1 a
INNER JOIN Table1 b ON a.id=b.id
WHERE a.lot IS NULL
id | barcode | lot
-: | :---------- | :--
0 | ABC-123-456 | 123
1 | ABC-123-654 | 123
2 | ABC-789-EFG | 789
3 | ABC-456-EFG | 456
db<>fiddle here

postgresql update column with least value of column from another table based on condition

I'm trying to run an update query on the column answer_date of a table P. I want to fill each row of answer_date of P with the unique date from create_date column of H where P.ID1 matches with H.ID1 and where P.acceptance_date is not empty.
The query takes a long while to run, so I check the interim changes in answer_date but the entire column is empty like it was created.
Btree indices exists on all the mentioned columns.
Is there something wrong with the query?
UPDATE P
SET answer_date = subquery.date
FROM (SELECT DISTINCT H.create_date as date
FROM H, P
where H.postid=P.acceptance_id
) AS subquery
WHERE P.acceptance_id is not null;
Table schema is as follows:
Table "public.P"
Column | Type | Modifiers | Storage | Stats target | Description
-----------------------+-----------------------------+-----------+----------+--------------+-------------
id | integer | not null | plain | |
acceptance_id | integer | | plain | |
answer_date | timestamp without time zone | | plain | |
Indexes:
"posts_pkey" PRIMARY KEY, btree (id)
"posts_accepted_answer_id_idx" btree (acceptance_id) WITH (fillfactor='100')
and
Table "public.H"
Column | Type | Modifiers | Storage | Stats target | Description
-------------------+-----------------------------+-----------+----------+--------------+-------------
id | integer | not null | plain | |
postid | integer | | plain | |
create_date | timestamp without time zone | not null | plain | |
Indexes:
"H_pkey" PRIMARY KEY, btree (id)
"ph_creation_date_idx" btree (create_date) WITH (fillfactor='100')
Table P as 70 million rows and H has 220 million rows.
Postgres version is 9.6
Hardware is a Windows laptop with 8Gb of RAM.

Rewriting this subquery?

I am trying to build a new table such that the values in the existing table are NOT contained (but obviously the following checks for contained) in another table. Following is my table structure:
mysql> explain t1;
+-----------+---------------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-----------+---------------------+------+-----+---------+-------+
| id | int(11) | YES | | NULL | |
| point | bigint(20) unsigned | NO | MUL | 0 | |
+-----------+---------------------+------+-----+---------+-------+
mysql> explain whitelist;
+-------------+---------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-------------+---------------------+------+-----+---------+----------------+
| id | bigint(20) unsigned | NO | PRI | NULL | auto_increment |
| x | bigint(20) unsigned | YES | | NULL | |
| y | bigint(20) unsigned | YES | | NULL | |
| geonetwork | linestring | NO | MUL | NULL | |
+-------------+---------------------+------+-----+---------+----------------+
My query looks like this:
SELECT point
FROM t1
WHERE EXISTS(SELECT source
FROM whitelist
WHERE MBRContains(geonetwork, GeomFromText(CONCAT('POINT(', t1.point, ' 0)'))));
Explain:
+----+--------------------+--------------------+-------+-------------------+-----------+---------+------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+--------------------+-------+-------------------+-----------+---------+------+------+--------------------------+
| 1 | PRIMARY | t1 | index | NULL | point | 8 | NULL | 1001 | Using where; Using index |
| 2 | DEPENDENT SUBQUERY | whitelist | ALL | _geonetwork | NULL | NULL | NULL | 3257 | Using where |
+----+--------------------+--------------------+-------+-------------------+-----------+---------+------+------+--------------------------+
The query is taking 6 seconds to execute for 1000 records in t1 which is unacceptable for me. How can I rewrite this query using Joins (or perhaps a faster way if that exists) if I don't have a column to join on? Even a stored procedure is acceptable I guess in the worst case. My goal is to finally create a new table containing entries from t1. Any suggestions?
Unless the query optimizer is failing, a WHERE EXISTS construct should result in the same plan as a join with a GROUP clause. Look at optimizing MBRContains(geonetwork, GeomFromText(CONCAT('POINT(', t1.point, ' 0)')))), that's probably where your query is spending all its time. I don't have a suggestion for that, but here's your query written with a JOIN:
Select t1.point
from t1
join whitelist on MBRContains(whitelist.geonetwork, GeomFromText(CONCAT('POINT(', t1.point, ' 0)'))))
group by t1.point
;
or to get the points in t1 not in whitelist:
Select t1.point
from t1
left join whitelist on MBRContains(whitelist.geonetwork, GeomFromText(CONCAT('POINT(', t1.point, ' 0)'))))
where whitelist.id is null
;
This seems like a case where de-nomalizing t1 might be beneficial. Adding a GeomFrmTxt column with a value of GeomFromText(CONCAT('POINT(', t1.point, ' 0)')) could speed up the query you already have.

Why does select statement influence query execution and performance in MySQL?

I'm encountering a strange behavior of MySQL.
Query execution (i.e. the usage of indexes as shown by explain [QUERY]) and time needed for execution are dependent on the elements of the where clause.
Here is a query where the problem occurs:
select distinct
e1.idx, el1.idx, r1.fk_cat, r2.fk_cat
from ent e1, ent_leng el1, rel_c r1, _tax_c t1, rel_c r2, _tax_c t2
where el1.fk_ent=e1.idx
and r1.fk_ent=e1.idx and ((r1.fk_cat=43) or (r1.fk_cat=t1.fk_cat1 and t1.fk_cat2=43))
and r2.fk_ent=e1.idx and ((r2.fk_cat=10) or (r2.fk_cat=t2.fk_cat1 and t2.fk_cat2=10))
The corresponding explain output is:
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra
+----+-------------+-------+--------+-------------------------+---------+---------+---------------+-------+------------------------------------
| 1 | SIMPLE | el1 | index | fk_ent | fk_ent | 4 | NULL | 15002 | Using index; Using temporary
| 1 | SIMPLE | e1 | eq_ref | PRIMARY | PRIMARY | 4 | DB.el1.fk_ent | 1 | Using index
| 1 | SIMPLE | r1 | ref | fk_ent,fk_cat,fks | fks | 4 | DB.e1.idx | 1 | Using where; Using index
| 1 | SIMPLE | r2 | ref | fk_ent,fk_cat,fks | fks | 4 | DB.el1.fk_ent | 1 | Using index
| 1 | SIMPLE | t1 | index | fk_cat1,fk_cat2,fk_cats | fk_cats | 8 | NULL | 69 | Using where; Using index; Distinct;
| | | | | | | | | | Using join buffer
| 1 | SIMPLE | t2 | index | fk_cat1,fk_cat2,fk_cats | fk_cats | 8 | NULL | 69 | Using where; Using index; Distinct;
| Using join buffer
As you can see a one-column index has the same name as the column it belongs to. I also added some useless indexes along with the used ones, just to see if they change the execution (which they don't).
The execution takes ~4.5 seconds.
When I add the column entl1.name to the select part (nothing else changed), the index fk_ent in el1 cannot be used any more:
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra
+----+-------------+-------+--------+-------------------------+---------+---------+---------------+-------+------------------------------------
| 1 | SIMPLE | el1 | ALL | fk_ent | NULL | NULL | NULL | 15002 | Using temporary
The execution now takes ~8.5 seconds.
I always thought that the select part of a query does not influence the usage of indexes by the engine and doesn't affect performance in such a way.
Leaving out the attribute isn't a solution, and there are even more attributes that i have to select.
Even worse, the query in the used form is even a bit more complex and that makes the performance issue a big problem.
So my questions are:
1) What is the reason for this strange behavior?
2) How can I solve the performance problem?
Thanks for your help!
Gred
It's the DISTINCT restriction. You can think of that as another WHERE restriction. When you change the select list, you are really changing the WHERE clause for the DISTINCT restriction, and now the optimizer decides that it has to do a table scan anyway, so it might as well not use your index.
EDIT:
Not sure if this helps, but if I am understanding your data correctly, I think you can get rid of the DISTINCT restriction like this:
select
e1.idx, el1.idx, r1.fk_cat, r2.fk_cat
from ent e1
Inner Join ent_leng el1 ON el1.fk_ent=e1.idx
Inner Join rel_c r1 ON r1.fk_ent=e1.idx
Inner Join rel_c r2 ON r2.fk_ent=e1.idx
where
((r1.fk_cat=43) or Exists(Select 1 From _tax_c t1 Where r1.fk_cat=t1.fk_cat1 and t1.fk_cat2=43))
and
((r2.fk_cat=10) or Exists(Select 1 From _tax_c t2 Where r2.fk_cat=t2.fk_cat1 and t2.fk_cat2=10))
MySQL will return data from an index if possible, saving the entire row from being loaded. In this way, the selected columns can influence the index selection.
With this in mind, it can much more efficient to add all required columns to an index, especially in the case of only selecting a small subset of columns.