My SQL query is taking too long, is there another approach?

My SQL query is taking too long, is there another approach? - sql

I want to store 3D vector images in my MariaDB, but I'm finding that retrieving the data is taking far too long to be practical.
I have a few tables:
a points table containing the x,y and z coordinates plus the entity id,
an entity table containing a unique id, an entity type (text, line, polyline,etc) other common attributes such as colour and linetype,
and some auxiliary tables containing additional values like text, text height, line thicknesses and flags split into separate tables based on field type (varchar, int or float).
I am accessing the data through PHP as follows:
if($result = mysqli_query($conn, "SELECT entityID,X,Y,Z FROM dwgpoints WHERE drawing=".$DrawingID." AND blockID=".$blockID.";"))
{
$previous_eID=0;
while($row = mysqli_fetch_array($result))
{
$eID=$row['entityID'];
if($previous_eID!=$eID)
{
if($previous_eID)// confirm it's not zero
renderEntity($image_handle,$DrawingID,$previous_eID,$etype,$colour,$ltype,$points, $transformation, $clip);
$previous_eID=$eID;
if($eResult=mysqli_query($conn,"SELECT colour,ltype,etype FROM entity WHERE drawing=".$ID." AND eID=".$eID.";")){
$erow=mysqli_fetch_assoc($eResult);
$colour=$erow['colour'];
$ltype=$erow['ltype'];
$etype=$erow['etype'];
$points=[[$row['X'],$row['Y'],$row['Z']]];
}
}else{
$points[]=[$row['X'],$row['Y'],$row['Z']];
}
}
}
This process is taking up to ten minutes, but I know that Openstreetmaps, for example, renders tiles from similar amounts of data.
The results of the EXPLAIN directive is as follows:
MariaDB [wptest_11]> EXPLAIN SELECT entityID,X,Y,Z FROM dwgpoints WHERE drawing=2 AND blockID=-1;
+------+-------------+-----------+------+---------------+--------+---------+-------------+-------+-----------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-----------+------+---------------+--------+---------+-------------+-------+-----------------------+
| 1 | SIMPLE | dwgpoints | ref | idx_id | idx_id | 9 | const,const | 24939 | Using index condition |
+------+-------------+-----------+------+---------------+--------+---------+-------------+-------+-----------------------+
Is it possible to streamline my data searches to make the process time manageable? I'm on a cheap VPS, so there may be hardware performance issues, in which case would an upgrade make much difference? Or do I need to rethink my approach?
Any advice would be most welcome.

For reference, I have run millions of records on a Raspberry Pi3 with better performance. Hardware helps, but I would first look at:
Minimizing your calls, if it makes sense for your application
Add some indexes on fields in your WHERE clause to improving performance
Single Query
Adjust the following query to fit your needs. Then see what the timing is. For 26k results, you should be well under a second.
SELECT
a.entityID
, a.X
, a.Y
, a.Z
, b.colour
, b.ltype
, b.etype
, b.drawing
FROM dwgpoints a
LEFT JOIN entity b
ON b.eID = a.entityID
AND b.drawing = a.drawing
WHERE
a.drawing = ".$ID."
AND a.blockID = ".$blockID.
;
Indexes
Create some indexes:
The combination of drawing and blockID for the dwgpoints table.
The combination of drawing and eID for the entity table.
Re-run the above query, there should be an improvement.
I would also look at adding a foreign key between the tables, which will help improve data integrity of your database.

Related

Query a dynamic deduplicated table

I am using BigQuery to give my colleagues access to aggregated data in our system.
I have a raw_orders table where I store orders data. The thing is that the lines in this table are subject to change across time. When a change occurs, I add a new line in this table. So my table looks like this:
+-----+-------+---------------------+---------------------+
| id | total | created_at | updated_at |
+-----+-------+---------------------+---------------------+
| ABC | 15.76 | 2020-01-01 12:56:32 | 2020-01-02 14:58:43 |
| ABC | 12.43 | 2020-01-01 12:56:32 | 2020-01-01 12:56:32 |
| DEF | 19.03 | 2020-01-01 12:56:32 | 2020-01-02 14:58:43 |
| DEF | 12.03 | 2020-01-01 12:56:32 | 2020-01-01 12:56:32 |
+-----+-------+---------------------+---------------------+
To allow my collaborators to query on a deduplicated table easily, I created a view of deduplicated lines using:
CREATE OR REPLACE VIEW xxx.orders as
select ro.*
from (
select ro.id, max(ro.updated_at) max_updated_at
from xxx.raw_orders ro
group by ro.id
) tmp inner join xxx.raw_orders ro2 on ro2.id = tmp.id && ro2.updated_at = tmp.max_updated_at
order by f.created_at desc
This works great, but I feel that I am spending too much budget on simple requests like:
SELECT * FROM rubee.orders WHERE created_at > '2020-11-01 00:00:00';
If I understand well, because of the view step, big query must use a lot of storage to deduplicate lines before responding a single result.
Am I doing something wrong here? How do you give access to deduplicated data without spending too much storage? Would you have a better strategy for what I try to do?

Ideally, you will use a materialized view for the purpose, but right now BigQuery has limited support on materialized view. You cannot create a mview to replace the view you were using.
It is possible to create a materialized view for the inner query, which may make the whole query less expensive but please read on.
Cost. There is no simple answer whether you are "spending too much budget" on the query.
If you're on pay-per-query plan and charged by "processed bytes", then although the query is more expensive for BigQuery to process, you're charged no more than scanning the whole table once (although technically the table was scanned more than once). In another word, deduplication is free. However, if your query pattern allows to to cluster/partition your table somehow to avoid scanning the whole table, then this "self-join" view does prevent you from saving the budget.
If you have reservation on slots, then you will benefit from making the query faster.
Suggestions. Give the situation is different case by case, the general suggestions are:
If it is possible, separating the data into "archived" and "active" so that "archived" data stay deduplicated (and partitioned/clustered to allow efficient search), and you only need a view to dedup "active" data.
Create a materialized view (on the inner "GROUP BY" query) may speed up the query a bit but not necessarily make it "cheaper", you may be charged the size of the base table + mview.

What is the shortest notation for updating a column in an internal table?

I have an ABAP internal table. Structured, with several columns (e.g. 25). Names and types are irrelevant. The table can get pretty large (e.g. 5,000 records).
| A | B | ... |
| --- | --- | --- |
| 7 | X | ... |
| 2 | CCC | ... |
| 42 | DD | ... |
Now I'd like to set one of the columns (e.g. B) to a specific constant value (e.g. 'Z').
What is the shortest, fastest, and most memory-efficient way to do this?
My best guess is a LOOP REFERENCE INTO. This is pretty efficient as it changes the table in-place, without wasting new memory. But it takes up three statements, which makes me wonder whether it's possible to get shorter:
LOOP AT lt_table REFERENCE INTO DATA(ls_row).
ls_row->b = 'Z'.
ENDLOOP.
Then there is the VALUE operator which reduces this to one statement but is not very efficient because it creates new memory areas. It also gets longish for a large number of columns, because they have to be listed one by one:
lt_table = VALUE #( FOR ls_row in lt_table ( a = ls_row-a
b = 'Z' ) ).
Are there better ways?

The following code sets PRICE = 0 of all lines at a time. Theoritically, it should be the fastest way to update all the lines of one column, because it's one statement. Note that it's impossible to omit the WHERE, so I use a simple trick to update all lines.
DATA flights TYPE TABLE OF sflight.
DATA flight TYPE sflight.
SELECT * FROM sflight INTO TABLE flights.
flight-price = 0.
MODIFY flights FROM flight TRANSPORTING price WHERE price <> flight-price.
Reference: MODIFY itab - itab_lines

If you have a workarea declared...
workarea-field = 'Z'.
modify table from workarea transporting field where anything.
Edition:
After been able to check that syntax in my current system, I could prove (to myself) that the WHERE clause must be added or a DUMP is raised.
Thanks Sandra Rossi.

For Sql performances, several equals or one between

For a new developement, I will have a big SQL table (~100M rows).
4 fields will be used to query the data.
Is it better to query one concatenated field with between or several equals ?
Exemple :
MainTable
PkId | Label | FkId1 | FkId2 | FkId3 | FkId4
1 | test | 1 | 4 | 3 | 1
Datas in Fk tables are static, example :
FkTable1
Id | Value
1 | a
2 | b
3 | c
To query the datas, the classic sql query is :
select Label, FkId1, FkId2, FkId3, FkId4
from MainTable
where FkId1=1 and FkId2=2 and FkId3 in(2, 3)
The idea to optimize performance is to add one field "UniqueId" calculated backend before the insert :
UniqueId = FkId1*1000000 + FkId2*10000 + FkId3*100 + FkId4
PkId | Label | FkId1 | FkId2 | FkId3 | FkId4 | UniqueId
1 | test | 1 | 4 | 3 | 1 | 1040301
select Label, FkId1, FkId2, FkId3, FkId4
from MainTable
where UniqueId between 1020200 and 1040000
Moreover, with the UniqueId field, an index on this field only will be sufficient.
What do you think ?
Thanks

For this query:
select Label, FkId1, FkId2, FkId3, FkId4
from MainTable
where FkId1 = 1 and FkId2 = 2 and FkId3 in (2, 3)
The optimal index is on MainTable(FkID1, FkId2, FkId3). You can also add Label and FkId4 to the index if you want a covering index (so the index can handle the entire query without referring to the original data pages).
There is no need for a computed field for the example you provided.

Since you will have 100M rows, thinking about optimisations from the start seems sensible to me.
However, your proposed solution will not work in this way:
Your formula above has two times the SAME factor 10000. You have to use different factors, i.e. different powers of 10.
Your select example has a "IN" clause (FkId3 in(2, 3)). This will only work, if only one of the FKs is queried this way. This fk should be the one with no factor in the formula for computing UniqueId (i.e. gives the least significant Digits of UniqueId).
Now seeing Gordons answer, I agree with him, i.e. using a combined index may be good enough for you (though your solution would probably slightly better). However, also the combined index has a similar problem: The FK field beeing queried with the IN clause should be the last field in the index.

How can I optimize this query...?

I have two tables, one for routes and one for airports.
Routes contains just over 9000 rows and I have indexed every column.
Airports only 2000 rows and I have also indexed every column.
When I run this query it can take up to 35 seconds to return 300 rows:
SELECT routes.* , a1.name as origin_name, a2.name as destination_name FROM routes
LEFT JOIN airports a1 ON a1.IATA = routes.origin
LEFT JOIN airports a2 ON a2.IATA = routes.destination
WHERE routes_build.carrier = "Carrier Name"
Running it with "DESCRIBE" I get the followinf info, but I'm not 100% sure on what it's telling me.
id | Select Type | Table | Type | possible_keys | Key | Key_len | ref | rows | Extra
--------------------------------------------------------------------------------------------------------------------------------------
1 | SIMPLE | routes_build | ref | carrier,carrier_2 | carrier | 678 | const | 26 | Using where
--------------------------------------------------------------------------------------------------------------------------------------
1 | SIMPLE | a1 | ALL | NULL | NULL | NULL | NULL | 5389 |
--------------------------------------------------------------------------------------------------------------------------------------
1 | SIMPLE | a2 | ALL | NULL | NULL | NULL | NULL | 5389 |
--------------------------------------------------------------------------------------------------------------------------------------
The only alternative I can think of is to run two separate queries and join them up with PHP although, I can't believe something like this being something that could kill a mysql server. So as usual, I suspect I'm doing something stupid. SQL is my number 1 weakness.

Personally, I would start by removing the left joins and replacing them with inner joins as each route must have a start and end point.

It's telling you that it's not using an index for joining on the airports table. See how the "rows" column is so huge, 5000 odd? that's how many rows it's having to read to answer your query.
I don't know why, as you have claimed you have indexed every column. What is IATA? Is it Unique? I believe if mysql decides the index is inefficient it may ignore it.
EDIT: if IATA is a unique string, maybe try indexing half of it only? (You can select how many characters to index) That may give mysql an index it can use.

SELECT routes.*, a1.name as origin_name, a2.name as destination_name
FROM routes_build
LEFT JOIN
airports a1
ON a1.IATA = routes_build.origin
LEFT JOIN
airports a2
ON a2.IATA = routes_build.destination
WHERE routes_build.carrier = "Carrier Name"
From your EXPLAIN PLAN I can see that you don't have an index on airports.IATA.
You should create it for the query to work fast.
Name also suggests that it should be a UNIQUE index, since IATA codes are unique.
Update:
Please post your table definition. Issue this query to show it:
SHOW CREATE TABLE airports
Also I should note that your FULLTEXT index on IATA is useless unless you have set ft_max_word_len is MySQL configuration to 3 or less.
By default, it's 4.
IATA codes are 3 characters long, and MySQL doesn't search for such short words using FULLTEXT with default settings.

After you implement Martin Robins's excellent advice (i.e. remove every instance of the word LEFT from your query), try giving routes_build a compound index on carrier, origin, and destination.

It really depends on what information you're trying to get to. You probably don't need to join airports twice and you probably don't need to use left joins. Also, if you can search on a numeric field rather than a text field, that would speed things up as well.
So what are you trying to fetch?

SQL magic - query shouldn't take 15 hours, but it does

Ok, so i have one really monstrous MySQL table (900k records, 180 MB total), and i want to extract from subgroups records with higher date_updated and calculate weighted average in each group. The calculation runs for ~15 hours, and i have a strong feeling i'm doing it wrong.
First, monstrous table layout:
category
element_id
date_updated
value
weight
source_prefix
source_name
Only key here is on element_id (BTREE, ~8k unique elements).
And calculation process:
Make hash for each group and subgroup.
CREATE TEMPORARY TABLE `temp1` (INDEX ( `ds_hash` ))
SELECT `category`,
`element_id`,
`source_prefix`,
`source_name`,
`date_updated`,
`value`,
`weight`,
MD5(CONCAT(`category`, `element_id`, `source_prefix`, `source_name`)) AS `subcat_hash`,
MD5(CONCAT(`category`, `element_id`, `date_updated`)) AS `cat_hash`
FROM `bigbigtable` WHERE `date_updated` <= '2009-04-28'
I really don't understand this fuss with hashes, but it worked faster this way. Dark magic, i presume.
Find maximum date for each subgroup
CREATE TEMPORARY TABLE `temp2` (INDEX ( `subcat_hash` ))
SELECT MAX(`date_updated`) AS `maxdate` , `subcat_hash`
FROM `temp1`
GROUP BY `subcat_hash`;
Join temp1 with temp2 to find weighted average values for categories
CREATE TEMPORARY TABLE `valuebycats` (INDEX ( `category` ))
SELECT `temp1`.`element_id`,
`temp1`.`category`,
`temp1`.`source_prefix`,
`temp1`.`source_name`,
`temp1`.`date_updated`,
AVG(`temp1`.`value`) AS `avg_value`,
SUM(`temp1`.`value` * `temp1`.`weight`) / SUM(`weight`) AS `rating`
FROM `temp1` LEFT JOIN `temp2` ON `temp1`.`subcat_hash` = `temp2`.`subcat_hash`
WHERE `temp2`.`subcat_hash` = `temp1`.`subcat_hash`
AND `temp1`.`date_updated` = `temp2`.`maxdate`
GROUP BY `temp1`.`cat_hash`;
(now that i looked through it and wrote it all down, it seems to me that i should use INNER JOIN in that last query (to avoid 900k*900k temp table)).
Still, is there a normal way to do so?
UPD: some picture for reference:
removed dead ImageShack link
UPD: EXPLAIN for proposed solution:
+----+-------------+-------+------+---------------+------------+---------+--------------------------------------------------------------------------------------+--------+----------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------+---------------+------------+---------+--------------------------------------------------------------------------------------+--------+----------+----------------------------------------------+
| 1 | SIMPLE | cur | ALL | NULL | NULL | NULL | NULL | 893085 | 100.00 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | next | ref | prefix | prefix | 1074 | bigbigtable.cur.source_prefix,bigbigtable.cur.source_name,bigbigtable.cur.element_id | 1 | 100.00 | Using where |
+----+-------------+-------+------+---------------+------------+---------+--------------------------------------------------------------------------------------+--------+----------+----------------------------------------------+

Using hashses is one of the ways in which a database engine can execute a join. It should be very rare that you'd have to write your own hash-based join; this certainly doesn't look like one of them, with a 900k rows table with some aggregates.
Based on your comment, this query might do what you are looking for:
SELECT cur.source_prefix,
cur.source_name,
cur.category,
cur.element_id,
MAX(cur.date_updated) AS DateUpdated,
AVG(cur.value) AS AvgValue,
SUM(cur.value * cur.weight) / SUM(cur.weight) AS Rating
FROM eev0 cur
LEFT JOIN eev0 next
ON next.date_updated < '2009-05-01'
AND next.source_prefix = cur.source_prefix
AND next.source_name = cur.source_name
AND next.element_id = cur.element_id
AND next.date_updated > cur.date_updated
WHERE cur.date_updated < '2009-05-01'
AND next.category IS NULL
GROUP BY cur.source_prefix, cur.source_name,
cur.category, cur.element_id
The GROUP BY performs the calculations per source+category+element.
The JOIN is there to filter out old entries. It looks for later entries, and then the WHERE statement filters out the rows for which a later entry exists. A join like this benefits from an index on (source_prefix, source_name, element_id, date_updated).
There are many ways of filtering out old entries, but this one tends to perform resonably well.

Ok, so 900K rows isn't a massive table, it's reasonably big but and your queries really shouldn't be taking that long.
First things first, which of the 3 statements above is taking the most time?
The first problem I see is with your first query. Your WHERE clause doesn't include an indexed column. So this means that it has to do a full table scan on the entire table.
Create an index on the "data_updated" column, then run the query again and see what that does for you.
If you don't need the hash's and are only using them to avail of the dark magic then remove them completely.
Edit: Someone with more SQL-fu than me will probably reduce your whole set of logic into one SQL statement without the use of the temporary tables.
Edit: My SQL is a little rusty, but are you joining twice in the third SQL staement? Maybe it won't make a difference but shouldn't it be :
SELECT temp1.element_id,
temp1.category,
temp1.source_prefix,
temp1.source_name,
temp1.date_updated,
AVG(temp1.value) AS avg_value,
SUM(temp1.value * temp1.weight) / SUM(weight) AS rating
FROM temp1 LEFT JOIN temp2 ON temp1.subcat_hash = temp2.subcat_hash
WHERE temp1.date_updated = temp2.maxdate
GROUP BY temp1.cat_hash;
or
SELECT temp1.element_id,
temp1.category,
temp1.source_prefix,
temp1.source_name,
temp1.date_updated,
AVG(temp1.value) AS avg_value,
SUM(temp1.value * temp1.weight) / SUM(weight) AS rating
FROM temp1 temp2
WHERE temp2.subcat_hash = temp1.subcat_hash
AND temp1.date_updated = temp2.maxdate
GROUP BY temp1.cat_hash;

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

My SQL query is taking too long, is there another approach? - sql

Related

Query a dynamic deduplicated table

What is the shortest notation for updating a column in an internal table?

For Sql performances, several equals or one between

How can I optimize this query...?

SQL magic - query shouldn't take 15 hours, but it does

Categories

Resources