I am running a query on localhost, I am extremely unfamiliar with SQL. I am using a golang library to generate the query statement. This is for an enterprise app so I don't have time to evaluate and code all possible performance cases. I'd prefer good performance for the largest possible queries:
upto 6 query parameters eg. BETWEEN 'created' AND 'abandoned', BETWEEN X AND Y, IN (1,2,3.....25), IN ('A', 'B', 'C'....'Z')
JOINs/subqueries between a 2-5 tables
returning between 50K-5M records (LAT and LNG)
Currently I am using JOIN to find the lat, lng for a record, and some query parameters. Should I join differently, (left, right)? Should the FROM table be the record or the relation? Subqueries?
Is this query performance reasonable from a UI perspective? This is on localhost (docker) on a fairly low performance laptop, under WSL (16GB RAM / 6 core CPU [2.2GHz]).
-- [2547.438ms] [rows:874731]
SELECT "Longitude","Latitude"
FROM Wells
JOIN Well_Reports ON Well_Reports.Well_ID = Wells.Well_ID
JOIN Lithologies on Lithologies.Well_Report_ID = Well_Reports.Well_Report_ID
where Lithologies.Colour IN
(
'NULL',
'Gray','White','Dark','Black','Dark Gray','Dark Brown','Dark Red','Dark Blue',
'Dark Green','Dark Yellow','Bluish Green','Brownish Gray','Brownish Green','Brownish Yellow',
'Light','Light Gray','Light Red','Greenish Gray','Light Yellow','Light Green','Light Blue',
'Light Brown','Blue Gray','Greenish Yellow','Greenish Gray'
);
The UI is a heatmap. I haven't really hit performance issues returning 1-million rows.
Angular is the framework. Im breaking the HTTP response into 10K record chunks
My initial impression was 3+ seconds is too long for a UI to start populating data. I was already breaking the response to the UI into chunks, that portion was efficient and async. It never occurred to me to simply break the SQL requests into smaller chunks with LIMIT and OFFSET, so the server can start responding with data immediately (<200ms) even if it takes +5s to completely finish loading.
I'll write an answer to this effect.
Thanks and best regards,
schmorrison
A few things.
where someColumn in (null, ...)
This will not return rows where the value of someColumn is null because, x in (a, b, c ...) translates to x = a or x = b or x = c, and null is not equal to null.
You need a construct like this instead
where someColumn is null or someColumn in (...)
Second, you mentioned you're returning between 50k and 5M rows to the UI. I question the sanity of this... how is the UI rendering 5 million sets of coordinates for the user to see/use? I suppose there could be some extreme edge cases where this is really what you need to do, but it just seems really unlikely.
Third, if your concern is UI responsiveness, the proper way to handle that is to make asynchronous requests. I don't know anything about golang, but I did find this page with a quick google search. Study up on that kind of technique.
Finally, if you really do need to work with data sets this big, the critical point will be to identify the common search criteria and talk to your DBA about appropriate indexes. We can't provide much help in this regard without a lot more schema information, but if you have a specific query that is taking a long time with a particular set of parameters, you can come back and create a question for that query, along with providing the query plan, and we can help you out.
As pointed out by The Impaler,
"If you want good performance for dynamic queries that return 5
million rows, then you'll need to learn the intricacies of database
engines, not only SQL. A high level knowledge unfortunately won't cut
it."
and I expected as much.
I was simply querying the SQL DB for the whole set at once, this is wrong and I know that.
The changes I made generates the following statements (I know the specifics have changed):
-- [145.955ms] [rows:1]
SELECT count(*)
FROM "tblWellLogs"
WHERE (
DateWellCompleted BETWEEN
'1800-01-01 06:58:00.000' AND '2021-09-12 06:00:00.000'
)
AND "FinalStatusOfWellL" NOT IN
(
5,6,7,8,9,16,27,36
)
AND WaterUseL IN
(
1,29,3,8,26,4,6
)
AND (
wyRate BETWEEN
0 AND 3321
)
AND (
TotalOrFinishedDepth BETWEEN
0 AND 248
);
Calculated total(69977) chunk (17494) currentpage(3)
-- [117.195ms] [rows:17494]
SELECT "Easting","Northing"
FROM "tblWellLogs"
WHERE (
DateWellCompleted BETWEEN
'1800-01-01 06:58:00.000' AND '2021-09-12 06:00:00.000'
)
AND "FinalStatusOfWellL" NOT IN
(
5,6,7,8,9,16,27,36
)
AND WaterUseL IN
(
1,29,3,8,26,4,6
)
AND (
wyRate BETWEEN
0 AND 3321
)
AND (
TotalOrFinishedDepth BETWEEN
0 AND 248
)
ORDER BY (SELECT NULL)
OFFSET 34988 ROWS
FETCH NEXT 17494 ROWS ONLY;
Chunk count: 17494
JSON size: 385135
Request time: 278.6612ms
The response to the UI is as so:
GET /data/wells/xxxx/?page=3&completed.start=1800-01-01T06:58:36.000Z&completed.end=2021-09-12T06:00:00.000Z&status=supply,research,geothermal,other,unknown&use=domestic,commercial,industrial,municipal,irrigation,agriculture&depth.start=0&depth.end=248&rate.start=0&rate.end=3321&page=3 HTTP/1.1 200 385135
{
"data":[[0.0, 0.0],......],
"count": 87473,
"total": 874731,
"pages": 11,
"page": 1
}
In this case, the DBs are essentially immutable (I may get an updated dataset once per year). I figure I can predefine a set of chunk size variations for each DB based on query result size rather than just DB size. I am also retrieving the next page before the client requests it. I am caching requests at the browser and client layers. I only perform the Count(*) request if the client doesn't provide it, and it isn't in the cache.
I did find that running the request concurrently simply over-burdened my CPU, all the requests returned almost simultaneously but took +5s each rather than ~1second each.
I have situation need to execute patch script for million row of data.The current query execution time is not meet the expectation for few rows (18000) which is take around 4 hours( testing data before deploy for live ).
The patch script actually select million row of data in loop and update according to the specification , im just wonder how long it could take for million row of data since it take around 4 hour for just 18000 rows.
to overcome this problem im decided to create temp table hold the entire select statement data and proceed with the patch process using the temp table where the process could be bit faster compare select and update.
is there any other ways i can use to handle this situation ? Any suggestion and ways to solve this.
(Due to company policy im unable to post the PL/SQl script here )
seems there is no one can answer my question here i post my answer.
In oracle there is Parallel Execution which is allows spreading the processing of a single SQL statement across multiple threads.
So by using this method i solved my long running query (4 hours ) to 6 minz ..
For more information :
https://docs.oracle.com/cd/E11882_01/server.112/e25523/parallel002.htm
http://www.oracle.com/technetwork/articles/database-performance/geist-parallel-execution-1-1872400.html
I have a table challenge containing about 12000 rows. Every point connects to the four points around it, for example 100 connects to 99 101 11 and 189. I tried this with a small scale table and it worked just fine but as I increased the size of the table the query became exponentially slower and now it won't even finish. Here's my query
SELECT level, origin, destination
FROM challenge
WHERE destination = 2500
START WITH origin = 1
CONNECT BY NOCYCLE PRIOR destination = origin;
Any advice on how to optimize this query would be greatly appreciated.
So you're finding every path from node 1 to node 2500 in a degree-4 graph (rectangular lattice?) of thousands of nodes. I expect there'll be quite a lot of them. Did the challenge just ask you to count them? Because I think the point was that you have to figure out how many there are by doing math, not brute force computation.
For example, if it's a 50x50 rectangular grid with node 1 and node 2500 in opposite corners, then the minimum path length is 100 steps. A path of 100 steps will have 50 of them horizontal and 50 of them vertical, and they can come in any order. Figure out how many ways you can arrange a string of 50 H's and 50 V's and you might find it's a number that even the mighty Oracle will have a bit of a problem with. (Generating the rows, that is. Doing the calculation just requires large integer arithmetic, which Oracle can probably do quite quickly once you tell it the formula.)
And your query is actually worse than that. It doesn't ask only for minimum-length paths. So it will also return all the paths of length 102 that take a step away from the destination somewhere along the way. And paths of length 104 that take 2 backward steps. And paths of length 2498 that visit almost all of the nodes! Counting those paths is more complicated than counting the short paths because you have to exclude the ones that cross themselves.
I'm in serious trouble, I've a huge subtle query that takes huge time to execute. Actually it freezes Access and sometimes I have to kill it the query looks like:
SELECT
ITEM.*,
ERA.*,
ORDR.*,
ITEM.COnTY1,
(SELECT TOP 1 New FROM MAPPING WHERE Old = ITEM.COnTY1) AS NewConTy1,
ITEM.COnValue1,
(SELECT TOP 1 KBETR FROM GEN_KUMV WHERE KNUMV = ERA.DOCCOND AND KSCHL = (SELECT TOP 1 New FROM MAPPING WHERE Old = ITEM.COnTY1)) AS NewCOnValue1
--... etc: this continues until ConTy40
FROM
GEN_ITEMS AS ITEM,
GEN_ORDERS AS ORDR,
GEN_ERASALES AS ERA
WHERE
ORDR.ORDER_NUM = ITEM.ORDER_NUM AND -- link between ITEM and ORDR
ERA.concat = ITEM.concat -- link between ERA and ITEM
I won't provide you with the tables schema since the query works, what I'd like to know is if there's a way to add the NewConTy1 and NewConValue1 using another technique to make it more efficient. The thing is that the Con* fields goes from 1 to 40 so I've to align them along (NewConTy1 next to ConTy1 with NewConValue1 next to new ConValue2... etc until 40).
ConTy# and ConTyValue# are in ITEMS (each in a field)
NewConty# and NewConValue# are in ERA (each in a record)
I really hope my explanation is enough to figure out my issue,
Looking forward to hearing from you guys
EDIT:
Ignore the TOP 1 in the SELECTS, it's because current dumps of data I have aren't accurate it's going to be removed later
EDIT 2:
Another thing my query returns up to 230 fields also lol
Thanks
Miloud
Have you considered a union query to normalize items?
SELECT "ConTy1" As CTName, Conty1 As CTVal,
"ConTyValue1" As CTVName, ConTyValue1" As CTVVal
FROM ITEMS
UNION ALL
SELECT "ConTy2" As CTName, Conty2 As CTVal,
"ConTyValue2" As CTVName, ConTyValue2" As CTVVal
FROM ITEMS
<...>
UNION ALL
SELECT "ConTy40" As CTName, Conty40 As CTVal,
"ConTyValue40" As CTVName, ConTyValue40" As CTVVal
FROM ITEMS
This can either be a separate query that links in to your main query, or a sub query of your main query, if that is more convenient. It should then be easy enough to draw in the relationship to the NewConty# and NewConValue# in ERA.
Remou's answer gives what you want - significantly different approach. It's been a while since I've meddled with MS Access query optimization, and had forgot about the details of its planner, but you might want to try a trivial suggestion to actually make your
WHERE conditions
into
INNER JOIN ON conditions
You are firing 40ish correlated subqueries so the above probably will not help (again Remou's answer takes significantly different approach and you might see real improvements there), but do let us know as it is trivial to test.
Another approach that you can take is to materialize expensive part and take Remou's idea but split it into different parts where you can join directly.
For example your first subquery is correlated on ITEM.COnTY1, your second is correlated on ERA.DOCCOND and ITEM.ConTY1.
If you classify your subqueries according to correlated keys then you can save them as queries (or materialize them as make table queries) and join on them (or the newly created tables), which should might perform much faster (and in the case of make tables will perform much faster, at the expense of materializing - so you'll have to run some queries before getting latest data - this can be encapsulated in a macro or VBA function/sub).
Otherwise (for example if you run the above query regularly as a part of your normal business use case) - redesign your DB.
As part of a data analysis project, I will be issuing some long running queries on a mysql database. My future course of action is contingent on the results I obtain along the way. It would be useful for me to be able to view partial results generated by a SELECT statement that is still running.
Is there a way to do this? Or am I stuck with waiting until the query completes to view results which were generated in the very first seconds it ran?
Thank you for any help : )
In general case the partial result cannot be produced. For example, if you have an aggregate function with GROUP BY clause, then all data should be analysed, before the 1st row is returned. LIMIT clause will not help you, because it is applied after the output is computed. Maybe you can give a concrete data and SQL query?
One thing you may consider is sampling your tables down. This is good practice in data analysis in general to get your iteration speed up when you're writing code.
For example, if you have table create privelages and you have some mega-huge table X with key unique_id and some data data_value
If unique_id is numeric, in nearly any database
create table sample_table as
select unique_id, data_value
from X
where mod(unique_id, <some_large_prime_number_like_1013>) = 1
will give you a random sample of data to work your queries out, and you can inner join your sample_table against the other tables to improve speed of testing / query results. Thanks to the sampling your query results should be roughly representative of what you will get. Note, the number you're modding with has to be prime otherwise it won't give a correct sample. The example above will shrink your table down to about 0.1% of the original size (.0987% to be exact).
Most databases also have better sampling and random number methods than just using mod. Check the documentaion to see what's available for your version.
Hope that helps,
McPeterson
It depends on what your query is doing. If it needs to have the whole result set before producing output - such as might happen for queries with group by or order by or having clauses, then there is nothing to be done.
If, however, the reason for the delay is client-side buffering (which is the default mode), then that can be adjusted using "mysql-use-result" as an attribute of the database handler rather than the default "mysql-store-result". This is true for the Perl and Java interfaces: I think in the C interface, you have to use an unbuffered version of the function that executes the query.