SQL Query performance - UI responsiveness concerns - sql

I am running a query on localhost, I am extremely unfamiliar with SQL. I am using a golang library to generate the query statement. This is for an enterprise app so I don't have time to evaluate and code all possible performance cases. I'd prefer good performance for the largest possible queries:
upto 6 query parameters eg. BETWEEN 'created' AND 'abandoned', BETWEEN X AND Y, IN (1,2,3.....25), IN ('A', 'B', 'C'....'Z')
JOINs/subqueries between a 2-5 tables
returning between 50K-5M records (LAT and LNG)
Currently I am using JOIN to find the lat, lng for a record, and some query parameters. Should I join differently, (left, right)? Should the FROM table be the record or the relation? Subqueries?
Is this query performance reasonable from a UI perspective? This is on localhost (docker) on a fairly low performance laptop, under WSL (16GB RAM / 6 core CPU [2.2GHz]).
-- [2547.438ms] [rows:874731]
SELECT "Longitude","Latitude"
FROM Wells
JOIN Well_Reports ON Well_Reports.Well_ID = Wells.Well_ID
JOIN Lithologies on Lithologies.Well_Report_ID = Well_Reports.Well_Report_ID
where Lithologies.Colour IN
(
'NULL',
'Gray','White','Dark','Black','Dark Gray','Dark Brown','Dark Red','Dark Blue',
'Dark Green','Dark Yellow','Bluish Green','Brownish Gray','Brownish Green','Brownish Yellow',
'Light','Light Gray','Light Red','Greenish Gray','Light Yellow','Light Green','Light Blue',
'Light Brown','Blue Gray','Greenish Yellow','Greenish Gray'
);
The UI is a heatmap. I haven't really hit performance issues returning 1-million rows.
Angular is the framework. Im breaking the HTTP response into 10K record chunks
My initial impression was 3+ seconds is too long for a UI to start populating data. I was already breaking the response to the UI into chunks, that portion was efficient and async. It never occurred to me to simply break the SQL requests into smaller chunks with LIMIT and OFFSET, so the server can start responding with data immediately (<200ms) even if it takes +5s to completely finish loading.
I'll write an answer to this effect.
Thanks and best regards,
schmorrison

A few things.
where someColumn in (null, ...)
This will not return rows where the value of someColumn is null because, x in (a, b, c ...) translates to x = a or x = b or x = c, and null is not equal to null.
You need a construct like this instead
where someColumn is null or someColumn in (...)
Second, you mentioned you're returning between 50k and 5M rows to the UI. I question the sanity of this... how is the UI rendering 5 million sets of coordinates for the user to see/use? I suppose there could be some extreme edge cases where this is really what you need to do, but it just seems really unlikely.
Third, if your concern is UI responsiveness, the proper way to handle that is to make asynchronous requests. I don't know anything about golang, but I did find this page with a quick google search. Study up on that kind of technique.
Finally, if you really do need to work with data sets this big, the critical point will be to identify the common search criteria and talk to your DBA about appropriate indexes. We can't provide much help in this regard without a lot more schema information, but if you have a specific query that is taking a long time with a particular set of parameters, you can come back and create a question for that query, along with providing the query plan, and we can help you out.

As pointed out by The Impaler,
"If you want good performance for dynamic queries that return 5
million rows, then you'll need to learn the intricacies of database
engines, not only SQL. A high level knowledge unfortunately won't cut
it."
and I expected as much.
I was simply querying the SQL DB for the whole set at once, this is wrong and I know that.
The changes I made generates the following statements (I know the specifics have changed):
-- [145.955ms] [rows:1]
SELECT count(*)
FROM "tblWellLogs"
WHERE (
DateWellCompleted BETWEEN
'1800-01-01 06:58:00.000' AND '2021-09-12 06:00:00.000'
)
AND "FinalStatusOfWellL" NOT IN
(
5,6,7,8,9,16,27,36
)
AND WaterUseL IN
(
1,29,3,8,26,4,6
)
AND (
wyRate BETWEEN
0 AND 3321
)
AND (
TotalOrFinishedDepth BETWEEN
0 AND 248
);
Calculated total(69977) chunk (17494) currentpage(3)
-- [117.195ms] [rows:17494]
SELECT "Easting","Northing"
FROM "tblWellLogs"
WHERE (
DateWellCompleted BETWEEN
'1800-01-01 06:58:00.000' AND '2021-09-12 06:00:00.000'
)
AND "FinalStatusOfWellL" NOT IN
(
5,6,7,8,9,16,27,36
)
AND WaterUseL IN
(
1,29,3,8,26,4,6
)
AND (
wyRate BETWEEN
0 AND 3321
)
AND (
TotalOrFinishedDepth BETWEEN
0 AND 248
)
ORDER BY (SELECT NULL)
OFFSET 34988 ROWS
FETCH NEXT 17494 ROWS ONLY;
Chunk count: 17494
JSON size: 385135
Request time: 278.6612ms
The response to the UI is as so:
GET /data/wells/xxxx/?page=3&completed.start=1800-01-01T06:58:36.000Z&completed.end=2021-09-12T06:00:00.000Z&status=supply,research,geothermal,other,unknown&use=domestic,commercial,industrial,municipal,irrigation,agriculture&depth.start=0&depth.end=248&rate.start=0&rate.end=3321&page=3 HTTP/1.1 200 385135
{
"data":[[0.0, 0.0],......],
"count": 87473,
"total": 874731,
"pages": 11,
"page": 1
}
In this case, the DBs are essentially immutable (I may get an updated dataset once per year). I figure I can predefine a set of chunk size variations for each DB based on query result size rather than just DB size. I am also retrieving the next page before the client requests it. I am caching requests at the browser and client layers. I only perform the Count(*) request if the client doesn't provide it, and it isn't in the cache.
I did find that running the request concurrently simply over-burdened my CPU, all the requests returned almost simultaneously but took +5s each rather than ~1second each.

Related

Query fast with literal but slow with variable

I am using Typeorm for SQL Server in my application. When I pass the native query like connection.query(select * from user where id = 1), the performance is really good and it less that 0.5 seconds.
If we use the findone or QueryBuilder method, it is taking around 5 seconds to get a response.
On further debugging, we found that passing value directly to query like this,
return getConnection("vehicle").createQueryBuilder()
.select("vehicle")
.from(Vehicle, "vehicle")
.where("vehicle.id='" + id + "'").getOne();
is faster than
return getConnection("vehicle").createQueryBuilder()
.select("vehicle")
.from(Vehicle, "vehicle")
.where("vehicle.id =:id", {id:id}).getOne();
Is there any optimization we can do to fix the issue with parameterized query?
I don't know Typeorm but it seems to me clear the difference. In one case you query the database for the whole table and filter it locally and in the other you send the filter to the database and it filters the data before it sends it back to the client.
Depending on the size of the table this has a big impact. Consider picking one record from 10 million. Just the time to transfer the data to the local machine is 10 million times slower.

Optimizing Geometry.STIntersect Query

I am attempting to move simple Geoprocessing routines from ESRI based processes to SQL Server. My assumption is that it will be far more efficient. For my initial test I am working on an intersection routine to associate overlapping linear data.
In my WCASING table I have 1610 records. I am trying to associate these Casings with their associated mains. I have ~277,000 Mains.
I am running the query below to get a general sense of how long it will take to find individual matches. This query returned 5 valid intersections in 40 seconds.
SELECT Top 5 [WCASING].[OBJECTID] As CasingOBJECTID,
[WPUMPPRESSUREMAIN].[OBJECTID] AS MainObjectID, [WCASING].[Shape]
FROM [dbo].[WPUMPPRESSUREMAIN]
JOIN [WCASING]
ON [WCASING].[Shape].STIntersects([WPUMPPRESSUREMAIN].[Shape]) = 1
My Primary questions;
Will this process faster depending on the search order
Finding 'A' inside of 'B' vs
Finding 'B' inside of 'A'
Initial return on 5 records from these datasets is that it does not matter
Will this process faster, if I first buffer to limit to a smaller main set and then search
Can I use SQL Server Tuning to work with Geometry based queries
I will be working on these processes over the next few weeks. In the meantime I would greatly appreciate insight and pointers to white papers associated with these tuning options. Thus far I have not found a great resource.
Thank You,
Rick

How to do server paging in SQL correct?

My situation: My application is slow. As slow as it gets... mostly because I have the feeling my Server paging for my dataTables / grids are wrongly implemented.
Let's start:
I have a SQL Server 2008 database, one table with all the information, 10 columns in it, at the moment 19K rows
My application is based on a JavaScript and ASP.Net backend code.
My SQL query is:
WITH Ordered AS
(
SELECT *, ROW_NUMBER() OVER (ORDER BY Created DESC) AS 'RowNumber'
FROM Meetings
WHERE State IN ('Appointed', 'Accepted')
AND [xxx] LIKE '%1%'
AND [yyy] LIKE '%2%'
)
SELECT *
FROM Ordered
WHERE RowNumber BETWEEN 1 AND 41;
So at the moment this query runs around 27 to 32 seconds, which means over 30 seconds I got a timeout... on 19k rows in 1 year... which means in 1 month latest every query will run against dead...
As far as I am understand the order for this query is the problem: No index done here.
Because the query first sorts, then selects all with a manual row number, then selects only 40... (of course on page 2 of my grid it gets Rows 41 to 81...)
I COULD do an Index on my "Created desc" and the query would be much much faster, BUT every column is sortable for my grid which means "Created desc" could be every other column of my table and of course desc and asc order!
So, how to improve this?
//Edit:
Sorry to forget that:
The inner query (Inner Select) runs 6 seconds, while the total query runs 31 seconds...
Which means the "WITH ORDERES AS" is the problem here!
First things first: you have a performance problem, approach it with a proper methodology and measure appropriately. The inner query (Inner Select) runs 6 seconds, while the total query runs 31 seconds... Which means ... is amateurism. Read How to analyse SQL Server performance for correct ways to measure performance. And before we continue, if you start from 6 seconds you have already lost the game.
Now, on to the question.
WHERE State in('Appointed','Accepted') AND [xxx] LIKE '%1%' AND [yyy] LIKE '%2%'
This expression is basically non-indexable. Even if you add an index on State it will not help because of the low cardinality (few values with many rows each). And like '% ... %' is unindexable because it searches for values in the middle of the text.
You could try to replace like '% ... %' with a full-text search like CONTAINS ... which will be faster, provider you search for specific enough terms. But it does require you to deploy and configure properly the full-text indexes.
As for the paging, I do not favor much the ROWNUMBER approach. Even when a sort column exists, it involves a scan and count to skip the number of rows and gets slower and slower as you go to higher pages. I much more recommend the key based approach:
SELECT TOP (page size) ...
WHERE keys > <last row>
ORDER BY...
but this approach is more difficult to implement as it requires keeping track of keys rather than the page number.
But expect no miracles. You are asking a relational OLTP system to do the work of an ElasticSearch/Solr. It will never work as you expect. Use a tool appropriate for the job (a Search engine). Also read Dynamic Search Conditions in T‑SQL for a more thorough discussion, but again, expect no miracles.

Querying Sql server with index for range

I am using SQL SERVER 2012 that is running on windows datacenter 2012,
I have a database with a table that is build as followed :
[ID] (pk,int not null)
[Start] (float,null)
[End] (float, null)
[CID] (int,null) --country id
I have a web service that gets an IP, translate it to decimal
(may refer to this : IP address conversion to decimal and vice versa) and request the database server for the country id
The table mentioned at first contains ~200K rows with start and end values representing IP ranges as decimal and a countryid related to each range,
I have encountered a really high CPU usage against some heavy traffic we have been dealing, so i added indexes on the start and end columns, afterwards the cpu got a little bit better but i think it should have been much more, its simply suppose to work as a search in a sorted list which should be extremely fast, though the expected result i had from adding the index were far from reality,
I suppose it is because its not searching a list but searching a range
What would be the best way to efficient this situation, since i am just sure that the resources this simple action is taking me is way to much than it should.
Here is a picture from the activity monitor now (lower traffic, after indexing) :
This is running on Azure ExtraLarge VM (8 cores 14GB memory) - the vm is doing nothing but running a sql server with 1 table that only translates this 1 request ! the VM CPU on this lower traffic is ~30% and ~70% on higher traffic, i am sure some structure/logical changes should make a really small server\service handle this easily.
SELECT TOP 1 *
FROM IP
WHERE StartIP <= yourIP
ORDER BY StartIP
This gets you the nearest IP range above the given IP. You then need to test whether the EndIP also matches. So:
SELECT *
FROM (
SELECT TOP 1 *
FROM IP
WHERE StartIP <= yourIP
ORDER BY StartIP
) x
WHERE EndIP >= yourIP
This amounts to a single-row index seek. Perfect performance.
The reason SQL Server cannot automatically do this is that it cannot know that IP ranges are ordered, meaning that the next StartIP is always greater than the current EndIP. We could have ranges of the form (100, 200), (150, 250). That is clearly invalid but it could be in the table.
In my opinion you main problem is the lack of "parameterization" because (a) query compilation is/can be expensive and (b) these "unparamterized" queries seems to have a lot of executions. And the available screenshot shows two things regarding this aspect:
1) The recent expensive queries aren't "parameterized".
2) High values for "Plan count":
Plan Count The number of cached query plans for this query. A large
number might indicate a need for explicit query parameterization. For
more information, see Specifying Query Parameterization Behavior by
Using Plan Guides.
Source
So, I would try to use parameters for these queries:
SELECT TOP(1) CountryId FROM [IP] WHERE Column1 <= #param AND #param <= Column2
If you can't change the application (how SQL requests are sended to SQL Server) then you could try plan guides:
http://technet.microsoft.com/en-US/library/ms191275(v=sql.90).aspx

long running queries: observing partial results?

As part of a data analysis project, I will be issuing some long running queries on a mysql database. My future course of action is contingent on the results I obtain along the way. It would be useful for me to be able to view partial results generated by a SELECT statement that is still running.
Is there a way to do this? Or am I stuck with waiting until the query completes to view results which were generated in the very first seconds it ran?
Thank you for any help : )
In general case the partial result cannot be produced. For example, if you have an aggregate function with GROUP BY clause, then all data should be analysed, before the 1st row is returned. LIMIT clause will not help you, because it is applied after the output is computed. Maybe you can give a concrete data and SQL query?
One thing you may consider is sampling your tables down. This is good practice in data analysis in general to get your iteration speed up when you're writing code.
For example, if you have table create privelages and you have some mega-huge table X with key unique_id and some data data_value
If unique_id is numeric, in nearly any database
create table sample_table as
select unique_id, data_value
from X
where mod(unique_id, <some_large_prime_number_like_1013>) = 1
will give you a random sample of data to work your queries out, and you can inner join your sample_table against the other tables to improve speed of testing / query results. Thanks to the sampling your query results should be roughly representative of what you will get. Note, the number you're modding with has to be prime otherwise it won't give a correct sample. The example above will shrink your table down to about 0.1% of the original size (.0987% to be exact).
Most databases also have better sampling and random number methods than just using mod. Check the documentaion to see what's available for your version.
Hope that helps,
McPeterson
It depends on what your query is doing. If it needs to have the whole result set before producing output - such as might happen for queries with group by or order by or having clauses, then there is nothing to be done.
If, however, the reason for the delay is client-side buffering (which is the default mode), then that can be adjusted using "mysql-use-result" as an attribute of the database handler rather than the default "mysql-store-result". This is true for the Perl and Java interfaces: I think in the C interface, you have to use an unbuffered version of the function that executes the query.