Querying Sql server with index for range - sql

I am using SQL SERVER 2012 that is running on windows datacenter 2012,
I have a database with a table that is build as followed :
[ID] (pk,int not null)
[Start] (float,null)
[End] (float, null)
[CID] (int,null) --country id
I have a web service that gets an IP, translate it to decimal
(may refer to this : IP address conversion to decimal and vice versa) and request the database server for the country id
The table mentioned at first contains ~200K rows with start and end values representing IP ranges as decimal and a countryid related to each range,
I have encountered a really high CPU usage against some heavy traffic we have been dealing, so i added indexes on the start and end columns, afterwards the cpu got a little bit better but i think it should have been much more, its simply suppose to work as a search in a sorted list which should be extremely fast, though the expected result i had from adding the index were far from reality,
I suppose it is because its not searching a list but searching a range
What would be the best way to efficient this situation, since i am just sure that the resources this simple action is taking me is way to much than it should.
Here is a picture from the activity monitor now (lower traffic, after indexing) :
This is running on Azure ExtraLarge VM (8 cores 14GB memory) - the vm is doing nothing but running a sql server with 1 table that only translates this 1 request ! the VM CPU on this lower traffic is ~30% and ~70% on higher traffic, i am sure some structure/logical changes should make a really small server\service handle this easily.

SELECT TOP 1 *
FROM IP
WHERE StartIP <= yourIP
ORDER BY StartIP
This gets you the nearest IP range above the given IP. You then need to test whether the EndIP also matches. So:
SELECT *
FROM (
SELECT TOP 1 *
FROM IP
WHERE StartIP <= yourIP
ORDER BY StartIP
) x
WHERE EndIP >= yourIP
This amounts to a single-row index seek. Perfect performance.
The reason SQL Server cannot automatically do this is that it cannot know that IP ranges are ordered, meaning that the next StartIP is always greater than the current EndIP. We could have ranges of the form (100, 200), (150, 250). That is clearly invalid but it could be in the table.

In my opinion you main problem is the lack of "parameterization" because (a) query compilation is/can be expensive and (b) these "unparamterized" queries seems to have a lot of executions. And the available screenshot shows two things regarding this aspect:
1) The recent expensive queries aren't "parameterized".
2) High values for "Plan count":
Plan Count The number of cached query plans for this query. A large
number might indicate a need for explicit query parameterization. For
more information, see Specifying Query Parameterization Behavior by
Using Plan Guides.
Source
So, I would try to use parameters for these queries:
SELECT TOP(1) CountryId FROM [IP] WHERE Column1 <= #param AND #param <= Column2
If you can't change the application (how SQL requests are sended to SQL Server) then you could try plan guides:
http://technet.microsoft.com/en-US/library/ms191275(v=sql.90).aspx

Related

SQL Query performance - UI responsiveness concerns

I am running a query on localhost, I am extremely unfamiliar with SQL. I am using a golang library to generate the query statement. This is for an enterprise app so I don't have time to evaluate and code all possible performance cases. I'd prefer good performance for the largest possible queries:
upto 6 query parameters eg. BETWEEN 'created' AND 'abandoned', BETWEEN X AND Y, IN (1,2,3.....25), IN ('A', 'B', 'C'....'Z')
JOINs/subqueries between a 2-5 tables
returning between 50K-5M records (LAT and LNG)
Currently I am using JOIN to find the lat, lng for a record, and some query parameters. Should I join differently, (left, right)? Should the FROM table be the record or the relation? Subqueries?
Is this query performance reasonable from a UI perspective? This is on localhost (docker) on a fairly low performance laptop, under WSL (16GB RAM / 6 core CPU [2.2GHz]).
-- [2547.438ms] [rows:874731]
SELECT "Longitude","Latitude"
FROM Wells
JOIN Well_Reports ON Well_Reports.Well_ID = Wells.Well_ID
JOIN Lithologies on Lithologies.Well_Report_ID = Well_Reports.Well_Report_ID
where Lithologies.Colour IN
(
'NULL',
'Gray','White','Dark','Black','Dark Gray','Dark Brown','Dark Red','Dark Blue',
'Dark Green','Dark Yellow','Bluish Green','Brownish Gray','Brownish Green','Brownish Yellow',
'Light','Light Gray','Light Red','Greenish Gray','Light Yellow','Light Green','Light Blue',
'Light Brown','Blue Gray','Greenish Yellow','Greenish Gray'
);
The UI is a heatmap. I haven't really hit performance issues returning 1-million rows.
Angular is the framework. Im breaking the HTTP response into 10K record chunks
My initial impression was 3+ seconds is too long for a UI to start populating data. I was already breaking the response to the UI into chunks, that portion was efficient and async. It never occurred to me to simply break the SQL requests into smaller chunks with LIMIT and OFFSET, so the server can start responding with data immediately (<200ms) even if it takes +5s to completely finish loading.
I'll write an answer to this effect.
Thanks and best regards,
schmorrison
A few things.
where someColumn in (null, ...)
This will not return rows where the value of someColumn is null because, x in (a, b, c ...) translates to x = a or x = b or x = c, and null is not equal to null.
You need a construct like this instead
where someColumn is null or someColumn in (...)
Second, you mentioned you're returning between 50k and 5M rows to the UI. I question the sanity of this... how is the UI rendering 5 million sets of coordinates for the user to see/use? I suppose there could be some extreme edge cases where this is really what you need to do, but it just seems really unlikely.
Third, if your concern is UI responsiveness, the proper way to handle that is to make asynchronous requests. I don't know anything about golang, but I did find this page with a quick google search. Study up on that kind of technique.
Finally, if you really do need to work with data sets this big, the critical point will be to identify the common search criteria and talk to your DBA about appropriate indexes. We can't provide much help in this regard without a lot more schema information, but if you have a specific query that is taking a long time with a particular set of parameters, you can come back and create a question for that query, along with providing the query plan, and we can help you out.
As pointed out by The Impaler,
"If you want good performance for dynamic queries that return 5
million rows, then you'll need to learn the intricacies of database
engines, not only SQL. A high level knowledge unfortunately won't cut
it."
and I expected as much.
I was simply querying the SQL DB for the whole set at once, this is wrong and I know that.
The changes I made generates the following statements (I know the specifics have changed):
-- [145.955ms] [rows:1]
SELECT count(*)
FROM "tblWellLogs"
WHERE (
DateWellCompleted BETWEEN
'1800-01-01 06:58:00.000' AND '2021-09-12 06:00:00.000'
)
AND "FinalStatusOfWellL" NOT IN
(
5,6,7,8,9,16,27,36
)
AND WaterUseL IN
(
1,29,3,8,26,4,6
)
AND (
wyRate BETWEEN
0 AND 3321
)
AND (
TotalOrFinishedDepth BETWEEN
0 AND 248
);
Calculated total(69977) chunk (17494) currentpage(3)
-- [117.195ms] [rows:17494]
SELECT "Easting","Northing"
FROM "tblWellLogs"
WHERE (
DateWellCompleted BETWEEN
'1800-01-01 06:58:00.000' AND '2021-09-12 06:00:00.000'
)
AND "FinalStatusOfWellL" NOT IN
(
5,6,7,8,9,16,27,36
)
AND WaterUseL IN
(
1,29,3,8,26,4,6
)
AND (
wyRate BETWEEN
0 AND 3321
)
AND (
TotalOrFinishedDepth BETWEEN
0 AND 248
)
ORDER BY (SELECT NULL)
OFFSET 34988 ROWS
FETCH NEXT 17494 ROWS ONLY;
Chunk count: 17494
JSON size: 385135
Request time: 278.6612ms
The response to the UI is as so:
GET /data/wells/xxxx/?page=3&completed.start=1800-01-01T06:58:36.000Z&completed.end=2021-09-12T06:00:00.000Z&status=supply,research,geothermal,other,unknown&use=domestic,commercial,industrial,municipal,irrigation,agriculture&depth.start=0&depth.end=248&rate.start=0&rate.end=3321&page=3 HTTP/1.1 200 385135
{
"data":[[0.0, 0.0],......],
"count": 87473,
"total": 874731,
"pages": 11,
"page": 1
}
In this case, the DBs are essentially immutable (I may get an updated dataset once per year). I figure I can predefine a set of chunk size variations for each DB based on query result size rather than just DB size. I am also retrieving the next page before the client requests it. I am caching requests at the browser and client layers. I only perform the Count(*) request if the client doesn't provide it, and it isn't in the cache.
I did find that running the request concurrently simply over-burdened my CPU, all the requests returned almost simultaneously but took +5s each rather than ~1second each.

Apache Geode : Improve query performance when order by clause is provided in the query

We are currently facing performance issues when order by clause is provided as a part of the query.
Current Specs:
We are running two geode servers with capacity of 20Gb(Max heap size) each. Geode has around 3.1 million records and the table has 1.48 million.
Query:
query --query="SELECT DISTINCT cashFlowId,upstreamSystem,upstreamSystemTxnDate,valueDate,amount,status FROM WHERE AND account IN SET ('XYZ','ABC') AND valueDate >= TO_DATE('20180320', 'yyyyMMdd') AND status = 'Booked' AND isActive = true AND category = 'Actual' ORDER BY amount DESC LIMIT 100"
The above query retrieves the output in 13-15 seconds after 2-3 times.
Actual Result Set: 666553
No of Records in the table: 1.49 million
What have we tried/observed so far?
We found that the index (type: range) is being picked correctly.
No improvement even after allocating more memory to JVM .
Verified that IN operator has no impact on the query performance. We tried the same using OR operator
On removing the Order by clause, the query gets completed in 2 seconds. We figured that sorting is eating most of the time.
Could you please guide or shed some information in improving the query performance?
Server Metrics:
Category | Metric | Value
--------- | --------------------- | ------------
cluster | totalHeapSize | 47135
cache | totalRegionEntryCount | 3100429
Like Urizen said, check the number of GC's going on but there is more. Here is the code and it looks fairly tight: Geode Order By Comparator. There is another factor related to the nature of distributed sort order that has little to do with Geode as a product. Each node does its ordering but when the results get returned from each node, those results need to be merged with the results from other nodes. In other words, given a set of {2,4,3,1,6,5}, node 1 can sort {2,5,6} and node 2 sorts {1,3,4} but the controlling node needs to do a merge for you to get {1,2,3,4,5,6}. I suspect that there's some of that going on as well. This has nothing to do with Geode per se but just distributed order by's. In database performance optimization theory, the database is the worst place to do an order by.
I'm wondering here if the better way to do this is to return 2 answer sets: 1) your answer set that you want but unsorted, and 2) a small KV collection of items where K is amount and V is the key. Then on the client you do a sort of the small KV collection and iterate over the KV collection reading your larger answer set in that order.
If you didn't want to write a function to do that, you could do one additional query up front to select amount, key FROM ..., wrap that in a sorted collection and then do your full unsorted query. This should be really quick since your 2 seconds is partially being consumed by network on such a large answer set.
Jason may have some technical insights but removing the load from the server may be the answer if you have large answer sets like you do.

Amazon Redshift queries mysteriously dying

Why is my Amazon Redshift query sometimes working, sometimes getting killed, and sometimes running out of memory?
This is a simple query:
dev=# EXPLAIN SELECT row_number, browser_cookie, "timestamp", request_path,
status, outcome, duration, referrer
FROM annotated_apache_logs
WHERE date = '2015-09-15';
QUERY PLAN
------------------------------------------------------------------------------------
XN Seq Scan on annotated_apache_logs (cost=0.00..114376.71 rows=9150137 width=207)
Filter: (date = '2015-09-15'::date)
Pulling about 9 million rows:
dev=# SELECT count(*) FROM annotated_apache_logs WHERE date = '2015-09-15';
count
---------
9150137
(1 row)
And choking:
dev=# SELECT row_number, browser_cookie, "timestamp", request_path,
status, outcome, duration, referrer
FROM annotated_apache_logs
WHERE date = '2015-09-15';
out of memory
Sometimes the sql says Killed. Sometimes it works. Sometimes I get out of memory. No idea why. The table looks like this (I've removed rows not in the above query):
CREATE TABLE IF NOT EXISTS annotated_apache_logs (
row_number double precision,
browser_cookie character varying(240),
timestamp integer,
request_path character varying(2500),
status character varying(12),
outcome character varying(128),
duration integer,
referrer character varying(2500)
)
DISTKEY (date)
SORTKEY (browser_cookie);
And I've worked very hard to get all of those columns as small as I can to reduce memory usage. What do I look for now? If I read the EXPLAIN output correctly, this might return a couple of gigs of data. Not much data, no joins, nothing fancy. For a "petabyte scale data warehouse", that's trivial, so I'm assuming I'm missing something fundamental here.
You should use cursors to fetch the result set in chunks. See http://docs.aws.amazon.com/redshift/latest/dg/declare.html
If your client application uses an ODBC connection and your query creates a result set that is too large to fit in memory, you can stream the result set to your client application by using a cursor. When you use a cursor, the entire result set is materialized on the leader node, and then your client can fetch the results incrementally.
Edit:
Assuming that you want the entire result set rather than filtering using where/limit.
If your query is actually running out of memory, check what is the concurrency for the WLM queue under which this query runs. Try to increase the available memory for this queue or reduce the concurrency, this will allow your query to have more memory.
P.S:
When it says "Petabyte scale", it does not mean it has petabyte of RAM for you. There are a lot of factors which decide how much memory your query is actually getting while execution,
What is the node type you are using?
How many nodes?
What other queries are running when you are running this query?

How to do server paging in SQL correct?

My situation: My application is slow. As slow as it gets... mostly because I have the feeling my Server paging for my dataTables / grids are wrongly implemented.
Let's start:
I have a SQL Server 2008 database, one table with all the information, 10 columns in it, at the moment 19K rows
My application is based on a JavaScript and ASP.Net backend code.
My SQL query is:
WITH Ordered AS
(
SELECT *, ROW_NUMBER() OVER (ORDER BY Created DESC) AS 'RowNumber'
FROM Meetings
WHERE State IN ('Appointed', 'Accepted')
AND [xxx] LIKE '%1%'
AND [yyy] LIKE '%2%'
)
SELECT *
FROM Ordered
WHERE RowNumber BETWEEN 1 AND 41;
So at the moment this query runs around 27 to 32 seconds, which means over 30 seconds I got a timeout... on 19k rows in 1 year... which means in 1 month latest every query will run against dead...
As far as I am understand the order for this query is the problem: No index done here.
Because the query first sorts, then selects all with a manual row number, then selects only 40... (of course on page 2 of my grid it gets Rows 41 to 81...)
I COULD do an Index on my "Created desc" and the query would be much much faster, BUT every column is sortable for my grid which means "Created desc" could be every other column of my table and of course desc and asc order!
So, how to improve this?
//Edit:
Sorry to forget that:
The inner query (Inner Select) runs 6 seconds, while the total query runs 31 seconds...
Which means the "WITH ORDERES AS" is the problem here!
First things first: you have a performance problem, approach it with a proper methodology and measure appropriately. The inner query (Inner Select) runs 6 seconds, while the total query runs 31 seconds... Which means ... is amateurism. Read How to analyse SQL Server performance for correct ways to measure performance. And before we continue, if you start from 6 seconds you have already lost the game.
Now, on to the question.
WHERE State in('Appointed','Accepted') AND [xxx] LIKE '%1%' AND [yyy] LIKE '%2%'
This expression is basically non-indexable. Even if you add an index on State it will not help because of the low cardinality (few values with many rows each). And like '% ... %' is unindexable because it searches for values in the middle of the text.
You could try to replace like '% ... %' with a full-text search like CONTAINS ... which will be faster, provider you search for specific enough terms. But it does require you to deploy and configure properly the full-text indexes.
As for the paging, I do not favor much the ROWNUMBER approach. Even when a sort column exists, it involves a scan and count to skip the number of rows and gets slower and slower as you go to higher pages. I much more recommend the key based approach:
SELECT TOP (page size) ...
WHERE keys > <last row>
ORDER BY...
but this approach is more difficult to implement as it requires keeping track of keys rather than the page number.
But expect no miracles. You are asking a relational OLTP system to do the work of an ElasticSearch/Solr. It will never work as you expect. Use a tool appropriate for the job (a Search engine). Also read Dynamic Search Conditions in T‑SQL for a more thorough discussion, but again, expect no miracles.

SQL Wildcard Search - Efficiency?

There has been a debate at work recently at the most efficient way to search a MS SQL database using LIKE and wildcards. We are comparing using %abc%, %abc, and abc%. One person has said that you should always have the wildcard at the end of the term (abc%). So, according to them, if we wanted to find something that ended in "abc" it'd be most efficient to use `reverse(column) LIKE reverse('%abc').
I set up a test using SQL Server 2008 (R2) to compare each of the following statements:
select * from CLMASTER where ADDRESS like '%STREET'
select * from CLMASTER where ADDRESS like '%STREET%'
select * from CLMASTER where ADDRESS like reverse('TEERTS%')
select * from CLMASTER where reverse(ADDRESS) like reverse('%STREET')
CLMASTER holds about 500,000 records, there are about 7,400 addresses that end "Street", and about 8,500 addresses that have "Street" in it, but not necessarily at the end. Each test run took 2 seconds and they all returned the same amount of rows except for %STREET%, which found an extra 900 or so results because it picked up addresses that had an apartment number on the end.
Since the SQL Server test didn't show any difference in execution time I moved into PHP where I used the following code, switching in each statement, to run multiple tests quickly:
<?php
require_once("config.php");
$connection = odbc_connect( $connection_string, $U, $P );
for ($i = 0; $i < 500; $i++) {
$m_time = explode(" ",microtime());
$m_time = $m_time[0] + $m_time[1];
$starttime = $m_time;
$Message=odbc_exec($connection,"select * from CLMASTER where ADDRESS like '%STREET%'");
$Message=odbc_result($Message,1);
$m_time = explode(" ",microtime());
$m_time = $m_time[0] + $m_time[1];
$endtime = $m_time;
$totaltime[] = ($endtime - $starttime);
}
odbc_close($connection);
echo "<b>Test took and average of:</b> ".round(array_sum($totaltime)/count($totaltime),8)." seconds per run.<br>";
echo "<b>Test took a total of:</b> ".round(array_sum($totaltime),8)." seconds to run.<br>";
?>
The results of this test was about as ambiguous as the results when testing in SQL Server.
%STREET completed in 166.5823 seconds (.3331 average per query), and averaged 500 results found in .0228.
%STREET% completed in 149.4500 seconds (.2989 average per query), and averaged 500 results found in .0177. (Faster time per result because it finds more results than the others, in similar time.)
reverse(ADDRESS) like reverse('%STREET') completed in 134.0115 seconds (.2680 average per query), and averaged 500 results found in .0183 seconds.
reverse('TREETS%') completed in 167.6960 seconds (.3354 average per query), and averaged 500 results found in .0229.
We expected this test to show that %STREET% would be the slowest overall, while it was actually the fastest to run, and had the best average time to return 500 results. While the suggested reverse('%STREET') was the fastest to run overall, but was a little slower in time to return 500 results.
Extra fun: A coworker ran profiler on the server while we were running the tests and found that the use of the double wildcard produced a significant increase CPU usage, while the other tests were within 1-2% of each other.
Are there any SQL Efficiency experts out that that can explain why having the wildcard at the end of the search string would be better practice than the beginning, and perhaps why searching with wildcards at the beginning and end of the string was faster than having the wildcard just at the beginning?
Having the wildcard at the end of the string, like 'abc%', would help if that column were indexed, as it would be able to seek directly to the records which start with 'abc' and ignore everything else. Having the wild card at the beginning means it has to look at every row, regardless of indexing.
Good article here with more explanation.
Only wildcards at the end of a Like character string will use an index.
You should look at using FTS Contains if you want to improve speed of wildcards at the front and back of a character string. Also see this related SO post regarding Contains versus Like.
From Microsoft it is more efficient to leave the closing wildcard because it can, if one exists, use an index rather than performing a scan. Think about how the search might work, if you have no idea what's before it then you have to scan everything, but if you are only searching the tail end then you can order the rows and even possible (depending on what you're looking for) do a quasi-binary search.
Some operators in joins or predicates tend to produce resource-intensive operations. The LIKE operator with a value enclosed in wildcards ("%a value%") almost always causes a table scan. This type of table scan is a very expensive operation because of the preceding wildcard. LIKE operators with only the closing wildcard can use an index because the index is part of a B+ tree, and the index is traversed by matching the string value from left to right.
So, the above quote also explains why there was a huge processor spike when running two wildcards. It completed faster only by happenstance because there is enough horsepower to cover up the inefficiency. When trying to determine performance on a query you want to look at the execution of the query rather than the resources of the server because those can be misleading. If I have a server with enough horsepower to serve a weather vain and I'm running queries on tables as small as 500,000 rows the results are going to be misleading.
Less the fact that Microsoft quoted your answer, when doing performance analysis, consider taking the dive into learning how to read the execution plan. It's an investment and very dry, but it will be worth it in the long run.
In short though, whoever was indicating that the trailing wildcard only is more efficient, is correct.
In MS SQL, if you want to have the names those are ending with 'ABC', then u can have the query like below(suppose table name is student)
select * from student where student_name like'%[ABC]'
so it will give those names which ends with 'A' ,'B','C'.
2) if u want to have names which are starting with 'ABC' means-
select * from student where student_name like '[ABC]%'
3) if u want to have names which in middle have 'ABC'
select * from student where student_name like '%[ABC]%'