I have written a query to fetch the polygon data from Sql database.
I have following query to fetch the results.
SELECT ZIP,
NAME,
STABB,
AREA,
TYPE,
orgZc.OrganizationId,
orgZc.[ZipCode] AS ORGzip,
REPLACE(REPLACE(REPLACE(REPLACE(GEOM.STAsText(),'POLYGON ((',' '),'MULTIPOLYGON (((',' '),'))',''),')))','')AS WKT
FROM USZIP
INNER JOIN ORGANIZTION_ZIP_CODES orgZc ON orgZc.[ZipCode]=USZIP.zip
WHERE orgZc.OrganizationId=#ORGANIZATION_ID
On this table i have already added a spatial index as below
CREATE SPATIAL INDEX SIndx_SpatialTable_geometry_col1
ON USZIP(GEOM) WITH ( BOUNDING_BOX = ( -90, -180, 90, 180 ) );
But it took 38 sec to fetch the 2483 records. Can anyone help me to optimize this query
My guess is that important part of your query is the from and where clauses. However, you can test this by removing the line:
REPLACE(REPLACE(REPLACE(REPLACE(GEOM.STAsText(),'POLYGON ((',' '),'MULTIPOLYGON (((',' '),'))',''),')))','')AS WKT
To see if that processing is taking up a lot of time.
For this part of the query:
FROM USZIP INNER JOIN
ORGANIZATION_ZIP_CODES orgZc
ON orgZc.[ZipCode] = USZIP.zip
WHERE orgZc.OrganizationId = #ORGANIZATION_ID;
You say that the zip code is "a primary column". However, it has to be the first column in a composite index (or primary key) in order to be used for the join. So, you really want an index on USZIP(zip) for the join to work. (I'm guessing this is true based on the name of the table, but I want to be explicit.)
Second, your where clause is limited to one OriganizationId, presumably of many. If so, you want an index on ORGANIZATION_ZIP_CODES(OrganizationId). Or, better yet, on ORGANIZATION_ZIP_CODES(OrganizationId, ZipCode).
I found the solution. I added a new column and updated REPLACE(REPLACE(REPLACE(REPLACE(GEOM.STAsText(),'POLYGON ((',' '),'MULTIPOLYGON (((',' '),'))',''),')))','')AS WKT Now i can fetch from the newly added column directly without doing any manipulations. Now it is taking 3 sec to fetch 2483 records
Related
I have to optimize the following query with the help of indexes.
SELECT f.*
FROM first f
JOIN second s on f.attributex_id = s.id
WHERE f.attributex_id IS NOT NULL AND f.attributey_id IS NULL
ORDER BY s.month ASC LIMIT 100;
Further infos:
attributex_id is a foreign key pointing to second.id
attributey_id is a foreign key pointing to another table not used in the query
Changing the query is not an option
Most entries (98%) in first the following will be true f.attributex_id IS NOT NULL. Same for the second condition f.attributey_id IS NULL
I tried to add as index as follows.
CREATE INDEX index_for_first
ON first (attributex_id, attributey_id)
WHERE attributex_id IS NOT NULL AND (attributey_id IS NULL)
But the index is not used (checked via Explain Analyze) when executing the query. What kind of indexes would I need to optimize the query and what am I doing wrong with the above index?
Does an index on s.month make sense, too (month is unique)?
Based on the query text and the fact that nearly all records in first satisfy the where clause, what you're essentially trying to do is
identify the 100 second records with the lowest month value
output the contents of the related records in the first table.
To achieve that you can create indexes on
second.month
first.attributex_id
Caveats
Since this query must be optimized, it's safe to say there are many rows in both tables. Since there are only 12 months in the year, the output of the query is probably not deterministic (i.e., it may return a different set of rows each time it's run, even if there is no activity in either table between runs) since many records likely share the same value for month. Adding "tie breaker" column(s) to the index on second may help, though your order by only includes month, so no guarantees. Also, if second.month can have null values, you'll need to decide whether those null values should collate first or last among values.
Also, this particular query is not the only one being run against your data. These indexes will take up disk space and incrementally slow down writes to the tables. If you have a dozen queries that perform poorly, you might fall into a trap of creating a couple indexes to help each one individually and that's not a solution that scales well.
Finally, you stated that
changing the query is not an option
Does that mean you're not allowed to change the text of the query, or the output of the query?
I personally feel like re-writing the query to select from second and then join first makes the goal of the query more obvious. The fact that your initial instinct was to add indexes to first lends credence to this idea. If the query were written as follows, it would have been more obvious that the thing to do is facilitate efficient access to the tiny set of rows in second that you're interested in:
...
from second s
join first f ...
where ...
order by s.month asc limit 100;
I have a query in sql server with a join that is taking forever. I'm hoping someone might have a tip to speed it up.
I think the problem is I'm joining on a field called Reseller_CSN which has values like '_0070000050'
I've tried using the substring function in the join to return everything but underscore, example '0070000050' but I keep getting an error when I try to cast or convert the result to int or bigint.
Any tips would be greatly appreciated, the query is below:
SELECT
t1.RESELLER_CSN
,t1.FIRST_YEAR_RENEWAL
,t1.SEAT_SEGMENT
,t2.Target_End_Date_CY
,t2.Target_End_Date_PQ
,t2.Target_End_Date_CY_1
,t2.Target_End_Date_CY_2
,t1.ASSET_SUB_END_DATE
FROM dbo.new_all_renewals t1
LEFT JOIN dbo.new_all_renewals_vwTable t2
ON SUBSTRING(t1.RESELLER_CSN,2,11) = SUBSTRING(t2.RESELLER_CSN,2,11)
A join on processed columns invariably takes more effort than a join on raw columns. In this case, you can improve performance by using computed columns. For instance, on the first table, you can do:
alter table new_all_renewals add CSN_num as SUBSTRING(t1.RESELLER_CSN,2,11);
create index on new_all_renewals(CSN_num);
This will generate an index on the column, which should speed the query. (Note: you'll need to reference the computed column rather than actually using the function.)
I am using sql server and have a table with 2 columns
myId varchar(80)
cchunk varchar(max)
Basicly it stores large chunk of text so thats why I need it varchar(max).
My problem is when I do a query like this:
select *
from tbchunks
where
(CHARINDEX('mystring1',tbchunks.cchunk)< CHARINDEX('mystring2',tbchunks.cchunk))
AND
CHARINDEX('mystring2',tbchunks.cchunk) - CHARINDEX('mystring1',tbchunks.cchunk) <=10
It takes about 3 seconds to complete, and the table chunks is about only 500,000 records and data returned from the above query is anywhere between 0 to 800 max
I have unclustered index on myid column, it helped with making fast select count(*) but didnt help with the above query.
I tried using Fulltext but was slow. i tried spliting the text in cchunk into smaller parts and adding an id column that will connect all those splited chunks, but ended up with a table with 2 million records of splited chunks of text (i did that so i can add index) but still the query was even slower.
EDIT:
modified the table to include primary key (int)
created fultext catalog with "Accent Senstive=true"
created fulltext index on my tabe on column "cchunk"
ran the same above query and it ended up taking 22 seconds with is much slower
UPDATE
Thanks everyone for suggesting the FullText (#Aaron Bertrand thanks!), i converted my query to this
SELECT * FROM tbchunksAS FT_TBL INNER JOIN
CONTAINSTABLE(tbchunks, cchunk, '(mystring1 NEAR mystring2)') AS KEY_TBL
ON FT_TBL.cID = KEY_TBL.[KEY]
by the way the cID is the primary key i added later.
anyway i am getting borad results and i notice that the higher the RANK column that was returned the better the results. my question is when RANK starts to get accurate?
An index isn't going to help with CHARINDEX at all. An index on a particular column is only going to be able to quickly find rows where the value in the indexed field is exactly an indexed value. I'm actually quite surprised that query only takes 3 seconds given that it has to read every single row four times (or at the very least, twice).
Well as good the ideas that were presented here, no body manage to really solve my problem but rather provided helpful tips that lead me to the solution which i would love to share.
Using Full text really was the answer like many mentioned but i managed to use the Contains in combination with Near so it can totally replace my current sql query and provide an awesome speed.
CONTAINS(tbchunks, 'NEAR ((mystring1, mystring2), 3, TRUE)')
I have a huge table in my database that contains distances between cities. This enables my application to find nearby cities around the world when a starting city is selected.
It contains 4 columns:
ID, StartCityID, EndCityID, Distance
and contains about 120 million rows.
I've got indexes set up on the startcityID, endcityID, another one for both, and another one each for startcity + distance, and endcity + distance (this is my first real dealings with indexes so not 100% sure if I'm doing it correctly).
Anyway - I do the following 2 queries:
Select distinct StartCityID
From Distances where EndCityID = 23485
and
Select distinct EndCityID
From Distances where StartCityID = 20045
They both return the same number of cityID's, but the top one takes 35 seconds to do, and the bottom one returns results immediately. When I look at the indexes, they seem to be set up to serve startCity and endCity in the same way.
Anyone know why they might be acting differently? I'm at a loss...
NB - this may offer more insight, but the one that takes 35 seconds - if I press execute again straight away with the same ID, it returns results immediately as well that time.
Unfortunately that isn't good enough for my website but it may be useful information.
Thanks
The second one is covering index and thus fast because you have index on startcity and endcity.
The index on endcity is not covering (as it doesnt have startcity) and thus either it has to join with other indexes to get the data or has to do key lookup and thus takes time.Also, it has to do hash distinct or distinct using sor whereas first one doesnt need to do that as well as data is sorted in endcity order for a given startcity.Also why use distinct will you have duplicate data for startcity and endcity.If no dup data remove distinct.
Check then plan for these first one should be index seek on endcity + distnace index and then most probably key lookup it could be clustred index scan as well based on the selectivity of the endcity.Then a hash distinct or sort distinct .
Second one should have just the index seek on index startcity + endcity.
You have mentioned that second time it returned immediately that is because data was already in cache. Thus try following
dbcc dropcleanbuffers
dbcc freeproccache
and then run the second query first..
CAUTION : Do not use these on PROD server and other cirtical servers.Try this on a machine where it wont impact other users.
All you have to do is to think about it...
Does your table have a primary key? What is it? What does it mean (to have a primary key)?
What does DISTINCT keyword asks for?
Try this query (avoid DISTINCT Keyword)
Select StartCityID From Distances group by StartCityID where EndCityID = 23485
Select EndCityID From Distances group by EndCityID where StartCityID = 20045
There is a simple SQL JOIN statement below:
SELECT
REC.[BarCode]
,REC.[PASSEDPROCESS]
,REC.[PASSEDNODE]
,REC.[ENABLE]
,REC.[ScanTime]
,REC.[ID]
,REC.[Se_Scanner]
,REC.[UserCode]
,REC.[aufnr]
,REC.[dispatcher]
,REC.[matnr]
,REC.[unitcount]
,REC.[maktx]
,REC.[color]
,REC.[machinecode]
,P.PR_NAME
,N.NO_NAME
,I.[inventoryID]
,I.[status]
FROM tbBCScanRec as REC
left join TB_R_INVENTORY_BARCODE as R
ON REC.[BarCode] = R.[barcode]
AND REC.[PASSEDPROCESS] = R.[process]
AND REC.[PASSEDNODE] = R.[node]
left join TB_INVENTORY as I
ON R.[inventid] = I.[id]
INNER JOIN TB_NODE as N
ON N.NO_ID = REC.PASSEDNODE
INNER JOIN TB_PROCESS as P
ON P.PR_CODE = REC.PASSEDPROCESS
The table tbBCScanRec has 556553 records while the table TB_R_INVENTORY_BARCODE has 260513 reccords and the table TB_INVENTORY has 7688. However, the last two tables (TB_NODE and TB_PROCESS) both have fewer than 30 records.
Incredibly, when it runs in SQL Server 2005, it takes 8 hours to return the result set.
Why does it take so much time to execute?
If the two inner joins are removed, it takes just ten seconds to finish running.
What is the matter?
There are at least two UNIQUE NONCLUSTERED INDEXes.
One is IX_INVENTORY_BARCODE_PROCESS_NODE on the table TB_R_INVENTORY_BARCODE, which covers four columns (inventid, barcode, process, and node).
The other is IX_BARCODE_PROCESS_NODE on the table tbBCScanRec, which covers three columns (BarCode, PASSEDPROCESS, and PASSEDNODE).
Well, standard answer to questions like this:
Make sure you have all the necessary indexes in place, i.e. indexes on N.NO_ID, REC.PASSEDNODE, P.PR_CODE, REC.PASSEDPROCESS
Make sure that the types of the columns you join on are the same, so that no implicit conversion is necessary.
You are working with around (556553 *30 *30) 500 millions of rows.
You probably have to add indexes on your tables.
If you are using SQL server, you can watch the plan query to see where you are losing time.
See the documentation here : http://msdn.microsoft.com/en-us/library/ms190623(v=sql.90).aspx
The query plan will help you to create indexes.
When you check the indexing, there should be clustered indexes as well - the nonclustered indexes use the clustered index, so not having one would render the nonclustered useless. Out-dated statistics could also be a problem.
However, why do you need to fetch ALL of the data? What is the purpose of that? You should have WHERE clauses restricting the result set to only what you need.