Azure SQL - Spatial Peformance Issues

Azure SQL - Spatial Peformance Issues - sql

I've recently set up an Azure SQL Database with intentions to build high-performance spatial applications.
Unfortunately when comparing Azure SQL to an on-prem server I'm getting very poor performance when executing geospatial queries like intersections of polygon boundaries.
Server Config:
On-Prem
SQL Server 2022
Xeon E5-1630 CPU // 64GB 2133mhz DDR4 RAM // Samsung
870 EVO SSD)
Azure SQL
General Purpose - Serverless: Standard-series (Gen5) 80 vCore max, 80 vCore min
Also tested the S3 100 DTU model. This not only didn't perform, but isn't feasible financially.
Dataset:
[dbo].[AddressGeocodes] -> Lat, Long Points, stored as geometry.
[dbo].[SA1_GDA2020] -> Multi-polygon geo-spatial boundaries, also in geometry
Replicated across systems (incl. clustered PK + spatial indexes with bounding boxes and auto-grid)
Query:
SELECT *
FROM [dbo].[AddressGeocodes] GEO
INNER JOIN [dbo].[SA1_GDA2020] SA1 ON GEO.[geom].STIntersects(SA1.[geom]) = 1
The estimated and actual execution plans within SSMS are identical, recognising the two clustered PKs and the spatial index.
Results (after 60 seconds):
On-prem -> 68,000 Records
Azure 80 vCores -> 17,000 Records
Conclusion:
What I don't understand is within azure portal, the CPU usage is only 2% for the query.
Could anyone please help me understand how there is such a dramatic difference?
There are very limited resources for spatial performance in Azure.
Thanks heaps!

With S3 you only have two vCores available and max degree of parallelism is one (1), all your query plans are serialized. I guess your local SQL Server has more computing power.
If you already think S3 is expensive and you cannot afford to scale up and try what service tier better adjust to the performance you are expecting, then you can try saving Lat and Long Points as varchar and convert them to geography when needed.
SET [GeoLocation] = geography::STPointFromText('POINT(' + CAST([Longitude]
AS VARCHAR(20)) + ' ' +
CAST([Latitude] AS VARCHAR(20)) + ')', 4326)

Related

Comparison between Azure SQL cost vs DocumentDB/CosmosDB cost

Did anyone run any comparison between Azure SQL cost vs DocumentDB/CosmosDB cost? The RU that's presented in the Azure CosmosDB cost is not clear to me. E.g., 1 request in 1 TB db cannot be equal with 1 request in 1 GB db.

First, you cannot reliably generalize a comparison of cost between relational Azure SQL cost and NoSQL CosmosDB cost, because they are significantly different things. They are not interchangeable, they would require different data modelling depending on planned usage and chosen optimization points. The cost (development + azure bill + future maintenance) can vary a lot depending on your load and usage.
So the general answer is the rather useless "it depends".
See more about sql vs noSQL differences, also some considerations about switching from sql to nosql.
The best way to get a better understanding of what a RU is, is to experiment by generating realistic data and examining realistic queries and deduce target cost from it. If you get your document by id or from a selective-enough index (and you should never scan in docDB), then most likely the RU cost is similar in GB vs TB DB.
If you lack the time to test with realistic data/queries, then you could play with https://www.documentdb.com/capacityplanner .
NB! Please note that both approaches require you already have some idea how you would lay out your data in NoSQL. NoSQL documents are not equivalent to SQL rows -or- tables. See "Modeling Data for NoSQL Document Databases" persentation by Ryan CrawCour, David Makogon for ideas what to consider when designing for noSQL.

Select statement low performance on simple table

Using Management Studio, I have a table with the six following columns on my SQL Server:
FileID - int
File_GUID - nvarchar(258)
File_Parent_GUID - nvarchar (258)
File Extension nvarchar(50)
File Name nvarchar(100)
File Path nvarchar(400)
It has a primary key on FileID.
This table has around 200M rows.
If I try and process the full data, I receive an memory error.
So I have decided to load this in partitions, using a select statement in every 20M where I split on the FileID number.
These selects take forever, the retrieval of rows is extremely slow, I have no idea why.. There are no calculations whatsoever, just a pull of data using a SELECT.
When I ran the query analyzer I see:
Select cost = 0%
Clustered Index Cost = 100%
Do you guys have any idea on why this could be happening or maybe some tips that I can apply ?
My query:
Select * FROM Dim_TFS_File
Thank you!!

Monitor the query while it's running to see if it's blocked or waiting on resources. If you can't easily see where the bottleneck is during monitoring the database and client machines, I suggest you run a couple of simple tests to help identify where you should focus your efforts. Ideally, run the tests with no other significant activity and a cold cache.
First, run the query on the database server and discard the results. This can be done from SSMS with the discard results option (Query-->Query Options-->Results-->Grid-->Discard Results after execution). Alternatively, use a Powershell script like the one below:
$connectionString = "Data Source=YourServer;Initial Catalog=YourDatabase;Integrated Security=SSPI;Application Name=Performance Test";
$connection = New-Object System.Data.SqlClient.SqlConnection($connectionString);
$command = New-Object System.Data.SqlClient.SqlCommand("SELECT * FROM Dim_TFS_File;", $connection);
$command.CommandTimeout = 0;
$sw = [System.Diagnostics.Stopwatch]::StartNew();
$connection.Open();
[void]$command.ExecuteNonQuery(); #this will discard returned results
$connection.Close();
$sw.Stop();
Write-Host "Done. Elapsed time is $($sw.Elapsed.ToString())";
Repeat the above test on the client machine. The elapsed time difference reflects data transfer network overhead. If the client machine test is significantly faster than the application, focus you efforts on the app code. Otherwise, take a closer look at database and network. Below are some random notes that might help remediate performance issues.
This trivial query will likely perform a full clustered index scan. The limiting performance factors on the database server will be:
CPU: Throughtput of this single-threaded query will be limited by spead of a single CPU core.
Storage: The SQL Server storage engine will use read-ahead reads during large scans to fetch data asynchronously so that data will already be in memory by the time it is needed by the query. Sequential read performance is important to keep up with the query.
Fragmentation: Fragmentation will result in more disk head movement against spinning media, adding several milliseconds per physical disk IO. This is typically a consideration only for large sequential scans on a single-spindle or low-end local storage arrays, not SSD or enterprise class SAN. Fragmentation can be eliminated with a reorganizing or rebuilding the clustered index. Be sure to specify MAXDOP 1 for rebuilds for maximum benefits.
SQL Server streams results as fast as they can be consumed by the client app but the client may be constrained by network bandwidth and latency. It seems you are returning many GB of data, which will take quite some time. You can reduce bandwidth needs considerably with different data types. For example, assuming the GUID-named columns actually contain GUIDs, using uniqueidentifier instead of nvarchar will save about 80 bytes per row over the netowrk and on disk. Similarly, use varchar instead of nvarchar if you don't actually need Unicode characters to cut data size by half.
Client processing time: The time to process 20M rows by the app code will be limited by CPU and code efficiency (especially memory management). Since you ran out of memory, it seems you are either loading all rows into memory or have a leak. Even without an outright out of memory error, high memory usage can result in paging and greatly slow throughput. Importantly, the database and network performance is moot if the app code can't process rows as fast as data are returned.

Slow spatial comparison when using cross join

I'm using U-SQL to select all objects which are inside one or more of the shapes. The code works but is really slow. Is there some way to make it more performant?
#rs1 =
SELECT DISTINCT aisdata.imo,
portshape.unlocode
FROM #lsaisdata AS aisdata
CROSS JOIN
#portsshape AS portshape
WHERE Geometry.STMPolyFromText(new SqlChars(portshape.shape.Replace("Z", "").ToCharArray()), 0).STContains(Geometry.Point(aisdata.lon, aisdata.lat, 0)).IsTrue;
Added more information about my issue:
I've registered Microsoft.SqlServer.Types.dll and SqlServerSpatial130.dll to be able to use spatial functions in U-SQL
I'm running my job in Data Lake Analytics using two AUs. Initially I used 10 AUs, but the Diagnostics tab stated that the job was 8 AUs over-allocated and max useful AUs was 2.
The job takes about 27 minutes to run with the UDT code below and the cross join takes almost all of this time
The input is one csv file (66 Mb) and one wkt file (2.4 Mb)
I'm using Visual Studio 2015 with Azure Data Lake Tools v2.2.5000.0
I tried encapsulating some of the spatial code in UDTs and that improved the performance to 27 minutes:
#rs1 =
SELECT DISTINCT aisdata.imo,
portshape.unlocode
FROM #lsaisdata AS aisdata
CROSS JOIN
#portsshape AS portshape
WHERE portshape.geoShape.GeoObject.STContains(SpatialUSQLApp.CustomFunctions.GetGeoPoint(aisdata.lon, aisdata.lat).GeoObject).IsTrue;

First, a CROSS JOIN will always explode your data to an NxM matrix. Depending on the number of rows this may either make it very expensive and possibly hard to estimate correct degree of parallelism.
Secondly, I assume that the Spatial join you do is an expensive operation. For example, if you use SQL Server 2012's spatial capabilities (2016 has native implementations of the type that may be a bit faster), I assume you probably get similar performance behavior. Most of the time you need a spatial index to get better performance. Now U-SQL does not support spatial indices, but you probably could approximate the same behavior by using an abstraction (like tessellation of the objects and determining if they overlap), to provide a faster pre-filter/join before you then test the condition to weed out the false positives.

SQL server query requesting too much memory

I am running a fairly simple query (LEFT JOINing a few tables on their primary keys) but SQL server is asking for a large amount of memory to run this query:
This results in only a few queries being run at a time. Is there a way to force less memory to be granted or for less memory to be requested per query?
I am using SQL server 2012 (not the enterprise version), and I have tried reducing the Max degree of parallelism and increasing the parallelism threshold without much change in the memory requirements.

Azure Database high DTU - High IO Avg

I'm trying to work out the cause of the high DTU on a database (its rank is S2 that is also geo replicated). On a server which is unsure if its V12 or the older (different problem).
Friday last week and this Friday we have a spike that looks like this:
Looking at the resource stats:
SELECT TOP 1000 *
FROM sys.dm_db_resource_stats
ORDER BY end_time DESC
avg CPU kicks around 3-5% during the peak
but most significantly the avg_data_io_percentage is roaming about 72% - 90%
How can I track down the IO further?
Query Performance Insight is quite useful but execution count and cpu could be misleading in this case?
TOP 5 queries per CPU consumption
top 5 during that odd period:
Are the likely offenders the queries that that appear differently in those top five?
Is there a better way to see the IO graph or data? Am I looking at the wrong thing? :D
Thanks in advance.

You can use SSMS and the built-in reports for Query Performance Insights/Query Data Store to look at IO-intensive queries. I suggest connecting to the database using SSMS and looking at the most resource intensive queries using the logical reads, logical writes, and physical reads metrics. You should find your offender in one of these.
Thanks,
Torsten

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas