SQL Server 2008 Performance on nullable geography column with spatial index - sql

I'm seeing some strange performance issues on SQL Server 2008 with a nullable geography column with a spatial index. Each null value is stored as a root node within the spatial index.
E.g. A table with 5 000 000 addresses where 4 000 000 has a coordinate stored.
Every time I query the index I have to scan through every root node, meaning I have to scan through 1 000 001 level 0 nodes. (1 root node for all the valid coordinates + 1M nulls)
I cannot find this mentioned in the documentation, and I cannot see why SQL allows this column to be nullable if the indexing is unable to handle it.
For now I have bypassed this by storing only the existing coordinates in a separate table, but I would like to know what is the best practice here?
EDIT: (case closed)
I got some help on the sql spatial msdn forum, and there is a blog post about this issue:
http://www.sqlskills.com/BLOGS/BOBB/post/Be-careful-with-EMPTYNULL-values-and-spatial-indexes.aspx
Also the MSDN documentation does infact mention this, but in a very sneaky manner.
NULL and empty instances are counted
at level 0 but will not impact
performance. Level 0 will have as many
cells as NULL and empty instances at
the base table. For geography indexes,
level 0 will have as many cells as
NULL and empty instances +1 cell,
because the query sample is counted as
1
Nowhere in the text is it promised that nulls does not affect performance for geography.
Only geometry is supposed to be unaffected.

Just a follow-up note - this issue has been fixed in Sql Server Denali with the new AUTO_GRID indexes (which are now the default). NULL values will no longer be populated in the root index node.

Related

Do columns with the value NULL impact the performance of Microsoft SQL Server?

I have a datatable with over 200 columns, however, in over half of those columns, the majority of my datarows has the value 'NULL'.
Do those NULL-values decrease the performance of my SQL Server or are fields with NULL-values irrelevant for all actions on the datatable?
Performance of the table is basically a function of I/O. The way that SQL Server lays out rows on data pages means that NULL values might or might not take up space -- depending on the underlying data type. SQL Server data pages contain a list of nullability bits for each column (even NOT NULL columns) to keep the NULL information.
Variable length strings simply use the NULL bits, so they occupy no additional space in each row. Other data types do occupy space, even for NULL values (this includes fixed-length strings, I believe).
What impact does this have on performance? If you have 200 NULL integer fields, that is 800 bytes on the data page. That limits the number of records stored on a given page to no more than 10 records. So, if you want to read 100 records, the query has to read (at least) 10 data pages. If the table did not have these columns, then it might be able to read only one data page.
Whether or not this is important for a given query or set of queries depends on the queries. But yes, columns that are NULL could have an impact on performance, particularly on the I/O side of queries.
Aside from taking up space, and what little impact that may have on performance, they don't have any other effect.

What is the maximum number of table names under a single schema in a DB2 subsystem?

What is the maximum number of results possible from the following SQL query for DB2 on z/OS?
SELECT NAME FROM SYSIBM.SYSTABLES WHERE TYPE='T' AND CREATOR=? ORDER BY NAME ASC
This query is intended to fetch a list of all table names under a specific schema/creator in a DB2 subsystem.
I am having trouble finding a definitive answer. According to IBM's "Limits in DB2 for z/OS" article, the maximum number of internal objects for a DB2 database is 32767. Objects include views, indexes, etc.
I would prefer a more specific answer for maximum number of table names under one schema. For instance, here is an excerpt from an IDUG thread for a related question:
Based on the limit of 32767 objects in one database, where each tablespace takes two entries, and tables and indexes take one entry each, then the theoretical max would seem to be, with one tablespace per database,
32767 - 2 (for the single tablespace) = 32765 / 2 = 16382 tables, assuming you need at least one index per table.
Are these assumptions valid (each tablespace takes two entries, at least one index per table)?
assuming you need at least one index per table.
That assumption doesn't seem valid. Tables don't always have indexes. And you are thinking about edge cases where someone is already doing something weird, so I definitely wouldn't presume there will be indexes on each table.*
If you really want to handle all possible cases, I think you need to assume that you can have up to 32765 tables (two object identifiers are needed for a table space, as mentioned in the quote).
*Also, the footnote in the documentation you linked indicates that an index takes up two internal object descriptors. So the math is also incorrect in that quote. It would actually be 10921 tables if they each had an index. But I don't think that is relevant anyway.
I'm not sure your assumptions are appropriate because there are just too many possibilities to consider and in the grand scheme of things probably doesn't make much difference to the answer from your point of view
I'll rephrase your question to make sure I understand you correctly, you are after the maximum number of rows i.e. worst case scenario, that could possibly be returned by your SQL query?
DB2 System Limits
Maximum databases
Limited by system storage and EDM pool size
Maximum number of databases
65217
Maximum number of internal objects for each database
32767
The number of internal object descriptors (OBDs) for external objects are as follows
Table space: 2 (minimum required)
Table: 1
Therefore the maximum number of rows from your SQL query:
65217 * (32767 - 2) = 2,136,835,005
N.B. DB2 for z/OS does not have a 1:1 ratio between schemas and databases
N.N.B. This figure assumes 32,765 tables/tablespace/database i.e. 32765:1:1
I'm sure ±2 billion rows is NOT a "reasonable" expectation for max number of table names that might show up under a schema but it is possible

Proper way to create dynamic 1:M SQL table

Simplified example: Two tables - people and times. Goal is to keep track of all the times a person walks through a doorway.
A person could have between 0 and 50 entries in the times table daily.
What is the proper and most efficient way to keep track of these records? Is it
times table
-----------
person_id
timestamp
I'm worried that this table can get well over a million records rather quickly. Insertion and retrieval times are of utmost importance.
ALSO: Obviously non-normalized but would it be a better idea to do
times table
-----------
person_id
serialized_timestamps_for_the_day
date
We need to access each individual timestamp for the person but ONLY query records on date or the person's id.
The second solution has some problems:
Since you need to access individual timestamps1, serialized_timestamps_for_the_day cannot be considered atomic and would violate the 1NF, causing a bunch of problems.
On top of that, you are introducing a redundancy: the date can be inferred from the contents of the serialized_timestamps_for_the_day, and your application code would need to make sure they never become "desynchronized", which is vulnerable to bugs.2
Therefore go with the first solution. If properly indexed, a modern database on modern hardware can handle much more than mere "well over a million records". In this specific case:
A composite index on {person_id, timestamp} will allow you to query for person or combination of person and date by a simple index range scan, which can be very efficient.
If you need just "by date" query, you'll need an index on {timestamp}. You can easily search for all timestamps within a specific date by searching for a range 00:00 to 24:00 of the given day.
1 Even if you don't query for individual timestamps, you still need to write them to the database one-by-one. If you have a serialized field, you first need to read the whole field to append just one value, and then write the whole result back to the database, which may become a performance problem rather quickly. And there are other problems, as mentioned in the link above.
2 As a general rule, what can be inferred should not be stored, unless there is a good performance reason to do so, and I don't see any here.
Consider what are we talking about here. Accounting for just raw data (event_time, user_id) this would be (4 + 4) * 1M ~ 8MB per 1M rows. Let's try to roughly estimate this in a DB.
One integer 4 bytes, timestamp 4 bytes; row header, say 18 bytes -- this brings the first estimate of the row size to 4 + 4 + 18 = 26 bytes. Using page fill factor of about 0.7; ==> 26 / 0.7 ~ 37 bytes per row.
So, for 1 M rows that would be about 37 MB. You will need index on (user_id, event_time), so let's simply double the original to 37 * 2 = 74 MB.
This brings the very rough, inacurate estimate to 74MB per 1M rows.
So, to keep this in memory all the time, you would need 0.074 GB for each 1M rows of this table.
To get better estimate, simply create a table, add the index and fill it with few million rows.
Given the expected data volume, this can all easily be tested with 10M rows even on a laptop -- testing always beats speculating.
P.S. Your option 2 does not look "obviously better idea" too me at all.
I think first option would be a better option.
Even if you go for second option, the size of the index might not reduce. In fact there will be an additional column.
And the data for different users is not related, you can shard the database based on person_id. i.e. let's say your data cannot be fit on a single database server node and requires two nodes. Then data for half the users will be stored on one node and rest of the data will be stored on another node.
This can be done using RDBMS like MySQL or Document oriented databases like MongoDB and OrientDB as well.

Many-to-Many Relationship Against a Single, Large Table

I have a geometric diagram the consists of 5,000 cells, each of which is an arbitrary polygon. My application will need to save many such diagrams.
I have determined that I need to use a database to make indexed queries against this map. Loading all the map data is far too inefficient for quick responses to simple queries.
I've added the cell data to the database. It has a fairly simple structure:
CREATE TABLE map_cell (
map_id INT NOT NULL ,
cell_index INT NOT NULL ,
...
PRIMARY KEY (map_id, cell_index)
)
5,000 rows per map is quite a few, but queries should remain efficient into the millions of rows because the main join indexes can be clustered. If it gets too unwieldy, it can be partitioned on map_id bounds. Despite the large number of rows per map, this table would be quite scalable.
The problem comes with storing the data that describes which cells neighbor each other. The cell-neighbor relationship is a many-to-many relationship against the same table. There are also a very large number of such relationship per map. A normalized table would probably look something like this:
CREATE TABLE map_cell_neighbors (
id INT NOT NULL AUTO INCREMENT ,
map_id INT NOT NULL ,
cell_index INT NOT NULL ,
neighbor_index INT ,
...
INDEX IX_neighbors (map_id, cell_index)
)
This table requires a surrogate key that will never be used in a join, ever. Also, this table includes duplicate entries: if cell 0 is a neighbor with cell 1, then cell 1 is always a neighbor of cell 0. I can eliminate these entries, at the cost of some extra index space:
CREATE TABLE map_cell_neighbors (
id INT NOT NULL AUTO INCREMENT ,
map_id INT NOT NULL ,
neighbor1 INT NOT NULL ,
neighbor2 INT NOT NULL ,
...
INDEX IX_neighbor1 (map_id, neighbor1),
INDEX IX_neighbor2 (map_id, neighbor2)
)
I'm not sure which one would be considered more "normalized", since option 1 includes duplicate entries (including duplicating any properties the relationship has), and option 2 is some pretty weird database design that just doesn't feel normalized. Neither option is very space efficient. For 10 maps, option 1 used 300,000 rows taking up 12M of file space. Option 2 was 150,000 rows taking up 8M of file space. On both tables, the indexes are taking up more space than the data, considering the data should be about 20 bytes per row, but it's actually taking 40-50 bytes on disk.
The third option wouldn't be normalized at all, but would be incredibly space- and row-efficient. It involves putting a VARBINARY field in map_cell, and storing a binary-packed list of neighbors in the cell table itself. This would take 24-36 bytes per cell, rather than 40-50 bytes per relationship. It would also reduce the overall number of rows, and the queries against the cell table would be very fast due to the clustered primary key. However, performing a join against this data would be impossible. Any recursive queries would have to be done one step at a time. Also, this is just really ugly database design.
Unfortunately, I need my application to scale well and not hit SQL bottlenecks with just 50 maps. Unless I can think of something else, the latter option might be the only one that really works. Before I committed such a vile idea to code, I wanted to make sure I was looking clearly at all the options. There may be another design pattern I'm not thinking of, or maybe the problems I'm foreseeing aren't as bad as they appear. Either way, I wanted to get other people's input before pushing too far into this.
The most complex queries against this data will be path-finding and discovery of paths. These will be recursive queries that start at a specific cell and that travel through neighbors over several iterations and collect/compare properties of these cells. I'm pretty sure I can't do all this in SQL, there will likely be some application code throughout. I'd like to be able to perform queries like this of moderate size, and get results in an acceptable amount of time to feel "responsive" to user, about a second. The overall goal is to keep large table sizes from causing repeated queries or fixed-depth recursive queries from taking several seconds or more.
Not sure which database you are using, but you seem to be re-inventing what a spatial enabled database supports already.
If SQL Server, for example, is an option, you could store your polygons as geometry types, use the built-in spatial indexing, and the OGC compliant methods such as "STContains", "STCrosses", "STOverlaps", "STTouches".
SQL Server spatial indexes, after decomposing the polygons into various b-tree layers, also uses tessellation to index which neighboring cells a given polygon touches, at a given layer of the tree-index.
There are other mainstream databases which support spatial types as well, including MySQL

Effect of NULL values on storage in SQL Server?

If you have a table with 20 rows that contains 12 null columns and 8 columns with values, what is the implications for storage and memory usage?
Is null unique or is it stored in memory at the same location each time and just referenced? Do a ton of nulls take up a ton of space? Does a table full of nulls take up the same amount of space as a table the same size full of int values?
This is for Sql server.
This depends on database engine as well as column type.
At least SQLite stores each null column as a "null type" which takes up NO additional space (each record is serialized to a single blob for storage so there is no space reserved for a non-null value in this case). With optimizations like this a NULL value has very little overhead to store. (SQLite also has optimizations for the values 0 and 1 -- the designers of databases aren't playing about!) See 2.1 Record Format for the details.
Now, things can get much more complex, especially with updating and potential index fragmentation. For instance, in SQL Server space may be reserved for the column data, depending upon the type. For instance, a int null will still reserve space for the integer (as well as have an "is null" flag somewhere), however varchar(100) null doesn't seem to reserve the space (this last bit is from memory, so be warned!).
Happy coding.
Starting with SQL Server 2008, you can define a column as SPARSE when you have a "ton of nulls". This will save some space but it requires a portion of the values of a column to be null . Exactly how much depends on the type.
See the Estimated Space Savings by Data Type tables in the article Using Sparse Columns which will tell you what percentage of the values need to be null for net saving of 40%
For example according to the tables 98% of values in a bit field must be null in order to get a savings of 40% while only 43% of a uniqueidentifier column will net you the same percentage.