Many-to-Many Relationship Against a Single, Large Table - sql

I have a geometric diagram the consists of 5,000 cells, each of which is an arbitrary polygon. My application will need to save many such diagrams.
I have determined that I need to use a database to make indexed queries against this map. Loading all the map data is far too inefficient for quick responses to simple queries.
I've added the cell data to the database. It has a fairly simple structure:
CREATE TABLE map_cell (
map_id INT NOT NULL ,
cell_index INT NOT NULL ,
...
PRIMARY KEY (map_id, cell_index)
)
5,000 rows per map is quite a few, but queries should remain efficient into the millions of rows because the main join indexes can be clustered. If it gets too unwieldy, it can be partitioned on map_id bounds. Despite the large number of rows per map, this table would be quite scalable.
The problem comes with storing the data that describes which cells neighbor each other. The cell-neighbor relationship is a many-to-many relationship against the same table. There are also a very large number of such relationship per map. A normalized table would probably look something like this:
CREATE TABLE map_cell_neighbors (
id INT NOT NULL AUTO INCREMENT ,
map_id INT NOT NULL ,
cell_index INT NOT NULL ,
neighbor_index INT ,
...
INDEX IX_neighbors (map_id, cell_index)
)
This table requires a surrogate key that will never be used in a join, ever. Also, this table includes duplicate entries: if cell 0 is a neighbor with cell 1, then cell 1 is always a neighbor of cell 0. I can eliminate these entries, at the cost of some extra index space:
CREATE TABLE map_cell_neighbors (
id INT NOT NULL AUTO INCREMENT ,
map_id INT NOT NULL ,
neighbor1 INT NOT NULL ,
neighbor2 INT NOT NULL ,
...
INDEX IX_neighbor1 (map_id, neighbor1),
INDEX IX_neighbor2 (map_id, neighbor2)
)
I'm not sure which one would be considered more "normalized", since option 1 includes duplicate entries (including duplicating any properties the relationship has), and option 2 is some pretty weird database design that just doesn't feel normalized. Neither option is very space efficient. For 10 maps, option 1 used 300,000 rows taking up 12M of file space. Option 2 was 150,000 rows taking up 8M of file space. On both tables, the indexes are taking up more space than the data, considering the data should be about 20 bytes per row, but it's actually taking 40-50 bytes on disk.
The third option wouldn't be normalized at all, but would be incredibly space- and row-efficient. It involves putting a VARBINARY field in map_cell, and storing a binary-packed list of neighbors in the cell table itself. This would take 24-36 bytes per cell, rather than 40-50 bytes per relationship. It would also reduce the overall number of rows, and the queries against the cell table would be very fast due to the clustered primary key. However, performing a join against this data would be impossible. Any recursive queries would have to be done one step at a time. Also, this is just really ugly database design.
Unfortunately, I need my application to scale well and not hit SQL bottlenecks with just 50 maps. Unless I can think of something else, the latter option might be the only one that really works. Before I committed such a vile idea to code, I wanted to make sure I was looking clearly at all the options. There may be another design pattern I'm not thinking of, or maybe the problems I'm foreseeing aren't as bad as they appear. Either way, I wanted to get other people's input before pushing too far into this.
The most complex queries against this data will be path-finding and discovery of paths. These will be recursive queries that start at a specific cell and that travel through neighbors over several iterations and collect/compare properties of these cells. I'm pretty sure I can't do all this in SQL, there will likely be some application code throughout. I'd like to be able to perform queries like this of moderate size, and get results in an acceptable amount of time to feel "responsive" to user, about a second. The overall goal is to keep large table sizes from causing repeated queries or fixed-depth recursive queries from taking several seconds or more.

Not sure which database you are using, but you seem to be re-inventing what a spatial enabled database supports already.
If SQL Server, for example, is an option, you could store your polygons as geometry types, use the built-in spatial indexing, and the OGC compliant methods such as "STContains", "STCrosses", "STOverlaps", "STTouches".
SQL Server spatial indexes, after decomposing the polygons into various b-tree layers, also uses tessellation to index which neighboring cells a given polygon touches, at a given layer of the tree-index.
There are other mainstream databases which support spatial types as well, including MySQL

Related

Selecting one column from a table that has 100 columns

I have a table with 100 columns (yes, code smell and arguably a potentially less optimized design). The table has an 'id' as PK. No other column is indexed.
So, if I fire a query like:
SELECT first_name from EMP where id = 10
Will SQL Server (or any other RDBMS) have to load the entire row (all columns) in memory and then return only the first_name?
(In other words - the page that contains the row id = 10 if it isn't in the memory already)
I think the answer is yes! unless it has column markers within a row. I understand there might be optimization techniques, but is it a default behavior?
[EDIT]
After reading some of your comments, I realized I asked an XY question unintentionally. Basically, we have tables with 100s of millions of rows with 100 columns each and receive all sorts of SELECT queries on them. The WHERE clause also changes but no incoming request needs all columns. Many of those cell values are also NULL.
So, I was thinking of exploring a column-oriented database to achieve better compression and faster retrieval. My understanding is that column-oriented databases will load only the requested columns. Yes! Compression will help too to save space and hopefully performance as well.
For MySQL: Indexes and data are stored in "blocks" of 16KB. Each level of the B+Tree holding the PRIMARY KEY in your case needs to be accessed. For example a million rows, that is 3 blocks. Within the leaf block, there are probably dozens of rows, with all their columns (unless a column is "too big"; but that is a different discussion).
For MariaDB's Columnstore: The contents of one columns for 64K rows is held in a packed, compressed structure that varies in size and structure. Before getting to that, the clump of 64K rows must be located. After getting it, it must be unpacked.
In both cases, the structure of the data on disk is a compromises between speed and space for both simple and complex queries.
Your simple query is easy and efficient to doing a regular RDBMS, but messier to do in a Columnstore. Columnstore is a niche market in which your query is abnormal.
Be aware that fetching blocks are typically the slowest part of performing the query, especially when I/O is required. There is a cache of blocks in RAM.

SQL Structure, Dynamic Two Columns or Unique Colmuns

I'm not sure which is faster. I have the need to store lists of possible data.
Currently I have an SQL table with the following structure being accessed with php.
boxID
place
name -- (serialNum, itemNum, idlock, etc, etc)
data
--(Note: The Primary Key here would be boxId, place, name, and data, to prevent duplicate data.)
The reason i set it up like this was to prevent creating columns per named data. Its a possibility in the future to have 5-10 different named data or more. Also possible to store 1,000 - 10,000 entries of data in one week for just one named data. It will be searched as well, like when i get place from a specific serialNum, then getting all data related to that place. (A specific serialNum, itemNum, idLock, etc, etc,)
But my concern is that my structure could be slower than just creating a named column for each named data. For example:
boxID
place
serialNum
itemNum
idLock
etc
etc
--(Note: Not even sure how to add keys to this if i would do it this way)
To sum it up: Which is faster and better practice? (keep in mind im still a novice with SQL)
The best practice is to model your data as entities with specific attributes. Typically an entity has at most a few dozen attributes. The entities typically turn into tables, and the attributes typically which turn into columns. That is, the physical model and the logic model are often very similar.
There may be other considerations. For instance, there is a limit on the number of columns a row can have -- and if you have more columns, you need another solution. Similarly, if the data is sparse (that is, most values are NULL), then having lots of unused columns may be a waste of space. That is, it is more efficient to store it in another format. SQL Server offers sparse columns for this reason.
My suggestion is that you design your table in an intuitive way with named columns. A volume of data of 1,000 - 10,000 rows per week is not that much data. That turns into 50,000 - 500,000 rows per year, which SQL Server should be easily able to handle the volume. You don't say how many named entities you have, but table with millions or tens of millions of rows are quite reasonable for modern databases.

Correlation between amount of rows and amount columns in database performance

Is there a correlation between the amount of rows/number of columns used and it's impact within the (MS)SQL database?
A little more background:
We have to store lots of data from measurement devices. These devices ping a string with data back to us around 100 times a day. These strings contains +- 300 fields. Assume we have 100 devices in operation that means we get 10000 records back every day. At our back-end we split these data strings and have to put these into the database. When these data strings are fixed that means we add each days around 10000 new rows into the database. No big deal.
Whatsoever, the contents of these data strings may change during time. There are two options we are considering:
Using vertical tables to store the data dynamically
Using horizontal tables and add a new column now and then when it's needed.
From the perspective of ease we'd like to choose for the first approach. Whatsoever, that means we're adding 100*100*300=3000000 rows each day. Data has to be stored 1 year and a month (395 days) so then we're around 1.2 billion rows. Not calculated the expected growth.
Is it from a performance perspective smarter to use a 'vertical' or a 'horizontal' approach?
When choosing for the 'vertical' solution, how can we actual optimize performance by using PK's/FK's wisely?
When choosing for the 'horizontal' solution, are there recommendations for adding columns to the table?
I have a vertical DB with 275 million rows in the "values" table. We took this approach because we couldn't accurately define the schema at the outset either. Inserts are fantastic. Selects suck. Too be fair we throw in a couple of extra doohickies the typical vertical schema doesn't have to deal with.
Have a search for EAV aka Entity Attribute Value models. You'll find a lot of heat on both sides of the debate. Too good articles on making it work are
What is so bad about EAV, anyway?
dave’s guide to the eav
My guess is these sensors don't just start sending you extra fields. You have to release new sensors or sensor code for this to happen. That's your chance to do change control on your schema and add the extra columns. If external parties can connect sensors without notifying you this argument is null and void and you may be stuck with an EAV.
For the horizontal option you can split tables putting the frequently-used columns in one table and the less-used in a second; both tables have the same primary key values so you can link less-used to more-used columns. Also you can use RDBMS's built-in partitioning functionality to split each day's (or week's or month's) data for the others'.
Generally, you can tune a table more for inserts (or any DML) or for queries. Improving one side comes at the expense of the other. Usually, it's a balancing act.
First of all, 10K inserts a day is not really a large number. Sure, it's not insignificant, but it doesn't even come close to what would be considered "large" nowadays So, while we don't want to make inserts downright sluggish, this gives you some wiggle room.
Creating an index on the device id and/or entry timestamp will do some logical partitioning of the data for you. The exact makeup of your index(es) will depend on your queries. Are you looking for all entries for a given date or date range? Then index the timestamp column. Are you looking for all entries received from a particular device? Then index the device id column. Are you looking for entries from a particular device on a particular date or date range or sorted by the date? Then create an index on both columns.
So if you ask for the entries for device x on date y, then you are going out to the table and looking only at the rows you need. The fact that the table is much larger than the small subset you query is incidental. It's as if the rest of the table doesn't even exist. The total size of the table need not be intimidating.
Another option: As it looks like the data is written to the table and never altered after that, then you may want to create a data warehouse schema for the data. New entries can be moved to the warehouse every day or several times a day. The point is, the warehouse schema can have the data sliced, diced, quartered and cubed to make queries much more efficient. So you can have the existing table tuned for more efficient inserts and the warehouse tuned for more efficient queries. That is, after all, what data warehouses are for.
You also imply that some of each entry is (or can be) duplicated from one entry to the next. See if you can segment the data into three types:
Type 1: Data that never changes (the device id, for example)
Type 2: Data that rarely changes
Type 3: Data that changes often
Now all you have is a normalization problem, something a lot easier to solve. Let's say the row is equally split between the types. So you have one table with 100 rows of 33 columns. That's it. It never changes. Linked to that is a table with at least 100 rows of 33 columns but maybe several new rows are added each day. Finally, linked to the second table a table with rows of 33 columns that possibly grows by the full 10K every day.
This minimizes the grow-space required by the online database. The warehouse could then denormalize back to one huge table for ease of querying.

Organizing lots of timestamped values in a DB (sql / nosql)

I have a device I'm polling for lots of different fields, every x milliseconds
the device returns a list of ids and values which I need to store with a time stamp in a DB of sorts.
Users of the system need to be able to query this DB for historic logs to create graphs, or query the last timestamp for each value.
A simple approach would be to define a MySQL table with
id,value_id,timestamp,value
and let users select
Select value form t where value_id=x order by timestamp desc limit 1
and just push everything there with index on timestamp and id, But my question is what's the best approach performance / size wise for designing the schema? or using nosql? can anyone comment on possible design trade offs. Will such a design scale with millions of records?
When you say "... or query the last timestamp for each value" is this what you had in mind?
select max(timestamp) from T where value = ?
If you have millions of records, and the above is what you meant (i.e. value is alone in the WHERE clause), then you'd need an index on the value column, otherwise you'd have to do a full table scan. But if queries will ALWAYS have [timestamp] column in the WHERE clause, you do not need an index on [value] column if there's an index on timestamp.
You need an index on the timestamp column if your users will issue queries where the timestamp column appears alone in the WHERE clause:
select * from T where timestamp > x and timestamp < y
You could index all three columns, but you want to make sure the writes do not slow down because of the indexing overhead.
The rule of thumb when you have a very large database is that every query should be able to make use of an index, so you can avoid a full table scan.
EDIT:
Adding some additional remarks after your clarification.
I am wondering how you will know the id? Is [id] perhaps a product code?
A single simple index on id might not scale very well if there are not many different product codes, i.e. if it's a low-cardinality index. The rebalancing of the trees could slow down the batch inserts that are happening every x milliseconds. A composite index on (id,timestamp) would be better than a simple index.
If you rarely need to sort multiple products but are most often selecting based on a single product-code, then a non-traditional DBMS that uses a hashed-key sparse-table rather than a b-tree might be a very viable even a superior alternative for you. In such a database, all of the records for a given key would be found physically on the same set of contiguous "pages"; the hashing algorithm looks at the key and returns the page number where the record will be found. There is no need to rebalance an index as there isn't an index, and so you completely avoid the related scaling worries.
However, while hashed-file databases excel at low-overhead nearly instant retrieval based on a key value, they tend to be poor performers at sorting large groups of records on an attribute, because the data are not stored physically in any meaningful order, and gathering the records can involve much thrashing. In your case, timestamp would be that attribute. If I were in your shoes, I would base my decision on the cardinality of the id: in a dataset of a million records, how many DISTINCT ids would be found?
YET ANOTHER EDIT SINCE THE SITE IS NOT LETTING ME ADD ANOTHER ANSWER:
Simplest way is to have two tables, one with the ongoing history, which is always having new values inserted, and the other, containing only 250 records, one per part, where the latest value overwrites/replaces the previous one.
Update latest
set value = x
where id = ?
You have a choice of
indexes (composite; covering value_id, timestamp and value, or some combination of them): you should test performance with different indexes; composite and non-composite, also be aware that there are quite a few significantly different ways to get 'max per group' (search so, especially mysql version with variables)
triggers - you might use triggers to maintain max row values in another table (best performance of further selects; this is redundant and could be kept in memory)
lazy statistics/triggers, since your database is updated quite often you can save cycles if you update your statistics periodically (if you can allow the stats to be y seconds old and if you poll 1000 / x times a second, then you potentially save y * 100 / x potential updates; and this can be noticeable, especially in terms of scalability)
The above is true if you are looking for last bit of performance, if not keep it simple.

Indexing Speed in SQL

When creating indexes for an SQL table,if i had an index on 2 columns in the table and i changed the index to be on 4 columns in the table, what would be a reasonable increase the time taken to save say 1 million rows to expect?
I know that the answer to this question will vary depending on a lot of factors, such as foreign keys, other indexes, etc, but I thought I'd ask anyway. Not sure if it matters, but I am using MS SQLServer 2005.
EDIT: Ok, so here's some more information that might help get a better answer. I have a table called CostDependency. Inside this table are the following columns:
CostDependancyID as UniqueIdentifier (PK)
ParentPriceID as UniqueIdentifier (FK)
DependantPriceID as UniqueIdentifier (FK)
LocationID as UniqueIdentifier (FK)
DistributionID as UniqueIdentifier (FK)
IsValid as Bit
At the moment there is one Unique index involving ParentPriceID, DependantPriceID, LocationID and DistributionID. The reason for this index is to guarantee that the combination of those four columns is unique. We are not doing any searching on these four columns together. I can however normalise this table and make it into three tables:
CostDependancyID as UniqueIdentifier (PK)
ParentPriceID as UniqueIdentifier (FK)
DependantPriceID as UniqueIdentifier (FK)
Unique Index on ParentPriceID and DependantPriceID
and
ExtensionID as UniqueIdentifier (PK)
CostDependencyID (FK)
DistributionID as UniqueIdentifier (FK)
Unique Index on CostDependencyID and DistributionID
and
ID as UniqueIdentifier (PK)
ExtensionID as UniqueIdentifier (FK)
LocationID as UniqueIdentifier (FK)
IsValid as Bit
Unique Index on ExtensionID and LocationID
I am trying to work out if normalising this table and thus reducing the number of columns in the indexes will mean speed improvements when adding a large number of rows (i.e. 1 million).
Thanks, Dane.
With all the new info available, I'd like to suggest the following:
1) If a few of the GUID (UniqueIdentifier) columns are such that a) there are relatively few different values and b) there are relatively few new values added after the initial load. (For example the LocationID may represents a store, and if we only see a few new stores every day), it would be profitable to spin off these to a separate lookup table(s) GUID ->LocalId (an INT or some small column), and use this LocalId in the main table.
==> Doing so will greatly reduce the overall size of the main table and its associated indexes, at the cost of slightly complicating the update logic (but not not its performance), because of the lookup(s) and the need to maintain the lookup table(s) with new values.
2) Unless a particular important/frequent search case could make [good] use of a clustered index, we could use the clustered index on the main table to be for the 4 columns-based unique composite key. This would avoid replicating that much data in a separate non-clustered index, and as counter intuitive at is seems, it would save time for the initial load and with new inserts. The trick would be to use a relatively low fillfactor so that node splitting and balancing etc. would be infrequent. BTW, if we make the main record narrower with the use of local IDs, we can more readily afford "wasting" space in the fillfactor, and more new record will fit in this space before requiring node balancing.
3) link664 could provide an order of magnitude for the total number of records in "main" table and the number expected daily/weekly/whenever updates are scheduled. And these two parameter could confirm the validity of the approach suggested above, as well as provide hints as to the possibility of maybe dropping the indexes (or some of them) prior to big batch inserts, as suggested by Philip Kelley. Doing so however would be contingent to operational considerations such as the need to continue search service while new data is inserted.
4) other considerations such as SQL partitioning, storage architecture etc. can also be put to work to improve load and/or retrieve performance.
It depends pretty much on whether the wider index forms a covering index for your queries (and to a lesser extent the ratio of read to writes on that table). Suggest you post your execution plan(s) for the query workload you are trying to improve.
I'm a bit confused over your goals. The (post-edit) question reads that you're trying to optimize data (row) insertion, comparing one table of six columns and a four-column compound primary key against a "normalized" set of three tables of three or four columns each, and each of the three with a two-column compound key. Is this your issue?
My first question is, what are the effects of the "normalization" from one table to three? If you had 1M rows in the single table, how many rows are you likely to have in the three normalized ones? Normalization usually removes redundant data, does it do so here?
Inserting 1M rows into a four-column-PK table will take more time than into a two-column-PK table--perhaps a little, perhaps a lot (see next paragraph). However, if all else is equal, I believe that inserting 1M rows into three two-column-PK tables will be slower than the four column one. Testing is called for.
One thing that is certain is that if the data to be inserted is not loaded in the same order as it will be stored in, it will be a LOT slower than if the data being inserted were already sorted. Multiply that by three, and you'll have a long wait indeed. The most common work-around to this problem is to drop the index, load the data, and then recreate the index (sounds like a time-waster, but for large data sets it can be faster than inserting into an indexed table). A more abstract work-around is to load the table into an unindexed partition, (re)build the index, then switch the partition into the "live" table. Is this an option you can consider?
By and large, people are not overly concerned with performance when data is being inserted--generally they sweat over data retrieval performance. If it's not a warehouse situation, I'd be interested in knowing why insert performance is your apparent bottleneck.
The query optimizer will look at the index and determine if it can use the leading column. If the first column is not in the query, then it won't be used period. If the index can be used, then it will check to see if the second column can be used. If your query contains 'where A=? and C=?' and you index is on A,B,C,D then only the 'A' column will be used in the query plan.
Adding columns to an index can be useful sometimes to avoid the database having to go from the index page to the data page. If you query is 'select D from table where a=? and b=? and c=?', then column 'D' will be returned from the index, and save you a bit of IO in having to go to the data page.