How can you create Clustered Indexes with Fluent NHibernate? - nhibernate

I am using Fluent-NHibernate (with automapping) to generate my tables but would like to choose a different clustered index than the ID field which is used by default. How can you create clustered indexes with Fluent NHibernate on a field other than the default Primary Key field?
The primary reasoning behind this is simple. I am using Guids for my primary key fields. By default, NHibernate creates clustered indexes on the primary key fields. Since Guids are usually not sequential, clustering on the primary key field causes a performance issue.
As we all know, appending records at the end of a table is a much cheaper operation than inserting records within the table. Also, the records in the table are physically stored in the order of the items in the clustered index. Since Guids are somewhat "random" and are not sequential, new Guids may be generated that are less than the value of other Id Guids already in the table--resulting in table inserts rather than appends.
To minimize this, I have a column called CreatedOn which is of type DateTime. I need for the table to be clustered on this CreatedOn column so that all new records are appended rather than inserted.
Any ideas for how to accomplish this are welcome!!!
Note: I realize that I could use Sequential Guids but prefer not to go down that path for security reasons.
Note: I still do not have an answer for this post but I have a few ideas I am pondering at the moment.
Using NHibernate without Fluent, I think it may be possible to created clustered indexes directly in NHibernate. I don't yet know enough about NHibernate to know how to do this. I am just pretty (as in almost absolutely) sure it can be done.
Fluent-NHibernate used to include a way to set attributes (e.g. like a clustered index) on a SQL object before the recent rewrite. Now that option appears to have gone away. I will probably post a question somewhere to see if that option is still available. If so, I could probably use that to set the clustered index.
Fluent-NHibernate provides the ability to expose a configuration for manual editing once it has been fluently built. I have not tried this functionality but expect it may offer the level of granularity that is needed to set clustered indexes.
Worst case scenario, I can write a SQL script to change the clustered indexes on all my tables once they are generated. However, I have a couple of questions regarding this approach. A. Since I am using automatic schema generation, will NHibernate "undo" my clustered index changes the next time it evaluates the configuration? 2. Will NHibernate error if it detects the clustered index has been changed? I need to test this but have not done so just yet. I really hate this solution though. I am testing my DB against SQLServer2008 and MySQL. Part of the beauty of NHibernate is that it is database agnostic. Once we introduce scripts, all bets are off.
There is an interface that is used in fluent conventions called IPropertyInstance Classes which inherit from this interface have an Index property which allows an Index to be created on the field. The problem is that there is no flag or other option to allow the index to be created as clustered. The simplest solution would be to add a property to this method to allow for clustered indexes to be created. I think I may suggest this to the Fluent-NHibernate developers.

This is an old post, but I hope could help someone else. This come from my experience on MS SQL Server. I believe different platforms require different solutions, but this should be a good starting point.
NHibernate doesn't set the CLUSTERED index on the primary key. It's the SQL Server default behavior. As there can be only one CLUSTERED per table, we need first to avoid the CLUSTERED creation on the primary key.
The only way I found to accomplish this is to create a custom Dialect, overriding the propery PrimaryKeyString. NHibernate's default comes from Dialect.cs:
public virtual string PrimaryKeyString
{
get { return "primary key"; }
}
For SQL Server
public override string PrimaryKeyString
{
get { return "primary key nonclustered"; }
}
This will force SQL Server to create a NONCLUSTERED primary key.
Now you can add your own CLUSTERED index on your favorite column through the tag in the XML mapping file.
<database-object>
<create>
create clustered index IX_CustomClusteredIndexName on TableName (ColumnName ASC)
</create>
<drop>
drop index IX_CustomClusteredIndexName ON TableName
</drop>
</database-object>

I can't answer that specifically, but I'll give you some database info since I'm here.
You'll need to tell NHibernate to create the primary key at a non-clustered index. There can only be only clustered index per table, so you need to create the table as a heap, and then put a clustered index on it.

As you said yourself, another option is to switch to the guid.comb ID generation strategy where PK uids are based on a part which is a Guid and a part which ensures that the generated IDs are sequential.
Check out more info in a post by Jeffrey Palermo here.
But you mention that do not want to do that for security reasons - why is that?

Just like #abx78 told, this is an old post, but I would like to share my knowledgde on a solution for this problem as well. I built the solution for idea 3 "Fluent NHibernate exposes mappings":
After the configuration has been build (thus the mappings are parsed), Fluent NHibernate gives us the oppertunity to look into the actual mappings with configuration.ClassMappings and configuration.CollectionMappings. The latter is used in the example below to set a composite primary key resulting in a clustered index in Sql Server (as #abx78 points out):
foreach (var collectionMapping in configuration.CollectionMappings) {
// Fetch the columns (in this example: build the columns in a hacky way)
const string columnFormat = "{0}_id";
var leftColumn = new Column(string.Format(
columnFormat,
collectionMapping.Owner.MappedClass.Name));
var rightColumn = new Column(string.Format(
columnFormat,
collectionMapping.GenericArguments[0].Name));
// Fetch the actual table of the many-to-many collection
var manyToManyTable = collectionMapping.CollectionTable;
// Shorten the name just like NHibernate does
var shortTableName = (manyToManyTable.Name.Length <= 8)
? manyToManyTable.Name
: manyToManyTable.Name.Substring(0, 8);
// Create the primary key and add the columns
// >> This part could be changed to a UniqueKey or to an Index
var primaryKey = new PrimaryKey {
Name = string.Format("PK_{0}", shortTableName),
};
primaryKey.AddColumn(leftColumn);
primaryKey.AddColumn(rightColumn);
// Set the primary key to the junction table
manyToManyTable.PrimaryKey = primaryKey;
// <<
}
Source: Fluent NHibernate: How to create a clustered index on a Many-to-Many Join Table?

Related

Identifying primary key for a vote table

I am working on a voting table design using Postgres 9.5 (but maybe the question itself is applicable to sql in general). My vote table should be like:
-------------------------
object | user | timestamp
-------------------------
Where object and user are foreign keys to the ids corresponding to their own tables. I have a problem identifying what actually should be a primary key.
I thought at first to make a primary_key(object, user) but since I use django as a server, it just doesn't support multicolumn primary key, I am not sure either about the performance since I may access a row using only one of those 2 columns (i.e. object or user), but the advantage this idea works automatically as a unique key since the same user shouldn't vote twice for the same object. And I don't need any additional indexes.
The other idea is to introduce an auto or serial id field, I really don't think of any advantage of using this approach especially when the table gets bigger. I need also to introduce at least a unique_key(object, user) which adds to the computational complexity and data storage. Not even sure about the performance when I select using one of the 2 columns, may be I need also 2 additional indexes for the object and user to accelerate the select operation since I need this heavily.
Is there something I am missing here? or is there a better idea?
django themselves recognise that the "natural primary key" in this case is not supported. So your gut feeling is right, but django don't support it.
https://code.djangoproject.com/wiki/MultipleColumnPrimaryKeys
Relational database designs use a set of columns as the primary key
for a table. When this set includes more than one column, it is known
as a “composite” or “compound” primary key. (For more on the
terminology, here is an ​article discussing database keys).
Currently Django models only support a single column in this set,
denying many designs where the natural primary key of a table is
multiple columns. Django currently can't work with these schemas; they
must instead introduce a redundant single-column key (a “surrogate”
key), forcing applications to make arbitrary and otherwise-unnecessary
choices about which key to use for the table in any given instance.
I'm less failure with django personally. One option might be to form an extra column as a primary key by concatenating object and user.
Remember that there is nothing special about a primary key. You can always add a UNIQUE KEY on the pair of columns and make them both NOT NULL.
You might find this example useful.
https://thecuriousfrequency.wordpress.com/2014/11/11/make-primary-key-with-two-or-more-field-in-django/
The correct solution woulf be to have a PRIMARY KEY (object, user) and an additional index on user. The primary key index can also be used for searches for object alone.
Form a database point of view, your problem is that you use an inadequate middleware if it does not support composite primary keys.
You'll probably have to introduce an artificial primary key constraint and in addition have a unique constraint on (object, user) and an index on user, but your gut feelings that that is not the best solution from a database perspective are absolutely true.

Why most SQL databases allow defining the same index twice?

Why most SQL databases allow defining the same index (or constraint) twice?
For example in MySQL I can do:
CREATE TABLE testkey(id VARCHAR(10) NOT NULL, PRIMARY KEY(id));
ALTER TABLE testkey ADD KEY (id);
ALTER TABLE testkey ADD KEY (id);
SHOW CREATE TABLE testkey;
CREATE TABLE `testkey` (
`id` varchar(10) NOT NULL,
PRIMARY KEY (`id`),
KEY `id` (`id`),
KEY `id_2` (`id`)
)
I do not see any use case for having the same index or constraint twice. And I would like SQL databases not allowing me do so.
I also do not see the point on naming indexes or constraints, as I could reference them for deletion just as I created them.
Several reasons come to mind. In the case of a database product which supports multiple index types it is possible that you might want to have the same field or combination of fields indexed multiple times, with each index having a different type depending on intended usage. For example, some (perhaps most) database products have a tree-structured index which is good for both direct lookup (e.g KEY_FIELD = 1) and range scans (e.g. KEY_FIELD > 0 AND KEY_FIELD < 5). In addition, some (but definitely not all) database products also support a hashed index type, which is only useful for direct lookups but which is very fast (e.g. would work for a comparison such as KEY_FIELD = 1 but which could not be used for a range comparison). If you need to have very fast direct lookup times but still need to to provide for ranged comparisons it might be useful to create both a tree-structured index and a hashed index.
Some database products do prevent you from having multiple primary key constraints on a table. However, preventing all possible duplicates might require more effort on the part of the database vendor than they feel can be justified. In the case of an open source database the principal developers might take the view that if a given feature is a big enough deal to a given user it should be up to that user to send in a code patch to enable whatever feature it is that is wanted. Open source is not a euphemism for "I use your open-source product; therefore, you are now my slave and must implement every feature I might ever want!".
In the end I think it's fair to say that a product which is intended for use by software developers can take it as a given that the user should be expected to exercise reasonable care when using the product.
All programming languages allow you to write redundancies:
<?php
$foo = 'bar';
$foo = 'bar';
That's just an example, you could obviously have duplicate code, duplicate functions, or duplicate data structures that are much more wasteful.
It's up to you to write good code, and this depends on the situation. Maybe there's a good reason in some rare case to write something that seems redundant. In that case, you'd be just as put out if the technology didn't allow you to do it.
You might be interested in a tool called Maatkit, which is a collection of indispensable tools for MySQL users. One of its tools checks for duplicate keys:
http://www.maatkit.org/doc/mk-duplicate-key-checker.html
If you're a MySQL developer, novice or expert, you should download Maatkit right away and set aside a full day to read the docs, try out each tool in the set, and learn how to integrate them into your daily development tasks. You'll kick yourself for not doing it sooner.
As for naming indexes, it allows you to do this:
ALTER TABLE testkey DROP KEY `id`, DROP KEY `id_2`;
If they weren't named, you'd have no way to drop individual indexes. You'd have to drop the whole table and recreate it without the indexes.
There are only two good reasons - that I can think of - for allowing defining the same index twice
for compatibility with existing scripts that do define the same index twice.
changing the implementation would require work that I am neither willing to do nor pay for
I can see that some databases prevent duplicate indexes. Oracle Database prevents duplicate indexes https://www.techonthenet.com/oracle/errors/ora01408.php while other databases like MySQL and PostgreSQL do not have duplicate index prevention.
You shouldn't be in a scenario that you have so many indexes on a table that you can't just quickly look and see if the index in there.
As for naming constraints and indexes, I only really ever name constraints. I will name a constraint FK_CurrentTable_ForeignKeyedColumn, just so things are more visible when quickly looking through lists of them.
Because databases that support covering indexes - Oracle, MySQL, SQL Server... (but not PostgreSQL, oddly). A covering index means indexing two or more columns, and are processed left to right for that column list in order to use them.
So if I define a covering index on columns 1, 2 and 3 - my queries need to use, at a minimum, column 1 to use the index. The next possible combination is column 1 & 2, and finally 1,2 and 3.
So what about my queries that only use column 3? Without the other two columns, the covering index can't be used. It's the same issue for only column 2 use... Either case, that's a situation where I would consider separate indexes on columns 2 and 3.

Multiple "ID" columns in SQL Server database?

Via this link, I know that a GUID is not good as a clustered index, but it can be uniquely created anywhere. It is required for some advanced SQL Server features like replication, etc.
Is it considered bad design if I want to have a GUID column as a typical Primary Key ? Also this assumes a separate int identity column for my clustering ID, and as an added bonus a "user friendly" id?
update
After viewing your feedback, I realise I didn't really word my question right. I understand that a Guid makes a good (even if its overkill) PK, but a bad clustering index (in general). My question more directly asked, is, is it bad to add a second "int identity" column to act as the clustering index?
I was thinking that the Guid would be the PK and use it to build all relationships/joins etc. Then I would instead of using a natural key for the Cluster Index, I would add an additional "ID" that not data-specific. What I'm wondering is that bad?
If you are going to create the identity field anyway, use that as the primary key. Think about querying this data. Ints are faster for joins and much easier to specify when writing queries.
Use the GUID if you must for replication, but don't use it as a primary key.
What are you intending to accomplish with the GUID? The int identity column will also be unique within that table. Do you actually need or expect to need the ability to replicate? If so, is using a GUID actually preferable in your architecture over handling identity columns through one of the identity range mangement options?
If you like the "pretty" ids generated using the Active Record pattern, then I think I'd try to use it instead of GUIDs. If you do need replication, then use one of the replication strategies appropriate for identity columns.
Consider using only GUID, but get your GUIDs using the NEWSEQUENTIALID method (which allocates sequential values and so doesn't have the same clustering performance problems as the NEWID method).
A problem with using a secondary INT key as an index is that, if it's an adequate index, why use a GUID at all? If a GUID is necessary, how can you use an INT index instead? I'm not sure whether you need a GUID, and if so then why: are you doing replication and/or merging between multiple databases? And if you do need a GUID then you haven't specified exactly how you intend to use the non-globally-unique INT index in that scenario.
Sounds like what you are saying is that I have not made a good case for using a Guid at all, and I agree I know its overkill, but my question I guess would be is it too much overkill?
I think it's convenient to use GUID instead of INT for the primary key, if you have a use case for doing so (e.g. multiple databases) and if you can tolerate the linear, O(1) loss of performance caused simply by using a bigger (16-byte) key (which results in there being fewer index instances per page of memory).
The bigger worry is the way in which using a (random) GUID could affect performance when it's used for clustering. To counter-act that:
Either, use something else (e.g. one of the record's natural keys) as the clustered index, even if you still use a GUID for the primary key
Or, let the clustered index be the same field as the GUID primary key, but use NewSequentialId() instead of NewId() to allocate the GUID values.
is it bad to insert an additional artifical "id" for clustering, since I'm not sure I'll have a good natural ID candidate for clustering?
I don't understand why you wouldn't prefer to instead use just the GUID with NewSequentialId(), which is I think is provided for exactly this reason.
Using a GUID is lazy -- i.e., the DBA can't be bothered to model his data properly. Also it offers very bad join performance -- typically (16-byte type with poor locality).
Is it a bad design, if I want to have a GUID column as my typical Primary Key, and a separate, int identity column for my clustering ID, and as an added bonus a "user friendly" id?
Yes it is very bad -- firstly you don't want more than one "artificial" candidate key for your table. Secondly, if you want a user friendly id to use as keys just use a fixed length type such as char[8] or binary(8) -- preferably binary as the sort won't use the locale; you could use 16-byte types however you will notice a deterioration in performance -- however not as bad as GUID's. You can use these fixed types to build your own user-friendly allocation scheme that preserves some locality but generates sensible and meaningful id's.
As an Example:
If you are writing some sort of a CRM system (lets say online insurance quotes) and you want an extremely user friendly type for example a insurance quote reference (QR) that looks like so "AD CAR MT 122299432".
In this case -- since the quote length huge -- I would create a separate LUT/Symboltable to resolve the quote reference to the actual identifier used. but I will divorce this LUT from the rest of the model, I will never use the quote reference anywhere else in the model, especially not in the table representing the QR's.
Create Table QRLut
{
bigint bigint_id;
char(32) QR;
}
Now if my model has one table that represents the QR and 20 other tables featuring the bigint QR as a foreign key -- the fact that a bigint is used will allow my DB to scale well -- the wider the join predicates the more contention is caused on the memory bus -- and the amount of contention on the memory bus determines how well your CPU's can be saturated (multiple CPU's).
You might think with this example that you could just place the user-friendly QR in the table that actually represents the quote, however keep in mind that SQL server gathers statistics on tables and indices, and you don't want to let the server make caching decisions based on the user-friendly QR -- since it is huge and wastefull.
I think it is bad design to do it that way but I don't know if it is bad otherwise. Remember, SQLServer automatically assigns the clustered index to the Primary key. You would have to remove it after making the GUID the primary key. Also, you usually want your identity column to be your primary key. So doing what you are saying would confuse anyone who reads your code that doesn't look closely. I would suggest you make the ID column your primary key, identity column, and put the clustered index on it. Then make your GUID column a unique key, making it a non-clustered index and not allowing nulls. That in affect will do what you want but will follow more of the standard.
Personally, I would go this way:
An internally known identity field for
your PK (one that isn't known to the
end-user because they will inevitably
want to control it somehow). A
user-friendly "ID" that is unique with
respect to some business rule
(enforced either in your app code or
as a constraint). A GUID in the
future if it's ever deemed necessary
(like if it's required for
replication).
Now with respect to the clustered index, which you may or may not be confused about, consider this guide from MS for SQL Server 2000.
You are right that GUIDs make good object identifiers, which are implemented in a database as primary keys. Additionally, you are right that primary keys do not need to be the clustered indices.
GUIDs share the same characteristics for clustered indexes as INT IDENTITY columns, provided that the GUIDs are sequential. There is a NewSequentialID specific to SQL Server, but there is also a generic algorithm for creating them called COMB GUID, based on combining the current datetime with random bytes in a way that retains a large degree of randomness while retaining sequentiality.
One thing to keep in mind, if you intend to use NHibernate at some point, is that NHibernate natively knows how to use the COMB GUID strategy - and NHibernate can even use it to do batch-inserts, something that cannot be done with INT IDENTITY or NewSequentialID. If you are inserting multiple objects with NHibernate, then it will be faster to use the COMB GUID strategy than either of the other two methods.
It is not bad design at all, an int Identity for your clustering key gives you a number of good benefits (Narrow,Unique,Ascending) whilst keeping the GUID for functionality purposes very separate and acting as your primary key.
If anything I would suggest you have the right approach, although the "user friendly" ID is the most questionable part - as in what purpose is it there to serve.
Addendum : I should put in the obligatory link to (possibly?) the most read article about the topic by Kimberley Tripp. http://www.sqlskills.com/BLOGS/KIMBERLY/post/GUIDs-as-PRIMARY-KEYs-andor-the-clustering-key.aspx

Fluent-NHibernate table mapping with no primary key

I am trying to create a mapping to a database table that has no primary keys/references.
public class TestMap : ClassMap<<Test>Test> {
public TestMap() {
WithTable("TestTable");
Map(x => x.TestColumn);
}
}
This fails and expects id or composite-id. Is this possible in fluent nhibernate?
In Oracle at least, I have used "ROWID" for this. For mssql you might use the "ROW_NUMBER()" builtin function for readonly access to the table, but I haven't tried that...
No. You'll have to add a surrogate primary key, such as an identity column in SQL Server, to map this table. As far as I know, this isn't supported by NHibernate itself.
Why don't you have a primary key on this table?
This functionality isn't supported by nhibernate as far as I know. As a general rule of thumb, however, you should really always have some kind of ID and if you find yourself in a situation where you think you don't need one you should assess your data model. An ID, whether it be a table-specific primary key, or a surrogate key from another table, should exist. This not only ensures that nhibernate can process the table, but helps performance via indexing.
Before you start assuming nhibernate isn't going to fulfill your needs, consider why you don't have a key on the table and what kind of sense it makes not to have one.
If we can bring a column from table having no primary key/identity coulmn, then we can use fluent as below:
Id(x => x.TempID).Column("TempID");
If the table contains data that belongs to another entity, you could map it as a collection of components. Components are not identified by themselves, but they belong to another entity, which is identified.
You can map an entity to a table without keys defined in the database. I do so in legacy SQL Server databases. However, the table must have a candidate key (some set of columns that actually stores a unique combination of values). The concept of entity involves the notion of some kind of identity.
Instead of this, what you're trying in your code is to map an entity without identity, wich isn't possible.

Moving from ints to GUIDs as primary keys

I use several referenced tables with integer primary keys. Now I want to change ints to GUIDs leaving all references intact. What is the easiest way to do it?
Thank you!
Addition
I do understand the process in general, so I need more detailed advices, for example, how to fill new GUID column. Using default value newid() is correct, but what for already existing rows?
Create a new column for the guid
value in the master table. Use the
uniqueidentifier data type, make it
not null with a newid() default so
all existing rows will be populated.
Create new uniqueidentifier columns
in the child tables.
Run update statements to build the guild relationships using the exisitng int relationships to reference the entities.
Drop the original int columns.
In addition, leave some space in your data/index pages (specify fillfactor < 100) as guids are not sequential like int identity columns are. This means inserts can be anywhere in the data range and will cause page splits if your pages are 100% full.
Firstly: Dear God why?!?!?
Secondly, you're going to have to add the GUID column to all your tables first, then populate them based on the int value. Once done you can set the GUIDs to primary/foreign keys then drop the int columns.
To update the value you'd do something like
Set the new GUIDs in the primary key table
Run this:
.
UPDATE foreignTable f
SET f.guidCol = p.guidCol
FROM primaryTable p
WHERE p.intCol = f.intCol
This is relevent in a system that implements the distributed computing model. If the system is required to know the primary key at the time when you persist information in the system, the use of a auto-incrementing primary key maintained by ONE handler will slow down the system. Instead, you need a mechanism like a GUID generator to create primary key (keep in mind that the true feature of a primary key is its uniqueness). So, I can scale up with multiple services, each creating its primary key, independently of each other.
I had dubious privilege of doing this before and basically what I had to do was to export the whole damned database into XML. Next, I had a Java application that uses the java.util.Random's nextLong() function to replace the primary key with their new guid keys. After that I imported the whole thing back in to the database.
Of course, the first time I tried to import the XML files back, I forgot to turn off the auto-number feature of the primary key field, so do learn from my mistakes. I'm sure that there're better ways of doing it, but this was a fast and dirty way of doing it ... and it worked. In case you wondering, the project was to make the application scale.
Yeah, I'm with Glenn... I was actually hesitating on posting the same thing before he posted it....
Why would you not want an auto increment int primary key separate from your GUID? it's a lot more flexible, and you can just have the GUID column indexed so you have good performance on your queries...
As for the flexibility, I like to keep my id's as autoincrement ints because then the other seemingly unique and primary-key worthy item can change.
A great case of the flexibility is if you use usernames as a primary key. Even if they are unique, it is nice to be able to change them. What if users use an email address as their username? Being able to change the username and have it not affect all your queries is a big plus, and I suspect the same could be true with your GUIDs....
I think, you must do it manualy. Or you can write some utility for it. The scenario should be:
Duplicate the "int" PK/FK columns with new "guid" columns.
Generates new values for "guid" PK columns.
Update values in "guid" FK columns with specified values ( you find the records via "int" PK ).
Remove references ( relations ) with "int" PK/FK columns.
Create similar references ( relations ) with "guid" PK/FK columns.
Remove "int" PK/FK columns.
It's a very good choice. I switched from longs to UUID for one of my applications and I don't regret it. If you use MS SQL Server it is included in standard (I use postgresql and it's only included in standard from 8.3 on).
Like mentioned by Glenn Slaven, you can recreate UUIDs from the keys you have in your current records. Be aware that they will not be unique though but that way it's easy to keep the relationships intact. New records you create after the move will be unique.
DON'T DO IT! We started out using GUIDs, and now we've almost finished moving to INTs as PKs; we're retaining the GUID for logging purposes (and for some tables of, er, "negotiable relational integrity" ;) ), but the speed increase of using ints has been phenomenal.
This only really became apparent when the table rowcounts crossed into millions, mind you.
Our biggest folly by far was using a NEWID() as the PK of our (sequential) log table - there was much head-smacking when we realised our error.