SSAS dimension key as compound key vs char field - ssas

For anyone working with SSAS 2008, a question:
I have a rather large dimension whose key attribute is a combination of two integer fields. I have the key attribute's Key Columns set up as a collection consisting of the two integer fields, and for the name column I have a WChar field which concatenates the two integer fields like so ("Field1 - Field2"). My question is: would I get better performance using the WChar field as the Key Column rather than the compound key? Or are two integer fields still better than one WChar field when it comes to Key Columns?
Thanks

In theory, a single integer "surrogate key" would be fastest. However I suspect that since the size of the concatenated field is a relatively small string, there won't be much difference between using the compound key and a concatenated field. It would probably begin to make a difference if the concatenated string was significantly larger.
Another problem you might run into with large dimensions that have large string keys is the analysis services key store has a limit of 4gb.
Check this whitepaper out, it has a lot of good information about optimizing the dimensional design and general perf tuning:
http://sqlcat.com/whitepapers/archive/2009/02/15/the-analysis-services-2008-performance-guide.aspx
This book has some of the best coverage on the analysis services storage engine and physical data structures:
http://www.pearson.ch/1471/9780672330018/Microsoft-SQL-Server-2008-Analysis.aspx
Hope this helps

Related

Use string as primary key to store words

I am planning to create one huge table to store all words that could possibly exist for personal experimentation (whether part of the official dictionary, urban or else).
Does it make sense to use the word itself as a primary key?
It is 100% certain that words MUST be unique, moreover they will not change.
The purpose in the end is to also use this PK as FK in related tables to get more information on these words.
I am not too familiar with table scaling, so I wonder if I can get into trouble:
Performance wise
If the table becomes too large and has to be partitioned (?)
If I want to move the database to sqlite to use as embedded data store
Tagging this question with postgres (my current db), but may migrate to sqlite.
I would be surprised if there were enough words that you needed to partition the table. Of course if your "words" are really genetic sequences or something, I might be off there.
In any case, one of the primary purposes of a primary key is to support foreign key relationships. So, if there is any possibility that another table might refer to this table, then you want to take that into account.
Integer foreign keys are generally preferable, because they are a fixed length -- and that is a little more efficient for indexes. In addition, four-byte integers are probably smaller than the average word length, so they save on storage of the foreign key as well.
That would be balanced against an additional 4 bytes in the words table itself. On balance, I usually add synthetic primary keys.
Another Idea:
Make 2 columns
Column 1: Initial Letter
Column 2: The Word
[if word is APPLE :::: Column1-->A :::: Column2-->Apple]
Benefits:
you can query faster for tasks like 'word count from a letter' (like, no. of words from A)
could give you simple rules for making shards (like all words with column1 as 'A', can be assigned to a particular dedicated shard)

SQL Index - Difference Between char and int

I have a table on Sql Server 2005 database.
The primary key field of the table is a code number.
As a standard, the code must contain exactly 4 numeric digits. For example: 1234, 7834, ...
Do you suggest that field type to be char(4) or int or numeric(4) in terms of effective select operation.
Would indexing the table on any type of these differ from any other?
Integer / Identity columns are often used for primary keys in database tables for a number of reasons. Primary key columns must be unique, should not be updatable, and really should be meaningless. This makes an identity column a pretty good choice because the server will get the next value for you, they must be unique, and integers are relatively small and useable (compared to a GUID).
Some database architects will argue that other data types should be used for primary key values and the "meaningless" and "not updatable" criteria can be argued convincingly on both sides. Regardless, integer / identity fields are pretty convenient and many database designers find that they make suitable key values for referential integrity.
The best choice for primary key are integer data types since integer values are process faster than character data type values. A character data type (as a primary key) needs to be converted to ASCII equivalent values before processing.
Fetching the record on the basis of primary key will be faster in case of integers as primay keys as this will mean more index records will be present on a single page. So the total search time decreases. Also the joins will be faster. But this will be applicable incase your query uses clustered index seek and not scan and if only one table is used. In case of scan not having additional column will mean more rows on one data page.
Hopefully this will help you!
I advocate a SMALLINT column. Just because it is the most sensible datatype that will fit the required range (up to 65535, in excess of 4 digits). Use a check constraint to enforce the 4-digit limitation and a COMPUTED column to return the char(4) column.
If I remember correctly, ints take up less storage than chars, so you should go with int.
These two links say the same:
http://www.eggheadcafe.com/software/aspnet/31759030/varcharschars-vs-intbigint-as-keys.aspx
http://sql-server-performance.com/Community/forums/p/16020/94489.aspx
"It depends"
In this case, char(4) captures the data stored correctly with no storage overhead (4 bytes each). And 0001 is not the same as 1 of course.
You do have some overhead for processing collation etc if you have non-numeric digits, but it shouldn't matter for reasonably sized databases. And with a 4 digit code you do have an upper bound for number of rows especially if numeric (10k).
If your new codes are not strictly increasing, then you get the page split issue associated with GUID clustered keys
If they are strictly increasing, then use int and add a computed column to add leading zeros

What is the best way to store categorical references in SQL tables?

I'm wanting to store a wide array of categorical data in MySQL database tables. Let's say that for instance I want to to information on "widgets" and want to categorize attributes in certain ways, i.e. shape category.
For instance, the widgets could be classified as: round, square, triangular, spherical, etc.
Should these categories be stored within a table to reference them best from an application? Another possibility, I would imagine, would be to add a column to widgets that contained a shape column that contained a tiny int. That way my application could search shapes by that and then use a coordinating enum type that would map the shape int meanings.
Which would be best? Or is there another solution that I'm not thinking of yet?
Define a category table for each attribute grouping. IE:
WIDGET_SHAPE_TYPE_CODES
WIDGET_SHAPE_TYPE_CODE (primary key)
DESCRIPTION
Then use a foreign key reference in the WIDGETS table:
WIDGETS
WIDGET_ID (primary key)
...
WIDGET_SHAPE_TYPE_CODE (foreign key)
This has the benefit of being portable to other databases, and more obvious relationships which means simpler maintenance.
What I would do is start with a Widgets table that has a category field that is a numeric type. If you also use the category table the numeric category is a foreign key that relates to a row in the category table. A numeric type is nice and small for better performance.
Optionally you can add a category table containing a a primary key numeric value, and a text description. This matches up the numeric value to a human friendly text value. This table can be used to convert the numbers to text if you just want to run reports directly from the database. The nice thing about having this table is you don't need to update an executable if you add a new category. I would add such a table to my design.
MySQL's ENUM is handy but it stores int the table as a string so it uses up more space in the table than is really needed. However it does have the advantage of preventing values that are not recognized from being stored. Preventing the storage of invalid numeric values is possible, but not as elegantly as ENUM. The other problem with ENUM is because it is regarded as a string, the database must do more work if you are selecting by the value because instead of comparing a single number, multiple characters have to be compared.
If you really want to you can have an enumeration in your code that coverts the numeric category back into something more application code friendly, but you are making your code more difficult to maintain by doing this. However it can have a performance advantage because fewer bytes have to be returned when you run a query. I would try to avoid this because it requires updating the application code every time a category is added to the database. If you really need to squeeze performance out of the database you could select the whole category table, and select the widgets table and merge them in application code, but that is a rare circumstance since the DB client almost always has a fast connection to the DB server and a few more bytes over the network are insignificant.
I think the best way is use ENUM, for example thereare pre defined enum type in mysql - http://dev.mysql.com/doc/refman/5.0/en/enum.html

Should I use integer primary IDs?

For example, I always generate an auto-increment field for the users table, but I also specify a UNIQUE index on their usernames. There are situations that I first need to get the userId for a given username and then execute the desired query, or use a JOIN in the desired query. It's 2 trips to the database or a JOIN vs. a varchar index.
Should I use integer primary IDs?
Is there a real performance benefit on INT over small VARCHAR indexes?
There are several advantages of having a surrogate primary key, including:
When you have a foreign key in another table, if it is an integer it takes up only a few bytes extra space and can be joined quickly. If you use the username as the primary key it will have to be stored in both tables - taking up more space and it takes longer to compare when you need to join.
If a user wishes to change their username, you will have big problems if you have used it as a primary key. While it is possible to update a primary key, it is very unwise to do so and can cause all sorts of problems as this key might have been sent out to all sorts of other systems, used in links, saved in backups, logs that have been archived, etc. You can't easily update all these places.
It's not just about performance. You should never key on a meaningful value, for reasons that are well documented elsewhere.
By the way, I often scale the type of int to the size of the table. When I know that a table will not exceed 255 rows, I use a tinyint key, and the same for smallint.
In addition to what others have said, you need to think about the clustering of the table.
In SQL Server for instance (and possibly other vendors), if the primary key is also used as the clustered index of the table (which is quote common), an incrementing integer benefits over other field types. This is because new rows are entered with a primary key that is always greater than the previous rows, meaning that the new row can be stored at the end of the table instead of in the middle (this same scenario can be created with other field types for the primary key, but an integer type lends itself better).
Compare this with a guid primary key - new rows have to be inserted into the middle of the table because guids are non-sequential, making inserts very inefficient.
First, as is obvious, on small tables, it will make no difference with respect to performance. Only on very large tables (how large depends on numerous factors), can it make a difference for a handful of reasons:
Using a 32-bit will only consume 4 bytes of space. Presumably, your usernames will be longer than four non-Unicode characters and thus consume more than 4 bytes of space. The more space used, the few pieces of data fit on a page, the fatter the index and the more IO you incur.
Your character columns are going to require the use of varchar over char unless you force everyone to have usernames of identical size. This too will have a tiny performance and storage impact.
Unless you are using a binary sort collation, the system has to do relatively sophisticaed matching when comparing two strings. Do the two columns use the same colllation? For each character, are they cased the same? What are the casing and accent rules in terms of matching? and so on. While this can be done quickly, it is more work which, in a very large tables, can make a difference in comparison to matching on an integer.
I'm not sure why you would ever have to do two trips to the database or join on a varchar column. Why couldn't you do one trip to the database (where creation returns your new PK) where you join to the users table on the integer PK?

Why is the Primary Key often an integer in a Relational Database Management System?

It's been habitual in most of the scenarios while developing a database design we set primary key as integer type for a unique identifier in the Table. Why not use string or float for primary keys? Does this affect the accessibility of values, or in plain words retrieval speed? Are there any specific reasons?
An integer will use less disk space than a string, thus giving you a smaller index file to search through. This is important for large tables where you want to have as much of the index as possible cached in RAM.
Also, they can be autoincremented so you don't need to write your own routines to generate keys.
You often want to have a technical key (also called a surrogate key), a key that is only used to identify the row and not used for anything else. Most data may change sooner or later for reasons you can't control and you don't want to update it everywhere. Even such seemingly static data as a nation-assigned personal id number can change (if you get a new identity) or there may be laws prohibiting their use. A key generated by you, however, is in your own control. For such surrogate keys it's useful to have a small key that is easily generated.
As for "floats as primary keys": Don't do this. A primary key should uniquely identify a row. Floats have no equality relation, which means you cannot safely compare two float values for equality. This is an inherent shortcoming of floating-point values. If you need decimals, use a fixed-point number type instead.
The primary key is supposed to be an index that can provide a unique way to access a specific row in a table. Primary keys can be most data types (in practical applications, float/double won't work too well), and primary keys can also be compound keys (comprised of several columns.)
If you carefully examine the data in the table, you might be able to find a data item that will be unique for every row in the table, thereby eliminating the requirement that you fabricate a key like the autoincrement integer that you find in some schemas.
If you're in a manufacturing environment it might be an alphanumeric field like part number or assembly identifier. Retail or warehousing applications might have a stock number or combination of stock number/shipment/manufacturer.
Generally, If some data in your table is supposed to be a unique identifier it probably will serve well as a primary key for your table.
Using data that exists in the table already completely eliminates the requirement to "make up" a value (such as the autoincrement column) and use it as the primary key. This saves space since it's one less column in the table and one less index on the table.
Yes, in my experience integer keys are almost always faster, since it's more efficient for the database engine to compare integers than comparing strings. Depending on the "uniqueness" of the data (technically called cardinality http://en.wikipedia.org/wiki/Cardinality_(SQL_statements)), the effect of character vs. integer keys is nominal.
Character keys may degrade performance depending on the number of characters that the database needs to compare to determine whether keys are equal or not equal. In the pathological case, imagine a hundred-character field which differ only on the right hand side. One row has 100 A's. We need to compare this to a key with 99 A's and a B as the last character. Conceptually, databases compare character fields just like strcmp() (strncmp() if you prefer) from left to right.
good luck!
The only reason is for performance.
A logical database design should specify which "real" columns are unique, but when the logical design is transformed into a physical design, it is traditional to not use any of these "natural" keys as the primary key; instead, a meaningless integer column is added for this purpose - called a "surrogate key".
Normally the designer will add further unique constraints for the "real" uniqueness business rules as specified in the logical design.
This is because most DBMS's have trouble updating a primary key (e.g. due to performance issues when cascading the update to child tables). Some DBMS's might not be able to support non-integer primary keys at all.
Some side notes:
There's no theoretical reason why
primary keys should be immutable.
This is nothing to do with
normalization, which happens in the
logical model (which should never
have surrogate keys).
Also, note that the idea of a
"primary" key is not a relational
concept - it is simply a way of
denoting the "preferred" uniqueness
constraint, perhaps for relational
integrity - but there's nothing in
the RM that says that you must use
the same key for each child table.
I've created natural keys as "Primary
Keys" in Oracle databases before,
albeit rarely. I've even had them
used for foreign key constraints.
Admittedly, they were either
immutable, or I hand-wrote the
update-cascade code; and I had
trouble with one front-end
application where the PK included a
date column.
Bottom line: there is no theoretical requirement for surrogate keys, but they're much more practical than the alternative.
I suspect that it is because we can auto-increment integer values so it's easy to generate a new unique key for every insert.
Many common ORM (Object Relational Mapping) tools either force to use or at least recommend using integer as primary key.
Integer primary key also saves space compared to string and integer primary key is in some cases also faster. Sequences or auto increment fields make integer primary key generation easy at least if you do not work with distributed databases.
These are some of the main reasons why i think we have integers/ numbers as primary keys.
1.Primary keys should be able to uniquely define your row and should be immutable. One of the problems with using real attributes (name etc..) is that they could change over time. To maintain relational integrity in such a case would be very difficult as this change needs to cascade to all the child records.
2.The size of the table and thereby the index would be smaller in case we use a number as a key for the tab.e
3.Since these are automatically generated using a sequence, we can be sure that the values would be unique under all circumstances.
Check this.
http://forums.oracle.com/forums/thread.jspa?messageID=3916511&#3916511