As per the documentation primary index (index on document key) is optional in Couchbase. How does Couchbase efficiently ensure uniqueness of document key without an index?
The primary index that documentation refers to is only for N1QL queries, and has nothing to do with enforcing uniqueness.
Instead, uniqueness is enforced by the key/value data service. From the "Data" overview documentation:
Each value (binary or JSON) is identified by a unique key, defined by
the user or application when the item is saved. The key is immutable:
once the item is saved, the key cannot be changed.
I am not an expert on Couchbase internals, but unique keys are fundamental to how Couchbase stores/retrieves/shards data. Check out Understanding vBuckets for more information ('vBucket' is analogous to 'shard'). Here's a snippet:
Items are written to and retrieved from vBuckets by means of a CRC32
hashing algorithm, which is applied to the item’s key, and so produces
the number of the vBucket in which the item resides.
Related
I am working on a voting table design using Postgres 9.5 (but maybe the question itself is applicable to sql in general). My vote table should be like:
-------------------------
object | user | timestamp
-------------------------
Where object and user are foreign keys to the ids corresponding to their own tables. I have a problem identifying what actually should be a primary key.
I thought at first to make a primary_key(object, user) but since I use django as a server, it just doesn't support multicolumn primary key, I am not sure either about the performance since I may access a row using only one of those 2 columns (i.e. object or user), but the advantage this idea works automatically as a unique key since the same user shouldn't vote twice for the same object. And I don't need any additional indexes.
The other idea is to introduce an auto or serial id field, I really don't think of any advantage of using this approach especially when the table gets bigger. I need also to introduce at least a unique_key(object, user) which adds to the computational complexity and data storage. Not even sure about the performance when I select using one of the 2 columns, may be I need also 2 additional indexes for the object and user to accelerate the select operation since I need this heavily.
Is there something I am missing here? or is there a better idea?
django themselves recognise that the "natural primary key" in this case is not supported. So your gut feeling is right, but django don't support it.
https://code.djangoproject.com/wiki/MultipleColumnPrimaryKeys
Relational database designs use a set of columns as the primary key
for a table. When this set includes more than one column, it is known
as a “composite” or “compound” primary key. (For more on the
terminology, here is an article discussing database keys).
Currently Django models only support a single column in this set,
denying many designs where the natural primary key of a table is
multiple columns. Django currently can't work with these schemas; they
must instead introduce a redundant single-column key (a “surrogate”
key), forcing applications to make arbitrary and otherwise-unnecessary
choices about which key to use for the table in any given instance.
I'm less failure with django personally. One option might be to form an extra column as a primary key by concatenating object and user.
Remember that there is nothing special about a primary key. You can always add a UNIQUE KEY on the pair of columns and make them both NOT NULL.
You might find this example useful.
https://thecuriousfrequency.wordpress.com/2014/11/11/make-primary-key-with-two-or-more-field-in-django/
The correct solution woulf be to have a PRIMARY KEY (object, user) and an additional index on user. The primary key index can also be used for searches for object alone.
Form a database point of view, your problem is that you use an inadequate middleware if it does not support composite primary keys.
You'll probably have to introduce an artificial primary key constraint and in addition have a unique constraint on (object, user) and an index on user, but your gut feelings that that is not the best solution from a database perspective are absolutely true.
I am trying to read up on best practices on DynamoDB. I saw that DynamoDB has two PK types:
Hash Key
Hash and Range Key
From what I read, it appears the latter is like the former but supports sorting and indexing of a finite set of columns.
So my question is why ever use only a hash key without a range key? Is it a viable choice only when the table is not searched?
It'd also be great to have some general guidelines on when to use what key type. I've read several guides (including Amazon's own documentation on DynamoDB) but none of them appear to directly address this question.
Thanks
The choice of which key to use comes down to your Use Cases and Data Requirements for a particular scenario. For example, if you are storing User Session Data it might not make much sense using the Range Key since each record could be referenced by a GUID and accessed directly with no grouping requirements. In general terms once you know the Session Id you just get the specific item querying by the key. Another example could be storing User Account or Profile data, each user has his own and you most likely will access it directly (by User Id or something else).
However, if you are storing Order Items then the Range Key makes much more sense since you probably want to retrieve the items grouped by their Order.
In terms of the Data Model, the Hash Key allows you to uniquely identify a record from your table, and the Range Key can be optionally used to group and sort several records that are usually retrieved together. Example: If you are defining an Aggregate to store Order Items, the Order Id could be your Hash Key, and the OrderItemId the Range Key. Whenever you would like to search the Order Items from a particular Order, you just query by the Hash Key (Order Id), and you will get all your order items.
You can find below a formal definition for the use of these two keys:
"Composite Hash Key with Range Key allows the developer to create a
primary key that is the composite of two attributes, a 'hash
attribute' and a 'range attribute.' When querying against a composite
key, the hash attribute needs to be uniquely matched but a range
operation can be specified for the range attribute: e.g. all orders
from Werner in the past 24 hours, or all games played by an individual
player in the past 24 hours." [VOGELS]
So the Range Key adds a grouping capability to the Data Model, however, the use of these two keys also have an implication on the Storage Model:
"Dynamo uses consistent hashing to partition its key space across its
replicas and to ensure uniform load distribution. A uniform key
distribution can help us achieve uniform load distribution assuming
the access distribution of keys is not highly skewed."
[DDB-SOSP2007]
Not only the Hash Key allows to uniquely identify the record, but also is the mechanism to ensure load distribution. The Range Key (when used) helps to indicate the records that will be mostly retrieved together, therefore, the storage can also be optimized for such need.
Choosing the correct keys to represent your data is one of the most critical aspects during your design process, and it directly impacts how much your application will perform, scale and cost.
Footnotes:
The Data Model is the model through which we perceive and manipulate our data. It describes how we interact with the data in the database [FOWLER]. In other words, it is how you abstract your data model, the way you group your entities, the attributes that you choose as primary keys, etc
The Storage Model describes how the database stores and manipulates the data internally [FOWLER]. Although you cannot control this directly, you can certainly optimize how the data is retrieved or written by knowing how the database works internally.
I have a working SQLite database that holds information about video files. The current design is as pictured below. However, the boss has decided to make some changes.
The FileProperties table currently uses the file name as the primary key. However, the PK now must be a compound key of both fileName and (file) location, which makes more sense anyway.
If this is done, what would be the best way to reference this compound key as a foreign key in the other tables? I was thinking of either creating a separate table that holds an auto-incrementing primary key, fileName and location. Then the PK can be used as a foreign key reference with all the other tables.
Or, make fileName and location a composite key in the current FileProperties table and add a new field that can be used as a reference and this field must be auto-incrementing and unique in the table.
I haven't had much practical experience with designing databases so any advice with my problem or my current design would be very welcome.
Absolutely use an auto-incrementing primary key. To ensure data integrity, create a unique index across the (filename,location) columns.
The following wiki article talks briefly about the pros and cons of a natural key. A natural key is a key taken directly from the data. In your case, that would be the composite key of (filename,location). In short, a natural key reduces physical space required by the data, at the cost of propagating changes to the key across all relations.
I (nearly) always have an auto-incrementing id on a table, even if there is a natural key available to be used.
Add auto-incremented FileId primary key.
Add unique constraint for Location + FileName.
Avoid using compound primary keys.
It's been habitual in most of the scenarios while developing a database design we set primary key as integer type for a unique identifier in the Table. Why not use string or float for primary keys? Does this affect the accessibility of values, or in plain words retrieval speed? Are there any specific reasons?
An integer will use less disk space than a string, thus giving you a smaller index file to search through. This is important for large tables where you want to have as much of the index as possible cached in RAM.
Also, they can be autoincremented so you don't need to write your own routines to generate keys.
You often want to have a technical key (also called a surrogate key), a key that is only used to identify the row and not used for anything else. Most data may change sooner or later for reasons you can't control and you don't want to update it everywhere. Even such seemingly static data as a nation-assigned personal id number can change (if you get a new identity) or there may be laws prohibiting their use. A key generated by you, however, is in your own control. For such surrogate keys it's useful to have a small key that is easily generated.
As for "floats as primary keys": Don't do this. A primary key should uniquely identify a row. Floats have no equality relation, which means you cannot safely compare two float values for equality. This is an inherent shortcoming of floating-point values. If you need decimals, use a fixed-point number type instead.
The primary key is supposed to be an index that can provide a unique way to access a specific row in a table. Primary keys can be most data types (in practical applications, float/double won't work too well), and primary keys can also be compound keys (comprised of several columns.)
If you carefully examine the data in the table, you might be able to find a data item that will be unique for every row in the table, thereby eliminating the requirement that you fabricate a key like the autoincrement integer that you find in some schemas.
If you're in a manufacturing environment it might be an alphanumeric field like part number or assembly identifier. Retail or warehousing applications might have a stock number or combination of stock number/shipment/manufacturer.
Generally, If some data in your table is supposed to be a unique identifier it probably will serve well as a primary key for your table.
Using data that exists in the table already completely eliminates the requirement to "make up" a value (such as the autoincrement column) and use it as the primary key. This saves space since it's one less column in the table and one less index on the table.
Yes, in my experience integer keys are almost always faster, since it's more efficient for the database engine to compare integers than comparing strings. Depending on the "uniqueness" of the data (technically called cardinality http://en.wikipedia.org/wiki/Cardinality_(SQL_statements)), the effect of character vs. integer keys is nominal.
Character keys may degrade performance depending on the number of characters that the database needs to compare to determine whether keys are equal or not equal. In the pathological case, imagine a hundred-character field which differ only on the right hand side. One row has 100 A's. We need to compare this to a key with 99 A's and a B as the last character. Conceptually, databases compare character fields just like strcmp() (strncmp() if you prefer) from left to right.
good luck!
The only reason is for performance.
A logical database design should specify which "real" columns are unique, but when the logical design is transformed into a physical design, it is traditional to not use any of these "natural" keys as the primary key; instead, a meaningless integer column is added for this purpose - called a "surrogate key".
Normally the designer will add further unique constraints for the "real" uniqueness business rules as specified in the logical design.
This is because most DBMS's have trouble updating a primary key (e.g. due to performance issues when cascading the update to child tables). Some DBMS's might not be able to support non-integer primary keys at all.
Some side notes:
There's no theoretical reason why
primary keys should be immutable.
This is nothing to do with
normalization, which happens in the
logical model (which should never
have surrogate keys).
Also, note that the idea of a
"primary" key is not a relational
concept - it is simply a way of
denoting the "preferred" uniqueness
constraint, perhaps for relational
integrity - but there's nothing in
the RM that says that you must use
the same key for each child table.
I've created natural keys as "Primary
Keys" in Oracle databases before,
albeit rarely. I've even had them
used for foreign key constraints.
Admittedly, they were either
immutable, or I hand-wrote the
update-cascade code; and I had
trouble with one front-end
application where the PK included a
date column.
Bottom line: there is no theoretical requirement for surrogate keys, but they're much more practical than the alternative.
I suspect that it is because we can auto-increment integer values so it's easy to generate a new unique key for every insert.
Many common ORM (Object Relational Mapping) tools either force to use or at least recommend using integer as primary key.
Integer primary key also saves space compared to string and integer primary key is in some cases also faster. Sequences or auto increment fields make integer primary key generation easy at least if you do not work with distributed databases.
These are some of the main reasons why i think we have integers/ numbers as primary keys.
1.Primary keys should be able to uniquely define your row and should be immutable. One of the problems with using real attributes (name etc..) is that they could change over time. To maintain relational integrity in such a case would be very difficult as this change needs to cascade to all the child records.
2.The size of the table and thereby the index would be smaller in case we use a number as a key for the tab.e
3.Since these are automatically generated using a sequence, we can be sure that the values would be unique under all circumstances.
Check this.
http://forums.oracle.com/forums/thread.jspa?messageID=3916511�
What are the down sides of using a composite/compound primary key?
Could cause more problems for normalisation (2NF, "Note that when a 1NF table has no composite candidate keys (candidate keys consisting of more than one attribute), the table is automatically in 2NF")
More unnecessary data duplication. If your composite key consists of 3 columns, you will need to create the same 3 columns in every table, where it is used as a foreign key.
Generally avoidable with the help of surrogate keys (read about their advantages and disadvantages)
I can imagine a good scenario for composite key -- in a table representing a N:N relation, like Students - Classes, and the key in the intermediate table will be (StudentID, ClassID). But if you need to store more information about each pair (like a history of all marks of a student in a class) then you'll probably introduce a surrogate key.
There's nothing wrong with having a compound key per se, but a primary key should ideally be as small as possible (in terms of number of bytes required). If the primary key is long then this will cause non-clustered indexes to be bloated.
Bear in mind that the order of the columns in the primary key is important. The first column should be as selective as possible i.e. as 'unique' as possible. Searches on the first column will be able to seek, but searches just on the second column will have to scan, unless there is also a non-clustered index on the second column.
I think this is a specialisation of the synthetic key debate (whether to use meaningful keys or an arbitrary synthetic primary key). I come down almost completely on the synthetic key side of this debate for a number of reasons. These are a few of the more pertinent ones:
You have to keep dependent child
tables on the end of a foriegn key
up to date. If you change the the
value of one of the primary key
fields (which can happen - see
below) you have to somehow change
all of the dependent tables where
their PK value includes these
fields. This is a bit tricky
because changing key values will
invalidate FK relationships with
child tables so you may (depending
on the constraint validation options
available on your platform) have to
resort to tricks like copying the
record to a new one and deleting the
old records.
On a deep schema the keys can get
quite wide - I've seen 8 columns
once.
Changes in primary key values can be
troublesome to identify in ETL
processes loading off the system.
The example I once had occasion to
see was an MIS application
extracting from an insurance
underwriting system. On some
occasions a policy entry would be
re-used by the customer, changing
the policy identifier. This was a
part of the primary key of the
table. When this happens the
warehouse load is not aware of what
the old value was so it cannot match
the new data to it. The developer
had to go searching through audit
logs to identify the changed value.
Most of the issues with non-synthetic primary keys revolve around issues when PK values of records change. The most useful applications of non-synthetic values are where a database schema is intended to be used, such as an M.I.S. application where report writers are using the tables directly. In this case short values with fixed domains such as currency codes or dates might reasonably be placed directly on the table for convenience.
I would recommend a generated primary key in those cases with a unique not null constraint on the natural composite key.
If you use the natural key as primary then you will most likely have to reference both values in foreign key references to make sure you are identifying the correct record.
Take the example of a table with two candidate keys: one simple (single-column) and one compound (multi-column). Your question in that context seems to be, "What disadvantage may I suffer if I choose to promote one key to be 'primary' and I choose the compound key?"
First, consider whether you actually need to promote a key at all: "the very existence of the PRIMARY KEY in SQL seems to be an historical accident of some kind. According to author Chris Date the earliest incarnations of SQL didn't have any key constraints and PRIMARY KEY was only later addded to the SQL standards. The designers of the standard obviously took the term from E.F.Codd who invented it, even though Codd's original notion had been abandoned by that time! (Codd originally proposed that foreign keys must only reference one key - the primary key - but that idea was forgotten and ignored because it was widely recognised as a pointless limitation)." [source: David Portas' Blog: Down with Primary Keys?
Second, what criteria would you apply to choose which key in a table should be 'primary'?
In SQL, the choice of key PRIMARY KEY is arbitrary and product specific. In ACE/Jet (a.k.a. MS Access) the two main and often competing factors is whether you want to use PRIMARY KEY to favour clustering on disk or whether you want the columns comprising the key to appears as bold in the 'Relationships' picture in the MS Access user interface; I'm in the minority by thinking that index strategy trumps pretty picture :) In SQL Server, you can specify the clustered index independently of the PRIMARY KEY and there seems to be no product-specific advantage afforded. The only remaining advantage seems to be the fact you can omit the columns of the PRIMARY KEY when creating a foreign key in SQL DDL, being a SQL-92 Standard behaviour and anyhow doesn't seem such a big deal to me (perhaps another one of the things they added to the Standard because it was a feature already widespread in SQL products?) So, it's not a case of looking for drawbacks, rather, you should be looking to see what advantage, if any, your SQL product gives the PRIMARY KEY. Put another way, the only drawback to choosing the wrong key is that you may be missing out on a given advantage.
Third, are you rather alluding to using an artificial/synthetic/surrogate key to implement in your physical model a candidate key from your logical model because you are concerned there will be performance penalties if you use the natural key in foreign keys and table joins? That's an entirely different question and largely depends on your 'religious' stance on the issue of natural keys in SQL.
Need more specificity.
Taken too far, it can overcomplicate Inserts (Every key MUST exist) and documentation and your joined reads could be suspect if incomplete.
Sometimes it can indicate a flawed data model (is a composite key REALLY what's described by the data?)
I don't believe there is a performance cost...it just can go really wrong really easily.
when you se it on a diagram are less readable
when you use it on a query join are less
readable
when you use it on a foregein key
you have to add a check constraint
about all the attribute have to be
null or not null (if only one is
null the key is not checked)
usualy need more storage when use it
as foreign key
some tool doesn't manage composite
key
The main downside of using a compound primary key, is that you will confuse the hell out of typical ORM code generators.