I have a table with id-token as a key-value. Sometimes I need to get id by token (when a user is logging).
How I can do this?
I can create a separate table (namespace) with a token as key and id as data, but it not seems like a good approach.
I heard about the secondary index as a solution, but I can't find out how to create them and what the difference between approaches in this question.
Which one should I use for my task?
I can create a separate table with a token as key and id as data
This is a way of creating secondary index. In most cases this is a better solution. Because this way you won't need any other dependency. Moreover, you would have full control over the index (data).
Related
I need to understand how one can search attributes of a DynamoDB that is part of an array.
So, in denormalising a table, say a person that has many email addresses. I would create an array into the person table to store email addresses.
Now, as the email address is not part of the sort key, and if I need to perform a search on an email address to find the person record. I need to index the email attribute.
Can I create an index on the email address, which is 1-many relationship with a person record and it's stored as an array as I understand it in DynamoDB.
Would this secondary index be global or local? Assuming I have billions of person records?
If I could create it as either LSI or GSI, please explain the pros/cons of each.
thank you very much!
Its worth getting the terminology right to start with. DynamoDB supported data types are
Scalar - String, number, binary, boolean
Document - List, Map
Sets - String Set, Number Set, Binary Set
I think you are suggesting you have an attribute that contains a list of emails. The attribute might look like this
Emails: ["one#email.com", "two#email.com", "three#email.com"]
There are a couple of relevant points about Key attributes described here. Firstly keys must be top-level attributes (they cant be nested in JSON documents). Secondly they must be of scalar types (i.e. String, Number or Binary).
As your list of emails is not a scalar type, you cannot use it in a key or index.
Given this schema you would have to perform a scan, in which you would set the FilterExpression on your Emails attribute using the CONTAINS operator.
Stu's answer has some great information in it and he is right, you can't use an Array it's self as a key.
What you CAN sometimes do is concatenate several variables (or an Array) into a single string with a known seperator (maybe '_' for example), and then use that string as a Sort Key.
I used this concept to create a composite Sort Key that consisted of multiple ISO 8061 date objects (DyanmoDB stores dates as ISO 8061 in String type attributes). I also used several attributes that were not dates but were integers with a fixed character length.
By using the BETWEEN comparison I am able to individually query each of the variables that are concatenated into the Sort Key, or construct a complex query that matches against all of them as a group.
In other words a data object could use a Sort Key like this:
email#gmail.com_email#msn.com_email#someotherplace.com
Then you could query that (assuming you knew what the partition key is) with something like this:
SELECT * FROM Users
WHERE User='Bob' AND Emails LIKE '%email#msn.com%'
YOU MUST know the partition key in order to perform a Query no matter what you choose as your Sort Key and no matter how that Sort Key is constructed.
I think the real question you are asking is what should my sort keys and partition keys be? That will depend on exactly which queries you want to make and how frequently each type of query is used.
I have found that I have way more success with DynamoDB if I think about the queries I want to make first, and then go from there.
A word on Secondary Indexes (GSI / LSI)
The issue here is that you still need to 'know' the Partition Key for your secondary data structure. GSI / LSI help you avoid needing to create additional DynamoDB tables for the sole purpose of improving data access.
From Amazon:
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/SecondaryIndexes.html
To me it sounds more like the issue is selecting the Keys.
LSI (Local Secondary Index)
If (for your Query case) you don't know the Partition Key to begin with (as it seems you don't) then a Local Secondary Index won't help — since it has the SAME Partition Key as the base table.
GSI (Global Secondary Index)
A Global Secondary Index could help in that you can have a DIFFERENT Partition Key and Sort Key (presumably a partition key that you could 'know' for this query).
So you could use the Email attribute (perhaps composite) as the Sort Key on your GSI and then something like a service name, or sign-up stage, as your Partition Key. This would let you 'know' what partition that user would be in based on their progress or the service they signed up from (for example).
GSI / LSI still need to generate unique values using their keys so keep that in mind!
I am working on a voting table design using Postgres 9.5 (but maybe the question itself is applicable to sql in general). My vote table should be like:
-------------------------
object | user | timestamp
-------------------------
Where object and user are foreign keys to the ids corresponding to their own tables. I have a problem identifying what actually should be a primary key.
I thought at first to make a primary_key(object, user) but since I use django as a server, it just doesn't support multicolumn primary key, I am not sure either about the performance since I may access a row using only one of those 2 columns (i.e. object or user), but the advantage this idea works automatically as a unique key since the same user shouldn't vote twice for the same object. And I don't need any additional indexes.
The other idea is to introduce an auto or serial id field, I really don't think of any advantage of using this approach especially when the table gets bigger. I need also to introduce at least a unique_key(object, user) which adds to the computational complexity and data storage. Not even sure about the performance when I select using one of the 2 columns, may be I need also 2 additional indexes for the object and user to accelerate the select operation since I need this heavily.
Is there something I am missing here? or is there a better idea?
django themselves recognise that the "natural primary key" in this case is not supported. So your gut feeling is right, but django don't support it.
https://code.djangoproject.com/wiki/MultipleColumnPrimaryKeys
Relational database designs use a set of columns as the primary key
for a table. When this set includes more than one column, it is known
as a “composite” or “compound” primary key. (For more on the
terminology, here is an article discussing database keys).
Currently Django models only support a single column in this set,
denying many designs where the natural primary key of a table is
multiple columns. Django currently can't work with these schemas; they
must instead introduce a redundant single-column key (a “surrogate”
key), forcing applications to make arbitrary and otherwise-unnecessary
choices about which key to use for the table in any given instance.
I'm less failure with django personally. One option might be to form an extra column as a primary key by concatenating object and user.
Remember that there is nothing special about a primary key. You can always add a UNIQUE KEY on the pair of columns and make them both NOT NULL.
You might find this example useful.
https://thecuriousfrequency.wordpress.com/2014/11/11/make-primary-key-with-two-or-more-field-in-django/
The correct solution woulf be to have a PRIMARY KEY (object, user) and an additional index on user. The primary key index can also be used for searches for object alone.
Form a database point of view, your problem is that you use an inadequate middleware if it does not support composite primary keys.
You'll probably have to introduce an artificial primary key constraint and in addition have a unique constraint on (object, user) and an index on user, but your gut feelings that that is not the best solution from a database perspective are absolutely true.
I want to use an id as a primary key for my table. In each record, I am also storing an id from an other source, but these ids are in no way sequential.
Should I add an (auto-incremented) column with a "new" id? It is very important that queries by the id are as fast as possible.
Some info:
The content of my table is only stored "temporary", The table gets often cleared (TRUNCATE) and than filled with new content.
It's a sql-server 2008
After writing content to the table, I create an index for the id column
Thanks!
As long as you are sure the supplied id's are unique, there's no need to create another (surrogate) id to use as primary key.
Under most circumstances, an index on the existing id should be sufficient. You can make it slightly faster by declaring it as a primary key.
From what you describe a new id is not necessary for performance. If you do add one, the table will be slightly larger, which has a (very small) negative effect on performance.
If the existing id is not numeric (or not an integer), then there might be a small gain from using a more efficient type for the index. But, your best bet is to make the existing id a primary key (although this might affect load performance).
Note: I usually prefer synthetic primary keys, so this answer is very specific to your question.
If you are after speed I would join the two IDs together (either from the application or stored proc) and then put them in one column
I have a table:
id:int
revision:int
text:ntext
In general I will need to retrieve the latest revision of text for a particular id, or (considerably less frequently) add a new row containing a new revision for a particular id. Bearing this in mind, it would be a good idea to put indexes on the id and revision columns. I don't have a problem with implementing this, but I'm wondering if this is a situation where it would be sensible to use a composite (multi-field) index/key composed of both id and revision, or if there is any other strategy that would be appropriate for my use case?
I don't think the performance difference between a composite index and two separate indexes would be noticeable, but, as usual, I suggest trying both and profiling if the absolute best performance is needed.
You are likely to always be querying on both fields, with a definite id and an unknown revision occasionally (when needing to find the max revision for an id). If your composite index is (id,revision) then this use case is supported by the index. Querying on id alone with no care for revision also works.
If it is ever likely that you will be querying on revision only without regard to id then you will need two separate indexes.
You will also want to analyze the impact that either index has on insert performance. The composite index will cluster on both fields, whereas the two separate indexes will cluster only on id.
EDIT: typos.
It seems it the majority of cases you will be selecting the record based on both id and revision - therefore for quickest lookups you should make id and revision your composite primary key.
If id is the primary key its already indexed (I don't use SqlServer)
I seems that your revision is unique too.
so I think it would be better to use separate indexes and put unique constraint on revision (if required).
Via this link, I know that a GUID is not good as a clustered index, but it can be uniquely created anywhere. It is required for some advanced SQL Server features like replication, etc.
Is it considered bad design if I want to have a GUID column as a typical Primary Key ? Also this assumes a separate int identity column for my clustering ID, and as an added bonus a "user friendly" id?
update
After viewing your feedback, I realise I didn't really word my question right. I understand that a Guid makes a good (even if its overkill) PK, but a bad clustering index (in general). My question more directly asked, is, is it bad to add a second "int identity" column to act as the clustering index?
I was thinking that the Guid would be the PK and use it to build all relationships/joins etc. Then I would instead of using a natural key for the Cluster Index, I would add an additional "ID" that not data-specific. What I'm wondering is that bad?
If you are going to create the identity field anyway, use that as the primary key. Think about querying this data. Ints are faster for joins and much easier to specify when writing queries.
Use the GUID if you must for replication, but don't use it as a primary key.
What are you intending to accomplish with the GUID? The int identity column will also be unique within that table. Do you actually need or expect to need the ability to replicate? If so, is using a GUID actually preferable in your architecture over handling identity columns through one of the identity range mangement options?
If you like the "pretty" ids generated using the Active Record pattern, then I think I'd try to use it instead of GUIDs. If you do need replication, then use one of the replication strategies appropriate for identity columns.
Consider using only GUID, but get your GUIDs using the NEWSEQUENTIALID method (which allocates sequential values and so doesn't have the same clustering performance problems as the NEWID method).
A problem with using a secondary INT key as an index is that, if it's an adequate index, why use a GUID at all? If a GUID is necessary, how can you use an INT index instead? I'm not sure whether you need a GUID, and if so then why: are you doing replication and/or merging between multiple databases? And if you do need a GUID then you haven't specified exactly how you intend to use the non-globally-unique INT index in that scenario.
Sounds like what you are saying is that I have not made a good case for using a Guid at all, and I agree I know its overkill, but my question I guess would be is it too much overkill?
I think it's convenient to use GUID instead of INT for the primary key, if you have a use case for doing so (e.g. multiple databases) and if you can tolerate the linear, O(1) loss of performance caused simply by using a bigger (16-byte) key (which results in there being fewer index instances per page of memory).
The bigger worry is the way in which using a (random) GUID could affect performance when it's used for clustering. To counter-act that:
Either, use something else (e.g. one of the record's natural keys) as the clustered index, even if you still use a GUID for the primary key
Or, let the clustered index be the same field as the GUID primary key, but use NewSequentialId() instead of NewId() to allocate the GUID values.
is it bad to insert an additional artifical "id" for clustering, since I'm not sure I'll have a good natural ID candidate for clustering?
I don't understand why you wouldn't prefer to instead use just the GUID with NewSequentialId(), which is I think is provided for exactly this reason.
Using a GUID is lazy -- i.e., the DBA can't be bothered to model his data properly. Also it offers very bad join performance -- typically (16-byte type with poor locality).
Is it a bad design, if I want to have a GUID column as my typical Primary Key, and a separate, int identity column for my clustering ID, and as an added bonus a "user friendly" id?
Yes it is very bad -- firstly you don't want more than one "artificial" candidate key for your table. Secondly, if you want a user friendly id to use as keys just use a fixed length type such as char[8] or binary(8) -- preferably binary as the sort won't use the locale; you could use 16-byte types however you will notice a deterioration in performance -- however not as bad as GUID's. You can use these fixed types to build your own user-friendly allocation scheme that preserves some locality but generates sensible and meaningful id's.
As an Example:
If you are writing some sort of a CRM system (lets say online insurance quotes) and you want an extremely user friendly type for example a insurance quote reference (QR) that looks like so "AD CAR MT 122299432".
In this case -- since the quote length huge -- I would create a separate LUT/Symboltable to resolve the quote reference to the actual identifier used. but I will divorce this LUT from the rest of the model, I will never use the quote reference anywhere else in the model, especially not in the table representing the QR's.
Create Table QRLut
{
bigint bigint_id;
char(32) QR;
}
Now if my model has one table that represents the QR and 20 other tables featuring the bigint QR as a foreign key -- the fact that a bigint is used will allow my DB to scale well -- the wider the join predicates the more contention is caused on the memory bus -- and the amount of contention on the memory bus determines how well your CPU's can be saturated (multiple CPU's).
You might think with this example that you could just place the user-friendly QR in the table that actually represents the quote, however keep in mind that SQL server gathers statistics on tables and indices, and you don't want to let the server make caching decisions based on the user-friendly QR -- since it is huge and wastefull.
I think it is bad design to do it that way but I don't know if it is bad otherwise. Remember, SQLServer automatically assigns the clustered index to the Primary key. You would have to remove it after making the GUID the primary key. Also, you usually want your identity column to be your primary key. So doing what you are saying would confuse anyone who reads your code that doesn't look closely. I would suggest you make the ID column your primary key, identity column, and put the clustered index on it. Then make your GUID column a unique key, making it a non-clustered index and not allowing nulls. That in affect will do what you want but will follow more of the standard.
Personally, I would go this way:
An internally known identity field for
your PK (one that isn't known to the
end-user because they will inevitably
want to control it somehow). A
user-friendly "ID" that is unique with
respect to some business rule
(enforced either in your app code or
as a constraint). A GUID in the
future if it's ever deemed necessary
(like if it's required for
replication).
Now with respect to the clustered index, which you may or may not be confused about, consider this guide from MS for SQL Server 2000.
You are right that GUIDs make good object identifiers, which are implemented in a database as primary keys. Additionally, you are right that primary keys do not need to be the clustered indices.
GUIDs share the same characteristics for clustered indexes as INT IDENTITY columns, provided that the GUIDs are sequential. There is a NewSequentialID specific to SQL Server, but there is also a generic algorithm for creating them called COMB GUID, based on combining the current datetime with random bytes in a way that retains a large degree of randomness while retaining sequentiality.
One thing to keep in mind, if you intend to use NHibernate at some point, is that NHibernate natively knows how to use the COMB GUID strategy - and NHibernate can even use it to do batch-inserts, something that cannot be done with INT IDENTITY or NewSequentialID. If you are inserting multiple objects with NHibernate, then it will be faster to use the COMB GUID strategy than either of the other two methods.
It is not bad design at all, an int Identity for your clustering key gives you a number of good benefits (Narrow,Unique,Ascending) whilst keeping the GUID for functionality purposes very separate and acting as your primary key.
If anything I would suggest you have the right approach, although the "user friendly" ID is the most questionable part - as in what purpose is it there to serve.
Addendum : I should put in the obligatory link to (possibly?) the most read article about the topic by Kimberley Tripp. http://www.sqlskills.com/BLOGS/KIMBERLY/post/GUIDs-as-PRIMARY-KEYs-andor-the-clustering-key.aspx