Primary Key Type Guid or Int? [closed] - sql

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
I am wondering what is the recommended type for PK in sql server? I remember reading a long time ago this article but now I am wondering if it is still a wise decision to use GUID still.
One reason that got me thinking about it is, these days many sites use the id in the url for instance Course/1 would get the information about that record.
You can't really do that with a guid, which would mean you would need some new column that would be unique and use that, what is more work as you got to make sure each record has a unique number.

There is never a "one solution fits all". You have to carefully design your architecture and select the best options for your scenario. Both INT and GUID types are valid options like they've always been.
You can absolutely use GUID in a URL. In fact, in most scenarios, it is better to use a GUID (or another random ID) in the URL than a sequential numeric ID for security reason. If you use sequential ID, your site visitors will be able to easily guess other users' IDs and potentially access their contents. For example, if my profile URL is /Profiles/111, I can try Profile/112 and see if I can access it. If my reservation URL is Reservation/444, I can try Reservation/441 and see what happens. I can easily guess other IDs in the system. Of course, you must have strong permissions, so I should not be able to see those other pages that don't belong to my account, but if there is any issues or holes in your permissions and security, a breach can happen. While with GUID and other random IDs, there is no way to guess other IDs in the system, so such a breach is much more difficult.
Another issue with sequential IDs is that your users can guess how many accounts or records you have and their order in your database. If my ID is 50269, I know that you must have almost this number of records. If my Id is 4, then I know that you had a very few accounts when I registered. For that reason, many developers start the first ID at some random high number like 1529 instead of 1. It doesn't solve the issue entirely, but it avoid the issues with small IDs. How important all that guessing is depends on the system, so you have to evaluate your scenario carefully.
That's on the top of the benefits mentioned in the article that you mentioned in your question. But still, an integer is better in some areas, so choose the best option for your scenario.
EDIT To answer the point that you raised in your comment about user-friendly URLs. In those scenarios, sequential numbers is the wrong answer. A better solution is a unique string in the URL which is linked to your numeric ID. For example, the Cars movie has this URL on IMDB:
https://www.imdb.com/title/tt0317219/
Now, compare that to the URL of the same movie on Wikipedia, Rotten Tomatoes, Plugged In, or Facebook:
https://en.wikipedia.org/wiki/Cars_(film)
https://www.rottentomatoes.com/m/cars/
https://www.pluggedin.ca/movie-reviews/cars/
https://www.facebook.com/PixarCars
We must agree that those URLs are much friendlier than the one from IMDB.

I've worked on small, medium, and large scale implementations(100k+ users) with SQL and Oracle. The major of the time PK type of INT is used when needed. The GUID was more popular 10-15 years ago, but even at its height was not as populate as the INT. Unless you see a need for it I would recommend INT.

My experience has been that the only time a GUID is needed is if your data is on the move or merged with other databases. For example, say you have three sites running the same application and you merge those three systems for reporting purposes.
If your data is stationary or running a single instance, int should be sufficient.

According to the article you mention:
GUIDs are unique across every table, every database, every server
Well... this is a great promise, but fails to deliver. GUID are supposed to be unique snowflakes. However, reality is much more complicated than that, and there are numerous reasons why they end up not being unique.
One of the main reasons is not related to the UUID/GUID specification, but by poor implementations of it. For example some Javascript implementations rank as the worst ones, using pseudo random numbers that are quite predictable. Other implementations are much more decent.
So, bottom line, study the specific implementation of UUID/GUID you are and will be using. Don't just read and trust the specification. Otherwise you may be up for a surprise, when you get called at 3 am on a Saturday night by angry customers.

Related

Resource modelling in a REST API ( problems with timeseries data& multiple identifiers) [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
I have some trouble modeling the resources in a domain to fit with a REST API. The example is obviously contrived and simplified, but it illustrates 2 points where I'm stuck.
I know that:
a user has pets
a pet has multiple names - one by each member of the family
a pet has: a date of birth, a date of death and a type (dog,cat...)
I need to be able to query based on dates (actually the date, or range of dates is mandatory when asking about pets). E.g.: tell me what pets I have now; tell me what pets grandma says we had 5 years ago until 3 years ago.
How should I handle dates?
a. in the query string: /pets/dogs/d123?from=10102010&to=10102015 (but as I understand, query string is mostly for optional parameters; and date/range of dates is needed. I was thinking of having the current date as default, if there's nothing in the query string. Any thoughts on this?)
b. somewhere in the path. Before /pets? This seems a bit weird when I change between a date and a range of dates. And my real path is already kind of long
How should I handle multiple names?
The way I see it, I must specify who uses the name I'm searching for.
/pets/dogs/rex -> I want to know about the dog called rex (by whom, me or grandma?). But where to put grandma?
I've seen some people say not to worry about the url, and use hypermedia But the way I understood that(and it's possible I just got it wrong) is that you have to always start from the root (here /pets )and follow the links provided in the response. And then I'm even more stuck(since the date makes for a really really long list of possibilities).
Any help is appreciated. Thanks
What might be useful in such scenarios is a kind of resource query language. It don't know the technology stack that you use, but a JavaScript example can be found here.
Absolutely do not put any dates in the path. This is considered as a bad style and users may be confused since, they most of them may be not used to such strange design and simply will not know how to use the API. Passing dates via query string is perfectly fine. You can introduce a default state - which is not a bad idea - but you need to describe the state (e.g. include dates) in the response. You can also return 400 Bad Request status code when dates range is missing in request. Personally, I'd go for default state and dates via query string.
In a such situation the only thing that comes to my mind is to reverse the relation, so it would be:
/users/grandma/dogs/rex
or:
/dogs/rex/owners/grandma
What can be done also is to abandon REST rules and introduce new endpoint /dogs/filter which will accept POST request with filter in the body. This way it will be much easier to describe the whole query as well to send it. As I mentioned this is not RESTful approach, however it seems reasonable in this situation. Such filtering can be also modeled with pure REST design - filter will become a resource as well.
Hypermedia seems not the way to go in this particular scenario - and to be honest I don't like hypermedia design very much.
You can use the query string if you want, there is no restriction about that. The path contains the hierarchical, while the query contains the non-hierarchical part, but that is not mandatory either.
By the queries I suggest you to think about the parameters and about what will be in the response. For example:
I want to know about the dog called rex (by whom, me or grandma?)
The params are: rex and grandma and you need dogs in the response.
So the hyperlink will be something like GET /pets/dogs/?owner=grandma&name=rex or GET /pets/dogs/owner:grandma/name:rex/, etc... The URI structure does not really matter if you attach some RDF metadata to the hyperlink and the params e.g. you can use the https://schema.org/AnimalShelter vocab. Ofc. this is not the best fit, because it does not concern about multiple names given by multiple persons, but it is a good start to create your own vocab if you decide to use RDF.

Performance gains vs Normalizing your tables? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
Ok ok I know you probably all going to kill me for asking this, however I got into an friendly programmer argument with a co-worker about one of our database tables and he asked a question which I know the answer to but I couldn't explain it is the better way.
I will simplify the situation for the simplicity of the question, We have a fairly large table of people / users. Now amongst other data being stored the data in question is as follows: we have a simNumber, cellNumber and the ipAddress of that sim.
Now I am saying that we should make a table lets call it SimTable and put those 3 entries in the sim table, and then put a FK in the UsersTable linking the two. Why? Because that's what I have always been taught NORMALISE your tables!!! Ok so all is good in that regard.
But now my friend says to me yes, but now when you want to query a users phone number, SQL now has to go and:
search for the user
search for the sim fk
search for the correct sim row in the sim database
get the phone number
Now when I go and request 10000 users phone numbers, the number of operations done seriously grows in size.
Vs the other approach
search for the user
find the phone number
Now the argument is purely performance based. As much as I understand why we do normalize the data (to remove redundant data, maintainability, make changes to data in one table which propagate up etc.. ) It does appear to me that the approach with the data in one table will be faster or will at least less tasks/ operations to give me the data I want?
So what is the case in this situation? I do hope that I have not asked anything insanely silly , it is early in the morning so do forgive me if im not thinking clearly
The technology involved in MS SQL server 2012
[EDIT]
This article below also touches on some pf the concepts I have mentioned above
http://databases.about.com/od/specificproducts/a/Should-I-Normalize-My-Database.htm
The goal of normalization is not performance. The goal is to model your data correctly with minimum redundancy so you avoid data anomalies.
Say for example two users share the same phone. If you store the phones in the user table, you'd have sim number, IP address, and cell number stored one each user's row.
Then you change the IP address on one row but not the other. How can one sim number have two IP addresses? Is that even valid? Which one is correct? How would you fix such discrepancies? How would you even detect them?
There are times when denormalization is worthwhile, if you really need to optimize data access for one query that you run very frequently. But denormalization comes at a cost, so be prepared to commit yourself to a lot more manual work to take responsibility for data integrity. More code, more testing, more cleanup tasks. Do those count when considering "performance" of the project overall?
Re comments:
I agree with #JoelBrown, as soon as you implement your first case of denormalization, you compromise on data integrity.
I'll expand on what Joel mentions as "well-considered." Denormalization benefits specific queries. So you need to know which queries you have in your app, and which ones you need to optimize for. Do this conservatively, because while denormalization can help a specific query, it harms performance for all other uses of the same data. So you need to know whether you need to query the data in different ways.
Example: suppose you are designing a database for StackOverflow, and you want to support tags for questions. Each question can have a number of tags, and each tag can apply to many questions. The normalized way to design this is to create a third table, pairing questions with tags. That's the physical data model for a many-to-many relationship:
Questions ----<- QuestionsTagged ->---- Tags
But you figure you don't want to do the join to get tags for a given question, so you put tags into a comma-separated string in the questions table. This makes it quicker to query a given question and its associated tags.
But what if you also want to query for one specific tag and find its related questions? If you use the normalized design, it's simply a query against the many-to-many table, but on the tag column.
But if you denormalize by storing tags as a comma-separated list in the Questions table, you'd have to search for tags as substrings within that comma-separated list. Searching for substrings can't be indexed with a standard B-tree style index, and therefore searching for related questions becomes a costly table-scan. It's also more complex and inefficient to insert and delete a tag, or to apply constraints like uniqueness or foreign keys.
That's what I mean by denormalization making an improvement for one type of query at the expense of other uses of the data. That's why it's a good idea to start out with everything in normal form, and then refactor to denormalized designs later on a case by case basis as your bottlenecks reveal themselves.
This goes back to old wisdom:
"Premature optimization is the root of all evil" -- Donald Knuth
In other words, don't denormalize until you can demonstrate during load testing that (a) it makes a real improvement to performance that justifies the loss of data integrity, and (b) it does not degrade performance of other cases unacceptably.
It sounds like you already understand the benefits of normalisation, so I won't cover these.
There are a couple of considerations here:
1. Does a user always have one and only phone number?
If so, then it is still normalised to add these to the user table. However, if the user can have either no phone number or multiple phone numbers, then the phone details should be held in a seperate table.
Assuming you have these in seperate tables, but after conducting performance tests you found that joining on these 2 tables was having a significant effect on performance, then you may choose to deliberately denormalise the tables for performance gains.
Others have already provided some good points and you may also want to take a look at this.
I'd just like to mention one more aspect that is often overlooked: I/O tends to be the greatest component of the cost of most queries, and denormalization generally increases the storage size of data, therefore making the DBMS cache "smaller".
If your normalized database fits into cache and denormalized doesn't, you may actually observe a performance decrease for the latter.
And you won't be able to spot that in development, unless you actually have the amount of data that is similar to production. This is one of many reasons why you should never, ever denormalize without solid measurements (on representative amounts of data) to justify it.

Database Design For Users\Downloads [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
I need to design a database for something like a downloads site . I want to keep track of users , the programs each users downloaded and also allow users to rate+comment said programs.The things I need from this database - get average rating for a program , get all comments for a program , know exactly what program was downloaded by whom(I dont care how many times each program was downloaded but I want to know for each users what programs he has downloaded),maybe also count number of comments for each program and thats about it(it's a very small project for personal use that I want to keep simple)
I come up with these entities -
User(uid,uname etc)
Program(pid,pname)
And the following relationships-
UserDownloadedProgram(uid,pid,timestamp)
UserCommentedOnProgram(uid,pid,commentText,timestamp)
UserRatedProgram(uid,pid,rating)
Why I chose it this way - the relationships (user downloads , user comments and rates) are many to many . A user downloads many programs and a program is downloaded by many users. Same goes for the comments (A user comments on many programs and a program is commented or rated by many users). The best practice as far as I know is to create a third table which is one to many (a relationship table).
. I suppose that in this design the average rating and comment retrieval is done by join queries or something similar.
I'm a total noob in database design but I try to adhere to best practices , is this design more or less ok or am I overlooking something ?
I can definitely think of other possibilities - maybe comment and\or rating can be an entity(table) by itself and the relationships are between 3 entities. I'm not really sure what the benefits\drawbacks of that are: I know that I don't really care about the comments or the ratings , I only want to display them where appropriate and maintain them(delete when needed) , so how do I know if they better become an entity themselves?
Any thoughts?
You would create new entities as dictated by the rules of normalization. There is no particular reason to make an additional (separate) table for comments because you already have one. Who made the comment and which program the comment applied to are full-fledged attributes of a comment. The foreign keys representing these relationships (which are many-to-one, from the perspective of the comment table) belong right where you've put them.
The tables you've proposed are in third normal form which is acceptable according to best practices. I would add that you seem to be tracking data on a transactional basis (i.e. recording events as and when they occur). That is a good practice too because you can always figure out whatever you want to based on detailed information.
Calculating number of downloads or number of comments is a simple matter of using SQL Aggregate Functions with filters on the foreign key(s) that apply to your query - e.g. where pid=1234 etc.
I would do an entity for Downloads with its own id. You could have download status, you may have multiple download of the same program for one user. you may need to associate your download to an order or something else,..

Would anyone ever recommend storing dates and numbers in the same field? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
As background, I'm one of two developers in my department. I got into computers my freshman year in high school (1986) and have no formal education. I got into MS Access a little bit in 1994 and more seriously beginning in 2003. I'm self-educated, have always tried to learn as much as I can about database design, and while I believe I know a lot I also know I don't know everything.
The other developer in my department, according to his resume, has a degree in computer science and has been doing IT work, including web design and database design, for about 8 years. He was hired into my department last December. I've been very surprised by what I see as a very fundamental lack of knowledge about the basics of database design and SQL and have been trying to figure out if at least part of the problem is I'm expecting too much or maybe don't know as much as I think I do.
Hence my question. Please note we are 100% MS Access, but I believe this question applies to about any SQL database. This developer was tasked to take a spreadsheet and convert it into a database. Part of the spreadsheet involved tracking inventory for batteries. In the spreadsheet, the column titles were Date and Count. But the data in the date column was a mix of dates and batch numbers. So this developer created a table with a numeric field to contain both the batch number and the date and a second boolean field called IsDate to indicate what value was in the field.
I disagree with this approach and would have created two separate fields, a date field for the date and a numeric field for the batch number. When I suggested this approach, he seemed to not only not understand why but also to get a bit angry about having to change his design.
Which approach would you recommend? Also, assuming everyone agrees with my approach - of course you will! ;) - if you had a developer with this supposed level of experience, would you consider him worth keeping and worth investing the time and effort to educate him?
My own rule of thumb here is:
Always keep data in a native datatype.
This helps comparing, sorting, finding and grouping - especially in a database - and makes your storage less prone to query errors. Moreover, you're not required to use another predicate (AND isdate) when accessing the data. Hence, I think your approach is correct.
Your colleague's approach seems not to be a matter of high education, but one of a personal approach. I've seen workers with PhD who could well listen to a well-reasoned argument, and freshmen who made grave mistakes and would not listen to a polite advice.
I'd most definitely store the date and the batch number in different fields of the appropriate type - setting each with the relevant content or as NULL if no value was available. By doing this you'd be able to see what data you actually have available and perform meaningful operations on that data.
In terms of you second question, I guess it would really depend on what the developer in question said when you asked them why they'd chosen the approach they did.
You are right.
Only under severe memory restrictions might (note might) this kind of architecture be acceptable.
As to dealing with him, I would first talk to him and fiugre out why he chose the given approach, this is something that might have been common in Access Databases 10 years ago (but even then there was enough disk and memory space to not have to do these kind of tricks).
His reluctance to talk about his design is a worse indicator of his abilities than the design itself. Even the most misguided design should have been based on a structured approach or idea. In my mind it is not a bad thing to be wrong, it is a bad thing to create random structures. But not knowing your requirements it is hard to suggest whether it is worth keeping him or not.
Is one of you the 'senior' hierarchy wise or are you sharing responsibilities ?
Point out that he is breaking first normal form by doing so. Be able to describe 1NF 2NF and 3NF before trying to impress him with you fancy pants knowledge.

Generate webpages directly from database or cache?

[I'm not asking about the architecture of SO, but it would be helpful to the question.]
On SO, when a user clicks on his/her name and clicks on "responses" they see other users responses to comment threads, questions, and answers in which they have participated. I've had the sneaking suspicion that I've missed certain responses out there, which made me wonder: if you had to build that thing, would you pull everything dynamically from the database every time a user requested it? Or would you modify it when there is new related activity in the application? Or would you build it in a nightly daemon process?
I imagine that the real answer is that it's dynamically constructed every time, but that the tables are denormalized in such a way so as to make the thing less time-consuming. How would you build it?
I'm asking about any platform, of course, not only on .Net.
I would pull it dynamically from the database every time. I think this gives you the best result from a user experience and then I would apply the principal that premature optimization is evil. Later if there were performance issues I would look into caching.
I think doing it as a daemon/push process would actually result in more overall work being done. That is the updates would happen more frequently than the users are requesting the info.
Obviously, when an answer or comment is posted, you'll want to identify the user that should be informed in their responses tab. Then just add a row to a responses table containing the response text, timestamp, and the user to which it belongs. That way you can dynamically generate the tab with a simple
select * from responses where user=<userid> order by time desc limit 30
or something like that.
p.s. Extra credit to anyone that can write a query that will remove old responses - assume that each person should have the last 30 responses in their responses tab.
I expect that userid would be a natural option for the clustered index. If you have an "Active" boolean field then you don't need to worry much about locks; the table could be write-only except to update the (unindexed) Active column. I bet it already works that way, since it appears that everything is recoverable.
Don't need no stinking extra-credit response remover.
I would assume this is denormalized in the database. The Comment table probably has both and answer_id and an answer_uid so the SQL to find comments on you answers just run against the comment table. The same setup would work on the Answer table. Each answer has a question_id and a question_uid.
Having said that, these are probably the same table and you have response_to_id and response_to_uid and that makes lots of code simpler and makes the "recent" tab a single select as well. In fact the difference between the two selects is one uses the uid and one uses the response_to_uid.
I'd say that your UI and your database should both be driven by your Application Domain; so they will reflect each other based on their common provenance there.
Some quick notes to illustrate, using simplified Object Role Modeling as discussed by Fowler et al.
Entities
Users
Questions
Answers
Comments
Entity Roles
(Note: In Object Role Modeling, most Roles are reflexive. Some, e.g. booleans here, are monopolar)
Question has User
Question has QuestionVersions
Question as Answers
Question has Comments
Answer has AnswerVersions
Answer has Comments
Question has User
QuestionVersion has Text
QuestionVersion has Timestamp
QuestionVersion has IsDeleted (could be inferred from nonNULL timestamp eg)
QuestionVersion has DeltedByUser
QuestionVersion has DeletedTimestamp
Answer has User
AnswerVersion has Text
AnswerVersion has Timestamp
AnswerVersion has IsDeleted
AnswerVersion has DeltedByUser
AnswerVersion has DeletedTimestamp
Comment has Text
Comment has User
Comment has Timestamp
Comment IsDeleted (boolean)
(note - no versions on comments)
I think that's the basics. These assertions drive ERDs in ORM. Hopefully it's self-evident how they drive the User Stories as well.
I don't think an implementation of a normalized design like this would require denormalization - especially since I think it's clear (from behavior) that queries => UI displays are cached to be refreshed 1X per minute.