Would anyone ever recommend storing dates and numbers in the same field? [closed] - sql

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
As background, I'm one of two developers in my department. I got into computers my freshman year in high school (1986) and have no formal education. I got into MS Access a little bit in 1994 and more seriously beginning in 2003. I'm self-educated, have always tried to learn as much as I can about database design, and while I believe I know a lot I also know I don't know everything.
The other developer in my department, according to his resume, has a degree in computer science and has been doing IT work, including web design and database design, for about 8 years. He was hired into my department last December. I've been very surprised by what I see as a very fundamental lack of knowledge about the basics of database design and SQL and have been trying to figure out if at least part of the problem is I'm expecting too much or maybe don't know as much as I think I do.
Hence my question. Please note we are 100% MS Access, but I believe this question applies to about any SQL database. This developer was tasked to take a spreadsheet and convert it into a database. Part of the spreadsheet involved tracking inventory for batteries. In the spreadsheet, the column titles were Date and Count. But the data in the date column was a mix of dates and batch numbers. So this developer created a table with a numeric field to contain both the batch number and the date and a second boolean field called IsDate to indicate what value was in the field.
I disagree with this approach and would have created two separate fields, a date field for the date and a numeric field for the batch number. When I suggested this approach, he seemed to not only not understand why but also to get a bit angry about having to change his design.
Which approach would you recommend? Also, assuming everyone agrees with my approach - of course you will! ;) - if you had a developer with this supposed level of experience, would you consider him worth keeping and worth investing the time and effort to educate him?

My own rule of thumb here is:
Always keep data in a native datatype.
This helps comparing, sorting, finding and grouping - especially in a database - and makes your storage less prone to query errors. Moreover, you're not required to use another predicate (AND isdate) when accessing the data. Hence, I think your approach is correct.
Your colleague's approach seems not to be a matter of high education, but one of a personal approach. I've seen workers with PhD who could well listen to a well-reasoned argument, and freshmen who made grave mistakes and would not listen to a polite advice.

I'd most definitely store the date and the batch number in different fields of the appropriate type - setting each with the relevant content or as NULL if no value was available. By doing this you'd be able to see what data you actually have available and perform meaningful operations on that data.
In terms of you second question, I guess it would really depend on what the developer in question said when you asked them why they'd chosen the approach they did.

You are right.
Only under severe memory restrictions might (note might) this kind of architecture be acceptable.
As to dealing with him, I would first talk to him and fiugre out why he chose the given approach, this is something that might have been common in Access Databases 10 years ago (but even then there was enough disk and memory space to not have to do these kind of tricks).
His reluctance to talk about his design is a worse indicator of his abilities than the design itself. Even the most misguided design should have been based on a structured approach or idea. In my mind it is not a bad thing to be wrong, it is a bad thing to create random structures. But not knowing your requirements it is hard to suggest whether it is worth keeping him or not.
Is one of you the 'senior' hierarchy wise or are you sharing responsibilities ?

Point out that he is breaking first normal form by doing so. Be able to describe 1NF 2NF and 3NF before trying to impress him with you fancy pants knowledge.

Related

SQL database to track monthly bills (Beginner) [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I am interested in learning SQL. I span up a Server 2019 and installed SQL Express on it. I am currently going through a basic SQL course on Codecademy but figured it would be a lot better to practice with some hands on. I came up with an idea to create a database to track my monthly bills and finances. Once I have enough data then I can leverage some graphing tools to visually present the data.
Has anyone done something similar before? Any suggestions as how I should design the database/tables? Keep in mind that my SQL knowledge is still very limited.
Thanks in advance!
This is a very broad question and nobody can give a definitive answer.
I agree with your approach of trying it yourself in your own situation - that's a definite thumbs-up from me (indeed, when we get new employees, this is one of the best traits when reviewing them). So many times people learn so much by 'getting their hands dirty' so to speak (e.g., going on and trying it themselves).
However, I suggest starting with the examples they have to get the general concepts down - the usually choose at least decent examples.
Alternatively though, you could give it a shot. Just be prepared to be wrong and start again. But don't worry - in terms of value, having a shot and getting it wrong is worth much more than reading something and only half-understanding it.
If you are familiar with spreadsheets, I suggest
Imagine how you would keep this on spreadsheets e.g., one sheet with bills that are due, and one sheet with your payments
Each one of those sheets would represent a table in your database.
If you pay all your bills with one payment only (e.g., no installments), then it would be easier to do it with one spreadsheet (e.g., just listing all the bills on the left side, and their payment information on the right). In this case it may not be the best case for teaching yourself databases. On the other hand, if you do pay by installments, then this could be useful.
The big difference difference in approach, is that in databases, the rows are not inherently sorted. Instead, you typically give the rows an identifier (e.g., Bill_ID, or Payment_ID). And then the tables are linked e.g., for a given row in Payment table, you'd also include the Bill_ID to represent which bill the payment was for.
Update: More examples
To choose a relevant thing to try on databases, I suggest choosing things that are related to each other, but are separate from each other (e.g., not linked 1-to-1).
In the bills/payments above, if you paid each bill with one payment, they didn't need to be on separate tables. However, you could try other things e.g.,
You live in a sharehouse where people pay for various things in a 'kitty' sysem (e.g., on each person's payday, they put in the amount they owe). In this case you may have a Bill table (which includes how it is split up, and when it was paid), and Person_Payments table which includes when people put money into the kitty
You have a family with kids and chores. You have a Kids table (with their name, etc), a Chores table (listing chores and how much they are worth in pocket money) and then a Kids_Chores table listing the Kid_ID, Chore_ID and date. Whenever they do a chore it goes into Kids_Chores and that is used to determine their pocket money.
You play various computer games and you want to track your win rate on them over time. You have one table for Game (with info about your user ID, etc), one for Game_Mode (which indicates, for a given game, what mode you were playing e.g., casual vs league, easy vs hard), then one for Game_Stats recording the date you played, the game and game_mode, and the number of games and number of wins.

Primary Key Type Guid or Int? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
I am wondering what is the recommended type for PK in sql server? I remember reading a long time ago this article but now I am wondering if it is still a wise decision to use GUID still.
One reason that got me thinking about it is, these days many sites use the id in the url for instance Course/1 would get the information about that record.
You can't really do that with a guid, which would mean you would need some new column that would be unique and use that, what is more work as you got to make sure each record has a unique number.
There is never a "one solution fits all". You have to carefully design your architecture and select the best options for your scenario. Both INT and GUID types are valid options like they've always been.
You can absolutely use GUID in a URL. In fact, in most scenarios, it is better to use a GUID (or another random ID) in the URL than a sequential numeric ID for security reason. If you use sequential ID, your site visitors will be able to easily guess other users' IDs and potentially access their contents. For example, if my profile URL is /Profiles/111, I can try Profile/112 and see if I can access it. If my reservation URL is Reservation/444, I can try Reservation/441 and see what happens. I can easily guess other IDs in the system. Of course, you must have strong permissions, so I should not be able to see those other pages that don't belong to my account, but if there is any issues or holes in your permissions and security, a breach can happen. While with GUID and other random IDs, there is no way to guess other IDs in the system, so such a breach is much more difficult.
Another issue with sequential IDs is that your users can guess how many accounts or records you have and their order in your database. If my ID is 50269, I know that you must have almost this number of records. If my Id is 4, then I know that you had a very few accounts when I registered. For that reason, many developers start the first ID at some random high number like 1529 instead of 1. It doesn't solve the issue entirely, but it avoid the issues with small IDs. How important all that guessing is depends on the system, so you have to evaluate your scenario carefully.
That's on the top of the benefits mentioned in the article that you mentioned in your question. But still, an integer is better in some areas, so choose the best option for your scenario.
EDIT To answer the point that you raised in your comment about user-friendly URLs. In those scenarios, sequential numbers is the wrong answer. A better solution is a unique string in the URL which is linked to your numeric ID. For example, the Cars movie has this URL on IMDB:
https://www.imdb.com/title/tt0317219/
Now, compare that to the URL of the same movie on Wikipedia, Rotten Tomatoes, Plugged In, or Facebook:
https://en.wikipedia.org/wiki/Cars_(film)
https://www.rottentomatoes.com/m/cars/
https://www.pluggedin.ca/movie-reviews/cars/
https://www.facebook.com/PixarCars
We must agree that those URLs are much friendlier than the one from IMDB.
I've worked on small, medium, and large scale implementations(100k+ users) with SQL and Oracle. The major of the time PK type of INT is used when needed. The GUID was more popular 10-15 years ago, but even at its height was not as populate as the INT. Unless you see a need for it I would recommend INT.
My experience has been that the only time a GUID is needed is if your data is on the move or merged with other databases. For example, say you have three sites running the same application and you merge those three systems for reporting purposes.
If your data is stationary or running a single instance, int should be sufficient.
According to the article you mention:
GUIDs are unique across every table, every database, every server
Well... this is a great promise, but fails to deliver. GUID are supposed to be unique snowflakes. However, reality is much more complicated than that, and there are numerous reasons why they end up not being unique.
One of the main reasons is not related to the UUID/GUID specification, but by poor implementations of it. For example some Javascript implementations rank as the worst ones, using pseudo random numbers that are quite predictable. Other implementations are much more decent.
So, bottom line, study the specific implementation of UUID/GUID you are and will be using. Don't just read and trust the specification. Otherwise you may be up for a surprise, when you get called at 3 am on a Saturday night by angry customers.

Performance gains vs Normalizing your tables? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
Ok ok I know you probably all going to kill me for asking this, however I got into an friendly programmer argument with a co-worker about one of our database tables and he asked a question which I know the answer to but I couldn't explain it is the better way.
I will simplify the situation for the simplicity of the question, We have a fairly large table of people / users. Now amongst other data being stored the data in question is as follows: we have a simNumber, cellNumber and the ipAddress of that sim.
Now I am saying that we should make a table lets call it SimTable and put those 3 entries in the sim table, and then put a FK in the UsersTable linking the two. Why? Because that's what I have always been taught NORMALISE your tables!!! Ok so all is good in that regard.
But now my friend says to me yes, but now when you want to query a users phone number, SQL now has to go and:
search for the user
search for the sim fk
search for the correct sim row in the sim database
get the phone number
Now when I go and request 10000 users phone numbers, the number of operations done seriously grows in size.
Vs the other approach
search for the user
find the phone number
Now the argument is purely performance based. As much as I understand why we do normalize the data (to remove redundant data, maintainability, make changes to data in one table which propagate up etc.. ) It does appear to me that the approach with the data in one table will be faster or will at least less tasks/ operations to give me the data I want?
So what is the case in this situation? I do hope that I have not asked anything insanely silly , it is early in the morning so do forgive me if im not thinking clearly
The technology involved in MS SQL server 2012
[EDIT]
This article below also touches on some pf the concepts I have mentioned above
http://databases.about.com/od/specificproducts/a/Should-I-Normalize-My-Database.htm
The goal of normalization is not performance. The goal is to model your data correctly with minimum redundancy so you avoid data anomalies.
Say for example two users share the same phone. If you store the phones in the user table, you'd have sim number, IP address, and cell number stored one each user's row.
Then you change the IP address on one row but not the other. How can one sim number have two IP addresses? Is that even valid? Which one is correct? How would you fix such discrepancies? How would you even detect them?
There are times when denormalization is worthwhile, if you really need to optimize data access for one query that you run very frequently. But denormalization comes at a cost, so be prepared to commit yourself to a lot more manual work to take responsibility for data integrity. More code, more testing, more cleanup tasks. Do those count when considering "performance" of the project overall?
Re comments:
I agree with #JoelBrown, as soon as you implement your first case of denormalization, you compromise on data integrity.
I'll expand on what Joel mentions as "well-considered." Denormalization benefits specific queries. So you need to know which queries you have in your app, and which ones you need to optimize for. Do this conservatively, because while denormalization can help a specific query, it harms performance for all other uses of the same data. So you need to know whether you need to query the data in different ways.
Example: suppose you are designing a database for StackOverflow, and you want to support tags for questions. Each question can have a number of tags, and each tag can apply to many questions. The normalized way to design this is to create a third table, pairing questions with tags. That's the physical data model for a many-to-many relationship:
Questions ----<- QuestionsTagged ->---- Tags
But you figure you don't want to do the join to get tags for a given question, so you put tags into a comma-separated string in the questions table. This makes it quicker to query a given question and its associated tags.
But what if you also want to query for one specific tag and find its related questions? If you use the normalized design, it's simply a query against the many-to-many table, but on the tag column.
But if you denormalize by storing tags as a comma-separated list in the Questions table, you'd have to search for tags as substrings within that comma-separated list. Searching for substrings can't be indexed with a standard B-tree style index, and therefore searching for related questions becomes a costly table-scan. It's also more complex and inefficient to insert and delete a tag, or to apply constraints like uniqueness or foreign keys.
That's what I mean by denormalization making an improvement for one type of query at the expense of other uses of the data. That's why it's a good idea to start out with everything in normal form, and then refactor to denormalized designs later on a case by case basis as your bottlenecks reveal themselves.
This goes back to old wisdom:
"Premature optimization is the root of all evil" -- Donald Knuth
In other words, don't denormalize until you can demonstrate during load testing that (a) it makes a real improvement to performance that justifies the loss of data integrity, and (b) it does not degrade performance of other cases unacceptably.
It sounds like you already understand the benefits of normalisation, so I won't cover these.
There are a couple of considerations here:
1. Does a user always have one and only phone number?
If so, then it is still normalised to add these to the user table. However, if the user can have either no phone number or multiple phone numbers, then the phone details should be held in a seperate table.
Assuming you have these in seperate tables, but after conducting performance tests you found that joining on these 2 tables was having a significant effect on performance, then you may choose to deliberately denormalise the tables for performance gains.
Others have already provided some good points and you may also want to take a look at this.
I'd just like to mention one more aspect that is often overlooked: I/O tends to be the greatest component of the cost of most queries, and denormalization generally increases the storage size of data, therefore making the DBMS cache "smaller".
If your normalized database fits into cache and denormalized doesn't, you may actually observe a performance decrease for the latter.
And you won't be able to spot that in development, unless you actually have the amount of data that is similar to production. This is one of many reasons why you should never, ever denormalize without solid measurements (on representative amounts of data) to justify it.

How to deal with stupid designed databases? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
So we just started doing web application for company X. Application have to calculate a lot of information like workers done job, how long he worked, how long device worked, device speed, device quality, parts quality, up-time, downtime, running time, waste and etc... etc... The problem is database is stupidly designed, no IDs(I joining it on multiple columns, but it's so slow), a lot of calculations inside view tables, (i am going to dream nightmares about this) database have a lot of and I mean a lot of tables with millions of records. So my question is how to approach this situation? Try to get the grip of database and try to do my job, even if it takes half a year to make everything work? Or maybe they should hire some database designer and change whole system...(but i guess they will not going to even if i ask to). Is there a software to fast get grip of database I could use? They using Microsoft Server SQL 2012.
P.S. Don't judge my English writing skills, i don't compile it very often.
EDIT:
1. There is no integrity between some tables, so i have to work my way around. And server always busy and crashes from time to time. Sometimes it takes 20min to get 1000 row from view table. 2. Some expensive query executed every time i query something.
EDIT:
There is a lot of data repeated in different tables.
EDIT:
Is there way to make database more efficient?
Let's walk through each point here:
no IDs(I joining it on multiple columns, but it's so slow)
Do you actually mean you have no referential integrity between tables and there are no columns that would form a primary key? If that is what you mean than yes I agree a non-normalized table is quite bad. However, if there is referential integrity (which I would presume there is, this is not an issue). You proceed to say it is slow, define slow. If it takes 10 seconds to query over 2 trillion records, I would hardly call that slow. If however, if takes 10 seconds to query over 5 rows, than yes that is slow.
a lot of calculations inside view tables
Now is this a materialized view? Meaning that the calculation is only executed once and the table is built off of that expensive query? Or do you mean some expensive query is executed every time that it is targeted? In the latter case that is bad, in the former that is correct.
database have a lot of and I mean a lot of tables with millions of
records
And your point is? Millions of records in 2013 are not that many. Further, if you are melting down over millions of records, it may be time to hang it up. There will only be more data, barring some insane magnetic storm that destroys all technology as we know it.
So my question is how to approach this situation?
Learn set theory and relational design.
You need to understand that changing the database is not trivial. What you need to do is understand this database structure well. Chances are you are not happy with it because you don't know it well. If you get to understand it, you can design views and canned queries for common every day tasks. Once you are comfortable with the database, you could then begin to make a list of what is wrong with the current design and what the business needs are. May be then you could draft a version 1.0 ERD and estimate the cost of building the new system based on business needs and your expertise in the current system.
Actually, contrary to popular belief, missing artificial keys do not automatically make a database "stupidly designed".
So yes, you should try to get the grip of database and try to do your job. Even if it takes you half a year to make everything work, it will probably still be cheaper than adapting the application that generates the data.
Whether your system can be improved by modifying the database can only be determined with an analysis by an expert. It is out of scope for this site.
Make sure that the BD structure is really as bad as you think. Perhaps there is some logic to the design you have missed? Better to check, it will save you time in the long run.
Also, is the database normalised? If there is a lot of data repeated in various tables, then it's not. If there is some attempt to normalise the database (minimising data duplication), then there is some intelligence in the design. Otherwise, you might be right.

maintaining query-oriented applications [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
I am currently doing some kind of reporting system.the figures, tables, graphs are all based on the result of queries. somehow i find that complex queries are not easy to maintain, especially when there are a lot of filtering. this makes the query very long and not easy to understand. And also, sometimes, queries with similar filters are executed, making a lot of redundant code, e.g. when i am going to select something between '2010-03-10' and '2010-03-15' and the location is 'US', customer group is "ZZ", i need to rewrite these conditions each time i make a query in this scope. does the dbms (in my case, mysql) support any "scope/context" to make the coding more maintainable as well as the speed faster?
also, is there a industrial standard or best practice for designing such applications?
i guess what I am doing is called data mining, right?
Learn how to create views to eliminate redundant code from queries. http://dev.mysql.com/doc/refman/5.0/en/create-view.html
No, this isn't data mining, it's plain old reporting. Sometimes called "decision support". The bread and butter of information technology. Ultimately, play old reporting is the reason we write software. Someone needs information to make a decision and take action.
Data mining is a little more specialized in that the relationships aren't easily defined yet. Someone is trying to discover the relationships so they can then write a proper query to make use of the relationship they found.
You won't make a very flexible reporting tool if you are hand coding the queries. Every time a requirement changes you are up to your neck in fiddly code trying to satisfy it - that way lies madness.
Instead you should start thinking about a meta-layer above your query infrastructure and generating the sql in response to criteria expressed by the user. You could present them with a set of choices from which you could generate your queries. If you give a bit of thought to making those choices extensible you'll be well on your way down the path of the many, many BI and reporting products that already exist.
You might also want to start looking for infrastructure that does this already, such as Crystal Reports (swallowed by Business Objects, swallowed by SAP) or Eclipse's BIRT. Depending on whether you are after a programming exercise or a solution to your users' reporting problems you might just want to grab an off the shelf product which has already had tens of thousands of man years of development, such as one of those above or even Cognos (swallowed by IBM) or Hyperion (swallowed by Oracle).
Best of luck.