SQL database to track monthly bills (Beginner) [closed] - sql

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I am interested in learning SQL. I span up a Server 2019 and installed SQL Express on it. I am currently going through a basic SQL course on Codecademy but figured it would be a lot better to practice with some hands on. I came up with an idea to create a database to track my monthly bills and finances. Once I have enough data then I can leverage some graphing tools to visually present the data.
Has anyone done something similar before? Any suggestions as how I should design the database/tables? Keep in mind that my SQL knowledge is still very limited.
Thanks in advance!

This is a very broad question and nobody can give a definitive answer.
I agree with your approach of trying it yourself in your own situation - that's a definite thumbs-up from me (indeed, when we get new employees, this is one of the best traits when reviewing them). So many times people learn so much by 'getting their hands dirty' so to speak (e.g., going on and trying it themselves).
However, I suggest starting with the examples they have to get the general concepts down - the usually choose at least decent examples.
Alternatively though, you could give it a shot. Just be prepared to be wrong and start again. But don't worry - in terms of value, having a shot and getting it wrong is worth much more than reading something and only half-understanding it.
If you are familiar with spreadsheets, I suggest
Imagine how you would keep this on spreadsheets e.g., one sheet with bills that are due, and one sheet with your payments
Each one of those sheets would represent a table in your database.
If you pay all your bills with one payment only (e.g., no installments), then it would be easier to do it with one spreadsheet (e.g., just listing all the bills on the left side, and their payment information on the right). In this case it may not be the best case for teaching yourself databases. On the other hand, if you do pay by installments, then this could be useful.
The big difference difference in approach, is that in databases, the rows are not inherently sorted. Instead, you typically give the rows an identifier (e.g., Bill_ID, or Payment_ID). And then the tables are linked e.g., for a given row in Payment table, you'd also include the Bill_ID to represent which bill the payment was for.
Update: More examples
To choose a relevant thing to try on databases, I suggest choosing things that are related to each other, but are separate from each other (e.g., not linked 1-to-1).
In the bills/payments above, if you paid each bill with one payment, they didn't need to be on separate tables. However, you could try other things e.g.,
You live in a sharehouse where people pay for various things in a 'kitty' sysem (e.g., on each person's payday, they put in the amount they owe). In this case you may have a Bill table (which includes how it is split up, and when it was paid), and Person_Payments table which includes when people put money into the kitty
You have a family with kids and chores. You have a Kids table (with their name, etc), a Chores table (listing chores and how much they are worth in pocket money) and then a Kids_Chores table listing the Kid_ID, Chore_ID and date. Whenever they do a chore it goes into Kids_Chores and that is used to determine their pocket money.
You play various computer games and you want to track your win rate on them over time. You have one table for Game (with info about your user ID, etc), one for Game_Mode (which indicates, for a given game, what mode you were playing e.g., casual vs league, easy vs hard), then one for Game_Stats recording the date you played, the game and game_mode, and the number of games and number of wins.

Related

Performance gains vs Normalizing your tables? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
Ok ok I know you probably all going to kill me for asking this, however I got into an friendly programmer argument with a co-worker about one of our database tables and he asked a question which I know the answer to but I couldn't explain it is the better way.
I will simplify the situation for the simplicity of the question, We have a fairly large table of people / users. Now amongst other data being stored the data in question is as follows: we have a simNumber, cellNumber and the ipAddress of that sim.
Now I am saying that we should make a table lets call it SimTable and put those 3 entries in the sim table, and then put a FK in the UsersTable linking the two. Why? Because that's what I have always been taught NORMALISE your tables!!! Ok so all is good in that regard.
But now my friend says to me yes, but now when you want to query a users phone number, SQL now has to go and:
search for the user
search for the sim fk
search for the correct sim row in the sim database
get the phone number
Now when I go and request 10000 users phone numbers, the number of operations done seriously grows in size.
Vs the other approach
search for the user
find the phone number
Now the argument is purely performance based. As much as I understand why we do normalize the data (to remove redundant data, maintainability, make changes to data in one table which propagate up etc.. ) It does appear to me that the approach with the data in one table will be faster or will at least less tasks/ operations to give me the data I want?
So what is the case in this situation? I do hope that I have not asked anything insanely silly , it is early in the morning so do forgive me if im not thinking clearly
The technology involved in MS SQL server 2012
[EDIT]
This article below also touches on some pf the concepts I have mentioned above
http://databases.about.com/od/specificproducts/a/Should-I-Normalize-My-Database.htm
The goal of normalization is not performance. The goal is to model your data correctly with minimum redundancy so you avoid data anomalies.
Say for example two users share the same phone. If you store the phones in the user table, you'd have sim number, IP address, and cell number stored one each user's row.
Then you change the IP address on one row but not the other. How can one sim number have two IP addresses? Is that even valid? Which one is correct? How would you fix such discrepancies? How would you even detect them?
There are times when denormalization is worthwhile, if you really need to optimize data access for one query that you run very frequently. But denormalization comes at a cost, so be prepared to commit yourself to a lot more manual work to take responsibility for data integrity. More code, more testing, more cleanup tasks. Do those count when considering "performance" of the project overall?
Re comments:
I agree with #JoelBrown, as soon as you implement your first case of denormalization, you compromise on data integrity.
I'll expand on what Joel mentions as "well-considered." Denormalization benefits specific queries. So you need to know which queries you have in your app, and which ones you need to optimize for. Do this conservatively, because while denormalization can help a specific query, it harms performance for all other uses of the same data. So you need to know whether you need to query the data in different ways.
Example: suppose you are designing a database for StackOverflow, and you want to support tags for questions. Each question can have a number of tags, and each tag can apply to many questions. The normalized way to design this is to create a third table, pairing questions with tags. That's the physical data model for a many-to-many relationship:
Questions ----<- QuestionsTagged ->---- Tags
But you figure you don't want to do the join to get tags for a given question, so you put tags into a comma-separated string in the questions table. This makes it quicker to query a given question and its associated tags.
But what if you also want to query for one specific tag and find its related questions? If you use the normalized design, it's simply a query against the many-to-many table, but on the tag column.
But if you denormalize by storing tags as a comma-separated list in the Questions table, you'd have to search for tags as substrings within that comma-separated list. Searching for substrings can't be indexed with a standard B-tree style index, and therefore searching for related questions becomes a costly table-scan. It's also more complex and inefficient to insert and delete a tag, or to apply constraints like uniqueness or foreign keys.
That's what I mean by denormalization making an improvement for one type of query at the expense of other uses of the data. That's why it's a good idea to start out with everything in normal form, and then refactor to denormalized designs later on a case by case basis as your bottlenecks reveal themselves.
This goes back to old wisdom:
"Premature optimization is the root of all evil" -- Donald Knuth
In other words, don't denormalize until you can demonstrate during load testing that (a) it makes a real improvement to performance that justifies the loss of data integrity, and (b) it does not degrade performance of other cases unacceptably.
It sounds like you already understand the benefits of normalisation, so I won't cover these.
There are a couple of considerations here:
1. Does a user always have one and only phone number?
If so, then it is still normalised to add these to the user table. However, if the user can have either no phone number or multiple phone numbers, then the phone details should be held in a seperate table.
Assuming you have these in seperate tables, but after conducting performance tests you found that joining on these 2 tables was having a significant effect on performance, then you may choose to deliberately denormalise the tables for performance gains.
Others have already provided some good points and you may also want to take a look at this.
I'd just like to mention one more aspect that is often overlooked: I/O tends to be the greatest component of the cost of most queries, and denormalization generally increases the storage size of data, therefore making the DBMS cache "smaller".
If your normalized database fits into cache and denormalized doesn't, you may actually observe a performance decrease for the latter.
And you won't be able to spot that in development, unless you actually have the amount of data that is similar to production. This is one of many reasons why you should never, ever denormalize without solid measurements (on representative amounts of data) to justify it.

Database Design For Users\Downloads [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
I need to design a database for something like a downloads site . I want to keep track of users , the programs each users downloaded and also allow users to rate+comment said programs.The things I need from this database - get average rating for a program , get all comments for a program , know exactly what program was downloaded by whom(I dont care how many times each program was downloaded but I want to know for each users what programs he has downloaded),maybe also count number of comments for each program and thats about it(it's a very small project for personal use that I want to keep simple)
I come up with these entities -
User(uid,uname etc)
Program(pid,pname)
And the following relationships-
UserDownloadedProgram(uid,pid,timestamp)
UserCommentedOnProgram(uid,pid,commentText,timestamp)
UserRatedProgram(uid,pid,rating)
Why I chose it this way - the relationships (user downloads , user comments and rates) are many to many . A user downloads many programs and a program is downloaded by many users. Same goes for the comments (A user comments on many programs and a program is commented or rated by many users). The best practice as far as I know is to create a third table which is one to many (a relationship table).
. I suppose that in this design the average rating and comment retrieval is done by join queries or something similar.
I'm a total noob in database design but I try to adhere to best practices , is this design more or less ok or am I overlooking something ?
I can definitely think of other possibilities - maybe comment and\or rating can be an entity(table) by itself and the relationships are between 3 entities. I'm not really sure what the benefits\drawbacks of that are: I know that I don't really care about the comments or the ratings , I only want to display them where appropriate and maintain them(delete when needed) , so how do I know if they better become an entity themselves?
Any thoughts?
You would create new entities as dictated by the rules of normalization. There is no particular reason to make an additional (separate) table for comments because you already have one. Who made the comment and which program the comment applied to are full-fledged attributes of a comment. The foreign keys representing these relationships (which are many-to-one, from the perspective of the comment table) belong right where you've put them.
The tables you've proposed are in third normal form which is acceptable according to best practices. I would add that you seem to be tracking data on a transactional basis (i.e. recording events as and when they occur). That is a good practice too because you can always figure out whatever you want to based on detailed information.
Calculating number of downloads or number of comments is a simple matter of using SQL Aggregate Functions with filters on the foreign key(s) that apply to your query - e.g. where pid=1234 etc.
I would do an entity for Downloads with its own id. You could have download status, you may have multiple download of the same program for one user. you may need to associate your download to an order or something else,..

Database normalization - who's right?

My professor (who claimed to have a firm understanding about systems development for many years) and I are arguing about the design of our database.
As an example:
My professor insists this design is right:
(list of columns)
Subject_ID
Description
Units_Lec
Units_Lab
Total_Units
etc...
Notice the total units column. He said that this column must be included.
I tried to explain that it is unnecessary, because if you want it, then just make a query by simply adding the two.
I showed him an example I found in a book, but he insists that I don't have to rely on books too much in making our system.
The same thing applies to similar cases as in this one:
student_ID
prelim_grade
midterm_grade
prefinal_grade
average
He wanted me to include the average! Anywhere I go, I can find myself reading articles that convince me that this is a violation of normalization. If I needed the average, I can easily compute the three grades. He enumerated some scenarios including ('Hey! What if the query has been accidentally deleted? What will you do? That is why you need to include it in your table!')
Do I need to reconstruct my database(which consists of about more than 40 tables) to comply with what he want? Am I wrong and just have overlooked these things?
Another thing is that he wanted to include the total amount in the payments table, which I believe is unnecessary. (Just compute the unit price of the product and the quantity.) He pointed out that we need that column for computing debits and/or credits that are critical for the overall system management, that it is needed for balancing transaction. Please tell me what you think.
You are absolutely correct! One of the rules of normalization is to reduce those attributes which can be easily deduced by using other attributes' values. ie, by performing some mathematical calculation. In your case, the total units column can be obtained by simply adding.
Tell your professor that having that particular column will show clear signs of transitive dependency and according to the 3rd normalization rule, its recommended to reduce those.
You are right when you say your solution is more normalized.
However, there is a thing called denormalization (google for it) which is about deliberately violating normalization rules to increase queries performance.
For instance you want to retrieve first five subjects (whatever the thing would be) ordered by decreasing number or total units.
You solution would require a full scan on two tables (subject and unit), joining the resultsets and sorting the output.
Your professor's solution would require just taking first five records from an index on total_units.
This of course comes at the price of increased maintenance cost (both in terms of computational resources and development).
I can't tell you who is "right" here: we know nothing about the project itself, data volumes, queries to be made etc. This is a decision which needs to be made for every project (and for some projects it may be a core decision).
The thing is that the professor does have a rationale for this requirement which may or may not be just.
Why he hasn't explained everything above to you himself, is another question.
In addition to redskins80's great answer I want to point out why this is a bad idea: Every time you need to update one of the source columns you need to update the calculated column as well. This is more work that can contain bugs easily (maybe 1 year later when a different programmer is altering the system).
Maybe you can use a computed column instead? That would be a workable middle-ground.
Edit: Denormalization has its place, but it is the last measure to take. It is like chemotherapy: The doctor injects you poison only to cure an even greater threat to your health. It is the last possible step.
Think it is important to add this because when you see the question the answer is not complete in my opinion. The original question has been answered well but there is a glitch here. So I take in account only the added question quoted below:
Another thing is that he wanted to include the total amount in the
payments table, which I believe is unnecessary(Just compute the unit
price of the product and the quantity.). He pointed out that we need
that column for computing debits and/or credits that are critical for
the overall system management, that it is needed for balancing
transaction. Please tell me what you think.
This edit is interesting. Based on the facts that this is a transactional system handling about money it has to be accountable. I take some basic terms: Transaction, product, price, amount.
In that sense it is very common or even required to denormalize. Why? Because you need it to be accountable. So when the transaction is registered that's it, it may never ever be modified. If you need to correct it then you make another transaction.
Now yes you can calculate for example product price * amount * taxes etc. That makes sense in normalization sense. But then you will need a complete lockdown of all related records. So take for example the products table: If you change the price before the transaction it should be taken into account when the transaction happens. But if the price changes afterwards it does not affect the transaction.
So it is not acceptable to just join transaction.product_id=products.id since that product might change. Example:
2012-01-01 price = 10
2012-01-05 price = 20
Transaction happens here, we sell 10 items so 10 * 20 = 200
2012-01-06 price = 22
Now we lookup the transaction at 2012-01-10, so we do:
SELECT
transactions.amount * products.price AS totalAmount
FROM transactions
INNER JOIN products on products.id=transactions.product_id
That would give 10 * 22 = 220 so it is not correct.
So you have 2 options:
Do not allow updates on the products table. So you make that table versioned, so for every record you add a new INSERT instead of update. So the transaction keeps pointing at the right version of the product.
Or you just add the fields to the transactions table. So add totalAmount to the transactions table and calculate it (in a database transaction) when the transaction is inserted and save it.
Yes, it is denormalized but it has a good reason, it makes it accountable. You just know and it's verified with transactions, locks etc. that the moment that transaction happened it related to the described product with the price = 20 etc.
Next to that, and that is just a nice thing of denormalization when you have to do that anyway, it is very easy to run reports. Total transaction amount of the month, year etc. It is all very easy to calculate.
Normalization has good things, for example no double storage, single point of edit etc. But in this case you just don't want that concept since that is not allowed and not preferred for a transactions log database.
See a transaction as a registration of something happened in real world. It happened, you wrote it down. Now you cannot change history, it was written as it was. Future won't change it, it happened.
If you want to implement the good, old, classic relational model, I think what you're doing is right.
In general, it's actually a matter of philosophy. Some systems, Oracle being an example, even allow you to give up the traditional, relational model in favor of objects, which (by being complex structures kept in tables) violate the 1st NF but give you the power of object-oriented model (you can use inheritance, override methods, etc.), which is pretty damn awesome in some cases. The language used is still SQL, only extended.
I know my answer drifts away from the subject (as we take into consideration a whole new database type) but I thought it's an interesting thing to share on the occasion of a pretty general question.
Database design for actual applications is hardly the question of what tables to make. Currently, there are countless possibilities when it comes to keeping and processing your data. There are relational systems we all know and love, object databases (like db4o), object-relational databases (not to be confused with object relational mapping, what I mean is tools like Oracle 11g with its objects), xml databases (take eXist), stream databases (like Esper) and the currently thriving noSQL databases (some insist they shouldn't be called databases) like MongoDB, Cassandra, CouchDB or Oracle NoSQL
In case of some of these, normalization loses its sense. Each model serves a completely different purpose. I think the term "database" has a much wider meaning than it used to.
When it comes to relational databases, I agree with you and not the professor (although I'm not sure if it's a good idea to oppose him to strongly).
Now, to the point. I think you might win him over by showing that you are open-minded and that you understand that there are many options to take into consideration (including his views) but that the situation requires you to normalize the data.
I know my answer is quite a stream of conscience for a stackoverflow post but I hope it's not received as lunatic babbling.
Good luck in the relational tug of war
You are talking about historical and financial data here. It is common to store some computations that will never change becasue that is the cost that was charged at the time. If you do the calc from product * price and the price changed 6 months after the transaction, then you have the incorrect value. Your professor is smart, listen to him. Further, if you do a lot of reporting off the database, you don't want to often calculate values that are not allowed to be changed without another record of data entry. Why perform calculations many times over the history of the application when you only need to do it once? That is wasteful of precious server resources.
The purpose of normalization is to eliminate redundancies so as to eliminate update anomalies, predominantly in transactional systems. Relational is still the best solution by far for transaction processing, DW, master data and many BI solutions. Most NOSQLs have low-integrity requirements. So you lose my tweet - annoying but not catastrophic. But to lose my million dollar stock trade is a big problem. The choice is not NOSQL vs. relational. NOSQL does certain things very well. But Relational is not going anywhere. It is still the best choice for transactional, update oriented solutions. The requirements for normalization can be loosened when the data is read-only or read-mostly. That's why redundancy is not such a huge problem in DW; there are no updates.

What are the pros and cons of Anchor Modeling? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I am currently trying to create a database where a very large percentage of the data is temporal. After reading through many techniques for doing this (most involving 6nf normalization) I ran into Anchor Modeling.
The schema that I was developing strongly resembled the Anchor Modeling model, especially since the use case (Temporal Data + Known Unknowns) is so similar, that I am tempted to embrace it fully.
The two biggest problem I am having is that I can find nothing detailing the negatives of this approach, and I cannot find any references to organizations that have used it in production for war-stories and gotchas that I need to be aware of.
I am wondering if anyone here is familiar enough with to briefly expound on some of the negatives (since the positives are very well advertized in research papers and their site), and any experiences with using it in a production environment.
In reference to the anchormodeling.com
Here are a few points I am aware of
The number of DB-objects is simply too large to maintain manually, so make sure that you use designer all the time to evolve the schema.
Currently, designer supports fully MS SQL Server, so if you have to port code all the time, you may want to wait until your target DB is fully supported. I know it has Oracle in dropdown box, but ...
Do not expect (nor demand) your developers to understand it, they have to access the model via 5NF views -- which is good. The thing is that tables are loaded via (instead-of-) triggers on views, which may (or may not) be a performance issue.
Expect that you may need to write some extra maintenance procedures (for each temporal attribute) which are not auto-generated (yet). For example, I often need a prune procedure for temporal attributes -- to delete same-value-records for the same ID on two consecutive time-events.
Generated views and queries-over-views resolve nicely, and so will probably anything that you write in the future. However, "other people" will be writing queries on views-over-views-over-views -- which does not always resolve nicely. So expect that you may need to police queries more than usual.
Having sad all that, I have recently used the approach to refactor a section of my warehouse, and it worked like a charm. Admittedly, warehouse does not have most of the problems outlined here.
I would suggest that it is imperative to create a demo-system and test, test, test ..., especially point No 3 -- loading via triggers.
With respect to point number 4 above. Restatement control is almost finished, such that you will be able to prevent two consecutive identical values over time.
And a general comment, joins are not necessarily a bad thing. Read: Why joins are a good thing.
One of the great benefits of 6NF in Anchor Modeling is non-destructive schema evolution. In other words, every previous version of the database model is available as a subset in the current model. Also, since changes are represented by extensions in the schema (new tables), upgrading a database is almost instantanous and can safely be done online (even in a production environment). This benefit would be lost in 5NF.
I haven't read any papers on it, but since it's based on 6NF, I'd expect it to suffer from whatever problems follow 6NF.
6NF requires each table consist of a candidate key and no more than one non-key column. So, in the worst case, you'll need nine joins to produce a 10-column result set. But you can also design a database that uses, say, 200 tables that are in 5NF, 30 that are in BCNF, and only 5 that are in 6NF. (I think that would no longer be Anchor Modeling per se, which seems to put all tables in 6NF, but I could be wrong about that.)
The Mythical Man-Month is still relevant here.
The management question, therefore, is not whether to build a pilot system and throw it away. You will do that. The only question is whether to plan in advance to build a throwaway, or to promise to deliver the throwaway to customers.
Fred Brooks, Jr., in The Mythical Man-Month, p 116.
How cheaply can you build a prototype to test your expected worst case?
In this post I will present a large part of the real business that belong to databases. Database's solutions in this big business area can not be solved by using „Anchor modeling" , at all.
In the real business world this case is happening on a daily basis. That is the case when data entry person, enters a wrong data.
In real-world business, errors happen frequently at data entry level. It often happens that data entry generates large amounts of erroneous data. So this is a real and big problem. "Anchor modeling" can not solve this problem.
Anyone who uses the "Anchor Modeling" database can enter incorrect data. This is possible because the authors of "Anchor modeling" have written that the erroneous data can be deleted.
Let me explain this problem by the following example:
A profesor of mathematics gave the best grade to the student who had the worst grade. In this high school, professors enter grades in the corressponding database. This student gave money to the professor for this criminal service. The student managed to enroll at the university using this false grade.
After a summer holiday, the professor of mathematics returned to school. After deleting the wrong grade from the database, the professor entered the correct one in the database. In this school they use "Anchor Modeling" db. So the math profesor deleted false data as it is strictly suggested by authors of "Anchor modeling".
Now, this professor of mathematics who did this criminal act is clean, thanks to the software "Anchor modeling".
This example says that using "Anchor Modeling," you can do crime with data just by applying „Anchor modeling technology“
In section 5.4 the authors of „Anchor modeling“ wrote the following: „Delete statements are allowed only when applied to remove erroneous data.“ .
You can see this text at the paper „ An agile modeling technique using sixth normal form for structurally evolving data“ written by authors of „Anchor modeling“.
Please note that „Anchor modeling“ was presented at the 28th International Conference on Conceptual Modeling and won the best paper award?!
Authors of "Anchor Modeling" claim that their data model can maintain a history! However this example shoes that „Anchor modeling“ can not maintain the history at all.
As „Anchor modeling“ allows deletion of data, then "Anchor modeling" has all the operations with the data, that is: adding new data, deleting data and update. Update can be obtained by using two operations: first delete the data, then add new data.
This further means that Anchor modeling has no history, because it has data deletion and data update.
I would like to point out that in "Anchor modeling" each erroneous data "MUST" be deleted. In the "Anchor modeling" it is not possible to keep erroneous data and corrected data.
"Anchor modeling" can not maintain history of erroneous data.
In the first part of this post, I showed that by using "Anchor Modeling" anyone can do crime with data. This means "Anchor Modeling" runs the business of a company, right into a disaster.
I will give one example so that professionals can see on real and important example, how bad "anchor modeling" is.
Example
People who are professionals in the business of databases, know that there are thousands and thousands of international standards, which have been used successfully in databases as keys.
International standards:
All professionals know what is "VIN" for cars, "ISBN" for books, and thousands of other international standards.
National standards:
All countries have their own standards for passports, personal documents, bank cards, bar codes, etc
Local standards:
Many companies have their own standards. For example, when you pay something, you have an invoice with a standard key written and that key is written in the database, also.
All the above mentioned type of keys from this example can be checked by using a variety of institutions, police, customs, banks credit card, post office, etc. You can check many of these "keys" on the internet or by using a phone.
I believe that percent of these databases, which have entities with standard keys, and which I have presented in this example, is more than 95%.
For all the above cases the "anchor surrogate key" is nonsense. "Anchor modeling" exclusively uses "anchor-surrogate key"
In my solution, I use all the keys that are standard on a global or local level and are simple.
Vladimir Odrljin

Would anyone ever recommend storing dates and numbers in the same field? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
As background, I'm one of two developers in my department. I got into computers my freshman year in high school (1986) and have no formal education. I got into MS Access a little bit in 1994 and more seriously beginning in 2003. I'm self-educated, have always tried to learn as much as I can about database design, and while I believe I know a lot I also know I don't know everything.
The other developer in my department, according to his resume, has a degree in computer science and has been doing IT work, including web design and database design, for about 8 years. He was hired into my department last December. I've been very surprised by what I see as a very fundamental lack of knowledge about the basics of database design and SQL and have been trying to figure out if at least part of the problem is I'm expecting too much or maybe don't know as much as I think I do.
Hence my question. Please note we are 100% MS Access, but I believe this question applies to about any SQL database. This developer was tasked to take a spreadsheet and convert it into a database. Part of the spreadsheet involved tracking inventory for batteries. In the spreadsheet, the column titles were Date and Count. But the data in the date column was a mix of dates and batch numbers. So this developer created a table with a numeric field to contain both the batch number and the date and a second boolean field called IsDate to indicate what value was in the field.
I disagree with this approach and would have created two separate fields, a date field for the date and a numeric field for the batch number. When I suggested this approach, he seemed to not only not understand why but also to get a bit angry about having to change his design.
Which approach would you recommend? Also, assuming everyone agrees with my approach - of course you will! ;) - if you had a developer with this supposed level of experience, would you consider him worth keeping and worth investing the time and effort to educate him?
My own rule of thumb here is:
Always keep data in a native datatype.
This helps comparing, sorting, finding and grouping - especially in a database - and makes your storage less prone to query errors. Moreover, you're not required to use another predicate (AND isdate) when accessing the data. Hence, I think your approach is correct.
Your colleague's approach seems not to be a matter of high education, but one of a personal approach. I've seen workers with PhD who could well listen to a well-reasoned argument, and freshmen who made grave mistakes and would not listen to a polite advice.
I'd most definitely store the date and the batch number in different fields of the appropriate type - setting each with the relevant content or as NULL if no value was available. By doing this you'd be able to see what data you actually have available and perform meaningful operations on that data.
In terms of you second question, I guess it would really depend on what the developer in question said when you asked them why they'd chosen the approach they did.
You are right.
Only under severe memory restrictions might (note might) this kind of architecture be acceptable.
As to dealing with him, I would first talk to him and fiugre out why he chose the given approach, this is something that might have been common in Access Databases 10 years ago (but even then there was enough disk and memory space to not have to do these kind of tricks).
His reluctance to talk about his design is a worse indicator of his abilities than the design itself. Even the most misguided design should have been based on a structured approach or idea. In my mind it is not a bad thing to be wrong, it is a bad thing to create random structures. But not knowing your requirements it is hard to suggest whether it is worth keeping him or not.
Is one of you the 'senior' hierarchy wise or are you sharing responsibilities ?
Point out that he is breaking first normal form by doing so. Be able to describe 1NF 2NF and 3NF before trying to impress him with you fancy pants knowledge.