Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I am interested in learning SQL. I span up a Server 2019 and installed SQL Express on it. I am currently going through a basic SQL course on Codecademy but figured it would be a lot better to practice with some hands on. I came up with an idea to create a database to track my monthly bills and finances. Once I have enough data then I can leverage some graphing tools to visually present the data.
Has anyone done something similar before? Any suggestions as how I should design the database/tables? Keep in mind that my SQL knowledge is still very limited.
Thanks in advance!
This is a very broad question and nobody can give a definitive answer.
I agree with your approach of trying it yourself in your own situation - that's a definite thumbs-up from me (indeed, when we get new employees, this is one of the best traits when reviewing them). So many times people learn so much by 'getting their hands dirty' so to speak (e.g., going on and trying it themselves).
However, I suggest starting with the examples they have to get the general concepts down - the usually choose at least decent examples.
Alternatively though, you could give it a shot. Just be prepared to be wrong and start again. But don't worry - in terms of value, having a shot and getting it wrong is worth much more than reading something and only half-understanding it.
If you are familiar with spreadsheets, I suggest
Imagine how you would keep this on spreadsheets e.g., one sheet with bills that are due, and one sheet with your payments
Each one of those sheets would represent a table in your database.
If you pay all your bills with one payment only (e.g., no installments), then it would be easier to do it with one spreadsheet (e.g., just listing all the bills on the left side, and their payment information on the right). In this case it may not be the best case for teaching yourself databases. On the other hand, if you do pay by installments, then this could be useful.
The big difference difference in approach, is that in databases, the rows are not inherently sorted. Instead, you typically give the rows an identifier (e.g., Bill_ID, or Payment_ID). And then the tables are linked e.g., for a given row in Payment table, you'd also include the Bill_ID to represent which bill the payment was for.
Update: More examples
To choose a relevant thing to try on databases, I suggest choosing things that are related to each other, but are separate from each other (e.g., not linked 1-to-1).
In the bills/payments above, if you paid each bill with one payment, they didn't need to be on separate tables. However, you could try other things e.g.,
You live in a sharehouse where people pay for various things in a 'kitty' sysem (e.g., on each person's payday, they put in the amount they owe). In this case you may have a Bill table (which includes how it is split up, and when it was paid), and Person_Payments table which includes when people put money into the kitty
You have a family with kids and chores. You have a Kids table (with their name, etc), a Chores table (listing chores and how much they are worth in pocket money) and then a Kids_Chores table listing the Kid_ID, Chore_ID and date. Whenever they do a chore it goes into Kids_Chores and that is used to determine their pocket money.
You play various computer games and you want to track your win rate on them over time. You have one table for Game (with info about your user ID, etc), one for Game_Mode (which indicates, for a given game, what mode you were playing e.g., casual vs league, easy vs hard), then one for Game_Stats recording the date you played, the game and game_mode, and the number of games and number of wins.
I am a web developer so I don't know a lot about databases. The company I joined recently has a very mature Desktop ERP built in .NET and SQL Server, they are providing services to huge corporate clients and there design is working fine. But they didn't develop any web based system. But there database design is quite unusual. Let me explain it and then I will post my questions.
So, Now I joined them to develop them a web based ERP (a replica of there desktop system in web). Since I am building the application from scratch, they have given me liberty to revamp any thing which I think would effect positively.
Now the design is,
They have around 150 tables in database.
Every table has the same schema definition.
They have divided the fields in three categories.
Strings (so they assign 50 varchar(250) fields in database).
DateTime (so they assign 15 smalldatetime fields in database).
Numeric (so they assign 30 Numeric() fields in database).
All columns have names as (these names don't terrify the developers, in a week or two they are accustomed and even remembered many of the fields associations):
Strings (S1, S2, S3, S4 and so on).
DateTime (D1, D2, D3, D4 and so on).
Numerics (N1, N2, N3, N4 and so on).
As I have told you the schema. Every table consist of 95 columns. And only 15-20 columns are actually been used. The remaining 75-80 columns are NULL.
The tables are well normalized and indexes are maintained.
Number of rows in most of the tables are less than 1000. Only the transaction table records touches several hundred thousands.
The precision for numeric columns are by default (1, 0). When any field is selected to be used then the precision is adjusted as per the requirement.
An empty database is of ~4MB.
This design makes there development quite easy. Since they have a number of columns and whenever they need a field. They just select the data type, i.e. String or Numeric or DateTime and the next available column is assigned.
Only 9-10 tables have image fields.
I think this information is quite enough. Now I want to ask
Since, I don't know a lot about SQL. Is this design viable for web environment (web API which will be called from web client as well as from mobile)?
Since, every table has 75-80 NULL columns. Does they cost us a lot of memory in future, when transaction records will touch millions? (considering that application is multi tenant)
What are your suggestions to improve this design?
THANKS.
You have two choices:
Use it, and live with it.
Completely redesign it.
I recommend #1 because #2 will be hard and will be viewed skeptically by your colleagues and boss. Any problems with you progress will be put down to your crazy database design.
The database you describe embodies the classic entity-value-attribute design error. Instead of defining tables modeled on real-world entities in the universe of discourse, and using the DBMS to enforce and infer logical relationships among them, the designers opted to remove all meaning from the database to the application. Entities that should be in the database are constructed in memory using application logic that supplies meaning to S1 and such. From a database perspective, it's an absolute nightmare.
It's also understandable. EVA designs usually arise where there's little database expertise, and where the problem domain is poorly understood. That adds up to "anything could go in the database", and an EVA design will indeed hold "anything". To the extent your customers determine the actual design -- that is, the supply independent meanings for each database column -- the application acts as a kind of DBMS proxy. The fact that every table has a vast number of unused columns suggests their use may be customer-determined: the customer can "add a column", and the application plucks one from the unused pile. No schema change necessary. It's dynamic!
There's a whole industry based on that idea. For example, the so-called "master data management" tools boil down to an EVA design where the customer designs a database within the application, and application uses the DBMS in much the way you describe.
I am trying to map information of Linux packages (name + version) to their corresponding CPE strings (see http://nvd.nist.gov/cpe.cfm) in order to be able to automatically find possible vulnerabilities of a system.
There is an XML document provided by NIST which contains all relevant CPE. I thought about parsing this information into an SQL database so I can quickly search by name and version number. That would be some 70.000 rows.
The problem now is, of course, that there are variations of the spellings of the CPEs and the package names. For example, the CPE for Tomcat 6.0.36 would be cpe:/a:apache:tomcat:6.0.36 so you have the name tomcat and the version 6.0.36. Now, the package manager could give you something like tomcat6 for the name and 6.0.36-3 for the version. Its likely that both programs are the same or have at least the same vulnerabilities. So I need to be able to automatically identify the above mentioned CPE as the correct one for my tomcat package.
The first thing to do would be some kind of normalization, maybe converting everything to lowercase. But as you can see from the example, that's not enough. I need some kind of fuzzy search. From what I already found out, there are some solutions for identifying matches in the case of misspelling. That is not exactly what I need, though. The package names are not misspelled but may contain additional characters (or miss some).
The fuzzy search must also be relatively fast, since I need to execute it for multiple hosts which each could have some hundred packages installed and as I said, the database would have around 70.000 rows. I can introduce a primary lookup which tries to find an exact match first, but since I suspect many package will not have any corresponding CPE string, that will not decrease the amount too dramatically.
Another constraint is that the solution should be working on a non-proprietary database, since I don't have the financial means for anything else.
So, is there anything that matches these requirements? Or can you think of any solution to my problem except some kind of fuzzy searching?
Thanks in advance!
A general comment, first. The CPE nomenclature seems to have evolved organically, often depending on the vendors' (inconsistent) nomenclature. For example, Sun Java has major.minor.point_version. Adobe uses major.minor.point.subpoint. Microsoft operating systems use Service Packs_Language Packs. Some other vendors would use point releases with mostly numbers but occasional letters sprinkled in (e.g., .8, .9, .9R2, .10).
When I worked on the stated problem, I started from their XML files and manipulated them in Excel, splitting on the periods. Then I would sort either numerically (if they were all numeric) or as a text string. (Note that the letters sprinkled in to mostly numbers causes havoc, and that .10 comes lexically before .8)
This inconsistency is why third-party software vendors have sprouted like mushrooms after a spring rain. Companies would rather pay the software vendors than untangle this Gordian knot.
If you want a truly fuzzy search, please take a look at this question about using Soundex. Expect to get a lot of false positives.
If your goal is accurately mapping the CPE strings, you should probably think about implementing a lookup table that translates from CPE to a library name.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I am currently trying to create a database where a very large percentage of the data is temporal. After reading through many techniques for doing this (most involving 6nf normalization) I ran into Anchor Modeling.
The schema that I was developing strongly resembled the Anchor Modeling model, especially since the use case (Temporal Data + Known Unknowns) is so similar, that I am tempted to embrace it fully.
The two biggest problem I am having is that I can find nothing detailing the negatives of this approach, and I cannot find any references to organizations that have used it in production for war-stories and gotchas that I need to be aware of.
I am wondering if anyone here is familiar enough with to briefly expound on some of the negatives (since the positives are very well advertized in research papers and their site), and any experiences with using it in a production environment.
In reference to the anchormodeling.com
Here are a few points I am aware of
The number of DB-objects is simply too large to maintain manually, so make sure that you use designer all the time to evolve the schema.
Currently, designer supports fully MS SQL Server, so if you have to port code all the time, you may want to wait until your target DB is fully supported. I know it has Oracle in dropdown box, but ...
Do not expect (nor demand) your developers to understand it, they have to access the model via 5NF views -- which is good. The thing is that tables are loaded via (instead-of-) triggers on views, which may (or may not) be a performance issue.
Expect that you may need to write some extra maintenance procedures (for each temporal attribute) which are not auto-generated (yet). For example, I often need a prune procedure for temporal attributes -- to delete same-value-records for the same ID on two consecutive time-events.
Generated views and queries-over-views resolve nicely, and so will probably anything that you write in the future. However, "other people" will be writing queries on views-over-views-over-views -- which does not always resolve nicely. So expect that you may need to police queries more than usual.
Having sad all that, I have recently used the approach to refactor a section of my warehouse, and it worked like a charm. Admittedly, warehouse does not have most of the problems outlined here.
I would suggest that it is imperative to create a demo-system and test, test, test ..., especially point No 3 -- loading via triggers.
With respect to point number 4 above. Restatement control is almost finished, such that you will be able to prevent two consecutive identical values over time.
And a general comment, joins are not necessarily a bad thing. Read: Why joins are a good thing.
One of the great benefits of 6NF in Anchor Modeling is non-destructive schema evolution. In other words, every previous version of the database model is available as a subset in the current model. Also, since changes are represented by extensions in the schema (new tables), upgrading a database is almost instantanous and can safely be done online (even in a production environment). This benefit would be lost in 5NF.
I haven't read any papers on it, but since it's based on 6NF, I'd expect it to suffer from whatever problems follow 6NF.
6NF requires each table consist of a candidate key and no more than one non-key column. So, in the worst case, you'll need nine joins to produce a 10-column result set. But you can also design a database that uses, say, 200 tables that are in 5NF, 30 that are in BCNF, and only 5 that are in 6NF. (I think that would no longer be Anchor Modeling per se, which seems to put all tables in 6NF, but I could be wrong about that.)
The Mythical Man-Month is still relevant here.
The management question, therefore, is not whether to build a pilot system and throw it away. You will do that. The only question is whether to plan in advance to build a throwaway, or to promise to deliver the throwaway to customers.
Fred Brooks, Jr., in The Mythical Man-Month, p 116.
How cheaply can you build a prototype to test your expected worst case?
In this post I will present a large part of the real business that belong to databases. Database's solutions in this big business area can not be solved by using „Anchor modeling" , at all.
In the real business world this case is happening on a daily basis. That is the case when data entry person, enters a wrong data.
In real-world business, errors happen frequently at data entry level. It often happens that data entry generates large amounts of erroneous data. So this is a real and big problem. "Anchor modeling" can not solve this problem.
Anyone who uses the "Anchor Modeling" database can enter incorrect data. This is possible because the authors of "Anchor modeling" have written that the erroneous data can be deleted.
Let me explain this problem by the following example:
A profesor of mathematics gave the best grade to the student who had the worst grade. In this high school, professors enter grades in the corressponding database. This student gave money to the professor for this criminal service. The student managed to enroll at the university using this false grade.
After a summer holiday, the professor of mathematics returned to school. After deleting the wrong grade from the database, the professor entered the correct one in the database. In this school they use "Anchor Modeling" db. So the math profesor deleted false data as it is strictly suggested by authors of "Anchor modeling".
Now, this professor of mathematics who did this criminal act is clean, thanks to the software "Anchor modeling".
This example says that using "Anchor Modeling," you can do crime with data just by applying „Anchor modeling technology“
In section 5.4 the authors of „Anchor modeling“ wrote the following: „Delete statements are allowed only when applied to remove erroneous data.“ .
You can see this text at the paper „ An agile modeling technique using sixth normal form for structurally evolving data“ written by authors of „Anchor modeling“.
Please note that „Anchor modeling“ was presented at the 28th International Conference on Conceptual Modeling and won the best paper award?!
Authors of "Anchor Modeling" claim that their data model can maintain a history! However this example shoes that „Anchor modeling“ can not maintain the history at all.
As „Anchor modeling“ allows deletion of data, then "Anchor modeling" has all the operations with the data, that is: adding new data, deleting data and update. Update can be obtained by using two operations: first delete the data, then add new data.
This further means that Anchor modeling has no history, because it has data deletion and data update.
I would like to point out that in "Anchor modeling" each erroneous data "MUST" be deleted. In the "Anchor modeling" it is not possible to keep erroneous data and corrected data.
"Anchor modeling" can not maintain history of erroneous data.
In the first part of this post, I showed that by using "Anchor Modeling" anyone can do crime with data. This means "Anchor Modeling" runs the business of a company, right into a disaster.
I will give one example so that professionals can see on real and important example, how bad "anchor modeling" is.
Example
People who are professionals in the business of databases, know that there are thousands and thousands of international standards, which have been used successfully in databases as keys.
International standards:
All professionals know what is "VIN" for cars, "ISBN" for books, and thousands of other international standards.
National standards:
All countries have their own standards for passports, personal documents, bank cards, bar codes, etc
Local standards:
Many companies have their own standards. For example, when you pay something, you have an invoice with a standard key written and that key is written in the database, also.
All the above mentioned type of keys from this example can be checked by using a variety of institutions, police, customs, banks credit card, post office, etc. You can check many of these "keys" on the internet or by using a phone.
I believe that percent of these databases, which have entities with standard keys, and which I have presented in this example, is more than 95%.
For all the above cases the "anchor surrogate key" is nonsense. "Anchor modeling" exclusively uses "anchor-surrogate key"
In my solution, I use all the keys that are standard on a global or local level and are simple.
Vladimir Odrljin
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
As background, I'm one of two developers in my department. I got into computers my freshman year in high school (1986) and have no formal education. I got into MS Access a little bit in 1994 and more seriously beginning in 2003. I'm self-educated, have always tried to learn as much as I can about database design, and while I believe I know a lot I also know I don't know everything.
The other developer in my department, according to his resume, has a degree in computer science and has been doing IT work, including web design and database design, for about 8 years. He was hired into my department last December. I've been very surprised by what I see as a very fundamental lack of knowledge about the basics of database design and SQL and have been trying to figure out if at least part of the problem is I'm expecting too much or maybe don't know as much as I think I do.
Hence my question. Please note we are 100% MS Access, but I believe this question applies to about any SQL database. This developer was tasked to take a spreadsheet and convert it into a database. Part of the spreadsheet involved tracking inventory for batteries. In the spreadsheet, the column titles were Date and Count. But the data in the date column was a mix of dates and batch numbers. So this developer created a table with a numeric field to contain both the batch number and the date and a second boolean field called IsDate to indicate what value was in the field.
I disagree with this approach and would have created two separate fields, a date field for the date and a numeric field for the batch number. When I suggested this approach, he seemed to not only not understand why but also to get a bit angry about having to change his design.
Which approach would you recommend? Also, assuming everyone agrees with my approach - of course you will! ;) - if you had a developer with this supposed level of experience, would you consider him worth keeping and worth investing the time and effort to educate him?
My own rule of thumb here is:
Always keep data in a native datatype.
This helps comparing, sorting, finding and grouping - especially in a database - and makes your storage less prone to query errors. Moreover, you're not required to use another predicate (AND isdate) when accessing the data. Hence, I think your approach is correct.
Your colleague's approach seems not to be a matter of high education, but one of a personal approach. I've seen workers with PhD who could well listen to a well-reasoned argument, and freshmen who made grave mistakes and would not listen to a polite advice.
I'd most definitely store the date and the batch number in different fields of the appropriate type - setting each with the relevant content or as NULL if no value was available. By doing this you'd be able to see what data you actually have available and perform meaningful operations on that data.
In terms of you second question, I guess it would really depend on what the developer in question said when you asked them why they'd chosen the approach they did.
You are right.
Only under severe memory restrictions might (note might) this kind of architecture be acceptable.
As to dealing with him, I would first talk to him and fiugre out why he chose the given approach, this is something that might have been common in Access Databases 10 years ago (but even then there was enough disk and memory space to not have to do these kind of tricks).
His reluctance to talk about his design is a worse indicator of his abilities than the design itself. Even the most misguided design should have been based on a structured approach or idea. In my mind it is not a bad thing to be wrong, it is a bad thing to create random structures. But not knowing your requirements it is hard to suggest whether it is worth keeping him or not.
Is one of you the 'senior' hierarchy wise or are you sharing responsibilities ?
Point out that he is breaking first normal form by doing so. Be able to describe 1NF 2NF and 3NF before trying to impress him with you fancy pants knowledge.