Where should I begin with this database design? - sql

I have 5 tables all unnormalised and I need to create an ER model, a logical model, normalise the data and also a bunch of queries.
Where would you begin? Would you normalise the data first? Create the ER model and the relationships?

There are two ways to start data modelling: top-down and bottom-up.
The top-down approach is to ask what things (tangible and intangible) are important to your system. These things become your entities. You then go through the process of figuring out how your entities are related to each other (your relationships) and then flesh out the entities with attributes. The result is your conceptual or logical model. This can be represented in ERD form if you like.
Either along the way or after your entities, relationships and attributes are defined, you go through the normalization process and make other implementation decisions to arrive at your physical model - which can also be represented as an ERD.
The bottom-up approach is to take your existing relations - i.e. whatever screens, reports, datastores, or whatever existing data representations you have and then perform a canonical synthesis to reduce the entire set of data representation into a single, coherent, normalized model. This is done by normalizing each of your views of data and looking for commonalities that let you bring items together into a single model.
Which approach you use depends a little bit on personal choice, and quite a bit on whether you have existing data views to start from.

I think you should first prepare the list of entities and attributes. so that you will be able to get the complete details of the data information.
Once you are clear with the data information. You can start creating the master table and Normalize then.
Then after the complete database is design with normalization, You can create the ER diagram very easily.

I would start by evaluating and then preparing the list of entites and attributes within your data.
I would do it in this order.
Relationships
Create the ER model.
Normalise the data.
I know many others will have a different opinion but this is the way I would go ahead with it :)

Related

I have problem while converting logical model to Relational model in SQL data modeler

I am trying to develop a database for my homework. I designed a Logical Model in SQL data modeler. I tried to convert it to Relational model but Relations were created as Tables. not like relations. I watched some videos on youtube and tried to do same tables. I got same problems again. Where is my mistake and how can I fix it? Thank you so much...
Logical Model
Relational Model
There are different terms used in data modelling, and several types of model provided by different tools and named differently, so exactly what a Relational Model is (I'd never heard of that one) will depend on the tool you are using. From the screenshot, it appears to be a physical model, which means tables, columns, foreign keys, indexes, storage details etc. If you wanted something different then please describe what it is.
Traditionally there is
a logical model, also known as an entity model or entity relationship diagram (ERD), which is database agnostic and merely models business entities, their attributes, unique keys and relationships.
a physical model, defining a schema ready for implementing in a specific vendor's RDBMS. This has tables with columns, primary keys, check constraints, indexes etc.
In my experience, tools often blur the line between entities and tables.
The way I learned it, an entity is a conceptual item of interest to the business, like "Doctor" or "Patient", because the purpose of the model is to show what a Doctor is and what a Patient is and how they are related, and that is why they are named in the singular. A database table, however, (in the physical model) will usually contain rows representing many Doctors and Patients and so the plural is appropriate. Opinions differ, though.

SSAS Tabular - Multiple Models?

We're starting to build an SSAS tabular model and wondering if most people have one model or multiple. If multiple, do you duplicate tables that are needed by each, or is there a way to share tables between models? I think I know the answer, but I'm hoping those with more experience can confirm what we've found...
From what I've researched I think...
- you can't share tables across models - any "common" tables would have to be duplicated in and deployed with each model and would take up memory
- we should create one model, use perspectives to organize the tables and make it easier to work with
- multiple models could be acceptable if there is little or no common data across models
thanks
You are correct, there is no way to share tables between models.
Perspectives can help.
The question of whether to have one model or more depends on the user audience. Who are the users? How analytically savvy are they? Will they have a reasonable understanding of the model structure?
One issue that affects my rather unsophisticated users is when a dimension does not relate to all fact tables. In this scenario, as is expected, measures on the fact table calculate identical values for every member of the unrelated dimension. For less knowledgeable users, this situation is confusing.
I agree with Ari's answer, and am posting this answer to explain my own experience.
We use a few large models for more sophisticated users that are in memory and processed once a week. We have agreed with the business that these models will not be available during processing, so we are able to processes without holding a transaction open which allows us to keep many more smaller models because we do not need to keep 2x the size of our largest model available to the instance. We use perspectives to simplify the presentation and reduce the confusion for the multiple fact tables. Even with perspectives, the models are rather complex, and it takes some training to get users used to working with the different facts.
We also use smaller models, usually more targeted to a specific audience/need. Many are processed daily, and use transactional processing to ensure the are available to users as much as possible. There are several dimensions that are used in several of our small models, but we are able to filter them so that user's do not see the full list of members which reduces size, and has been a huge benefit for my users, because they only see members that have a fact that they are analyzing instead of a list of every member associated with any fact.
We use views to ensure conformity between models when a dimension is used in multiple models. In my opinion this is very important, as it is very confusing when I have the same dimension with slightly different attribute names.
To sum up (pun intended)...
I like developing and working with large models. I think they answer more questions with less work.
Most users I have worked with prefer smaller, more concise models. Your server hardware/processing requirements may direct you to smaller models as well, even though some of the dimensions will be duplicated.

How is a graph database different to a graph represented in a relational database?

I can represent a graph trivially in a relational database with two tables: vertex and edge. Richer structure like "properties" and "labels" (in Neo4j terminology) can be represented as more tables. Have I misunderstood, or does a graph database like Neo4j allow me to represent anything that is not easily representable relationally?
I can query this graph using SQL, with recursive subqueries if necessary, and with multiple separate queries in a transaction if necessary. Have I misunderstood, or does a graph query language like Cypher provide greater expressivity than SQL?
The relational model of a graph is stored and queried efficiently, AFAIK. Does a graph database structure its storage, or optimize its queries, in some way that provides performance characteristics that cannot be gained from a relational database?
My relational database provides ACID guarantees, and allows me to write fairly expressive constraints on my graph data (and even more constraints if I break out the single vertex table into a properly normalized schema). Have I misunderstood, or does a graph database provide some guarantees or verify some kind of correctness properties that are not available in my relational database?
I am struggling to see how a graph database such as Neo4j is anything but a subset of the relational model. (Apologies for using Neo4j as representative of all graph databases here; it's the only one I've looked at.)
In short: Is graph database ⊆ relational database?
Is One a Subset of the Other?
Definitely no; both are eventually modeled on the mathematical concepts of relations or graphs. Both models being super-general, there is basically no information content that you can't represent using either one. This means that while they might differ in many syntactic sugar ways, and in the way they encourage you to model/think of data (just like programming languages differ) they both have the same "expressive power".
What you describe in your question is one way of modeling a graph (vertex and edge tables). That implementation of a graph is a subset of what relational can express. Similarly, I could mock up tables and rows using a graph database, but I would have chosen a particular implementation - this wouldn't demonstrate that relational data is a subset of graph data.
So the first insight is that they have roughly equal expressive power. You can model anything in either. So the real question you should be asking is why would you choose one over the other?
Why Would you Choose One Over The Other?
All databases exist to facilitate data access. Simply put, you store it so that you can get at the data. But exactly how do you need to get at the data? There are many different access patterns. The design space for databases in general is enormous. Any time a database makes a certain decision, that tends to automatically make it better at some things, worse at others. For example, when you create an index in a relational database, you've just sped up reads -- but you've degraded the performance of writes, because the index has to be maintained.
So, when approaching the question, "Graph or Relational?" - you should first figure out what does your data look like, and what do your data access patterns look like. If you knew what those things were, then you could evaluate a bunch of databases, see the choices they've made, and pick the one that's a good fit for what you need. And then if a DBMS made a choice that would make certain access patterns difficult, buggy, or slow -- you could avoid that DBMS for that data set.
It's (Partly) About Data Access Patterns
Graph databases tend to be better than relational when the data being stored is a graph, when the data access pattern involves a lot of graph traversal, or both. (See this other answer I wrote for a more in-depth discussion of why this is). That link there also provides the answer to your specific question: "Does a graph database structure its storage, or optimize its queries, in some way that provides performance characteristics that cannot be gained from a relational database?"
You say: I can query this graph using SQL, with recursive subqueries if necessary, and with multiple separate queries in a transaction if necessary. -- So technically this is true, but let's take an example to see why relational might not be good enough. Say I have a graph (in RDBMS, a table of nodes, a table of edges, with a join key between them). Let's say I pick out one node, and I want to identify everything that is between 6 and 8 hops away from that node. Here's the cypher to do that:
match (myChosenNode {id: 'foo'})-[r:relationshipType*6..8]->(y) return y;
I really want to see you write that up as SQL. It's possible, but it's hard and complicated. And it will also perform like a dog, because of the sheer quantity of joining you'll be doing on non-trivial quantities of data.
ACID
OK now on the ACID guarantees, Neo4J provides transactions with ACID guarantees. The answer will be different for different graph databases though, particularly the ones implemented on top of Hadoop/HBase. YMMV there, so check the fine print with each database.
It is true that there are a number of features of RDBMS that you typically won't find in graph databases, examples being triggers and certain kinds of constraints. As a long-time RDMBS nerd myself, I'm not so happy about those things being missing, I think they are valuable.
Summary
What this mostly boils down to for me, and many other engineers I work with is:
What is your data?
What are your access patterns?
If your data is a graph, or your access patterns involve a lot of graph traversal, you should probably use a graph DB. If your data is more tabluar, or your access patterns are more oriented around bulk scans, then you should use RDBMS. At the end of the day, they're two different tools with different niches. If you use them in their area of strength, you'll be happy. If you use RDBMS to model a graph just "because you can", you'll suffer. If you use a graph database to do a lot of bulk scans of every node in every graph, you'll suffer. Like most of tech, it's just about using the right tool for the job.

Advantage(s) of cube/tabular model over relational star schemas

I am wondering whether cubes or tabular models have any advantages over star schemas other than MDX/DAX query speed. Any feedback would be very much appreciated. Thanks.
Christian
When you say "advantages over star schemas", I am assuming that you mean a Star schema in a relational database? The primary difference is the potentially orders of magnatitude difference in speed, but in the area of self-service BI, a bigger advantage of Cubes or Models is that they implement an entirely new semantic layer. They give you the opportunity to rename fields that may have obscure names in the DB, to have more useful recognisable names for the business users and hide more technical fields, that are not useful to end users. You can define reuseable Named Sets and Hierarchies that enable easier, more effective and consistent reporting.
But the two biggies for me are the speed and the business user friendly semantic layer. JK.

Best practice for storing a neural network in a database

I am developing an application that uses a neural network. Currently I am looking at either trying to put it into a relational database based on SQL (probably SQL server) or a graph database.
From a performance viewpoint, the neural net will be very large.
My questions:
Do relational databases suffer a performance hit when dealing with a neural net in comparison to graph databases?
What graph-database technology would be best suited to dealing with a large neural net?
Can a geospatial database such as PostGIS be used to represent a neural net efficiently?
That depends on the intent of progress on the model.
Do you have a fixated idea on an immutable structure of the network? Like a Kohonnen map. Or an off-the-shelf model.
Do you have several relationship structures you need to test out, so that you wish to be able flip a switch to alternate between various structures.
Does your model treat the nodes as fluid automatons, free to seek their own neighbours? Where each automaton develops unique characteristic values of a common set of parameters, and you need to analyse how those values affect their "choice" of neighbours.
Do you have a fixed set of parameters for a fixed number of types/classes of nodes? Or is a node expected to develop a unique range of attributes and relationships?
Do you have frequent need to access each node, especially those embedded deep in the network layers, to analyse and correlate them?
Is your network perceivable as, or quantizable into, set of state-machines?
Disclaimer
First of all, I need to disclaim that I am familiar only with Kohonnen maps. (So, I admit having been derided for Kohonnen as being only entry-level of anything barely neural-network.) The above questions are the consequence of personal mental exploits I've had over the years fantasizing after random and lowly-educated reading of various neural shemes.
Category vs Parameter vs Attribute
Can we class vehicles by the number of wheels or tonnage? Should wheel-quantity or tonnage be attributes, parameters or category-characteristics.
Understanding this debate is a crucial step in structuring your repository. This debate is especially relevant to disease and patient vectors. I have seen patient information relational schemata, designed by medical experts but obviously without much training in information science, that presume a common set of parameters for every patient. With thousands of columns, mostly unused, for each patient record. And when they exceed column limits for a table, they create a new table with yet thousands more of sparsely used columns.
Type 1: All nodes have a common set of parameters and hence a node can be modeled into a table with a known number of columns.
Type 2: There are various classes of nodes. There is a fixed number of classes of nodes. Each class has a fixed set of parameters. Therefore, there is a characteristic table for each class of node.
Type 3: There is no intent to pigeon-hole the nodes. Each node is free to develop and acquire its own unique set of attributes.
Type 4: There are fixed number of classes of nodes. Each node within a class is free to develop and acquire its own unique set of attributes. Each class has a restricted set of attributes a node is allowed to acquire.
Read on EAV model to understand the issue of parameters vs attributes. In an EAV table, a node needs only three characterising columns:
node id
attribute name
attribute value
However, under constraints of technology, an attribute could be number, string, enumerable or category. Therefore, there would be four more attribute tables, one for each value type, plus the node table:
node id
attriute type
attribute name
attribute value
Sequential/linked access versus hashed/direct-address access
Do you have to access individual nodes directly rather than traversing the structural tree to get to a node quickly?
Do you need to find a list of nodes that have acquired a particular trait (set of attributes) regardless of where they sit topologically on the network? Do you need to perform classification (aka principal component analysis) on the nodes of your network?
State-machine
Do you wish to perceive the regions of your network as a collection of state-machines?
State machines are very useful quantization entities. State-machine quatization helps you to form empirical entities over a range of nodes based on neighbourhood similarities and relationships.
Instead of trying to understand and track individual behaviour of millions of nodes, why not clump them into regions of similarity. And track the state-machine flow of those regions.
Conclusion
This is my recommendation. You should start initially using a totally relational database. The reason is that relational database and the associated SQL provides information with a very liberal view of relationship. With SQL on a relational model, you could inquire or correlate relationships that you did not know exist.
As your experiments progress and you might find certain relationship modeling more suitable to a network-graph repository, you should then move those parts of the schema to such suitable repository.
In the final state of affairs. I would maintain a dual mode information repo. You maintain a relational repo to keep track of nodes and their attributes. So you store the dynamically mutating structure in a network-graph repository but each node refers to a node id in a relational database. Where the relational database allows you to query nodes based on attributes and their values. For example,
SELECT id FROM Nodes a, NumericAttributes b
WHERE a.attributeName = $name
AND b.value WItHIN $range
AND a.id = b.id
I am thinking, perhaps, hadoop could be used instead of a traditional network-graph database. But, I don't know how well hadoop adapts to dynamically changing relationships. My understanding is that hadoop is good for write-once read-by-many. However, a dynamic neural network may not perform well in frequent relationship changes. Whereas, a relational table modeling network relationships is not efficient.
Still, I believe I have only exposed questions you need to consider rather than providing you with a definite answer, especially with a rusty knowledge on many concepts.
Trees can be stored in a table by using self-referencing foreign keys. I'm assuming the only two things that need to be stored are topology and the weights; both of these can be stored in a flattened tree structure. Of course, this can require a lot of recursive selects, which depending on your RDBMS may be a pain to implement natively (thus requiring many SQL queries to achieve). I cannot comment on the comparison, but hopefully that helps with the relational point of view :)