Automatically connect SQL tables based on keys - sql

Is there a method to automatically join tables that have primary to foreign relationship rather then designate joining on those values?

The out and out answer is "no" - no RDBMS I know of will allow you to get away with not specifying columns in an ON clause intended to join two tables in a non-cartesian fashion, but it might not matter...
...because typically multi tier applications these days are built with data access libraries that DO take into account the relationships defined in a database. Picking on something like entity framework, if your database exists already, then you can scaffold a context in EF from it, and it will make a set of objects that obey the relationships in the frontend code side of things
Technically, you'll never write an ON clause yourself, because if you say something to EF like:
context.Customers.Find(c => c.id = 1) //this finds a customer
.Orders //this gets all the customer's orders
.Where(o => o.date> DateTIme.UtcNow.AddMonths(-1)); //this filters the orders
You've got all the orders raised by customer id 1 in the last month, without writing a single ON clause yourself... EF has, behind the scenes, written it but in the spirit of your question where there are tables related by relation, we've used a framework that uses that relation to relate the data for the purposes thtat the frontend put it to.. All you have to do is use the data access library that does this, if you have an aversion to writing ON clauses yourself :)
It's a virtual certaintythat there will be some similar ORM/mapping/data access library for your front end language of choice - I just picked on EF in C# because it's what I know. If you're after scouting out what's out there, google for {language of choice} ORM (if you're using an OO language) - you mentioned python,. seems SQLAlchemy is a popular one (but note, SO answers are not for recommending particular softwares)

If you mean can you write a JOIN at query time that doesn't need an ON clause, then no.
There is no way to do this in SQL Server.

I am not sure if you are aware of dbForge; it may help. It recognises joinable tables automatically in following cases:
The database contains information that specifies that the tables are related.
If two columns, one in each table, have the same name and data type.
Forge Studio detects that a search condition (e.g. the WHERE clause) is actually a join condition.

Related

Repository Pattern Dilemma: Redundant Queries vs. Database Round Trips

This is the situation:
Say I have an application in which two entity types exist:
Company
Person
Moreover, Person has a reference to Company via Person.employer, which denotes the company a person is employed at.
In my application I am using repositories to separate the database operations from my business-model related services: I have a PersonRepository.findOne(id) method to retrieve a Person entity and a CompanyRepository.findOne(id) method to retrieve a Company. So far so good.
This is the dilemma:
Now if I make a call to PersonRepository.findOne(id) to fetch a Person entity, I also need to have a fully resolved Company included inline via the Person.employer property – and this is where I am facing the dilemma of having two implementation options that are both suboptimal:
Option A) Redundant queries throughout my repositories but less database round trips:
Within the PersonRepository I can build a query which selects the user and also selects the company in a single query – however, the select expression for the company is difficult and includes some joins in order to assemble the company correctly. The CompanyRepository already contains this logic to select the company and rewriting it in the UserRepository is redundant. Hence, ideally I only want the CompanyRepository to take care of the company selection logic in order to avoid having to code the same query expression redundantly in two repositories.
Option B): Separation of concerns without query-code redundancy but at the price of additional db roundtrips and repo-dependencies:
Within the PersonRepository I could reference the CompanyRepository to take care of fetching the Company object and then I would add this entity to the Person.employer property in the PersonRepository. This way, I kept the logic to query the company encapsulated inside the CompanyRepository by which a clean separation of concerns is achieved. The downside of this is that I make additional round trips to the database as two separate queries are executed by two repositories.
So generally speaking, what is the preferred way to deal with this dilemma?
Also, what is the preferred way to handle this situation in ASP.NET Core and EF Core?
Edit: To avoid opinion based answers I want to stress: I am not looking for a pros and cons of the two options presented above but rather striving for a solution that integrates the good parts of both options – because maybe I am just on the wrong track here with my two listed options. I am also fine with an answer that explains why there is no such integrative solution, so I can sleep better and move on.
In order to retrieve a company by ID you need to read Person's data, and fetch company ID from it. Hence if you would like to keep company-querying logic in a single place, you would end up with two round-trips - one to get company ID (along with whatever other attributes a Person has) and one more to get the company itself.
You could reuse the code that makes a company from DbDataReader, but the person+company query would presumably require joining to "forward" person's companyId to the Company query, so the text of these queries would have to be different.
You could have it both ways (one roundtrip, no repeated queries) if you move querying logic into stored procedures. This way your person_sp would execute company_sp, and return you all the relevant data. If necessary, your C# code would be able to harvest multi-part result set using reader.NextResult(). Now the "hand-off" of the company ID would happen on RDBMS side, eliminating the second round-trip. However, this approach would require maintaining stored procedures on RDBMS side, effectively shipping some repository logic out of your C# code base.

sql, define Separate relation to target table or get by joins

We're working on a CMS project with EF and MVC. We've Recently encountered a problem,
Please consider these tables:
Applications
Entities
ProductsCategories
Products
Relations are in this order:
Applications=>Entities=>ProductCategories=>Products
When we select a product by it's Id, always we should check if requested ProductsId is
just for a specific application stored in Applications table, These is for preventing load other applications products,
what is the best way to get a product for specific application id, We have two choice:
Instead of define a relation between products and applications we can do joins with productsCategories,entities, and applications to find it
=> when we want to get products we don't want to know about entities or other tables that we should join it to access applications
we can define a separate relation between products and applications and get it by simple select query
which of these is the best way and why?
Manish first thanks for your comment,Then please consider this that some of our tables does not have any relation with Entities for these tables we should define a relation with Entites to access Applications or define a separate as relation as mentioned above,For these tables we just define a relation and does not have extra work,except performance issue.still some of other tables has relations with entites so for this one defining a separat relation has extra work,
At last please consider this,in fact all of tables should access 'Entities' some by separate relation and others can access from there parents
actually for relation between products and entities we didn't define a separate relation because it doesn't has performance issue,But for relation between products and entities we should consider performance issue because in every request we should access Applications to check request Id is for current Application
So what is your idea?
Let's look at your options
Instead of defining a relationship, you can join the three tables to get the correct set of products: In this case, you won't have to make any database changes and anyway, you won't be fetching all the joined tables data, you would fetch only that data, which you have specified in your Linq Select List. But then, 3-tables join can be a little performance degrading when the number of rows will be very high at some point of time
You can define a separate relationship between the two said tables: In this case you would have to change your database structure, that would mean, making changes in your Entity and Entity Model, and lot of testing. No doubt, it will mean simple code, ease of usage which is always welcome.
So you see, there is no clear answer, ultimately it depends on you and your code environment what you want to go with, as for me, I would go for creating a separate relationship between the Application and Product entity, cause that would cause a cleaner code with a little less effort. Besides as they say, "Code around your data-structure, and not the otherway around"

Is eager loading same as join fetch?

Is eager fetch same as join fetch?
I mean whether eagerly fetching a has-many relation fires 2 queries or a single join query?
How does rails active record implement a join fetch of associations as it doesnt know the table's meta-data in first hand (I mean columns in the table)? Say for example i have
people - id, name
things - id, person_id, name
person has one-to-many relation with the things. So how does it generate the query with all the column aliases even though it cannot know it when i do a join fetch on people?
An answer hasn't been accepted so I will try to answer your questions as I understand them:
"how does it know all the fields available in a table?"
It does a SQL query for every class that inherits from ActiveRecord::Base. If the class is 'Dog', it will do a query to find the column names of the table 'dogs'. In production mode it should only do this query once per run of the server -- in development mode it does it a lot. The query will differ depending on the database you use, and it is usually an expensive query.
"Say if i have a same name for column in a table and in an associated table how does it resolve this?"
If you are doing a join, it generates sql using the table names as prefixes to avoid ambiguities. In fact, if you are doing a join in Rails and want to add a condition (using custom SQL) for name, but both the main table and join table have a name column, you need to specify the table name in your sql. (e.g. Human.join(:pets).where("humans.name = 'John'"))
"I mean whether eagerly fetching a has-many relation fires 2 queries or a single join query?"
Different Rails versions are different. I think that early versions did a single join query at all times. Later versions would sometimes do multiple queries and sometimes a single join query, based on the realization that a single join query isn't always as performant as multiple queries. I'm not sure of the exact logic that it uses to decide. Recently, in Rails 3, I am seeing multiple queries happening in my current codebase -- but maybe it sometimes does a join as well, I'm not sure.
It knows the columns through a type of reflection. Ruby is very flexible and allows you to build functionality that will be used/defined during runtime and doesn't need to be stated ahead of time. It learns the associated "person_id" column by interpreting the "belongs_to :person" and knowing that "person_id" is the field that would be associated and the table would be called "people".
If you do People.includes(:things) then it will generate 2 queries, 1 that gets the people and a second that gets the things that have a relation to the people that exist.
http://guides.rubyonrails.org/active_record_querying.html

How important are lookup tables?

A lot of the applications I write make use of lookup tables, since that was just the way I was taught (normalization and such). The problem is that the queries I make are often more complicated because of this. They often look like this
get all posts that are still open
"SELECT * FROM posts WHERE status_id = (SELECT id FROM statuses WHERE name = 'open')"
Often times, the lookup tables themselves are very short. For instance, there may only be 3 or so different statuses. In this case, would it be okay to search for a certain type by using a constant or so in the application? Something like
get all posts that are still open
"SELECT * FROM posts WHERE status_id = ".Status::OPEN
Or, what if instead of using a foreign id, I set it as an enum and queried off of that?
Thanks.
The answer depends a little if you are limited to freeware such as PostGreSQL (not fully SQL compliant), or if you are thinking about SQL (ie. SQL compliant) and large databases.
In SQL compliant, Open Architecture databases, where there are many apps using one database, and many users using different report tools (not just the apps) to access the data, standards, normalisation, and open architecture requirements are important.
Despite the people who attempt to change the definition of "normalisation", etc. to suit their ever-changing purpose, Normalisation (the science) has not changed.
if you have data values such as {Open; Closed; etc} repeated in data tables, that is data duplication, a simple Normalisation error: if you those values change, you may have to update millions of rows, which is very limited design.
Such values should be Normalised into a Reference or Lookup table, with a short CHAR(2) PK:
O Open
C Closed
U [NotKnown]
The data values {Open;Closed;etc} are no longer duplicated in the millions of rows. It also saves space.
the second point is ease of change, if Closed were changed to Expired, again, one row needs to be changed, and that is reflected in the entire database; whereas in the un-normalised files, millions of rows need to be changed.
Adding new data values, eg. (H,HalfOpen) is then simply a matter of inserting one row.
in Open Architecture terms, the Lookup table is an ordinary table. It exists in the [SQL compliant] catalogue; as long as the FOREIGN KEY relation has been defined, the report tool can find that as well.
ENUM is a Non-SQL, do not use it. In SQL the "enum" is a Lookup table.
The next point relates to the meaningfulness of the key.
If the Key is meaningless to the user, fine, use an {INT;BIGINT;GUID;etc} or whatever is suitable; do not number them incrementally; allow "gaps".
But if the Key is meaningful to the user, do not use a meaningless number, use a meaningful Relational Key.
Now some people will get in to tangents regarding the permanence of PKs. That is a separate point. Yes, of course, always use a stable value for a PK (not "immutable", because no such thing exists, and a system-generated key does not provide row uniqueness).
{M,F} are unlikely to change
if you have used {0,1,2,4,6}, well don't change it, why would you want to. Those values were supposed to be meaningless, remember, only a meaningful Key need to be changed.
if you do use meaningful keys, use short alphabetic codes, that developers can readily understand (and infer the long description from). You will appreciate this only when you code SELECT and realise you do not have to JOIN every Lookup table. Power users too, appreciate it.
Since PKs are stable, particularly in Lookup tables, you can safely code:
WHERE status_code = 'O' -- Open
You do not have to JOIN the Lookup table and obtain the data value Open, as a developer, you are supposed to know what the Lookup PKs mean.
Last, if the database were large, and supported BI or DSS or OLAP functions in addition to OLTP (as properly Normalised databases can), then the Lookup table is actually a Dimension or Vector, in Dimension-Fact analyses. If it was not there, then it would have to be added in, to satisfy the requirements of that software, before such analyses can be mounted.
If you do that to your database from the outset, you will not have to upgrade it (and the code) later.
Your Example
SQL is a low-level language, thus it is cumbersome, especially when it comes to JOINs. That is what we have, so we need to just accept the encumbrance and deal with it. Your example code is fine. But simpler forms can do the same thing.
A report tool would generate:
SELECT p.*,
s.name
FROM posts p,
status s
WHERE p.status_id = s.status_id
AND p.status_id = 'O'
Another Exaple
For banking systems, where we use short codes which are meaningful (since they are meaningful, we do not change them with the seasons, we just add to them), given a Lookup table such as (carefully chosen, similar to ISO Country Codes):
Eq Equity
EqCS Equity/Common Share
OTC OverTheCounter
OF OTC/Future
Code such as this is common:
WHERE InstrumentTypeCode LIKE "Eq%"
And the users of the GUI would choose the value from a drop-down that displays
{Equity/Common Share;Over The Counter},
not {Eq;OTC;OF}, not {M;F;U}.
Without a lookup table, you can't do that, either in the apps, or in the report tool.
For look-up tables I use a sensible primary key -- usually just a CHAR(1) that makes sense in the domain with an additional Title (VARCHAR) field. This can maintain relationship enforcement while "keeping the SQL simple". The key to remember here is the look-up table does not "contain data". It contains identities. Some other identities might be time-zone names or assigned IOC country codes.
For instance gender:
ID Label
M Male
F Female
N Neutral
select * from people where gender = 'M'
Alternatively, an ORM could be used and manual SQL generation might never have to be done -- in this case the standard "int" surrogate key approach is fine because something else deals with it :-)
Happy coding.
Create a function for each lookup.
There is no easy way. You want performance and query simplicity. Ensure the following is maintained. You could create a SP_TestAppEnums to compare existing lookup values against the function and look for out of sync/zero returned.
CREATE FUNCTION [Enum_Post](#postname varchar(10))
RETURNS int
AS
BEGIN
DECLARE #postId int
SET #postId =
CASE #postname
WHEN 'Open' THEN 1
WHEN 'Closed' THEN 2
END
RETURN #postId
END
GO
/* Calling the function */
SELECT dbo.Enum_Post('Open')
SELECT dbo.Enum_Post('Closed')
Question is: do you need to include the lookup tables (domain tables 'round my neck of the woods) in your queries? Presumably, these sorts of tables are usually
pretty static in nature — the domain might get extended, but it probably won't get shortened.
their primary key values are pretty unlikely to change as well (e.g., the status_id for a status of 'open' is unlikely to suddenly get changed to something other than what it was created as).
If the above assumptions are correct, there's no real need to add all those extra tables to your joins just so your where clause can use a friend name instead of an id value. Just filter on status_id directly where you need to. I'd suspect the non-key attribute in the where clause ('name' in your example above) is more likely to get changes than the key attribute ('name' in your example above): you're more protected by referencing the desire key value(s) of the domain table in your join.
Domain tables serve
to limit the domain of the variable via a foreign key relationship,
to allow the domain to be expanded by adding data to the domain table,
to populate UI controls and the like with user-friendly information,
Naturally, you'd need to suck domain tables into your queries where you you actually required the non-key attributes from the domain table (e.g., descriptive name of the value).
YMMV: a lot depends on context and the nature of the problem space.
The answer is "whatever makes sense".
lookup tables involve joins or subqueries which are not always efficient. I make use of enums a lot to do this job. its efficient and fast
Where possible (and It is not always . . .), I use this rule of thumb: If I need to hard-code a value into my application (vs. let it remain a record in the database), and also store that vlue in my database, then something is amiss with my design. It's not ALWAYS true, but basically, whatever the value in question is, it either represents a piece of DATA, or a peice of PROGRAM LOGIC. It is a rare case that it is both.
NOT that you won't find yourself discovering which one it is halfway into the project. But as the others said above, there can be trade-offs either way. Just as we don't always acheive "perfect" normalization in a database design (for reason of performance, or simply because you CAN take thngs too far in pursuit of acedemic perfection . . .), we may make some concious choices about where we locate our "look-up" values.
Personally, though, I try to stand on my rule above. It is either DATA, or PROGRAM LOGIC, and rarely both. If it ends up as (or IN) a record in the databse, I try to keep it out of the Application code (except, of course, to retrieve it from the database . . .). If it is hardcoded in my application, I try to keep it out of my database.
In cases where I can't observe this rule, I DOCUMENT THE CODE with my reasoning, so three years later, some poor soul will be able to ficure out how it broke, if that happens.
The commenters have convinced me of the error of my ways. This answer and the discussion that went along with it, however, remain here for reference.
I think a constant is appropriate here, and a database table is not. As you design your application, you expect that table of statuses to never, ever change, since your application has hard-coded into it what those statuses mean, anyway. The point of a database is that the data within it will change. There are cases where the lines are fuzzy (e.g. "this data might change every few months or so…"), but this is not one of the fuzzy cases.
Statuses are a part of your application's logic; use constants to define them within the application. It's not only more strictly organized that way, but it will also allow your database interactions to be significantly speedier.

Should I include user_id in multiple tables?

I'm at the planning stages of a multi-user application where each user will only have access their own data. There'll be a few tables that relate to each other, so I could use JOINs to ensure they're accessing only their data, but should I include user_id in each table? Would this be faster? It would certainly make some of the queries easier in the long run.
Specifically, the question is about multiple tables containing the user_id field.
For example, each user can configure categories, items (in those categories), and sub-items against those items. There's a logical path from user, to sub-items through the other tables, but it would require 3 JOINs. Should I just include user_id in all the tables?
Thanks!
This is a design decision in multi-tenant databases. With "root" tables, obviously you have to have the user_id. But in the non-"root" tables, you do have a choice when you are using surrogate PKs.
Say you have users with projects and projects with actions. Projects obviously has to have a user_id, but if actions are tied to one and only one project, then the user_id is redundant, and also violates normal form, since if it was to move to another user's project (probably not likely in your use cases), both the project FK and the user FK would have to be updated. Typically in multi-tenant scenarios, this isn't really a possible scenario, and so the primary key of every table is really a combination of tenant and a unique primary key "within" the tenant (which may also happen to be globally unique).
If you use natural keys extensively in your design, then clearly tenant+natural key is necessary so that each tenant's natural keys can be used. It's only when using surrogates like IDENTITY or GUIDs or sequences, that this becomes an issue, since it is tempting to make the IDENTITY the PK, after all, it is unique by definition.
Having the user_id in all tables does allow you to do certain things in views to enhance security (defense in depth), giving you a little bit of defensive programming (in SQL Server you can restrict all access through inline table valued function - essentially parametrized views - which require the app to specify user_id on every "table" access), and also allows you to easily scale out to multiple databases by forklifting everything on shared keys.
See this article for some interesting insights.
(In a massively multi-parallel paradigm like Teradata, the PRIMARY INDEX determines the amp on which the data lives, so I would think that this is a must to stop redistribution of rows to the other amps.)
In general, I would say you have a tenantid in each table, it should be the first column in the table, in most indexes and should be part of the primary key in most cases, unless otherwise justified. Where possible, it should be a required parameter in most stored procedures.
Generally, you use foreign keys to relate data between tables. In many cases, this foreign key is the user id. For example:
users
id
name
phonenumbers
user_id
phonenumber
So yes, that'd make perfect sense.
If a category can only belong to one user then yes, you need to include the user_id in the category table. If a category can belong to multiple people then you would have a separate table that maps category IDs to user IDs. You can still do this if you have a one to one mapping between the two, but there is no real reason for it.
You don't need to include the user_id in further tables if you can guarantee that those child tables will always be accessed via joining to the category table. If there is a chance that you will access them independantly of the category table then you should also have the user_id on those tables.
The extent to which to normalize can be a difficult decision. One of the best StackOverflow answers on this topic (Database Development Mistakes Made by App Developers) warns against both (1) failing to normalize, and (2) over-normalizing.
You mention that it might be easier "in the long run" to repeat the same data in multiple tables (that is, not to normalize that data). Look at the "Not simplifying complex queries through views" topic in the previous link. If you use views effectively, you will only have to do the 3 join query once when writing the view and then you can use a query with no joins for most purposes.
Most developers tend to under-normalize because it seems simpler. Go ahead and normalize. Use views to simplify your daily queries. When your requiremens get more complex or you decide to add features, you will be glad that you put time into a relational database design.
Alternatively, depending on your toolset, you may want to use a database abstraction layer that does the relational design under the covers while you manipulate higher level data object.
if it is Oracle, then you would probably set up a fine grained security rule to do the joins and prevent certain activities based on the existence of the original user id... (SELECT INSERT UPDATE DELETE etc)
You would need a map between the logged in user and the user_id. You could use uid, but then remember this umber may change if the database is reconstructed after some disaster...