SQL Server: find relations between tables without Foreign Key constraints - sql

I have a SQL Server database with lot of tables (several hundrends), somehow related with each other. All of them have Primary Keys (GUID), but only few of them have actually defined Foreign Key constraints.
I need to find all tables related with certain table (let's call it TargetTable) both related directly and inderectly (through 1, 2 or more intermediate tables) on any column.
My finish goal to get SQL queries (one per each related table) which JOIN all tables between TargetTable and that related table.
For example: it's found 5 related to TargetTable tables:
TargetTable - Table1
TargetTable - Table1 - Table2
TargetTable - Table3
TargetTable - Table3 - Table4
TargetTable - Table3 - Table4 - Table5
I need to get 5 separate JOINs.
It there any SQL query or software or utility or any way to get desired SQL codes? Or even enough to get relations in some convinient graph so i could parse them with my favourite script language and generate SQL codes.

You can certainly generate code by looping through information_schema.columns or sys.columns but I doubt this is going to work as well as you would like.
If they didn't bother to put in FKs then they probably have done some other awful things.. like no standard naming conventions or generic tables.
You are probably better off looking through the SQL queries/procedures in the database to see where most of the relationships are... then you will have to decide for yourself if tables are related or not.

You can use SQL Server Management Studio to have both, graph with database diagrams (Not ideal but usefull) https://www.mssqltips.com/sqlservertip/1816/getting-started-with-sql-server-database-diagrams/ and you can get SQL and joins with Query Designer (Still in SSMS) https://www.mssqltips.com/sqlservertip/1086/sql-server-management-studio-query-designer/
Hope this help,

You cannot infer relations given the tables only. To do that you need to have knowledge of the domain. For example, suppose you have two tables, T1 which contains and int field X and, T2 which has an int field Y. Then there is a relationship R, between the rows of T1 and T2, where (r1,r2) is in R if and only if r1.x = r2.y.
So I would suggest that you construct a model (ER-model for example) using your knowledge of the domain. Then add the foreign key constraints manually.

Related

Efficiently delete from one table where ID matches another table

I have two tables with few million records in a PostgreSQL database.
I'm trying to delete rows from one table where ID matches ID of another table. I have used the following command:
delete from table1 where id in (select id from table2)
The above command has been taking lot of time (few hours) which got me wondering is there a faster way to do this operation. Will creating indices help?
I have also tried the delete using join as suggested by few people:
delete from table1 join table2 on table1.id = table2.id
But the above command returned a syntax error. Can this be modified to avoid the error?
Syntax
You second attempt is not legal DELETE syntax in PostgreSQL. This is:
DELETE FROM table1 t1
USING table2 t2
WHERE t2.id = t1.id;
Consider the chapter "Notes" for the DELETE command:
PostgreSQL lets you reference columns of other tables in the WHERE condition by specifying the other tables in the USING clause. For example,
[...]
This syntax is not standard.
[...]
In some cases the join style is easier to write or faster to execute than the sub-select style.
Index
Will creating indices help?
The usefulness of indexes always depends on the complete situation. If table1 is big, and much bigger than table2, an index on table1.id should typically help. Typically, id would be your PRIMARY KEY, which is indexed implicitly anyway ...
Also typically, an index on table2 would not help (and not be used even if it exists.)
But like I said: Depends on the complete situation, and you disclosed preciously little.
Other details of your setup might make the deletes expensive. FK constraints, triggers, indexes, locks held by concurrent transactions, table and index bloat ...
Or non-unique rows in table2. (But I would assume ìd to be unique?) Then you would first extract a unique set of IDs from table2. Depending on cardinalities, a simple DISTINCT or more sophisticated query techniques would be in order ...

SQL insert query takes too much time in migration

I am migrating data from one un-normalized database to another normalized. I could migrate almost all the data but got to the point where a query lasts around 5 mins and I think its too much.
Here is the Entity-Relation Diagram:
Diagram of normalized database
And a picture of the un-normalized database:
Un-normalized database
The table that I want to complete where I have the problem is "Items" and the query is:
INSERT INTO LOS_CAPOS.Items (Item_Factura_Nro, Item_Compra_Cod, Item_Factura_Monto, Item_Factura_Cantidad, Item_Factura_Descripcion)
SELECT f.Factura_Nro, c.Compra_Cod, Item_Factura_Monto, Item_Factura_Cantidad, Item_Factura_Descripcion
FROM LOS_CAPOS.Facturas f
INNER JOIN gd_esquema.Maestra m ON f.Factura_Nro = m.Factura_Nro
INNER JOIN LOS_CAPOS.Compras c ON c.Compra_Fecha = m.Compra_Fecha AND c.Compra_Cantidad = m.Compra_Cantidad
Facturas is a 7664 rows and Compras is a 78327 rows table
Thanks!
Start testing the SELECT only by commenting out a join (and the related columns coming from that table) and see which lookup is causing slowness. After that check if you can use other columns that are indexed to do the lookup. Ideally you would join LOS_CAPOS.Compras on its PK. If you can't, start testing as I mention below, by picking a column, create a non-clustered index, and test all SELECT/INSERT/UPDATE/DELETE operations on that table to see the impact.
Any query tuning/optimisation can only be done by seeing the query plan. And you need to know that an index will slow down INSERT/UPDATE/DELETE operations as the index needs to be updated as well. So there are different indexing scenarios for which table, which column, read vs write considerations, no ultimate solution exists that solves slowness.

Will a SQL DELETE with a sub query execute inefficiently if there are many rows in the source table?

I am looking at an application and I found this SQL:
DELETE FROM Phrase
WHERE PhraseId NOT IN(SELECT Id FROM PhraseSource)
The intention of the SQL is to delete rows from Phrase that are not in the PhraseSource table.
The two tables are identical and have the following structure
Id - GUID primary key
...
...
...
Modified int
the ... columns are about ten columns containing text and numeric data. The PhraseSource table may or may not contain more recent rows with a higher number in the Modified column and different text and numeric data.
Can someone tell me will this query execute the SELECT Id from PhraseSource for every row in the Phrase table? If so is there a more efficient way that this could be coded.
1. Will this query execute the SELECT Id from PhraseSource for every row?
No.
In SQL you express what you want to do, not how you want it to be done1. The engine will create an execution plan to do what you want in the most performant way it can.
For your query, executing the query for each row is not necessary. Instead the engine will create an execution plan that executes the subquery once, then does a left anti-semi join to determine what IDs are not present in the PhraseSource table.
You can verify this when you include the Actual Execution Plan in SQL Server Management Studio.
2. Is there a more efficient way that this could be coded?
A little bit more efficient, as follows:
DELETE
p
FROM
Phrase AS p
WHERE
NOT EXISTS (
SELECT
1
FROM
PhraseSource AS ps
WHERE
ps.Id=p.PhraseId
);
This has been shown in tests done by user Aaron Bertrand on sqlperformance.com: Should I use NOT IN, OUTER APPLY, LEFT OUTER JOIN, EXCEPT, or NOT EXISTS?:
Conclusion
[...] for the pattern of finding all rows in table A where some condition does not exist in table B, NOT EXISTS is typically going to be your best choice.
Another benefit of using NOT EXISTS with a correlated subquery is that it does not have problems when PhraseSource.Id can be NULL. I suggest you read up on IN/NOT IN vs NULL values in the subquery. E.g. you can read more about that on sqlbadpractices.com: Using NOT IN operator with null values.
The PhraseSource.Id column is probably not nullable in your schema, but I prefer using a method that is resilient in all possible schemas.
1. Exceptions exist when forcing the engine to use a specific path, e.g. with Table Hints or Query Hints. The engine doesn't always get things right.
In this case the sub-query could be evaluated for each row if the database system is not smart enough (but in case of MS SQL Server, I suppose it should be able to recognize the fact that you don't need to evaluate the subquery more than once).
Still there is a better solution:
DELETE p
FROM Phrase p
LEFT JOIN PhraseSource ps ON ps.Id = p.PhraseId
WHERE ps.Id IS NULL
This uses the LEFT JOIN which matches the rows of both tables, but in case there is no match it leaves the ps entry NULL. Now you just check for NULLs on the left side to see which Phrases do not have a match and will delete those.
All types of JOIN statements are very nicely described in this answer.
Here you can see three different approaches for a similar issue compared on MySQL. As #Drammy mentions, to actually see the performance of a given approach, you could see the execution plan on your target database and do performance testing on different approaches of the same problem.
That query should optimise into a join. Have you looked at the execution plan?
If you're experiencing poor performance it is likely because of the guid primary keys.
A primary key is clustered by default. If the guid primary key is clustered on your table that means the data in the tables is ordered by the primary key. The problem with guids as clustered keys is that when you delete one record the table has to be reordered and shuffled around on disk.
This article is a good read on the topic..
https://blog.codinghorror.com/primary-keys-ids-versus-guids/

TSQL: Best way to get data from temp/scratch table to normalized version?

I'm working in with relatively large data sets; ~200GB. The data is coming from text files that are being imported to SQL via a script. They are being bulkcopy'd into a temp table with the normalized tables waiting to recieve the data.
My question comes from the fact that I'm mostly a scripter so my logic would be to loop through each row and do individual checks per row to put the data where it needs to go but I read a different post on SO saying that's really wrong for SQL.
So my question is, if I have one temp table (31 columns) that is to be normalized between 5 others, what's the best way to go about this?
Table relationship is as follows:
System - Table that contains machine information (e.g. name, domain, etc.)
File - File information (e.g. name, size, directory, etc.)
SystemFile - The many-to-many system<->file relationship table.
Metadata - File metadata (language, etc.) - has foreign key relationship to file primary key
DigitalSignature - File digital signature status - has foreign key relationship to file primary key
Thanks
Dont have any links, don't have enough experience with things like ssis etc to give a balanced view. but when doing the task you are talking about my normal process would be (generic, simple version):
1.look at normalised data set and consider the least dependant components in the data being imported (e.g. order headers created before order items)
2.create queries the select out the data i will have.. these often have this form:
select
t.x,t.y,t.z
from
temp_table as t
left outer join normalise_table as n
on t.x=n.x
and t.y=n.y
and t.z=n.z
where
n.x is null
where temp_table may have lots of columns but these three represent whatever normalised nugget i want to add first, the left outer join and where null make sure i only get the new values - if merging is the same
verify that i am getting good information and that i am only getting the new rows i want. often you have to use group bys or distincts on the temp data to get accurate data for inserting.. something like:
select
t.x,t.y,t.z
from
(select
distinct x,y,z
from
temp_table ) as t
left outer join normalise_table as n
on t.x=n.x
and t.y=n.y
and t.z=n.z
where
n.x is null
3.wrap that select in an insert:
insert into
normalise_table (x,y,z)
select
t.x,t.y,t.z
from
(select
distinct x,y,z
from
temp_table ) as t
left outer join normalise_table as n
on t.x=n.x
and t.y=n.y
and t.z=n.z
where
n.x is null
in this way you are inserting sets of data.. the procedural part is doing this for each set to be inserted, but in general you are not iterating over rows.
BTW T-SQL has a merge command for when you may or may not have the data in the target table (and if you want to remove keys missing from the temp tables)
http://msdn.microsoft.com/en-us/library/bb510625.aspx
Some comments on foreign keys - these tend to be more specific to the situation:
Can you identify the relationship without the primary key? This is the easiest situation to deal with..
Imagine I have inserted my xyz object into a normalised table but it has 100 child rows (abc's) in another table (each child may have 100 children too.. this would mean 10000 rows in the de-normalised data for one xyz)
you would have to go through the validation before but your final query may look something like:
insert into
normalise_table_2 (parentID,a,b,c)
select
n.id,t.a,t.b,t.c
from
(select
distinct x,y,z,a,b,c
from
temp_table ) as t
inner join join normalise_table as n
on t.x=n.x
and t.y=n.y
and t.z=n.z
left outer join normalise_table_2 as n2
on n.id = n2.parentID
and t.a = n2.a
and t.b = n2.b
and t.c = n2.c
where
n2.a is null
or maybe a more readable way:
insert into normalise_table_2 (parentID,a,b,c)
select
*
from (
select distinct
n.id,t.a,t.b,t.c
from
normalise_table as n
inner join temp_table as t
on t.x = n.x
and t.y = n.y
and t.z = n.z
left outer join normalise_table_2 as n2
on t.a = n2.a
and t.b = n2.b
and t.c = n2.c
and n2.parentID = n.id
where
n2.id is null
) as x
If you are having trouble identifying the row without the id here are some points to consider
I often give a unique id to every row in the de-normalised/import data this makes it easier to track what has and has not been done. not to mention paying off in other ways (e.g. when source data has blanks if its they are to be the same as the row above)
I have created temp tables to track relationships like this as I go along.
sometimes (especially for less consistent data) these are not temp tables as they can be used after the fact for analysis what did and didn't import (and where it went), sometimes i have a comments column that the update queries populate with any details about exceptions relating to the import of that row.
sometimes you are lucky and there is some kind of source or oldId field in the target that can be used to link the de-normalised data and normalised version (this is particularly true of system migration type tasks as people often want to be able to look up items in the old system). sometimes this can be weird and wonderful - e.g. using the updated by or created by field looking for a special account that executes this particular process (though i would not particulary recommend that)
Sometimes it makes sense to update the source tables in some way.. e.g. replacing identifiers there
Sometimes you come up with ID ranges or similar that are used for import and you break normal rules about where ids are generated and your import process creates the ID.
this often means shutting down all other access to the target system while the import is executed. may sound mad but sometimes this is the best way for very complex uploads that require a lot of preparation
But often when you think about it there is a particular order you can add your data in and avoid this issue as you will always be able to identify the correct data. I have used the above techniques to make my life easier but I am not sure I have ever HAD to use them..
The only exception I can think of is generating IDs outside of the system which i have had to use, but this was so that IDs would be consistent across multiple trial loads and the final production load. Also data was coming from many sources with many people working on it, it made life easier that they could be in control of their own IDs - but it did bring other issues ;).
Generally I would try and leave the source data alone and ensure that if you re-run any of your scripts then they wont have any effect. this makes the whole system much more robust and gives everyone more confidence as you can re-import the same data or a file that has some of the same data and run everything again and nothing breaks.
note i have not tested any of these queries and just written them off the top of my head so sorry if they are not totally accurate.

Tips or tricks for translating sql joins from literal language to SQL syntax?

I often know exactly what I want, and know how tables are related in order to get it, but I have a real hard time translating that literal language knowledge to SQL syntax when it comes to joins. Do you have any tips or tricks you can share that have worked for you in the past?
This is a basic, but poor example:
"I have Categories, which have one-to-many Products, which have one-to-many Variants, which have one-to-many Sources. I need all Sources that belong to Category XYZ."
I imagine doing something where you cross out certain language terms and replace them with SQL syntax. Can you share how you formulate your queries based upon some concept similar to that? Thanks!
Use SQL Query Designer to easily buid Join queries from the visual table collection right there, then if you want to learn how it works, simply investigate it, that's how I learned it.
You won't notice how charming it is till you try it.
Visual Representation of SQL Joins - A walkthrough explaining SQL JOINs.
Complete ref of SQL-Server Join, Inner Join, Left Outer Join, Right Outer Join, Full Outer Join, in SQL-Server 2005 (View snapshot bellow).
ToTraceString of Entity Frameork' ObjectQuery (that you add Include shapings to it) is also a good way to learn it.
SQL-Server Join types (with detailed examples for each join type):
INNER JOIN - Match rows between the two tables specified in the INNER JOIN statement based on one or more columns having matching data. Preferably the join is based on referential integrity enforcing the relationship between the tables to ensure data integrity.
Just to add a little commentary to the basic definitions above, in general the INNER JOIN option is considered to be the most common join needed in applications and/or queries. Although that is the case in some environments, it is really dependent on the database design, referential integrity and data needed for the application. As such, please take the time to understand the data being requested then select the proper join option.
Although most join logic is based on matching values between the two columns specified, it is possible to also include logic using greater than, less than, not equals, etc.
LEFT OUTER JOIN - Based on the two tables specified in the join clause, all data is returned from the left table. On the right table, the matching data is returned in addition to NULL values where a record exists in the left table, but not in the right table.
Another item to keep in mind is that the LEFT and RIGHT OUTER JOIN logic is opposite of one another. So you can change either the order of the tables in the specific join statement or change the JOIN from left to right or vice versa and get the same results.
RIGHT OUTER JOIN - Based on the two tables specified in the join clause, all data is returned from the right table. On the left table, the matching data is returned in addition to NULL values where a record exists in the right table but not in the left table.
Self Join - In this circumstance, the same table is specified twice with two different aliases in order to match the data within the same table.
CROSS JOIN - Based on the two tables specified in the join clause, a Cartesian product is created if a WHERE clause does filter the rows. The size of the Cartesian product is based on multiplying the number of rows from the left table by the number of rows in the right table. Please heed caution when using a CROSS JOIN.
FULL JOIN - Based on the two tables specified in the join clause, all data is returned from both tables regardless of matching data.
I think most people approach its:
Look for substantives, as they can point to potential tables
Look for adjectives, cause they probably are fields
Relationships between substantives gives JOIN rules
Nothing better than to draw these structures in a paper sheet.
Write and debug a query which returns the fields from the table having the majority of—or the most important—data. Add constraints which depend only on that table, or which are independent of all tables.
Add a new where term which relates another table.
Repeat 2 until done.
I've yet to use the join operator in a query, even after 20+ years of writing SQL queries. One can almost always write them in the form
select field, field2, field3, <etc.>
from table
where field in (select whatever from table2 where whatever) and
field2 in (select whatever from table2 where whatever) and ...
or
select field, field2, field3, <etc.>
from table1, table2, ...
where table1.field = table2.somefield and
table1.field2 = table3.someotherfield and ...
Like someone else wrote, just be bold and practice. It will be like riding a bicycle after 4 or 5 times creating such a query.
One word: Practice.
Open up the query manager and start running queries until you get what you want. Look up similar examples and adapt them to your situation. You will always have to do some trial and error with the queries to get them right.
SQL is very different from imperative programming.
1) To design tables, consider Entities (the real THINGS in the world of concern), Relationships (between the Entities), and Attributes (the values associated with an Entity).
2) to write a Select statement consider a plastic extrusion press:
a) you put in raw records From tables, Where conditions exist in the records
b) you may have to join tables to get at the data you need
c) you craft extrusion nozzles to make the plastic into the shapes you want. These are the individual expressions of the select List.
d) you may want the n-ary sets of list data to come to you in a certain order, you can apply an Order By clause.
3) crafting the List expressions is the most like imperative programming after you discover the if(exp,true-exp,false-exp) function.
Look at the ERD.
Logical or physical version, it will show what tables are related to one another. This way, you can see what table(s) you need to join to in order to get from point/table a to point/table b, and what criteria.