I have been scouring the internet for days trying to find relevant relationship visualizations using Toad for DB2 to see relevant table keys/ linkages for my JOIN statements. Everything I have been able to find has been less than helpful. My Current method is to run on the two tables that have information I want to join...
SELECT *
FROM TABLE1
FETCH FIRST 100 ROWS ONLY
SELECT *
FROM TABLE2
FETCH FIRST 100 ROWS ONLY
Then visually look for markers that look the same. Needless to say this is taking forever and extremely inefficient. I am trying to see what feature should I use to quickly see from table to table if there are relevant keys to join information. Help me Obi Wan, your my only hope!
If your database has foreign keys defined, and if you are using Db2 for LUW you would find foreign keys in the table SYSCAT.REFERENCES
https://www.ibm.com/support/knowledgecenter/en/SSEPGG_11.1.0/com.ibm.db2.luw.sql.ref.doc/doc/r0001057.html
If there are no Foreign Keys, and the person who designed the database has gone AWOL and left no documentation, then there are tools available that will attempt to discover table relationships from the data contents. An example of such a tool is https://www.ibm.com/uk-en/marketplace/infosphere-information-analyzer
Related
Hi there I'm knew to SQL and I'm facing what I hope is a basic problem.
Scenario: Say I have 2 tables, let these be Client and Reading. I want to find the table k which holds a relationship to both of them, so that I may perform an inner join to link Client to Reading.
Question: Does a query exist to find table k?
p.s.Client and Reading have many attributes which makes finding a relationship by hand very tedious.
If the tables have foreign keys, then yes you can query the meta tables to find the tables that have relationships to them.
If not, then no, there is no way to find them programmatically.
I'm using SQL Server 2012. I want to join two tables without columns that I can join them, how can I find all the tables to reach to this two tables?
For example: I need to join the Table A to table D and to do that I need to connect A to B and then to C and at the end to D.
My question is: can I find the tables B and C among thousands of tables in the database without searching table by table?
Thanks a lot,
Ohad
Assuming that:
You want to automate this process
You have FOREIGN KEY constraints that you can rely on
You should proceed as follows:
Query sys.foreign_keys and create a directed graph structure that will contain the links between tables.
Then implement a graph search algorithm that will start from table A and try to find a path to table D and from D to A.
Once you have found the path, it will be easy to construct dynamic SQL containing the join of all tables on the path. You will need to query sys.foreign_key_columns as well to be able to construct the ON clauses of the JOIN's.
Let me know if you need help with more detail.
There's a couple of things you can do to help your cause, but for the most part, there's no direct way and you would need to know how your database is structured and the purposes of the tables. Furthermore, based on the database's design, it might be very difficult for you to intuitively find your answer and you might need just need to get guidance from someone who is knowledgeable with the database design. Regardless:
Fields in your tables A & D:
you can look at primary fields or unique fields in the tables to determine what other tables may link to those table. Usually they are named in a way that match those other tables and you can tell what table they're coming from.
Information_Schema Views
You can use information_schema.tables and information_schema.column views to easily search for names of tables and columns across the entire database and narrow your search to less tables.
I am working on a research project, using the IMDb dataset as my source of secondary data. I downloaded the entire database in text format from .ftp servers provided by IMDb itself, and used the IMDbPY python package to compile all of the unsorted information into a relational database. I chose to use SQLite as my SQL engine, as it seemed like the least cumbersome option thanks to its ability to create locally-stored databases. After a bit of poking around and a lot of documentation-reading, I ended up with a 9.04 GB im.db file, hosting the entirety of IMDb.
Now I need to isolate my dataset according to my requirements, but due to my lack of experience with SQL I'm finding it difficult to figure out the most optimal way of doing so.
Specifically, I want to look at:
Movies only (i.e. exclude TV series, episodes, etc.);
Produced in the period between 2000-2015, inclusive;
Of feature length (i.e. running time over 40 minutes);
Non-adult (I didn't even know IMDb hosts information on these, but apparently so);
Produced in the USA;
With complete information on crew.
Here's a representation of my database schema. I was confused by some of the database design choices that IMDbPY creators made, but I'm no SQL expert, and this is what I get to work with. Some clarifications:
The title table holds basic information about every instance of films, shows, episodes, and so on, 3,673,485 rows in total. The id column is an auto-incremented primary key, which is referenced as the movie_id foreign key in all other relevant tables. However, it seems like that none of the foreign keys in other tables are indexed properly, so I can't use simple query statements to properly get necessary information just by knowing a particular film's id value.
Running SELECT count(*) FROM title WHERE kind_id=1 AND production_year BETWEEN 2000 AND 2015; tells me that there are 442,135 instances of movies, produced between 2000-2015. So far so good.
The complete_cast and comp_cast_type tables hold info about the completion status of a film's crew/cast list. Since I only need to consider films with complete crew information, I need to isolate only those instances, where (i) movie_id exists in my previous query (i.e. out of the 442,135 movie rows); (ii) subject_id=2; and (iii) status_id=3 or 4.
This is where it gets tricky for me. The movie_info table holds 20 million rows of information about films and TV shows, including runtimes, genres, countries of production, years of production, etc. Basically all of the information that I need to isolate my dataset. Within that table (i) id is an arbitrary auto-incremented primary key; (ii) movie_id refers to the id values from title; (iii) info_type_id refers to one of the 113 types of information as listed in the table info_type; (iv) info holds the actual information, as integers or strings.
For example: Running SELECT id FROM title WHERE title='2001: A Space Odyssey' AND kind_id=1; returns '2484213'. Running SELECT info FROM movie_info WHERE movie_id=2484213 AND info_type_id=1; returns '142, 161, 149', indicating the running times in minutes of the three available versions of the film. Running SELECT info FROM movie_info WHERE movie_id=2484213 AND info_type_id=8 returns 'USA, UK', indicating the countries involved in production. And so on.
Basically I'm trying to create a new table, populated only with films that fall under my requirements, and I'm having a hard time figuring out the most efficient way of doing so. Here's how I translated my requirements into basic SQL syntax:
SELECT * FROM title WHERE kind_id=1 AND production_year BETWEEN 2000 AND 2015;
Then a bunch of requirements from the movie_info table, which cross-references only those instances, where movie_id exists as id in the query above, and (i) info_type_id=1 AND info>40; (ii) info_type_id=3 AND info!='Adult'; (iii) info_type_id=8 AND info='USA';
Finally, I need to make sure that all of the selections exist in the complete_cast table, and WHERE subject_id=2 AND status_id=3 OR 4;
I've been reading SQLite documentation, and suspect that I need to use some combination of INNER/LEFT OUTER JOIN, EXISTS and UNION/INTERSECT/EXCEPT statements, but not sure how to approach this exactly. I would like to write this code efficiently, since brute-forcing queries requirement by requirement takes a while for my computer to process. Thank you in advance for your help.
TL;DR. I can't figure out an efficient way of using INNER/LEFT OUTER JOIN, EXISTS and UNION/INTERSECT/EXCEPT statements to help me isolate a smaller dataset in accordance with multiple requirements, to satisfy which I need to cross-query a number of existing tables without properly indexed foreign keys.
Is an inner join with all the required tables too slow for your needs?
You could create tables that just contain the subset data that you need and then run an inner join on those.
So create a table "movie" and insert only those records from "title" with kind_id of 1. Then do something like
Select *
FROM
movie m
inner join movie_info mi
on m.id = mi.movie_id
inner join complete_cast cc
on m.id = cc.id
WHERE
...
Providing your new tables don't have the same kind of volume of data, it should perform better.
I am working with a SQL Server database which contains almost 850 tables. It has many defined relationships and plenty of undefined relationships(FK), undefined primary keys etc. It is a mess. I don't have access to the application source code, so I can't track down the undefined relations through code.
Is there any software or query by which I can just look at the data and figure out the relationships between the tables? To be more specific, every fields(columns) in each tables are mapped (join) against every columns of all other tables and provide me with a report of some sort. Almost 60% of the cases the column names would be similar in related tables but many tables have same column name for primary key(for example item_id).
I need all those undefined relationships which is making my life miserable everyday!! :(
I think your best bet would be to use the profiler to capture the statements being executed and try infer the relationships from that. This is a tough one, and there aren't any easy solutions that I'm aware of.
Good Luck !
Well, you can query the metadata - INFORMATION_SCHEMA.COLUMNS - filter out things which are highly unlikely to be joined as keys - like TEXT/NVARCHAR(MAX). Put it in some kind of data dictionary table where you start to tag the columns with information.
You can query with things like:
SELECT *
FROM INFORMATION_SCHEMA.COLUMNS AS C
INNER JOIN INFORMATION_SCHEMA.TABLES AS T
ON C.COLUMN_NAME = T.TABLE_NAME + '_ID';
to see if there are obvious matches.
That might help you get a handle on the database. But it will take a lot of work.
Without a foreign key constraint, it's even possible that they've done things like "multi-keys" where a certain column is a foreign key to one table or another depending on some kind of type selector (these aren't possible with foreign key constraints) - it's possible you won't even see this in the profiler except between separate joins - so one time you might see it join to one table and sometimes another.
I need to explain to somebody how they can determine what fields from multiple tables/views they should join on. Any suggestions? I know how to do it but am having difficulty trying to explain it.
One of the issues they have is they will take two fields from two tables that are the same (zip code) and join on those, when in reality they should be joining on ID columns. When they choose the wrong column to join on it increases records they receive in return.
Should I work in PK and FK somewhere?
While it is indeed typical to join a PK to an FK any conversation about JOIN clauses that only revolve around PK's and FK's is fairly limited
For example I had this FROM clause in a recent SQL answer I gave
FROM
YourTable firstNames
LEFT JOIN YourTable lastNames
ON firstnames.Name = lastNames.Name
AND lastNames.NameType =2
and firstnames.FrequencyPercent < lastNames.FrequencyPercent
The table referenced on each side of the table is the same table (a self join) and it includes three condidtions one of which is an inequality. Furthermore there would never be an FK here because its looking to join on a field, that is by design, not a Candidate Key.
Also you don't have even have to join one table to another. You can join inline queries to each other which of course can't possibly have a Key.
So in order to properly understand JOIN you just need to understand that it combines the records from two relations (tables, views, inline queries) where some conditions evaluate to true. This means you need to understand boolean logic and the database and the data in the database.
If your user is having a problem with a specific JOIN ask them to SELECT some rows from one table and also the other and then ask them under what conditions would you want to combine the rows.
You don't need to talk in terms of a primary key of a table but you should point to it and explain that it uniquely identifies a given row and that you must join to related tables using it or you could get duplicated results.
Give them examples of joining with it and joining without it.
An ER diagram showing all of the tables they use and their key relationships would help ensure that they always use the correct keys.
It sounds to me like neither you, nor the person you are trying to help understands how this particular database is constructed and perhaps don't really even understand basic database fundamentals, like PK's and FK's. Most often a PK from one table is joined to a FK to another table.
Assuming the database has the proper PK's and FK's in place, it would probably help a great deal to generate an ER diagram. That would make the joining concept much easier to grasp.
Another approach you could take is to find someone who does understand these things and create some views for this person to use. This way he doesn't need to understand how to join the tables together.
A user shouldn't typically be doing joins. A user should have an interface that lets them get the data that they need in the way that they need it. If you don't have the developer resources to do that then you're going to be stuck with this problem of having to teach a user technical details. You also need to be very careful about what kind of damage the user can do. Do they have update rights on the data? I hope they don't accidentally do a DELETE FROM Table with no WHERE clause. Even if you restrict their permissions, a poorly written query can crush the database server or block resources causing problems for other users (and more work for you).
If you have no choice, then I think that you need to certainly teach them about primary and foreign keys, even if you don't call them that. Point out that the id on your table (or whatever your PK is) identifies a row. Then explain how the id appears in other tables to show the relationship. For example, "See, in the address table we have a person_id which tells us who that address belongs to."
After that, expect to spend a large portion of your time with that user as they make mistakes or come up with other things that they want to get from the database, but which they can't figure out how to get.
From theory, and ideally, you should define primary keys on all tables, and join tables using a primary key to the matching field or fields (foreign key) in the other table.
Even if you don't define or if they're not defined as primary keys, you need to make sure the fields uniquely identify the records in the table, and that they should be properly indexed.
For example, let's say the 'person' table has a SSN and a driver's license field. The SSN could be considered and flagged as the 'primary key', but if you join that table to a 'drivers' table which might not have the SSN, but does have the driver's license #, you could join them by the driver's license field (even if it's not flagged as primary key), but you need to make sure that the field is properly indexed in both tables.
...explain to somebody how they can determine what fields from multiple tables/views they should join on.
Simply put, look for the columns with values that match between the tables/views. Preferably, match exactly but some massaging might be necessary.
The existence of foreign key constraints would help to know what matches to what, but the constraint might not be directly to the table/view that is to be joined.
The existence of a primary key doesn't mean it is the criteria that is necessary for the query, so I would overlook this detail (depending on the audience).
I would recommend attacking the desired result set by starting with the columns desired, and working back from there. If there's more than one table's columns in the result set, focus on the table whose columns should be returning distinct results first and then gradually add joins, checking the result set between each JOIN addition to confirm the results are still the same. Otherwise, need to review the JOIN or if a JOIN is actually necessary vs IN or EXISTS.
I did this when I first started out, it comes from thinking of joins as just linking tables together, so I linked at all possible points.
Once you think of joins as a way to combine AND filter the data it becomes easier to understand them.
Writing out your request as a sentence is helpful too, "I want to see all the times Table A interacted with Table B". Then build a query from that using only the ID, noting that if you wanted to know "All the times Table A was in the same zip code as Table B" then you would join by zip code.