left join using comma separated column using sql - sql

I am working on an asp.net application with SQL server database. This db has two tables Vacancies and dutystations. Vacancies table has a column named dutystationId which stores ids of dutystations in comma separated list like this:
2,12,15,18,19,23
Now I want to show this vacancy in grid and I have used left join like this:
QUERY
SELECT * FROM dbo.hr_Vacancies
CROSS APPLY dbo.hr_Split(dbo.hr_Vacancies.DutyStationID, ',') AS s
LEFT OUTER JOIN dbo.hr_DutyStations
ON s.Data = dbo.hr_DutyStations.DutyStationID
and in xsd, I have set vacancyid as primary key. but I get error:
ERROR
Failed to enable constraints. One or more rows contain values violating non-null, unique, or foreign-key constraints.
If I remove this constraint, I get 6 rows. I want to show one row only. How can I do this?

I stopped reading here:
Vacancies table has a column named dutystationId which stores ids of dutystations in comma seperated list
That is your problem right there. If you have comma separated values in an RDBMS, specifically if they contain foreign keys to other tables, you should halt full stop whatever you're doing and start redesigning your database. Many-to-many relations in an RDBMS are implemented with junction tables, and if you use them all your problems will suddenly solve themselves.
Your current design is not only hell to write SQL queries for, like this question illustraties perfectly as you cannot solve a trivial task, but it also kills performance - those calls to hr_Split are infinitely more computationally expensive than just doing proper joins.
Don't fall into the XY trap, solve the real problem first. Which is that you're even violating First Normal Form right now.

Related

Cross-querying multiple tables in an SQLite database without indexed foreign keys

I am working on a research project, using the IMDb dataset as my source of secondary data. I downloaded the entire database in text format from .ftp servers provided by IMDb itself, and used the IMDbPY python package to compile all of the unsorted information into a relational database. I chose to use SQLite as my SQL engine, as it seemed like the least cumbersome option thanks to its ability to create locally-stored databases. After a bit of poking around and a lot of documentation-reading, I ended up with a 9.04 GB im.db file, hosting the entirety of IMDb.
Now I need to isolate my dataset according to my requirements, but due to my lack of experience with SQL I'm finding it difficult to figure out the most optimal way of doing so.
Specifically, I want to look at:
Movies only (i.e. exclude TV series, episodes, etc.);
Produced in the period between 2000-2015, inclusive;
Of feature length (i.e. running time over 40 minutes);
Non-adult (I didn't even know IMDb hosts information on these, but apparently so);
Produced in the USA;
With complete information on crew.
Here's a representation of my database schema. I was confused by some of the database design choices that IMDbPY creators made, but I'm no SQL expert, and this is what I get to work with. Some clarifications:
The title table holds basic information about every instance of films, shows, episodes, and so on, 3,673,485 rows in total. The id column is an auto-incremented primary key, which is referenced as the movie_id foreign key in all other relevant tables. However, it seems like that none of the foreign keys in other tables are indexed properly, so I can't use simple query statements to properly get necessary information just by knowing a particular film's id value.
Running SELECT count(*) FROM title WHERE kind_id=1 AND production_year BETWEEN 2000 AND 2015; tells me that there are 442,135 instances of movies, produced between 2000-2015. So far so good.
The complete_cast and comp_cast_type tables hold info about the completion status of a film's crew/cast list. Since I only need to consider films with complete crew information, I need to isolate only those instances, where (i) movie_id exists in my previous query (i.e. out of the 442,135 movie rows); (ii) subject_id=2; and (iii) status_id=3 or 4.
This is where it gets tricky for me. The movie_info table holds 20 million rows of information about films and TV shows, including runtimes, genres, countries of production, years of production, etc. Basically all of the information that I need to isolate my dataset. Within that table (i) id is an arbitrary auto-incremented primary key; (ii) movie_id refers to the id values from title; (iii) info_type_id refers to one of the 113 types of information as listed in the table info_type; (iv) info holds the actual information, as integers or strings.
For example: Running SELECT id FROM title WHERE title='2001: A Space Odyssey' AND kind_id=1; returns '2484213'. Running SELECT info FROM movie_info WHERE movie_id=2484213 AND info_type_id=1; returns '142, 161, 149', indicating the running times in minutes of the three available versions of the film. Running SELECT info FROM movie_info WHERE movie_id=2484213 AND info_type_id=8 returns 'USA, UK', indicating the countries involved in production. And so on.
Basically I'm trying to create a new table, populated only with films that fall under my requirements, and I'm having a hard time figuring out the most efficient way of doing so. Here's how I translated my requirements into basic SQL syntax:
SELECT * FROM title WHERE kind_id=1 AND production_year BETWEEN 2000 AND 2015;
Then a bunch of requirements from the movie_info table, which cross-references only those instances, where movie_id exists as id in the query above, and (i) info_type_id=1 AND info>40; (ii) info_type_id=3 AND info!='Adult'; (iii) info_type_id=8 AND info='USA';
Finally, I need to make sure that all of the selections exist in the complete_cast table, and WHERE subject_id=2 AND status_id=3 OR 4;
I've been reading SQLite documentation, and suspect that I need to use some combination of INNER/LEFT OUTER JOIN, EXISTS and UNION/INTERSECT/EXCEPT statements, but not sure how to approach this exactly. I would like to write this code efficiently, since brute-forcing queries requirement by requirement takes a while for my computer to process. Thank you in advance for your help.
TL;DR. I can't figure out an efficient way of using INNER/LEFT OUTER JOIN, EXISTS and UNION/INTERSECT/EXCEPT statements to help me isolate a smaller dataset in accordance with multiple requirements, to satisfy which I need to cross-query a number of existing tables without properly indexed foreign keys.
Is an inner join with all the required tables too slow for your needs?
You could create tables that just contain the subset data that you need and then run an inner join on those.
So create a table "movie" and insert only those records from "title" with kind_id of 1. Then do something like
Select *
FROM
movie m
inner join movie_info mi
on m.id = mi.movie_id
inner join complete_cast cc
on m.id = cc.id
WHERE
...
Providing your new tables don't have the same kind of volume of data, it should perform better.

Multiple record types and how to split them amongst tables

I'm working on a database structure and trying to imagine the best way to split up a host of related records into tables. Records all have the same base type they inherit from, but each then expands on it for their particular use.
These 4 properties are present for every type.
id, name, groupid, userid
Here are the types that expand off those 4 properties.
"Static": value
"Increment": currentValue, maxValue, overMaxAllowed, underNegativeAllowed
"Target": targetValue, result, lastResult
What I tried initially was to create a "records" table with the 4 base properties in it. I then created 3 other tables named "records_static/increment/target", each with their specific properties as columns. I then forged relationships between a "rowID" column in each of these secondary tables with the main table's "id".
Populating the tables with dummy data, I am now having some major problems attempting to extract the data with a query. The only parameter is the userid, beyond that what I need is a table with all of the columns and data associated with the userid.
I am unsure if I should abandon that table design, or if I just am going about the query incorrectly.
I hope I explained that well enough, please let me know if you need additional detail.
Make the design as simple as possible.
First I'd try a single table that contains all attributes that might apply to a record. Irrelevant attributes can be null. You can enforce null values for a specific type with a check constraint.
If that doesn't work out, you can create three tables for each record type, without a common table.
If that doesn't work out, you can create a base table with 1:1 extension tables. Be aware that querying that is much harder, requiring join for every operation:
select *
from fruit f
left join
apple a
on a.fruit_id = f.id
left join
pear p
on p.fruit_id = f.id
left join
...
The more complex the design, the more room for an inconsistent database state. The second option you could have a pear and an apple with the same id. In the third option you can have missing rows in either the base or the extension table. Or the tables can contradict each other, for example a base row saying "pear" with an extension row in the Apple table. I fully trust end users to find a way to get that into your database :)
Throw out the complex design and start with the simplest one. Your first attempt was not a failure: you now know the cost of adding relations between tables. Which can look deceptively trivial (or even "right") at design time.
This is a typical "object-oriented to relational" mapping problem. You can find books about this. Also a lot of google hits like
http://www.ibm.com/developerworks/library/ws-mapping-to-rdb/
The easiest for you to implement is to have one table containing all columns necessary to store all your types. Make sure you define them as nullable. Only the common columns can be not null if necessary.
Just because object share some of the same properties does not mean you need to have one table for both objects. That leads to unnecessary right outer joins that have a 1 to 1 relationship which is not what I think of as good database design.
but...
If you want to continue in your fashion I think all you need is a primary key in the table with common columns "id, name, groupid, userid" (I assume ID) then that would be the foreign key to your table with currentValue, maxValue, overMaxAllowed, underNegativeAllowed

SQL: Reference one to one-of-many

I'm having what some would call a rather strange problem/question.
Suppose I have a table, which may reference one (and only one) of many different other tables. How would I do that in the best way?? I'm looking for a solution which should work in a majority of databases (MS SQL, MySQL, PostgreSQL etc). The way I see it, there are a couple of different solutions (is any better than the other?):
Have one column for each possible reference. Only one of these columns may contain a value for any given row, all others are null. Allows for strict foreign keys, but it gets tedious when the number of "many" (possible referenced tables) gets large
Have a two column relationship, i.e. one column "describing" which table is referenced, and one referencing the instance (row in that table). Easily extended when the number of "many" (referenced tables) grows, though I can't perform single query lookup in a straightforward way (either left join all possible tables, or union multiple queries which joins towards one table each)
??
Make sense? What's best practise (if any) in this case?
I specifically want to be able to query data from the referenced entity, without really knowing which of the tables are being referenced.
How would you do?
Both of these methods are suitable in any relational database, so you don't have to worry about that consideration. Both result in rather cumbersome queries. For the first method:
select . . .
from t left outer join
ref1
on t.ref1id = ref1.ref1id left outer join
ref2
on t.ref2id = ref2.ref2id . . .
For the second method:
select . . .
from t left outer join
ref1
on t.anyid = ref1.ref1id and anytype = 'ref1' left outer join
ref2
on t.anyid = ref2.ref2id and anytype = 'ref2' . . .
So, from the perspective of query simplicity, I don't see a major advantage for one versus the other. The second version has a small disadvantage -- when writing queries, you have to remember what the name is for the join. This might get lost over time. (Of course, you can use constraints or triggers to ensure that only a fixed set of values make it into the column.)
From the perspective of query performance, the first version has a major advantage. You can identify the column as a foreign key and the database can keep statistics on it. This can help the database choose the right join algorithm, for instance. The second method does not readily offer this possibility.
From the perspective of data size, the first version requires storing the id for each of the possible values. The second is more compact. From the perspective of maintainability, the first is hard to add a new object type; the second is easy.
If you have a set of things that are similar to each other, then you can consider storing them in a single table. Attributes that are not appropriate can be NULLed out. You can even create views for the different flavors of the thing. One table may or may not be an option.
In other words, there is no right answer to this question. As with many aspects of database design, it depends on how the data is going to be used. Absent other information, I would probably first try to coerce the data into a single table. If that is just not reasonable, I would go with the first option if the number of tables can be counted on one hand, and the second if there are more tables.
1)
This is legitimate for small number of static tables. If you anticipate a number of new tables might need to be added in the future, take a look at 3) below...
2)
Please don't do that. You'd be forfeiting the declarative FOREIGN KEYs, which is one of the most important mechanisms for maintaining data integrity.
3)
Use inheritance. More info in this post:
What is the best design for a database table that can be owned by two different resources, and therefore needs two different foreign keys?
You might also be interested in looking at:
Implementing comments and Likes in database
Multiple one to many relationship design
How to avoid multiple tables tables to relations M: M?
database table design thoughts
Relating two database tables (associating an employee with an activity)
How to structure table Activities in a database?

How to best explain on what fields should a user join on?

I need to explain to somebody how they can determine what fields from multiple tables/views they should join on. Any suggestions? I know how to do it but am having difficulty trying to explain it.
One of the issues they have is they will take two fields from two tables that are the same (zip code) and join on those, when in reality they should be joining on ID columns. When they choose the wrong column to join on it increases records they receive in return.
Should I work in PK and FK somewhere?
While it is indeed typical to join a PK to an FK any conversation about JOIN clauses that only revolve around PK's and FK's is fairly limited
For example I had this FROM clause in a recent SQL answer I gave
FROM
YourTable firstNames
LEFT JOIN YourTable lastNames
ON firstnames.Name = lastNames.Name
AND lastNames.NameType =2
and firstnames.FrequencyPercent < lastNames.FrequencyPercent
The table referenced on each side of the table is the same table (a self join) and it includes three condidtions one of which is an inequality. Furthermore there would never be an FK here because its looking to join on a field, that is by design, not a Candidate Key.
Also you don't have even have to join one table to another. You can join inline queries to each other which of course can't possibly have a Key.
So in order to properly understand JOIN you just need to understand that it combines the records from two relations (tables, views, inline queries) where some conditions evaluate to true. This means you need to understand boolean logic and the database and the data in the database.
If your user is having a problem with a specific JOIN ask them to SELECT some rows from one table and also the other and then ask them under what conditions would you want to combine the rows.
You don't need to talk in terms of a primary key of a table but you should point to it and explain that it uniquely identifies a given row and that you must join to related tables using it or you could get duplicated results.
Give them examples of joining with it and joining without it.
An ER diagram showing all of the tables they use and their key relationships would help ensure that they always use the correct keys.
It sounds to me like neither you, nor the person you are trying to help understands how this particular database is constructed and perhaps don't really even understand basic database fundamentals, like PK's and FK's. Most often a PK from one table is joined to a FK to another table.
Assuming the database has the proper PK's and FK's in place, it would probably help a great deal to generate an ER diagram. That would make the joining concept much easier to grasp.
Another approach you could take is to find someone who does understand these things and create some views for this person to use. This way he doesn't need to understand how to join the tables together.
A user shouldn't typically be doing joins. A user should have an interface that lets them get the data that they need in the way that they need it. If you don't have the developer resources to do that then you're going to be stuck with this problem of having to teach a user technical details. You also need to be very careful about what kind of damage the user can do. Do they have update rights on the data? I hope they don't accidentally do a DELETE FROM Table with no WHERE clause. Even if you restrict their permissions, a poorly written query can crush the database server or block resources causing problems for other users (and more work for you).
If you have no choice, then I think that you need to certainly teach them about primary and foreign keys, even if you don't call them that. Point out that the id on your table (or whatever your PK is) identifies a row. Then explain how the id appears in other tables to show the relationship. For example, "See, in the address table we have a person_id which tells us who that address belongs to."
After that, expect to spend a large portion of your time with that user as they make mistakes or come up with other things that they want to get from the database, but which they can't figure out how to get.
From theory, and ideally, you should define primary keys on all tables, and join tables using a primary key to the matching field or fields (foreign key) in the other table.
Even if you don't define or if they're not defined as primary keys, you need to make sure the fields uniquely identify the records in the table, and that they should be properly indexed.
For example, let's say the 'person' table has a SSN and a driver's license field. The SSN could be considered and flagged as the 'primary key', but if you join that table to a 'drivers' table which might not have the SSN, but does have the driver's license #, you could join them by the driver's license field (even if it's not flagged as primary key), but you need to make sure that the field is properly indexed in both tables.
...explain to somebody how they can determine what fields from multiple tables/views they should join on.
Simply put, look for the columns with values that match between the tables/views. Preferably, match exactly but some massaging might be necessary.
The existence of foreign key constraints would help to know what matches to what, but the constraint might not be directly to the table/view that is to be joined.
The existence of a primary key doesn't mean it is the criteria that is necessary for the query, so I would overlook this detail (depending on the audience).
I would recommend attacking the desired result set by starting with the columns desired, and working back from there. If there's more than one table's columns in the result set, focus on the table whose columns should be returning distinct results first and then gradually add joins, checking the result set between each JOIN addition to confirm the results are still the same. Otherwise, need to review the JOIN or if a JOIN is actually necessary vs IN or EXISTS.
I did this when I first started out, it comes from thinking of joins as just linking tables together, so I linked at all possible points.
Once you think of joins as a way to combine AND filter the data it becomes easier to understand them.
Writing out your request as a sentence is helpful too, "I want to see all the times Table A interacted with Table B". Then build a query from that using only the ID, noting that if you wanted to know "All the times Table A was in the same zip code as Table B" then you would join by zip code.

SQL Modeling / Query Question

I currently have this database structure:
One entry can have multiple items of the type "file", "text" and "url".
Everyone of these items has exactly one corresponding item in either the texts, urls or files table - where data is stored.
I need a query to efficiently select an entry with all its corresponding items and their data.
So my first approach was someting like
SELECT * FROM entries LEFT JOIN entries_items LEFT JOIN texts LEFT JOIN urls LEFT JOIN files
and then loop through it and do the post processing in my application.
But the thing is that its very unlikely that multiple items of different types exist. Its even a rare case that more then one item exists per entry. And in most cases it will be a file. But I need It anways...
So not to scan all 3 tables for eveyr item I thought I could do something like case/switch and scan the corresponding table based on the value of "type" in entries_items.
But I couldn't get it working.
I also thought about making the case/switch logic in the application, but then I would have multiple queries which would probabably be slower as the mysql server will be external.
I can also change the structure if you have a better approach!
I also having all the fields of "texts", "urls" and "files" in side the table entries_items, as its only a 1:1 relation and just have everything that is not needed null.
What would be the pros/cons of that? I think it needs more storage space and i cant do my cosntraints as i have them now. Everything needs also to be null...
Well I am open to all sorts of ideas. The application is not written yet, so I can basically change whatever I like.
You have three different entity types (URL, TEXT, FILE) being linked to the primary ENTRIES table via the intermediary table ENTRIES_ITEMS, and you are violating normal form with this "conditional join" approach. Given your structure, it is impossible to declare a foreign key constraint on ENTRIES_ITEMS.id because the id column could reference the URLS, the TEXTS, or the FILES table. To normalize the ENTRIES_ITEMS table you would have to add three separate fields, urlid, textid, and fileid and allow them to be nullable, and then you could join each of the three entities tables to the ENTRIES table via your linking table. The approach you are taking is very commonly found in legacy databases that were not SQL92-compliant, where the values were grabbed from the entities tables programmatically/procedurally rather than declaratively using SQL selects.
I would first consider adding a column to your "entries_items" table that contains an XML representation of texts, urls, and files. I can't speak for MySQL, but SQL Server has fantastic facilities for handling XML. I bet MySQL does too.
If not a state-of-the-art technique like that, then I would consider going retro and just having one items table with many nulls, as you already considered.
This may get you started, but wil not resolve hierarchical structure (parent_id) of entries and entries_items.
select *
from entries as e
join entries_items as i on i.entry_id = e.id
left join texts as t on t.item_id = i.id and i.type = 'text'
left join urls as u on u.item_id = i.id and i.type = 'url'
left join files as f on f.file_id = i.id and i.type = 'file'
;
If considering the model cleanup, this may be a starting point.