Best Practice to querying a Lookup table - sql

I am trying to figure out a way to query a property feature lookup table.
I have a property table that contains rental property information (address, rent, deposit, # of bedrooms, etc.) along with another table (Property_Feature) that represents the features of this property (pool, air conditioning, laundry on-site, etc.). The features themselves are defined in yet another table labeled Feature.
Property
pid - primary key
other property details
Feature
fid - primary key
name
value
Property_Feature
id - primary key
pid - foreign key (Property)
fid - foreign key (Feature)
Let say someone wants to search for property that has air conditioning, and a pool and laundry on-site. How do you query the Property_Feature table for multiple features for the same property if each row only represents one feature? What would the SQL query look like? Is this possible? Is there a better solution?
Thanks for the help and insight.

In terms of database design, yours is the right way to do it. It's correctly normalized.
For the query, I would simply use exists, like this:
select * from Property
where
exists (select * from Property_Feature where pid = property.pid and fid = 'key_air_conditioning')
and
exists (select * from Property_Feature where pid = property.pid and fid = 'key_pool')
Where key_air_conditioning and key_pool are obviously the keys for those features.
The performance will be OK even for large databases.

Here's the query that will find all the properties with a pool:
select
p.*
from
property p
inner join property_feature pf on
p.pid = pf.pid
inner join feature f on
pf.fid = f.fid
where
f.name = 'Pool'
I use inner joins instead of EXISTS since it tends to be a bit faster.

You can also do something like this:
SELECT *
FROM Property p
WHERE 3 =
( SELECT COUNT(*)
FROM Property_Feature pf
, Feature f
WHERE pf.pid = p.pid
AND pf.fid = f.fid
AND f.name in ('air conditioning', 'pool', 'laundry on-site')
);
Obviously, if your front end is capturing the fids of the feature items when the user is selecting them, you can dispense with the join to Feature and constrain directly on fid. Your front end would know what the count of features selected was, so determining the value for "3" above is trivial.
Compare it, performance wise, to the tekBlues construction above; depending on your data distribution, either one of these might be the faster query.

Related

select items where id is contained in another table field

I have the following database schema in sqlite3:
Basically, a member has multiple characters. A character plays in an activity (with a mode type) and has results for that activity (character_activity_stats)
I select all of the stats (activity / character_activity_stats) for a specific character and mode like so:
SELECT
*,
activity.mode as activity_mode,
character_activity_stats.id as character_activity_stats_index
FROM
character_activity_stats
INNER JOIN
activity ON character_activity_stats.activity = activity.id,
modes ON modes.activity = activity.id
WHERE
modes.mode = 5 AND
character_activity_stats.character = 1
This works great.
However, now I want to select the same set of data, but by member (basically combine results for all characters for a member).
However, I am not really sure how to even approach this.
Basically, I need to retrieve all character_activity_stats where character_activity_stats.character is a character of the specified member (by id). Any suggestions or pointers? (I am very new to sql).
Join those 3 tables on the right keys:
select *
from character_activity_stats
join character on character_activity_stats.character = character.id
join member on member.id = character.member
where member.id = ?
If you don't need any data from member other than limit by id, then you leave that join off and just do character.member = ? instead.
It's much easier if you use the same name for the primary and foreign keys (i.e. don't use id for the primary key). It also allows you use natural joins so you don't even need to give the join conditions. For the primary key to convention is usually _id. You id and _in in most of the tables, so I don't what is that is about.

SQL query to search an unique ID that can be in three different tables

I have three tables that control products, colors and sizes. Products can have or not colors and sizes. Colors can or not have sizes.
product color size
------- ------- -------
id id id
unique_id id_product (FK from product) id_product (FK from version)
stock unique_id id_version (FK from version)
title stock unique_id
stock
The unique_id column, that is present in all tables, is a serial type (autoincrement) and its counter is shared with the three tables, basically it works as a global unique ID between them.
It works fine, but i am trying to increase the query performance when i have to select some fields based in the unique_id.
As i don't know where is the unique_id that i am looking for, i am using UNION, like below:
select title, stock
from product
where unique_id = 10
UNION
select p.title, c.stock
from color c
join product p on c.id_product = p.id
where c.unique_id = 10
UNION
select p.title, s.stock
from size s
join product p on s.id_product = p.id
where s.unique_id = 10;
Is there a better way to do this? Thanks for any suggestion!
EDIT 1
Based on #ErwinBrandstetter and #ErikE answers i decided to use the below query. The main reasons is:
1) As unique_id has indexes in all tables, i will get a good performance
2) Using the unique_id i will find the product code, so i can get all columns i need using a another simple join
SELECT
p.title,
ps.stock
FROM (
select id as id_product, stock
from product
where unique_id = 10
UNION
select id_product, stock
from color
where unique_id = 10
UNION
select id_product, stock
from size
where unique_id = 10
) AS ps
JOIN product p ON ps.id_product = p.id;
PL/pgSQL function
To solve the problem at hand, a plpgsql function like the following should be faster:
CREATE OR REPLACE FUNCTION func(int)
RETURNS TABLE (title text, stock int) LANGUAGE plpgsql AS
$BODY$
BEGIN
RETURN QUERY
SELECT p.title, p.stock
FROM product p
WHERE p.unique_id = $1; -- Put the most likely table first.
IF NOT FOUND THEN
RETURN QUERY
SELECT p.title, c.stock
FROM color c
JOIN product p ON c.id_product = p.id
WHERE c.unique_id = $1;
END;
IF NOT FOUND THEN
RETURN QUERY
SELECT p.title, s.stock
FROM size s
JOIN product p ON s.id_product = p.id
WHERE s.unique_id = $1;
END IF;
END;
$BODY$;
Updated function with table-qualified column names to avoid naming conflicts with OUT parameters.
RETURNS TABLE requires PostgreSQL 8.4, RETURN QUERY requires version 8.2. You can substitute both for older versions.
It goes without saying that you need to index the columns unique_id of every involved table. id should be indexed automatically, being the primary key.
Redesign
Ideally, you can tell which table from the ID alone. You could keep using one common sequence, but add 100000000 for the first table, 200000000 for the second and 300000000 for the third - or whatever suits your needs. This way, the least significant part of the number is easily distinguishable.
A plain integer spans numbers from -2147483648 to +2147483647, move to bigint if that's not enough for you. I would stick to integer IDs, though, if possible. They are smaller and faster than bigint or text.
CTEs (experimental!)
If you cannot create a function for some reason, this pure SQL solution might do a similar trick:
WITH x(uid) AS (SELECT 10) -- provide unique_id here
, a AS (
SELECT title, stock
FROM x, product
WHERE unique_id = x.uid
)
, b AS (
SELECT p.title, c.stock
FROM x, color c
JOIN product p ON c.id_product = p.id
WHERE NOT EXISTS (SELECT 1 FROM a)
AND c.unique_id = x.uid
)
, c AS (
SELECT p.title, s.stock
FROM x, size s
JOIN product p ON s.id_product = p.id
WHERE NOT EXISTS (SELECT 1 FROM b)
AND s.unique_id = x.uid
)
SELECT * FROM a
UNION ALL
SELECT * FROM b
UNION ALL
SELECT * FROM c;
I am not sure whether it avoids additional scans like I hope. Would have to be tested. This query requires at least PostgreSQL 8.4.
Upgrade!
As I just learned, the OP runs on PostgreSQL 8.1.
Upgrading alone would speed up the operation a lot.
Query for PostgreSQL 8.1
As you are limited in your options, and a plpgsql function is not possible, this function should perform better than the one you have. Test with EXPLAIN ANALYZE - available in v8.1.
SELECT title, stock
FROM product
WHERE unique_id = 10
UNION ALL
SELECT p.title, ps.stock
FROM product p
JOIN (
SELECT id_product, stock
FROM color
WHERE unique_id = 10
UNION ALL
SELECT id_product, stock
FROM size
WHERE unique_id = 10
) ps ON ps.id_product = p.id;
I think it's time for a redesign.
You have things that you're using as bar codes for items that are basically all the same in one respect (they are SerialNumberItems), but have been split into multiple tables because they are different in other respects.
I have several ideas for you:
Change the Defaults
Just make each product required to have one color "no color" and one size "no size". Then you can query any table you want to find the info you need.
SuperType/SubType
Without too much modification you could use the supertype/subtype database design pattern.
In it, there is a parent table where all the distinct detail-level identifiers live, and the shared columns of the subtype tables go in the supertype table (the ways that all the items are the same). There is one subtype table for each different way that the items are distinct. If mutual exclusivity of the subtype is required (you can have a Color or a Size but not both), then the parent table is given a TypeID column and the subtype tables have an FK to both the ParentID and the TypeID. Looking at your design, in fact you would not use mutual exclusivity.
If you use the pattern of a supertype table, you do have the issue of having to insert in two parts, first to the supertype, then the subtype. Deleting also requires deleting in reverse order. But you get a great benefit of being able to get basic information such as Title and Stock out of the supertype table with a single query.
You could even create schema-bound views for each subtype, with instead-of triggers that convert inserts, updates, and deletes into operations on the base table + child table.
A Bigger Redesign
You could completely change how Colors and Sizes are related to products.
First, your patterns of "has-a" are these:
Product (has nothing)
Product->Color
Product->Size
Product->Color->Size
There is a problem here. Clearly Product is the main item that has other things (colors and sizes) but colors don't have sizes! That is an arbitrary assignment. You may as well have said that Sizes have Colors--it doesn't make a difference. This reveals that your table design may not be best, as you're trying to model orthogonal data in a parent-child type of relationship. Really, products have a ColorAndSize.
Furthermore, when a product comes in colors and sizes, what does the uniqueid in the Color table mean? Can such a product be ordered without a size, having only a color? This design is assigning a unique ID to something that (it seems to me) should never be allowed to be ordered--but you can't find this information out from the Color table, you have to compare the Color and Size tables first. It is a problem.
I would design this as: Table Product. Table Size listing all distinct sizes possible for any product ever. Table Color listing all distinct colors possible for any product ever. And table OrderableProduct that has columns ProductId, ColorID, SizeID, and UniqueID (your bar code value). Additionally, each product must have one color and one size or it doesn't exist.
Basically, Color and Size are like X and Y coordinates into a grid; you are filling in the boxes that are allowable combinations. Which one is the row and which the column is irrelevant. Certainly, one is not a child of the other.
If there are any reasonable rules, in general, about what colors or sizes can be applied to various sub-groups of products, there might be utility in a ProductType table and a ProductTypeOrderables table that, when creating a new product, could populate the OrderableProduct table with the standard set—it could still be customized but might be easier to modify than to create anew. Or, it could define the range of colors and sizes that are allowable. You might need separate ProductTypeAllowedColor and ProductTypeAllowedSize tables. For example, if you are selling T-shirts, you'd want to allow XXXS, XXS, XS, S, M, L, XL, XXL, XXXL, and XXXXL, even if most products never use all those sizes. But for soft drinks, the sizes might be 6-pack 8oz, 24-pack 8oz, 2 liter, and so on, even if each soft drink is not offered in that size (and soft drinks don't have colors).
In this new scheme, you only have one table to query to find the correct orderable product. With proper indexes, it should be blazing fast.
Your Question
You asked:
in PostgreSQL, so do you think if i use indexes on unique_id i will get a satisfactory performance?
Any column or set of columns that you use to repeatedly look up data must have an index! Any other pattern will result in a full table scan each time, which will be awful performance. I am sure that these indexes will make your queries lightning fast as it will take only one leaf-level read per table.
There's an easier way to generate unique IDs using three separate auto_increment columns. Just prepend a letter to the ID to uniquify it:
Colors:
C0000001
C0000002
C0000003
Sizes:
S0000001
S0000002
S0000003
...
Products:
P0000001
P0000002
P0000003
...
A few advantages:
You don't need to serialize creation of ids across tables to ensure uniqueness. This will give better performance.
You don't actually need to store the letter in the table. All IDs in the same table start with the same letter, so you only need to store the number. This means that you can use an ordinary auto_increment column to generate your IDs.
If you have an ID you only need to check the first character to see which table it can be found in. You don't even need to make a query to the database if you just want to know whether it's a product ID or a size ID.
A disadvantage:
It's no longer a number. But you can get around that by using 1,2,3 instead of C,S,P.
Your query will be pretty much efficient, as long as you have an index on unique_id, on every table and indices on the joining columns.
You could turn those UNION into UNION ALL but the won't be any differnce on performance, for this query.
This is a bit different. I don't understand the intended behaviour if stocks exists in more than one of the {product,color,zsize} tables. (UNION will remove duplicates, but for the row-as-a-whole, eg the {product_id,stock} tuples. That makes no sense to me. I just take the first. (Note the funky self-join!!)
SELECT p.title
, COALESCE (p2.stock, c.stock, s.stock) AS stock
FROM product p
LEFT JOIN product p2 on p2.id = p.id AND p2.unique_id = 10
LEFT JOIN color c on c.id_product = p.id AND c.unique_id = 10
LEFT JOIN zsize s on s.id_product = p.id AND s.unique_id = 10
WHERE COALESCE (p2.stock, c.stock, s.stock) IS NOT NULL
;

Modeling Existential Facts in a Relational Database

I need a way to represent existential relations in a database. For instance I have a bio-historical table (i.e. a family tree) that stores a parent id and a child id which are foreign keys to a people table. This table is used to describe arbitrary family relationships. Thus I’d like to be able to say that X and Y are siblings without having to know exactly who the parents of X and Y are. I just want to be able to say that there exists two different people A and B such that A and B are each parents of X and Y. Once I do know who A and/or B are I’d need to be able to reconcile them.
The simplest solution I can think of is to store existential people with negative integer user ids. Once I know who the people are, I’d need to cascade update all of the IDs. Are there any well-known techniques for this?
Does existential mean "non existant"?
They don't have to be negative. You could just add a record to People table with no last/first name and perhaps a flag "unknown person". Or existential if you like.
Then when you know something (e.g. like last name but not first) you update this record.
Reconciling duplicate people could be more difficult. I guess you could just update FamilyTree set parent_id=new_id where parent_id=old_id, etc. But this means for instance that the same person could end up with too many parents, so you'll need to perform a number of complex checks before doing that.
I would document only the known relationships in a link table which links your Person table to itself with:
FK Person1ID
FK Person2ID
RelationshipTypeID (Sibling, Father, Mother, Step-Father, Step-Mother, etc.)
With some appropriate constraints on that table (or multiple tables, one for each relationship type if that makes the constraints more logical)
Then when other relationships can possibly (a half-sibling will only share one parent) be inferred (by running an exception query) but are missing, create them.
For instance, people who are siblings who don't have all their parents identified:
SELECT *
FROM People p1
INNER JOIN Relationship r_sibling
ON r_sibling.Person1ID = p1.PersonID
AND r_sibling.RelationshipType = SIBLING_TYPE_CONSTANT
INNER JOIN People p2
ON r_sibling.Person2ID = p2.PersonID
WHERE EXISTS (
-- p1 has a father
SELECT *
FROM Relationship r_father
ON r_father.RelationshipType = FATHER_TYPE_CONSTANT
AND r_father.Person2ID = p1.PersonID
)
AND NOT EXISTS (
-- p2 (p1's sibling) doesn't have a father yet
SELECT *
FROM Relationship r_father
ON r_father.RelationshipType = FATHER_TYPE_CONSTANT
AND r_father.Person2ID = p2.PersonID
)
You might need to UNION the reverse of this query depending on how you want your relationships constrained (siblings are always commutative, unlike other relationships) and then handle mothers similarly.
Hmmm, come to think of it, I guess I need a general way to reconcile duplicate people anyway and I can use it for this purpose. Thoughts?

SQL multiple join on many to many tables + comma separation

I have these tables:
media
id (int primary key)
uri (varchar).
media_to_people
media_id (int primary key)
people_id (int primary key)
people
id (int primary key)
name (varchar)
role (int) -- role specifies whether the person is an artist, publisher, writer, actor,
etc relative to the media and has range(1-10)
This is a many to many relation
I want to fetch a media and all its associated people in a select. So if a media has 10 people associated with it, all 10 must come.
Further more, if multiple people with the same role exist for a given media, they must come as comma separated values under a column for that role.
Result headings must look like: media.id, media.uri, people.name(actor), people.name(artist), people.name(publisher) and so on.
I'm using sqlite.
SQLite doesn't have the "pivot" functionality you'd need for starters, and the "comma separated values" part is definitely a presentation issue that it would be absurd (and possibly unfeasible) to try to push into any database layer, whatever dialect of SQL may be involved -- it's definitely a part of the job you'd do in the client, e.g. a reporting facility or programming language.
Use SQL for data access, and leave presentation to other layers.
How you get your data is
SELECT media.id, media.uri, people.name, people.role
FROM media
JOIN media_to_people ON (media.id = media_to_people.media_id)
JOIN people ON (media_to_people.people_id = people.id)
WHERE media.id = ?
ORDER BY people.role, people.name
(the ? is one way to indicate a parameter in SQLite, to be bound to the specific media id you're looking for in ways that depend on your client); the data will come from the DB to your client code in several rows, and your client code can easily put them into the single column form that you want.
It's hard for us to say how to code the client-side part w/o knowing anything about the environment or language you're using as the client. But in Python for example:
def showit(dataset):
by_role = collections.defaultdict(list)
for mediaid, mediauri, name, role in dataset:
by_role[role].append(name)
headers = ['mediaid', 'mediauri']
result = [mediaid, mediauri]
for role in sorted(by_role):
headers.append('people(%s)' % role)
result.append(','.join(by_role[role]))
return ' '.join(headers) + '\n' + ' '.join(result)
even this won't quite match your spec -- you ask for headers such as 'people(artist)' while you also specify that the role's encoded as an int, and mention no way to go from the int to the string 'artist', so it's obviously impossible to match your spec exactly... but it's as close as my ingenuity can get;-).
I agree with Alex Martelli's answer, that you should get the data in multiple rows and do some processing in your application.
If you try to do this with just joins, you need to join to the people table for each role type, and if there are multiple people in each role, your query will have Cartesian products between these roles.
So you need to do this with GROUP_CONCAT() and produce a scalar subquery in your select-list for each role:
SELECT m.id, m.uri,
(SELECT GROUP_CONCAT(name)
FROM media_to_people JOIN people ON (people_id = id)
WHERE media_id = m.id AND role = 1) AS Actors,
(SELECT GROUP_CONCAT(name)
FROM media_to_people JOIN people ON (people_id = id)
WHERE media_id = m.id AND role = 2) AS Artists,
(SELECT GROUP_CONCAT(name)
FROM media_to_people JOIN people ON (people_id = id)
WHERE media_id = m.id AND role = 3) AS Publishers
FROM media m;
This is truly ugly! Don't try this at home!
Take our advice, and don't try to format the pivot table using only SQL.

Query to get all revisions of an object graph

I'm implementing an audit log on a database, so everything has a CreatedAt and a RemovedAt column. Now I want to be able to list all revisions of an object graph but the best way I can think of for this is to use unions. I need to get every unique CreatedAt and RemovedAt id.
If I'm getting a list of countries with provinces the union looks like this:
SELECT c.CreatedAt AS RevisionId from Countries as c where localId=#Country
UNION
SELECT p.CreatedAt AS RevisionId from Provinces as p
INNER JOIN Countries as c ON p.CountryId=c.LocalId AND c.LocalId = #Country
UNION
SELECT c.RemovedAt AS RevisionId from Countries as c where localId=#Country
UNION
SELECT p.RemovedAt AS RevisionId from Provinces as p
INNER JOIN Countries as c ON p.CountryId=c.LocalId AND c.LocalId = #Country
For more complicated queries this could get quite complicated and possibly perform very poorly so I wanted to see if anyone could think of a better approach. This is in MSSQL Server.
I need them all in a single list because this is being used in a from clause and the real data comes from joining on this.
You have most likely already implemented your solution, but to address a few issues; I would suggest considering Aleris's solution, or some derivative thereof.
In your tables, you have a "removed at" field -- well, if that field were active (populated), technically the data shouldn't be there -- or perhaps your implementation has it flagged for deletion, which will break the logging once it is removed.
What happens when you have multiple updates during a reporting period -- the previous log entries would be overwritten.
Having a separate log allows for archival of the log information and allows you to set a different log analysis cycle from your update/edit cycles.
Add whatever "linking" fields required to enable you to get back to your original source data OR make the descriptions sufficiently verbose.
The fields contained in your log are up to you but Aleris's solution is direct. I may create an action table and change the field type from varchar to int, as a link into the action table -- forcing the developers to some standardized actions.
Hope it helps.
An alternative would be to create an audit log that might look like this:
AuditLog table
EntityName varchar(2000),
Action varchar(255),
EntityId int,
OccuranceDate datetime
where EntityName is the name of the table (eg: Contries, Provinces), the Action is the audit action (eg: Created, Removed etc) and the EntityId is the primary key of the modified row in the original table.
The table would need to be kept synchronized on each action performed to the tables. There are a couple of ways to do this:
1) Make triggers on each table that will add rows to AuditTable
2) From your application add rows in AuditTable each time a change is made to the repectivetables
Using this solution is very simple to get a list of logs in audit.
If you need to get columns from original table is also possible using joins like this:
select *
from
Contries C
join AuditLog L on C.Id = L.EntityId and EntityName = 'Contries'
You could probably do it with a cross join and coalesce, but the union is probably still better from a performance standpoint. You can try testing each though.
SELECT
COALESCE(C.CreatedAt, P.CreatedAt)
FROM
dbo.Countries C
FULL OUTER JOIN dbo.Provinces P ON
1 = 0
WHERE
C.LocalID = #Country OR
P.LocalID = #Country