Tips on Database schema - sql

I have a database that tracks UK Horse races.
Race contains all the information for a particular race.
CREATE TABLE "race" (
"id" INTEGER PRIMARY KEY AUTOINCREMENT,
"date" TEXT NOT NULL,
"time" TEXT NOT NULL,
"name" TEXT NOT NULL,
"class" INTEGER NOT NULL,
"distance" INTEGER NOT NULL,
"extra" TEXT NOT NULL,
"going" TEXT NOT NULL,
"handicap" INTEGER NOT NULL,
"prize" REAL,
"purse" REAL,
"surface" TEXT NOT NULL,
"type" TEXT NOT NULL,
"course_id" INTEGER NOT NULL,
"betfair_path" TEXT NOT NULL UNIQUE,
"racingpost_id" INTEGER NOT NULL UNIQUE,
UNIQUE("betfair_path", "racingpost_id")
);
A race can have many entries.
CREATE TABLE "entry" (
"id" INTEGER PRIMARY KEY AUTOINCREMENT,
"weight" INTEGER,
"allowance" INTEGER,
"horse_id" INTEGER NOT NULL,
"jockey_id" INTEGER,
"trainer_id" INTEGER,
"race_id" INTEGER NOT NULL,
UNIQUE("race_id", "horse_id")
);
An entry can have 0 or 1 runner. This takes into account non-runners, horses entered for a race but who failed to start.
CREATE TABLE "runner" (
"id" INTEGER PRIMARY KEY AUTOINCREMENT,
"position" TEXT NOT NULL,
"beaten" INTEGER,
"isp" REAL NOT NULL,
"bsp" REAL,
"place" REAL,
"over_weight" INTEGER,
"entry_id" INTEGER NOT NULL UNIQUE
);
My question is
Is that actually the best way to store my Entry vs Runner data? Note: Entry data is always harvested in a single sweep, and runner (basically result) is found later.
What query would I need to quickly find total entries vs. total runners for a particular race.
How can I easily match the runner information with entry information without multiple selects?
Apologies if I am missing something obvious but I am now brain dead from coding this application.

Your schema looks reasonable. The key construct to use to address your SQL questions is LEFT JOIN, for example:
SELECT COUNT(entry.id) entry_count, COUNT(runner.id) runner_count
FROM entry
LEFT JOIN runner ON runner.entry_id = entry.id
WHERE race_id = 1
From Wikipedia:
... a left outer join returns all the values from the left table, plus matched values from the right table (or NULL in case of no matching join predicate).
So in general for your schema, focus on the entry table and LEFT JOIN the runner table as needed.

Relational database tag, and you want advice on your schema as per title. Even though the single question is answered, you may have more tomorrow.
I couldn't make any sense of your three flat files, so I drew them up into what they might look like in a â–¶Relational databaseâ—€, where the information is organised and queries are easy. Going brain dead is not unusual when the information remains in its complex form.
If you have not seen the Relational Modelling Standard, you might need the IDEF1X Notation.
Note, OwnerId, JockeyId, and TrainerId are all PersonIds. No use manufacturing new ones when there is a perfectly good unique one already sitting there in the table. Just rename it to reflect its Role, and the PK of the table that it is in (the relevance of this will become clear when you code).
MultipleSELECTSare nothing to be scared of, SQL is a cumbersome language but that is all we have. The problem is:
the complexity (necessary due to a bad model) of eachSELECT
and whether you learn and understand how to use subqueries or not.
Single level queries are obviously very limited, and will lead to procedural (row-by-row) processing instead of set-processing.
Single level queries result in huge result sets that then have to be beaten into submission using GROUP BY, etc. Not good for performance, churning through all that unwanted data; better to get only the data you really want.
Now the queries.
When you are printing race forms, I think you will need the Position scheduled and advertised for the RaceEntry; it is not an element of a Runner.
Now that we have gotten rid of those Ids all over the place, which force all sorts of unnecessary joins, we can join directly to the parents concerned (less joins). Eg. for the Race Form, which is only concerned with RaceEntry, for the Owner, you can join to directly to Person using WHERE OwnerId = Person.PersonId; no need to join HorseRegistered or Owner.
LEFT and RIGHT joins are OUTER joins, which means the rows on one side may be missing. That method has been answered, and you will get Nulls, which you have to process later (more code and cycles). I do not think that is what you want, if you are filling forms or a web page.
The concept here is to think is terms of Relational sets, not row-by-row processing. But you need a database for that. Now that we have a bit of Relational power in the beast, you can try this for the Race Result (not the Race Form), instead of procedural processing. These are Scalar Subqueries. For the passed Race Identifiers (the outer query is only concerned with a Race): SELECT (SELECT ISNULL(Place, " ")
FROM Runner
WHERE RacecourseCode = RE.RacecourseCode
AND RaceDate = RE.RaceDate
AND RaceNo = RE.RaceNo
AND HorseId = RE.HorseId) AS Finish,
(SELECT ISNULL(Name, "SCRATCH")
FROM Runner R,
Horse H
WHERE R.RacecourseCode = RE.RacecourseCode
AND R.RaceDate = RE.RaceDate
AND R.RaceNo = RE.RaceNo
AND R.HorseId = RE.HorseId
AND H.HorseId = RE.HorseId) AS Horse,
-- Details,
(SELECT Name FROM Person WHERE PersonId = RE.TrainerId) AS Trainer,
(SELECT Name FROM Person WHERE PersonId = RE.JockeyId) AS Jockey,
ISP AS SP,
Weight AS Wt
FROM RaceEntry RE
WHERE RaceDate = #RaceDate
AND RacecourseCode = #RacecourseCode -- to print entire race form,
AND RaceNo = #RaceNo -- remove these 2 lines
ORDER BY Position

This matches entries and runners for a given race
SELECT E.*, R.*
FROM entry E LEFT JOIN runner R on R.entry_id = E.id
WHERE E.race_id = X
If the entry has no runner, then the R.* fields are all null. You can count such null fields to answer your first query (or perhaps more easily, subtract)

Related

"Data warehouse"-like SQLite store design

I am interested in designing a SQL-based (SQLite, actually) storage for an application processing a large number of similar data entries. For this example, let it be a chat messages storage.
The application has to provide the capabilities of filtering and analyzing the data by message participants, tags, etc., all of those implying N-to-N relationships.
So, the schema (kind of star) will look something like:
create table messages (
message_id INTEGER PRIMARY KEY,
time_stamp INTEGER NOT NULL
-- other fact fields
);
create table users (
user_id INTEGER PRIMARY KEY,
-- user dimension data
);
create table message_participants (
user_id INTEGER references users(user_id),
message_id INTEGER references messages(message_id)
);
create table tags (
tag_id INTEGER PRIMARY KEY,
tag_name TEXT NOT NULL,
-- tag dimension data
);
create table message_tags (
tag_id INTEGER references tags(tag_id),
message_id INTEGER references messages(message_id)
);
-- etc.
So, all good and well, until I have to perform analytic operations and filtering based on the N-to-N dimensions. Given millions of rows in the messages table and thousands in the dimensions (there are more than shown in the example), all the joins are simply too much a performance hit.
For example, I would like to analyze the number of messages each user participated in, given the data is filtered based on selected tags, selected users and other aspects:
select U.user_id, U.user_name, count(1)
from messages as M
join message_participants as MP on M.message_id=MP.message_id
join user as U on MP.user_id=U.user_id
where
MP.user_id not in ( /* some user ID's set */ )
and M.time_stamp between #StartTime and #EndTime
and
-- more fact table fields filtering
and message_id in
(select message_id
from message_tags
where tag_id in ( /* some tag ID's set */ ))
and
-- more N-to-N filtering
group by U.user_id
I am constrained to SQL and, specifically, SQLite. And I do use indices on the tables.
I there some way I don't see to improve the schema, maybe a clever way to de-normalize it?
Or maybe there is a way to somehow index the dimension keys inside the message row (I thought about using FTS capabilities but not sure if searching the textual index and joining on the results will provide any performance leverage)?
Too long to put in a comment, and might help with performance but isn't exactly a direct answer to your question (your schema seems fine): have you tried messing with your query itself?
I often see that kind of subselect filter for many-to-many, and I have found that on large queries like this I frequently see improvements in performance from running a CTE/join rather than a where blag in (subselect):
;with tagMesages as (
select distinct message_id
from message_tags
where tag_id in ( /* some tag ID's set */ )
) -- more N-to-N filtering
select U.user_id, U.user_name, count(1)
from messages as M
join message_participants as MP on M.message_id=MP.message_id
join user as U on MP.user_id=U.user_id
join tagMesages on M.message_id = tagMesages.message_id
where
MP.user_id not in ( /* some user ID's set */ )
and M.time_stamp between #StartTime and #EndTime
and
-- more fact table fields filtering
group by U.user_id
We can tell they're the same, but the query planner can sometimes find this more helpful
Disclaimer: I don't do SQLite, I do SQL Server, so sorry if I've made some obvious (or otherwise) error.

Subquery that matches column with several ranges defined in table

I've got a pretty common setup for an address database: a person is tied to a company with a join table, the company can have an address and so forth.
All pretty normalized and easy to use. But for search performance, I'm creating a materialized, rather denormalized view. I only need a very limited set of information and quick queries. Most of everything that's usually done via a join table is now in an array. Depending on the query, I can either search it directly or join it via unnest.
As a complement to my zipcodes column (varchar[]), I'd like to add a states column that has the (German fedaral) states already precomputed, so that I don't have to transform a query to include all kinds of range comparisons.
My mapping date is in a table like this:
CREATE TABLE zip2state (
state TEXT NOT NULL,
range_start CHARACTER VARYING(5) NOT NULL,
range_end CHARACTER VARYING(5) NOT NULL
)
Each state has several ranges, and ranges can overlap (one zip code can be for two different states). Some ranges have range_start = range_end.
Now I'm a bit at wit's end on how to get that into a materialized view all at once. Normally, I'd feel tempted to just do it iteratively (via trigger or on the application level).
Or as we're just talking about 5 digits, I could create a big table mapping zip to state directly instead of doing it via a range (my current favorite, yet something ugly enough that it prompted me to ask whether there's a better way)
Any way to do that in SQL, with a table like the above (or something similar)? I'm at postgres 9.3, all features allowed...
For completeness' sake, here's the subquery for the zip codes:
(select array_agg(distinct address.zipcode)
from affiliation
join company
on affiliation.ins_id = company.id
join address
on address.com_id = company.id
where affiliation.per_id = person.id) AS zipcodes,
I suggest a LATERAL join instead of the correlated subquery to conveniently compute both columns at once. Could look like this:
SELECT p.*, z.*
FROM person p
LEFT JOIN LATERAL (
SELECT array_agg(DISTINCT d.zipcode) AS zipcodes
, array_agg(DISTINCT z.state) AS states
FROM affiliation a
-- JOIN company c ON a.ins_id = c.id -- suspect you don't need this
JOIN address d ON d.com_id = a.ins_id -- c.id
LEFT JOIN zip2state z ON d.zipcode BETWEEN z.range_start AND z.range_end
WHERE a.per_id = p.id
) z ON true;
If referential integrity is guaranteed, you don't need to join to the table company at all. I took the shortcut.
Be aware that varchar or text behaves differently than expected for numbers. For example: '333' > '0999'. If all zip codes have 5 digits you are fine.
Related:
What is the difference between LATERAL and a subquery in PostgreSQL?

Creating a denormalized table from a normalized key-value table using 100s of joins

I have an ETL process which takes values from an input table which is a key value table with each row having a field ID and turning it into a more denormalized table where each row has all the values. Specifically, this is the input table:
StudentFieldValues (
FieldId INT NOT NULL,
StudentId INT NOT NULL,
Day DATE NOT NULL,
Value FLOAT NULL
)
FieldId is a foreign key from table Field, Day is a foreign key from table Days. The PK is the first 3 fields. There are currently 188 distinct fields. The output table is along the lines of:
StudentDays (
StudentId INT NOT NULL,
Day DATE NOT NULL,
NumberOfClasses FLOAT NULL,
MinutesLateToSchool FLOAT NULL,
... -- the rest of the 188 fields
)
The PK is the first 2 fields.
Currently the query that populates the output table does a self join with StudentFieldValues 188 times, one for each field. Each join equates StudentId and Day and takes a different FieldId. Specifically:
SELECT Students.StudentId, Days.Day,
StudentFieldValues1.Value NumberOfClasses,
StudentFieldValues2.Value MinutesLateToSchool,
...
INTO StudentDays
FROM Students
CROSS JOIN Days
LEFT OUTER JOIN StudentFieldValues StudentFieldValues1
ON Students.StudentId=StudentFieldValues1.StudentId AND
Days.Day=StudentFieldValues1.Day AND
AND StudentFieldValues1.FieldId=1
LEFT OUTER JOIN StudentFieldValues StudentFieldValues2
ON Students.StudentId=StudentFieldValues2.StudentId AND
Days.Day=StudentFieldValues2.Day AND
StudentFieldValues2.FieldId=2
... -- 188 joins with StudentFieldValues table, one for each FieldId
I'm worried that this system isn't going to scale as more days, students and fields (especially fields) are added to the system. Already there are 188 joins and I keep reading that if you have a query with that number of joins you're doing something wrong. So I'm basically asking: Is this something that's gonna blow up in my face soon? Is there a better way to achieve what I'm trying to do? It's important to note that this query is minimally logged and that's something that wouldn't have been possible if I was adding the fields one after the other.
More details:
MS SQL Server 2014, 2x XEON E5 2690v2 (20 cores, 40 threads total), 128GB RAM. Windows 2008R2.
352 million rows in the input table, 18 million rows in the output table - both expected to increase over time.
Query takes 20 minutes and I'm very happy with that, but performance degrades as I add more fields.
Think about doing this using conditional aggregation:
SELECT s.StudentId, d.Day,
max(case when sfv.FieldId = 1 then sfv.Value end) as NumberOfClasses,
max(case when sfv.FieldId = 2 then sfv.Value end) as MinutesLateToSchool,
...
INTO StudentDays
FROM Students s CROSS JOIN
Days d LEFT OUTER JOIN
StudentFieldValues sfv
ON s.StudentId = sfv.StudentId AND
d.Day = sfv.Day
GROUP BY s.StudentId, d.Day;
This has the advantage of easy scalability. You can add hundreds of fields and the processing time should be comparable (longer, but comparable) to fewer fields. It is also easer to add new fields.
EDIT:
A faster version of this query would use subqueries instead of aggregation:
SELECT s.StudentId, d.Day,
(SELECT TOP 1 sfv.Value FROM StudentFieldValues WHERE sfv.FieldId = 1 and sfv.StudentId = s.StudentId and sfv.Day = sfv.Day) as NumberOfClasses,
(SELECT TOP 1 sfv.Value FROM StudentFieldValues WHERE sfv.FieldId = 2 and sfv.StudentId = s.StudentId and sfv.Day = sfv.Day) as MinutesLateToSchool,
...
INTO StudentDays
FROM Students s CROSS JOIN
Days d;
For performance, you want a composite index on StudentFieldValues(StudentId, day, FieldId, Value).
Yes, this is going to blow up. You have your definitions of "normalized" and "denormalized" backwards. The Field/Value table design is not a relational design. It's a variation of the entity-attribute-value design, which has all sorts of problems.
I recommend you do not try to pivot the data in an SQL query. It doesn't scale well that way. Instea, you need to query it as a set of rows, as it is stored in the database, and fetch back the result set into your application. There you write code to read the data row by row, and apply the "fields" to fields of an object or a hashmap or something.
I think there may be some trial and error here to see what works but here are some things you can try:
Disable indexes and re-enable after data load is complete
Disable any triggers that don't need to be ran upon data load scenarios.
The above was taken from an msdn post where someone was doing something similar to what you are.
Think about trying to only update the de-normalized table based on changed records if this is possible. Limiting the result set would be much more efficient if this is a possibility.
You could try a more threaded iterative approach in code (C#, vb, etc) to build this table by student where you aren't doing the X number of joins all at one time.

Collapsing multiple subqueries into one in Postgres

I have two tables:
CREATE TABLE items
(
root_id integer NOT NULL,
id serial NOT NULL,
-- Other fields...
CONSTRAINT items_pkey PRIMARY KEY (root_id, id)
)
CREATE TABLE votes
(
root_id integer NOT NULL,
item_id integer NOT NULL,
user_id integer NOT NULL,
type smallint NOT NULL,
direction smallint,
CONSTRAINT votes_pkey PRIMARY KEY (root_id, item_id, user_id, type),
CONSTRAINT votes_root_id_fkey FOREIGN KEY (root_id, item_id)
REFERENCES items (root_id, id) MATCH SIMPLE
ON UPDATE CASCADE ON DELETE CASCADE,
-- Other constraints...
)
I'm trying to, in a single query, pull out all items of a particular root_id along with a few arrays of user_ids of the users who voted in particular ways. The following query does what I need:
SELECT *,
ARRAY(SELECT user_id from votes where root_id = i.root_id AND item_id = i.id AND type = 0 AND direction = 1) as upvoters,
ARRAY(SELECT user_id from votes where root_id = i.root_id AND item_id = i.id AND type = 0 AND direction = -1) as downvoters,
ARRAY(SELECT user_id from votes where root_id = i.root_id AND item_id = i.id AND type = 1) as favoriters
FROM items i
WHERE root_id = 1
ORDER BY id
The problem is that I'm using three subqueries to get the information I need when it seems like I should be able to do the same in one. I thought that Postgres (I'm using 8.4) might be smart enough to collapse them all into a single query for me, but looking at the explain output in pgAdmin it looks like that's not happening - it's running multiple primary key lookups on the votes table instead. I feel like I could rework this query to be more efficient, but I'm not sure how.
Any pointers?
EDIT: An update to explain where I am now. At the advice of the pgsql-general mailing list, I tried changing the query to use a CTE:
WITH v AS (
SELECT item_id, type, direction, array_agg(user_id) as user_ids
FROM votes
WHERE root_id = 5305
GROUP BY type, direction, item_id
ORDER BY type, direction, item_id
)
SELECT *,
(SELECT user_ids from v where item_id = i.id AND type = 0 AND direction = 1) as upvoters,
(SELECT user_ids from v where item_id = i.id AND type = 0 AND direction = -1) as downvoters,
(SELECT user_ids from v where item_id = i.id AND type = 1) as favoriters
FROM items i
WHERE root_id = 5305
ORDER BY id
Benchmarking each of these from my application (I set up each as a prepared statement to avoid spending time on query planning, and then ran each one several thousand times with a variety of root_ids) my initial approach averages 15 milliseconds and the CTE approach averages 17 milliseconds. I was able to repeat this result over a few runs.
When I have some time I'm going to play with jkebinger's and Dragontamer5788's approaches with my test data and see how they work, but I'm also starting a bounty to see if I can get more suggestions.
I should also mention that I'm open to changing my schema (the system isn't in production yet, and won't be for a couple months) if it can speed up this query. I designed my votes table this way to take advantage of the primary key's uniqueness constraint - a given user can both favorite and upvote an item, for example, but not upvote it AND downvote it - but I can relax/work around that constraint if representing these options in a different way makes more sense.
EDIT #2: I've benchmarked all four solutions. Amazingly, Sequel is flexible enough that I was able to write all four without dropping to SQL once (not even for the CASE statements). Like before, I ran them all as prepared statements, so that query planning time wouldn't be an issue, and did each run several thousand times. Then I ran all the queries under two situations - a worst-case scenario with a lot of rows (265 items and 4911 votes) where the relevant rows would be in the cache pretty quickly, so CPU usage should be the deciding factor and a more realistic scenario where a random root_id was chosen for each run. I wound up with:
Original query - Typical: ~10.5 ms, Worst case: ~26 ms
CTE query - Typical: ~16.5 ms, Worst case: ~70 ms
Dragontamer5788 - Typical: ~15 ms, Worst case: ~36 ms
jkebinger - Typical: ~42 ms, Worst case: ~180 ms
I suppose the lesson to take from this right now is that Postgres' query planner is very smart and is probably doing something clever under the surface. I don't think I'm going to spend any more time trying to work around it. If anyone would like to submit another query attempt I'd be happy to benchmark it, but otherwise I think Dragontamer is the winner of the bounty and correct (or closest to correct) answer. Unless someone else can shed some light on what Postgres is doing - that would be pretty cool. :)
There are two questions being asked:
A syntax to collapse multiple subqueries into one.
Optimization.
For #1, I can't get the "complete" thing into a single Common Table Expression, because you're using a correlated subquery on each item. Still, you might have some benefits if you used a common table expression. Obviously, this will depend on the data, so please benchmark to see if it would help.
For #2, because there are three commonly accessed "classes" of items in your table, I expect partial indexes to increase the speed of your query, regardless of whether or not you were able to increase the speed due to #1.
First, the easy stuff then. To add a partial index to this table, I'd do:
CREATE INDEX upvote_vote_index ON votes (type, direction)
WHERE (type = 0 AND direction = 1);
CREATE INDEX downvote_vote_index ON votes (type, direction)
WHERE (type = 0 AND direction = -1);
CREATE INDEX favoriters_vote_index ON votes (type)
WHERE (type = 1);
The smaller these indexes, the more efficient your queries will be. Unfortunately, in my tests, they didn't seem to help :-( Still, maybe you can find a use of them, it depends greatly on your data.
As for an overall optimization, I'd approach the problem differently. I'd "unroll" the query into this form (using an inner join and using conditional expressions to "split up" the three types of votes), and then use "Group By" and the "array" aggregate operator to combine them. IMO, I'd rather change my application code to accept it in the "unrolled" form, but if you can't change the application code, then the "group by"+aggregate function ought to work.
SELECT array_agg(v.user_id), -- array_agg(anything else you needed),
i.root_id, i.id, -- I presume you needed the primary key?
CASE
WHEN v.type = 0 AND v.direction = 1
THEN 'upvoter'
WHEN v.type = 0 AND v.direction = -1
THEN 'downvoter'
WHEN v.type = 1
THEN 'favoriter'
END as vote_type
FROM items i
JOIN votes v ON i.root_id = v.root_id AND i.id = v.item_id
WHERE i.root_id = 1
AND ((type=0 AND (direction=1 OR direction=-1))
OR type=1)
GROUP BY i.root_id, i.id, vote_type
ORDER BY id
Its still "one step unrolled" compared to your code (vote_type is vertical, while in your case, its horizontal, across the columns). But this seems to be more efficient.
Just a guess, but maybe it could be worth trying:
Maybe sql can optimize the query if you create a VIEW of
SELECT user_id from votes where root_id = i.root_id AND item_id = i.id
and then select 3 times from there with the different where-clauses about type and direction.
If thats not helping either, maybe you could fetch the 3 types as additional boolean columns and then only work with one query?
Would be interested to hear, if you find a solution. Good luck.
Here's another approach. It has the (possibly) undesirable result of including NULL values in the arrays, but it works in one pass, rather than three. I find it helpful to think of some SQL queries in a map-reduce manner, and case statements are great for that.
select
v.root_id, v.item_id,
array_agg(case when type = 0 AND direction = 1 then user_id else NULL end) as upvoters,
array_agg(case when type = 0 AND direction = -1 then user_id else NULL end) as downvoters,
array_agg(case when type = 1 then user_id else NULL end) as favoriters
from items i
join votes v on i.root_id = v.root_id AND i.id = v.item_id
group by 1, 2
With some sample data, I get this result set:
root_id | item_id | upvoters | downvoters | favoriters
---------+---------+----------------+------------------+------------------
1 | 2 | {100,NULL,102} | {NULL,101,NULL} | {NULL,NULL,NULL}
2 | 4 | {100,NULL,101} | {NULL,NULL,NULL} | {NULL,100,NULL}
I believe you need postgres 8.4 to get array_agg, but there's been a recipe for a array_accum function prior to that.
There's a discussion on postgres-hackers list about how to build a NULL-removing version of array_agg if you're interested.

Best Practice to querying a Lookup table

I am trying to figure out a way to query a property feature lookup table.
I have a property table that contains rental property information (address, rent, deposit, # of bedrooms, etc.) along with another table (Property_Feature) that represents the features of this property (pool, air conditioning, laundry on-site, etc.). The features themselves are defined in yet another table labeled Feature.
Property
pid - primary key
other property details
Feature
fid - primary key
name
value
Property_Feature
id - primary key
pid - foreign key (Property)
fid - foreign key (Feature)
Let say someone wants to search for property that has air conditioning, and a pool and laundry on-site. How do you query the Property_Feature table for multiple features for the same property if each row only represents one feature? What would the SQL query look like? Is this possible? Is there a better solution?
Thanks for the help and insight.
In terms of database design, yours is the right way to do it. It's correctly normalized.
For the query, I would simply use exists, like this:
select * from Property
where
exists (select * from Property_Feature where pid = property.pid and fid = 'key_air_conditioning')
and
exists (select * from Property_Feature where pid = property.pid and fid = 'key_pool')
Where key_air_conditioning and key_pool are obviously the keys for those features.
The performance will be OK even for large databases.
Here's the query that will find all the properties with a pool:
select
p.*
from
property p
inner join property_feature pf on
p.pid = pf.pid
inner join feature f on
pf.fid = f.fid
where
f.name = 'Pool'
I use inner joins instead of EXISTS since it tends to be a bit faster.
You can also do something like this:
SELECT *
FROM Property p
WHERE 3 =
( SELECT COUNT(*)
FROM Property_Feature pf
, Feature f
WHERE pf.pid = p.pid
AND pf.fid = f.fid
AND f.name in ('air conditioning', 'pool', 'laundry on-site')
);
Obviously, if your front end is capturing the fids of the feature items when the user is selecting them, you can dispense with the join to Feature and constrain directly on fid. Your front end would know what the count of features selected was, so determining the value for "3" above is trivial.
Compare it, performance wise, to the tekBlues construction above; depending on your data distribution, either one of these might be the faster query.