Sql Querying, group relationships - sql

Suppose I have two tables:
Group
(
id integer primary key,
someData1 text,
someData2 text
)
GroupMember
(
id integer primary key,
group_id foreign key to Group.id,
someData text
)
I'm aware that my SQL syntax is not correct :) Hopefully is clear enough. My problem is this: I want to load a group record and all the GroupMember records associated with that group. As I see it, there are two options.
A single query:
SELECT Group.id, Group.someData1, Group.someData2 GroupMember.id, GroupMember.someData
FROM Group INNER JOIN GroupMember ...
WHERE Group.id = 4;
Two queries:
SELECT id, someData2, someData2
FROM Group
WHERE id = 4;
SELECT id, someData
FROM GroupMember
WHERE group_id = 4;
The first solution has the advantage of only being one database round trip, but has the disadvantage of returning redundant data (All group data is duplicated for every group member)
The second solution returns no duplicate data but involves two round trips to the database.
What is preferable here? I suppose there's some threshold such that if the group sizes become sufficiently large, the cost of returning all the redundant data is going to be greater than the overhead involved with an additional database call. What other things should I be thinking about here?
Thanks,
Jordan

If you actually want the results joined, I believe it is always more efficient to do the joining at the server level. The SQL processor is designed to match sets of data.
If you really want the results of 2 sql statements, you can always send two statements in one batch separated by a semicolon, and get two resultsets back with one round trip to the DB.

How the data is finally used is an important and unknown factor.
I suggest the single query method for most applications. Proper indexing will keep the query more efficient than the two query method.
The single query method also has the benefit of remaining valid if you need to select more than one group.

If you are only ever going to be retreiving a single group record with each request to the database then i would go with the second option. If you are retrieving multiple group records and associated group member records, go with the join as it will be much quicker.

In general, it depends on what type of data you are trying to display.
If you are showing a single group and all its members, performance differences between the two options would be negligible.
If you are showing many groups and all of their members, the overhead of having to make a roundtrip to the database for each successive group will quickly outweigh any benefit you got from receiving a little less data.
Some other things you might want to consider in you reasoning
Result Set Size - For many groups and members, your result set size may become a limiting factor as the size to retrieve and keep it in memory increases. This is likely to occur with the second option. You may want to consider paging the data, so that you are only retrieving a certain subset at a time.
Lazy Loading - If you are only getting the members of some groups, or a user is requesting the members one group at a time, consider Lazy Loading. This means only making the additional query to get the group's members when needed. This makes sense only in certain use cases, but it can be much more effective than retrieving all data up front.

Depending on the type of database and your frontend application, you can return the results of two SQL statements on one trip (A stored procedure in SQL Server 2005 for example).
If you are creating a report that requires many fields from the Group table, you may not want the increased amount of data with the first query.
If this is some type of data entry app, you've probably already presented the Group data to the user, so they could fill in the group id on the where clause (or preferably via some parameter) and now they need the member results.

It really, really, really depends on what use you will make of the data.
For insatnce, if you were assembling a list of group members for a mail shot, and you need the group name for each letter you're going to send to a member, and you have no use for the Group level then the single joined query makes a lot of sense.
But if, say, you're coding a master-detail screen or report, with a page for each group and displaying information at both the Group and the Member levels then the two separate queries is probably most useful.
Unless you are retrieving quite large amounts of data (tens of thousands of groups with hundreds of memebers per group, or similar orders of magnitude) it is unlikely you are going to see much difference between performances of the two approaches.

On a simple query like this I would to try to perform it in one query. The overhead of two database calls will probably exceed the additional SQL processing time from the query.
A UNION clause will do this for you:
SELECT id, someData1, someData2
FROM Group
WHERE id = 4
UNION
SELECT id, someData, null
FROM GroupMember
WHERE group_id = 4;

Related

Current State Query / Architecture

I expect this is a common enough use-case, but I'm unsure the best way to leverage database features to do it. Hopefully the community can help.
Given a business domain where there are a number of attributes to make up a record. We can just call these a,b,c
Each of these belong to a parent record, of which there can be many,
Given an external datasource that will post updates to those attributes, at arbitrary times, and typically only a subset, so you get instructions like
z:{a:3}
or
y:{b:2,c:100}
What are good ways to be able to query postgres for the 'current state', ie. wanting a single row result that represents the most recent value for all of a,b,c, for each of the parent records.
current state looks overall like
x:{a:0, b:0, c:1}
y:{a:1, b:2, c:3}
z:{a:2, b:65, c:6}
If it matters, The difference in time between updates on a single value could be arbitrarily long
I am deliberately avoiding having a table that keeps updating and writing an individual row for the state because the write-contention could be a problem, and I think there must be a better overall pattern.
Your question is a bit theorical - but in essence you are describing a top-1-per-group problem. In Postgres, you can use distinct on for this.
Assuming that your table is called mytable, where attributes are stored in column attribute, and that column ordering_id defines the ordering of the rows (that could be a timestamp or an serial for example), you would phrase the query as:
select distinct on (attribute) t.*
from mytable t
order by attribute, ordering_id desc

How to speedup joins

I'm using SQL server 2008r2. I have a problem of returning data to the user because of massive joins (for example I need to make 5 inner + 6 left joins in one query (usually tvfs, sometimes tables). It takes toooo long.)
What are the workarounds for this problem?
Should I denormolize my database?
What are the best practices to avoid huge number of joins?
I'd have to see the SQL to troubleshoot specifics, but here's a few things I do when pulling results that have extremely high demand:
Use you tools. Display Estimated Execution Plan can expose some obvious vagaries in your logic.
Learn to love 'where exists' and 'having'. You can minimize the focus and scope sometimes by qualifying in creative ways that don't require HARD IO. This is more true for sub-queries than joins but I add a clause for every outer join I need.
Most importantly IMO, don't be afraid of staging your results. You sometimes need to process billions/trillions of transactions against millions of records and what takes hours with joins can be accomplished in minutes or seconds by staging. If you only need x% of you top 2 or 3 tables, why join every record top to bottom? Sometimes it's just too much overhead.
Pull your simplest result-set down to a stage table (or temp, whatever you need), index it and then go after the next chunk. That usually saves me a fortune in memory.
Use CTEs when you can. However, my experience has been they degrade beyond a certain point. Nice for ancillary tables but not for serious volume.
Be creative in your combinations. I'll use those exists clauses in Stage 1 (reading Tables a, b and c) to only bring back the records that also exist in tables d, e and f.
A lot of the expert SQL advice is not based on VLDBs - it's based on Customer, Orders, Demographic type schemas.
Are these stored procs run natively?
Here's a good (over-simplified) example of staging:
Let's say you wanted to find all of the high-risk individuals in your city (Might as well be interesting about it). You have a Phone company dB (national) indexed by state, city, last name, first name, address and an FBI dB (global) indexed by last name, first name, country, region, address. Let's say the FBI dB has multiple records for each individual due to multiple past addresses.
You could join the two dBs on the common elements and then qualify your criteria. Or...
Select RecordID from Phone as P1
Where State = 'MyState' and City = 'MyCity' and
exists (Select 1
From TheMan as M1
Where M1.Last = P1.Last and M1.First = P1.First and M1.Risk > 80)
Now I have a small record-set to qualify and a small result-set to work from. From there I can go get details. That's a good candidate for a CTE and I could shoot a dozen holes in the logic, but it illustrates the concept. If you bring M1.Risk (non-indexed field) into the equation with a full join, you're forcing SQL Server to plan against it in certain situations. Not necessarily here, but as your logic gets more complex and subsequent non-indexed criteria comes into play.

Speed of paged queries in Oracle

This is a never-ending topic for me and I'm wondering if I might be overlooking something. Essentially I use two types of SQL statements in an application:
Regular queries with a "fallback" limit
Sorted and paged queries
Now, we're talking about some queries against tables with several million records, joined to 5 more tables with several million records. Clearly, we hardly want to fetch all of them, that's why we have the above two methods to limit user queries.
Case 1 is really simple. We just add an additional ROWNUM filter:
WHERE ...
AND ROWNUM < ?
That's quite fast, as Oracle's CBO will take this filter into consideration for its execution plan and probably apply a FIRST_ROWS operation (similar to the one enforced by the /*+FIRST_ROWS*/ hint.
Case 2, however is a bit more tricky with Oracle, as there is no LIMIT ... OFFSET clause as in other RDBMS. So we nest our "business" query in a technical wrapper as such:
SELECT outer.* FROM (
SELECT * FROM (
SELECT inner.*, ROWNUM as RNUM, MAX(ROWNUM) OVER(PARTITION BY 1) as TOTAL_ROWS
FROM (
[... USER SORTED business query ...]
) inner
)
WHERE ROWNUM < ?
) outer
WHERE outer.RNUM > ?
Note that the TOTAL_ROWS field is calculated to know how many pages we will have even without fetching all data. Now this paging query is usually quite satisfying. But every now and then (as I said, when querying 5M+ records, possibly including non-indexed searches), this runs for 2-3minutes.
EDIT: Please note, that a potential bottleneck is not so easy to circumvent, because of sorting that has to be applied before paging!
I'm wondering, is that state-of-the-art simulation of LIMIT ... OFFSET, including TOTAL_ROWS in Oracle, or is there a better solution that will be faster by design, e.g. by using the ROW_NUMBER() window function instead of the ROWNUM pseudo-column?
The main problem with Case 2 is that in many cases the whole query result set has to be obtained and then sorted before the first N rows can be returned - unless the ORDER BY columns are indexed and Oracle can use the index to avoid a sort. For a complex query and a large set of data this can take some time. However there may be some things you can do to improve the speed:
Try to ensure that no functions are called in the inner SQL - these may get called 5 million times just to return the first 20 rows. If you can move these function calls to the outer query they will be called less.
Use a FIRST_ROWS_n hint to nudge Oracle into optimising for the fact that you will never return all the data.
EDIT:
Another thought: you are currently presenting the user with a report that could return thousands or millions of rows, but the user is never realistically going to page through them all. Can you not force them to select a smaller amount of data e.g. by limiting the date range selected to 3 months (or whatever)?
You might want to trace the query that takes a lot of time and look at its explain plan. Most likely the performance bottleneck comes from the TOTAL_ROWS calculation. Oracle has to read all the data, even if you only fetch one row, this is a common problem that all RDBMS face with this type of query. No implementation of TOTAL_ROWS will get around that.
The radical way to speed up this type of query is to forego the TOTAL_ROWS calculation. Just display that there are additional pages. Do your users really need to know that they can page through 52486 pages? An estimation may be sufficient. That's another solution, implemented by google search for example: estimate the number of pages instead of actually counting them.
Designing an accurate and efficient estimation algorithm might not be trivial.
A "LIMIT ... OFFSET" is pretty much syntactic sugar. It might make the query look prettier, but if you still need to read the whole of a data set and sort it and get rows "50-60", then that's the work that has to be done.
If you have an index in the right order, then that can help.
It may perform better to run two queries instead of trying to count() and return the results in the same query. Oracle may be able to answer the count() without any sorting or joining to all the tables (join table elimination based on declared foreign key constraints). This is what we generally do in our application. For performance important statements, we write a separate query that we know will return the correct count as we can sometimes do better than Oracle.
Alternatively, you can make a tradeoff between performance and recency of the data. Bringing back the first 5 pages is going to be nearly as quick as bringing back the first page. So you could consider storing the results from 5 pages in a temporary table along with an expiry date for the information. Take the result from the temporary table if valid. Put a background task in to delete the expired data periodically.

Methods to speed up specific query

I have an existing sql query which works well but takes what I consider to be quite a bit of time and resources for such a small resultset. I am trying to figure out if the following query can be optimized in ways I am unfamiliar for better performance.
Query
SELECT
a.programname, count(b.id)
FROM
groups a
LEFT JOIN
selections b ON (a.id_selection = b.id AND a.min_age = 18 AND a.max_age = 24)
LEFT JOIN
member_info c ON (b.memberid = c.memberid AND (c.status = 1 OR c.term_date > '2011-01-31'))
WHERE
a.flag = 3
GROUP BY
a.programname
ORDER BY
a.programid asc;
There are three tables at work here:
Groups - A
Groups contains a list of possible program selections a member can make. A member can have multiple selections within the entire table but can only have one selection per programname and only one age bracket. The overall program is determined by the flag which limits the 400+ programs to only say 100 possible mixes. The program names grouped together are:
member only, member plus spouse, member plus child, family
The resultset must return the count of all active members who have that particular selection, even if the result is 0 (i.e. cannot limit the resultset to 3 rows just because one has a zero count).
Selections
This table groups the member selections to multiple groups selections. One member can have multiple IDs from groups but only one of each type.
Member_info
contains information about each particular member, including their status (1 is active) and if their termination date is passed in the event they are not active.
My query takes nearly 3/4 of a full second which I find to be way too much for this time of information but maybe I can wrong with all the necessary joins.
Any help is greatly appreciated. I can further expand my question if necessary.
EXPLAIN details
1 SIMPLE a ALL 184 Using where; Using temporary; Using filesort
1 SIMPLE b index memberid_id 7 3845 Using index
1 SIMPLE c ALL 1551
EDIT REGARDING INDEX SUGGESTION
I have given much thought to the use of indexes regarding this query but as nearly all sources would suggest, the use in an example like this may actually be hurtful. The best summary i found was:
Indexes are something extra that you
can enable on your MySQL tables to
increase performance,cbut they do have
some downsides. When you create a new
index MySQL builds a separate block of
information that needs to be updated
every time there are changes made to
the table. This means that if you are
constantly updating, inserting and
removing entries in your table this
could have a negative impact on
performance.
The member_information table will grow daily, the groups will stay fairly constant but the selections table can change drastically on a daily basis. As such, the use of indexes really seems to have a negative effect in this case.
Do you have indexes on the columns being joined? That would be an obvious first step.
There seems to be no problems with this query. Your options are
using indexes: if you plan to read way more than write
using parameterized queries, so that the db engine can cache the execution plan for reuse
Beyond this, there must be some serious bottleneck in the system or millions of rows in the tables that causes a long execution.
How does you query perform, if you run the query 100 times parallel?
If you run this query often, try using bind parameters instead of just concatenating sql. That way the db engine can cache the execution plan.

Performance benefit in this data model?

I have a MySQL(innodb) table 'items' with the following characteristics
Large number of rows, and keeps on increasing.
Large number of columns of various data-types including 'text';
primary key 'item_id' is present.
There are additional requirements as follows:
Need to query items based on their status
Need to update status
The above two operations happen quite frequently.
Given the above scenario I have two questions
Would making a separate table with two columns namely item_id and status with item_id as primary key provide increased performance?
If the above is true, how am I going to tackle querying item_ids based on status?
I am inexperienced in handling databases. I hope you will bear with me :)
This is called vertical segmentation. It is often used when a data entity has multiple access patterns which access different subsets of the entities attributes (table columns), with different frequencies. If one function needs access to only one or two columns 100s of times per second, and another application function needs access to all the other columns, but only once or twice a day, then this approach is warrented, and will garner substantial perfomance improvement.
Basically, as you suggested, you "split" the table into two tables, both with the same key, with a one-to-one FK/PK->PK relationship. In one table you put only those few columns that are accessed more frequently, and you put the rest of the columns in the other table that will be accessed less frequently. You can then apply indexing to each table more appropriately based on the actual access pattern for each table separately.
Would make more sense to create an index on your status and your item_id if its the only columns you need to fetch.
create index status_item_id_items on items (status)
You can then query your result that will use this index:
select item_id, status from items where status = 'status'
Keep in mind that if you don't have many different statuses your query may ends up returning a lot of row and could be slow. If you can be constrained by a more 'selective' column like a datetime it would be better.
Answering part 2 first, you'd do an inner join of your two tables:
SELECT i.*, s.StatusCode FROM items AS i INNER JOIN status AS s ON s.item_id = i.item_id
To answer part 1, though, I don't think doing this would gain you any performance advantage.