Rails 3 Improve Active Record Query Performance - ruby-on-rails-3

Hi I have the following query :
Player.select("Players.*, (SELECT COUNT(*) FROM Results WHERE Results.player_id = Players.id and win = true and competition_id = 4) as wins").where("competition_id = 4").order("wins desc")
Essentially it's counting and ordering Player records based on it's foreign key occurrences within the Results table where win is set to true for a particular competition. TO give you a better idea some sample data from the Results table might be...
Player_ID | Match_ID|Win|Elapsed_Time|etc..
1 | 1 |T | 1:00 |etc..
2 | 1 |F | 1:00 |etc..
1 | 2 |T | 3:00 |etc..
3 | 2 |F | 3:00 |etc..
As you can see two Selects are occurring within this statement which I'm thinking could cause a performance hit in the future. Players actually has a one to many relationship with Results as you could guess but I couldn't figure out how to make this work with a join which I imagine might be more efficient.
Perhaps I'm wrong and there isn't any problem with the above query - in either case please provide some advice.
I'm using a PostgreSQL database.

Maybe something like this?
Player.joins(:results).where("results.win = true AND results.competition_id = 4").order("results.count DESC")

Related

Find spectators that have seen the same shows (match multiple rows for each)

For an assignment I have to write several SQL queries for a database stored in a PostgreSQL server running PostgreSQL 9.3.0. However, I find myself blocked with last query. The database models a reservation system for an opera house. The query is about associating the a spectator the other spectators that assist to the same events every time.
The model looks like this:
Reservations table
id_res | create_date | tickets_presented | id_show | id_spectator | price | category
-------+---------------------+---------------------+---------+--------------+-------+----------
1 | 2015-08-05 17:45:03 | | 1 | 1 | 195 | 1
2 | 2014-03-15 14:51:08 | 2014-11-30 14:17:00 | 11 | 1 | 150 | 2
Spectators table
id_spectator | last_name | first_name | email | create_time | age
---------------+------------+------------+----------------------------------------+---------------------+-----
1 | gonzalez | colin | colin.gonzalez#gmail.com | 2014-03-15 14:21:30 | 22
2 | bequet | camille | bequet.camille#gmail.com | 2014-12-10 15:22:31 | 22
Shows table
id_show | name | kind | presentation_date | start_time | end_time | id_season | capacity_cat1 | capacity_cat2 | capacity_cat3 | price_cat1 | price_cat2 | price_cat3
---------+------------------------+--------+-------------------+------------+----------+-----------+---------------+---------------+---------------+------------+------------+------------
1 | madama butterfly | opera | 2015-09-05 | 19:30:00 | 21:30:00 | 2 | 315 | 630 | 945 | 195 | 150 | 100
2 | don giovanni | opera | 2015-09-12 | 19:30:00 | 21:45:00 | 2 | 315 | 630 | 945 | 195 | 150 | 100
So far I've started by writing a query to get the id of the spectator and the date of the show he's attending to, the query looks like this.
SELECT Reservations.id_spectator, Shows.presentation_date
FROM Reservations
LEFT JOIN Shows ON Reservations.id_show = Shows.id_show;
Could someone help me understand better the problem and hint me towards finding a solution. Thanks in advance.
So the result I'm expecting should be something like this
id_spectator | other_id_spectators
-------------+--------------------
1| 2,3
Meaning that every time spectator with id 1 went to a show, spectators 2 and 3 did too.
Note based on comments: Wanted to make clear that this answer may be of limited use as it was answered in the context of SQL-Server (tag was present at the time)
There is probably a better way to do it, but you could do it with the 'stuff 'function. The only drawback here is that, since your ids are ints, placing a comma between values will involve a work around (would need to be a string). Below is the method I can think of using a work around.
SELECT [id_spectator], [id_show]
, STUFF((SELECT ',' + CAST(A.[id_spectator] as NVARCHAR(10))
FROM reservations A
Where A.[id_show]=B.[id_show] AND a.[id_spectator] != b.[id_spectator] FOR XML PATH('')),1,1,'') As [other_id_spectators]
From reservations B
Group By [id_spectator], [id_show]
This will show you all other spectators that attended the same shows.
Meaning that every time spectator with id 1 went to a show, spectators 2 and 3 did too.
In other words, you want a list of ...
all spectators that have seen all the shows that a given spectator has seen (and possibly more than the given one)
This is a special case of relational division. We have assembled an arsenal of basic techniques here:
How to filter SQL results in a has-many-through relation
It is special because the list of shows each spectator has to have attended is dynamically determined by the given prime spectator.
Assuming that (d_spectator, id_show) is unique in reservations, which has not been clarified.
A UNIQUE constraint on those two columns (in that order) also provides the most important index.
For best performance in query 2 and 3 below also create an index with leading id_show.
1. Brute force
The primitive approach would be to form a sorted array of shows the given user has seen and compare the same array of others:
SELECT 1 AS id_spectator, array_agg(sub.id_spectator) AS id_other_spectators
FROM (
SELECT id_spectator
FROM reservations r
WHERE id_spectator <> 1
GROUP BY 1
HAVING array_agg(id_show ORDER BY id_show)
#> (SELECT array_agg(id_show ORDER BY id_show)
FROM reservations
WHERE id_spectator = 1)
) sub;
But this is potentially very expensive for big tables. The whole table hast to be processes, and in a rather expensive way, too.
2. Smarter
Use a CTE to determine relevant shows, then only consider those
WITH shows AS ( -- all shows of id 1; 1 row per show
SELECT id_spectator, id_show
FROM reservations
WHERE id_spectator = 1 -- your prime spectator here
)
SELECT sub.id_spectator, array_agg(sub.other) AS id_other_spectators
FROM (
SELECT s.id_spectator, r.id_spectator AS other
FROM shows s
JOIN reservations r USING (id_show)
WHERE r.id_spectator <> s.id_spectator
GROUP BY 1,2
HAVING count(*) = (SELECT count(*) FROM shows)
) sub
GROUP BY 1;
#> is the "contains2 operator for arrays - so we get all spectators that have at least seen the same shows.
Faster than 1. because only relevant shows are considered.
3. Real smart
To also exclude spectators that are not going to qualify early from the query, use a recursive CTE:
WITH RECURSIVE shows AS ( -- produces exactly 1 row
SELECT id_spectator, array_agg(id_show) AS shows, count(*) AS ct
FROM reservations
WHERE id_spectator = 1 -- your prime spectator here
GROUP BY 1
)
, cte AS (
SELECT r.id_spectator, 1 AS idx
FROM shows s
JOIN reservations r ON r.id_show = s.shows[1]
WHERE r.id_spectator <> s.id_spectator
UNION ALL
SELECT r.id_spectator, idx + 1
FROM cte c
JOIN reservations r USING (id_spectator)
JOIN shows s ON s.shows[c.idx + 1] = r.id_show
)
SELECT s.id_spectator, array_agg(c.id_spectator) AS id_other_spectators
FROM shows s
JOIN cte c ON c.idx = s.ct -- has an entry for every show
GROUP BY 1;
Note that the first CTE is non-recursive. Only the second part is recursive (iterative really).
This should be fastest for small selections from big tables. Row that don't qualify are excluded early. the two indices I mentioned are essential.
SQL Fiddle demonstrating all three.
It sounds like you have one half of the total question--determining which id_shows a particular id_spectator attended.
What you want to ask yourself is how you can determine which id_spectators attended an id_show, given an id_show. Once you have that, combine the two answers to get the full result.
So the final answer I got, looks like this :
SELECT id_spectator, id_show,(
SELECT string_agg(to_char(A.id_spectator, '999'), ',')
FROM Reservations A
WHERE A.id_show=B.id_show
) AS other_id_spectators
FROM Reservations B
GROUP By id_spectator, id_show
ORDER BY id_spectator ASC;
Which prints something like this:
id_spectator | id_show | other_id_spectators
-------------+---------+---------------------
1 | 1 | 1, 2, 9
1 | 14 | 1, 2
Which suits my needs, however if you have any improvements to offer, please share :) Thanks again everybody!

SQL join two tables using value from one as column name for other

I'm a bit stumped on a query I need to write for work. I have the following two tables:
|===============Patterns==============|
|type | bucket_id | description |
|-----------------------|-------------|
|pattern a | 1 | Email |
|pattern b | 2 | Phone |
|==========Results============|
|id | buc_1 | buc_2 |
|-----------------------------|
|123 | pass | |
|124 | pass |fail |
In the results table, I can see that entity 124 failed a validation check in buc_2. Looking at the patterns table, I can see bucket 2 belongs to pattern b (bucket_id corresponds to the column name in the results table), so entity 124 failed phone validation. But how do I write a query that joins these two tables on the value of one of the columns? Limitations to how this query is going to be called will most likely prevent me from using any cursors.
Some crude solutions:
SELECT "id", "description" FROM
Results JOIN Patterns
ON "buc_1" = 'fail' AND "bucket_id" = 1
union all
SELECT "id", "description" FROM
Results JOIN Patterns
ON "buc_2" = 'fail' AND "bucket_id" = 2
Or, with a very probably better execution plan:
SELECT "id", "description" FROM
Results JOIN Patterns
ON "buc_1" = 'fail' AND "bucket_id" = 1
OR "buc_2" = 'fail' AND "bucket_id" = 2;
This will report all failure descriptions for each id having a fail case in bucket 1 or 2.
See http://sqlfiddle.com/#!4/a3eae/8 for a live example
That being said, the right solution would be probably to change your schema to something more manageable. Say by using an association table to store each failed test -- as you have in fact here a many to many relationship.
An other approach if you are using Oracle ≥ 11g, would be to use the UNPIVOT operation. This will translate columns to rows at query execution:
select * from Results
unpivot ("result" for "bucket_id" in ("buc_1" as 1, "buc_2" as 2))
join Patterns
using("bucket_id")
where "result" = 'fail';
Unfortunately, you still have to hard-code the various column names.
See http://sqlfiddle.com/#!4/a3eae/17
It looks to me that what you really want to know is the description(in your example Phone) of a Pattern entry given the condition that the bucket failed. Regardless of the specific example you have you want a solution that fulfills that condition, not just your particular example.
I agree with the comment above. Your bucket entries should be tuples(rows) and not arguments, and also you should share the ids on each table so you can actually join them. For example, Consider adding a bucket column and index their number then just add ONE result column to store the state. Like this:
|===============Patterns==============|
|type | bucket_id | description |
|-----------------------|-------------|
|pattern a | 1 | Email |
|pattern b | 2 | Phone |
|==========Results====================|
|entity_id | bucket_id |status |
|-------------------------------------|
|123 | 1 |pass |
|124 | 1 |pass |
|123 | 2 | |
|124 | 2 |fail |
1.-Use an Inner Join: http://www.w3schools.com/sql/sql_join_inner.asp and the WHERE clause to filter only those buckets that failed:
2.-Would this example help?
SELECT Patterns.type, Patterns.description, Results.entity_id,Results.status
INNER JOIN Results
ON
Patterns.bucket_id=Results.bucket_id
WHERE
Results.status=fail
Lastly, I would also add a primary_key column to each table to make sure indexing is faster for each unique combination.
Thanks!

Selecting adjacent rows in an SQL query

The following is a problem which is not well-suited to an RDBMS, I think, but that is what I've got deal with.
I am trying to write a tool to search through logs stored in a database.
Some rows might be:
Time | ID | Object | Description
2012-01-01 13:37 | 1 | 1 | Something happened
2012-01-01 13:39 | 2 | 2 | Something else happened
2012-01-01 13:50 | 3 | 2 | Bad
2012-01-01 14:08 | 4 | 1 | Good
2012-01-01 14:27 | 5 | 1 | Bad
2012-01-01 14:30 | 6 | 2 | Good
Object is a foreign key. In practice, Time will increase with ID but that is not an actual constraint. In reality there are more fields. It's a Postgres database - I'd like to be able to support SQLite as well but am aware this may well be impossible.
Now, I want to be able to run a query for, say, all Bad events that happened to Object 2:
SELECT * FROM table WHERE Object = 2 AND Description = 'Bad';
But it would often be useful to see some lines of context around the results - just as with the -C option to grep is very useful when searching through text logs.
For the above query, if we wanted one line of context either side, we would want rows 2 and 6 in addition to row 3.
If the original query returned multiple rows, more context would need to be retrieved.
Notice that the context is not retrieved from the events associated with Object 1; we eliminate only the restriction on the Description.
Also, the order involved, and hence what determines what is adjacent to what, is that induced by the Time field.
This specifies what I want to achieve, but the database concerned is fairly big, at least in comparison to the power of the machine it's running on.
The most often cited solution for getting adjacent rows requires you to run one extra query per result in what I'll call the base query; this is no good because that might be thousands of queries.
My current least bad solution is to run a query to retrieved the IDs of all possible rows that could be context - in the above example, that would be a search for all rows relating to Object 2. Then I get the IDs matching the base query, expand (using the list of all possible IDs) to a list of IDs of rows matching the base query or in context, then finally retrieve the data for those IDs.
This works, but is inelegant and slow.
It is especially slow when using the tool from a remote computer, as that initial list of IDs can be very large, and retrieving it and then just transmitting it over the internet can be inordinate.
Another solution I have tried is using a subquery or view that computes the "buffer sequence" of the rows.
Here's what the table looks like with this field added:
Time | ID | Sequence | Object | Description
2012-01-01 13:37 | 1 | 1 | 1 | Something happened
2012-01-01 13:39 | 2 | 1 | 2 | Something else happened
2012-01-01 13:50 | 3 | 2 | 2 | Bad
2012-01-01 14:08 | 4 | 2 | 1 | Good
2012-01-01 14:27 | 5 | 3 | 1 | Bad
2012-01-01 14:30 | 6 | 3 | 2 | Good
Running the base query on this table then allows you to generate the list of IDs you want by adding or subtracting from the Sequence value.
This eliminates the problem of transferring loads of rows over the wire, but now the database has to run this complicated subquery, and it's unacceptably slow, especially on the first run - given the use-case, queries are sporadic and caching is not very effective.
If I were in charge of the schema I'd probably just store this field there in the database, but I'm not, so any suggestions for improvements are welcome. Thanks!
You should use the ROW_NUMBER windowing function
http://www.postgresql.org/docs/current/static/functions-window.html
Adjacency is an abstract construct and relies on an explicit sort (or PARTITION OVER) ... do you mean the one with the preceeding time stamp?
Decide how you decide on what sort of "adjacent" you want, then get ROW_NUMBER over that criteria.
Once you have that you would just JOIN each row on the item having ROW_NUMBER +/- 1
You can try this with sqlite
SELECT DISTINCT t2.*
FROM (SELECT * FROM t WHERE object=2 AND description='Bad') t1
JOIN
(SELECT * FROM t WHERE object=2) t2
ON t1.id = t2.id OR
t2.id IN (SELECT id FROM t WHERE object=2 AND t.time<t1.time ORDER BY t.time DESC LIMIT 1) OR
t2.id IN (SELECT id FROM t WHERE object=2 AND t.time>t1.time ORDER BY t.time ASC LIMIT 1)
ORDER BY t2.time
;
Change the limit values ​​by more context

SQL duration between dates for different persons

hopefully someone can help me with the following task:
I hVE got 2 tables Treatment and 'Person'. Treatment contains the dates when treatments for the different persons were started, Person contains personal information, e.g. lastname.
Now I have to find all persons where the duration between the first and last treatment is over 20 years.
The Tables look something like this:
Person
| PK_Person | First name | Name |
_________________________________
| 1 | A_Test | Karl |
| 2 | B_Test | Marie |
| 3 | C_Test | Steve |
| 4 | D_Test | Jack |
Treatment
| PK_Treatment | Description | Starting time | PK_Person |
_________________________________________________________
| 1 | A | 01.01.1989 | 1
| 2 | B | 02.11.2001 | 1
| 3 | A | 05.01.2004 | 1
| 4 | C | 01.09.2013 | 1
| 5 | B | 01.01.1999 | 2
So in this example, the output should be person Karl, A_Test.
Hopefully its understandable what the problem is and someone can help me.
Edit: There seems to be a problem with the formatting, the tables are not displayed correctly, I hope its readable.
SELECT *
FROM person p
INNER JOIN Treatment t on t.PK_Person = p.PK_Person
WHERE DATEDIFF(year,[TREATMENT_DATE_1], [TREATMENT_DATE_2]) > 20
This should do it, it is however untested so will need tweaking to your schema
Your data looks a bit suspicious, because the first name doesn't look like a first name.
But, what you want to do is aggregate the Treatment table for each person and get the minimum and maximum starting times. When the difference is greater than 20 years, then keep the person, and join back to the person table to get the names.
select p.FirstName, p.LastName
from Person p join
(select pk_person, MIN(StartingTime) as minst, MAX(StartingTime) as maxst
from Treatment t
group by pk_person
having MAX(StartingTime) - MIN(StartingTime) > 20*365.25
) t
on p.pk_person = t.pk_person;
Note that date arithmetic does vary between databases. In most databases, taking the difference of two dates counts the number of days between them, so this is a pretty general approach (although not guaranteed to work on all databases).
I've taken a slightly different approach and worked with SQL Fiddle to verify that the below statements work.
As mentioned previously, the data does seem a bit suspicious; nonetheless per your requirements, you would be able to do the following:
select P.PK_Person, p.FirstName, p.Name
from person P
inner join treatment T on T.pk_person = P.pk_person
where DATEDIFF((select x.startingtime from treatment x where x.pk_person = p.pk_person order by startingtime desc limit 1), T.StartingTime) > 7305
First, we need to inner join treatements which will ignore any persons who are not in the treatment table. The where portion now just needs to select based on your criteria (in this case a difference of dates). Doing a subquery will generate the last date a person has been treated, compare that to each of your records, and filter by number of days (7305 = 20 years * 365.25).
Here is the working SQL Fiddle sample.

Best way to join the two tables *including* duplicates from one table

Accounts (table)
+----+----------+----------+-------+
| id | account# | supplier | RepID |
+----+----------+----------+-------+
| 1 | 123xyz | Boston | 2 |
| 2 | 245xyz | Chicago | 2 |
| 3 | 425xyz | Chicago | 3 |
+----+----------+----------+-------+
PayOut (table)
+----+----------+----------+-------------+--------+
| id | account# | supplier | datecreated | Amount |
+----+----------+----------+-------------+--------+
| 5 | 245xyz | Chicago | 01-15-2009 | 25 |
| 6 | 123xyz | Boston | 10-15-2011 | 50 |
| 7 | 123xyz | Boston | 10-15-2011 | -50 |
| 8 | 123xyz | Boston | 10-15-2011 | 50 |
| 9 | 425xyz | Chicago | 10-15-2011 | 100 |
+----+----------+----------+-------------+--------+
I have accounts table and I have payout table. Payout table comes from abroad so we do not have any control over it. This leaves us with a problem that we can't join the two tables based on record ID field, that is one problem which we can't solved. We therefore join based on Account#, SupplierID (2nd and 3rd column). This creates a problem that it creates (possibly) many to many relationship. But we filter our records if they are active and we use a second filter on payout table when the payout was created. Payout are created months to month. There are two problems with this in my view
The query takes quite a bit of time to complete (could be inefficient)
There are certain duplicates that are removed which should not be removed. Example is record 6 and 8 in payout table. What happened here is, we got a customer, then the customer cancelled then he got him back. In this case +50, -50 and +50. Again all values are valid and must show in the report for audit purposes. Currently only one +50 is shown, the other is lost. There are a couple of other problems within the report that comes once in a while.
Here is the query. It uses groups by to remove duplicates. I would like to have an advance query which outperforms and which does takes into account that no record in PayOut table is duplicated as long as they come up in the month of the report.
Here is our current query
/* Supplied to Store Procedure */
-----------------------------------
#RepID // the person for whome payout is calculated
#Month // of payment date
#year // year of payment date
-----------------------------------
select distinct
A.col1,
A.col2,
...
A.col10,
B.col2,
B.Col2,
B.Amount /* this is the important column, portion of which goes to Rep */
from records A
JOIN payout B
on A.Supplier = B.Supplier AND A.Account# = B.Account#
where datepart(mm, B.datecreated) = #Month /* parameter to stored procedure */
and datepart(yyyy, B.datecreated) = #Year
and A.[rep ID] = #RepID /* parameter to SP */
group by
col1,col2,col3,....col10
order by customerName
Is this query optimum? Can I improve it using CROSS APPLY or WHERE EXISTs that will make it faster as well as remove the duplicate problem?
Note that this query is used to get payout of a rep. Hence every record has repid field who it is assigned to. Ideally I would like to use Select WHERE Exist query.
It's difficult to understand exactly what you want because in one place you say you 'want' the duplicates but then you say that you are using the group by to remove duplicates. So the first thought would be "Why not just get rid of the group by?". But I have to believe you are smart enough to have thought of that yourself, so I assume it's got to be there for a reason.
I think someone here could help you pretty easily if you could post the actual query, but since you say you can't I will just try to give you some direction in solving the problem...
Instead of trying to do everything in one statement, use temporary tables or views to split it up. It may be easier for you to think about how to get rid of the duplicates you don't want and keep the ones you do first and put those into a temporary table, and then join the tables together and work with that.