Store specific column from 1:N relationship on parent table - sql

Apologies for the long question, but I want to make sure my problem is clear. Say I have the following tables:
CREATE TABLE project (
id NUMBER(38, 0),
status_id NUMBER(38, 0), -- FK to a status table
title VARHCAR(4000 CHAR)
);
CREATE TABLE project_status_log (
id NUMBER(38, 0),
project_id NUMBER(38, 0),
status_id NUMBER(38, 0),
user_id NUMBER(38, 0), -- FK to a user table
created_on DATE
);
Projects go through a complex workflow, where each status log entry represents a step in the workflow. An example workflow: Draft -> Submitted -> Review -> Returned To Draft -> Submitted -> Review -> Approved
Now let's say a very common need is to get the user_id of the user who last submitted a project. I typically create a view that I can join to project:
CREATE VIEW project_submitter (project_id, user_id) AS
SELECT project.project_id, submitter.user_id
FROM project
JOIN (
SELECT DISTINCT
project_id,
FIRST_VALUE(user_id) OVER (PARTITION BY project_id ORDER BY date DESC) AS user_id
FROM project_status_log
WHERE status_id = 5 -- ID of submitted status
) submitter
The problem is there are many rows and lots of helper views like this, and when I need to use many of them in a single query the performance gets really bad. Some of these queries are taking several seconds to finish. I've added indexes and made sure there aren't full table scans, but the problem seems to be all the aggregation and sub queries in a single query.
I'm considering adding a project.submitted_by column that is set programmatically any time a project's status is updated to submitted. This would drastically simplify my queries and make life much easier. Is this a bad approach? It feels a little bit like de-normalized data, but I'm not sure it actually is.
Are there any potential problems with a project.submitted_by column I'm not thinking about? If so, are there any alternatives to solve the performance issues short of putting the entire thing in an elasticsearch index?

I would suggest that you separate your state tables from your log tables. And not query your log tables except when you need to get something like a history. What you are doing is a step in that direction.
Other approaches to the problem include
the creation and maintenance of read models.
if you are ok with slightly stale data, a shortcut would be to create materialized views (based off your current views) that are refresh periodically
In your query, you might want to remove the DISTINCT. You don't need it since you are doing FIRST_VALUE anyway.

You can simplify the query. One method is aggregation:
SELECT project_id,
MAX(user_id) KEEP (DENSE_RANK FIRST ORDER BY date DESC) as user_id
FROM project_status_log
WHERE status_id = 5
GROUP BY project_id;
Or using window functions if you really want more columns:
SELECT . . . -- whatever columns you want
FROM (SELECT psl.*,
ROW_NUMBER() OVER (PARTITION BY project_id ORDER BY date DESC) as seqnum
FROM project_status_log psl
WHERE status_id = 5
) psl
WHERE seqnum = 1;
Oracle has a smart optimizer and it should be able to use an index on project_status_log(status_id, project_id, date) for both of these queries.
Note: Sometimes views can impede the optimizer. It might be worth trying a correlated subquery as well:
select p.*,
(select psl.user_id
from project_status_log psl
where psl.status = 5 and psl.project_id = p.project_id
order by date desc
fetch first 1 row only
) as user_id
from projects p;
The goal here is that any filtering means that the subquery is run only for the rows after filtering. This also wants the same index mentioned above.

Related

How can I order by a specific order?

It would be something like:
SELECT * FROM users ORDER BY id ORDER("abc","ghk","pqr"...);
In my order clause there might be 1000 records and all are dynamic.
A quick google search gave me below result:
SELECT * FROM users ORDER BY case id
when "abc" then 1
when "ghk" then 2
when "pqr" then 3 end;
As I said all my order clause values are dynamic. So is there any suggestion for me?
Your example isn't entirely clear, as it appears that a simple ORDER BY would suffice to order your id's alphabetically. However, it appears you are trying to create a dynamic ordering scheme that may not be alphabetical. In that case, my recommendation would be to use a lookup table for the values that you will be ordering by. This serves two purposes: first, it allows you to easily reorder the items without altering each entry in the users table, and second, it avoids (or at lest reduces) problems with typos and other issues that can occur with "magic strings."
This would look something like:
Lookup Table:
CREATE TABLE LookupValues (
Id CHAR(3) PRIMARY KEY,
Order INT
);
Query:
SELECT
u.*
FROM
users u
INNER JOIN
LookupTable l
ON
u.Id = l.Id
ORDER BY
l.Order

Select query with join in huge table taking over 7 hours

Our system is facing performance issues selecting rows out of a 38 million rows table.
This table with 38 million rows stores information from clients/suppliers etc. These appear across many other tables, such as Invoices.
The main problem is that our database is far from normalized. The Clients_Suppliers table has a composite key made of 3 columns, the Code - varchar2(16), Category - char(2) and the last one is up_date, a date. Every change in one client's address is stored in that same table with a new date. So we can have records such as this:
code ca up_date
---------------- -- --------
1234567890123456 CL 01/01/09
1234567890123456 CL 01/01/10
1234567890123456 CL 01/01/11
1234567890123456 CL 01/01/12
6543210987654321 SU 01/01/10
6543210987654321 SU 08/03/11
Worst, in every table that uses a client's information, instead of the full composite key, only the code and category is stored. Invoices, for instance, has its own keys, including the emission date. So we can have something like this:
invoice_no serial_no emission code ca
---------- --------- -------- ---------------- --
1234567890 12345 05/02/12 1234567890123456 CL
My specific problem is that I have to generate a list of clients for which invoices where created in a given period. Since I have to get the most recent info from the clients, I have to use max(up_date).
So here's my query (in Oracle):
SELECT
CL.CODE,
CL.CATEGORY,
-- other address fields
FROM
CLIENTS_SUPPLIERS CL
INVOICES I
WHERE
CL.CODE = I.CODE AND
CL.CATEGORY = I.CATEGORY AND
CL.UP_DATE =
(SELECT
MAX(CL2.UP_DATE)
FROM
CLIENTS_SUPPLIERS CL2
WHERE
CL2.CODE = I.CODE AND
CL2.CATEGORY = I.CATEGORY AND
CL2.UP_DATE <= I.EMISSION
) AND
I.EMISSION BETWEEN DATE1 AND DATE2
It takes up to seven hours to select 178,000 rows. Invoices has 300,000 rows between DATE1 and DATE2.
It's a (very, very, very) bad design, and I've raised the fact that we should improve it, by normalizing the tables. That would involve creating a table for clients with a new int primary key for each pair of code/category and another one for Adresses (with the client primary key as a foreign key), then use the Adresses' primary key in each table that relates to clients.
But it would mean changing the whole system, so my suggestion has been shunned. I need to find a different way of improving performance (apparently using only SQL).
I've tried indexes, views, temporary tables but none have had any significant improvement on performance. I'm out of ideas, does anyone have a solution for this?
Thanks in advance!
What does the DBA have to say?
Has he/she tried:
Coalescing the tablespaces
Increasing the parallel query slaves
Moving indexes to a separate tablespace on a separate physical disk
Gathering stats on the relevant tables/indexes
Running an explain plan
Running the query through the index optimiser
I'm not saying the SQL is perfect, but if performance it is degrading over time, the DBA really needs to be having a look at it.
SELECT
CL2.CODE,
CL2.CATEGORY,
... other fields
FROM
CLIENTS_SUPPLIERS CL2 INNER JOIN (
SELECT DISTINCT
CL.CODE,
CL.CATEGORY,
I.EMISSION
FROM
CLIENTS_SUPPLIERS CL INNER JOIN INVOICES I ON CL.CODE = I.CODE AND CL.CATEGORY = I.CATEGORY
WHERE
I.EMISSION BETWEEN DATE1 AND DATE2) CL3 ON CL2.CODE = CL3.CODE AND CL2.CATEGORY = CL3.CATEGORY
WHERE
CL2.UP_DATE <= CL3.EMISSION
GROUP BY
CL2.CODE,
CL2.CATEGORY
HAVING
CL2.UP_DATE = MAX(CL2.UP_DATE)
The idea is to separate the process: first we tell oracle to give us the list of clients for which there are the invoices of the period you want, and then we get the last version of them. In your version there's a check against MAX 38000000 times, which I really think is what costed most of the time spent in the query.
However, I'm not asking for indexes, assuming they are correctly setup...
Assuming that the number of rows for a (code,ca) is smallish, I would try to force an index scan per invoice with an inline view, such as:
SELECT invoice_id,
(SELECT MAX(rowid) KEEP (DENSE_RANK FIRST ORDER BY up_date DESC
FROM clients_suppliers c
WHERE c.code = i.code
AND c.category = i.category
AND c.up_date < i.invoice_date)
FROM invoices i
WHERE i.invoice_date BETWEEN :p1 AND :p2
You would then join this query to CLIENTS_SUPPLIERS hopefully triggering a join via rowid (300k rowid read is negligible).
You could improve the above query by using SQL objects:
CREATE TYPE client_obj AS OBJECT (
name VARCHAR2(50),
add1 VARCHAR2(50),
/*address2, city...*/
);
SELECT i.o.name, i.o.add1 /*...*/
FROM (SELECT DISTINCT
(SELECT client_obj(
max(name) KEEP (DENSE_RANK FIRST ORDER BY up_date DESC),
max(add1) KEEP (DENSE_RANK FIRST ORDER BY up_date DESC)
/*city...*/
) o
FROM clients_suppliers c
WHERE c.code = i.code
AND c.category = i.category
AND c.up_date < i.invoice_date)
FROM invoices i
WHERE i.invoice_date BETWEEN :p1 AND :p2) i
The correlated subquery may be causing issues, but to me the real problem is in what seems to be your main client table, you cannot easily grab the most recent data without doing the max(up_date) mess. Its really a mix of history and current data, and as you describe poorly designed.
Anyway, it will help you in this and other long running joins to have a table/view with ONLY the most recent data for a client. So, first build a mat view for this (untested):
create or replace materialized view recent_clients_view
tablespace my_tablespace
nologging
build deferred
refresh complete on demand
as
select * from
(
select c.*, rownumber() over (partition by code, category order by up_date desc, rowid desc) rnum
from clients c
)
where rnum = 1;
Add unique index on code,category. The assumption is that this will be refreshed periodically on some off hours schedule, and that your queries using this will be ok with showing data AS OF the date of the last refresh. In a DW env or for reporting, this is usually the norm.
The snapshot table for this view should be MUCH smaller than the full clients table with all the history.
Now, you are doing an joining invoice to this smaller view, and doing an equijoin on code,category (where emission between date1 and date2). Something like:
select cv.*
from
recent_clients_view cv,
invoices i
where cv.code = i.code
and cv.category = i.category
and i.emission between :date1 and :date2;
Hope that helps.
You might try rewriting the query to use analytic functions rather than a correlated subquery:
select *
from (SELECT CL.CODE, CL.CATEGORY, -- other address fields
max(up_date) over (partition by cl.code, cl.category) as max_up_date
FROM CLIENTS_SUPPLIERS CL join
INVOICES I
on CL.CODE = I.CODE AND
CL.CATEGORY = I.CATEGORY and
I.EMISSION BETWEEN DATE1 AND DATE2 and
up_date <= i.emission
) t
where t.up_date = max_up_date
You might want to remove the max_up_date column in the outside select.
As some have noticed, this query is subtly different from the original, because it is taking the max of up_date over all dates. The original query has the condition:
CL2.UP_DATE <= I.EMISSION
However, by transitivity, this means that:
CL2.UP_DATE <= DATE2
So the only difference is when the max of the update date is less than DATE1 in the original query. However, these rows would be filtered out by the comparison to UP_DATE.
Although this query is phrased slightly differently, I think it does the same thing. I must admit to not being 100% positive, since this is a subtle situation on data that I'm not familiar with.

Optimizing a category filter

This recent question had me thinking about optimizing a category filter.
Suppose we wish to create a database referencing a huge number of audio tracks, with their release date and a list of world locations from which the audio track is downloadable.
The requests we wish to optimize for are:
Give me the 10 most recent tracks downloadable from location A.
Give me the 10 most recent tracks downloadable from locations A or B.
Give me the 10 most recent tracks downloadable from locations A and B.
How would one go about structuring that database ? I have a hard time coming up with a simple solution that doesn't require reading through all the tracks for at least one location...
To optimise these queries, you need to slightly de-normalise the data.
For example, you may have a track table that contains the track's id, name and release date, and a map_location_to_track table that describes where those tracks can be down-loaded from. To answer "10 most recent tracks for location A" you need to get ALL of the tracks for Location A from map_location_to_track, then join them to the track table to order them by release date, and pick the top 10.
If instead all the data is in a single table, the ordering step can be avoided. For example...
CREATE TABLE map_location_to_track (
location_id INT,
track_id INT,
release_date DATETIME,
PRIMARY KEY (location_id, release_date, track_id)
)
SELECT * FROM map_location_to_track
WHERE location_id = A
ORDER BY release_date DESC LIMIT 10
Having location_id as the first entry in the primary key ensures that the WHERE clause is simply an index seek. Then there is no requirement to re-order the data, it's already ordered for us by the primary key, but instead just pick the 10 records at the end.
You may indeed still join on to the track table to get the name, price, etc, but you now only have to do that for 10 records, not everything at that location.
To solve the same query for "locations A OR B", there are a couple of options that can perform differently depending on the RDBMS you are using.
The first is simple, though some RDBMS don't play nice with IN...
SELECT track_id, release_date FROM map_location_to_track
WHERE location_id IN (A, B)
GROUP BY track_id, release_date
ORDER BY release_date DESC LIMIT 10
The next option is nearly identical, but still some RDBMS don't play nice with OR logic being applied to INDEXes.
SELECT track_id, release_date FROM map_location_to_track
WHERE location_id = A or location_id = B
GROUP BY track_id, release_date
ORDER BY release_date DESC LIMIT 10
In either case, the algorithm being used to rationalise the list of records down to 10 is hidden from you. It's a matter of try it and see; the index is still available such that this CAN be performant.
An alternative is to explicitly determine part of the approach in your SQL statement...
SELECT
*
FROM
(
SELECT track_id, release_date FROM map_location_to_track
WHERE location_id = A
ORDER BY release_date DESC LIMIT 10
UNION
SELECT track_id, release_date FROM map_location_to_track
WHERE location_id = B
ORDER BY release_date DESC LIMIT 10
)
AS data
ORDER BY
release_date DESC
LIMIT 10
-- NOTE: This is a UNION and not a UNION ALL
-- The same track can be available in both locations, but should only count once
-- It's in place of the GROUP BY in the previous 2 examples
It is still possible for an optimiser to realise that these two unioned data sets are ordered, and so make the external order by very quick. Even if not, however, ordering 20 items is pretty quick. More importantly, it's a fixed overhead: it doesn't matter if you have a billion tracks in each location, we're just merging two lists of 10.
The hardest to optimise is the AND condition, but even then the existance of the "TOP 10" constraint can help work wonders.
Adding a HAVING clause to the IN or OR based approaches can solve this, but, again, depending on your RDBMS, may run less than optimally.
SELECT track_id, release_date FROM map_location_to_track
WHERE location_id = A or location_id = B
GROUP BY track_id, release_date
HAVING COUNT(*) = 2
ORDER BY release_date DESC LIMIT 10
The alternative is to try the "two queries" approach...
SELECT
location_a.*
FROM
(
SELECT track_id, release_date FROM map_location_to_track
WHERE location_id = A
)
AS location_a
INNER JOIN
(
SELECT track_id, release_date FROM map_location_to_track
WHERE location_id = B
)
AS location_b
ON location_a.release_date = location_b.release_date
AND location_a.track_id = location_b.track_id
ORDER BY
location_a.release_date DESC
LIMIT 10
This time we can't restrict the two sub-queries to just 10 records; for all we know the most recent 10 in location a don't appear in location b at all. The primary key rescues us again though. The two data sets are orgnised by release date, the RDBMScan just start at the top record of each set and merge the two until it has 10 records, then stop.
NOTE: Because the release_date is in the primary key, and before the track_id, one should ensure that it is used in the join.
Depending on the RDBMS, you don't even need the sub-queries. You may be able to just self-join the table without altering the RDBMS' plan...
SELECT
location_a.*
FROM
map_location_to_track AS location_a
INNER JOIN
map_location_to_track AS location_b
ON location_a.release_date = location_b.release_date
AND location_a.track_id = location_b.track_id
WHERE
location_a.location_id = A
AND location_b.location_id = B
ORDER BY
location_a.release_date DESC
LIMIT 10
All in all, the combination of three things makes this pretty efficient:
- Partially De-Normalising the data to ensure it's in a friendly order for our needs
- Knowing we only ever need the first 10 results
- Knowing we're only ever dealing with 2 locations at the most
There are variations that can optimise to any number of records and any number of locations, but these are significantly less performant than the problem stated in this question.
In a classic relational schema you would have a many-to-many relationship between tracks and locations in order to avoid redundancy:
CREATE TABLE tracks (
id INT,
...
release_date DATETIME,
PRIMARY KEY (id)
)
CREATE TABLE locations (
id INT,
...
PRIMARY KEY (id)
)
CREATE TABLE tracks_locations (
location_id INT,
track_id INT,
...
PRIMARY KEY (location_id, track_id)
)
SELECT tracks.* FROM tracks_locations LEFT JOIN tracks ON tracks.id = tracks_locations.location_id
WHERE tracks_locations.location_id = A
ORDER BY tracks.release_date DESC LIMIT 10
You could modify that schema using table partitions by location. Problem is that it depends on implementation issues or usage constraints. For example, AFAIK in MySQL you cannot have foreign keys in partitioned tables. To solve this you could also have a collection of tables (call it "partitioning by hand") like tracks_by_location_#, where # is the ID of a known location. These tables could store filtered results and be created/updated/deleted using triggers.

Collapsing multiple subqueries into one in Postgres

I have two tables:
CREATE TABLE items
(
root_id integer NOT NULL,
id serial NOT NULL,
-- Other fields...
CONSTRAINT items_pkey PRIMARY KEY (root_id, id)
)
CREATE TABLE votes
(
root_id integer NOT NULL,
item_id integer NOT NULL,
user_id integer NOT NULL,
type smallint NOT NULL,
direction smallint,
CONSTRAINT votes_pkey PRIMARY KEY (root_id, item_id, user_id, type),
CONSTRAINT votes_root_id_fkey FOREIGN KEY (root_id, item_id)
REFERENCES items (root_id, id) MATCH SIMPLE
ON UPDATE CASCADE ON DELETE CASCADE,
-- Other constraints...
)
I'm trying to, in a single query, pull out all items of a particular root_id along with a few arrays of user_ids of the users who voted in particular ways. The following query does what I need:
SELECT *,
ARRAY(SELECT user_id from votes where root_id = i.root_id AND item_id = i.id AND type = 0 AND direction = 1) as upvoters,
ARRAY(SELECT user_id from votes where root_id = i.root_id AND item_id = i.id AND type = 0 AND direction = -1) as downvoters,
ARRAY(SELECT user_id from votes where root_id = i.root_id AND item_id = i.id AND type = 1) as favoriters
FROM items i
WHERE root_id = 1
ORDER BY id
The problem is that I'm using three subqueries to get the information I need when it seems like I should be able to do the same in one. I thought that Postgres (I'm using 8.4) might be smart enough to collapse them all into a single query for me, but looking at the explain output in pgAdmin it looks like that's not happening - it's running multiple primary key lookups on the votes table instead. I feel like I could rework this query to be more efficient, but I'm not sure how.
Any pointers?
EDIT: An update to explain where I am now. At the advice of the pgsql-general mailing list, I tried changing the query to use a CTE:
WITH v AS (
SELECT item_id, type, direction, array_agg(user_id) as user_ids
FROM votes
WHERE root_id = 5305
GROUP BY type, direction, item_id
ORDER BY type, direction, item_id
)
SELECT *,
(SELECT user_ids from v where item_id = i.id AND type = 0 AND direction = 1) as upvoters,
(SELECT user_ids from v where item_id = i.id AND type = 0 AND direction = -1) as downvoters,
(SELECT user_ids from v where item_id = i.id AND type = 1) as favoriters
FROM items i
WHERE root_id = 5305
ORDER BY id
Benchmarking each of these from my application (I set up each as a prepared statement to avoid spending time on query planning, and then ran each one several thousand times with a variety of root_ids) my initial approach averages 15 milliseconds and the CTE approach averages 17 milliseconds. I was able to repeat this result over a few runs.
When I have some time I'm going to play with jkebinger's and Dragontamer5788's approaches with my test data and see how they work, but I'm also starting a bounty to see if I can get more suggestions.
I should also mention that I'm open to changing my schema (the system isn't in production yet, and won't be for a couple months) if it can speed up this query. I designed my votes table this way to take advantage of the primary key's uniqueness constraint - a given user can both favorite and upvote an item, for example, but not upvote it AND downvote it - but I can relax/work around that constraint if representing these options in a different way makes more sense.
EDIT #2: I've benchmarked all four solutions. Amazingly, Sequel is flexible enough that I was able to write all four without dropping to SQL once (not even for the CASE statements). Like before, I ran them all as prepared statements, so that query planning time wouldn't be an issue, and did each run several thousand times. Then I ran all the queries under two situations - a worst-case scenario with a lot of rows (265 items and 4911 votes) where the relevant rows would be in the cache pretty quickly, so CPU usage should be the deciding factor and a more realistic scenario where a random root_id was chosen for each run. I wound up with:
Original query - Typical: ~10.5 ms, Worst case: ~26 ms
CTE query - Typical: ~16.5 ms, Worst case: ~70 ms
Dragontamer5788 - Typical: ~15 ms, Worst case: ~36 ms
jkebinger - Typical: ~42 ms, Worst case: ~180 ms
I suppose the lesson to take from this right now is that Postgres' query planner is very smart and is probably doing something clever under the surface. I don't think I'm going to spend any more time trying to work around it. If anyone would like to submit another query attempt I'd be happy to benchmark it, but otherwise I think Dragontamer is the winner of the bounty and correct (or closest to correct) answer. Unless someone else can shed some light on what Postgres is doing - that would be pretty cool. :)
There are two questions being asked:
A syntax to collapse multiple subqueries into one.
Optimization.
For #1, I can't get the "complete" thing into a single Common Table Expression, because you're using a correlated subquery on each item. Still, you might have some benefits if you used a common table expression. Obviously, this will depend on the data, so please benchmark to see if it would help.
For #2, because there are three commonly accessed "classes" of items in your table, I expect partial indexes to increase the speed of your query, regardless of whether or not you were able to increase the speed due to #1.
First, the easy stuff then. To add a partial index to this table, I'd do:
CREATE INDEX upvote_vote_index ON votes (type, direction)
WHERE (type = 0 AND direction = 1);
CREATE INDEX downvote_vote_index ON votes (type, direction)
WHERE (type = 0 AND direction = -1);
CREATE INDEX favoriters_vote_index ON votes (type)
WHERE (type = 1);
The smaller these indexes, the more efficient your queries will be. Unfortunately, in my tests, they didn't seem to help :-( Still, maybe you can find a use of them, it depends greatly on your data.
As for an overall optimization, I'd approach the problem differently. I'd "unroll" the query into this form (using an inner join and using conditional expressions to "split up" the three types of votes), and then use "Group By" and the "array" aggregate operator to combine them. IMO, I'd rather change my application code to accept it in the "unrolled" form, but if you can't change the application code, then the "group by"+aggregate function ought to work.
SELECT array_agg(v.user_id), -- array_agg(anything else you needed),
i.root_id, i.id, -- I presume you needed the primary key?
CASE
WHEN v.type = 0 AND v.direction = 1
THEN 'upvoter'
WHEN v.type = 0 AND v.direction = -1
THEN 'downvoter'
WHEN v.type = 1
THEN 'favoriter'
END as vote_type
FROM items i
JOIN votes v ON i.root_id = v.root_id AND i.id = v.item_id
WHERE i.root_id = 1
AND ((type=0 AND (direction=1 OR direction=-1))
OR type=1)
GROUP BY i.root_id, i.id, vote_type
ORDER BY id
Its still "one step unrolled" compared to your code (vote_type is vertical, while in your case, its horizontal, across the columns). But this seems to be more efficient.
Just a guess, but maybe it could be worth trying:
Maybe sql can optimize the query if you create a VIEW of
SELECT user_id from votes where root_id = i.root_id AND item_id = i.id
and then select 3 times from there with the different where-clauses about type and direction.
If thats not helping either, maybe you could fetch the 3 types as additional boolean columns and then only work with one query?
Would be interested to hear, if you find a solution. Good luck.
Here's another approach. It has the (possibly) undesirable result of including NULL values in the arrays, but it works in one pass, rather than three. I find it helpful to think of some SQL queries in a map-reduce manner, and case statements are great for that.
select
v.root_id, v.item_id,
array_agg(case when type = 0 AND direction = 1 then user_id else NULL end) as upvoters,
array_agg(case when type = 0 AND direction = -1 then user_id else NULL end) as downvoters,
array_agg(case when type = 1 then user_id else NULL end) as favoriters
from items i
join votes v on i.root_id = v.root_id AND i.id = v.item_id
group by 1, 2
With some sample data, I get this result set:
root_id | item_id | upvoters | downvoters | favoriters
---------+---------+----------------+------------------+------------------
1 | 2 | {100,NULL,102} | {NULL,101,NULL} | {NULL,NULL,NULL}
2 | 4 | {100,NULL,101} | {NULL,NULL,NULL} | {NULL,100,NULL}
I believe you need postgres 8.4 to get array_agg, but there's been a recipe for a array_accum function prior to that.
There's a discussion on postgres-hackers list about how to build a NULL-removing version of array_agg if you're interested.

SQL aggregation question

I have three tables:
unmatched_purchases table:
unmatched_purchases_id --primary key
purchases_id --foreign key to events table
location_id --which store
purchase_date
item_id --item purchased
purchases table:
purchases_id --primary key
location_id --which store
customer_id
credit_card_transactions:
transaction_id --primary key
trans_timestamp --timestamp of when the transaction occurred
item_id --item purchased
customer_id
location_id
All three tables are very large. The purchases table has 590130404 records. (Yes, half a billion) Unmatched_purchases has 192827577 records. Credit_card_transactions has 79965740 records.
I need to find out how many purchases in the unmatched_purchases table match up with entries in the credit_card_transactions table. I need to do this for one location at a time (IE run the query for location_id = 123. Then run it for location_id = 456) "Match up" is defined as:
1) same customer_id
2) same item_id
3) the trans_timestamp is within a certain window of the purchase_date
(EG if the purchase_date is Jan 3, 2005
and the trans_timestamp is 11:14PM Jan 2, 2005, that's close enough)
I need the following aggregated:
1) How many unmatched purchases are there for that location
2) How many of those unmatched purchases could have been matched with credit_card_transactions for a location.
So, what is a query (or queries) to get this information that won't take forever to run?
Note: all three tables are indexed on location_id
EDIT: as it turns out, the credit_card_purchases table has been partitioned based on location_id. So that will help speed this up for me. I'm asking our DBA if the others could be partitioned as well, but the decision is out of my hands.
CLARIFICATION: I only will need to run this on a few of our many locations, not all of them separately. I need to run it on 3 locations. We have 155 location_ids in our system, but some of them are not used in this part of our system.
try this (I have no idea how fast it will be - that depends on your indices)
Select Count(*) TotalPurchases,
Sum(Case When c.transaction_id Is Not Null
Then 1 Else 0 End) MatchablePurchases
From unmatched_purchases u
Join purchases p
On p.purchases_id = u.unmatched_purchases_id
Left Join credit_card_transactions c
On customer_id = p.customer_id
And item_id = u.item_id
And trans_timestamp - purchase_date < #DelayThreshold
Where Location_id = #Location
At least, you'll need more indexes. I propose at least the folloging:
An index on unmatched_purchases.purchases_id, one on purchases.location_id and
another index on credit_card_transactions.(location_id, customer_id, item_id, trans_timestamp).
Without those indexes, there is little hope IMO.
I suggest you to query ALL locations at once. It will cost you 3 full scans (each table once) + sorting. I bet this will be faster then querying locations one by one.
But if you want not to guess, you at least need to examine EXPLAIN PLAN and 10046 trace of your query...
The query ought to be straightforward, but the tricky part is to get it to perform. I'd question why you need to run it once for each location when it would probably be more eficient to run it for every location in a single query.
The join would be a big challenge, but the aggregation ought to be straightforward. I would guess that your best hope performance-wise for the join would be a hash join on the customer and item columns, with a subsequent filter operation on the date range. You might have to fiddle with putting the customer and item join in an inline view and then try to stop the date predicate from being pushed into the inline view.
The hash join would be much more efficient with tables that are being equi-joined both having the same hash partitioning key on all join columns, if that can be arranged.
Whether to use the location index or not ...
Whether the index is worth using or not depends on the clustering factor for the location index, which you can read from the user_indexes table. Can you post the clustering factor along with the number of blocks that the table contains? That will give a measure of the way that values for each location are distributed throughout the table. You could also extract the execution plan for a query such as:
select some_other_column
from my_table
where location_id in (value 1, value 2, value 3)
... and see if oracle thinks the index is useful.