What's the most efficient query? - sql

I have a table named Projects that has the following relationships:
has many Contributions
has many Payments
In my result set, I need the following aggregate values:
Number of unique contributors (DonorID on the Contribution table)
Total contributed (SUM of Amount on Contribution table)
Total paid (SUM of PaymentAmount on Payment table)
Because there are so many aggregate functions and multiple joins, it gets messy do use standard aggregate functions the the GROUP BY clause. I also need the ability to sort and filter these fields. So I've come up with two options:
Using subqueries:
SELECT Project.ID AS PROJECT_ID,
(SELECT SUM(PaymentAmount) FROM Payment WHERE ProjectID = PROJECT_ID) AS TotalPaidBack,
(SELECT COUNT(DISTINCT DonorID) FROM Contribution WHERE RecipientID = PROJECT_ID) AS ContributorCount,
(SELECT SUM(Amount) FROM Contribution WHERE RecipientID = PROJECT_ID) AS TotalReceived
FROM Project;
Using a temporary table:
DROP TABLE IF EXISTS Project_Temp;
CREATE TEMPORARY TABLE Project_Temp (project_id INT NOT NULL, total_payments INT, total_donors INT, total_received INT, PRIMARY KEY(project_id)) ENGINE=MEMORY;
INSERT INTO Project_Temp (project_id,total_payments)
SELECT `Project`.ID, IFNULL(SUM(PaymentAmount),0) FROM `Project` LEFT JOIN `Payment` ON ProjectID = `Project`.ID GROUP BY 1;
INSERT INTO Project_Temp (project_id,total_donors,total_received)
SELECT `Project`.ID, IFNULL(COUNT(DISTINCT DonorID),0), IFNULL(SUM(Amount),0) FROM `Project` LEFT JOIN `Contribution` ON RecipientID = `Project`.ID GROUP BY 1
ON DUPLICATE KEY UPDATE total_donors = VALUES(total_donors), total_received = VALUES(total_received);
SELECT * FROM Project_Temp;
Tests for both are pretty comparable, in the 0.7 - 0.8 seconds range with 1,000 rows. But I'm really concerned about scalability, and I don't want to have to re-engineer everything as my tables grow. What's the best approach?

Knowing the timing for each 1K rows is good, but the real question is how they'll be used.
Are you planning to send all these back to a UI? Google doles out results 25 per page; maybe you should, too.
Are you planning to do calculations in the middle tier? Maybe you can do those calculations on the database and save yourself bringing all those bytes across the wire.
My point is that you may never need to work with 1,000 or one million rows if you think carefully about what you do with them.
You can EXPLAIN PLAN to see what the difference between the two queries is.

I would go with the first approach. You are allowing the RDBMS to do it's job, rather than trying to do it's job for it.
By creating a temp table, you will always create the full table for each query. If you only want data for one project, you still end up creating the full table (unless you restrict each INSERT statement accordingly.) Sure, you can code it, but it's already becoming a fair amount code and complexity for a small performance gain.
With a SELECT, the db can fetch the appriate amount of data, optimizing the whole query based on context. If other users have queried the same data, it may even be cached (query, and possibly data, depending upon your db). If performance is truly a concern, you might consider using Indexed/Materialized Views, or generating a table on an INSERT/UPDATE/DELETE trigger. Scaling out, you can use server clusters and partioned views - something that I believe will be difficult if you are creating temporary tables.
EDIT: the above is written without any specific rdbms in mind, although the OP added that mysql is the target db.

There is a third option which is derived tables:
Select Project.ID AS PROJECT_ID
, Payments.Total AS TotalPaidBack
, Coalesce(ContributionStats.DonarCount, 0) As ContributorCount
, ContributionStats.Total As TotalReceived
From Project
Left Join (
Select C1.RecipientId, Sum(C1.Amount) As Total, Count(Distinct C1.DonarId) ContributorCount
From Contribution As C1
Group By C1.RecipientId
) As ContributionStats
On ContributionStats.RecipientId = Project.Project_Id
Left Join (
Select P1.ProjectID, Sum(P1.PaymentAmount) As Total
From Payment As P1
Group By P1.RecipientId
) As Payments
On Payments.ProjectId = Project.Project_Id
I'm not sure if it will perform better, but you might give it shot.

A few thoughts:
The derived table idea would be good on other platforms, but MySQL has the same issue with derived tables that it does with views: they aren't indexed. That means that MySQL will execute the full content of the derived table before applying the WHERE clause, which doesn't scale at all.
Option 1 is good for being compact, but syntax might get tricky when you want to start putting the derived expressions in the WHERE clause.
The suggestion of materialized views is a good one, but MySQL unfortunately doesn't support them. I like the idea of using triggers. You could translate that temporary table into a real table that persists, and then use INSERT/UPDATE/DELETE triggers on the Payments and Contribution tables to update the Project Stats table.
Finally, if you don't want to mess with triggers, and if you aren't too concerned with freshness, you can always have the separate stats table and update it offline, having a cron job that runs every few minutes that does the work that you specified in Query #2 above, except on the real table. Depending on the nuances of your application, this slight delay in updating the stats may or may not be acceptable to your users.

Related

5+ Intermediate SQL Tables to Arrive at Desired Table, Postgres

I am generating reports on electoral data that group voters into their age groups, and then assign those age groups a quartile, before finally returning the table of age groups and quartiles.
By the time I arrive at the table with the schema and data that I want, I have created 7 intermediate tables that might as well be deleted at this point.
My question is, is it plausible that so many intermediate tables are necessary? Or this a sign that I am "doing it wrong?"
Technical Specifics:
Postgres 9.4
I am chaining tables, starting with the raw database tables and successively transforming the table closer to what I want. For instance, I do something like:
CREATE TABLE gm.race_code_and_turnout_count AS
SELECT race_code, count(*)
FROM gm.active_dem_voters_34th_house_in_2012_primary
GROUP BY race_code
And then I do
CREATE TABLE gm.race_code_and_percent_of_total_turnout AS
SELECT race_code, count, round((count::numeric/11362)*100,2) AS percent_of_total_turnout
FROM gm.race_code_and_turnout_count
And that first table goes off in a second branch:
CREATE TABLE gm.race_code_and_turnout_percentage AS
SELECT t1.race_code, round((t1.count::numeric / t2.count)*100,2) as turnout_percentage
FROM gm.race_code_and_turnout_count AS t1
JOIN gm.race_code_and_total_count AS t2
ON t1.race_code = t2.race_code
So each table is building on the one before it.
While temporary tables are used a lot in SQL Server (mainly to overcome the peculiar locking behaviour that it has) it is far less common in Postgres (and your example uses regular tables, not temporary tables).
Usually the overhead of creating a new table is higher than letting the system store intermediate on disk.
From my experience, creating intermediate tables usually only helps if:
you have a lot of data that is aggregated and can't be aggregated in memory
the aggregation drastically reduces the data volume to be processed so that the next step (or one of the next steps) can handle the data in memory
you can efficiently index the intermediate tables so that the next step can make use of those indexes to improve performance.
you re-use a pre-computed result several times in different steps
The above list is not completely and using this approach can also be beneficial if only some of these conditions are true.
If you keep creating those tables create them at least as temporary or unlogged tables to minimized the IO overhead that comes with writing that data and thus keep as much data in memory as possible.
However I would always start with a single query instead of maintaining many different tables (that all need to be changed if you have to change the structure of the report).
For example your first two queries from your question can easily be combined into a single query with no performance loss:
SELECT race_code,
count(*) as cnt,
round((count(*)::numeric/11362)*100,2) AS percent_of_total_turnout
FROM gm.active_dem_voters_34th_house_in_2012_primary
GROUP BY race_code;
This is going to be faster than writing the data twice to disk (including all transactional overhead).
If you stack your queries using common table expressions Postgres will automatically store the data on disk if it gets too big, if not it will process it in-memory. When manually creating the tables you force Postgres to write everything to disk.
So you might want to try something like this:
with race_code_and_turnout_count as (
SELECT race_code,
count(*) as cnt,
round((count(*)::numeric/11362)*100,2) AS percent_of_total_turnout
FROM gm.active_dem_voters_34th_house_in_2012_primary
GROUP BY race_code
), race_code_and_total_count as (
select ....
from ....
), race_code_and_turnout_percentage as (
SELECT t1.race_code,
round((t1.count::numeric / t2.count)*100,2) as turnout_percentage
FROM ace_code_and_turnout_count AS t1
JOIN race_code_and_total_count AS t2
ON t1.race_code = t2.race_code
)
select *
from ....;
and see how that performs.
If you don't re-use the intermediate steps more than once, writing them as a derived table instead of a CTE might be faster in Postgres due to the way the optimizer works, e.g.:
SELECT t1.race_code,
round((t1.count::numeric / t2.count)*100,2) as turnout_percentage
FROM (
SELECT race_code,
count(*) as cnt,
round((count(*)::numeric/11362)*100,2) AS percent_of_total_turnout
FROM gm.active_dem_voters_34th_house_in_2012_primary
GROUP BY race_code
) AS t1
JOIN race_code_and_total_count AS t2
ON t1.race_code = t2.race_code
If it performs well and results in the right output, I see nothing wrong with it. I do however suggest to use (local) temporary tables if you need intermediate tables.
Your series of queries can always be optimized to use fewer intermediate steps. Do that if you feel your reports start performing poorly.

Oracle join of two tables where joined to table returns highest of a result set restricted by the joining (parent?) table?

Sorry for the long title, but its a difficult dilemma to distill down to a short title.. :-)
I have two tables, an auto maintenance log table and an auto trip log table, something like the following:
auto_maint_log:
auto_id
maint_datetime
maint_description
auto_trip_log:
auto_id
trip_datetime
ending_odometer
I need to select all maintenance events for each auto, and for each event, lookup the most recent ending_odometer value at the time of the maintenance from the trip log table. The only way I have successfully accomplished this task is using a function (i.e. get_odometer(auto_id, maint_datetime) as a column in my query) in which the odometer reading from the trip log is evaluated to pull the most recent trip prior to the passed maint_datetime, thus returning the most recent ending_odometer.
While the function call does work, in theory, in practice it does not. My end goal is to create a view of the maintenance data that includes the (then) current odometer value, and I have millions of maintenance rows across hundreds of vehicles. The performance of the function call makes its use as a solution impractical, nay impossible.. :-)
I've tried all variations of MAX, rownum, subselects, analytics (over/partition, etc.) I've been able to scour from google'ing this, and have not been able to code a working query.
Any suggestions, or brilliant solutions are welcome!
Thanks,
You can do this with a correlated subquery:
select aml.*,
(select max(atl.ending_odometer)
from auto_trip_log atl
where alt.auto_id = aml.auto_id and
atl.trip_datetime <= aml.maint_datetime
) as ending_odometer
from auto_maint_log aml;
This query uses max() because (presumably) the odometer readings are steadily increasing.
EDIT:
select aml.*,
(select max(atl.ending_odometer) over (dense_rank first order by atl.trip_datetime desc)
from auto_trip_log atl
where alt.auto_id = aml.auto_id and
atl.trip_datetime <= aml.maint_datetime
) as ending_odometer
from auto_maint_log aml;

Store results of SQL Server query for pagination

In my database I have a table with a rather large data set that users can perform searches on. So for the following table structure for the Person table that contains about 250,000 records:
firstName|lastName|age
---------|--------|---
John | Doe |25
---------|--------|---
John | Sams |15
---------|--------|---
the users would be able to perform a query that can return about 500 or so results. What I would like to do is allow the user see his search results 50 at a time using pagination. I've figured out the client side pagination stuff, but I need somewhere to store the query results so that the pagination uses the results from his unique query and not from a SELECT * statement.
Can anyone provide some guidance on the best way to achieve this? Thanks.
Side note: I've been trying to use temp tables to do this by using the SELECT INTO statements, but I think that might cause some problems if, say, User A performs a search and his results are stored in the temp table then User B performs a search shortly after and User A's search results are overwritten.
In SQL Server the ROW_NUMBER() function is great for pagination, and may be helpful depending on what parameters change between searches, for example if searches were just for different firstName values you could use:
;WITH search AS (SELECT *,ROW_NUMBER() OVER (PARTITION BY firstName ORDER BY lastName) AS RN_firstName
FROM YourTable)
SELECT *
FROM search
WHERE RN BETWEEN 51 AND 100
AND firstName = 'John'
You could add additional ROW_NUMBER() lines, altering the PARTITION BY clause based on which fields are being searched.
Historically, for us, the best way to manage this is to create a complete new table, with a unique name. Then, when you're done, you can schedule the table for deletion.
The table, if practical, simply contains an index id (a simple sequenece: 1,2,3,4,5) and the primary key to the table(s) that are part of the query. Not the entire result set.
Your pagination logic then does something like:
SELECT p.* FROM temp_1234 t, primary_table p
WHERE t.pkey = p.primary_key
AND t.serial_id between 51 and 100
The serial id is your paging index.
So, you end up with something like (note, I'm not a SQL Server guy, so pardon):
CREATE TABLE temp_1234 (
serial_id serial,
pkey number
);
INSERT INTO temp_1234
SELECT 0, primary_key FROM primary_table WHERE <criteria> ORDER BY <sort>;
CREATE INDEX i_temp_1234 ON temp_1234(serial_id); // I think sql already does this for you
If you can delay the index, it's faster than creating it first, but it's a marginal improvement most likely.
Also, create a tracking table where you insert the table name, and the date. You can use this with a reaper process later (late at night) to DROP the days tables (those more than, say, X hours old).
Full table operations are much cheaper than inserting and deleting rows in to an individual table:
INSERT INTO page_table SELECT 'temp_1234', <sequence>, primary_key...
DELETE FROM page_table WHERE page_id = 'temp_1234';
That's just awful.
First of all, make sure you really need to do this. You're adding significant complexity, so go & measure whether the queries and pagination really hurts or you just "feel like you should". The pagination can be handled with ROW_NUMBER() quite easily.
Assuming you go ahead, once you've got your query, clearly you need to build a cache so first you need to identify what the key is. It will be the SQL statement or operation identifier (name of stored procedure perhaps) and the criteria used. If you don't want to share between users then the user name or some kind of session ID too.
Now when you do a query, you first look up in this table with all the key data then either
a) Can't find it so you run the query and add to the cache, storing the criteria/keys and the data or PK of the data depending on if you want a snapshot or real time. Bear in mind that "real time" isn't really because other users could be changing data under you.
b) Find it, so remove the results (or join the PK to the underlying tables) and return the results.
Of course now you need a background process to go and clean up the cache when it's been hanging around too long.
Like I said - you should really make sure you need to do this before you embark on it. In the example you give I don't think it's worth it.

How to convert a loop in SQL to Set-based logic

I have spent a good portion of today and yesterday attempting to decide whether to utilize a loop or cursor in SQL or to figure out how to use set based logic to solve the problem. I am not new to set logic, but this problem seems to be particularly complex.
The Problem
The idea is that if I have a list of all transactions (10's, 100's of millions) and a date they occurred, I can start combining some of that data into a daily totals table so that it is more rapidly view able by reporting and analytic systems. The pseudocode for this is as such:
foreach( row in transactions_table )
if( row in totals_table already exists )
update totals_table, add my totals to the totals row
else
insert into totals_table with my row as the base values
delete ( or archive ) row
As you can tell, the block of the loop is relatively trivial to implement, and as is the cursor/looping iteration. However, the execution time is quite slow and unwieldy and my question is: is there a non-iterative way to perform such a task, or is this one of the rare exceptions where I just have to "suck it up" and use a cursor?
There have been a few discussions on the topic, some of which seem to be similar, but not usable due to the if/else statement and the operations on another table, for instance:
How to merge rows of SQL data on column-based logic? This question doesn't seem to be applicable because it simply returns a view of all sums, and doesn't actually make logical decisions about additions or updates to another table
SQL Looping seems to have a few ideas about selection with a couple of cases statements which seems possible, but there are two operations that I need done dependent upon the status of another table, so this solution does not seem to fit.
SQL Call Stored Procedure for each Row without using a cursor This solution seems to be the closest to what I need to do, in that it can handle arbitrary numbers of operations on each row, but there doesn't seem to be a consensus among that group.
Any advice how to tackle this frustrating problem?
Notes
I am using SQL Server 2008
The schema setup is as follows:
Totals: (id int pk, totals_date date, store_id int fk, machine_id int fk, total_in, total_out)
Transactions: (transaction_id int pk, transaction_date datetime, store_id int fk, machine_id int fk, transaction_type (IN or OUT), transaction_amount decimal)
The totals should be computed by store, by machine, and by date, and should total all of the IN transactions into total_in and the OUT transactions into total_out. The goal is to get a pseudo data cube going.
You would do this in two set-based statements:
BEGIN TRANSACTION;
DECLARE #keys TABLE(some_key INT);
UPDATE tot
SET totals += tx.amount
OUTPUT inserted.some_key -- key values updated
INTO #keys
FROM dbo.totals_table AS tot WITH (UPDLOCK, HOLDLOCK)
INNER JOIN
(
SELECT t.some_key, amount = SUM(amount)
FROM dbo.transactions_table AS t WITH (HOLDLOCK)
INNER JOIN dbo.totals_table AS tot
ON t.some_key = tot.some_key
GROUP BY t.some_key
) AS tx
ON tot.some_key = tx.some_key;
INSERT dbo.totals_table(some_key, amount)
OUTPUT inserted.some_key INTO #keys
SELECT some_key, SUM(amount)
FROM dbo.transactions_table AS tx
WHERE NOT EXISTS
(
SELECT 1 FROM dbo.totals_table
WHERE some_key = tx.some_key
)
GROUP BY some_key;
DELETE dbo.transactions_table
WHERE some_key IN (SELECT some_key FROM #keys);
COMMIT TRANSACTION;
(Error handling, applicable isolation level, rollback conditions etc. omitted for brevity.)
You do the update first so you don't insert new rows and then update them, performing work twice and possibly double counting. You could use output in both cases to a temp table, perhaps, to then archive/delete rows from the tx table.
I'd caution you to not get too excited about MERGE until they've resolved some of these bugs and you have read enough about it to be sure you're not lulled into any false confidence about how much "better" it is for concurrency and atomicity without additional hints. The race conditions you can work around; the bugs you can't.
Another alternative, from Nikola's comment
CREATE VIEW dbo.TotalsView
WITH SCHEMABINDING
AS
SELECT some_key_column(s), SUM(amount), COUNT_BIG(*)
FROM dbo.Transaction_Table
GROUP BY some_key_column(s);
GO
CREATE UNIQUE CLUSTERED INDEX some_key ON dbo.TotalsView(some_key_column(s));
GO
Now if you want to write queries that grab the totals, you can reference the view directly or - depending on query and edition - the view may automatically be matched even if you reference the base table.
Note: if you are not on Enterprise Edition, you may have to use the NOEXPAND hint to take advantage of the pre-aggregated values materialized by the view.
I do not think you need the loop.
You can just
Update all rows/sums that match your filters/ groups
Archive/ delete previous.
Insert all rows that do not match your filter/ groups
Archive/ delete previous.
SQL is supposed to use mass data not rows one by one.

how do i cache a variable every 24 hours?

I have a query with horrible performance:
select COUNT(distinct accession_id) count,MONTH(received_date) month,YEAR(received_date) year
into #tmpCounts
from F_ACCESSION_DAILY
where CLIENT_ID not in (select clientid from SalesDWH..TestPractices)
group by MONTH(received_date),YEAR(received_date)
Instead of waiting for this query, I would like to create a variable or a view or anything else that I can store on the server and have the server automatically calculate it every 24 hours.
I would like to then be able to do a select * from #tmpCounts
How can I achieve this ?
I don't know if this meets your need, but I would create a table for storing it, use SQL Server Agent to create a job that just truncates the table, runs the query you mention above and inserts the rows. This way from then on you could query the table for those results.
As an aside, likely the easiest way to truncate and load the table would be running a very simple SSIS package.
Aside from your issue of performance and trying an alternative to prebuild each time once per day... How many records are you dealing with for this query to process.
In addition, I would alter the query since IN ( SUBSELECT ) are ALWAYS horrible. I would change to a LEFT-JOIN to the client table and test for NULL or not
select
YEAR(FAD.received_date) year,
MONTH(FAD.received_date) month,
COUNT(distinct FAD.accession_id) count
from
F_ACCESSION_DAILY FAD
LEFT JOIN SalesDWH..TestPractices TP
on FAD.Client_ID = TP.ClientID
where
TP.CLIENT_ID IS NULL
group by
YEAR(FAD.received_date),
MONTH(FAD.received_date)
I would also ensure having an index on your ACCESSION table on received_date, and your TestPractices table on its ClientID
Instead of creating a cache table, consider creating an indexed view. There are limitations, but you may be able to improve your performance dramatically without much extra processing or code.
Here's some basic information to start with:
http://beyondrelational.com/quiz/sqlserver/tsql/2011/questions/advantage-and-disadvantage-of-using-indexed-view-in-sql-server.aspx