SQL comparing record versions in the same table - sql

I have a table that loads employee records weekly on Monday. The load date is stored on the record. I need to sum the total changed (add/update) records from one week to the next.
This is what I have so far. It splits new record and updated record counts for the latest load date compared to the previous load date.
I'm not sure if this is a good way to do this and I'd really appreciate any feedback I could get about my method, or advice on a better way to accomplish my goal.
Thanks.
SELECT
RIGHT(CONVERT(VARCHAR(10), REPORT_DATE, 103), 7) AS REPORT_DATE,
[NEW],
[UPDATED]
FROM
(
SELECT
CUR.LOAD_DATE AS REPORT_DATE,
CASE
WHEN PRV.LOAD_DATE IS NULL THEN 'NEW'
ELSE 'UPDATED'
END AS RECORD_TYPE,
COUNT(*) AS RECORD_COUNT
FROM
(SELECT *
FROM EMPLOYEES
WHERE LOAD_DATE = (SELECT MAX(LOAD_DATE) FROM EMPLOYEES)) CUR
LEFT OUTER JOIN
(SELECT *
FROM EMPLOYEES
WHERE LOAD_DATE = (SELECT DATEADD(WEEK,-1,MAX(LOAD_DATE)) FROM EMPLOYEES))PRV
ON
CUR.EMPLOYEE_ID = PRV.EMPLOYEE_ID
WHERE
PRV.EMPLOYEE_ID IS NULL
OR (CUR.FIRST_NAME != PRV.FIRST_NAME
OR CUR.LAST_NAME != PRV.LAST_NAME
OR CUR.ADDRESS1 != PRV.ADDRESS1
OR CUR.ADDRESS2 != PRV.ADDRESS2
OR CUR.CITY != PRV.CITY
OR CUR.STATE != PRV.STATE
OR CUR.ZIP != PRV.ZIP
OR CUR.POSITION != PRV.POSITION
OR CUR.LOCATION != PRV.LOCATION)
GROUP BY
CUR.LOAD_DATE,
PRV.LOAD_DATE
) DT
PIVOT
(SUM(RECORD_COUNT) FOR RECORD_TYPE IN ([NEW], [UPDATED])) PV;

I have a couple of suggestions that could simplify your code even improve the performance of the query.
While you are looking for "Last date of loading data for employee", try to add a table to log the loading process, which contains time of loading. This would improve your performance and you don't have to use the "select MAX(LOAD_DATE) from ..." twice.
You may add a additional column to record the updated time of the record; so that while your are looking for changed record, just to compare records' "updated time" and "load time". Putting a updating trigger on this table would be a better tactic to modify the "updated time".
Based on above suggestions, the point is to prevent from joining the table twice and touching the data page. Since your report is to retrieve the "SUM" of data, your don't have to use the whole information of "EMPLOYEES" table.
First, the code is more clear to match your intention for "sum the total changed records". Second, the database just need the index to "COUNT" your metric of data(of course, a proper index on "load_date"), so the performance should be superior to your "JOIN-SELF-TABLE" method.
There are multiple ways to generate a report by SQL. Because SQL is a kind of hard-to-read language, concise writing is a matter of maintenance. Because it is a tough effort to figure out performance problems in SQL, writing a more efficient SQL is worth than rewriting it afterwards.
In my experience, the "decent SQL" is about:
Acceptable performance in plausible anticipation.
Without sacrificing the performance, make code more readable.
Forgiving me for repeat of my points, if you have a complex SQL that has poor performance. You have more risk to modify the SQL for the sake of improving performance afterwards.

Related

What is the fastest way to perform a date query in Oracle SQL?

We have a 6B row table that is giving us challenges when retrieving data.
Our query returns values instantly when doing a...
SELECT * WHERE Event_Code = 102225120
That type of instant result is exactly what we need. We now want to filter to receive values for just a particular year - but the moment we add...
AND EXTRACT(YEAR FROM PERFORMED_DATE_TIME) = 2017
...the query takes over 10 minutes to begin returning any values.
Another SO post mentions that indexes don't necessarily help date queries when pulling many rows as opposed to an individual row. There are other approaches like using TRUNC, or BETWEEN, or specifying the datetime in YYYY-MM-DD format for doing comparisons.
Of note, we do not have the option to add indexes to the database as it is a vendor's database.
What is the way to add a date filtering query and enable Oracle to begin streaming the results back in the fastest way possible?
Another SO post mentions that indexes don't necessarily help date queries when pulling many rows as opposed to an individual row
That question is quite different from yours. Firstly, your statement above applies to any data type, not only dates. Also the word many is relative to the number of records in the table. If the optimizer decides that the query will return many of all records in your table, then it may decide that a full scan of the table is faster than using the index. In your situation, this translates to how many records are in 2017 out of all records in the table? This calculation gives you the cardinality of your query which then gives you an idea if an index will be faster or not.
Now, if you decide that an index will be faster, based on the above, the next step is to know how to build your index. In order for the optimizer to use the index, it must match the condition that you're using. You are not comparing dates in your query, you are only comparing the year part. So an index on the date column will not be used by this query. You need to create an index on the year part, so use the same condition to create the index.
we do not have the option to add indexes to the database as it is a vendor's database.
If you cannot modify the database, there is no way to optimize your query. You need to talk to the vendor and get access to modify the database or ask them to add the index for you.
A function can also cause slowness for the number of records involved. Not sure if Function Based Index can help you for this, but you can try.
Had you tried to add a year column in the table? If not, try to add a year column and update it using code below.
UPDATE table
SET year = EXTRACT(YEAR FROM PERFORMED_DATE_TIME);
This will take time though.
But after this, you can run the query below.
SELECT *
FROM table
WHERE Event_Code = 102225120 AND year = 2017;
Also, try considering Table Partitioned for this big data. For starters, see link below,
link: https://oracle-base.com/articles/8i/partitioned-tables-and-indexes
Your question is a bit ambiguous IMHO:
but the moment we add...
AND EXTRACT(YEAR FROM PERFORMED_DATE_TIME) = 2017
...the query takes over 10 minutes to begin returning any values.
Do you mean that
SELECT * WHERE Event_Code = 102225120
is fast, but
SELECT * WHERE Event_Code = 102225120 AND EXTRACT(YEAR FROM PERFORMED_DATE_TIME) = 2017
is slow???
For starters I'll agree with Mitch Wheat that you should try to use PERFORMED_DATE_TIME between Jan 1, 2017 and Dec 31, 2017 instead of Year(field) = 2017. Even if you'd have an index on the field, the latter would hardly be able to make use of it while the first method would benefit enormously.
I'm also hoping you want to be more specific than just 'give me all of 2017' because returning over 1B rows is NEVER going to be fast.
Next, if you can't make changes to the database, would you be able to maintain a 'shadow' in another database? This would require that you create a table with all date-values AND the PK of the original table in another database and query those to find the relevant PK values and then JOIN those back to your original table to find whatever you need. The biggest problem with this would be that you need to keep the shadow in sync with the original table. If you know the original table only changes overnight, you could merge the changes in the morning and query all day. If the application is 'real-time(ish)' then this probably won't work without some clever thinking... And yes, your initial load of 6B values will be rather heavy =)
May this could be usefull (because you avoid functions (a cause for context switching) and if you have an index on your date field, it could be used) :
with
dt as
(
select
to_date('01/01/2017', 'DD/MM/YYYY') as d1,
to_date('31/01/2017', 'DD/MM/YYYY') as d2
from dual
),
dates as
(
select
dt.d1 + rownum -1 as d
from dt
connect by dt.d1 + rownum -1 <= dt.d2
)
select *
from your_table, dates
where dates.d = PERFORMED_DATE_TIME
Move the date literal to RHS:
AND PERFORMED_DATE_TIME >= date '2017-01-01'
AND PERFORMED_DATE_TIME < date '2018-01-01'
But without an (undisclosed) appropriate index on PERFORMED_DATE_TIME, the query is unlikely to be any faster.
One option to create indexes in third party databases is to script in the index and then before any vendor upgrade run a script to remove any indexes you've added. If the index is important, ask the vendor to add it to their database design.

Oracle join of two tables where joined to table returns highest of a result set restricted by the joining (parent?) table?

Sorry for the long title, but its a difficult dilemma to distill down to a short title.. :-)
I have two tables, an auto maintenance log table and an auto trip log table, something like the following:
auto_maint_log:
auto_id
maint_datetime
maint_description
auto_trip_log:
auto_id
trip_datetime
ending_odometer
I need to select all maintenance events for each auto, and for each event, lookup the most recent ending_odometer value at the time of the maintenance from the trip log table. The only way I have successfully accomplished this task is using a function (i.e. get_odometer(auto_id, maint_datetime) as a column in my query) in which the odometer reading from the trip log is evaluated to pull the most recent trip prior to the passed maint_datetime, thus returning the most recent ending_odometer.
While the function call does work, in theory, in practice it does not. My end goal is to create a view of the maintenance data that includes the (then) current odometer value, and I have millions of maintenance rows across hundreds of vehicles. The performance of the function call makes its use as a solution impractical, nay impossible.. :-)
I've tried all variations of MAX, rownum, subselects, analytics (over/partition, etc.) I've been able to scour from google'ing this, and have not been able to code a working query.
Any suggestions, or brilliant solutions are welcome!
Thanks,
You can do this with a correlated subquery:
select aml.*,
(select max(atl.ending_odometer)
from auto_trip_log atl
where alt.auto_id = aml.auto_id and
atl.trip_datetime <= aml.maint_datetime
) as ending_odometer
from auto_maint_log aml;
This query uses max() because (presumably) the odometer readings are steadily increasing.
EDIT:
select aml.*,
(select max(atl.ending_odometer) over (dense_rank first order by atl.trip_datetime desc)
from auto_trip_log atl
where alt.auto_id = aml.auto_id and
atl.trip_datetime <= aml.maint_datetime
) as ending_odometer
from auto_maint_log aml;

how do i cache a variable every 24 hours?

I have a query with horrible performance:
select COUNT(distinct accession_id) count,MONTH(received_date) month,YEAR(received_date) year
into #tmpCounts
from F_ACCESSION_DAILY
where CLIENT_ID not in (select clientid from SalesDWH..TestPractices)
group by MONTH(received_date),YEAR(received_date)
Instead of waiting for this query, I would like to create a variable or a view or anything else that I can store on the server and have the server automatically calculate it every 24 hours.
I would like to then be able to do a select * from #tmpCounts
How can I achieve this ?
I don't know if this meets your need, but I would create a table for storing it, use SQL Server Agent to create a job that just truncates the table, runs the query you mention above and inserts the rows. This way from then on you could query the table for those results.
As an aside, likely the easiest way to truncate and load the table would be running a very simple SSIS package.
Aside from your issue of performance and trying an alternative to prebuild each time once per day... How many records are you dealing with for this query to process.
In addition, I would alter the query since IN ( SUBSELECT ) are ALWAYS horrible. I would change to a LEFT-JOIN to the client table and test for NULL or not
select
YEAR(FAD.received_date) year,
MONTH(FAD.received_date) month,
COUNT(distinct FAD.accession_id) count
from
F_ACCESSION_DAILY FAD
LEFT JOIN SalesDWH..TestPractices TP
on FAD.Client_ID = TP.ClientID
where
TP.CLIENT_ID IS NULL
group by
YEAR(FAD.received_date),
MONTH(FAD.received_date)
I would also ensure having an index on your ACCESSION table on received_date, and your TestPractices table on its ClientID
Instead of creating a cache table, consider creating an indexed view. There are limitations, but you may be able to improve your performance dramatically without much extra processing or code.
Here's some basic information to start with:
http://beyondrelational.com/quiz/sqlserver/tsql/2011/questions/advantage-and-disadvantage-of-using-indexed-view-in-sql-server.aspx

Get Day, Month, Year, Lifetime total records with one query w/ optimizations

I have a Postgres DB running 7.4 (Yeah we're in the midst of upgrading)
I have four separate queries to get the Daily, Monthly, Yearly and Lifetime record counts
SELECT COUNT(field)
FROM database
WHERE date_field
BETWEEN DATE_TRUNC('DAY' LOCALTIMESTAMP)
AND DATE_TRUNC('DAY' LOCALTIMESTAMP) + INTERVAL '1 DAY'
For Month just replace the word DAY with MONTH in the query and so on for each time duration.
Looking for ideas on how to get all the desired results with one query and any optimizations one would recommend.
Thanks in advance!
NOTE: date_field is timestamp without time zone
UPDATE:
Sorry I do filter out records with additional query constraints, just wanted to give the gist of the date_field comparisons. Sorry for any confusion
I have some idea of using prepared statements and simple statistics (record_count_t) table for that:
-- DROP TABLE IF EXISTS record_count_t;
-- DEALLOCATE record_count;
-- DROP FUNCTION updateRecordCounts();
CREATE TABLE record_count_t (type char, count bigint);
INSERT INTO record_count_t (type) VALUES ('d'), ('m'), ('y'), ('l');
PREPARE record_count (text) AS
UPDATE record_count_t SET count =
(SELECT COUNT(field)
FROM database
WHERE
CASE WHEN $1 <> 'l' THEN
DATE_TRUNC($1, date_field) = DATE_TRUNC($1, LOCALTIMESTAMP)
ELSE TRUE END)
WHERE type = $1;
CREATE FUNCTION updateRecordCounts() RETURNS void AS
$$
EXECUTE record_count('d');
EXECUTE record_count('m');
EXECUTE record_count('y');
EXECUTE record_count('l');
$$
LANGUAGE SQL;
SELECT updateRecordCounts();
SELECT type,count FROM record_count_t;
Use updateRecordCounts() function any time you need update statistics.
I'd guess that it is not possible to optimize this any further than it already is.
If you're collecting daily/monthly/yearly stats, as I'm assuming you are doing, one option (after upgrading, of course) is a with statement and the relevant joins, e.g.:
with daily_stats as (
(what you posted)
),
monthly_stats as (
(what you posted monthly)
),
etc.
select daily_stats.stats,
monthly_stats.stats,
etc.
stats
left join yearly_stats on ...
left join monthly_stats on ...
left join daily_stats on ...
However, that will actually perform less well than running each query separately in a production environment, because you'll introduce left joins in the DB which could be done just as well in the middleware (i.e. show daily, then monthly, then yearly and finally lifetime stats). (If not better, since you'll be avoiding full table scans.)
By keeping things as if, you'll save the precious DB resources to deal with reads and writes on actual data. The tradeoff (less network traffic between your database and your app) is almost certainly not worth it.
Yikes! Don't do this!!! Not because you can't do what you're asking, but because you probably shouldn't be doing what you're asking in this manner. I'm guessing the reason you've got date_field in your example is because you've got a date_field attached to a user or some other meta-data.
Think about it: you are asking PostgreSQL to scan 100% of the records relevant to a given user. Unless this is a one-time operation, you almost assuredly do not want to do this. If this is a one-time operation and you are planning on caching this value as a meta-data, then who cares about the optimizations? Space is cheap and will save you heaps of execution time down the road.
You should add 4x per-user (or whatever it is) meta-data fields that help sum up the data. You have two options, I'll let you figure out how to use this so that you keep historical counts, but here's the easy version:
CREATE TABLE user_counts_only_keep_current (
user_id , -- Your user_id
lifetime INT DEFAULT 0,
yearly INT DEFAULT 0,
monthly INT DEFAULT 0,
daily INT DEFAULT 0,
last_update_utc TIMESTAMP WITH TIME ZONE,
FOREIGN KEY(user_id) REFERENCES "user"(id)
);
CREATE UNIQUE INDEX this_tbl_user_id_udx ON user_counts_only_keep_current(user_id);
Setup some stored procedures that zero out individual columns if last_update_utc doesn't match the current day according to NOW(). You can get creative from here, but incrementing records like this is going to be the way to go.
Handling of time series data in any relational database requires special handling and maintenance. Look in to PostgreSQL's table inheritance if you want good temporal data management.... but really, don't do whatever it is you're about to do to your application because it's almost certainly going to result in bad things(tm).

What's the most efficient query?

I have a table named Projects that has the following relationships:
has many Contributions
has many Payments
In my result set, I need the following aggregate values:
Number of unique contributors (DonorID on the Contribution table)
Total contributed (SUM of Amount on Contribution table)
Total paid (SUM of PaymentAmount on Payment table)
Because there are so many aggregate functions and multiple joins, it gets messy do use standard aggregate functions the the GROUP BY clause. I also need the ability to sort and filter these fields. So I've come up with two options:
Using subqueries:
SELECT Project.ID AS PROJECT_ID,
(SELECT SUM(PaymentAmount) FROM Payment WHERE ProjectID = PROJECT_ID) AS TotalPaidBack,
(SELECT COUNT(DISTINCT DonorID) FROM Contribution WHERE RecipientID = PROJECT_ID) AS ContributorCount,
(SELECT SUM(Amount) FROM Contribution WHERE RecipientID = PROJECT_ID) AS TotalReceived
FROM Project;
Using a temporary table:
DROP TABLE IF EXISTS Project_Temp;
CREATE TEMPORARY TABLE Project_Temp (project_id INT NOT NULL, total_payments INT, total_donors INT, total_received INT, PRIMARY KEY(project_id)) ENGINE=MEMORY;
INSERT INTO Project_Temp (project_id,total_payments)
SELECT `Project`.ID, IFNULL(SUM(PaymentAmount),0) FROM `Project` LEFT JOIN `Payment` ON ProjectID = `Project`.ID GROUP BY 1;
INSERT INTO Project_Temp (project_id,total_donors,total_received)
SELECT `Project`.ID, IFNULL(COUNT(DISTINCT DonorID),0), IFNULL(SUM(Amount),0) FROM `Project` LEFT JOIN `Contribution` ON RecipientID = `Project`.ID GROUP BY 1
ON DUPLICATE KEY UPDATE total_donors = VALUES(total_donors), total_received = VALUES(total_received);
SELECT * FROM Project_Temp;
Tests for both are pretty comparable, in the 0.7 - 0.8 seconds range with 1,000 rows. But I'm really concerned about scalability, and I don't want to have to re-engineer everything as my tables grow. What's the best approach?
Knowing the timing for each 1K rows is good, but the real question is how they'll be used.
Are you planning to send all these back to a UI? Google doles out results 25 per page; maybe you should, too.
Are you planning to do calculations in the middle tier? Maybe you can do those calculations on the database and save yourself bringing all those bytes across the wire.
My point is that you may never need to work with 1,000 or one million rows if you think carefully about what you do with them.
You can EXPLAIN PLAN to see what the difference between the two queries is.
I would go with the first approach. You are allowing the RDBMS to do it's job, rather than trying to do it's job for it.
By creating a temp table, you will always create the full table for each query. If you only want data for one project, you still end up creating the full table (unless you restrict each INSERT statement accordingly.) Sure, you can code it, but it's already becoming a fair amount code and complexity for a small performance gain.
With a SELECT, the db can fetch the appriate amount of data, optimizing the whole query based on context. If other users have queried the same data, it may even be cached (query, and possibly data, depending upon your db). If performance is truly a concern, you might consider using Indexed/Materialized Views, or generating a table on an INSERT/UPDATE/DELETE trigger. Scaling out, you can use server clusters and partioned views - something that I believe will be difficult if you are creating temporary tables.
EDIT: the above is written without any specific rdbms in mind, although the OP added that mysql is the target db.
There is a third option which is derived tables:
Select Project.ID AS PROJECT_ID
, Payments.Total AS TotalPaidBack
, Coalesce(ContributionStats.DonarCount, 0) As ContributorCount
, ContributionStats.Total As TotalReceived
From Project
Left Join (
Select C1.RecipientId, Sum(C1.Amount) As Total, Count(Distinct C1.DonarId) ContributorCount
From Contribution As C1
Group By C1.RecipientId
) As ContributionStats
On ContributionStats.RecipientId = Project.Project_Id
Left Join (
Select P1.ProjectID, Sum(P1.PaymentAmount) As Total
From Payment As P1
Group By P1.RecipientId
) As Payments
On Payments.ProjectId = Project.Project_Id
I'm not sure if it will perform better, but you might give it shot.
A few thoughts:
The derived table idea would be good on other platforms, but MySQL has the same issue with derived tables that it does with views: they aren't indexed. That means that MySQL will execute the full content of the derived table before applying the WHERE clause, which doesn't scale at all.
Option 1 is good for being compact, but syntax might get tricky when you want to start putting the derived expressions in the WHERE clause.
The suggestion of materialized views is a good one, but MySQL unfortunately doesn't support them. I like the idea of using triggers. You could translate that temporary table into a real table that persists, and then use INSERT/UPDATE/DELETE triggers on the Payments and Contribution tables to update the Project Stats table.
Finally, if you don't want to mess with triggers, and if you aren't too concerned with freshness, you can always have the separate stats table and update it offline, having a cron job that runs every few minutes that does the work that you specified in Query #2 above, except on the real table. Depending on the nuances of your application, this slight delay in updating the stats may or may not be acceptable to your users.