Efficient sliding window sum over a database table - sql

A database has a transactions table with columns: account_id, date, transaction_value (signed integer). Another table (account_value) stores the current total value of each account, which is the sum of all transaction_values per account. It is updated with a trigger on the transactions table (i.e., INSERTs, UPDATEs and DELETEs to transactions fire the trigger to change the account_value.)
A new requirement is to calculate the account's total transaction value only over the last 365 days. Only the current running total is required, not previous totals. This value will be requested often, almost as often as the account_value.
How would you implement this "sliding window sum" efficiently? A new table is ok. Is there a way to avoid summing over a year's range every time?

This can be done with standard windowing functions:
SELECT account_id,
sum(transaction_value) over (partition by account_id order by date)
FROM transactions
The order by inside the over() claues makes the sum a "sliding sum".
For the "only the last 356 days" you'd need a second query that will limit the rows in the WHERE clause.
The above works in PostgreSQL, Oracle, DB2 and (I think) Teradata. SQL Server does not support the order by in the window definition (the upcoming Denali version will AFAIK)

As simple as this?
SELECT
SUM(transaction_value), account_id
FROM
transactions t
WHERE
-- SQL Server, Sybase t.DATE >= DATEADD(year, -1, GETDATE())
-- MySQL t.DATE >= DATE_SUB(NOW(), INTERVAL 12 MONTH)
GROUP BY
account_id;
You may want to remove the time component from the date expressions using DATE (MySQL) or this way in SQL Server

If queries of the transactions table are more frequent than inserts to the transactions table, then perhaps a view is the way to go?

You are going to need a one-off script to populate the existing table with values for the preceding year for each existing record - that will need to run for the whole of the previous year for each record generated.
Once the rolling year column is populated, one alternative to summing the previous year would be to derive each new record's value as the previous record's rolling year value, plus the transaction value(s) since the last update, minus the transaction values between one year prior to the last update and one year ago from now.
I suggest trying both approaches against realistic test data to see which will perform better - I would expect summing the whole year to perform at least as well where data is relatively sparse, while the difference method may work better if data is to be frequently updated on each account.

I'll avoid any actual SQL here as it varies a lot depending on the variety of SQL that you are using.
You say that you have a trigger to maintain the existing running total.
I presume that it also (or perhaps a nightly process) creates new daily records in the account_value table. Then INSERTs, UPDATEs and DELETEs fire the trigger to add or subtract from the existing running total?
The only changes you need to make are:
- add a new field, "yearly_value" or something
- have the existing trigger update that in the same way as the existing field
- use gbn's type of answer to create today's records (or however far you backdate)
- but initialise each new daily record in a slightly different way...
When you insert a new row for a new day, it should be initialised to yesterday's value - the value 365 days ago. After that, the behavior should be identical to what you're already used to.

Related

creating materialized view for annual report based on slow function

Consider the following scenario:
I have a table with 1 million product ids products :
create table products (
pid number,
p_description varchar2(200)
)
also there is a relatively slow function
function gerProductMetrics(pid,date) return number
which returns some metric for the given product at given date.
there is also an annual report executed every year that is based on the following query:
select pid,p_description,getProductMetrics(pid,'2019-12-31') from
products
that query takes about 20-40 minutes to execute for a given year.
would it be correct approach to create Materialized View (MV) for this scenario using the following
CREATE TABLE mydates
(
mydate date
);
INSERT INTO mydates (mydate)
VALUES (DATE '2019-12-31');
INSERT INTO mydates (mydate)
VALUES (DATE '2018-12-31');
INSERT INTO mydates (mydate)
VALUES (DATE '2017-12-31');
CREATE MATERIALIZED VIEW metrics_summary
BUILD IMMEDIATE
REFRESH FORCE ON DEMAND
AS
SELECT pid,
getProductMetrics(pid,mydate AS annual_metric,
mydate
FROM products,mydates
or it would take forever?
Also, how and how often would I update this MV?
Metrics data is required for the end of each year.
But any year's data could be requested at any time.
Note, that I have no control over the slow function - it's just a given.
thanks.
First, you do not have a "group by" query, so you can remove that.
An MV would be most useful if you needed to recompute all of the data for all years. As this appears to be a summary, with no need to reprocess old data, updated only when certain threshold dates like end of year are passed, I would recommend putting the results in a normal table and only adding the updates as often as your threshold dates occur (annually?) using a stored procedure. Otherwise your MV will take longer to run and require more system resources with every execution that adds a new date.
Do not create a materialized view. This is not just a performance issue. It is also an archiving issue: You don't want to run the risk that historical results could change.
My advice is to create a single table with a "year" column. Run the query once per year and insert the rows into the new table. This is an archive of the results.
Note: If you want to recalculate previous years because the results may have changed (say the data is updated somehow), then you should store those results in a separate table and decide which version is the "right" version. You may find that you want an archive table with both the "as-of" date and the "run-date" to see how results might be changing.

Automatically add date for each day in SQL

I'm working on BigQuery and have created a view using multiple tables. Each day data needs to be synced with multiple platforms. I need to insert a date or some other field via SQL through which I can identify which rows were added into the view each day or which rows got updated so only that data I can take forward each day instead of syncing all every day. Best way I can think is to somehow add the the current date wherever an update to a row happens but that date needs to be constant until a further update happens for that record.
Ex:
Sample data
Say we get the view T1 on 1st September and T2 on 2nd. I need to to only spot ID:2 for 1st September and ID:3,4,5 on 2nd September. Note: no such date column is there.I need help in creating such column or any other approach to verify which rows are getting updated/added daily
You can create a BigQuery schedule queries with frequency as daily (24 hours) using below INSERT statement:
INSERT INTO dataset.T1
SELECT
*
FROM
dataset.T2
WHERE
date > (SELECT MAX(date) FROM dataset.T1);
Your table where the data is getting streamed to (in your case: sample data) needs to be configured as a partitioned table. Therefor you use "Partition by ingestion time" so that you don't need to handle the date yourself.
Configuration in BQ
After you recreated that table append your existing data to that new table with the help of the format options in BQ (append) and RUN.
Then you create a view based on that table with:
SELECT * EXCEPT (rank)
FROM (
SELECT
*,
ROW_NUMBER() OVER (GROUP BY invoice_id ORDER BY _PARTITIONTIME desc) AS rank
FROM `your_dataset.your_sample_data_table`
)
WHERE rank = 1
Always use the view from that on.

Oracle query to fetch data based upon a order without using order by

Below is the problem statement :
I have a table where we are inserting some data along with a business date of that Hospital along with the create time stamp.
The business date (Bday) defines the logical day for which the hospital has done the business and the create time stamp (create_ts) defines the timestamp at which the record was inserted in our database.
The business date could be 1 day ahead of the create timestamp as the create timestamp is in PST and the Hospital can be in South Asia, Australia region.
Also a hospital can be opening for next day if they accidentally closed for today's business date in the application.
I need to sync the record from a staging database to main database.
I want to sync the record with minimum create timestamp first. But order by is a costly operation as sometimes the number of records to be synced are more than 100k
10 Different threads are running to sync the records each with a batch size of 50.
First solution tries was to pickup the records with business date = min business date but there were some complexities:
1) as there were some records with minimum business date which were not syncing due to some issue as a result the next day records were never picked up without manual interventions.
2) There were hospitals which opened for a future business date ( for example today is 13th and they closed accidentally for it so they were opened for business date 14th ). SO these records did not get process as the business date was greater than min business date.
Thought of picking records where the create timestamp is between minimum create timestamp and +1 hour .
But there could be again issue as mentioned in point (1) and there may not be any record to be synced between the stuck record with minimum create timestamp and +1 hour.
Please suggest a solution for the query
Few columns in the table are : Hname (Hospital Name) , Hloc (Hospital location) , Dseq (Per day sequence number ) , Bday (business date ) , create_ts and modify_ts
Oracle doesn't guaranty the order of the resultset if you don't use "ORDER BY" clause.
But you may consider using CDC for Oracle 11g or GoldenGate for Oracle 12c - it should be very efficient. New data will be replicated almost immediately. Oracle will take care of it.
in general: no order by --> no guarantee on the order of result set.

How do I Automatically insert monthly records into a table via SQL?

I'm trying to generate monthly records in one table based on instructions in another table. Software - MS Access 2007, though I'm looking for an SQL solution here. To greatly simplify the matter, let's say the following describes the tables:
TaskManager:
- DayDue
- TaskName
Task:
- DateDue
- TaskName
So what happens is that there may be an entry in TaskManager {15, "Accounts due"}, so this should lead to an "Account due" record in the Task table with the due date being the 15th of each month. I'd want it to create records for the last few months and the next year.
What I'm thinking that I need to do is first create a SELECT query that results in x records for each record in the TaskManager table, with a date for each month. After that, I do an INSERT query which inserts records into the Task table if they do not EXIST in the aforementioned SELECT query.
I think I can manage the INSERT query, though I'm having trouble figuring out how to do the SELECT query. Could someone give me a pointer?
You could use a calendar table.
INSERT INTO Task ( DateDue, TaskName )
SELECT calendar.CalDate, TaskManager.TaskName
FROM calendar, TaskManager
WHERE (((Day([CalDate]))=TaskManager.DayDue)
AND ((calendar.CalDate)<#7/1/2013#));
The calendar table would simply contain all dates and other such relevant fields as work day (yesno). Calendar tables are generally quite useful.
Here is the solution I developed using Remou's Calendar table idea.
First create a Calendar table, which simply contains all dates for a desired range. It's easy to just make the dates in Excel and paste them into the table. This is also a very reliable way of doing it, as Excel handles leap years correctly for the modern range of dates.
After building this table, there are three queries to run. The first is a SELECT, which selects every possible task generated by the TaskManager based on the date and frequency. This query is called TaskManagerQryAllOptions, and has the following code:
SELECT TaskManager.ID, Calendar.CalendarDate
FROM TaskManager INNER JOIN Calendar ON
TaskManager.DateDay = Day(Calendar.CalendarDate)
WHERE (TaskManager.Frequency = "Monthly")
OR (TaskManager.Frequency = "Yearly" AND
TaskManager.DateMonth = Month(Calendar.CalendarDate))
OR (TaskManager.Frequency = "Quarterly" AND
(((Month(Calendar.CalendarDate)- TaskManager.DateMonth) Mod 3) = 0));
The bulk of the above is to cover the different options a quarterly Day and Month pair could cover. The next step is another SELECT query, which selects records from the TaskManagerQryAllOptions in which the date is within the required range. This query is called TaskManagerQrySelect.
SELECT TaskManagerQryAllOptions.ID, TaskManager.TaskName,
TaskManagerQryAllOptions.CalendarDate
FROM TaskManagerQryAllOptions INNER JOIN TaskManager
ON TaskManagerQryAllOptions.ID = TaskManager.ID
WHERE (TaskManagerQryAllOptions.CalendarDate > Date()-60)
AND (TaskManagerQryAllOptions.CalendarDate < Date()+370)
AND (TaskManagerQryAllOptions.CalendarDate >= TaskManager.Start)
AND ((TaskManagerQryAllOptions.CalendarDate <= TaskManager.Finish)
OR (TaskManager.Finish Is Null))
ORDER BY TaskManagerQryAllOptions.CalendarDate;
The final query is an INSERT. As we will be using this query frequently, we don't want it to generate duplicates, so we need to filter out already created records.
INSERT INTO Task ( TaskName, TaskDate )
SELECT TaskManagerQrySelect.TaskName, TaskManagerQrySelect.CalendarDate
FROM TaskManagerQrySelect
WHERE Not Exists(
SELECT *
FROM Task
WHERE Task.TaskName = TaskManagerQrySelect.TaskName
AND Task.TaskDate = TaskManagerQrySelect.CalendarDate);
One limitation of this method is that if the date of repetition (e.g. the 15th of each month) is changed, the future records with the wrong day will remain. A solution to this would be to update all the future records with the adjusted date, then run the insert.
One possibility could be to create a table of Months, and a table of Years (prior year, current, and next one). I could run a SELECT query which takes the Day from the TaskManager table, the Month from the Month table, and the Year from the Year table - I imagine that this could somehow create my desired multiple records for a single TaskManager record. Though I'm not sure what the exact SQL would be.

How to query the number of changes that have been made to a particular column in SQL

I have a database with a column that I want to query the amount of times it has changed over a period of time. For example, I have the username, user's level, and date. How do I query this database to see the number of times the user's level has changed over x amount of years?
(I've looked in other posts on stackoverflow, and they're telling me to use triggers. But in my situation, I want to query the database for the number of changes that has been made. If my question can't be answered, please tell me what other columns might I need to look into to figure this out. Am I supposed to use Lag for this? )
A database will not inherently capture this information for you. Two suggestions would be to either store your data as a time series so instead of updating the value you add a new row to a table as the new current value and expire the old value. The other alternative would be to just add a new column for tracking the number of updates to the column you care about. This could be done in code or in a trigger.
Have you ever heard of the LOG term ?
You have to create a new table, in wich you will store your wanted changes.
I can imagine this solution for the table:
id - int, primary key, auto increment
table - the table name where the info has been changed
table_id - the information unique id from the table where changes
have been made
year - integer
month - integer
day - integer
knowin this, you can count everything
In case you are already keeping track of the level history by adding a new row with a different level and date every time a user changes level:
SELECT username, COUNT(date) - 1 AS changes
FROM table_name
WHERE date >= '2011-01-01'
GROUP BY username
That will give you the number of changes since Jan 1, 2011. Note that I'm subtracting 1 from the COUNT. That's because a user with a single row on your table has never changed levels, that row represents the user's initial level.