In this scenario, "inactive" can generally refer to data that has not been accessed in the last month by users from the web server.
Knowing the "inactive" status of records can be used to optimize queries for the active data as the database table grows larger.
I know one approach can be to
Update a record with a last_accessed timestamp each time it is
accessed.
Monthly, when there is low traffic, the web server can tell the database to update an inactive flag for records that have/have not been accessed in the past month.
But two major issues to this approach are
Updating when the client is just trying to select data has a performance impact.
If there are too many records, the monthly update may take too long and cause issues, like locking rows.
Wondering what a better, or alternative approach could be.
Here is one approach.
You could write a query that would essentially check if the last_accessed_date is in the last 30 days (CASE WHEN last_accessed_date < SYSDATE-30) and created an is_active indicator. That would essentially allow you to mark the historic records as active or inactive.
Then, once you have this done, you would need to run this script routinely (daily, weekly, or monthly) to check the status of these items. Monthly might be a good idea, and do it during a non-business hour so it does not affect performance much (Saturday mornings at 3:00AM). I'm sure you could schedule this with your team and also have your communications team send out a notice to the end users that they may see usability delays during this timeframe (first Saturday of the month from 3:00 AM - 6:00 AM or something).
Additionally, you could have it be the case that there is a second condition. Whenever someone does access a record, you could have a small logic check that essentially says "Change the last_accessed_date to today. If is_active is currently No, switch to yes". That would keep your database more up to date.
A last piece, for optimization, is if you choose to include the second option (the logic check), you could have a field that says "Last_Updated_Indicator" which is the date that the indicator was last changed. If your last updated indicator is within the lead time since you last ran your entire database update function, it can be skipped. This would drastically cut down on the performance impact of that update procedure.
Related
I have a historical table which contains many price columns and only few columns change at a time. Currently I am just inserting all the data as new records and this change could come 100+ times every second. So it is resulting in growing of table size pretty quick.
I am trying to find the better design for the table to keep the table size to minimum and the best query to retrieve the data when required. I am not much worried about the data retrieval performance, but it should be somewhere in the middle when used for reports. Priority is to keep the table size to its minimum.
Data from this historical table is not retrieved on a day to day basis. I have a transaction table like *1 Current Design for that purpose.
Here are the details of my implementation.
1) Current Design
2) Planned design - 1
Question:
1) If I use the above table structure what is the best query to get the results like shown in Current design #1
3) Planned design - 2
Question:
1) How much performance hit this would be compared to Planned design #1.
2) Also if I go in that route what is the best query to get the results like what shown in Current design #1?
End question:
I assume planned design #1 will take more table space VS planned design #2. But planned design 2 will take more time to retrieve the query. Is there any case I assumption can go wrong?
Edit: I have only inserts going to this table. No updates or deletion is ever made to this.
In fact, I think you have better plan. You can use Temporal Tables that come from SQL Server 2016.
This type managed by sql to track change of table in best way.
Visit this Link: https://learn.microsoft.com/en-us/sql/relational-databases/tables/temporal-tables?view=sql-server-2017
I have s similar situation where I'm loading a bunch of temperature sensors every 10 seconds. As I'm using the express version of MSSQL I'm looking a at max database size of 10GB so I went creative to make it last as long as possible.
My table-layout is pretty much identical to yours in that I've got 1 timestamp + 30 value columns + another 30 flag columns.
The value columns are numeric(9,2)
The value columns are marked SPARSE, if the value is identical (enough) to the value before it I store NULL instead of repeating the value.
The flag columns are bit and indicate whether the value is 'extrapolated' or from an actual source (later on that)
I've also got another table that holds the following information for each sensor:
last time the sensor was updated; that way if a new value comes in I can easily decide if this requires just a new insert at the end of the table or whether I need to go through all the logic of inserting/updating a value somewhere in-between existing numbers.
the value of that latest entry
the sensitivity for said sensor; this way I don't have to hardcode it and can set it on a per-sensor basis
Anyway, for now my stream of information is that I've got several programs each reading data from different sources (arduino, web, ...) and dumping this in .csv files and then a 'parser' program that reads these files into the database once in a while. As I'm loading the values 1 by 1 instead of row-based this isn't very efficient but I'm now doing about 3500 values / second so I'm not overly concerned. I'll agree that this is only true when loading the values in historical order and because I'm using the helper table.
Currently I've got almost 1 year of information in there which corresponds to
2.209.883 rows
5.799.511 values spread over 18 sensors (yes, I've still got room for 12 more without needing to change the table)
This means I've only got 15% of the fields filled in, or looking at it the other way around, when I'd fill in every record rather than putting NULL in case of repetition, I'd have almost 8 times that many numbers there.
As for space requirements: I decided to reload all numbers last night 'for fun' but noticed that even though most .csv files come in historically, they would do a range of columns from Jan to Dec, then another couple of columns from Jan to Dec etc.. This resulted in quite a bit of fragmentation: 70% in fact! At that time the table required 282Mb of disk-space.
I then rebuilt the table bringing the fragmentation down to 0% and the reserved space went down to 118Mb (!).
For me this is more than good enough
it's unlikely the table will outgrow the 10GB limit anytime soon, especially if I stick to (online) rebuilding it now and then.
loading data is sufficiently fast (although reloading the entire year took a couple of hours)
reporting is sufficiently fast (for now, haven't tried to connect any 'interactive' reporting tools to it yet; but for some simple graphs in excel it works just fine IMHO).
FYI: for reporting I've created a rater simple stored procedure that picks a from-to range for a given set of columns; dumps it in a temp-table and then fills in the blanks by figuring out the NULL-ranges and then filling these in with the value that preceded the range. This works quite well although fetching the 'first' value sometimes takes a while as I can't predict how far back in time the last value should be looked for (sometimes there is none).
To work around this I've added another process that extrapolates the values for every 'hour' timestamp. That way the report never needs to go back more than 1 hour. A flag-column in the readings table indicates whether the value on a record for a given field was extrapolated or not.
(note: this makes updating values in the past more problematic but not impossible)
Hope this helps you out a bit in your endeavors, good luck!
Let's say I have three tables
TRANSACTIONS
amount
date
RECORDS
amount
date
CUSTOM_RECORDS
amount
date
(Let's just say there are many other fields to justify splitting of these tables)
To calculate running balance I have two methods
-------------METHOD 1 -------------
Heavy on READ, Light on WRITE
Whenever we read, just join the table, sort by date and calculate the running balance.
PRO
Write is easy, just write into each table
CON
Reading is very heavy, the calculation needs to be done on each read.
It is very strange to be querying (from let's say a span of 1 week) and to have the calculation done for ALL the records. If I query for 10 records then calculation needs to be done for 1 million records to be able to know the 10 record balance.
-------------METHOD 2 -------------
Heavy on WRITE, Light on READ
I have another table
FINAL_TABLE
date
amount
running balance
Whenever I write, I refresh this table and calculate all the running balance again.
PRO
Read is easy, running balance already computed.
Querying between time period is as easy as extracting the date between the time span from the FINAL_TABLE
CON
Write is really slow, each write on any of the Three tables mean refreshing a whole FINAL_TABLE table!
Why didn't I just reuse the latest running balance? This can occur if the entry is a guarantee to be chronological in real life. However, sometimes entry might be added late.
Currently, I am using Method 2 and every time a client save/update a row into any of the three tables, the server freeze as it tries to refresh and compute the FINAL_TABLE. Obviously, this is not very scalable.
Method 1 is also not very scalable in term of querying. I would have to calculate running balance from the beginning of time in order to know the running balance of last week.
Both Method is not very scalable. What is a good design to ensure scalability and relatively fast performance on READ and WRITE? What method does the bank use to keep track of running balance?
It depends.
Suppose you have a report like transaction report where accounts' running balance will be shown. If you want to show real time data then always method 1 will be preferable. And I will suggest to use Quirky Update for this rather than using cursors, loops, sub-queries or recursions.
On the other hand, if you don't need real time running total then you could have use method 2 with a little customization. I will not support updating Final Table while you made a transaction. Rather than I will suggest to update it with interval schedule. Depending on your traffic or load you may update the running total after a interval.
And for real time I will discourage using method 2 as it will make your transaction costly.
To make your method 1 faster here is some link.
Calculating Running Total
Quirky Update
Quirky Update Performance
Halloween Protection
Create Table AccBalance
(
AccountNO PK,
Balance
)
Create Table AccDateWiseCumBalance
(
AccountNO PK,
SystemDate PK,
Cumulative Balance
)
First table will be updated by each transaction will keep real time balance but not any history.
Second table keep account and date wise cumulative balance which will be updated at each day end.
So if you need up to previous date cumulative balance you will retrieve data from second table.
And if you need up to current date cumulative balance you will retrieve data from second table up to day before current date and retrieve current date data from first table.
I am designing a database which will hold transaction level data. It will work the same way as a bank account - debits/credits to an Account Number.
What is the best / most efficient way of obtaining the aggregation of these transactions.
I was thinking about using a summary table and then adding these to a list of today's transactions in order to derive how much each account has (i.e their balance).
I want this to be scalable (ie. 1 billion transactions) so don't want to have to perform database hits to the main fact table as it will need to find all the debits/credits associated with a desired account number scanning potentially a billion rows.
Thanks, any help or resources would be awesome.
(Have been working in Banks for almost 10years. Here is how it is actually done).
TLDR: your idea is good.
Every now and then you store the balance somewhere else ("carry forward balance"). E.g. every month or so (or aver a given number of transactions). To calculate the actual balance (or any balance in the past) you accumulate all relevant transactions going back in time until the most recent balance you kept ("carry forward balance"), which you need to add, of course.
The "current" balance is not kept anywhere. Just alone for the locking problems you would have if you'd update this balance all the time. (In real banks you'll hit some bank internal accounts with almost every single transactions. There are plenty of bank internal accounts to be able to get the figures required by law. These accounts are hit very often and thus would cause locking issues when you'd like to update them with every transaction. Instead every transactions is just insert — even the carry forward balances are just inserts).
Also in real banks you have many use cases which make this approach more favourable:
Being able to get back dated balances at any time - Being able to get balances based on different dates for any time (e.g. value date vs. transaction date).
Reversals/Cancellations are a fun of it's own. Imagine to reverse a transaction from two weeks ago and still keep all of the above going.
You see, this is a long story. However, the answer to your question is: yes, you cannot accumulate an ever increasing number of transactions, you need to keep intermediate balances to limit the number to accumulate if needed. Hitting the main table for a limited number of rows, should be no issue.
Make sure your main query uses an Index-Only Scan.
Do an Object Oriented Design, Create table for objects example Account, Transaction etc. Here's a good website for your reference. But there's a lot more on the web discussing OODBMS. The reference I gave is just my basis when I started doing an OODBMS.
I have a table with 281,433 records in it, ranging from March 2010 to the current date (Sept 2014). It's a transaction table which consists of records that determine stock which is currently in and out of the warehouse.
When making picks from the warehouse, the system needs to look over every transaction from a particular customer that was ever made (based on the AccountListID field, which determines the customer, a customer might on average have about 300 records in the table). This happens 2-3 times per request from the particular .NET application when a picking run is done.
There are times when the database seemingly locks out. Some requests complete no bother, within about 3 seconds. Others hang for 'up to 4 minutes' according to the end users.
My guess is with 4-5 requests at the same time all looking at this one transaction table things are getting locked up.
I'm thinking about partitioning this table so that the primary transaction table only contains record from the last 2 years. The end user has agreed that any records past this date are unnecessary.
But I can't just delete them, they're used elsewhere in the system. I have indexes already in place and they make a massive difference (going from >30 seconds to <2, on the accountlistid field). It seems partitioning is the next step.
1) Am I going down the right route as a solution to my 'locking' problem?
2) When moving a set of records (e.g. records where the field DateTimeCheckedIn is more than 2 years old) is this a manual process or does partitioning automatically do this?
Partitioning shouldn't be necessary on a table with fewer than 300,000 rows, unless each record is really big. If a record is occupying more than 4k bytes, then you have 300,000 pages (2,400,000,000 bytes) and that is getting larger.
Indexes are usually the solution for something like this. Taking more than a second to return 300 records in an indexed database seems like a long time (unless the records are really big and the network overhead adds to the time). Your table and index should both fit into memory. Check your memory configuration.
The next question is about the application code. If it uses cursors, then these might be the culprit by locking rows under certain circumstances. For read-only cursors, "FAST_FORWARD" or "FORWARD READ_ONLY" should be fast. It is possible that if the application code is locking all the historical records, then you might get contention. After all, this would occur when two records (for different) customers are on the same data page. The solution is to not lock the historical records as you read them. Or, to avoid using cursors all together.
I don't think partitioning will be necessary here. You can probably fix this with a well-placed index: I'm thinking a single index covering (in order) company, part number, and quantity. Or, if it's an old server, possibly just add ram. Finally, since this is reading a lot of older data for transactions, where individual transactions themselves are likely never (or at most very rarely) updated once written, you might do better with a READ UNCOMMITTED isolation level for this query.
I'm working on a Data Warehouse which, in the end, will require me to create reports based on business hours. Currently, my time dimension is granular to the hour. I'm wondering if I should be modifying my Time dimension to include a bit field for "business hour" or should I be creating some sort of calculated measure for it on the analysis end? Any examples would be super magnificent?
Use a bit (or even another column) to specify whether an hour is a business hour at the time it is stored. Otherwise when you change the business hours you will become unable to reproduce historical reports.
Is all of your sales data in the same time zone? For example, are you tracking sales for outlets in different time zones, or end users in different time zones? If so, you may want to create that bit field for "business hour" in the sales fact table, because it'll be pretty difficult to calculate that on the fly for users and outlets in different time zones.
Plus, you only want to calculate this once - when the sale is imported into the data warehouse - because this data is not likely to change very often. It's not like you're going to say, "This sale used to be in business hours, but it no longer is."
business hours are business rules, therefore they may change in the future
represent business hours as a base time and a duration, e.g. StartTime 0900, Duration 9.5 hours, that way you can easily change the interval, do what-if scenarios based on different business hours, and business hours can cross date lines without complicating queries
of course, all datetimes should be GMT (UTC), never local time, to avoid daylight savings time complexities
EDIT: I think I misunderstood the question, your data is already granular to the hour... No, I think my answer stands, but with the addition of Effective Start and End dates for the business-hour intervals. That would permit the granularity to change in the future while still preserving history
I'm not sure if this helps, but I'd use UCT to store all the times, and then have a start and end times to specify the business hours. Once that is setup, it would be a simple If (SpecificHour >= BusinessStartingHour) And (SpecificHour <= BusinessEndingHour) Then ... operation.
You can play and test with your different options if you use Microsoft PerformancePoint 2007. You can modify your dimensions and output your results in charts, pivot-tables, other reporting tools etc.
http://office.microsoft.com/en-us/performancepoint/FX101680481033.aspx
Could the "business hours" change over time? I guess I'm asking whether each row needs to tie to a business hour flag, or whether just having the reports themselves (or some reference) table decide whether that transaction took place during a business hour or not is enough.
All else equal, I'd probably have the report do it for you, instead of flagging rows, but if business hours are volatile over time, you'd have to flag the rows to make sure your historic data stays correct.
It's a judgement call I think... one that depends on performance testing, system usage, etc. Personally, I'd probably create an indexed field to hold a flag in the interest of dealing with the logic to determine what is and isn't a business hour up-front (i.e. when the data is loaded). If done correctly (and again, depending on the specific usage) I think you might be able to get a performance gain as well.