High throughput summary tables - sql

I have a very high throughput production web application that I need to design effective SQL summary tables for. For each request going through the application I need to append 4 or 5 values, to hourly or daily stats in the DB
Three options I thought of:
Basic summary table with cols such as "day", "totalA", "totalB", "totalC"
I know from experience that the throughput of the application in production is high enough that any attempt to constantly update a single row will cause huge wait locks and stall all the threads
Also this means I always have to do a query like "UPDATE WHERE today .. ROWS EFFECTED == 0 ? INSERT..." which seems like a bad pattern
Table with 1 row per application request with cols such as "day", "amountA", amountB", "amountC"
Due to the super high throughput this table would get at least a few million rows added per day. As I would need to keep this data forever the table size would some become a real problem
Option 2 + job to summarise data in another table
Let's say I set a job to sum up data from the option 2 table, then insert into a separate summary table, then delete from option 2 table
The problem I think with this solution would be that during the DELETE process, the table would lock and cause INSERT delays in the application which is unfortunately not acceptable even for a period of a few seconds
I'm actually at a loss as to the best practice in this scenario, any input would be greatly appreciated.

Related

Deleting millions of rows without impacting transaction log

Our client database is growing at increasing pace, more specifically entities like Auditing & Logging which are growing much greater speed.
For instance, as of now the Auditing table has ~30 million rows and its is growing with the rate of 1.5 million rows per week.
Similarly, the Logging table is growing at the rate of ~1 million rows per week. This table has ~50 million rows.
We have decided to archive tables based on our data retention policy & delete some 'N' number of records from these tables when ever archiving jobs runs.
I am looking for best advice for defining the chuckSize which will not impact transaction logs of sql server db or table locking. I know this value cannot be straight way derived, we need to run different test scenarios to come with this magic number.
The best advice is to partition by data, presumably by date.
Then you can remove entire partitions without having to log the results.
The subject of partitioning tables is rather broad. The documentation is a good place to start.

Slow queries on 'transaction' table - sql partition as a solution?

I have a table with 281,433 records in it, ranging from March 2010 to the current date (Sept 2014). It's a transaction table which consists of records that determine stock which is currently in and out of the warehouse.
When making picks from the warehouse, the system needs to look over every transaction from a particular customer that was ever made (based on the AccountListID field, which determines the customer, a customer might on average have about 300 records in the table). This happens 2-3 times per request from the particular .NET application when a picking run is done.
There are times when the database seemingly locks out. Some requests complete no bother, within about 3 seconds. Others hang for 'up to 4 minutes' according to the end users.
My guess is with 4-5 requests at the same time all looking at this one transaction table things are getting locked up.
I'm thinking about partitioning this table so that the primary transaction table only contains record from the last 2 years. The end user has agreed that any records past this date are unnecessary.
But I can't just delete them, they're used elsewhere in the system. I have indexes already in place and they make a massive difference (going from >30 seconds to <2, on the accountlistid field). It seems partitioning is the next step.
1) Am I going down the right route as a solution to my 'locking' problem?
2) When moving a set of records (e.g. records where the field DateTimeCheckedIn is more than 2 years old) is this a manual process or does partitioning automatically do this?
Partitioning shouldn't be necessary on a table with fewer than 300,000 rows, unless each record is really big. If a record is occupying more than 4k bytes, then you have 300,000 pages (2,400,000,000 bytes) and that is getting larger.
Indexes are usually the solution for something like this. Taking more than a second to return 300 records in an indexed database seems like a long time (unless the records are really big and the network overhead adds to the time). Your table and index should both fit into memory. Check your memory configuration.
The next question is about the application code. If it uses cursors, then these might be the culprit by locking rows under certain circumstances. For read-only cursors, "FAST_FORWARD" or "FORWARD READ_ONLY" should be fast. It is possible that if the application code is locking all the historical records, then you might get contention. After all, this would occur when two records (for different) customers are on the same data page. The solution is to not lock the historical records as you read them. Or, to avoid using cursors all together.
I don't think partitioning will be necessary here. You can probably fix this with a well-placed index: I'm thinking a single index covering (in order) company, part number, and quantity. Or, if it's an old server, possibly just add ram. Finally, since this is reading a lot of older data for transactions, where individual transactions themselves are likely never (or at most very rarely) updated once written, you might do better with a READ UNCOMMITTED isolation level for this query.

How to efficiently keep count by reading, incrementing it & updating a column in the database

I have a column in the database which keeps counts of incoming requests, but updated from different sources and systems.
And the incoming requests are in thousands per minute.
What is the best way to update this column with the new request count?
The 2 ways at the top of my head are -
Read current value from column, increment it by one, and then update it back(All part of a sproc).
The problem I see with this is that every source/system that updates needs to lock this column and this might increase the wait time of read and updating of the column. And will slow down the DB.
Put requests in a queue, and a job reads the queue and updates the column, one at a time. This method looks safer, atleast to me, but is it too much work to get a count of requests coming in?
What is the approach you would typically take in such a read & update in a column in huge amounts scenario?
Thanks
1000s per minute is not "huge". Let's say its 10k per minute. That leaves 6ms of time per update. For an in-memory row with a simple integer increment and not too many indexes expect <1ms per update. Works out fine.
So just use
UPDATE T SET Count = Count + 1 WHERE ID = 1234
Put an index on the database and just do:
update table t
set request_count = requestcount + 1
where <whatever conditions are appropriate>;
Be sure that the conditions in the where clause all refer to indexes, so finding the row is likely to be as fast as possible.
Without strenuous effort, I would expect the update to be as fast enough. You should test this to see if this is true. You could also insert a row into a requests table and do the counting when you query that table. inserts are faster than updates, because the engine doesn't have to find the row first.
If this doesn't meet performance goals, then some sort of distributed mechanism may prove successful. I don't see that batching the requests using sequences would be a simple solution. Although the queue is likely to be distributed, you then have the problem that the request counts are out-of-sync with the actual updates.

Indexing of table is messed up which causes the queries to take extremely long to fetch the rows

I have a table A [Schema: key_id, x,y,z,...] with a key id column which is unique and has increment seeding of 1. Now I have a table B [schema: key_id, key_idOfA, x,y,z,....] which is a back up for A with a similar schema (only diff table B has its own key_id and it also maintains original key_id of Table A).
I have a service which transfers some rows from A to B based on a where clause. I tried this service once and it worked fine by transferring rows from A to B. Now to check this service again I had to transfer the rows (key_idOfA, x, y,z,...) from B back to A.
TO avoid loosing the original key_id of Table A here I first used
SET IDENTITY_INSERT A ON
and transferred the rows which worked fine. After transfer I used
SET IDENTITY_INSERT A OFF
Now when I run the service again it takes a lot of time to get few rows from table A which causes a timeout. Precisely speaking it takes 5 mins to get 30,000 rows on SQL Server Management Studio. From the service, the query times out due to a 3 mins timeout.
I am aware switching a table's identity insert ON and OFF is a bad practice but this was a test bed DB and I would never do it on Production DB.
My questions:
Is the indexing messed up due to which the query is taking so much of time? Or is there some other issue?
Could I have taken a different approach to transfer the rows back without messing up indexing?
Well, What you can do to look into WHY it is slow, look at the Actual Query Plan.
In SSMS this is done throught the Query menu in the top, the select Actual Query plan (Or CTRL-M)
When you run the query afterwards you get an extra tab beside your Results and Messages tab. it is called Execution Plan.
When you look into that, you have to start on the far right. That is where the query starts. Look for high percentages. If you see index scans or table scans, those are worse then a index seek, which is a faster method.
If you see that you have high percentages with table/index scans, try to defrag your indexes.
You might wanna take a look at a script i wrote to defrag indexes : www.plixa.nl/fragment-an-index
Hope it helps, if not, let us know :)

Appropriate query and indexes for a logging table in SQL

Assume a table named 'log', there are huge records in it.
The application usually retrieves data by simple SQL:
SELECT *
FROM log
WHERE logLevel=2 AND (creationData BETWEEN ? AND ?)
logLevel and creationData have indexes, but the number of records makes it take longer to retrieve data.
How do we fix this?
Look at your execution plan / "EXPLAIN PLAN" result - if you are retrieving large amounts of data then there is very little that you can do to improve performance - you could try changing your SELECT statement to only include columns you are interested in, however it won't change the number of logical reads that you are doing and so I suspect it will only have a neglible effect on performance.
If you are only retrieving small numbers of records then an index of LogLevel and an index on CreationDate should do the trick.
UPDATE: SQL server is mostly geared around querying small subsets of massive databases (e.g. returning a single customer record out of a database of millions). Its not really geared up for returning truly large data sets. If the amount of data that you are returning is genuinely large then there is only a certain amount that you will be able to do and so I'd have to ask:
What is it that you are actually trying to achieve?
If you are displaying log messages to a user, then they are only going to be interested in a small subset at a time, and so you might also want to look into efficient methods of paging SQL data - if you are only returning even say 500 or so records at a time it should still be very fast.
If you are trying to do some sort of statistical analysis then you might want to replicate your data into a data store more suited to statistical analysis. (Not sure what however, that isn't my area of expertise)
1: Never use Select *
2: make sure your indexes are correct, and your statistics are up-to-date
3: (Optional) If you find you're not looking at log data past a certain time (in my experience, if it happened more than a week ago, I'm probably not going to need the log for it) set up a job to archive that to some back-up, and then remove unused records. That will keep the table size down reducing the amount of time it takes search the table.
Depending on what kinda of SQL database you're using, you might look into Horizaontal Partitioning. Oftentimes, this can be done entirely on the database side of things so you won't need to change your code.
Do you need all columns? First step should be to select only those you actually need to retrieve.
Another aspect is what you do with the data after it arrives to your application (populate a data set/read it sequentially/?).
There can be some potential for improvement on the side of the processing application.
You should answer yourself these questions:
Do you need to hold all the returned data in memory at once? How much memory do you allocate per row on the retrieving side? How much memory do you need at once? Can you reuse some memory?
A couple of things
do you need all the columns, people usually do SELECT * because they are too lazy to list 5 columns of the 15 that the table has.
Get more RAM, themore RAM you have the more data can live in cache which is 1000 times faster than reading from disk
For me there are two things that you can do,
Partition the table horizontally based on the date column
Use the concept of pre-aggregation.
Pre-aggregation:
In preagg you would have a "logs" table, "logs_temp" table, a "logs_summary" table and a "logs_archive" table. The structure of logs and logs_temp table is identical. The flow of application would be in this way, all logs are logged in the logs table, then every hour a cron job runs that does the following things:
a. Copy the data from the logs table to "logs_temp" table and empty the logs table. This can be done using the Shadow Table trick.
b. Aggregate the logs for that particular hour from the logs_temp table
c. Save the aggregated results in the summary table
d. Copy the records from the logs_temp table to the logs_archive table and then empty the logs_temp table.
This way results are pre-aggregated in the summary table.
Whenever you wish to select the result, you would select it from the summary table.
This way the selects are very fast, because the number of records are far less as the data has been pre-aggregated per hour. You could even increase the threshold from an hour to a day. It all depends on your needs.
Now the inserts would be fast too, because the amount of data is not much in the logs table as it holds the data only for the last hour, so index regeneration on inserts would take very less time as compared to very large data-set hence making the inserts fast.
You can read more about Shadow Table trick here
I employed the pre-aggregation method in a news website built on wordpress. I had to develop a plugin for the news website that would show recently popular (popular during the last 3 days) news items, and there are like 100K hits per day, and this pre-aggregation thing has really helped us a lot. The query time came down from more than 2 secs to under a second. I intend on making the plugin publically available soon.
As per other answers, do not use 'select *' unless you really need all the fields.
logLevel and creationData have indexes
You need a single index with both values, what order you put them in will affect performance, but assuming you have a small number of possible loglevel values (and the data is not skewed) you'll get better performance putting creationData first.
Note that optimally an index will reduce the cost of a query to log(N) i.e. it will still get slower as the number of records increases.
C.
I really hope that by creationData you mean creationDate.
First of all, it is not enough to have indexes on logLevel and creationData. If you have 2 separate indexes, Oracle will only be able to use 1.
What you need is a single index on both fields:
CREATE INDEX i_log_1 ON log (creationData, logLevel);
Note that I put creationData first. This way, if you only put that field in the WHERE clause, it will still be able to use the index. (Filtering on just date seems more likely scenario that on just log level).
Then, make sure the table is populated with data (as much data as you will use in production) and refresh the statistics on the table.
If the table is large (at least few hundred thousand rows), use the following code to refresh the statistics:
DECLARE
l_ownname VARCHAR2(255) := 'owner'; -- Owner (schema) of table to analyze
l_tabname VARCHAR2(255) := 'log'; -- Table to analyze
l_estimate_percent NUMBER(3) := 5; -- Percentage of rows to estimate (NULL means compute)
BEGIN
dbms_stats.gather_table_stats (
ownname => l_ownname ,
tabname => l_tabname,
estimate_percent => l_estimate_percent,
method_opt => 'FOR ALL INDEXED COLUMNS',
cascade => TRUE
);
END;
Otherwise, if the table is small, use
ANALYZE TABLE log COMPUTE STATISTICS FOR ALL INDEXED COLUMNS;
Additionally, if the table grows large, you shoud consider to partition it by range on creationDate column. See these links for the details:
Oracle Documentation: Range Partitioning
OraFAQ: Range partitions
How to Create and Manage Partition Tables in Oracle