How to efficiently filter large amount of records based on user permissions on specific records with specific criteria? - sql

I'm working as a maintainer for a legacy Java-based cargo railway consignment note accounting system. There is a serious performance issue with retrieving a list of consignment notes to display on their website.
I cannot publish the entire query, but here are some statistics to give the general idea:
it has 17 left joins
it has a huge where clause with 5 OR groups to determine if a user is allowed to access a record because of a specific relation to the record (consignor, consignee, carrier, payer, supervisor) and to check user's permission to access records related to a specific railway station
each of the OR group has, in average, two exists() checks with subqueries on some data related to the record and also to check the station permission
when expanded to be human-readable, the query is about 200 lines long
Essentially, the availability of each record to currently logged-in user depends on the following factors:
- the company of the user
- the company of the carrier, consignee, consignor, payer of each specific consignment note
- every consignment note has multiple route sections and every section has its own carrier and payer, thus requiring further access control conditions to make these records visible to the user
- every consignment note and every route section has origin and destination stations, and a user is allowed to see the record only if he has been given access to any of these stations (using a simple relation table).
There are about 2 million consignment note records in the database and the customer is complaining that it takes too long to load a page with 20 records.
Unfortunately it is not possible to optimize the final query before passing it to the RDBMS (Oracle 11g, to be specific) because the system has complex architecture and a home-brew ORM tool, and the final query is being assembled in at least three different places that are responsible for collection of fields to select, collection of joins, adding criteria selected in the UI and, finally, the reason for this question - the permission related filter.
I wouldn't say that the final query is very complex; on the contrary, it is simple in its nature but it's just huge.
I'm afraid that caching solutions wouldn't be very effective in this case because data changes very often and the cache would be overwritten every minute or so. Also, because of individual permissions, each user should have own cache that would have to be maintained.
Besides the usual recommendations - dealing with indexes and optimizing each subquery as much as possible - are there any other well-known solutions for filtering large amount of records based on complex permission rules?

Just my two cents, since I see no other answers around.
First of all you would need to get the execution plan of the query. Without it, it's not that easy to have an idea of what could get improved. It sounds like a nice challenge, if it wasn't for your urgency.
Well, you say the query has 17 left joins. Does that mean there is a single main table in the query? If so, then that's the first section I would optimize. The key aspect is to reduce the TABLE ACCESS BY ROWID operations as much as possible on that table. The typical solution is to add well tailored indexes to narrow down the INDEX RANGE SCAN as much as possible on that table, therefore reducing the heap fetches.
Then, when navigating the rest of the [outer] tables (presumably using NESTED LOOPS) you can try materializing some of those conditions into simple 0/1 flags you could use, instead of the whole conditions.
Also, if you only need 20 rows, I would expect that to be very fast... well as long as the query is properly pipelined. If in your case it's taking too long, then it may not be the case. Are you sorting/aggregating/windowing by some specific condition that prevents pipelining? This condition could be the most important factor to index if you just need 20 rows.
Finally, you could try avoiding heap fetches by using "covering indexes". That could really improve performance of your query, but I would leave it as a last resort, since they have their downsides.
Well, again a good solution really requires to take a good look at the execution plan. If you still are game, post it, and I can look at it.

Related

What is the best approach to fetch a flag if objects are connected?

Suppose, we have two entities\tables - Users and Games (could be anything instead). And a user can mark multiple games as a favourite. So we also have a user_favourite_game (user_id, game_id) table.
Then suppose, a user is fetching a list of all available games and some of them should have the "favourite" flag = true (pagination is used, so we'll assume 20 games are fetched each time). So I see two approaches here:
We can make one request populating the "favourite" field, e. g.
SELECT
g.*,
ufg.game_id IS NOT NULL AS favourite
FROM
games g LEFT JOIN
user_favourite_game ufg ON ufg.user_id = :userId AND g.id = ufg.game_id
ORDER BY
g.id;
We can select the games and then perform 20 requests to check whether a game is of user's favourites.
Which approach is better to use and why? Any other ideas?
On the last project, we used the second approach because of the complexity of computations required for each entity. So it was a lot more complicated rather than in the example above and close to impossible to be calculated inside a single query.
But in general, it seems to me that in such simple cases a single query with JOIN should run faster than 20 simple queries. Although, I'm not sure how it will behave when we'll have a lot of data in user_favourite_game table
Use the database for what it's designed to do and have it give you the results as part of your original query.
The time your DB will spend performing the outer join on the user favorite game table will likely be less than the network overhead of 20 separate requests for the favorite flag.
Make sure the tables are indexed appropriate as they grow and have accurate statistics.
This isn't a hard and fast rule, and actual performance testing should guide, but I have observed plenty of applications that were harmed by network chattiness. If your round-trip cost for each request is 250ms, your 20 calls will be very expensive. If your round-trip cost is 1ms, people might never notice.
Firing 20 queries(irrespective of how simple they are) will always slow your application. Factors includes network cost, query running etc.
You should fire one query to get page of available games and then make another query to get list of "favorite" games of that user by passing ids of games present in that page. Then set/unset the flag by looping the result. This way you will make only 2 DB calls and it will improve performance significantly.

Calculating counts over several columns in DB

We have a product backed by a DB (currently Oracle, planning to support MS SQL Server as well) with several dozen tables. For simplicity let's take one table called TASK.
We have a use case when we need to present the user the number of tasks having specific criteria. For example, suppose that among many columns the TASK table has, there are 3 columns suitable for this use case:
PRIORITY- possible values LOW, MEDIUM, HIGH
OWNER - possible values are users registered in the system (can be 10s)
STATUS- possible values IDLE, IN_PROCESS, DONE
So we want to display the user exactly how many tasks are LOW, MEDIUM, HIGH, how many of them are owned by some specific user, and how many pertain to different statuses. Of course the basic implementation would be to maintain these counts up-to-date, on every modification to the TASK table. However what complicates the matter is the fact that the user can additionally filter the result by some criteria that can include (or not) part of the columns mentioned above.
For example, the use might want to see those counts only for tasks that are owned by him and have been created last month. The number of possible filter combinations is endless here, so needless to say maintaining up-to-date counts is impossible.
So the question is: how this problem can be solved without serious impact on the DB performance? Can it be solved solely over DB, or should we resort to using other data stores, like sparse data store? It feels like a problem that is present allover in many companies. For example in Amazon store, you can see the counts on categories while using arbitrary text search criteria, which means that they also calculate it on the spot instead of maintaining it up-to-date all the time.
One last thing: we can accept a certain functional limitation, saying that the count should be exact up to 100, but starting from 100 it can just say "over 100 tasks". Maybe this mitigation can allow us to emit more efficient SQL queries.
Thank you!
As I understand you would like to have info about 3 different distributions: across PRIORITY, OWNER and STATUS. I suppose the best way to solve this problem is to maintain 3 different data sources (like SQL query, aggregated info in DB or Redis, etc.).
The simplest way to calculate this data I see as build separate SQL query for each distribution. For example, for priority it would something like:
SELECT USER_ID, PRIORITY, COUNT(*)
FROM TASKS
[WHERE <additional search criterias>]
GROUP BY PRIORITY
Of course, it is not the most efficient way in terms of database performance but it allows to maintain counts up to date.
If you would like to store aggregated values which may significantly decrease database loads (it depends on rows count) so you probably need to build a cube which dimensions should be available search criteria. With this approach, you may implement limitation functionality.

is it ok to loop a sql query in programing language

I have a doubt in mind when retrieving data from database.
There are two tables and master table id always inserted to other table.
I know that data can retrieve from two table by joining but want to know,
if i first retrieve all my desire data from master table and then in loop (in programing language) join to other table and retrieve data, then which is efficient and why.
As far as efficiency goes the rule is you want to minimize the number of round trips to the database, because each trip adds a lot of time. (This may not be as big a deal if the database is on the same box as the application calling it. In the world I live in the database is never on the same box as the application.) Having your application loop means you make a trip to the database for every row in the master table, so the time your operation takes grows linearly with the number of master table rows.
Be aware that in dev or test environments you may be able to get away with inefficient queries if there isn't very much test data. In production you may see a lot more data than you tested with.
It is more efficient to work in the database, in fewer larger queries, but unless the site or program is going to be very busy, I doubt that it'll make much difference that the loop is inside the database or outside the database. If it is a website application then looping large loops outside the database and waiting on results will take a more significant amount of time.
What you're describing is sometimes called the N+1 problem. The 1 is your first query against the master table, the N is the number of queries against your detail table.
This is almost always a big mistake for performance.*
The problem is typically associated with using an ORM. The ORM queries your database entities as though they are objects, the mistake is assume that instantiating data objects is no more costly than creating an object. But of course you can write code that does the same thing yourself, without using an ORM.
The hidden cost is that you now have code that automatically runs N queries, and N is determined by the number of matching rows in your master table. What happens when 10,000 rows match your master query? You won't get any warning before your database is expected to execute those queries at runtime.
And it may be unnecessary. What if the master query matches 10,000 rows, but you really only wanted the 27 rows for which there are detail rows (in other words an INNER JOIN).
Some people are concerned with the number of queries because of network overhead. I'm not as concerned about that. You should not have a slow network between your app and your database. If you do, then you have a bigger problem than the N+1 problem.
I'm more concerned about the overhead of running thousands of queries per second when you don't have to. The overhead is in memory and all the code needed to parse and create an SQL statement in the server process.
Just Google for "sql n+1 problem" and you'll lots of people discussing how bad this is, and how to detect it in your code, and how to solve it (spoiler: do a JOIN).
* Of course every rule has exceptions, so to answer this for your application, you'll have to do load-testing with some representative sample of data and traffic.

Table design and Querying

I have a table design that is represented by this awesome hand drawn image.
Basically, I have an account event, which can be either a Transaction (Payment to or from a third party) or a Transfer (transfer between accounts held by the user).
All common data is held in the event table (Date, CreatedBy, Source Account Id...) and then if it's a transaction, then transaction specific data is held in the Account Transaction table (Third Party, transaction type (Debit, Credit)...). If the event is a transfer, then transfer specific data is in the account_transfer table (Amount, destination account id...).
Note, something I forgot to draw, is that the Event table has an event_type_id. If event_type_id = 1, then it's a transaction. If it's a 2, then it's a Transfer.
Both the transfer and transaction tables are linked to the event table via an event id foreign key.
Note though that a transaction doesn't have an amount, as the transaction can be split into multiple payment lines, so it has a child account_transaction_line. To get the amount of the transaction, you sum it's child lines.
Foreign keys are all setup, with an index on primary keys...
My question is about design and querying. If I want to list all events for a specific account, I can either:
Select
from Event,
where event_type = 1 (transaction),
then INNER join to the Transaction table,
and INNER join to the transaction line (to sum the total)...
and then UNION to another selection,
selecting
from Event,
where event_type = 2 (transfer),
INNER join to transfer table...
and producing a list of all events.
or
Select
from Event,
then LEFT join to transaction,
then LEFT join to transaction line,
then LEFT join to transfer ...
and sum up totals (because of the transaction lines).
Which is more efficient? I think option 1 is best, as it avoids the LEFT joins (Scans?)
OR...
An Indexed View of option 1?
On performance
For performance analysis in SQL server, there are quite a few factors at play, e.g.
What is the number of queries you are going to run, esp. on the same data? For example, if 80% of your queries are around 20% of your data, then caching may help significantly. (See below the design section on how this can matter)
Are your databases distributed or collocated on the same server? I assume it's a single server system, but if they were distributed, the design and optimization might vary.
Are these queries executed in a background process or on-demand and a user is expecting to get the results quicker?
Without these (and perhaps some other follow up questions once answers to these are provided), it would be unwise to give an answer stating one being preferable over the other.
Having said that, based on my personal experience, your best bet specifically for SQL server is to use query analyzer, which is actually pretty reasonable, as your first stop. After that, you can do some performance analyses to find the optimal solution. Typically, these are done by modeling the query traffic as it would be when the system is under regular load. (FYI: The modeling link is to ASP.NET performance modeling, but various core concepts apply to SQL as well.) You typically put the system under load and then:
Look at how many connections are lost -- this can increase if the queries are expensive.
Performance counters on the server(s) to see how the system is dealing with the load.
Responses from the queries to see if some start failing to provide a valid response, although this is unlikely to happen
FYI: This is based on my personal experience, after having done various types of performance analyses for multiple projects. We expect to do it again for our current project, although this time around we're using AD and Azure tables instead of SQL, and hence the methodology is not specific to SQL server, although the tools, traffic profiles, and what to measure varies.
On design
Introducing event id in the account transaction line:
Although you do not explicitly state so, but it seems that the event ID and transaction ID is not going to change after the first entry has been made. If that's the case and you are only interested in getting the totals for a transaction in this query, then another option (which will optimize your queries) would be to add a foreign key to AccountEvent's primary key (which I think is the event id). In strictest DB sense, you are de-normalizing the table a bit, but in practice, it often helps with performance.
Computing totals on inserts:
The other approach that I have taken in a past project (just because I was using FoxPro in the previous century and FoxPro tended to be extremely slow at joins) was to keep total amounts in the primary table, equivalent of your transactions table. This would be quite useful if your reads heavily outweighed your writes, and in the case of SQL, you can issue a transaction to make entries in other tables and update totals simultaneously (hence my question about on your query profiles).
Join transaction & transfers tables:
Keep a value to indicate which is which, and keep the totals there -- similar to previous one but at a different level. This will decrease the joins on query, but still have sum of totals on inserts -- I would prefer the previous over this one.
De-normalize completely:
This is yet another approach that folks have used (esp. in NOSQL space), but it gives me shivers when applying in SQL Server, so I have a personal bias against it but you could very well search it and find about it.

Aggregates on large databases: best platform?

I have a postgres database with several million rows, which drives a web app. The data is static: users don't write to it.
I would like to be able to offer users query-able aggregates (e.g. the sum of all rows with a certain foreign key value), but the size of the database now means it takes 10-15 minutes to calculate such aggregates.
Should I:
start pre-calculating aggregates in the database (since the data is static)
move away from postgres and use something else?
The only problem with 1. is that I don't necessarily know which aggregates users will want, and it will obviously increase the size of the database even further.
If there was a better solution than postgres for such problems, then I'd be very grateful for any suggestions.
You are trying to solve an OLAP (On-Line Analytical Process) data base structure problem with an OLTP (On-Line Transactional Process) database structure.
You should build another set of tables that store just the aggregates and update these tables in the middle of the night. That way your customers can query the aggregate set of tables and it won't interfere with the on-line transation proceessing system at all.
The only caveate is the aggregate data will always be one day behind.
Yes
Possibly. Presumably there are a whole heap of things you would need to consider before changing your RDBMS. If you moved to SQL Server, you would use Indexed views to accomplish this: Improving Performance with SQL Server 2008 Indexed Views
If you store the aggregates in an intermediate Object (something like MyAggragatedResult), you could consider a caching proxy:
class ResultsProxy {
calculateResult(param1, param2) {
.. retrieve from cache
.. if not found, calculate and store in cache
}
}
There are quite a few caching frameworks for java, and most like for other languages/environments such as .Net as well. These solution can take care of invalidation (how long should a result be stored in memory), and memory-management (remove old cache items when reaching memory limit, etc.).
If you have a set of commonly-queried aggregates, it might be best to create an aggregate table that is maintained by triggers (or an observer pattern tied to your OR/M).
Example: say you're writing an accounting system. You keep all the debits and credits in a General Ledger table (GL). Such a table can quickly accumulate tens of millions of rows in a busy organization. To find the balance of a particular account on the balance sheet as of a given day, you would normally have to calculate the sum of all debits and credits to that account up to that date, a calculation that could take several seconds even with a properly indexed table. Calculating all figures of a balance sheet could take minutes.
Instead, you could define an account_balance table. For each account and dates or date ranges of interest (usually each month's end), you maintain a balance figure by using a trigger on the GL table to update balances by adding each delta individually to all applicable balances. This spreads the cost of aggregating these figures over each individual persistence to the database, which will likely reduce it to a negligible performance hit when saving, and will decrease the cost of getting the data from a massive linear operation to a near-constant one.
For that data volume you shouldn't have to move off Postgres.
I'd look to tuning first - 10-15 minutes seems pretty excessive for 'a few million rows'. This ought to be just a few seconds. Note that the out-of-the box config settings for Postgres don't (or at least didn't) allocate much disk buffer memory. You might look at that also.
More complex solutions involve implementing some sort of data mart or an OLAP front-end such as Mondrian over the database. The latter does pre-calculate aggregates and caches them.
If you have a set of common aggregates you can calculate it before hand (like, well, once a week) in a separate table and/or columns and users get it fast.
But I'd seeking the tuning way too - revise your indexing strategy. As your database is read only, you don't need to worry about index updating overhead.
Revise your database configuration, maybe you can squeeze some performance of it - normally default configurations are targeted to easy the life of first-time users and become short-sighted fastly with large databases.
Maybe even some denormalization can speed up things after you revised your indexing and database configuration - and falls in the situation that you need even more performance, but try it as a last resort.
Oracle supports a concept called Query Rewrite. The idea is this:
When you want a lookup (WHERE ID = val) to go faster, you add an index. You don't have to tell the optimizer to use the index - it just does. You don't have to change the query to read FROM the index... you hit the same table as you always did but now instead of reading every block in the table, it reads a few index blocks and knows where to go in the table.
Imagine if you could add something like that for aggregation. Something that the optimizer would just 'use' without being told to change. Let's say you have a table called DAILY_SALES for the last ten years. Some sales managers want monthly sales, some want quarterly, some want yearly.
You could maintain a bunch of extra tables that hold those aggregations and then you'd tell the users to change their query to use a different table. In Oracle, you'd build those as materialized views. You do no work except defining the MV and an MV Log on the source table. Then if a user queries DAILY_SALES for a sum by month, ORACLE will change your query to use an appropriate level of aggregation. The key is WITHOUT changing the query at all.
Maybe other DB's support that... but this is clearly what you are looking for.