Calculating counts over several columns in DB - sql

We have a product backed by a DB (currently Oracle, planning to support MS SQL Server as well) with several dozen tables. For simplicity let's take one table called TASK.
We have a use case when we need to present the user the number of tasks having specific criteria. For example, suppose that among many columns the TASK table has, there are 3 columns suitable for this use case:
PRIORITY- possible values LOW, MEDIUM, HIGH
OWNER - possible values are users registered in the system (can be 10s)
STATUS- possible values IDLE, IN_PROCESS, DONE
So we want to display the user exactly how many tasks are LOW, MEDIUM, HIGH, how many of them are owned by some specific user, and how many pertain to different statuses. Of course the basic implementation would be to maintain these counts up-to-date, on every modification to the TASK table. However what complicates the matter is the fact that the user can additionally filter the result by some criteria that can include (or not) part of the columns mentioned above.
For example, the use might want to see those counts only for tasks that are owned by him and have been created last month. The number of possible filter combinations is endless here, so needless to say maintaining up-to-date counts is impossible.
So the question is: how this problem can be solved without serious impact on the DB performance? Can it be solved solely over DB, or should we resort to using other data stores, like sparse data store? It feels like a problem that is present allover in many companies. For example in Amazon store, you can see the counts on categories while using arbitrary text search criteria, which means that they also calculate it on the spot instead of maintaining it up-to-date all the time.
One last thing: we can accept a certain functional limitation, saying that the count should be exact up to 100, but starting from 100 it can just say "over 100 tasks". Maybe this mitigation can allow us to emit more efficient SQL queries.
Thank you!

As I understand you would like to have info about 3 different distributions: across PRIORITY, OWNER and STATUS. I suppose the best way to solve this problem is to maintain 3 different data sources (like SQL query, aggregated info in DB or Redis, etc.).
The simplest way to calculate this data I see as build separate SQL query for each distribution. For example, for priority it would something like:
SELECT USER_ID, PRIORITY, COUNT(*)
FROM TASKS
[WHERE <additional search criterias>]
GROUP BY PRIORITY
Of course, it is not the most efficient way in terms of database performance but it allows to maintain counts up to date.
If you would like to store aggregated values which may significantly decrease database loads (it depends on rows count) so you probably need to build a cube which dimensions should be available search criteria. With this approach, you may implement limitation functionality.

Related

How to efficiently filter large amount of records based on user permissions on specific records with specific criteria?

I'm working as a maintainer for a legacy Java-based cargo railway consignment note accounting system. There is a serious performance issue with retrieving a list of consignment notes to display on their website.
I cannot publish the entire query, but here are some statistics to give the general idea:
it has 17 left joins
it has a huge where clause with 5 OR groups to determine if a user is allowed to access a record because of a specific relation to the record (consignor, consignee, carrier, payer, supervisor) and to check user's permission to access records related to a specific railway station
each of the OR group has, in average, two exists() checks with subqueries on some data related to the record and also to check the station permission
when expanded to be human-readable, the query is about 200 lines long
Essentially, the availability of each record to currently logged-in user depends on the following factors:
- the company of the user
- the company of the carrier, consignee, consignor, payer of each specific consignment note
- every consignment note has multiple route sections and every section has its own carrier and payer, thus requiring further access control conditions to make these records visible to the user
- every consignment note and every route section has origin and destination stations, and a user is allowed to see the record only if he has been given access to any of these stations (using a simple relation table).
There are about 2 million consignment note records in the database and the customer is complaining that it takes too long to load a page with 20 records.
Unfortunately it is not possible to optimize the final query before passing it to the RDBMS (Oracle 11g, to be specific) because the system has complex architecture and a home-brew ORM tool, and the final query is being assembled in at least three different places that are responsible for collection of fields to select, collection of joins, adding criteria selected in the UI and, finally, the reason for this question - the permission related filter.
I wouldn't say that the final query is very complex; on the contrary, it is simple in its nature but it's just huge.
I'm afraid that caching solutions wouldn't be very effective in this case because data changes very often and the cache would be overwritten every minute or so. Also, because of individual permissions, each user should have own cache that would have to be maintained.
Besides the usual recommendations - dealing with indexes and optimizing each subquery as much as possible - are there any other well-known solutions for filtering large amount of records based on complex permission rules?
Just my two cents, since I see no other answers around.
First of all you would need to get the execution plan of the query. Without it, it's not that easy to have an idea of what could get improved. It sounds like a nice challenge, if it wasn't for your urgency.
Well, you say the query has 17 left joins. Does that mean there is a single main table in the query? If so, then that's the first section I would optimize. The key aspect is to reduce the TABLE ACCESS BY ROWID operations as much as possible on that table. The typical solution is to add well tailored indexes to narrow down the INDEX RANGE SCAN as much as possible on that table, therefore reducing the heap fetches.
Then, when navigating the rest of the [outer] tables (presumably using NESTED LOOPS) you can try materializing some of those conditions into simple 0/1 flags you could use, instead of the whole conditions.
Also, if you only need 20 rows, I would expect that to be very fast... well as long as the query is properly pipelined. If in your case it's taking too long, then it may not be the case. Are you sorting/aggregating/windowing by some specific condition that prevents pipelining? This condition could be the most important factor to index if you just need 20 rows.
Finally, you could try avoiding heap fetches by using "covering indexes". That could really improve performance of your query, but I would leave it as a last resort, since they have their downsides.
Well, again a good solution really requires to take a good look at the execution plan. If you still are game, post it, and I can look at it.

Data Warehouse - Storing unique data over time

Basically we are building a reporting dashboard for our software. We are giving the Clients the ability to view basic reporting information.
Example: (I've removed 99% of the complexity of our actual system out of this example, as this should still get across what I'm trying to do)
One example metric would be...the number of unique products viewed over a certain time period. AKA, if 5 products were each viewed by customers 100 times over the course of a month. If you run the report for that month, it should just say 5 for number of products viewed.
Are there any recommendations on how to go about storing data in such a way where it can be queried for any time range, and return a unique count of products viewed. For the sake of this example...lets say there is a rule that the application cannot query the source tables directly, and we have to store summary data in a different database and query it from there.
As a side note, we have tons of other metrics we are storing, which we store aggregated by day. But this particular metric is different because of the uniqueness issue.
I personally don't think it's possible. And our current solution is that we offer 4 pre-computed time ranges where metrics affected by uniqueness are available. If you use a custom time range, then that metric is no longer available because we don't have the data pre-computed.
Your problem is that you're trying to change the grain of the fact table. This can't be done.
Your best option is what I think you are doing now - define aggregate fact tables at the grain of day, week and month to support your performance constraint.
You can address the custom time range simply by advising your users that this will be slower than the standard aggregations. For example, a user wanting to know the counts of unique products sold on Tuesdays can write a query like this, at the expense of some performance loss:
select distinct dim_prod.pcode
,count(*)
from fact_sale
join dim_prod on dim_prod.pkey = fact_sale.pkey
join dim_date on dim_date.dkey = fact_sale.dkey
where dim_date.day_name = 'Tuesday'
group by dim_prod.pcode
The query could also be written against a daily aggregate rather than a transactional fact, and as it would be scanning less data it would run faster, maybe even meeting your need
From the information that you have provided, I think you are trying to measure ' number of unique products viewed over a month (for example)'.
Not sure if you are using Kimball methodologies to design your fact tables. I believe in Kimball methodology, an Accumulating Snapshot Fact table will be recommended to meet such a requirement.
I might be preaching to the converted(apologies in that case), but if not then I would let you go through the following link where the experts have explained the concept in detail:
http://www.kimballgroup.com/2012/05/design-tip-145-time-stamping-accumulating-snapshot-fact-tables/
I have also included another link from Kimball, which explains different types of fact tables in detail:
http://www.kimballgroup.com/2014/06/design-tip-167-complementary-fact-table-types/
Hope that explains the concepts in detail. More than happy to answer any questions(to the best of my ability)
Cheers
Nithin

Are there downsides to nesting data in BigQuery?

We have data of different dimensions, for example:
Name by Company
Stock prices by Date, Company
Commodity prices by Date & Commodity
Production volumes by Date, Commodity & Company
We're thinking of the best way of storing these in BigQuery. One potential method is to put them all in the same table, and nest the extra dimensions.
That would mean:
Almost all the data would be nested - e.g. there would be a single 'row' for each Company, and then its prices would be nested by Date.
Data would have to share at least one dimension - I don't think there would be a way of representing Commodity prices in a table whose first column was the company's Name
Are there disadvantages? Are there performance implications? Is it sensible to nest 5000 dates + associated values within each company's row?
It's common to have nested/repeated columns in BigQuery schemas since it makes reasoning about the data easier. Firebase produces schemas with repetition at many levels, for instance. If you flatten everything, the downside is you need some kind of unique ID for each row in order to associate events with each other, and then you'll need aggregations (using the ID as a key) rather than simple filters if you want to do any kind of counting.
As for downsides of nested/repeated schemas, one is that you may find yourself performing complicated transformations of the structure with ARRAY subqueries or STRUCT operators, for instance. These are generally fast, but they do have some overhead relative to queries without any structure imposed on the result at all.
My best suggestion would be to load some data and run some experiments. Storage and querying both are relatively cheap, so you can try a few different schema shapes and see which works better for your purposes.
Updating in Bigquery is pretty new, but based on the public available info BigQuery DML it is currently limited to only 48 updates per table per day.
Quotas
DML statements are significantly more expensive to process than SELECT
statements.
Maximum UPDATE/DELETE statements per day per table: 48 Maximum
UPDATE/DELETE statements per day per project: 500 Maximum INSERT
statements per day per table: 1,000 Maximum INSERT statements per day
per project: 10,000
Processing nested data is also very expensive since all of the data from that column is loaded on every query. It is also slow if you are doing a lot of operations on nested data.

Efficient database model for points system?

I'm trying to add a very simple points system to a site. In my SQL database, there is a table for awarding points, since admins can increase a user's points by any amount, along with a rationale for the points increase. So, simplified a little, this table contains the # of points awarded, rationale and userid for each time an admin awards points. So far so good.
However, there are 10 usergroups on the site that compete for highest total points. The number of points for a single usergroup can easily hit 15 000 total, as there are already more than 10 000 members of the site (admittedly, most are inactive). I want to have a leaderboard to show the competing usergroups and their total scores, but I'm worried that when implementing the system, summing the points will take too long to do each time. Here's the question: at what level (if any) should I save the points aggregate in the database? Should I have a field in the user table for total points per user and sum those up on the fly for the usergroup leaderboard? Or should I have an aggregate field for each usergroup that I update each times points are added to a single user? Before actually implementing the system, I'd like to have a good idea of how long it will take to sum these up on the fly, since a bad implementation will affect thousands of users, and I don't have much practical experience with large databases.
It depends on your hardware but summing thousands of rows should be no problem. In general though, you should avoid summing all the user scores except when you absolutely need to. I would recommend adding in a rolup table that will store the total score for each group and then run a cron nightly that will validate the total scores (basically do the summation and then store the absolutely correct values).
I suggest adding in your table that logs points awarded and reason for the award. Also, store the summed scores per user separately and update it at the same time your insert into the logging table and another table with the total score per group. That should work well at your activity level. You could also do asynchronous updates of the total group scores if it is too contentious but it should be fine.
Honestly, your aggregates will still likely compute in less than a second with less than 10k rows -you should leave your operational database atomic and just store each point transaction and compute the aggregates when you query. If you really wanted to, you could precompute your aggregates into a materialized view, but I really don't think you'd need to.
You could create a materialized view with the clause
refresh fast start with sydate
next sysdate + 1/24
-To have it refresh hourly.
You wouldn't have real-time aggregate data (it could be off by an hour), but it could increase the performance of your aggregate queries by quite a bit if the data gets huge. As your data is now, I wouldn't even bother with it though -- your performance should be alright.
edit: not sure why I'm being down-voted. I think this is a better solution than storing aggregates in tables.

Aggregates on large databases: best platform?

I have a postgres database with several million rows, which drives a web app. The data is static: users don't write to it.
I would like to be able to offer users query-able aggregates (e.g. the sum of all rows with a certain foreign key value), but the size of the database now means it takes 10-15 minutes to calculate such aggregates.
Should I:
start pre-calculating aggregates in the database (since the data is static)
move away from postgres and use something else?
The only problem with 1. is that I don't necessarily know which aggregates users will want, and it will obviously increase the size of the database even further.
If there was a better solution than postgres for such problems, then I'd be very grateful for any suggestions.
You are trying to solve an OLAP (On-Line Analytical Process) data base structure problem with an OLTP (On-Line Transactional Process) database structure.
You should build another set of tables that store just the aggregates and update these tables in the middle of the night. That way your customers can query the aggregate set of tables and it won't interfere with the on-line transation proceessing system at all.
The only caveate is the aggregate data will always be one day behind.
Yes
Possibly. Presumably there are a whole heap of things you would need to consider before changing your RDBMS. If you moved to SQL Server, you would use Indexed views to accomplish this: Improving Performance with SQL Server 2008 Indexed Views
If you store the aggregates in an intermediate Object (something like MyAggragatedResult), you could consider a caching proxy:
class ResultsProxy {
calculateResult(param1, param2) {
.. retrieve from cache
.. if not found, calculate and store in cache
}
}
There are quite a few caching frameworks for java, and most like for other languages/environments such as .Net as well. These solution can take care of invalidation (how long should a result be stored in memory), and memory-management (remove old cache items when reaching memory limit, etc.).
If you have a set of commonly-queried aggregates, it might be best to create an aggregate table that is maintained by triggers (or an observer pattern tied to your OR/M).
Example: say you're writing an accounting system. You keep all the debits and credits in a General Ledger table (GL). Such a table can quickly accumulate tens of millions of rows in a busy organization. To find the balance of a particular account on the balance sheet as of a given day, you would normally have to calculate the sum of all debits and credits to that account up to that date, a calculation that could take several seconds even with a properly indexed table. Calculating all figures of a balance sheet could take minutes.
Instead, you could define an account_balance table. For each account and dates or date ranges of interest (usually each month's end), you maintain a balance figure by using a trigger on the GL table to update balances by adding each delta individually to all applicable balances. This spreads the cost of aggregating these figures over each individual persistence to the database, which will likely reduce it to a negligible performance hit when saving, and will decrease the cost of getting the data from a massive linear operation to a near-constant one.
For that data volume you shouldn't have to move off Postgres.
I'd look to tuning first - 10-15 minutes seems pretty excessive for 'a few million rows'. This ought to be just a few seconds. Note that the out-of-the box config settings for Postgres don't (or at least didn't) allocate much disk buffer memory. You might look at that also.
More complex solutions involve implementing some sort of data mart or an OLAP front-end such as Mondrian over the database. The latter does pre-calculate aggregates and caches them.
If you have a set of common aggregates you can calculate it before hand (like, well, once a week) in a separate table and/or columns and users get it fast.
But I'd seeking the tuning way too - revise your indexing strategy. As your database is read only, you don't need to worry about index updating overhead.
Revise your database configuration, maybe you can squeeze some performance of it - normally default configurations are targeted to easy the life of first-time users and become short-sighted fastly with large databases.
Maybe even some denormalization can speed up things after you revised your indexing and database configuration - and falls in the situation that you need even more performance, but try it as a last resort.
Oracle supports a concept called Query Rewrite. The idea is this:
When you want a lookup (WHERE ID = val) to go faster, you add an index. You don't have to tell the optimizer to use the index - it just does. You don't have to change the query to read FROM the index... you hit the same table as you always did but now instead of reading every block in the table, it reads a few index blocks and knows where to go in the table.
Imagine if you could add something like that for aggregation. Something that the optimizer would just 'use' without being told to change. Let's say you have a table called DAILY_SALES for the last ten years. Some sales managers want monthly sales, some want quarterly, some want yearly.
You could maintain a bunch of extra tables that hold those aggregations and then you'd tell the users to change their query to use a different table. In Oracle, you'd build those as materialized views. You do no work except defining the MV and an MV Log on the source table. Then if a user queries DAILY_SALES for a sum by month, ORACLE will change your query to use an appropriate level of aggregation. The key is WITHOUT changing the query at all.
Maybe other DB's support that... but this is clearly what you are looking for.