How can I efficiently roll up stats on huge dataset?

How can I efficiently roll up stats on huge dataset? - sql

This is likely my ignorance of basic data science, but here goes...
I have a massive database of events--many billions. I have up to 5 or 6 filters users can select, so whatever filters user selects becomes the WHERE clause of my SQL query. In a status display I need to show a couple of simple computed stats over this data, with filters applied (average, %, that kind of thing--easy computations).
My problem is this:
If I do it the straightforward way: SQL where we say
select a,b,c from Events where fieldX=filter1_value and fieldY=filter2_value
where the filter values are provided by user's manipulation of filter controls, then feed the results thru a simple compute engine to roll up the stats.
High confidence this works functionally, but.... I'm very concerned that even with filters you can easily have many millions of results and simply rolling these up, even for a simple computation, will be too long for real-time display (ie. providing results to web UI via REST).
Alternatively, the combinatorial explosion of potential filter values is likewise too vast (I think) to allow me to project ready-to-eat stats.
Is there a different/better way to allow me to do this and still be fast?

Related

What is the best approach to fetch a flag if objects are connected?

Suppose, we have two entities\tables - Users and Games (could be anything instead). And a user can mark multiple games as a favourite. So we also have a user_favourite_game (user_id, game_id) table.
Then suppose, a user is fetching a list of all available games and some of them should have the "favourite" flag = true (pagination is used, so we'll assume 20 games are fetched each time). So I see two approaches here:
We can make one request populating the "favourite" field, e. g.
SELECT
g.*,
ufg.game_id IS NOT NULL AS favourite
FROM
games g LEFT JOIN
user_favourite_game ufg ON ufg.user_id = :userId AND g.id = ufg.game_id
ORDER BY
g.id;
We can select the games and then perform 20 requests to check whether a game is of user's favourites.
Which approach is better to use and why? Any other ideas?
On the last project, we used the second approach because of the complexity of computations required for each entity. So it was a lot more complicated rather than in the example above and close to impossible to be calculated inside a single query.
But in general, it seems to me that in such simple cases a single query with JOIN should run faster than 20 simple queries. Although, I'm not sure how it will behave when we'll have a lot of data in user_favourite_game table

Use the database for what it's designed to do and have it give you the results as part of your original query.
The time your DB will spend performing the outer join on the user favorite game table will likely be less than the network overhead of 20 separate requests for the favorite flag.
Make sure the tables are indexed appropriate as they grow and have accurate statistics.
This isn't a hard and fast rule, and actual performance testing should guide, but I have observed plenty of applications that were harmed by network chattiness. If your round-trip cost for each request is 250ms, your 20 calls will be very expensive. If your round-trip cost is 1ms, people might never notice.

Firing 20 queries(irrespective of how simple they are) will always slow your application. Factors includes network cost, query running etc.
You should fire one query to get page of available games and then make another query to get list of "favorite" games of that user by passing ids of games present in that page. Then set/unset the flag by looping the result. This way you will make only 2 DB calls and it will improve performance significantly.

How to efficiently filter large amount of records based on user permissions on specific records with specific criteria?

I'm working as a maintainer for a legacy Java-based cargo railway consignment note accounting system. There is a serious performance issue with retrieving a list of consignment notes to display on their website.
I cannot publish the entire query, but here are some statistics to give the general idea:
it has 17 left joins
it has a huge where clause with 5 OR groups to determine if a user is allowed to access a record because of a specific relation to the record (consignor, consignee, carrier, payer, supervisor) and to check user's permission to access records related to a specific railway station
each of the OR group has, in average, two exists() checks with subqueries on some data related to the record and also to check the station permission
when expanded to be human-readable, the query is about 200 lines long
Essentially, the availability of each record to currently logged-in user depends on the following factors:
- the company of the user
- the company of the carrier, consignee, consignor, payer of each specific consignment note
- every consignment note has multiple route sections and every section has its own carrier and payer, thus requiring further access control conditions to make these records visible to the user
- every consignment note and every route section has origin and destination stations, and a user is allowed to see the record only if he has been given access to any of these stations (using a simple relation table).
There are about 2 million consignment note records in the database and the customer is complaining that it takes too long to load a page with 20 records.
Unfortunately it is not possible to optimize the final query before passing it to the RDBMS (Oracle 11g, to be specific) because the system has complex architecture and a home-brew ORM tool, and the final query is being assembled in at least three different places that are responsible for collection of fields to select, collection of joins, adding criteria selected in the UI and, finally, the reason for this question - the permission related filter.
I wouldn't say that the final query is very complex; on the contrary, it is simple in its nature but it's just huge.
I'm afraid that caching solutions wouldn't be very effective in this case because data changes very often and the cache would be overwritten every minute or so. Also, because of individual permissions, each user should have own cache that would have to be maintained.
Besides the usual recommendations - dealing with indexes and optimizing each subquery as much as possible - are there any other well-known solutions for filtering large amount of records based on complex permission rules?

Just my two cents, since I see no other answers around.
First of all you would need to get the execution plan of the query. Without it, it's not that easy to have an idea of what could get improved. It sounds like a nice challenge, if it wasn't for your urgency.
Well, you say the query has 17 left joins. Does that mean there is a single main table in the query? If so, then that's the first section I would optimize. The key aspect is to reduce the TABLE ACCESS BY ROWID operations as much as possible on that table. The typical solution is to add well tailored indexes to narrow down the INDEX RANGE SCAN as much as possible on that table, therefore reducing the heap fetches.
Then, when navigating the rest of the [outer] tables (presumably using NESTED LOOPS) you can try materializing some of those conditions into simple 0/1 flags you could use, instead of the whole conditions.
Also, if you only need 20 rows, I would expect that to be very fast... well as long as the query is properly pipelined. If in your case it's taking too long, then it may not be the case. Are you sorting/aggregating/windowing by some specific condition that prevents pipelining? This condition could be the most important factor to index if you just need 20 rows.
Finally, you could try avoiding heap fetches by using "covering indexes". That could really improve performance of your query, but I would leave it as a last resort, since they have their downsides.
Well, again a good solution really requires to take a good look at the execution plan. If you still are game, post it, and I can look at it.

Calculating counts over several columns in DB

We have a product backed by a DB (currently Oracle, planning to support MS SQL Server as well) with several dozen tables. For simplicity let's take one table called TASK.
We have a use case when we need to present the user the number of tasks having specific criteria. For example, suppose that among many columns the TASK table has, there are 3 columns suitable for this use case:
PRIORITY- possible values LOW, MEDIUM, HIGH
OWNER - possible values are users registered in the system (can be 10s)
STATUS- possible values IDLE, IN_PROCESS, DONE
So we want to display the user exactly how many tasks are LOW, MEDIUM, HIGH, how many of them are owned by some specific user, and how many pertain to different statuses. Of course the basic implementation would be to maintain these counts up-to-date, on every modification to the TASK table. However what complicates the matter is the fact that the user can additionally filter the result by some criteria that can include (or not) part of the columns mentioned above.
For example, the use might want to see those counts only for tasks that are owned by him and have been created last month. The number of possible filter combinations is endless here, so needless to say maintaining up-to-date counts is impossible.
So the question is: how this problem can be solved without serious impact on the DB performance? Can it be solved solely over DB, or should we resort to using other data stores, like sparse data store? It feels like a problem that is present allover in many companies. For example in Amazon store, you can see the counts on categories while using arbitrary text search criteria, which means that they also calculate it on the spot instead of maintaining it up-to-date all the time.
One last thing: we can accept a certain functional limitation, saying that the count should be exact up to 100, but starting from 100 it can just say "over 100 tasks". Maybe this mitigation can allow us to emit more efficient SQL queries.
Thank you!

As I understand you would like to have info about 3 different distributions: across PRIORITY, OWNER and STATUS. I suppose the best way to solve this problem is to maintain 3 different data sources (like SQL query, aggregated info in DB or Redis, etc.).
The simplest way to calculate this data I see as build separate SQL query for each distribution. For example, for priority it would something like:
SELECT USER_ID, PRIORITY, COUNT(*)
FROM TASKS
[WHERE <additional search criterias>]
GROUP BY PRIORITY
Of course, it is not the most efficient way in terms of database performance but it allows to maintain counts up to date.
If you would like to store aggregated values which may significantly decrease database loads (it depends on rows count) so you probably need to build a cube which dimensions should be available search criteria. With this approach, you may implement limitation functionality.

Rails 4: dashboard/analytics and querying ALL records in DB

Working on a dashboard page which does a lot of analytics to display BOTH graphical and tabular data to users.
When the dashboard is filtered by a given year, I have to display analytics for the selected year, another year chosen for comparison, and historical averages from all time.
For the selected and comparison years, I create start/end DateTime objects that are set to the beginning_of_year and end_of_year.
year = Model.where("closed_at >= ?", start).where("closed_at <= ?", end).all
comp = Model.where("closed_at >= ?", comp_start).where("closed_at <= ?", comp_end).all
These queries are essentially the same, just different date filters. I don't really see any way to optimize this besides trying to only "select(...)" the fields I need, which will probably be all of them.
Since there will be an average of 250-1000 records in a given year, they aren't "horrible" (in my not-very-skilled opinion).
However, the historical averages are causing me a lot of pain. In order to adequately show the averages, I have to query ALL the records for all time and perform calculations on them. This is a bad idea, but I don't know how to get around it.
all_for_average = Model.all
Surely people have run into these kinds of problems before and have some means of optimizing them? Returning somewhere in the ballpark of 2,000 - 50,000 records for historical average analysis can't be very efficient. However, I don't see another way to perform the analysis unless I first retrieve the records.
Option 1: Grab everything and filter using Ruby
Since I'm already grabbing everything via Model.all, I "could" remove the 2 year queries by simply grabbing the desired records from the historical average instead. But this seems wrong...I'm literally "downloading" my DB (so to speak) and then querying it with Ruby code instead of SQL. Seems very inefficient. Has anyone tried this before and seen any performance gains?
Option 2: Using multiple SQL DB calls to get select information
This would mean instead of grabbing all records for a given time period, I would make several DB queries to get the "answers" from the DB instead of analyzing the data in Ruby.
Instead of running something like this,
year = Model.where("closed_at >= ?", start).where("closed_at <= ?", end).all
I would perform multiple queries:
year_total_count = Model.where(DATE RANGE).size
year_amount_sum = Model.where(DATE RANGE).sum("amount")
year_count_per_month = Model.where(DATE RANGE).group("MONTH(closed_at)")
...other queries to extract selected info...
Again, this seems very inefficient, but I'm not knowledgeable enough about SQL and Ruby code efficiencies to know which would lead to obvious downsides.
I "can" code both routes and then compare them with each other, but it will take a few days to code/run them since there's a lot of information on the dashboard page I'm leaving out. Certainly these situations have been run into multiple times for dashboard/analytics pages; is there a general principle for these types of situations?
I'm using PostgreSQL on Rails 4. I've been looking into DB-specific solutions as well, as being "database agnostic" really is irrelevant for most applications.

Dan, I would look into using a materialized view (MV) for the all-time historical average. This would definitely fall under the "DB-specific" solutions category, as MVs are implemented differently in different databases (or sometimes not at all). Here is the basic PG documentation.
A materialized view is essentially a physical table, except its data is based on a query of other tables. In this case, you could create an MV that is based on a query that averages the historical data. This query only gets run once if the underlying data does not change. Then the dashboard could just do a simple read query on this MV instead of running the costly query on the underlying table.

After discussing the issue with other more experienced DBAs and developers, I decided I was trying to optimize a problem that didn't need any optimization yet.
For my particular use case, I would have a few hundred users a day running these queries anywhere from 5-20 times each, so I wasn't really having major performance issues (ie, I'm not a Google or Amazon servicing billions of requests a day).
I am actually just having the PostgreSQL DB execute the queries each time and I haven't noticed any major performance issues for my users; the page loads very quickly and the queries/graphs have no noticeable delay.
For others trying to solve similar issues, I recommend trying to run it for a while a staging environment to see if you really have a problem that needs solving in the first place.
If I hit performance hiccups, my first step will be specifically indexing data that I query on, and my 2nd step will be creating DB views that "pre-load" the queries more efficiently than querying them over live data each time.
Thanks to the incredible advances in DB speed and technology, however, I don't have to worry about this problem.
I'm answering my own question so others can spend time resolving more profitable questions.

Simulating queries of large views for benchmarking purposes

A Windows Forms application of ours pulls records from a view on SQL Server through ADO.NET and a SOAP web service, displaying them in a data grid. We have had several cases with ~25,000 rows, which works relatively smoothly, but a potential customer needs to have many times that much in a single list.
To figure out how well we scale right now, and how (and how far) we can realistically improve, I'd like to implement a simulation: instead of displaying actual data, have the SQL Server send fictional, random data. The client and transport side would be mostly the same; the view (or at least the underlying table) would of course work differently. The user specifies the amount of fictional rows (e.g. 100,000).
For the time being, I just want to know how long it takes for the client to retrieve and process the data and is just about ready to display it.
What I'm trying to figure out is this: how do I make the SQL Server send such data?
Do I:
Create a stored procedure that has to be run beforehand to fill an actual table?
Create a function that I point the view to, thus having the server generate the data 'live'?
Somehow replicate and/or randomize existing data?
The first option sounds to me like it would yield the results closest to the real world. Because the data is actually 'physically there', the SELECT query would be quite similar performance-wise to one on real data. However, it taxes the server with an otherwise meaningless operation. The fake data would also be backed up, as it would live in one and the same database — unless, of course, I delete the data after each benchmark run.
The second and third option tax the server while running the actual simulation, thus potentially giving unrealistically slow results.
In addition, I'm unsure how to create those rows, short of using a loop or cursor. I can use SELECT top <n> random1(), random2(), […] FROM foo if foo actually happens to have <n> entries, but otherwise I'll (obviously) only get as many rows as foo happens to have. A GROUP BY newid() or similar doesn't appear to do the trick.

For data for testing CRM type tables, I highly recommend fakenamegenerator.com, you can get 40,000 fake names for free.

You didn't mention if you're using SQL Server 2008. If you use 2008 and you use Data Compression, be aware that random data will act very differently (slower) than real data. Random data is much harder to compress.
Quest Toad for SQL Server and Microsoft Visual Studio Data Dude both have test data generators that will put fake "real" data into records for you.

If you want results you can rely on you need to make the testing scenario as realistic as possible, which makes option 1 by far your best bet. As you point out if you get results that aren't good enough with the other options you won't be sure that it wasn't due to the different database behaviour.
How you generate the data will depend to a large degree on the problem domain. Can you take data sets from multiple customers and merge them into a single mega-dataset? If the data is time series then maybe it can be duplicated over a different range.

The data is typically CRM-like, i.e. contacts, projects, etc. It would be fine to simply duplicate the data (e.g., if I only have 20,000 rows, I'll copy them five times to get my desired 100,000 rows). Merging, on the other hand, would only work if we never deploy the benchmarking tool publicly, for obvious privacy reasons (unless, of course, I apply a function to each column that renders the original data unintelligible beyond repair? Similar to a hashing function, only without modifying the value's size too much).
To populate the rows, perhaps something like this would do:
WHILE (SELECT count(1) FROM benchmark) < 100000
INSERT INTO benchmark
SELECT TOP 100000 * FROM actualData

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas