Why getting items count in distributed storage is a costly operation?

Why getting items count in distributed storage is a costly operation? - api

I was reading that book about APIs called "API design patterns" by JJ Geewax and there was a section that talks about getting the count of items and he said it's not a good idea especially in distributed storage systems.
page 102
Next, there is often the temptation to include a count of the items along with the listing. While this
might be nice for user-interface consumers to show a total number of matching results, it often
adds far more headache as time goes on and the number of items in the list grows beyond what
was originally projected. This is particularly complicated for distributed storage systems that are
not designed to provide quick access to counts matching specific queries. In short, it's generally a
bad idea to include item counts in the responses to a standard List method.
Anyone has a clue why is that or at least gives me keywords to search for.

In a typical database (e.g., a MySQL db with a few gigs of data in there), counting the number of rows is pretty easy. If that's all you'll ever deal with, then providing a count of matching results isn't a huge deal -- the concern comes up when things get bigger.
As the amount data starts growing (e.g., say... 10T?), dynamically computing an accurate count of matching rows can start to get pretty expensive (you have to scan and keep a running count of all the matching data). Even with a distributed storage system, this can be fast, but still expensive. This means your API will be spending a lot of computing resources to calculate the total number of results when it could be doing other important things. In my opinion, this is wasteful (a large expense for a "nice-to-have" on the API). If counts are critical to the API, then that changes the calculation.
Further, as changes to the data become more frequent (more creates, updates, and deletes), a count becomes less and less accurate as it might change drastically from one second to the next. In that case, not only is there more work being done to come up with a number, but that number isn't even all that accurate (and presumably, not super useful at that point).
So overall... result counts on larger data sets tend to be:
Expensive
More nice-to-have than business critical
Inaccurate
And since APIs tend to live much longer than we ever predict and can grow to a size far larger than we imagine, I discourage including result counts in API responses.
Every API is different though, so maybe it makes sense to have counts in your API, though I'd still suggest using rough estimates rather than exact counts to future-proof the API.
Some good reasons to include a count:
Your data size will stay reasonably small (i.e., able to be served by a single MySQL database).
Result counts are critical to your API (not just "nice-to-have").
Whatever numbers you come up with are accurate enough for your use cases (i.e., exact numbers for small data sets or "good estimates", not useless estimates).

Related

Why store whole records in audit tables?

I worked in several companies and in each of them audit tables have been storing full snapshots of records for every change.
To my understanding it's enough to store only changed columns to recreate record in any given point of time. It will obviously reduce storage space. Moreover I suppose it would improve performance as we would need to write much smaller amount of data.
As I've seen it in across different databases and frameworks, I'm not putting any specific tag here.
I'd gladly understand reasoning behind this approach.

Here are some important reasons.
First, storage is becoming cheaper and cheaper. So there is little financial benefit in reducing the number of records or their size.
Second, the "context" around a change can be very helpful. Reconstructing records as they look when the change occurs can be tricky.
Third, the logic to detect changes is tricker than it seems. This is particularly true when you have NULL values. If there is a bug in the code, then you lose the archive. Entire records are less error-prone.
Fourth, remember that (2) and (3) need to be implemented for every table being archived, further introducing the possibility of error.
I might summarize this as saying that storing the entire record uses fewer lines of code. Fewer lines of code are easier to maintain and less error-prone. And those savings outweigh the benefits of reducing the size of the archive.

Pulling large quantities of data takes too long. Need a way to speed it up

I'm creating a client dashboard website that displays many different graphs and charts of different views of data in our database.
The data is of records of medical patients and companies that they work for for insurance purposes. The data is displayed as aggregate charts but there is a filter feature on the page that the user can use to filter individual patient records. The fields that they can filter by are
Date range of the medical claim
Relationship to the insurance holder
Sex
Employer groups (user selects a number of different groups they work with, and can turn them on and off in the filter)
User Lists (the user of the site can create arbitrary lists of patients and save their IDs and edit them later). Either none, one, or multiple lists can be selected. There is also an any/all selector if multiple are chosen.
A set of filters that the user can define (with preset defaults) from other, more internally structured pieces of data. The user can customize up to three of them and can select any one, or none of them, and they return a list of patient IDs that is stored in memory until they're changed.
The problem is that loading the data can take a long time, some pages taking from 30 seconds to a minute to load (the page is loaded first and the data is then download as JSON via an ajax function while a loading spinner is displayed). Some of the stored procedures we use are very complex, requiring multiple levels of nested queries. I've tried using the Query Analyzer to simplify them, but we've made all the recommended changes and it still takes a long time. Our database people have looked and don't see any other way to make the queries simpler while still getting the data that we need.
The way it's set up now, only changes to the date range and the employer groups cause the database to be hit again. The database never filters on any of the other fields. Any other changes to the filter selection are made on the front end. I tried changing the way it worked and sending all the fields to the back end for the database to filter on, and it ended up taking even longer, not to mention having to wait on every change instead of just a couple.
We're using MS SQL 2014 (SP1). My question is, what are our options for speeding things up? Even if it means completely changing the way our data is stored?

You don't provide any specifics - so this is pretty generic.
Speed up your queries - this is the best, easiest, least error-prone option. Modern hardware can cope with huge datasets and still provide sub-second responses. Post your queries, DDL, sample data and EXPLAINs to Stack Overflow - it's very likely you can get significant improvements.
Buy better hardware - if you really can't speed up the queries, figure out what the bottleneck is, and buy better hardware. It's so cheap these days that maxing out on SSDs, RAM and CPU will probably cost less than the time it takes to figure out how to deal with the less optimal routes below.
Caching - rather than going back to the database for everything, use a cache. Figure out how "up to date" your dashboards need to be, and how unique the data is, and cache query results if at all possible. Many development frameworks have first-class support for caching. The problem with caching is that it makes debugging hard - if a user reports a bug, are they looking at cached data? If so, is that cache stale - is it a bug in the data, or in the caching?
Pre-compute if caching is not feasible, you can pre-compute data. For instance, when you create a new patient record, you could update the reports for "patient by sex", "patient by date", "patience by insurance co" etc. This creates a lot of work - and even more opportunity for bugs.
De-normalize - this is the nuclear option. Denormalization typically improves reporting speed at the expense of write speed, and at the expense of introducing lots of opportunities for bugs.

Database design for: Very hierarchical data; off-server subset caching for processing; small to moderate size; (complete beginner)

I found myself with a project (very relaxed, little to none consequences on failure) that I think a database of some sort is required to solve. The problem is, that while I'm still quite inexperienced in general, I've never touched any database beyond the tutorials I could dig up with Google and setting up your average home-cloud. I got myself stuck on not knowing what I do not know.
That's about the situation:
Several hundred different automated test-systems will write little amounts of data over a slow network into a database frequently. Few users, will then get large subsets of that data from the database over a slow network infrequently. The data will then be processed, which will require a large amount of reads, very high performance at this point is desired.
This will be the data (in order of magnitudes):
1000 products containing
10 variants containing
100 batches containing
100 objects containing
10 test-systems containing
100 test-steps containing
10 entries
It is basically a labeled B-tree with the test-steps as leave-nodes (since their format has been standardized).
A batch will always belong to one variant, a object will always belong to the same variant (but possibly multiple batches), and a variant will always belong to one product. There are hundreds of thousands of different test-steps.
Possible queries will try to get (e.g.):
Everything from a batch (optional: and the value of an entry within a range)
Everything from a variant
All test-steps of the type X and Y from a test-system with the name Z
As far as I can tell rows, hundreds of thousands columns wide (containing everything described above), do not seem like a good idea and neither do about a trillion rows (and the middle ground between the two still seems quite extreme).
I'd really like to leverage the hierarchical nature of the data, but all I found on e.g. something like nested databases is, that they're simply not a thing.
It'd be nice if you could help me with:
What to search for
What'd be a good approach to structure and store this data
Some place I can learn about avoiding the SQL horror stories even I've found plenty of
If there is a great way / best practice I should know of of transmitting the queried data and caching it locally for processing
Thank you and have a lovely day
Andreas

Search for "database normalization".
A normalized relational database is a fine structure.
If you want to avoid the horrors of SQL, you could also try a No-SQL Document-oriented Database, like MongoDB. I actually prefer this kind of database in a great many scenarios.
The database will cache your query results, and of course, whichever tool you use to query the database will cache the data in the tool's memory (or it will cache at least a subset of the query results if the number of results is very large). You can also write your results to a file. There are many ways to "cache", and they are all useful in different situations.

Rapidly changing large data processing advise

My team has the following dilemma that we need some architectural/resources advise:
Note: Our data is semi-structured
Over-all Task:
We have a semi-large data that we process during the day
each day this "process" get executed 1-5 times a day
each "process" takes anywhere from 30 minutes to 5 hours
semi-large data = ~1 million rows
each row gets updated anywhere from 1-10 times during the process
during this update ALL other rows may change, as we aggregate these rows for UI
What we are doing currently:
our current system is functional, yet expensive and inconsistent
we use SQL db to store all the data and we retrieve/update as process requires
Unsolved problems and desired goals:
since this processes are user triggered we never know when to scale up/down, which causes high spikes and Azure doesnt make it easy to do autoscale based on demand without data warehouse which we are wanting to stay away from because of lack of aggregates and other various "buggy" issues
because of constant IO to the db we hit 100% of DTU when 1 process begins (we are using Azure P1 DB) which of course will force us to grow even larger if multiple processes start at the same time (which is very likely)
yet we understand the cost comes with high compute tasks, we think there is better way to go about this (SQL is about 99% optimized, so much left to do there)
We are looking for some tool that can:
Process large amount of transactions QUICKLY
Can handle constant updates of this large amount of data
supports all major aggregations
is "reasonably" priced (i know this is an arguable keyword, just take it lightly..)
Considered:
Apache Spark
we don't have ton of experience with HDP so any pros/cons here will certainly be useful (does the use case fit the tool??)
ArangoDB
seems promising.. Seems fast and has all aggregations we need..
Azure Data Warehouse
too many various issues we ran into, just didn't work for us.
Any GPU-accelerated compute or some other high-end ideas are also welcome.
Its hard to try them all and compare which one fits the best, as we have a fully functional system and are required to make adjustments to whichever way we go.
Hence, any before hand opinions are welcome, before we pull the trigger.

Leaderboard design and performance in oracle

I'm developing a game and I'm using a leaderboard to keep track of a player's score. There is also the requirement to keep track of about 200 additional statistics. These stats are things like: kills, deaths, time played, weapon used, achievements gained and so on.
What players will be interested in is is the score,kills,deaths and time played. All the other stats are not necessarily needed to be shown in the game but should be accessible if I want to view them or compare them against other players. The expected number of players to be stored in this leaderboard table is about 2 million.
Currently the design is to store a player id together will all the stats in one table, for instance:
player_id,points,stat_1 .. stat_200,date_created,date_updated
If I want to show a sorted leaderboard based on points then I would have to put an index on points and do a sort on it with a select query and limit the results to return say 50 every time. There are also ideas to be able to have a player sort the leaderboard on a couple of other stats like time played or deaths up to a maximum of say 5 sortable stats.
The number of expected users playing the game is about 40k concurrently. Maybe a quarter of them, but this is really a ballpark figure, will actively browse the leaderboard, the rest will just play the game and upload their scores when they are finished.
I have a number of questions about this approach below:
It seems, but I have my doubts, that the consensus is that leaderboards with millions of records that should be sortable on a couple of stats don't scale very well in a RDBMS. Is this correct ?
Is sorting the leaderboard on points through a select query, assuming we have an index on it, going to be extremely slow and if so how can I work around this ?
Should I split up the storing of the additional stats that are not to be sorted in a separate table or is there another even better approach ?
Is caching the sorted results in memory or in a separate table going to be needed, keeping the expected load in mind, and if so which solutions or options should I consider ?
If my approach is completely wrong and I would be better of doing things like this in another way please let me know, even options like NoSQL solutions in cloud hosting environments are open to be considered.
Cheers

1) With multiple indexes it will become more costly to update the table. It all boils down to how often each player status is written to the db.
2) It will be very fast as long as the indexes are small enough to fit into RAM. After that, performance takes a big hit.
3) Sometimes you can gain performance if you add all fields you need to the index, cause then the DBMS doesn't need to access the table at all. This approach has the highest probability to work if the accessed fields are small compared to the size of a row.
4) Oracle will probably be good att doing the caching for you, but if you have a massive load of users all doing the same query it is probably better to run that query regularly and store the result in memory (or a memory-mapped file).
For instance, if the high-score list is accessed 50 times/second you can decrease the load caused by that question by 99% by dumping it every 2 seconds.
My advice on this is: don't do it unless you need it. Measure the performance first, and add it if necessary.

I've been working on a game with a leaderboard myself recently, using MS SQL Server rather than Oracle, and though the number of records and players aren't the same, here's what I've learnt - in answer to your questions:
As long as you have the right underlying hardware, creating a leaderboard with millions of records and sorting on score etc. should work just fine - databases are really, really efficient at querying and sorting based on indexes.
No, it will be fast.
I see no reason to partition into other tables - you'll have to join to those tables to retrieve the data, and that will incur a performance penalty. Though this might be the issue the normalization comment was aimed at.
I assume you will need to include caching to reach the scale you mention; I wouldn't cache in the database layer (your table is effectively a denormalized, flat record already - I don't think you can partition it much more). Not sure what other layers you've got, but I'd look at how "cacheable" your data is (sounds like leaderboards are fairly static), and cache either in the layer immediately above the database, or add something like ehcache to the mix.
General points:
I'd try it out to get a feel for how it would work. Use something like dbmonster to populate a test system with millions of records, and query against that puppy to get a feel for what works and doesn't.
Once you have that up and running, I'd invest in some more serious load and performance testing before deciding to add caching etc. - the more complex you make the architecture, the harder it is to debug, the more costly it is to build, and the more there is to go wrong. So, only add caching if you really need to because you can prove - through load and performance tests - that you can't meet your response time goals.
Whilst it's true that adding indexes to a table slows down insert/update/delete statements, in most cases that's a negligible penalty - I'd definitely not worry too much about it at this stage.

I don't like tables having hundreds of columns, to begin with but it could be ok. Personally I would prefer having separate ID table and scores table having ID, score types and values, both indexed on only the ID columns. If you organize them as cluster, the parent and child records are all fetched in 1 IO.
The number of transactions you mention asks for some scalability. You have no real idea about the load. I assume there is some application server[farm] that handles the requests.
That is a good fit for the Oracle In-Memory Database Cache option. See result caches ..... what about heavily modified data. This is a smart way of caching you Oracle data on the application server. You create a cache grid, consisting of at least one grid member and for best performance, combine them with the application server[s]. When you add application server, you automatically add Cache Grid Members. It works very well, it is the good old TimesTen technology that is integrated in the database.
You can make the combination, but don't have to. If you don't, you have a no top performance but are more flexible in the number of Grid Members.

meh - millions of records? not a big table.
I'd just create the table (avoid the "stat_1, stat_2" naming - give them their proper names, e.g. "score", "kill_count", etc.), add indexes with leading columns on what the users are most likely to want to sort on (that way Oracle can avoid a sort by using the index to access the table in sorted order).
If the number of stats grows too large, you could "partition" it vertically - e.g. have most of the most frequently accessed stats in one table, then have one or more other tables which have extra stats. Each table would have an identical primary key.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas