Database design for: Very hierarchical data; off-server subset caching for processing; small to moderate size; (complete beginner)

Database design for: Very hierarchical data; off-server subset caching for processing; small to moderate size; (complete beginner) - sql

I found myself with a project (very relaxed, little to none consequences on failure) that I think a database of some sort is required to solve. The problem is, that while I'm still quite inexperienced in general, I've never touched any database beyond the tutorials I could dig up with Google and setting up your average home-cloud. I got myself stuck on not knowing what I do not know.
That's about the situation:
Several hundred different automated test-systems will write little amounts of data over a slow network into a database frequently. Few users, will then get large subsets of that data from the database over a slow network infrequently. The data will then be processed, which will require a large amount of reads, very high performance at this point is desired.
This will be the data (in order of magnitudes):
1000 products containing
10 variants containing
100 batches containing
100 objects containing
10 test-systems containing
100 test-steps containing
10 entries
It is basically a labeled B-tree with the test-steps as leave-nodes (since their format has been standardized).
A batch will always belong to one variant, a object will always belong to the same variant (but possibly multiple batches), and a variant will always belong to one product. There are hundreds of thousands of different test-steps.
Possible queries will try to get (e.g.):
Everything from a batch (optional: and the value of an entry within a range)
Everything from a variant
All test-steps of the type X and Y from a test-system with the name Z
As far as I can tell rows, hundreds of thousands columns wide (containing everything described above), do not seem like a good idea and neither do about a trillion rows (and the middle ground between the two still seems quite extreme).
I'd really like to leverage the hierarchical nature of the data, but all I found on e.g. something like nested databases is, that they're simply not a thing.
It'd be nice if you could help me with:
What to search for
What'd be a good approach to structure and store this data
Some place I can learn about avoiding the SQL horror stories even I've found plenty of
If there is a great way / best practice I should know of of transmitting the queried data and caching it locally for processing
Thank you and have a lovely day
Andreas

Search for "database normalization".
A normalized relational database is a fine structure.
If you want to avoid the horrors of SQL, you could also try a No-SQL Document-oriented Database, like MongoDB. I actually prefer this kind of database in a great many scenarios.
The database will cache your query results, and of course, whichever tool you use to query the database will cache the data in the tool's memory (or it will cache at least a subset of the query results if the number of results is very large). You can also write your results to a file. There are many ways to "cache", and they are all useful in different situations.

Related

Pulling large quantities of data takes too long. Need a way to speed it up

I'm creating a client dashboard website that displays many different graphs and charts of different views of data in our database.
The data is of records of medical patients and companies that they work for for insurance purposes. The data is displayed as aggregate charts but there is a filter feature on the page that the user can use to filter individual patient records. The fields that they can filter by are
Date range of the medical claim
Relationship to the insurance holder
Sex
Employer groups (user selects a number of different groups they work with, and can turn them on and off in the filter)
User Lists (the user of the site can create arbitrary lists of patients and save their IDs and edit them later). Either none, one, or multiple lists can be selected. There is also an any/all selector if multiple are chosen.
A set of filters that the user can define (with preset defaults) from other, more internally structured pieces of data. The user can customize up to three of them and can select any one, or none of them, and they return a list of patient IDs that is stored in memory until they're changed.
The problem is that loading the data can take a long time, some pages taking from 30 seconds to a minute to load (the page is loaded first and the data is then download as JSON via an ajax function while a loading spinner is displayed). Some of the stored procedures we use are very complex, requiring multiple levels of nested queries. I've tried using the Query Analyzer to simplify them, but we've made all the recommended changes and it still takes a long time. Our database people have looked and don't see any other way to make the queries simpler while still getting the data that we need.
The way it's set up now, only changes to the date range and the employer groups cause the database to be hit again. The database never filters on any of the other fields. Any other changes to the filter selection are made on the front end. I tried changing the way it worked and sending all the fields to the back end for the database to filter on, and it ended up taking even longer, not to mention having to wait on every change instead of just a couple.
We're using MS SQL 2014 (SP1). My question is, what are our options for speeding things up? Even if it means completely changing the way our data is stored?

You don't provide any specifics - so this is pretty generic.
Speed up your queries - this is the best, easiest, least error-prone option. Modern hardware can cope with huge datasets and still provide sub-second responses. Post your queries, DDL, sample data and EXPLAINs to Stack Overflow - it's very likely you can get significant improvements.
Buy better hardware - if you really can't speed up the queries, figure out what the bottleneck is, and buy better hardware. It's so cheap these days that maxing out on SSDs, RAM and CPU will probably cost less than the time it takes to figure out how to deal with the less optimal routes below.
Caching - rather than going back to the database for everything, use a cache. Figure out how "up to date" your dashboards need to be, and how unique the data is, and cache query results if at all possible. Many development frameworks have first-class support for caching. The problem with caching is that it makes debugging hard - if a user reports a bug, are they looking at cached data? If so, is that cache stale - is it a bug in the data, or in the caching?
Pre-compute if caching is not feasible, you can pre-compute data. For instance, when you create a new patient record, you could update the reports for "patient by sex", "patient by date", "patience by insurance co" etc. This creates a lot of work - and even more opportunity for bugs.
De-normalize - this is the nuclear option. Denormalization typically improves reporting speed at the expense of write speed, and at the expense of introducing lots of opportunities for bugs.

Best (NoSQL?) DB for small docs/records, unchanging data, lots of writes, quick reads?

I found a few questions in the same vein as this, but they did not include much detail on the nature of the data being stored, how it is queried, etc... so I thought this would be worthwhile to post.
My data is very simple, three fields:
- a "datetimestamp" value (date/time)
- two strings, "A" and "B", both < 20 chars
My application is very write-heavy (hundreds per second). All writes are new records; once inserted, the data is never modified.
Regular reads happen every few seconds, and are used to populate some near-real-time dashboards. I query against the date/time value and one of the string values. e.g. get all records where the datetimestamp is within a certain range and field "B" equals a specific search value. These queries typically return a few thousand records each.
Lastly, my database does not need to grow without limit; I would be looking at purging records that are 10+ days old either by manually deleting them or using a cache-expiry technique if the DB supported one.
I initially implemented this in MongoDB, without being aware of the way it handles locking (writes block reads). As I scale, my queries are taking longer and longer (30+ seconds now, even with proper indexing). Now with what I've learned, I believe that the large number of writes are starving out my reads.
I've read the kkovacs.eu post comparing various NoSQL options, and while I learned a lot I don't know if there is a clear winner for my use case. I would greatly appreciate a recommendation from someone familiar with the options.
Thanks in advance!

I have faced a problem like this before in a system recording process control measurements. This was done with 5 MHz IBM PCs, so it is definitely possible. The use cases were more varied—summarization by minute, hour, eight-hour-shift, day, week, month, or year—so the system recorded all the raw data, but is also aggregated on the fly for the most common queries (which were five minute averages). In the case of your dashboard, it seems like five minute aggregation is also a major goal.
Maybe this could be solved by writing a pair of text files for each input stream: One with all the raw data; another with the multi-minute aggregation. The dashboard would ignore the raw data. A database could be used, of course, to do the same thing. But simplifying the application could mean no RDB is needed. Simpler to engineer and maintain, easier to fit on a microcontroller, embedded system, etc., or a more friendly neighbor on a shared host.

Deciding a right NoSQL product is not an easy task. I would suggest you to learn more about NoSQL before making your choice, if you really want to make sure that you don't end up trusting someone else's suggestion or favorites.
There is a good book which gives really good background about NoSQL and anyone who is starting up with NoSQL should read this.
http://www.amazon.com/Professional-NoSQL-Wrox-Programmer/dp/047094224X
I hope reading some of the chapters in the book will really help you. There are comparisons and explanations about what is good for what job and lot more.
Good luck.

Leaderboard design and performance in oracle

I'm developing a game and I'm using a leaderboard to keep track of a player's score. There is also the requirement to keep track of about 200 additional statistics. These stats are things like: kills, deaths, time played, weapon used, achievements gained and so on.
What players will be interested in is is the score,kills,deaths and time played. All the other stats are not necessarily needed to be shown in the game but should be accessible if I want to view them or compare them against other players. The expected number of players to be stored in this leaderboard table is about 2 million.
Currently the design is to store a player id together will all the stats in one table, for instance:
player_id,points,stat_1 .. stat_200,date_created,date_updated
If I want to show a sorted leaderboard based on points then I would have to put an index on points and do a sort on it with a select query and limit the results to return say 50 every time. There are also ideas to be able to have a player sort the leaderboard on a couple of other stats like time played or deaths up to a maximum of say 5 sortable stats.
The number of expected users playing the game is about 40k concurrently. Maybe a quarter of them, but this is really a ballpark figure, will actively browse the leaderboard, the rest will just play the game and upload their scores when they are finished.
I have a number of questions about this approach below:
It seems, but I have my doubts, that the consensus is that leaderboards with millions of records that should be sortable on a couple of stats don't scale very well in a RDBMS. Is this correct ?
Is sorting the leaderboard on points through a select query, assuming we have an index on it, going to be extremely slow and if so how can I work around this ?
Should I split up the storing of the additional stats that are not to be sorted in a separate table or is there another even better approach ?
Is caching the sorted results in memory or in a separate table going to be needed, keeping the expected load in mind, and if so which solutions or options should I consider ?
If my approach is completely wrong and I would be better of doing things like this in another way please let me know, even options like NoSQL solutions in cloud hosting environments are open to be considered.
Cheers

1) With multiple indexes it will become more costly to update the table. It all boils down to how often each player status is written to the db.
2) It will be very fast as long as the indexes are small enough to fit into RAM. After that, performance takes a big hit.
3) Sometimes you can gain performance if you add all fields you need to the index, cause then the DBMS doesn't need to access the table at all. This approach has the highest probability to work if the accessed fields are small compared to the size of a row.
4) Oracle will probably be good att doing the caching for you, but if you have a massive load of users all doing the same query it is probably better to run that query regularly and store the result in memory (or a memory-mapped file).
For instance, if the high-score list is accessed 50 times/second you can decrease the load caused by that question by 99% by dumping it every 2 seconds.
My advice on this is: don't do it unless you need it. Measure the performance first, and add it if necessary.

I've been working on a game with a leaderboard myself recently, using MS SQL Server rather than Oracle, and though the number of records and players aren't the same, here's what I've learnt - in answer to your questions:
As long as you have the right underlying hardware, creating a leaderboard with millions of records and sorting on score etc. should work just fine - databases are really, really efficient at querying and sorting based on indexes.
No, it will be fast.
I see no reason to partition into other tables - you'll have to join to those tables to retrieve the data, and that will incur a performance penalty. Though this might be the issue the normalization comment was aimed at.
I assume you will need to include caching to reach the scale you mention; I wouldn't cache in the database layer (your table is effectively a denormalized, flat record already - I don't think you can partition it much more). Not sure what other layers you've got, but I'd look at how "cacheable" your data is (sounds like leaderboards are fairly static), and cache either in the layer immediately above the database, or add something like ehcache to the mix.
General points:
I'd try it out to get a feel for how it would work. Use something like dbmonster to populate a test system with millions of records, and query against that puppy to get a feel for what works and doesn't.
Once you have that up and running, I'd invest in some more serious load and performance testing before deciding to add caching etc. - the more complex you make the architecture, the harder it is to debug, the more costly it is to build, and the more there is to go wrong. So, only add caching if you really need to because you can prove - through load and performance tests - that you can't meet your response time goals.
Whilst it's true that adding indexes to a table slows down insert/update/delete statements, in most cases that's a negligible penalty - I'd definitely not worry too much about it at this stage.

I don't like tables having hundreds of columns, to begin with but it could be ok. Personally I would prefer having separate ID table and scores table having ID, score types and values, both indexed on only the ID columns. If you organize them as cluster, the parent and child records are all fetched in 1 IO.
The number of transactions you mention asks for some scalability. You have no real idea about the load. I assume there is some application server[farm] that handles the requests.
That is a good fit for the Oracle In-Memory Database Cache option. See result caches ..... what about heavily modified data. This is a smart way of caching you Oracle data on the application server. You create a cache grid, consisting of at least one grid member and for best performance, combine them with the application server[s]. When you add application server, you automatically add Cache Grid Members. It works very well, it is the good old TimesTen technology that is integrated in the database.
You can make the combination, but don't have to. If you don't, you have a no top performance but are more flexible in the number of Grid Members.

meh - millions of records? not a big table.
I'd just create the table (avoid the "stat_1, stat_2" naming - give them their proper names, e.g. "score", "kill_count", etc.), add indexes with leading columns on what the users are most likely to want to sort on (that way Oracle can avoid a sort by using the index to access the table in sorted order).
If the number of stats grows too large, you could "partition" it vertically - e.g. have most of the most frequently accessed stats in one table, then have one or more other tables which have extra stats. Each table would have an identical primary key.

Would this method work to scale out SQL queries?

I have a database containing a single huge table. At the moment a query can take anything from 10 to 20 minutes and I need that to go down to 10 seconds. I have spent months trying different products like GridSQL. GridSQL works fine, but is using its own parser which does not have all the needed features. I have also optimized my database in various ways without getting the speedup I need.
I have a theory on how one could scale out queries, meaning that I utilize several nodes to run a single query in parallel. A precondition is that the data is partitioned (vertically), one partition placed on each node. The idea is to take an incoming SQL query and simply run it exactly like it is on all the nodes. When the results are returned to a coordinator node, the same query is run on the union of the resultsets. I realize that an aggregate function like average need to be rewritten into a count and sum to the nodes and that the coordinator divides the sum of the sums with the sum of the counts to get the average.
What kinds of problems could not easily be solved using this model. I believe one issue would be the count distinct function.
Edit: I am getting so many nice suggestions, but none have addressed the method.

It's a data volume problem, not necessarily an architecture problem.
Whether on 1 machine or 1000 machines, if you end up summarizing 1,000,000 rows, you're going to have problems.
Rather than normalizing you data, you need to de-normalize it.
You mention in a comment that your data base is "perfect for your purpose", when, obviously, it's not. It's too slow.
So, something has to give. Your perfect model isn't working, as you need to process too much data in too short of a time. Sounds like you need some higher level data sets than your raw data. Perhaps a data warehousing solution. Who knows, not enough information to really say.
But there are a lot of things you can do to satisfy a specific subset of queries with a good response time, while still allowing ad hoc queries that respond in "10-20 minutes".
Edit regarding comment:
I am not familiar with "GridSQL", or what it does.
If you send several, identical SQL queries to individual "shard" databases, each containing a subset, then the simple selection query will scale to the network (i.e. you will eventually become network bound to the controller), as this is a truly, parallel, stateless process.
The problem becomes, as you mentioned, the secondary processing, notably sorting and aggregates, as this can only be done on the final, "raw" result set.
That means that your controller ends up, inevitably, becoming your bottleneck and, in the end, regardless of how "scaled out" you are, you still have to contend with a data volume issue. If you send your query out to 1000 node and inevitably have to summarize or sort the 1000 row result set from each node, resulting in 1M rows, you still have a long result time and large data processing demand on a single machine.
I don't know what database you are using, and I don't know the specifics about individual databases, but you can see how if you actually partition your data across several disk spindles, and have a decent, modern, multi-core processor, the database implementation itself can handle much of this scaling in terms of parallel disk spindle requests for you. Which implementations actually DO do this, I can't say. I'm just suggesting that it's possible for them to (and some may well do this).
But, my general point, is if you are running, specifically, aggregates, then you are likely processing too much data if you're hitting the raw sources each time. If you analyze your queries, you may well be able to "pre-summarize" your data at various levels of granularity to help avoid the data saturation problem.
For example, if you are storing individual web hits, but are more interested in activity based on each hour of the day (rather than the subsecond data you may be logging), summarizing to the hour of the day alone can reduce your data demand dramatically.
So, scaling out can certainly help, but it may well not be the only solution to the problem, rather it would be a component. Data warehousing is designed to address these kinds of problems, but does not work well with "ad hoc" queries. Rather you need to have a reasonable idea of what kinds of queries you want to support and design it accordingly.

One huge table - can this be normalised at all?
If you are doing mostly select queries, have you considered either normalising to a data warehouse that you then query, or running analysis services and a cube to do your pre-processing for you?
From your question, what you are doing sounds like the sort of thing a cube is optimised for, and could be done without you having to write all the plumbing.

By trying custom solution (grid) you introduce a lot of complexity. Maybe, it's your only solution, but first did you try partitioning the table (native solution)?

I'd seriously be looking into an OLAP solution. The trick with the Cube is once built it can be queried in lots of ways that you may not have considered. And as #HLGEM mentioned, have you addressed indexing?
Even at in millions of rows, a good search should be logarithmic not linear. If you have even one query which results in a scan then your performance will be destroyed. We might need an example of your structure to see if we can help more?
I also agree fully with #Mason, have you profiled your query and investigated the query plan to see where your bottlenecks are. Adding nodes improving speed makes me think that your query might be CPU bound.

David,
Are you using all of the features of GridSQL? You can also use constraint exclusion partitioning, effectively breaking out your big table into several smaller tables. Depending on your WHERE clause, when the query is processed it may look at a lot less data and return results much faster.
Also, are you using multiple logical nodes per physical server? Configuring it that way can take advantage of otherwise idle cores.
If you monitor the servers during execution, is the bottleneck IO or CPU?
Also alluded to here is that you may want to roll up rows in your fact table into summary tables/cubes. I do not know enough about Tableau, will it automatically use the appropriate cube and drill down only when necessary? If so, it seems like you would get big gains doing something like this.

My guess (based on nothing but my gut) is that any gains you might see from parallelization will be eaten up by reaggregation and subsequent queries of the results. Further, I would think that writing might get more complicated with pk/fk/constraints. If this were my world, I would probably create many indexed views on top of my table (and other views) that optimized for the particular queries I need to execute (which I have worked with successfully on 10million+ row tables.)

If you run the incoming query, unpartitioned, on each node, why will any node finish before a single node running the same query would finish? Am I misunderstanding your execution plan?
I think this is, in part, going to depend on the nature of the queries you're executing and, in particular, how many rows contribute to the final result set. But surely you'll need to partition the query somehow among the nodes.

Your method to scale out queries works fine.
In fact, I've implemented such a method in:
http://code.google.com/p/shard-query
It uses a parser, but it supports most SQL constructs.
It doesn't yet support count(distinct expr) but this is doable and I plan to add support in the future.
I also have a tool called Flexviews (google for flexviews materialized views)
This tool lets you create materialized views (summary tables) which include various aggregate functions and joins.
Those tools combined together can yield massive scalability improvements for OLAP type queries.

real-time data warehouse for web access logs

We're thinking about putting up a data warehouse system to load with web access logs that our web servers generate. The idea is to load the data in real-time.
To the user we want to present a line graph of the data and enable the user to drill down using the dimensions.
The question is how to balance and design the system so that ;
(1) the data can be fetched and presented to the user in real-time (<2 seconds),
(2) data can be aggregated on per-hour and per-day basis, and
(2) as large amount of data can still be stored in the warehouse, and
Our current data-rate is roughly ~10 accesses per second which gives us ~800k rows per day. My simple tests with MySQL and a simple star schema shows that my quires starts to take longer than 2 seconds when we have more than 8 million rows.
Is it possible it get real-time query performance from a "simple" data warehouse like this,
and still have it store a lot of data (it would be nice to be able to never throw away any data)
Are there ways to aggregate the data into higher resolution tables?
I got a feeling that this isn't really a new question (i've googled quite a lot though). Could maybe someone give points to data warehouse solutions like this? One that comes to mind is Splunk.
Maybe I'm grasping for too much.
UPDATE
My schema looks like this;
dimensions:
client (ip-address)
server
url
facts;
timestamp (in seconds)
bytes transmitted

Seth's answer above is a very reasonable answer and I feel confident that if you invest in the appropriate knowledge and hardware, it has a high chance of success.
Mozilla does a lot of web service analytics. We keep track of details on an hourly basis and we use a commercial DB product, Vertica. It would work very well for this approach but since it is a proprietary commercial product, it has a different set of associated costs.
Another technology that you might want to investigate would be MongoDB. It is a document store database that has a few features that make it potentially a great fit for this use case.
Namely, the capped collections (do a search for mongodb capped collections for more info)
And the fast increment operation for things like keeping track of page views, hits, etc.
http://blog.mongodb.org/post/171353301/using-mongodb-for-real-time-analytics

Doesn't sound like it would be a problem. MySQL is very fast.
For storing logging data, use MyISAM tables -- they're much faster and well suited for web server logs. (I think InnoDB is the default for new installations these days - foreign keys and all the other features of InnoDB aren't necessary for the log tables). You might also consider using merge tables - you can keep individual tables to a manageable size while still being able to access them all as one big table.
If you're still not able to keep up, then get yourself more memory, faster disks, a RAID, or a faster system, in that order.
Also: Never throwing away data is probably a bad idea. If each line is about 200 bytes long, you're talking about a minimum of 50 GB per year, just for the raw logging data. Multiply by at least two if you have indexes. Multiply again by (at least) two for backups.
You can keep it all if you want, but in my opinion you should consider storing the raw data for a few weeks and the aggregated data for a few years. For anything older, just store the reports. (That is, unless you are required by law to keep around. Even then, it probably won't be for more than 3-4 years).

Also, look into partitioning, especially if your queries mostly access latest data; you could -- for example -- set-up weekly partitions of ~5.5M rows.
If aggregating per-day and per hour, consider having date and time dimensions -- you did not list them so I assume you do not use them. The idea is not to have any functions in a query, like HOUR(myTimestamp) or DATE(myTimestamp). The date dimension should be partitioned the same way as fact tables.
With this in place, the query optimizer can use partition pruning, so the total size of tables does not influence the query response as before.

This has gotten to be a fairly common data warehousing application. I've run one for years that supported 20-100 million rows a day with 0.1 second response time (from database), over a second from web server. This isn't even on a huge server.
Your data volumes aren't too large, so I wouldn't think you'd need very expensive hardware. But I'd still go multi-core, 64-bit with a lot of memory.
But you will want to mostly hit aggregate data rather than detail data - especially for time-series graphing over days, months, etc. Aggregate data can be either periodically created on your database through an asynchronous process, or in cases like this is typically works best if your ETL process that transforms your data creates the aggregate data. Note that the aggregate is typically just a group-by of your fact table.
As others have said - partitioning is a good idea when accessing detail data. But this is less critical for the aggregate data. Also, reliance on pre-created dimensional values is much better than on functions or stored procs. Both of these are typical data warehousing strategies.
Regarding the database - if it were me I'd try Postgresql rather than MySQL. The reason is primarily optimizer maturity: postgresql can better handle the kinds of queries you're likely to run. MySQL is more likely to get confused on five-way joins, go bottom up when you run a subselect, etc. And if this application is worth a lot, then I'd consider a commercial database like db2, oracle, sql server. Then you'd get additional features like query parallelism, automatic query rewrite against aggregate tables, additional optimizer sophistication, etc.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas