Best (NoSQL?) DB for small docs/records, unchanging data, lots of writes, quick reads? - sql

I found a few questions in the same vein as this, but they did not include much detail on the nature of the data being stored, how it is queried, etc... so I thought this would be worthwhile to post.
My data is very simple, three fields:
- a "datetimestamp" value (date/time)
- two strings, "A" and "B", both < 20 chars
My application is very write-heavy (hundreds per second). All writes are new records; once inserted, the data is never modified.
Regular reads happen every few seconds, and are used to populate some near-real-time dashboards. I query against the date/time value and one of the string values. e.g. get all records where the datetimestamp is within a certain range and field "B" equals a specific search value. These queries typically return a few thousand records each.
Lastly, my database does not need to grow without limit; I would be looking at purging records that are 10+ days old either by manually deleting them or using a cache-expiry technique if the DB supported one.
I initially implemented this in MongoDB, without being aware of the way it handles locking (writes block reads). As I scale, my queries are taking longer and longer (30+ seconds now, even with proper indexing). Now with what I've learned, I believe that the large number of writes are starving out my reads.
I've read the kkovacs.eu post comparing various NoSQL options, and while I learned a lot I don't know if there is a clear winner for my use case. I would greatly appreciate a recommendation from someone familiar with the options.
Thanks in advance!

I have faced a problem like this before in a system recording process control measurements. This was done with 5 MHz IBM PCs, so it is definitely possible. The use cases were more varied—summarization by minute, hour, eight-hour-shift, day, week, month, or year—so the system recorded all the raw data, but is also aggregated on the fly for the most common queries (which were five minute averages). In the case of your dashboard, it seems like five minute aggregation is also a major goal.
Maybe this could be solved by writing a pair of text files for each input stream: One with all the raw data; another with the multi-minute aggregation. The dashboard would ignore the raw data. A database could be used, of course, to do the same thing. But simplifying the application could mean no RDB is needed. Simpler to engineer and maintain, easier to fit on a microcontroller, embedded system, etc., or a more friendly neighbor on a shared host.

Deciding a right NoSQL product is not an easy task. I would suggest you to learn more about NoSQL before making your choice, if you really want to make sure that you don't end up trusting someone else's suggestion or favorites.
There is a good book which gives really good background about NoSQL and anyone who is starting up with NoSQL should read this.
http://www.amazon.com/Professional-NoSQL-Wrox-Programmer/dp/047094224X
I hope reading some of the chapters in the book will really help you. There are comparisons and explanations about what is good for what job and lot more.
Good luck.

Related

Database design for: Very hierarchical data; off-server subset caching for processing; small to moderate size; (complete beginner)

I found myself with a project (very relaxed, little to none consequences on failure) that I think a database of some sort is required to solve. The problem is, that while I'm still quite inexperienced in general, I've never touched any database beyond the tutorials I could dig up with Google and setting up your average home-cloud. I got myself stuck on not knowing what I do not know.
That's about the situation:
Several hundred different automated test-systems will write little amounts of data over a slow network into a database frequently. Few users, will then get large subsets of that data from the database over a slow network infrequently. The data will then be processed, which will require a large amount of reads, very high performance at this point is desired.
This will be the data (in order of magnitudes):
1000 products containing
10 variants containing
100 batches containing
100 objects containing
10 test-systems containing
100 test-steps containing
10 entries
It is basically a labeled B-tree with the test-steps as leave-nodes (since their format has been standardized).
A batch will always belong to one variant, a object will always belong to the same variant (but possibly multiple batches), and a variant will always belong to one product. There are hundreds of thousands of different test-steps.
Possible queries will try to get (e.g.):
Everything from a batch (optional: and the value of an entry within a range)
Everything from a variant
All test-steps of the type X and Y from a test-system with the name Z
As far as I can tell rows, hundreds of thousands columns wide (containing everything described above), do not seem like a good idea and neither do about a trillion rows (and the middle ground between the two still seems quite extreme).
I'd really like to leverage the hierarchical nature of the data, but all I found on e.g. something like nested databases is, that they're simply not a thing.
It'd be nice if you could help me with:
What to search for
What'd be a good approach to structure and store this data
Some place I can learn about avoiding the SQL horror stories even I've found plenty of
If there is a great way / best practice I should know of of transmitting the queried data and caching it locally for processing
Thank you and have a lovely day
Andreas
Search for "database normalization".
A normalized relational database is a fine structure.
If you want to avoid the horrors of SQL, you could also try a No-SQL Document-oriented Database, like MongoDB. I actually prefer this kind of database in a great many scenarios.
The database will cache your query results, and of course, whichever tool you use to query the database will cache the data in the tool's memory (or it will cache at least a subset of the query results if the number of results is very large). You can also write your results to a file. There are many ways to "cache", and they are all useful in different situations.

When to use a query or code [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
I am asking for a concrete case for Java + JPA / Hibernate + Mysql, but I think you can apply this question to a great number of languages.
Sometimes I have to perform a query on a database to get some entities, such as employees. Let's say you need some specific employees (the ones with 'John' as their firstname), would you rather do a query returning this exact set of employees, or would you prefer to search for all the employees and then use a programming language to retrieve the ones that you are interested with? why (ease, efficiency)?
Which is (in general) more efficient?
Is one approach better than the other depending on the table size?
Considering:
Same complexity, reusability in both cases.
Always do the query on the database. If you do not you have to copy over more data to the client and also databases are written to efficiently filter data almost certainly being more efficient than your code.
The only exception I can think of is if the filter condition is computationally complex and you can spread the calculation over more CPU power than the database has.
In the cases I have had a database the server has had more CPU power than the clients so unless overloaded will just run the query more quickly for the same amount of code.
Also you have to write less code to do the query on the database using Hibernates query language rather than you having to write code to manipulate the data on the client. Hibernate queries will also make use of any client caching in the configiration without you having to write more code.
There is a general trick often used in programming - paying with memory for operation speedup. If you have lots of employees, and you are going to query a significant portion of them, one by one (say, 75% will be queried at one time or the other), then query everything, cache it (very important!), and complete the lookup in memory. The next time you query, skip the trip to RDBMS, go straight to the cache, and do a fast look-up: a roundtrip to a database is very expensive, compared to an in-memory hash lookup.
On the other hand, if you are accessing a small portion of employees, you should query just one employee: data transfer from the RDBMS to your program takes a lot of time, a lot of network bandwidth, a lot of memory on your side, and a lot of memory on the RDBMS side. Querying lots of rows to throw away all but one never makes sense.
In general, I would let the database do what databases are good at. Filtering data is something databases are really good at, so it would be best left there.
That said, there are some situations where you might just want to grab all of them and do the filtering in code though. One I can think of would be if the number of rows is relatively small and you plan to cache them in your app. In that case you would just look up all the rows, cache them, and do subsequent filtering against what you have in the cache.
It's situational. I think in general, it's better to use sql to get the exact result set.
The problem with loading all the entities and then searching programmatically is that you ahve to load all the entitites, which could take a lot of memory. Additionally, you have to then search all the entities. Why do that when you can leverage your RDBMS and get the exact results you want. In other words, why load a large dataset that could use too much memory, then process it, when you can let your RDBMS do the work for you?
On the other hand, if you know the size of your dataset is not too, you can load it into memory and then query it -- this has the advantage that you don't need to go to the RDBMS, which might or might not require going over your network, depending on your system architecture.
However, even then, you can use various caching utilities so that the common query results are cached, which removes the advantage of caching the data yourself.
Remember, that your approach should scale over time. What may be a small data set could later turn into a huge data set over time. We had an issue with a programmer that coded the application to query the entire table then run manipulations on it. The approach worked fine when there were only 100 rows with two subselects, but as the data grew over the years, the performance issues became apparent. Inserting even a date filter to query only the last 365 days, could help your application scale better.
-- if you are looking for an answer specific to hibernate, check #Mark's answer
Given the Employee example -assuming the number of employees can scale over time, it is better to use an approach to query the database for the exact data.
However, if you are considering something like Department (for example), where the chances of the data growing rapidly is less, it is useful to query all of them and have in memory - this way you don't have to reach to the external resource (database) every time, which could be costly.
So the general parameters are these,
scaling of data
criticality to bussiness
volume of data
frequency of usage
to put some sense, when the data is not going to scale frequently and the data is not mission critical and volume of data is manageable in memory on the application server and is used frequently - Bring it all and filter them programatically, if needed.
if otherwise get only specific data.
What is better: to store a lot of food at home or buy it little by little? When you travel a lot? Just when hosting a party? It depends, isn't? Similarly, the best approach is a matter of performance optimization. That involves a lot of variables. The art is to both prevent painting yourself into a corner when designing your solution and optimize later, when you know your real bottlenecks. A good starting point is here: en.wikipedia.org/wiki/Performance_tuning One think could be more or less universally helpful: encapsulate your data access well.

web application receiving millions of requests and leads to generating millions of row inserts per 30 seconds in SQL Server 2008

I am currently addressing a situation where our web application receives at least a Million requests per 30 seconds. So these requests will lead to generating 3-5 Million row inserts between 5 tables. This is pretty heavy load to handle. Currently we are using multi threading to handle this situation (which is a bit faster but unable to get a better CPU throughput). However the load will definitely increase in future and we will have to account for that too. After 6 months from now we are looking at double the load size we are currently receiving and I am currently looking at a possible new solution that is scalable and should be easy enough to accommodate any further increase to this load.
Currently with multi threading we are making the whole debugging scenario quite complicated and sometimes we are having problem with tracing issues.
FYI we are already utilizing the SQL Builk Insert/Copy that is mentioned in this previous post
Sql server 2008 - performance tuning features for insert large amount of data
However I am looking for a more capable solution (which I think there should be one) that will address this situation.
Note: I am not looking for any code snippets or code examples. I am just looking for a big picture of a concept that I could possibly use and I am sure that I can take that further to an elegant solution :)
Also the solution should have a better utilization of the threads and processes. And I do not want my threads/processes to even wait to execute something because of some other resource.
Any suggestions will be deeply appreciated.
Update: Not every request will lead to an insert...however most of them will lead to some sql operation. The appliciation performs different types of transactions and these will lead to a lot of bulk sql operations. I am more concerned towards inserts and updates.
and these operations need not be real time there can be a bit lag...however processing them real time will be much helpful.
I think your problem looks more towards getting a better CPU throughput which will lead to a better performance. So I would probably look at something like an Asynchronous Processing where in a thread will never sit idle and you will probably have to maintain a queue in the form of a linked list or any other data structure that will suit your programming model.
The way this would work is your threads will try to perform a given job immediately and if there is anything that would stop them from doing it then they will push that job into the queue and these pushed items will be processed based on how it stores the items in the container/queue.
In your case since you are already using bulk sql operations you should be good to go with this strategy.
lemme know if this helps you.
Can you partition the database so that the inserts are spread around? How is this data used after insert? Is there a natural partion to the data by client or geography or some other factor?
Since you are using SQL server, I would suggest you get several of the books on high availability and high performance for SQL Server. The internals book muight help as well. Amazon has a bunch of these. This is a complex subject and requires too much depth for a simple answer on a bulletin board. But basically there are several keys to high performance design including hardware choices, partitioning, correct indexing, correct queries, etc. To do this effectively, you have to understand in depth what SQL Server does under the hood and how changes can make a big difference in performance.
Since you do not need to have your inserts/updates real time you might consider having two databases; one for reads and one for writes. Similar to having a OLTP db and an OLAP db:
Read Database:
Indexed as much as needed to maximize read performance.
Possibly denormalized if performance requires it.
Not always up to date.
Insert/Update database:
No indexes at all. This will help maximize insert/update performance
Try to normalize as much as possible.
Always up to date.
You would basically direct all insert/update actions to the Insert/Update db. You would then create a publication process that would move data over to the read database at certain time intervals. When I have seen this in the past the data is usually moved over on a nightly bases when few people will be using the site. There are a number of options for moving the data over, but I would start by looking at SSIS.
This will depend on your ability to do a few things:
have read data be up to one day out of date
complete your nightly Read db update process in a reasonable amount of time.

Design a database with a lot of new data

Im new to database design and need some guidance.
A lot of new data is inserted to my database throughout the day. (100k rows per day)
The data is never modified or deleted once it has been inserted.
How can I optimize this database for retrieval speed?
My ideas
Create two databases (and possible on different hard drives) and merge the two at night when traffic is low
Create some special indexes...
Your recommendation is highly appreciated.
UPDATE:
My database only has a single table.
100k/day is actually fairly low. 3M/month, 40M/year. You can store 10 years archive and not reach 1B rows.
The most important thing to choose in your design will be the clustered key(s). You need to make sure that they are narrow and can serve all the queries your application will normally use. Any query that will end up in table scan will completely trash your memory by fetching in the entire table. So, no surprises there, your driving factor in your design is the actual load you'll have: exactly what queries will you be running.
A common problem (more often neglected than not) with any high insert rate is that eventually every row inserted will have to be deleted. Not acknowledging this is a pipe dream. The proper strategy depends on many factors, but probably the best bet is on a sliding window partitioning scheme. See How to Implement an Automatic Sliding Window in a Partitioned Table. This cannot be some afterthought, the choice for how to remove data will permeate every aspect of your design and you better start making a strategy now.
The best tip I can give which all big sites use to speed up there website is:
CACHE CACHE CACHE
use redis/memcached to cache your data! Because memory is (blazingly)fast and disc I/O is expensive.
Queue writes
Also for extra performance you could queue up the writes in memory for a little while before flushing them to disc -> writting them to SQL database. Off course then you have the risk off losing data if you keep it in memory and your computer crashes or has power failure or something
Context missing
Also I don't think you gave us much context!
What I think is missing is:
architecture.
What kind of server are you having VPS/shared hosting.
What kind of Operating system does it have linux/windows/macosx
computer specifics like how much memory available, cpu etc.
a find your definition of data a bit vague. Could you not attach a diagram or something which explains your domain a little bit. For example something like
this using http://yuml.me/
Your requirements are way to general. For MS SQL server 100k (more or less "normal") records per days should not be a problem, if you have decent hardware. Obviously you want to write fast to the database, but you ask for optimization for retrieval performance. That does not match very well! ;-) Tuning a database is a special skill on its own. So you will never get the general answer you would like to have.

In terms of today's technology, are these meaningful concerns about data size?

We're adding extra login information to an existing database record on the order of 3.85KB per login.
There are two concerns about this:
1) Is this too much on-the-wire data added per login?
2) Is this too much extra data we're storing in the database per login?
Given todays technology, are these valid concerns?
Background:
We don't have concrete usage figures, but we average about 5,000 logins per month. We hope to scale to larger customers, howerver, still in the 10's of 1000's per month, not 1000's per second.
In the US (our market) broadband has 60% market adoption.
Assuming you have ~80,000 logins per month, you would be adding ~ 3.75 GB per YEAR to your database table.
If you are using a decent RDBMS like MySQL, PostgreSQL, SQLServer, Oracle, etc... this is a laughable amount of data and traffic. After several years, you might want to start looking at archiving some of it. But by then, who knows what the application will look like?
It's always important to consider how you are going to be querying this data, so that you don't run into performance bottlenecks. Without those details, I cannot comment very usefully on that aspect.
But to answer your concern, do not be concerned. Just always keep thinking ahead.
How many users do you have? How often do they have to log in? Are they likely to be on fast connections, or damp pieces of string? Do you mean you're really adding 3.85K per time someone logs in, or per user account? How long do you have to store the data? What benefit does it give you? How does it compare with the amount of data you're already storing? (i.e. is most of your data going to be due to this new part, or will it be a drop in the ocean?)
In short - this is a very context-sensitive question :)
Given that storage and hardware are SOOO cheap these days (relatively speaking of course) this should not be a concern. Obviously if you need the data then you need the data! You can use replication to several locations so that the added data doesn't need to move over the wire as far (such as a server on the west coast and the east coast). You can manage your data by separating it by state to minimize the size of your tables (similar to what banks do, choose state as part of the login process so that they look to the right data store). You can use horizontal partitioning to minimize the number or records per table to keep your queries speedy. Lots of ways to keep large data optimized. Also check into Lucene if you plan to do lots of reads to this data.
In terms of today's average server technology it's not a problem. In terms of your server technology it could be a problem. You need to provide more info.
In terms of storage, this is peanuts, although you want to eventually archive or throw out old data.
In terms of network (?) traffic, this is not much on the server end, but it will affect the speed at which your website appears to load and function for a good portion of customers. Although many have broadband, someone somewhere will try it on edge or modem or while using bit torrent heavily, your site will appear slow or malfunction altogether and you'll get loud complaints all over the web. Does it matter? If your users really need your service, they can surely wait, if you are developing new twitter the page load time increase is hardly acceptable.