Is RavenDB Right for my Situation? - ravendb

I have an interesting situation where I'm near the end of an evaluation period for a RavenDB prototype for use with a project at our company. The reason it's interesting is that 99.99% of the time, I believe it fits Raven's sweet spot; it repeatedly queries for new data, often, and in small batches (< 1000 documents at a time).
However, we do have an initial load period, where we need to load two days' worth of data, which can be 3 million (or more) records in some cases.
A diagram might help:
It's the Transfer Service that is responsible for getting the correct data out of three production databases and storing it in RavenDB. The WCF service will query this data and make it available to its clients.
Once we do the initial load of millions of records/documents into RavenDB, we'll rarely have to do that again.
As an initial load test, on a machine with 4GB RAM and two processors, it took just over 23 minutes to read the initial data. In this case, it was only about 1.28 million records. I eliminated all async operations from this initial load, because I wanted each read to not be interfered with by other read operations. I found the best results this way.
I know it's not recommended, but to accomplish all this, I had to change settings that aren't recommended to be changed:
I had to increase the timeout:
documentStore.JsonRequestFactory.ConfigureRequest += (e, x) => ((HttpWebRequest)x.Request).Timeout = ravenTimeoutInMilliseconds;
In the Raven.Server.exe.config, I had to increase the page size (to int.MaxValue):
<add key="Raven/MaxPageSize" value="2147483647"/>
And in my retrieval methods, I had to use Take(int.MaxValue):
return session.Query<T>().Where(whereClause).Take(int.MaxValue).ToList();
Remember this is all for that one-time, initial load. After that, it's many queries, quickly, and often. I should also note that each document is self-contained in RavenDB. There are no relationships to manage.
Knowing all this, is RavenDB a good fit?

A good fit for what?
Full text search? Yes. Background aggregations (map/reduce ones)? Yes. Easy replication and sharding, say scaling? Yes...
Ad-hoc reporting? No. Support for probably thousands of third party tools? No...
If you're talking about performance, you probably want to look at Orens latest post on that. His numbers are quite similar to your ones: http://ayende.com/blog/154913/ravendb-amp-freedb-an-optimization-story

From what I understand of your question, you need to "prep" the WCF web-service. To do this you read 1.2M docs from RavenDB (in about 23 mins) and hold them in memory, so the WCF service can then serve queries from them, is this right? Or am I missing something?
Why not get the WCF service to send it's queries to Raven one-at-a-time? I.e. for each query it gets from a Client, ask RavenDB to do the query for it?

From what you've told us in the other answers comments, I believe the only good way to serve the wcf clients fast enough, is to actually store everything in memory, so just the way you do it now.
The question, if RavenDB is a good fit for that situation depends on whether your data model benefits in others way from the document oriented nature. So, in case you have dynamic data that would require some kind of EAV in a relational databases and lots of joins, then RavenDB will probably be a very good solution. However, if you just need something you can throw flat data in, then I would go with a relational database here. In terms of licensing costs and ease of use, you might also want to take a look at PostgreSql, as this is a really awesome database that comes completely free.

Related

What database for crawler/scraper?

I am currently researching what database to use for a project I am working on. Hopefully you guys can give me some hints.
The project is an automated web crawler that checks websites as per a user's request, scrapes data under certain circumstances, and creates log files of what was done.
Requirements:
Only few tables with few columns; predefining columns is no problem
No overly complex associations between models
Huge amount of date & time based queries
Due to logging, database will grow rapidly and use up a lot of space
Should be able to scale over multiple servers
Fields contain mostly ids (int), strings (around 200-500 characters max), and unix timestamps
Two different types of servers will simultaneously read/write data directly to/from it:
One(/later more) rails app that takes user input and displays results upon request
One(/later more) Node.js server that functions as the executing crawler/scraper. It will have enough load to run continuously and make dozens of database queries every second.
I assume it will neither be a graph database (no complex associations), nor a memory based key/value store (too much data to hold in cached). I'm still on the fence for every other type of database I could find, each seems to have it's merits.
So, any advice from the pros how I should decide?
Thanks.
I would agree with Vladimir that you would want to consider a document-based database for this scenario. I am most familiar with MongoDB. My reasons for using it here are as follows:
Your 'schema requirements' of "only a few tables with few columns" fits well with the NoSQL nature of MongoDB.
Same as above for "no overly complex associations between nodes" -- you will want to decide whether you'd prefer nested documents or using dbref (I prefer the former)
Huge amount of time-based data (and other scaling requirements) - MongoDB scales well via sharding or partitioning
Read/write access - this is why I am recommending MongoDB over something like Hadoop. The interactive query requirement is best met by something other than a Hadoop-style store, as this type of storage is designed for batch (rather than interactive query) requirements.
Google built a database called "BigTable" for crawling, indexing and the search related business. They released a paper about it (google for "BigTable" if you're interested). There are several open source implementations for bigtable-like designs, one of them is Hypertable. We have a blog posting describing a crawler/indexer implementation (http://hypertable.com/blog/sehrchcom_a_structured_search_engine_powered_by_hypertable/) written by the guys from sehrch.com. And looking at your requirements: all of them are supported and are common use cases.
(disclaimer: i work for hypertable.)
Take a look at document-oriented database like a CouchDB or MongoDB.

How to go from a full SQL querying to something like a NoSQL?

In one of my process I have this SQL query that take 10-20% of the total execution time. This SQL query does a filter on my Database, and load a list of PricingGrid object.
So I want to improve these performance.
So far I guessed 2 solutions :
Use a NoSQL solution, AFAIK these are good solutions for improving reading process.
But the migration seems hard and needs a lot of work (like import the data from sql server to nosql in a regular basis)
I don't have any knowledge , I even don't know which one I should use (the first I'd use is Ravendb because I follow ayende and it's done by the .net community).
I might have some stuff to change in my model to make my object ok for a nosql database
Load all my PricingGrid object in memory (in a static IEnumerable)
This might be a problem when my server won't have enough memory to load everything
I might reinvent the wheel (indexes...) invented by the NoSQL providers
I think I'm not the first one wondering this, so what would be the best solution ? Is there any tools that could help me ?
.net 3.5, SQL Server 2005, windows server 2005
Migrating your data from SQL is only the first step.
Moving to a document store (like RavenDB or MongoDB) also means that you need to:
Denormalize your data
Perform schema validation in your code
Handle concurrency of complex operations in your code since you no longer have transactions (at least not the same way)
Perform rollbacks in the event of partial commits (changes)
Depending on your updates, reads and network model you might also need to handle conflicts
You provided very limited information but it sounds like your needs include a single database server and that your data fits well in the relational model.
In such a case I would vote against a NoSQL solution, it is more likely that you can speed up your queries with database optimizations and still retain all the added value of a RDBMS.
Non-relational databases are tools for a specific job (no matter how they sell them), if you need them it is usually because your data doesn't fit well in the relational model or if you have a need to distribute your data over multiple machines (size or availability). For instance, I use MongoDB for a write-intensive high throughput job management application. It is centralized and the data is very transient so the "cost" of having low durability is acceptable. This doesn't sound like the case for you.
If prefer to use a NoSQL solution perhaps you should try using Memcached+MySQL (InnoDB) this will allow you to get the speed benefits of an in-memory cache (in the form of a memcached daemon plugin) with the underlying protection and capabilities of an RDBMS (MySQL). It should also ease data migration and somewhat reduce the amount of changes required in your code.
I myself have never used it, I find that I either need NoSQL for the reasons I stated above or that I can optimize the RDBMS using stored procedures, indexes and table views in a way which is sufficient for my needs.
Asaf has provided great information in regards to the usage of NoSQL and when it is most appropriate. Given that your main concern was performance, I would tend to agree with his opinion - it would take you much more time and effort to adopt a completely new (and very different) data persistence platform than it would to trick out your SQL Server cluster. That said, my answer is mainly to address the "how" part of your question.
Addressing misunderstandings:
Denormalizing Data - You do not need to manually denormalize your existing data. This will be done for you when it is migrated over. More than anything you need to simply think about your data in a different fashion - root aggregates, entity and value types, etc.
Concurrency/Transactions - Transactions are possible in both Mongo and Raven, they are simply done in a different fashion. One of the inherent ways Raven does this is by using an ORM-like "unit of work" pattern with its RavenSession objects. Yes, your data validation needs to be done in code, but you already should be doing it there anyway. In my experience this is an over-hyped con.
How:
Install Raven or Mongo on a primary server, run it as a service.
Create or extend an existing application that uses the database you intend to port. This application needs all the model classes/libraries that your SQL database provides persistence for.
a. In your "data layer" you likely have a repository class somewhere. Extract an interface form this, and use it to build another repository class for your Raven/Mongo persistence. Both DB's have plenty good documentation for using their APIs to push/pull/update changes in the document graphs. It's pretty damn simple.
b. Load your SQL data into C# objects in memory. Pull back your top-level objects (just the entities) and load their inner collections and related data in memory. Your repository is probably already doing this (ex. when fetching an Order object, ensure not only its properties but associated collections like Items are loaded in memory.
c. Instantiate your Raven/Mongo repository and push the data to it. Primary entities become "top level documents" or "root aggregates" serialized in JSON, and their collections' data nested within. Save changes and close the repository. Note: You may break this step down into as many little pieces as your data deems necessary.
Once your data is migrated, play around with it and ensure you are satisfied. You may want to modify your application Models a little to adjust the way they are persisted to Raven/Mongo - for instance you may want to make both Orders and Items top-level documents and simply use reference values (much like relationships in RDBMS systems). Watch out here though, as doing so sort-of goes against the principal and performance behind NoSQL as now you have to tap the DB twice to get the Order and the Items.
If satisfied, shard/replicate your mongo/raven servers across your remaining available server boxes.
Obviously there are tons of little details I did not explain, but that is the general process, and much of it depends on the applications already consuming the database and may be tricky if more than one app/system talks to it.
Lastly, just to reiterate what Asaf said... learn as much as you can about NoSQL and its best use-cases. It is an amazing tool, but not golden solution for all data persistence. In your case try to really find the bottlenecks in your current solution and see if they are solvable. As one of my systems guys says, "technology for technology's sake is bullshit"

How to speed up serialization and transport for large object graphs; WCF 3.5 & SL3

I have a 3.5 SP1 project, WCF service which is limited to consumption by Silverlight 3 clients. Due to the business requirements we have to work with large object graphs that are hydrated via SQL Server on the WCF side and then sent to the Silverlight client. They are deep, you might have a class that has two collection properties and each item in the collection has collections inside of it. The fundamental design is what it is, something I inherited and must work within for the short term. How big are we talking? An example top level collection with 250 items once serialized is 14mb when coming across the wire using no modiciations (httpBinding and the DataContractSerializer). 250 items is small, the requirements we are facing require us to be able to work with 10,000+ items which given my limited math skills is well over 500mb to pull across the wire. No walk in the park - in fact - you could probably take a walk in the park while that churned away.
So there are several things we are considering, one is to move away from DataContractSerializer and use the XmlSerializer so we can move lots of these properties into attributes and reduce the payload size. We are also looking at Binary Xml bindings.
My question is this, what would you do? Can IIS compression play a role here? Moving away from the DCS a bad idea? Is there a better technique? Am I up a creek without a paddle?
Do you absolutely need to pull down all of that data in one shot? If this is a Silverlight app, I can't see you physically displaying 10,000+ (or even 250 for that matter) records of the size you describe. Is it possible to use paging to reduce the amount of data pulled across the wire?
Instead of having 10,000+, you could be showing 10-20 records per page, so the system only queries 10-20 depending on the circumstances.
Not sure if that is an option for you, given the requirements you describe, but that seems to be the most obvious solution to me.
Also, to marginally help your performance, have you made sure serialization assemblies are being generated for your business objects? This would optimize at least the serialization/deserialization piece of your code. It may not be a huge performance boost, but it will help.
As others have said, look if you need to transfer this amount of data.
The best way to send large amounts of data is to use streaming, see:
http://msdn.microsoft.com/en-us/library/ms733742.aspx

Highest Performance Database Storage Mechanism

I need ideas to implement a (really) high performance in-memory Database/Storage Mechanism. In the range of storing 20,000+ objects, with each object updated every 5 or so seconds. I would like a FOSS solution.
What is my best option? What are your experiences?
I am working primarily in Java, but I need the datastore to have good performance so the datastore solution need not be java centric.
I also need like to be able to Query these objects and I need to be able to restore all of the objects on program startup.
SQLite is an open-source self-contained database that supports in-memory databases (just connect to :memory:). It has bindings for many popular programming languages. It's a traditional SQL-based relational database, but you don't run a separate server – just use it as a library in your program. It's pretty quick. Whether it's quick enough, I don't know, but it may be worth an experiment.
Java driver.
are you updating 20K objects every 5 seconds or updating one of the 20K every 5 seconds?
What kind of objects? Why is a traditional RDBMS not sufficient?
Check out HSQLDB and Prevayler. Prevayler is a paradigm shift from traditional RDBMS - one which I have used (the paradigm, that is, not specifically Prevayler) in a number of projects and found it to have real merit.
Depends exactly how you need to query it, but have you looked into memcached?
http://www.danga.com/memcached/
Other options could include MySQL MEMORY Tables, the APC Cache if you're using PHP.
Some more detail about the project/requirements would be helpful.
An in-memory storage ?
1) a simple C 'malloc' array where all your structures would be indexed.
2) berkeleyDB: http://www.oracle.com/technology/products/berkeley-db/index.html. It is fast because you build your own indexes (secondary database) and there is no SQL expression to be evaluated.
Look at some of the products listed here: http://en.wikipedia.org/wiki/In-memory_database
What level of durability do you need? 20,000 updates every 5 seconds will probably be difficult for most IO hardware in terms of number of transactions if you write the data back to disc for every one.
If you can afford to lose some updates, you could probably flush it to disc every 100ms with no problem with fairly cheap hardware if your database and OS support doing that.
If it's really an in-memory database that you don't want to flush to disc often, that sounds pretty trivial. I've heard that H2 is pretty good, but SQLite may work as well. A properly tuned MySQL instance could also do it (But may be more convoluted)
Oracle TimesTen In-Memory Database. See: http://www.informationweek.com/whitepaper/Business-Intelligence/Datamarts-Data-Warehouses/oracle-timesten-in-memory-databas-wp1228511232361
Chronicle Map is an pure Java key-value store
it has really high performance, sustaining 1 million writes/second from a single thread. It's a myth that a fast database couldn't be written in Java.
Seamlessly stores and loads any serializable Java objects, provides a simple Map interface
LGPLv3
Since you don't have many "tables" a full-blown SQL database could be an overkill solution, indexes & queries could be implemented with a handful of distinct key-value stores which are updated manually by vanilla Java code. Chronicle Map provides mechanisms to make such updates concurrently isolated from each other, if you need it.

PostgreSQL performance monitoring tool

I'm setting up a web application with a FreeBSD PostgreSQL back-end. I'm looking for some database performance optimization tool/technique.
Database optimization is usually a combination of two things
Reduce the number of queries to the database
Reduce the amount of data that needs to be looked at to answer queries
Reducing the amount of queries is usually done by caching non-volatile/less important data (e.g. "Which users are online" or "What are the latest posts by this user?") inside the application (if possible) or in an external - more efficient - datastore (memcached, redis, etc.). If you've got information which is very write-heavy (e.g. hit-counters) and doesn't need ACID-semantics you can also think about moving it out of the Postgres database to more efficient data stores.
Optimizing the query runtime is more tricky - this can amount to creating special indexes (or indexes in the first place), changing (possibly denormalizing) the data model or changing the fundamental approach the application takes when it comes to working with the database. See for example the Pagination done the Postgres way talk by Markus Winand on how to rethink the concept of pagination to make it more database efficient
Measuring queries the slow way
But to understand which queries should be looked at first you need to know how often they are executed and how long they run on average.
One approach to this is logging all (or "slow") queries including their runtime and then parsing the query log. A good tool for this is pgfouine which has already been mentioned earlier in this discussion, it has since been replaced by pgbadger which is written in a more friendly language, is much faster and more actively maintained.
Both pgfouine and pgbadger suffer from the fact that they need query-logging enabled, which can cause a noticeable performance hit on the database or bring you into disk space troubles on top of the fact that parsing the log with the tool can take quite some time and won't give you up-to-date insights on what is going in the database.
Speeding it up with extensions
To address these shortcomings there are now two extensions which track query performance directly in the database - pg_stat_statements (which is only helpful in version 9.2 or newer) and pg_stat_plans. Both extensions offer the same basic functionality - tracking how often a given "normalized query" (Query string minus all expression literals) has been run and how long it took in total. Due to the fact that this is done while the query is actually run this is done in a very efficient manner, the measurable overhead was less than 5% in synthetic benchmarks.
Making sense of the data
The list of queries itself is very "dry" from an information perspective. There's been work on a third extension trying to address this fact and offer nicer representation of the data called pg_statsinfo (along with pg_stats_reporter), but it's a bit of an undertaking to get it up and running.
To offer a more convenient solution to this problem I started working on a commercial project which is focussed around pg_stat_statements and pg_stat_plans and augments the information collected by lots of other data pulled out of the database. It's called pganalyze and you can find it at https://pganalyze.com/.
To offer a concise overview of interesting tools and projects in the Postgres Monitoring area i also started compiling a list at the Postgres Wiki which is updated regularly.
pgfouine works fairly well for me. And it looks like there's a FreeBSD port for it.
I've used pgtop a little. It is quite crude, but at least I can see which query is running for each process ID.
I tried pgfouine, but if I remember, it's an offline tool.
I also tail the psql.log file and set the logging criteria down to a level where I can see the problem queries.
#log_min_duration_statement = -1 # -1 is disabled, 0 logs all statements
# and their durations, > 0 logs only
# statements running at least this time.
I also use EMS Postgres Manager to do general admin work. It doesn't do anything for you, but it does make most tasks easier and makes reviewing and setting up your schema more simple. I find that when using a GUI, it is much easier for me to spot inconsistencies (like a missing index, field criteria, etc.). It's only one of two programs I'm willing to use VMWare on my Mac to use.
Munin is quite simple yet effective to get trends of how the database is evolving and performing over time. In the standard kit of Munin you can among other thing monitor the size of the database, number of locks, number of connections, sequential scans, size of transaction log and long running queries.
Easy to setup and to get started with and if needed you can write your own plugin quite easily.
Check out the latest postgresql plugins that are shipped with Munin here:
http://munin-monitoring.org/browser/branches/1.4-stable/plugins/node.d/
Well, the first thing to do is try all your queries from psql using "explain" and see if there are sequential scans that can be converted to index scans by adding indexes or rewriting the query.
Other than that, I'm as interested in the answers to this question as you are.
Check out Lightning Admin, it has a GUI for capturing log statements, not perfect but works great for most needs. http://www.amsoftwaredesign.com
DBTuna http://www.dbtuna.com/postgresql_monitor.php has recently started supporting PostgreSQL monitoring. We use it extensively for MySQL monitoring, so if it provides the same for Postgres then it should be a good fit for you too.