What DB for big databases? - sql

I'm embarking on a project which will likely cross several million rows in the short future, so I am researching the database I use as that is sure to prove an issue. From what I have read, SQL in all its incarnations has issues once you get to the 2,000,000 rows issue for a table. Is there a good database recommended for these big size projects?
It is a website I am talking about, and archiving old entries is not ideal, though can be done if it proves to be an issue that I can't overcome.
Thanks.

No database that would call themselves an SQL database if they had issues with 2 million records. You can get in trouble with some databases with 2 billion records though.
I've had mysql databases with well over 150 million records without trouble. You need to figure out what features you need from a database before you're deciding, not ponder over a few million rows - which is not much at all.

First off, a million records is not exactly a lot when databases are concerned. Any database worth it's salt should be able to handle that just fine.
Create proper indexes on your tables and almost any database will be able to handle those numbers of records. I've seen MySQL databases with millions of rows that worked just fine, and MySQL is not a heavyweight in database land.
MS SQL Server, PostgreSQL, DB2, Progress OpenEdge - almost anything will do if you create proper indexes. Things like MS Access (and possibly sqlite) may fall apart when you put a lot of data in them.

I've had tables in MS SQL Server with a fair bit more than 2 million rows without trouble. Of course, it depends on how you're using that data.
Just don't try using MySQL for something like this. At least from my experience, it just doesn't allow enough tweaking to provide high enough performance. I've run into a few cases with large amounts of data in (almost) identically set up tables. MySQL5 performed like 30 times slower than SQL Server on the same hardware. Extreme example maybe, but still.
I have too little experience with PostgreSQL or Oracle to judge, so I will just stick with not recommending MySQL. Or Access ;)

One of the tables in my current project has 13 million rows in it. MS SQL Server handles it just fine. Really, 2 million rows is nothing.
But, seriously, if you want a high-end database, look to Oracle, Teradata, and DB2.

We run lots of databases with row counts in the hundreds of millions in MSSQL (2000, 2005, 2008). Your row count isn't where your problem will arise, it's in the characteristics of access to the data. Depending on how it looks, you may need to scale across separate hardware, and that is where the differences between database servers will really show up (that and price...)

Microsoft SQL Server, MySQL, Oracle, and DB2 can all handle millions and millions of rows without a problem.
The problem will be finding a DBA who knows how to design and manage it properly so you get the performance characteristics you're looking for.

2,000,000 rows is really not much at all. I've seen plenty of tables with > 50 million rows with acceptable performance, in MS SQL.
IMHO you're still pretty far away from being a 'big database'

As others have said, any decent DB can handle that sort of load. I've used MS SQL Server and PostgreSQL for databases of that size before, both work great. I'd recommend PostgreSQL because it's free and open. I've never done a performance comparison, but it seems to be very capable. I'd avoid DB2 or Oracle because they're very hard to use (unless you want to pay for a full time DBA, in which case such a person might be able to squeeze better performance out of those than any other solution, especially with Oracle).

I concur with richardtallent. The big name database systems have all provided us with good tools for large databases. (2 million rows is nothing, although you can see performance issues with lousy indexes or poor choices in the select statements, especially if you are joining across multiple tables of similar size.). It comes down to pros and cons with costs, usability, cost of support, etc.
I can speak most to Oracle and SQL Server. Oracle is pretty pricey, and it takes a pricey, dedicated DBA to really use it right. It isn't known for usability, but a DBA or programmer comfortable with it can work just fine in it. It also has great flexibility and some believe it is more powerful than the others. (I don't know if that's true or not, but I know it certainly provides lots of different ways you can tweak it for efficiency, etc.)
SQL Server can certainly handle large datasets just fine. It has a "prettier" face and tends to be considered more usable, but usability in the end is a matter of opinion. It does have a cheaper price tag, but you might have just a bit less flexibility than Oracle. You can get a "cheap" SQL Server dba, because its user-friendly interface makes it easy for people to do some of the basic DBA tasks without being experts. But you get what you pay for (usually) and if you really want efficiency and security, you pay for an expert anyway.
Those are just a few of the things to consider when looking at DBs. I'm sure MySQL and DB2 have their own pros and cons to be weighed.
But none of them have a problem with a measly 2 million rows. (I regularly work in a database with hundreds of tables, some of which have over 50 million rows, and I see little performance hit b/c the DBAs know what they are doing.)
FOLLOW UP EDIT: Since this is for a website, maybe your biggest consideration should be integration of front/back. For example, if you are using ASP for the web, SQL Server is a natural choice.

For most apps MS SQL will work fine. MySQL will work for smaller apps, but to answer your question if you are truly concerned about DB performance I would go with Oracle if you can afford it but if you are like the most of us who can't use an $80,000 database I would suggest MS SQL It works well. By the sounds of what you are doing (website) I would use MS SQL and utilize caching. Using the database correctly tends to be more important than using the correct database.

Try looking at other large organization to see what they're using. MS's proof of concept for very large databases is TerraServer, which is running a database that is several terabytes in size.
Any database will have problems with a small dataset if you are doing table scans, cartesian products, expensive calculations for each row, etc.
To really stress a relational db with a table of 2 million rows, you'd have to be doing cross tabs while doing a large number of inserts and updates and then you'd want to switch to an olap datastore.
Do you have anything else to describe the expected workload? Is this mostly readonly, read write, etc?

Properly configured, 2MM rows is not a big deal for most of the commercial DBs and may not be for the Open Source DBs - I don't know enough about MySQL et al to have an opinion.
By SQL I assume the original poster means MS SQL Server. While there were some scaling issues in the 2000 release, they seem to have been mostly addressed in 2005 and 2008. I have one testdb that has significantly more than 2 MM rows running now and running quite well.
Respectfully I think the question is badly stated - you need to describe much more information to get a useful answer. Size of the database, number of tables, number of common joins, will it be optimized for read, write or both, number of concurrent users that will be supported, replication, geographic location of end users vs database server, hardware configuration.
In general I have found SQL Server post 2005 works in a lot of cases very well. If you need the ability to tune everything at the lowest level both Oracle and DB2 give you better access and documentation to do that.
If your need is primarily a data warehouse and you have the cash then I would look at Neteeza or Teradata. I am fan of NZ but we are partners so I am biased.
Hope that helps,
Terence

Keep in mind that if you have a large amount of data:
indexing columns you join tables on is ESPECIALLY important
writing efficient queries can make a huge difference
if you query data all the time and rarely write new rows, you can create clustered indexes and materialized views to retrieve data much more efficiently, based on what queries you use most often

As a lot of people have already said, that amount of records is not a problem if your database design is properly done.
But there may be another aspect worth considering. How many users, namely how many simultaneous users, do you expect you application to have? If you expect to get a lot of users, you also need to consider the scalability of the database engine, or database design.
MSSql service may not be expensive for a single server setup, but if you need to scale up, e.g. run on 4 CPUs, the licensing becomes very expensive. And when you have pushed the limit of a single server, and you need to scale out to multiple servers, what do you do?
I don't have the answer to that, except that as far as I know, MS SQL Server does not directly support load balancing.
Just a thought

Related

Performance of calling stored procedure from different databases [duplicate]

Is there any performance hit when doing a select across another DB on the same physical machine? So I have 2 databases on the same physical machine running within the same SQL 2008 instance.
For instance in SomStoreProc on_this_db I run
SELECT someFields FROM the_other_db.dbo.someTable
So far from what I have read on the internet, most people seem to indicate NO.
Even if it is not a performance hit, it could be a problem in data integrity as FKs can't be enforced across databases.
However, it is more likely your procs need to be tuned especially if they are thousands of lines long. To begin with look for cursors, correlated subqueries and bad indexing. Also look for where clauses that are nonsaragable and scalar functions that are runing row-by-agonizing-row.
Of course the best way to prove that the separate database is not the issue is to take one slow proc and convert those tables to one database and test performance both ways. Please at least convince them to do this smaller test before they go ahead and make the horribly complicated and time consuming change to one database and then find out they still have performance problems.
And remember, the execution plan is your friend is looking at these things.

Are sql tuning ways always same for different DB engine?

I used Oracle for the half past year and learned some tricks of sql tuning,but now our DB is moving to greenplum and the project manager suggest us to change some of the codes that writted in Oracle sql for their efficiency or grammar.
I am curious that Are sql tuning ways same for different DB engine,like oracle,postgresql,mysql and so on?if yes or not,why?Any suggestion are welcomed!
some like:
in or exists
count(*) or count(column)
use index or not
use exact column instead of select *
For the most part the syntax that is used will remain the same, there may be small differences from one engine to another and you may run into different terms to achieve some of the more specific output or do more complex tasks. In order to achieve parity you will need to learn those new terms.
As far as tuning, this will vary from system to system. Specifically going from Oracle to Greenplum you are looking at moving from a database where efficiency in a query if often driven by dropping an index on the data. Where Greenplum is a parallel execution system where efficiency is gained by effectively distributing the data across multiple systems and querying them in parallel. In Greenplum indexing is an additional layer that usually does not add benefit, just additional overhead.
Even within a single system using changing the storage engine type can result in different ways to optimize a query. In practice queries are often moved to a new platform and work, but are far from optimal as they don't take advantage of optimizations of that platform. I would strongly suggest getting an understanding of the new platform and you should not go in assuming a query that is optimized for one platform is the optimal way to run it in another.
Getting specifics in why they differ requires someone to be an expert in bother to be able to compare both. I don't claim to know much of greenplum.
The basic principles which I would expect all developers to learn over time dont really change. But there are "quirks" of individual engines which make specific differences. From your question I would personally anticipate 1 and 4 to remain the same.
Indexing is something which does vary. For example the ability to use two indexes was not (is not?) Ubiquitous. I wouldn't like to guess which DBMS can / can't count columns from the second field in a composite index. And the way indexes are maintained is very different from one DBMS to the next.
From my own experience I've also seen differences caused by:
Different capabilities in the data access path. As an example, one optimisation is for a DBMS to create a bit map of rows (matching and not matching) the combine multiple bitmaps to select rows. A DBMS with this feature can use multiple indexes in a single query. One without it can't.
Availability of hints / lack of hints. Not all DBMS support them. I know they are very common in Oracle.
Different locking strategies. This is a big one and can really affect update and insert queries.
In some cases DBMS have very specific capabilities for certain types of data such as geographic data or searchable free text (natural language). In these cases the way of working with the data is entirely different from one DBMS to the next.

What is the overhead for a cross server query vs. a native query?

If I am connected to a particular server in SQL Server, and I use a query referencing a table in another server, does this slow down the query and if so how much (e.g. if I am joining a lot to tables in the server I currently am in)? I ask because I want to know if it is ever more efficient to import that table to the server I am in than to cross server query.
I would say it depends. Yeah, probably not the answer you wanted, but it depends on the situation. Using linked servers does come with a cost, especially depending on the number of rows you're trying to return and the types of queries you're trying to run. You should be able to view your execution plan on your queries and see how much is being used to hit the linked server tables. That along with the time it takes to return the results would probably help determine if it's needed.
In regards to bringing the tables locally, then that depends as well. Do you need up-to-date data or is the data static? If static, then importing the table would not be a bad idea. If that data constantly changes and you need the changes, then this might not be the best solution. You could always look into creating SSIS packages to do this on a nightly basis, but again, it just depends.
Good luck.
As sgeddes mentions, it depends.
My own experiences with linked server queries were that they are pretty slow for large tables. Depending on how you write the query the predicates may have to be evaluated on the server that is running the statements (very likely if you're joining to them), meaning it could be transferring the entire table to that server anyways, and then filtering the result. And that is definitely bad for performance.

SQL versus noSQL (speed)

When people are comparing SQL and noSQL, and concluding the upsides and downsides of each one, what I never hear anyone talking about is the speed.
Isn't performing SQL queries generally faster than performing noSQL queries?
I mean, for me this would be a really obvious conclusion, because you should always be able to find something faster if you know the structure of your database than if you don't.
But people never seem to mention this, so I want to know if my conclusion is right or wrong.
People who tend to use noSQL use it specifically because it fits their use cases. Being divorced from normal RDBMS table relationships and constraints, as well as ACID-ity of data, it's very easy to make it run a lot faster.
Consider Twitter, which uses NoSQL because a user only does very limited things on site, or one exactly - tweet. And concurrency can be considered non-existent since (1) nobody else can modify your tweet and (2) you won't normally be simultaneously tweeting from multiple devices.
The definition of noSQL systems is a very broad one -- a database that doesn't use SQL / is not a RDBMS.
Therefore, the answer to your question is, in short: "it depends".
Some noSQL systems are basically just persistent key/value storages (like Project Voldemort). If your queries are of the type "look up the value for a given key", such a system will (or at least should be) faster that an RDBMS, because it only needs to have a much smaller feature set.
Another popular type of noSQL system is the document database (like CouchDB).
These databases have no predefined data structure.
Their speed advantage relies heavily on denormalization and creating a data layout that is tailored to the queries that you will run on it. For example, for a blog, you could save a blog post in a document together with its comments. This reduces the need for joins and lookups, making your queries faster, but it also could reduce your flexibility regarding queries.
As Einstein would say, speed is relative.
If you need to store a master/detail simple application (like a shopping cart), you would need to do several Insert statements in your SQL application, also you will get a Data set of information when you do a query to get the purchase, if you're using NoSQL, and you're using it well, then you would have all the data for a single order in one simple "record" (document if you use the terms of NoSQL databases like djondb).
So, I really think that the performance of an application can be measured by the number of things it need to do to achieve a single requirement, if you need to do several Inserts to store an order and you only need one simple Insert in a database like djondb then the performance will be 10x faster in the NoSQL world, just because you're using 10 times less calls to the database layer, that's it.
To illustrate my point let me link an example I wrote sometime ago about the differences between NoSQL and SQL data models approach: https://web.archive.org/web/20160510045647/http://djondb.com/blog/nosql-masterdetail-sample/, I know it's a self reference, but basically I wrote it to address this question which I found it's the most challenging question a RDBMS guy could have and it's always a good way to explain why NoSQL is so different from SQL world, and why it will achieve better performance anytime, not because we use "nasa" technology, it's because NoSQL will let the developer do less... and get more, and less code = greater performance.
The answer is: it depends. Generally speaking, the objective of NoSQL DATABASES (no "queries") is scalability. RDBMS usually have some hard limits at some point (I'm talking about millons and millons of rows) where you could not scale any more by traditional means (Replication, clustering, partitioning), and you need something more because your needs keep growing. Or even if you manage to scale, the overall setup is quite complicated. Or you can scale reads, but not writes.
And the queries depends on the particular implementation of your server, the type of query you are doing, the columns in the table, etc... remember that queries are just one part of the RDBMS.
query time of relational database like SQL for 1000 person data is 2000 ms and graph database like neo4j
is 2ms .if you crate more node 1000000 speed stable 2 ms

How can my application benefit from temporary tables?

I've been reading a little about temporary tables in MySQL but I'm an admitted newbie when it comes to databases in general and MySQL in particular. I've looked at some examples and the MySQL documentation on how to create a temporary table, but I'm trying to determine just how temporary tables might benefit my applications and I guess secondly what sorts of issues I can run into. Granted, each situation is different, but I guess what I'm looking for is some general advice on the topic.
I did a little googling but didn't find exactly what I was looking for on the topic. If you have any experience with this, I'd love to hear about it.
Thanks,
Matt
Temporary tables are often valuable when you have a fairly complicated SELECT you want to perform and then perform a bunch of queries on that...
You can do something like:
CREATE TEMPORARY TABLE myTopCustomers
SELECT customers.*,count(*) num from customers join purchases using(customerID)
join items using(itemID) GROUP BY customers.ID HAVING num > 10;
And then do a bunch of queries against myTopCustomers without having to do the joins to purchases and items on each query. Then when your application no longer needs the database handle, no cleanup needs to be done.
Almost always you'll see temporary tables used for derived tables that were expensive to create.
First a disclaimer - my job is reporting so I wind up with far more complex queries than any normal developer would. If you're writing a simple CRUD (Create Read Update Delete) application (this would be most web applications) then you really don't want to write complex queries, and you are probably doing something wrong if you need to create temporary tables.
That said, I use temporary tables in Postgres for a number of purposes, and most will translate to MySQL. I use them to break up complex queries into a series of individually understandable pieces. I use them for consistency - by generating a complex report through a series of queries, and I can then offload some of those queries into modules I use in multiple places, I can make sure that different reports are consistent with each other. (And make sure that if I need to fix something, I only need to fix it once.) And, rarely, I deliberately use them to force a specific query plan. (Don't try this unless you really understand what you are doing!)
So I think temp tables are great. But that said, it is very important for you to understand that databases generally come in two flavors. The first is optimized for pumping out lots of small transactions, and the other is optimized for pumping out a smaller number of complex reports. The two types need to be tuned differently, and a complex report run on a transactional database runs the risk of blocking transactions (and therefore making web pages not return quickly). Therefore you generally don't want to avoid using one database for both purposes.
My guess is that you're writing a web application that needs a transactional database. In that case, you shouldn't use temp tables. And if you do need complex reports generated from your transactional data, a recommended best practice is to take regular (eg daily) backups, restore them on another machine, then run reports against that machine.
The best place to use temporary tables is when you need to pull a bunch of data from multiple tables, do some work on that data, and then combine everything to one result set.
In MS SQL, Temporary tables should also be used in place of cursors whenever possible because of the speed and resource impact associated with cursors.
If you are new to databases, there are some good books by Joe Kelko that review best practices for ANSI SQL. SQL For Smarties will describe in great detail the use of temp table, impact of indexes, where clauses, etc. It's a great reference book with in depth detail.
I've used them in the past when I needed to create evaluated data. That was before the time of views and sub selects in MySQL though and I generally use those now where I would have needed a temporary table. The only time I might use them is if the evaluated data took a long time to create.
I haven't done them in MySQL, but I've done them on other databases (Oracle, SQL Server, etc).
Among other tasks, temporary tables provide a way for you to create a queryable (and returnable, say from a sproc) dataset that's purpose-built. Let's say you have several tables of figures -- you can use a temporary table to roll those figures up to nice, clean totals (or other math), then join that temp table to others in your schema for final output. (An example of this, in one of my projects, is calculating how many scheduled calls a given sales-related employee must make per week, bi-weekly, monthly, etc.)
I also often use them as a means of "tilting" the data -- turning columns to rows, etc. They're good for advanced data processing -- but only use them when you need to. (My golden rule, as always, applies: If you don't know why you're using x, and you don't know how x works, then you probably shouldn't use it.)
Generally, I wind up using them most in sprocs, where complex data processing is needed. I'd love to give a concrete example, but mine would be in T-SQL (as opposed to MySQL's more standard SQL), and also they're all client/production code which I can't share. I'm sure someone else here on SO will pick up and provide some genuine sample code; this was just to help you get the gist of what problem domain temp tables address.