How to handle big database?

How to handle big database? - sql

I have a database with over million users, each user has enormous amount of data stored.
Needless to say, the performance has decayed.
(Each user has its own website and CMS)
How do I handle the database for many users?
I herd an idea of saving each user information as its own database, instead of tables with foreign keys.
What are your thoughts of this idea? What are the advantages and disadvantages?
What other ways should I be considering?

One million users with referenced data is not big data.
If the performance is bad then you might have a look at your SQL code or front-end code.
Use indexes also to increase the Query Execution time. Most of the times indexes and optimization of the code is the trick. A lot of other things also plays a big role like your CPU, memory disk etc.
I would first have a look at the code and see if you can optimize anything and then if that doesn't help then seperate the data in multiple databases.
Even if you do this you might still have problem with performance if the databases is hosted on the same server.
Good luck!

Related

How to get a list of tables that need tuning

I have a database with tables that grow every day. I cannot predict which tables are going to grow and which are not as I'm not the one who is putting the data into them.
Is there a way to find tables that need indexes at a particular point in time? Is there a way, in SQL Server, to notify me if a database needs tuning on certain tables?
This is a product we have deployed at different client locations and we cannot go onto their servers every time to check if they have a performance issue. What I was thinking about is something that can notify me if there are performance issues on certain tables, so as the new patches go to the clients we can add these indexes or tuned queries.
After referring to Insertion of data after creating index on empty table or creating unique index after inserting data on oracle? I'm not willing to create indexes while installing databases or when the tables have few rows or are empty.

As per my understanding we must not create indexes on a smaller table as it can affect the write performances.
This is only a real concern if you're bulk loading or otherwise generating a hundred million records each day and write performance is a problem. Indexes do increase write times because they have to be updated when data is written, but unless you're running on a potato or running very high loads it's unlikely to be a problem. You'd know it was a problem before you encountered it.
If we're talking about small tables (less than 100 pages) then it's much more likely that indexes won't be useful because the data set is so small, but you shouldn't be concerned about impacting write performance.
Overall, your application should have indexes that support the queries that you expect should be run in your unit testing and staging. You will need feedback from your customers or clients, but until you really know how people use their data, you're going to have to make a best guess.
The general question of "How do I know what indexes I need when I don't know what queries will be run?" is better suited to DBA Stack Exchange. Briefly, you'll need to use dynamic management views for that. The three missing index dynamic views can be used for this. The example query given isn't horrible:
SELECT mig.*, statement AS table_name,
column_id, column_name, column_usage
FROM sys.dm_db_missing_index_details AS mid
CROSS APPLY sys.dm_db_missing_index_columns (mid.index_handle)
INNER JOIN sys.dm_db_missing_index_groups AS mig
ON mig.index_handle = mid.index_handle
ORDER BY mig.index_group_handle, mig.index_handle, column_id;
You shouldn't just blindly follow what this view says, however. It's a good lead on what to look at, but you have to look at the column order and queries actually being used to tell.
You should also monitor index usage statistics and examine how much and in what way indexes are used compared to how much they have to be updated. Indexes that are updated a million times a day but are used once or twice should be considered for removal.
You will also want to monitor query stats to look for queries that run for a long time. This may be poor development on the part of your client, but can also be a sign of design problems.
This is not even a comprehensive overview of things to look for, however. There's a lot to database maintenance and operations. That's why DBAs make a good living. This is just the tip of the iceberg. Just the tip for indexes, even.
What I'd do if you want to maintain this is consider asking your customers to allow you to send feedback for performance analysis. Set up a broker that monitors the management views and sends compiled and sanitized information back to yourselves. You'll need to be very careful about what you send because you don't want to be sending actual customer data, of course.
Keep in mind that dynamic management views typically reset when the instance does, so the results will not typically represent the entire lifespan of the database.

where is the best place to store big data schema

The question name may seem irrelevant, but I couldn't find a name best fit for my intent.
I am working on an e-commerce project and using a relational database for storage. But I have a little issue with the performance my product domain queries. The thing we want to accomplish is a little bit complex, so the way we designed the database schema is a little bit complex. We have over 40 tables for product domain and we are having some performance problems while querying the db to make our product pages work. To view some simple information about a product we need to query at least 5-7 table and this causes a huge penalty for website since we have hundreds of thousands of requests per day.
By the way our data is not big. The problem is, we have many tables containing little data and we have to join many of them at a time.
Is there a better way or place to store such data? I have looked at nosql dbs, but as I understand they might not be the best fit for my solution. Can a graph db like neo4j be helpful in my case?
Thanks for any help...

Should I move to NoSQL? (big data)

I'm currently researching a very large table (~100 million rows, 35 columns), it's currently stored in SQL db, but the queries I'm running (and they're various) run very, very slow..
so I get it I should probably move to NoSQL db. question is:
How can I tell which (NoSQL) db is best for me?
How can I move my current SQL table to the new NoSQL scheme?
OR should I stay in SQL and just fine tune it?
A few more details: rows will not be added/removed, this is historical data and all of the analysis will be done on that table. plan to run various queries on it. data is numerical.

I routinely work with a SQL Server 2012 table that has 900 million rows. This table has rows being added to it about every 2 minutes with a total of about 200K per day. I can query this table and get rows back in a couple seconds (using the clustered index / PK). I can also query on one of the other indexes and get results back in seconds or less.
So, it's all a matter of making sure your indexes are set up correctly, AND BEING USED!! Check your queries against the query plan being generated and make sure seeks are being done.
There could be good reasons for moving to NoSQL, or something similar. But moving to NoSQL because you think you can't get good performance in SQL Server, before making sure you've done everything you can do to improve performance first, is not a good reason.

Some food for thought:
100M rows is well within SQL's "sweet spot". You can grow by x10 and still be assured that SQL will be able to support you with fairly trivial effort.
NoSQL is not a silver bullet for solving performance problems at scale. It offers a set of tradeoffs which, with careful planning, can provide better results. But if sounds like you don't fully understand your performance issues in SQL, and without that your chances of making the correct design decisions in a NoSQL environment are slim.
One of the common tradeoffs in NoSQL systems is that they typically provide less flexibilty in querying, in return for greater flexibility in schema management. You mentioned your queries are "various"- if they are truly varied, or more importantly- frequently changing - then moving to a NoSQL system can put you in a world of pain. Especially if you are not familiar with the technology yet.
Bottom line- You aren't doing anything which is clearly "beyond" the capabilities of SQL, and your problems are probably caused more by inefficient implementation than by any inherent platform limitations. Moving to a NoSQL system won't magically solve any of your problems, and will probably introduce new ones.

If you are running a query on columns that are not indexed you will be very slow. You can add more indexes to speed them up. If your DB is static this should work.
One major speed up is the usage of map-reduce queries, where aggregations are carried out by multiple processes or computers. NoSQL databases like MongoDB can be used in such ways. But even MySQL has Cluster capabilities nowadays: http://www.mysql.de/products/cluster/scalability.html. SQL Server can be clustered as well.
So I guess the best first shot would be to optimize your indexes in the table to the query. Each argument column to the query (compare, count ...) etc. should be indexed.
If this is not doing any better you probably count and calculate a lot and you should use map-reduce jobs and a DB which can handle this like MongoDB: http://docs.mongodb.org/manual/aggregation/
I hope this helps

Design a database with a lot of new data

Im new to database design and need some guidance.
A lot of new data is inserted to my database throughout the day. (100k rows per day)
The data is never modified or deleted once it has been inserted.
How can I optimize this database for retrieval speed?
My ideas
Create two databases (and possible on different hard drives) and merge the two at night when traffic is low
Create some special indexes...
Your recommendation is highly appreciated.
UPDATE:
My database only has a single table.

100k/day is actually fairly low. 3M/month, 40M/year. You can store 10 years archive and not reach 1B rows.
The most important thing to choose in your design will be the clustered key(s). You need to make sure that they are narrow and can serve all the queries your application will normally use. Any query that will end up in table scan will completely trash your memory by fetching in the entire table. So, no surprises there, your driving factor in your design is the actual load you'll have: exactly what queries will you be running.
A common problem (more often neglected than not) with any high insert rate is that eventually every row inserted will have to be deleted. Not acknowledging this is a pipe dream. The proper strategy depends on many factors, but probably the best bet is on a sliding window partitioning scheme. See How to Implement an Automatic Sliding Window in a Partitioned Table. This cannot be some afterthought, the choice for how to remove data will permeate every aspect of your design and you better start making a strategy now.

The best tip I can give which all big sites use to speed up there website is:
CACHE CACHE CACHE
use redis/memcached to cache your data! Because memory is (blazingly)fast and disc I/O is expensive.
Queue writes
Also for extra performance you could queue up the writes in memory for a little while before flushing them to disc -> writting them to SQL database. Off course then you have the risk off losing data if you keep it in memory and your computer crashes or has power failure or something
Context missing
Also I don't think you gave us much context!
What I think is missing is:
architecture.
What kind of server are you having VPS/shared hosting.
What kind of Operating system does it have linux/windows/macosx
computer specifics like how much memory available, cpu etc.
a find your definition of data a bit vague. Could you not attach a diagram or something which explains your domain a little bit. For example something like
this using http://yuml.me/

Your requirements are way to general. For MS SQL server 100k (more or less "normal") records per days should not be a problem, if you have decent hardware. Obviously you want to write fast to the database, but you ask for optimization for retrieval performance. That does not match very well! ;-) Tuning a database is a special skill on its own. So you will never get the general answer you would like to have.

Scalability of Using MySQL as a Key/Value Database

I am interested to know the performance impacts of using MySQL as a key-value database vs. say Redis/MongoDB/CouchDB. I have used both Redis and CouchDB in the past so I'm very familiar with their use cases, and know that it's better to store key/value pairs in say NoSQL vs. MySQL.
But here's the situation:
the bulk of our applications already have lots of MySQL tables
We host everything on Heroku (which only has MongoDB and MySQL, and is basically 1-db-type per app)
we don't want to be using multiple different databases in this case.
So basically, I'm looking for some info on the scalability of having a key/value table in MySQL. Maybe at three different arbitrary tiers:
1000 writes per day
1000 writes per hour
1000 writes per second
1000 reads per hour
1000 reads per second
A practical example is in building something like MixPanel's Real-time Web Analytics Tracker, which would require writing very often depending on traffic.
Wordpress and other popular software use this all the time: Post has "Meta" model which is just key/value, so you can add arbitrary properties to an object which can be searched over.
Another option is to store a serializable hash in a blob but that seems worse.
What is your take?

I'd say that you'll have to run your own benchmark because it is only you that knows the following important aspects:
the size of the data to be stored in this KV table
the level of parallelism you want to achieve
the number of existing queries reaching your MySQL instance
I'd also say that depending on the durability requirements for this data, you'll also want to test multiple engines: InnoDB, MyISAM.
While I do expect some NoSQL solutions to be faster, based on your constraints you may find out that MySQL will perform good enough for your requirements.

SQL databases are more and more used as a persistance layer, with computations and delivery cached in Key-Value repositories.
With this in mind, those guys have done quite a test here:
InnoDB inserts 43,000 records per second AT ITS PEAK*;
TokuDB inserts 34,000 records per second AT ITS PEAK*;
This KV inserts 100 millions of records per second (2,000+ times more).
To answer your question, a Key-Value repository is more than likely to outdo MySQL by several orders of magnitude:
Processing 100,000,000 items:
kv_add()....time:....978.32 ms
kv_get().....time:....297.07 ms
kv_free()....time:........0.00 ms
OK, your test was 1,000 ops per second, but it can't hurt to be able to do 1,000 times more!
See this for further details (they also compare it with Tokyo Cabinet).

There is no doubt that using a NOSQL solution is going to be faster, since it is simpler.
NOSQL and Relational do not compete with each other, they are different tools that can solve different problems.
That being said for 1000 writes/day or per hour, MySQL will have no problem.
For 1000 per second you will need some fancy hardware to get there. For the NOSQL solution you will probably still need some distributed file system.
It also depends on what you are storing.

Check out the series of blog posts here where the author runs tests comparing MongoDB and MySQL performance, and fights through the MySQL performance tuning mess. MongoDB was doing ~100K row reads per second, MySQL in c/s mode was doing 43K max, but with the embedded library he managed to get it up to 172K row reads per second.
It sounds a little complicated to get that high on a single node, so ymmv.
The writes/second question is a little harder, but this still might give you some ideas on configs to try.

You should first implement it in the simplest way then compare that. Always test things. This means:
Create a schema that's representative of your use case.
Create queries representative of your use case.
Create significant amounts of dummy data representive of your use case.
In a variety of loops, including both random access and sequential, bench mark it.
Ensure you use concurrency (run many processes randomly hammering the server with all kinds of queries representative of your use cases).
Once you have that, measure, test. There are different ways you can go about it. Some tests can be simple but might be less realistic. Measure throughput and latency.
Then try to optimise it.
MySQL has one particular limitation for KV which is the standard Engines with persistence use indexes optimised for range lookups, not for KV, which might introduce some overhead, though it's also difficult to have things such as hash work with persistent storage due to rehashing. Memory tables support a hash index.
Many people associate certain things with being slow such as SQL, RELATIONAL, JOINS, ACID, etc.
When using an ACID capable relational database, you don't have to necessarily use ACID or relations.
While joins have a bad reputation for being slow this is usually down to misconceptions about joins. Often people simply write bad queries. This is made more difficult as SQL is declarative, it can get things wrong, especially with JOINs where there are often multiple ways to perform the join. What people are actually getting out of NoSQL in this case is imperative. NoDeclaritive would be more accurate as that's the problem with SQL a lot of people are having. Quite often people simply lack indexes. That's not an argument in favour of joins but rather to illuminate where people can get it wrong on speed.
Traditional databases can be extremely fast if you do certain special things for that such as ignoring data integrity or handling it elsewhere. You don't have to wait for the harddrive to flush writes, you don't have to enforce relations, you don't have to enforce unique constraints, you don't have to use transactions but if you do replace safety with speed then you need to know what you're doing.
NoSQL solutions by comparison first and foremost tend to be designed to support various modes of scaling out of the box. The performance of an individual node might not be quite what you expect. NoSQL solutions also struggle for general use with many having quite unusual performance characteristics or limited feature sets.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas