Optimization techniques for large databases [closed] - sql

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
What optimization techniques do you use on extremely large databases? If our estimations are correct, our application will have billions of records stored in the db (MS SQL Server 2005), mostly logs that will be used for statistics. The data contains numbers (mostly integer) and text (error message texts, URLs) alike.
I am interested in ANY kind of tips, hacks, solutions.

The question is a little big vague, but here are a few tips:
Use appropriate hardware for your databases. I'd opt for 64-bit OS as well.
Have dedicated machines for the DBs. Use fast disks configured for optimal performance. The more disks you can span over, the better the performance.
Optimize the DB for the type of queries that will be performed. What happens more SELECTs or INSERTs?
Does the load happens for the entire day, or for just few hours? Can you postpone some of the things to be run for the night?
Have incremental backups.
If you'll consider Oracle instead of SQL Server, you could use features such as Grid and Table Partitioning, which might boost performance considerably.
Consider having some load-balancing solution between the DB servers.
Pre-design the schemes and tables, so queries will be performed as fast as possible. Consider the appropriate indexes as well.
You're gonna have to be more specific about the way you're going to store those logs. Are they LOBs in the DB? Simple text records?

I don't use it myself but I have read that one can use Hadoop in combination with hbase for distributed storage and distributed analysing of data like logs.

duncan's link has a good set of tips. Here are a few more tips:
If you do not need to query against totally up-to-date data (i.e. if data up to the last hour or close of business yesterday is acceptable), consider building a separate data mart for the analytics. This allows you to optimise this for fast analytic queries.
The SQL Server query optimiser has a star transformation operator. If the query optimiser recongises this type of query it can select what slice of data you want by filtering based on the dimension tables before it touches the fact table. This reduces the amount of I/O needed for the query.
For VLDB applications involving large table scans, consider direct attach storage with as many controllers as possible rather than a SAN. You can get more bandwidth cheaper. However, if your data set is less than (say) 1TB or so it probably won't make a great deal of difference.
A 64-bit server with lots of RAM is good for caching if you have locality of reference in your query accesses. However, a table scan has no locality of reference so once it gets significantly bigger than the RAM on your server extra memory doesn't help so much.
If you partition your fact tables, consider putting each partition on a sepaarate disk array - or at least a separate SAS or SCSI channel if you have SAS arrays with port replication. Note that this will only make a difference if you routinely do queries across multiple partitions.

Related

Abnormal query execution time [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I'm execution the same query in two different environments
The first environment has 4GB RAM and an Intel processor 3,09 Ghz.
The second environment has 32GB RAM and an AMD processor 2,20 Ghz.
I'm wondering why the query is taking 6 minutes to be executed in the first environment and many many hours in the second one.
I checked the momory allocated for sql server by exectuting the query below, the two environments have the same value.
SELECT value_in_use
FROM sys.configurations
WHERE name = 'max server memory (MB)'
What could be the reason why the query is taking too much time in the second environment?
PS : The number of rows is the same in both environments (about 2 millions)
Several factors can make the results different. I hesitate to mention them because they are the kind of things people tend to reject, but I have experienced them all and had some confirmed by msft.
First, the person who suggested you look at execution plans is on the right track. That is likely to at least give you a clue as to what's different.
Reasons (assuming size and content of data are identical)
Statistics are different on the two servers, causing different execution plans.
Hardware performance is different. Slower or faster CPUs (different generation for example, even if clock speed is close), slower or faster disk. More cores vs. one core (leads to parallel plans vs. serial plans)
Hardware configuration is different (e.g. perormance difference). One uses a SAN one has direct storage. Sometimes matters a lot. Sometimes makes no difference once data is in cache.
Data storage is physically different. One server has greatly fragmented and sparse data blocks/pages in the table(s) or indexes you care about. One has compact and fast. This can occur due to different scenarios used to load data in the two systems.
SQLServer configuration settings: Limited Memory configuration on one system, for example.
Competing workload on one of the systems.
Missing or different indexes.
Different collation settings resulting in different index statistics and different plans.
Slightly different software version.
Size of data is about the same but content is different, changing query plan.
etc.
The time difference involved suggests a different execution plan is most likely.
If all else (indexes, etc.) is 'the same', sometimes it is worth forcing SQLServer to rerun statistics, then try the query again. Historically SQLServer has automatically updated statistics based on the volume of changes to data in a table. Sometimes this leads to statistics that are temporarily bad, as one table hits the threshold for an update but other tables do not, and the optimizer chooses bad plans. It all seems very improbable, but I once built a product that hit this problem regularly, and was fortunate enough to have a visiting MSFT SQLServer developer help me prove it.
Another that has really hit me is slow san or slow VMWare environment. Sans are often claimed to be fast, but turn out to deliver terrible performance to actual users due to slow network connections or competing workloads. VMWare or other virtualization environment often leads to the same problem, especially in large organizations that don't want to figure out what workload is hitting their clusters. Someone else's workload affects yours. So performance testing loses meaning.
My money is on some factor that changes the plan, like data content, statistics, index configuration.

Best SQL\NoSQL solution for specified requirements? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
there's a data set with around 6 millions of records. Each record has the same number of fields. There are 8 fields totally:
ID Title Color Date1 Date2 Date3 Date4...
There should be a way to filter these records by title and all date fields (or, 'columns' in RDBMS terms).
The size of the data is not so huge, around few gigabytes. We don't have long text fields etc. (we got rid of them during architecture creation, so now we have only really important fields in the data set).
The backend reads & writes the data quite intensive. We would really like to speed up both reads\writes (and filtering by fields) as much as possible. Currently we're using Postgres and we like its reliability, but it seems it's not really fast. Yes, we did some tweaking and optimization, added indexes, installed it on 32GB RAM machine and set all necessary settings. In other words, it works, but I still believe it might be better. What we need is speed: filtering records by dates and titles should be fast, really fast. Data insertion might be slower. The backend filters all records that were not processed, process it, and sets the date flag (of the datetime when it was processed). There are around 50 backend 'workers' executed every 5-10 seconds, so the DB should be able to perform really fast. Also we do some DB iterations (kind of a map\reduce jobs), so the DB solution should be able to execute this kind of tasks (RDBMS are not really good here).
We don't have joins there, the data is already optimized for big data solutions. Only one 'big table'.
And we would like to run it on a single node, or on many small instances. The data is not really important. But we would like to avoid expensive solutions so we're looking for a SQL or NoSQL solution that will perform faster than Postgres on the same cheap hardware.
I remember I tried MongoDB about a year or two ago. From what I remember, filtering was not so quick that moment. Cassandra was better but I remember it was able to perform only small subset of filtering queries. Riak is good but only for a big cluster with many machines. This is my very basic experience, if you guys know that one of these solutions performs great please do write that. Or suggest another solution.
Thanks!
I agree with Ryan above. Stick with PostgreSQL.
You haven't described what your write load is actually like (are you updating a few records here and there, but with a lot of parallel queries? Updating with a fewer number of parallel queries but a lot of rows updated at once, etc). So I can't tell you what you need to do to get more speed.
However, based on your question and the things you say you have tried so far, I would recommend that you consider hiring a consultant to look at your db, look at your environment, etc. with fresh eyes and suggest improvements. My guess is that you have a lot of stuff going on that could be optimized quite a bit and you will spend a lot less on such optimizations than you will switching to a new environment.
I agree with Denis, that you should stick with Postgres. From my experience, the relational databases when tuned correctly have incredibly fast results. Or put another way ... I've found it much harder to tune Mongo to get complex queries returning in 10ms or less than I have tuning SQL Server and MySQL.
Read this website http://use-the-index-luke.com/ for ideas on how to further tune. The guy also wrote a book that will likely be useful to you.
Like Denis said, the data size is not so big that it would be worth the price to start from scratch with a NoSQL solution.

When to use a query or code [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
I am asking for a concrete case for Java + JPA / Hibernate + Mysql, but I think you can apply this question to a great number of languages.
Sometimes I have to perform a query on a database to get some entities, such as employees. Let's say you need some specific employees (the ones with 'John' as their firstname), would you rather do a query returning this exact set of employees, or would you prefer to search for all the employees and then use a programming language to retrieve the ones that you are interested with? why (ease, efficiency)?
Which is (in general) more efficient?
Is one approach better than the other depending on the table size?
Considering:
Same complexity, reusability in both cases.
Always do the query on the database. If you do not you have to copy over more data to the client and also databases are written to efficiently filter data almost certainly being more efficient than your code.
The only exception I can think of is if the filter condition is computationally complex and you can spread the calculation over more CPU power than the database has.
In the cases I have had a database the server has had more CPU power than the clients so unless overloaded will just run the query more quickly for the same amount of code.
Also you have to write less code to do the query on the database using Hibernates query language rather than you having to write code to manipulate the data on the client. Hibernate queries will also make use of any client caching in the configiration without you having to write more code.
There is a general trick often used in programming - paying with memory for operation speedup. If you have lots of employees, and you are going to query a significant portion of them, one by one (say, 75% will be queried at one time or the other), then query everything, cache it (very important!), and complete the lookup in memory. The next time you query, skip the trip to RDBMS, go straight to the cache, and do a fast look-up: a roundtrip to a database is very expensive, compared to an in-memory hash lookup.
On the other hand, if you are accessing a small portion of employees, you should query just one employee: data transfer from the RDBMS to your program takes a lot of time, a lot of network bandwidth, a lot of memory on your side, and a lot of memory on the RDBMS side. Querying lots of rows to throw away all but one never makes sense.
In general, I would let the database do what databases are good at. Filtering data is something databases are really good at, so it would be best left there.
That said, there are some situations where you might just want to grab all of them and do the filtering in code though. One I can think of would be if the number of rows is relatively small and you plan to cache them in your app. In that case you would just look up all the rows, cache them, and do subsequent filtering against what you have in the cache.
It's situational. I think in general, it's better to use sql to get the exact result set.
The problem with loading all the entities and then searching programmatically is that you ahve to load all the entitites, which could take a lot of memory. Additionally, you have to then search all the entities. Why do that when you can leverage your RDBMS and get the exact results you want. In other words, why load a large dataset that could use too much memory, then process it, when you can let your RDBMS do the work for you?
On the other hand, if you know the size of your dataset is not too, you can load it into memory and then query it -- this has the advantage that you don't need to go to the RDBMS, which might or might not require going over your network, depending on your system architecture.
However, even then, you can use various caching utilities so that the common query results are cached, which removes the advantage of caching the data yourself.
Remember, that your approach should scale over time. What may be a small data set could later turn into a huge data set over time. We had an issue with a programmer that coded the application to query the entire table then run manipulations on it. The approach worked fine when there were only 100 rows with two subselects, but as the data grew over the years, the performance issues became apparent. Inserting even a date filter to query only the last 365 days, could help your application scale better.
-- if you are looking for an answer specific to hibernate, check #Mark's answer
Given the Employee example -assuming the number of employees can scale over time, it is better to use an approach to query the database for the exact data.
However, if you are considering something like Department (for example), where the chances of the data growing rapidly is less, it is useful to query all of them and have in memory - this way you don't have to reach to the external resource (database) every time, which could be costly.
So the general parameters are these,
scaling of data
criticality to bussiness
volume of data
frequency of usage
to put some sense, when the data is not going to scale frequently and the data is not mission critical and volume of data is manageable in memory on the application server and is used frequently - Bring it all and filter them programatically, if needed.
if otherwise get only specific data.
What is better: to store a lot of food at home or buy it little by little? When you travel a lot? Just when hosting a party? It depends, isn't? Similarly, the best approach is a matter of performance optimization. That involves a lot of variables. The art is to both prevent painting yourself into a corner when designing your solution and optimize later, when you know your real bottlenecks. A good starting point is here: en.wikipedia.org/wiki/Performance_tuning One think could be more or less universally helpful: encapsulate your data access well.

Pros and Cons of using MongoDB instead of MS SQL Server [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
I am new to NoSQL world and thinking of replacing my MS Sql Server database to MongoDB. My application (written in .Net C#) interacts with IP Cameras and records meta data for each image coming from Camera, into MS SQL Database. On average, i am inserting about 86400 records per day for each camera and in current database schema I have created separate table for separate Camera images, e.g. Camera_1_Images, Camera_2_Images ... Camera_N_Images. Single image record consists of simple metadata info. like AutoId, FilePath, CreationDate. To add more details to this, my application initiates separate process (.exe) for each camera and each process inserts 1 record per second in relative table in database.
I need suggestions from (MongoDB) experts on following concerns:
to tell if MongoDB is good for holding such data, which eventually will be queried against time ranges (e.g. retrieve all images of a particular camera between a specified hour)? Any suggestions about Document Based schema design for my case?
What should be the specs of server (CPU, RAM, Disk)? any suggestion?
Should i consider Sharding/Replication for this scenario (while considering the performance in writing to synch replica sets)?
Are there any benefits of using multiple databases on same machine, so that one database will hold images of current day for all cameras, and the second one will be used to archive previous day images? I am thinking on this with respect to splitting reads and writes on separate databases. Because all read requests might be served by second database and writes to first one. Will it benefit or not? If yes then any idea to ensure that both databases are synced always.
Any other suggestions are welcomed please.
I am myself a starter on NoSQL databases. So I am answering this at the expense of potential down votes but it will be a great learning experience for me.
Before trying my best to answer your questions I should say that if MS
SQL Server is working well for you then stick with it. You have not
mentioned any valid reason WHY you want to use MongoDB except the fact
that you learnt about it as a document oriented db. Moreover I see
that you have almost the same set of meta-data you are capturing for
each camera i.e. your schema is dynamic.
to tell if MongoDB is good for holding such data, which eventually will be queried against time ranges (e.g. retrieve all images of a particular camera between a specified hour)? Any suggestions about Document Based schema design for my case?
MongoDB being a document oriented db, is good at querying within an aggregate (you call it document). Since you already are storing each camera's data in its own table, in MongoDB you will have a separate collection created for each camera. Here is how you perform date range queries.
What should be the specs of server (CPU, RAM, Disk)? any suggestion?
All NoSQL data bases are built to scale-out on commodity hardware. But by the way you have asked the question, you might be thinking of improving performance by scaling-up. You can start with a reasonable machine and as the load increases, you can keep adding more servers (scaling-out). You no need to plan and buy a high end server.
Should i consider Sharding/Replication for this scenario (while considering the performance in writing to synch replica sets)?
MongoDB locks the entire db for a single write (but yields for other operations) and is meant for systems which have more reads than writes. So this depends upon how your system is. There are multiple ways of sharding and should be domain specific. A generic answer is not possible. However some examples can be given like sharding by geography, by branches etc.
Also read A plain english introduction to CAP Theorem
Updated with answer to the comment on sharding
According to their documentation, You should consider deploying a sharded cluster, if:
your data set approaches or exceeds the storage capacity of a single node in your system.
the size of your system’s active working set will soon exceed the capacity of the maximum amount of RAM for your system.
your system has a large amount of write activity, a single MongoDB instance cannot write data fast enough to meet demand, and all other
approaches have not reduced contention.
So based upon the last point yes. The auto-sharding feature is built to scale writes. In that case, you have a write lock per shard, not per database. But mine is a theoretical answer. I suggest you take consultation from 10gen.com group.
to tell if MongoDB is good for holding such data, which eventually
will be queried against time ranges (e.g. retrieve all images of a
particular camera between a specified hour)?
This quiestion is too subjective for me to answer. From personal experience with numerous SQL solutions (ironically not MS SQL) I would say they are both equally as good, if done right.
Also:
What should be the specs of server (CPU, RAM, Disk)? any suggestion?
Depends on too many variables that only you know, however a small cluster of commodity hardware works quite well. I cannot really give a factual response to this question and it will come down to your testing.
As for a schema I would go for a document of the structure:
{
_id: {},
camera_name: "my awesome camera",
images: [
{
url: "http://I_like_S3_here.amazons3.com/my_image.png" ,
// All your other fields per image
}
]
}
This should be quite easy to mantain and update so long as you are not embedding much deeper since then it could become a bit of pain, however, that depends upon your queries.
Not only that but this should be good for sharding since you have all the data you need in one document, if you were to shard on _id you could probably get the perfect setup here.
Should i consider Sharding/Replication for this scenario (while considering the performance in writing to synch replica sets)?
Possibly, many people assume they need to shard when in reality they just need to be more intelligent in how they design the database. MongoDB is very free form so there are a lot of ways to do it wrong, but that being said, there are also a lot of ways of dong it right. I personally would keep sharding in mind. Replication can be very useful too.
Are there any benefits of using multiple databases on same machine, so that one database will hold images of current day for all cameras, and the second one will be used to archive previous day images?
Even though MongoDBs write lock is on DB level (currently) I would say: No. The right document structure and the right sharding/replication (if needed) should be able to handle this in a single document based collection(s) under a single DB. Not only that but you can direct writes and reads within a cluster to certain servers so as to create a concurrency situation between certain machines in your cluster. I would promote the correct usage of MongoDBs concurrency features over DB separation.
Edit
After reading the question again I omitted from my solution that you are inserting 80k+ images for each camera a day. As such instead of the embedded option I would actually make a row per image in a collection called images and then a camera collection and query the two like you would in SQL.
Sharding the images collection should be just as easy on camera_id.
Also make sure you take you working set into consideration with your server.
to tell if MongoDB is good for holding such data, which eventually
will be queried against time ranges (e.g. retrieve all images of a
particular camera between a specified hour)? Any suggestions about
Document Based schema design for my case?
MongoDB can do this. For better performance, you can set an index on your time field.
What should be the specs of server (CPU, RAM, Disk)? any suggestion?
I think RAM and Disk would be important.
If you don't want to do sharding to scale out, you should consider a larger size of disk so you can store all your data in it.
Your hot data should can fit into your RAM. If not, then you should consider a larger RAM because the performance of MongoDB mainly depends on RAM.
Should i consider Sharding/Replication for this scenario (while
considering the performance in writing to synch replica sets)?
I don't know many cameras do you have, even 1000 inserts/second with total 1000 cameras should still be easy to MongoDB. If you are concerning insert performance, I don't think you need to do sharding(Except the data size are too big that you have to separate them into several machines).
Another problem is the read frequency of your application. It it is very high, then you can consider sharding or replication here.
And you can use (timestamp + camera_id) as your sharding key if your query only on one camera in a time range.
Are there any benefits of using multiple databases on same machine, so
that one database will hold images of current day for all cameras, and
the second one will be used to archive previous day images?
You can separate the table into two collections(archive and current). And set index only on archive if you only query date on archive. Without the overhead of index creation, the current collection should benefit with insert.
And you can write a daily program to dump the current data into archive.

SQL searchable cache - high scalability [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
I have developed a website which provides very generic data storage. Currently it works just fine but I am thinking about optimizing the speed.
INSERT/SELECT ratio is hard to predict and changes for different cases but usually SELECT is more often. INSERTs are fast enough. SELECTs are what worries me. There are a lot of LEFT JOINs. E.g. each object can have a image which is stored in separate table (as it can span across multiple objects) and stores additional information about the image as well.
Up to 8 joins are made every select and it can take up to 1 seconds to process - mean value is around 0.3s. There can be multiple of such selects for every request. It has already been optimized multiple times on SQL side and there is not much that can be done there.
Other than buying more powerful machine for DB, what can be done (if anything)?
Django is not a speed demon here as well but we still got some optimizations left there. Switch to PyPy if we must. On DB side I had a few ideas but there they seem to be uncommon - couldn't find any real case scenario.
Use different storage for this part that's faster. We need transactions and we need consistency checks so it may not be preferable.
Searchable cache? Does it make any sense here? E.g. maintain a flat copy of all tables combined in NoSQL or something. Inserts would be more expensive - it needs to update multiple records in NoSQL if some common table changes. Tough to maintain as well.
Is there anything that would make sense or is it just the fastest that can get and just get more RAM, increase cache size in rdbms, get SSD and leave it. Focus on optimizing other parts like pooling database connections as they are expensive as well.
Technologies used: PostgreSQL 9.1 and Django (python).
To summarize. Question is: after optimizing all SQL part - indexes, clustering etc. What can be done to optimize further when static timeout cache for results is not an option (different request arguments, different results anyway).
---EDIT 30-08-2012---
We are already using checking slow queries on a daily basis. This IS our bottleneck. We only order and filter on indexes. Also, sorry for not being clear about this - we don't store actual images in db. Just file paths.
JOINs and ORDER BY are killing our performance here. E.g. one complex query that spits out 20 000 results takes 1800ms (EXPLAIN ANALYZE used). And this assumes that we are not using any kind of filtering based on JOINed tables.
If we skip all the JOINS we are down to 110ms. That's insane... That's why we are thinking of some kind of searchable cache or flat copy NoSQL.
Without ordering we got 60ms which is great but what's with the JOIN performance in PostgreSQL?
Is there some different DB that can do better for us? Preferably free one.
First, although I think that there are times and places to store image files in the database, in general you are going to have extra I/O and memory associated with this sort of operation. If I was looking at optimizing this I would put every image with a path and be able to bulk save these to the fs. This way they are still in your db for backup purposes but you can just pull the relative path out and generate links, thus saving you a bunch of sql queries and reducing overhead. Over a web-based backend you aren't going to be able to get transactions working really well between generating the HTML and retrieving the image anyway since these come in under different HTTP requests.
As for speed, I can't tell if you are looking at total http request time or db time. But the first thing you need to do is break everything apart and look for where most of your time is being spent. This may surprise you. The next thing is to get query plans of those queries which are slow queries:
http://heatware.net/databases/how-to-find-log-slow-queries-postgresql/
Then from there, start using explain analyze to find out what is the problem.
Also in deciding to upgrade hardware you want to have a good idea of where you are currently facing limits. More RAM helps generally (and it is helpful if your db can fit comfortably in RAM), but beyond that it makes no sense to put faster storage in a cpu-bound server or switch to a server with faster cpu's in an I/O bound server. top is your friend there. Similarly depending on the concurrency issues, it might (or might not!) make sense to use a hot standby for your select statements.
But without a lot more information I can't tell you what the best way to go about further optimizing your db is. PostgreSQL is capable of running really fast under the right conditions and scaling very well.