database index and memory usage - sql

suppose I have a table that stores 100 million records of strings of varying sizes up to 20 characters in a column field. I need to index this column, I only have a 2GB-Ram machine, is this sufficient to perform such task? Is mysql recommended db engine for storage?

Databases are generally designed in a way that allows them to work with more data then you have available RAM. Giving it more working memory will speed things up, but it should be able to build the index and perform searches on it just fine.

If you have 2 GB of main memory, then yes, you should be able to build the index without any problems; virtual memory is a wonderful thing, and the DBMS may well arrange to spill data to disk as it goes.
If you only have 2 GB of disk space, you don't have enough space for the data and the index.
To no-one's surprise, it is 2 GB of main memory, not 2 GB of disk (that comment was mainly in jest - but these days, if someone says 256 GB, it is not clear whether they're referring to disk space or main memory; it could be either).
Yes, if the DBMS cannot create the index within that constraint, it is not worthy of being termed a DBMS.
MySQL probably can do the job. It isn't what I'd recommend, but I'm very biassed in this area as a result of being one of the developers of an alternative (commercial) DBMS. We don't have enough information about your budget etc to be able to advise reliably.

Related

What about performance of cursors,reindex and shrinking?

i am having recently came to know that sql server if i delete one column or modify it acquires space at backend so i need to reindex and shrink the database and i have done it and my datbase size reduced to
2.82 to 1.62
so its good like wise so now i am in a confusion
so in my mind many questions regarding this subject occurs pls help me about this one
1. So it is necessary to recreate indexes(refresh ) after particular interval
It is necessary to shrink database after particular time so performance will be up to date?
If above yes then what particular time should i refresh (Shrink) my database?
i am having no idea what should be done for disk spacing problem i am having 77000 records it takes 2.82gb dataspace which is not acceptable i am having two tables of that one only with one table nvarchar(max) so there should be minimum spaces to database can anyone help me on this one Thanks in advance
I am going to simplify things a little for you so you might want to read up about the things I talk about in my answer.
Two concepts you must understand. Allocated space vs free space. A database might be 2GB in size but it is only using 1GB so it has allocated 2GB with 1GB free space. When you shrink a database it removes the free space so free space should be about 0. Dont think smaller file size is faster. As you database grows it has to allocate space again. When you shrink the file and then it grows every so often it cannot allocate space in a contiguous fashion. This will create fragmentation of the files which slows you down even more.
With data files(.mdb) files this is not so bad but with the transaction log shrinking the log can lead to virtual log file fragmentation issues which can slow you down. So in a nutshell there is very little reason to shrink your database on a schedule. Go read about Virtual Log Files in SQL Server there are a lot of articles about it. This is a good article about shrink log files and why it is bad. Use it as a starting point.
Secondly indexes get fragmented over time. This will lead to bad performance of SELECT queries mainly but will also affect other queries. Thus you need to perform some index maintenance on the database. See this answer on how to defragment your indexes.
Update:
Well the time you rebuild indexes is not clear cut. Index rebuilds lock the index during the rebuild. Essentially they are offline for the duration. In your case it would be fast 77 000 rows is nothing for SQL server. So rebuilding the indexes will consume server resources. IF you have enterprise edition you can do online index rebuilding which will NOT lock the indexes but will consume more space.
So what you need to do is find a maintenance window. For example if your system is used from 8:00 till 17:00 you can schedule maintenance rebuilds after hours. Schedule this with SQL server agent. The script in the link can be automated to run.
Your database is not big. I have seen SQL server handle tables of 750GB without taking strain if the IO is split over several disks. The slowest part of any database server is not the CPU or the RAM but the IO pathways to the disks. This is a huge topic though. Back to your point you are storing data in NVARCHAR(MAX) fields. I assume this is large text. So after you shrink the database you see the size at 1,62GB which means that each row in your database is about 1,62/77 000 big or roughly 22Kb big. This seems reasonable. Export the table to a text file and check the size you will be suprised it will probably be larger than 1,62GB.
Feel free to ask more detail if required.

How to store 15 x 100 million 32-byte records for sequential access?

Me got 15 x 100 million 32-byte records. Only sequential access and appends needed. The key is a Long. The value is a tuple - (Date, Double, Double). Is there something in this universe which can do this? I am willing to have 15 seperate databases (sql/nosql) or files for each of those 100 million records. I only have a i7 core and 8 GB RAM and 2 TB hard disk.
I have tried PostgreSQL, MySQL, Kyoto Cabinet (with fine tuning) with Protostuff encoding.
SQL DBs (with indices) take forever to do the silliest query.
Kyoto Cabinet's B-Tree can handle upto 15-18 million records beyond which appends take forever.
I am fed up so much that I am thinking of falling back on awk + CSV which I remember used to work for this type of data.
If you scenario means always going through all records in sequence then it may be an overkill to use a database. If you start to need random lookups, replacing/deleting records or checking if a new record is not a duplicate of an older one, a database engine would make more sense.
For the sequential access, a couple of text files or hand-crafted binary files will be easier to handle. You sound like a developer - I would probably go for an own binary format and access it with help of memory-mapped files to improve the sequential read/append speed. No caching, just a sliding window to read the data. I think that it would perform better and even on usual hardware than any DB would; I did such data analysis once. It would also be faster than awking CSV files; however, I am not sure how much and if it satisfied the effort to develop the binary storage, first of all.
As soon as the database becomes interesting, you can have a look at MongoDB and CouchDB. They are used for storing and serving very large amounts of data. (There is a flattering evaluation that compares one of them to traditional DBs.). Databases usually need a reasonable hardware power to perform better; maybe you could check out how those two would do with your data.
--- Ferda
Ferdinand Prantl's answer is very good. Two points:
By your requirements I recommend that you create a very tight binary format. This will be easy to do because your records are fixed size.
If you understand your data well you might be able to compress it. For example, if your key is an increasing log value you don't need to store it entirely. Instead, store the difference to the previous value (which is almost always going to be one). Then, use a standard compression algorithm/library to save on data size big time.
For sequential reads and writes, leveldb will handle your dataset pretty well.
I think that's about 48 gigs of data in one table.
When you get into large databases, you have to look at things a little differently. With an ordinary database (say, tables less than a couple million rows), you can do just about anything as a proof of concept. Even if you're stone ignorant about SQL databases, server tuning, and hardware tuning, the answer you come up with will probably be right. (Although sometimes you might be right for the wrong reason.)
That's not usually the case for large databases.
Unfortunately, you can't just throw 1.5 billion rows straight at an untuned PostgreSQL server, run a couple of queries, and say, "PostgreSQL can't handle this." Most SQL dbms have ways of dealing with lots of data, and most people don't know that much about them.
Here are some of the things that I have to think about when I have to process a lot of data over the long term. (Short-term or one-off processing, it's usually not worth caring a lot about speed. A lot of companies won't invest in more RAM or a dozen high-speed disks--or even a couple of SSDs--for even a long-term solution, let alone a one-time job.)
Server CPU.
Server RAM.
Server disks.
RAID configuration. (RAID 3 might be worth looking at for you.)
Choice of operating system. (64-bit vs 32-bit, BSD v. AT&T derivatives)
Choice of DBMS. (Oracle will usually outperform PostgreSQL, but it costs.)
DBMS tuning. (Shared buffers, sort memory, cache size, etc.)
Choice of index and clustering. (Lots of different kinds nowadays.)
Normalization. (You'd be surprised how often 5NF outperforms lower NFs. Ditto for natural keys.)
Tablespaces. (Maybe putting an index on its own SSD.)
Partitioning.
I'm sure there are others, but I haven't had coffee yet.
But the point is that you can't determine whether, say, PostgreSQL can handle a 48 gig table unless you've accounted for the effect of all those optimizations. With large databases, you come to rely on the cumulative effect of small improvements. You have to do a lot of testing before you can defensibly conclude that a given dbms can't handle a 48 gig table.
Now, whether you can implement those optimizations is a different question--most companies won't invest in a new 64-bit server running Oracle and a dozen of the newest "I'm the fastest hard disk" hard drives to solve your problem.
But someone is going to pay either for optimal hardware and software, for dba tuning expertise, or for programmer time and waiting on suboptimal hardware. I've seen problems like this take months to solve. If it's going to take months, money on hardware is probably a wise investment.

What happens when maxing out Postgres' work_mem?

How does the work_mem option in Postgres work? Here's the description from http://www.postgresql.org/docs/8.4/static/runtime-config-resource.html:
Specifies the amount of memory to be used by internal
sort operations and hash tables before switching to
temporary disk files. The value defaults to one megabyte
(1MB). Note that for a complex query, several sort or
hash operations might be running in parallel; each one
will be allowed to use as much memory as this value
specifies before it starts to put data into temporary
files. Also, several running sessions could be doing
such operations concurrently. So the total memory used
could be many times the value of work_mem; it is
necessary to keep this fact in mind when choosing the
value. Sort operations are used for ORDER BY, DISTINCT,
and merge joins. Hash tables are used in hash joins,
hash-based aggregation, and hash-based processing of IN
subqueries.
I'm probably totally wrong here but..isn't "switching to temporary disk files" essentially the same thing as "virtual memory" in the operating system? Wouldn't the OS just create a swap file once the RAM is gone? Wouldn't it be better to set this to something like 100TB and let the OS figure it out? Before I potentially mess up my system, I want to check if anyone actually tried this approach.
PostgreSQL will for example switch to a sorting operation more suitable for on-disk sort than in-memory sort if it knows the sort will happen on disk - which it won't know if it happens in swap.
Also, PostgreSQL can switch to a completely different plan (for example, using a different JOIN method) if it figures out the data does not fit in RAM.
Setting work_mem too high will get you a very slow database as soon as you have enough data so that everything doesn't always fit in RAM anymore.
Keep in mind that work_mem is the maximum amount of RAM that can be used for every single sort operation. For a single query, multiple sort operations might run in parallel and there might be multiple connections querying the database at once. For that reason all sort operations may use x-times the amount of work_mem in RAM (that's the reason a conservative amount is recommended).
Now back to your question, if you choose a work_mem to a such high value, sort operations might use up most of your RAM, which leads to page in and out's from swap (keep in mind, there are lots of other processes and PostgreSQL parts that need some (or even lots of) RAM. Disk-based sort operations are by factors more efficient than page swaps done by the OS. As some of the other replies pointed out, a database server which has swap out and in constantly will perform extremely slow.
Another point is, that with such a high work_mem value, a single query (purposely or by accident) might more or less make the whole database server go unresponsive.
A database server that swaps is a dead database server.
In RAM postgres uses quicksort, on disk it uses another algorithm which is much more suited to harddisks. Using quicksort on swapped-out memory will be incredibly slow.
The OS is generic in the terms it handles swap, besides, there's a finite amount of address space a process can use, which isn't that big on 32 bit systems(2Gb on a windows 32 bit platform, can be enhanced to 3Gb), but you're right, you could let the OS handle this through virtual memory.
PostgreSQL is not 'generic' it'll know much better than the OS how to structure data once disk access is involved, so letting the database switch over to explicit file handling once memory is exhausted will have benefits over letting the OS handle it.

GemStone-Linux-Apache-Seaside-Smalltalk.. how practical is 4GB?

I am really interested in GLASS. The 4GB limit for the free version has me concerned. Especially when I consider the price for the next level ($7000 year).
I know this can be subjective and variable, but can someone describe for me in everyday terms what 4 GB of GLASS will get you? Maybe a business example. 4 GB may get me more storage than I realize.. and I don't have to worry about it.
In my app, some messages have file attachments up to 5 MB in size. Can I conserve the 4 GB of Gemstone space by saving these attachments directly to files on the operating system, instead of inside Gemstone? I'm thinking yes.
I'm aware of one GLASS system that is ~944 MB and has 8.3 million objects, or ~118 bytes per object. At this rate, it can grow to over 36 million objects and stay under 4 GB.
As to "attachments", I'd suggest that even in an RDBMS you should consider storing larger, static data in the file system and referencing it from the database. If you are building a web-based application, serving static content (JPG, CSS, etc.) should be done by your web server (e.g., Apache) rather than through the primary application.
By comparison, Oracle and Microsoft SQL Server have no-cost licenses for a 4-GB database.
What do you think would be a good price for the next level?
The 4GByte limit has been removed a while ago. The free version is limited now to the use of two cores and 2GByte ram.
4GB is quite a decent size database. Not having used gemstone before I can only speculate as to how efficient it is a storing objects, but having played with a few other similar object databases (Mongodb, db4o). I know that you're going to be able to fit several(5-10) million records before you even get close to that limit. In reality, how many records depends highly on the type of data you're storing.
As an example I was storing ~2million listings & ~1million transactions, in a mysql database and the space was < 1Gb. You have a small overhead serializing a whole object, but not that much.
Files can definitely can be stored on the file system.
4gb an issue... I guess you think you're building the next ebay!
Nowadays, there is no limit on the size of the repository. See the latest specs for GemStone
If you have multiple simultaneous users with attachments of 5MB you need a separate strategy for them anyway, as each takes about a twentieth second of bandwidth of a GBit ethernet network.

Database Disk Queue too high, what can be done?

I have a problem with a large database I am working with which resides on a single drive - this Database contains around a dozen tables with the two main ones are around 1GB each which cannot be made smaller. My problem is the disk queue for the database drive is around 96% to 100% even when the website that uses the DB is idle. What optimisation could be done or what is the source of the problem the DB on Disk is 16GB in total and almost all the data is required - transactions data, customer information and stock details.
What are the reasons why the disk queue is always high no matter the website traffic?
What can be done to help improve performance on a database this size?
Any suggestions would be appreciated!
The database is an MS SQL 2000 Database running on Windows Server 2003 and as stated 16GB in size (Data File on Disk size).
Thanks
Well, how much memory do you have on the machine? If you can't store the pages in memory, SQL Server is going to have to go to the disk to get it's information. If your memory is low, you might want to consider upgrading it.
Since the database is so big, you might want to consider adding two separate physical drives and then putting the transaction log on one drive and partitioning some of the other tables onto the other drive (you have to do some analysis to see what the best split between tables is).
In doing this, you are allowing IO accesses to occur in parallel, instead of in serial, which should give you some more performance from your DB.
Before buying more disks and shifting things around, you might also update statistics and check your queries - if you are doing lots of table scans and so forth you will be creating unnecessary work for the hardware.
Your database isn't that big after all - I'd first look at tuning your queries. Have you profiled what sort of queries are hitting the database?
If you disk activity is that high while your site is idle, I would look for other processes that might be running that could be affecting it. For example, are you sure there aren't any scheduled backups running? Especially with a large db, these could be running for a long time.
As Mike W pointed out, there is usually a lot you can do with query optimization with existing hardware. Isolate your slow-running queries and find ways to optimize them first. In one of our applications, we spent literally 2 months doing this and managed to improve the performance of the application, and the hardware utilization, dramatically.