Coming from MySQL and PostgreSQL, i would very much like to know how SQL Server stores and handles large physical database file.
According to this article here
http://msdn.microsoft.com/en-us/library/aa174545%28SQL.80%29.aspx
SQL Server has 3 types of file, .mdf, .ndf and .ldf
Due to the nature of how data grows, a database can contain hundreds of thousand of files. This would eventually affect the size of these .mdf.
So the question is, how does SQL Server handle large physical database files?
I might seem to ask a lot of question, but i would like to have an answer also covers the sub-question below:
Theoretically, .mdf filesize could grow to GB or perhaps TB. Is this common in real world scenario?
Since SQL Server deals with a single file, it would have a considerably large read/write operation performed on the same file. How would this impact the performance?
Is it possible (has there been any case) to split .mdf into parts. Instead of having 1 uber large .mdf file, would it be better to split it into chunks?
note: I am new to SQL Server, basic query in SQL Server appears to be similar to MySQL, I would like to know a bit about what is going on "under the hood".
1 Theoretically, mdf filesize could grow to GB or perhaps TB. Is this
common in real world scenario?
Yes, it is common. It depends on amount of read-write operations per second and your disk subsystem. Nowadays, a database with size of hundreds GB is considered to be small.
2 Since MSSQL deals with single file, it would have a considerably
large read/write operation performed on the same file. How would this
impact the performance?
This is one of the most common performance bottlenecks. You need to choose appropriate disk subsystem and maybe divide your database into several filegroups and place them on different disk subsystems.
3 Is it possible (has there been any case) to split mdf into parts.
Instead of having 1 uber large mdf file, would it be better to split
it into chunks?
Yes you can. This "chunks" are called filegroups. You can create different tables, indexes, objects or even parts of tables in different filegroups (if version and edition of SQL-Server allows it). But it will give you advantage only if you create filegroups across multiple disks, RAIDs and so on. For more information you can read Using Files and Filegroups
Related
I come from a Sybase background, and with it, if a backup to one file took 20 minutes, a backup to two files would take 10 minutes (plus a bit of overhead), four files would take 5 minutes (plus a bit more overhead), etc. I expected to see the same results with DB2 but it doesn't seem to be reducing the overall backup time at all. While not optimal, in both the Sybase and DB2 tests the files were all being written to the same filesystem. Am I misunderstanding what the multi-file backup achieves in DB2? Thanks.
When you take a look at the BACKUP DATABASE syntax and options you will notice that Db2 supports several storage targets (with respective options) as well as options on how the database data is read. The backup process consists of reading the relevant data from the database and writing it to the backup device.
For the reading part, there are options like BUFFER and PARALLELISM that impact performance and throughput. By default, if not specified by the user, Db2 tries to come up with good values. This is something you could look into.
Are you compressing or encrypting the backup file? Are you writing the backup to the same file system as your database is in? That would be more to consider.
I have read quite a bit about spreading tempdb and database files across luns to optimize sql server, my question is if I have just one LUN, would spreading the tempdb have any positive effect or it is better to request for additional separate disk to get performance benefits?
Having more than one database file (no matter which database) on one single physical disk (LUN or not) does not provide improved performance most of the time.
Regarding tempdb, it will improve the performance if there is allocation contention. In that case you should create as many equally sized database files as the number of logical processors.
But if you can request more LUN's why to assign them to tempdb? Is tempdb so overloaded? Typically you will get better overall performance if the new LUN's is assigned to other databases.
I have updated two databases in MSSQL Server 2008R2 using liquibase.
Both of them start with the same database, but one ran through several liquibase updates until the final one incrementally, the other just go straight to the final update.
So I have checked they have the same schema, same set of data, but their .mdf file sizes are 10GB apart.
What areas (best to provide also the SQL command) I can look into to investigate what possibly gives me this 10GB difference (e.g. Index? Unused empty spaces? etc...)
I am not trying to make them the same (so no Shrink), I just want to find out the places that contribute to this 10GB size difference. So I will accept answers like using HEX editor to open up the mdf files and compare byte by byte, but I need to know what am I looking at.
Thank you
The internal structure (physical organization, not logical data) of databases is opaque both by design and due to the real-world scenarios that affect how data is created, updated and accessed.
In most cases there is literally no telling why two logically equivalent databases are different on a physical level. It is some combination of deleted objects, unbalanced pages, disk-based temporary tables, history of garbage collection, and many other potential causes.
In short, you would never expect a physical database to be 1:1 with the logical data it contains.
I have some large (200 GB is normal) flat files of data that I would like to store in some kind of database so that it can be accessed quickly and in the intuitive way that the data is logically organized. Think of it as large sets of very long audio recordings, where each recording is the same length (samples) and can be thought of as a row. One of these files normally has about 100,000 recordings of 2,000,000 samples each in length.
It would be easy enough to store these recordings as rows of BLOB data in a relational database, but there are many instances where I want to load into memory only certain columns of the entire data set (say, samples 1,000-2,000). What's the most memory- and time-efficient way to do this?
Please don't hesitate to ask if you need more clarification on the particulars of my data in order to make a recommendation.
EDIT: To clarify the data dimensions... One file consists of: 100,000 rows (recordings) by 2,000,000 columns (samples). Most relational databases I've researched will allow a maximum of a few hundred to a couple thousand rows in a table. Then again, I don't know much about object-oriented databases, so I'm kind of wondering if something like that might help here. Of course, any good solution is very welcome. Thanks.
EDIT: To clarify the usage of the data... The data will be accessed only by a custom desktop/distributed-server application, which I will write. There is metadata (collection date, filters, sample rate, owner, etc.) for each data "set" (which I've referred to as a 200 GB file up to now). There is also metadata associated with each recording (which I had hoped would be a row in a table so I could just add columns for each piece of recording metadata). All of the metadata is consistent. I.e. if a particular piece of metadata exists for one recording, it also exists for all recordings in that file. The samples themselves do not have metadata. Each sample is 8 bits of plain-ol' binary data.
DB storage may not be ideal for large files. Yes, it can be done. Yes, it can work. But what about DB backups? The file contents likely will not change often - once they're added, they will remain the same.
My recommendation would be store the file on disk, but create a DB-based index. Most filesystems get cranky or slow when you have > 10k files in a folder/directory/etc. Your application can generate the filename and store metadata in the DB, then organize by the generated name on disk. Downsides are file contents may not be directly apparent from the name. However, you can easily backup changed files without specialized DB backup plugins and a sophisticated partitioning, incremental backup scheme. Also, seeks within the file become much simpler operations (skip ahead, rewind, etc.). There is generally better support for these operations in a file system than in a DB.
I wonder what makes you think that RDBMS would be limited to mere thousands of rows; there's no reason this would be the case.
Also, at least some databases (Oracle as an example) do allow direct access to parts of LOB data, without loading the full LOB, if you just know the offset and length you want to have. So, you could have a table with some searchable metadata and then the LOB column, and if needed, an additional metadata table containing metadata on the LOB contents so that you'd have some kind of keyword->(offset,length) relation available for partal loading of LOBs.
Somewhat echoing another post here, incremental backups (which you might wish to have here) are not quite feasible with databases (ok, can be possible, but at least in my experience tend to have a nasty price tag attached).
How big is each sample, and how big is each recording?
Are you saying each recording is 2,000,000 samples, or each file is? (it can be read either way)
If it is 2 million samples to make up 200 GB, then each sample is ~10 K, and each recording is 200K (to have 100,000 per file, which is 20 samples per recording)?
That seems like a very reasonable size to put in a row in a DB rather than a file on disk.
As for loading into memory only a certain range, if you have indexed the sample ids, then you could very quickly query for only the subset you want, loading only that range into memory from the DB query result.
I think that Microsoft SQL does what you need with the varbinary(MAX) field type WHEN used in conjnction with filestream storage.
Have a read on TechNet for more depth: (http://technet.microsoft.com/en-us/library/bb933993.aspx).
Basically, you can enter any descriptive fields normally into your database, but the actual BLOB is stored in NTFS, governed by the SQL engine and limited in size only by your NTFS file system.
Hope this helps - I know it raises all kinds of possibilities in my mind. ;-)
I have a problem with a large database I am working with which resides on a single drive - this Database contains around a dozen tables with the two main ones are around 1GB each which cannot be made smaller. My problem is the disk queue for the database drive is around 96% to 100% even when the website that uses the DB is idle. What optimisation could be done or what is the source of the problem the DB on Disk is 16GB in total and almost all the data is required - transactions data, customer information and stock details.
What are the reasons why the disk queue is always high no matter the website traffic?
What can be done to help improve performance on a database this size?
Any suggestions would be appreciated!
The database is an MS SQL 2000 Database running on Windows Server 2003 and as stated 16GB in size (Data File on Disk size).
Thanks
Well, how much memory do you have on the machine? If you can't store the pages in memory, SQL Server is going to have to go to the disk to get it's information. If your memory is low, you might want to consider upgrading it.
Since the database is so big, you might want to consider adding two separate physical drives and then putting the transaction log on one drive and partitioning some of the other tables onto the other drive (you have to do some analysis to see what the best split between tables is).
In doing this, you are allowing IO accesses to occur in parallel, instead of in serial, which should give you some more performance from your DB.
Before buying more disks and shifting things around, you might also update statistics and check your queries - if you are doing lots of table scans and so forth you will be creating unnecessary work for the hardware.
Your database isn't that big after all - I'd first look at tuning your queries. Have you profiled what sort of queries are hitting the database?
If you disk activity is that high while your site is idle, I would look for other processes that might be running that could be affecting it. For example, are you sure there aren't any scheduled backups running? Especially with a large db, these could be running for a long time.
As Mike W pointed out, there is usually a lot you can do with query optimization with existing hardware. Isolate your slow-running queries and find ways to optimize them first. In one of our applications, we spent literally 2 months doing this and managed to improve the performance of the application, and the hardware utilization, dramatically.