MDF file size much larger than actual data - sql

For some reason my MDF file is 154gigs, however, I only loaded 7 gigs worth of data from flat files. Why is the MDF file so much larger than the actual source data?
More info:
Only a few tables with ~25 million rows. No large varchar fields (biggest is 300, most are less than varchar(50). Not very wide tables < 20 columns. Also, none of the large tables are indexed yet. Tables with indexes have less than 1 million rows. I don't use char, only varchar for strings. Datatype is not the issue.
Turned out it was the log file, not the mdf file. The MDF file is actually 24gigs which seems more reasonable, however still big IMHO.
UPDATE:
I fixed the problem with the LDF (log) file by changing the recovery model from FULL to simple. This is okay because this server is only used for internal development and ETL processing. In addition, before changing to SIMPLE I had to shrink the LOG file. Shrinking is not recommended in most cases, however, this was one of those cases where the log file should have never grown so big and so fast. For further reading see this

Could be a lot of reasons maybe you are using char(5000) instead of varchar(5000), maybe you are using bigints instead of int, nvarchar when all you need is varchar etc etc etc. Maybe you are using a lot of indexes per table, these will all add up. Maybe your autogrow settings are wrong. You are sure this is the MDF and not the LDF file right?

Because the MDF was allocated with 154Gb, or has grown to 154Gb through various operations. A database file has at least the size of the data in it, but it can be larger than the used amount by any amount.
An obvious question will be how do you measure the amount of data in the database? Did you use sp_spaceused? Did you check sys.allocation_units? Did you guess?
If the used size is indeed 7Gb out of 154Gb, then you should leave it as it is. The database was sized by somebody at this size, or has grown, and it is likely to grow back. If you believe that the growth or pre-sizing was accidental, then the previous point still applies and you should leave it as is.
If you are absolutely positive the overallocation is a mistake, you can shrink the database, with all the negative consequences of shrinking.

Just in case this is useful for someone out there, found this query in dba.stackexchange, it uses the sys.dm_db_database_page_allocations which counts the number of pages per object, this includes internal storage and gives you a real overview of the spaced used by your database.
SELECT sch.[name], obj.[name], ISNULL(obj.[type_desc], N'TOTAL:') AS [type_desc],
COUNT(*) AS [ReservedPages],
(COUNT(*) * 8) AS [ReservedKB],
(COUNT(*) * 8) / 1024.0 AS [ReservedMB],
(COUNT(*) * 8) / 1024.0 / 1024.0 AS [ReservedGB]
FROM sys.dm_db_database_page_allocations(DB_ID(), NULL, NULL, NULL, DEFAULT) pa
INNER JOIN sys.all_objects obj
ON obj.[object_id] = pa.[object_id]
INNER JOIN sys.schemas sch
ON sch.[schema_id] = obj.[schema_id]
GROUP BY GROUPING SETS ((sch.[name], obj.[name], obj.[type_desc]), ())
ORDER BY [ReservedPages] DESC;
Thanks to Solomon Rutzky:
https://dba.stackexchange.com/questions/175649/sum-of-table-sizes-dont-match-with-mdf-size

Either AUTO SHRINK is not enabled or The initial size was set to the larger value.

Related

PostgreSQL: Setted a full column to null value & database size increased. Why?

I´m working with PostgreSQL. I have a database named db_as on it with 25.000.000 rows of data. I wanted to set some diskspace free so I updated a full column to null value thinking that I would decrease databases size, but it didnt happend, in fact, i did the oposite thing, I increased databases size, and i dont know why. It increased from 700MB to 1425MB, thats a lot :( .
I used this sentence to know each columns size:
SELECT sum(pg_column_size(_column)) as size FROM _table
And this one to know all the databases size:
SELECT pg_database.datname, pg_size_pretty(pg_database_size(pg_database.datname)) AS size FROM pg_database;
The original values will still be on disk, just dead.
Run a vacuum on the database to remove these.
vacuum full
Documentation
https://www.postgresql.org/docs/12/sql-vacuum.html

How can I optimize my varchar(max) column?

I'm running SQL Server and I have a table of user profiles which contains columns for the user's personal info and a profile picture.
When setting up the project, I was given advice to store the profile image in the database. This seemed OK and worked fine, but now I'm dealing with real data and querying more rows the data is taking a lifetime to return.
To pull just the personal data, the query takes one second. To pull the images I'm looking at upwards of 6 seconds for 5 records.
The column is of type varchar(max) and the size of the data varies. Here's an example of the data lengths:
28171
4925543
144881
140455
25955
630515
439299
1700483
1089659
1412159
6003
4295935
Is there a way to optimize my fetching of this data? My query looks like this:
SELECT *
FROM userProfile
ORDER BY id
Indexing is out of the question due to the data lengths. Should I be looking at compressing the images before storing?
If takes time to return data. Five seconds seems a little long for a few megabytes, but there is overhead.
I would recommend compressing the data, if retrieval time is so important. You may be able to retrieve and uncompress the data faster than reading the uncompressed data.
That said, you should not be using select * unless you specifically want the image column. If you are using this in places where it is not necessary, that can improve performance. If you want to make this save for other users, you can add a view without the image column and encourage them to use the view.
If it is still possible to take one step back.Drop the idea of Storing images in table. Instead save path in DB and image in folder.This is the most efficient .
SELECT *
FROM userProfile
ORDER BY id
Do not use * and why are you using order by ? You can order by AT UI code

nvarchar(max) - how to speed up getting only meaningful string in SQL

I have a table with a column with Nvarchar(Max). The Column is 90% percent of the time having a string length between 255 and 500. Some go well over 22000 which aren't required as its XML of something that the business wont ever use for reporting purpose. Anyways to cut a long story short was the best way to trim out all the excess bulk. I have tried the usual
left(column,500)
and
substring(column,1,500)
I have set the destination column to be 500 length.
However loading the table from source to target destination takes a while just because of that column alone. I am doing the in SSIS in the Source. I also gone to the output column and ignored truncation. Is there anyway I can reduce the time take loading this column. These methods seem to take as much as loading the full length. Any suggestion will be greatly appreciated.
NVARCHAR(MAX) (even when using a function like SUBSTRING or LEFT) will cost a lot of memory and will fill up your buffers quickly. Check the DefaultBufferMaxSize and also the properties BLOBTempStoragePath and BufferTempStoragePath setting them to an optimal value might increase the performance but note that you have so configure them accordingly because it is like a double edged sword.
Also If Source and Destination are on differents servers, the network could also be an issue because all data has to to from your SQL server via the network to your SSIS server. You could try changing the Network Packet Size
More info are provided in these links
Set BLOBTempStoragePath and BufferTempStoragePath to Fast Drives
Troubleshooting Package Performance
Perfomance Issue with NVarchar(MAX) in SSIS

Storing trillions of document similarities

I wrote a program to compute similarities among a set of 2 million documents. The program works, but I'm having trouble storing the results. I won't need to access the results often, but will occasionally need to query them and pull out subsets for analysis. The output basically looks like this:
1,2,0.35
1,3,0.42
1,4,0.99
1,5,0.04
1,6,0.45
1,7,0.38
1,8,0.22
1,9,0.76
.
.
.
Columns 1 and 2 are document ids, and column 3 is the similarity score. Since the similarity scores are symmetric I don't need to compute them all, but that still leaves me with 2000000*(2000000-1)/2 ≈ 2,000,000,000,000 lines of records.
A text file with 1 million lines of records is already 9MB. Extrapolating, that means I'd need 17 TB to store the results like this (in flat text files).
Are there more efficient ways to store these sorts of data? I could have one row for each document and get rid of the repeated document ids in the first column. But that'd only go so far. What about file formats, or special database systems? This must be a common problem in "big data"; I've seen papers/blogs reporting similar analyses, but none discuss practical dimensions like storage.
DISCLAIMER: I don't have any practical experience with this, but it's a fun exercise and after some thinking this is what I came up with:
Since you have 2.000.000 documents you're kind of stuck with an integer for the document id's; that makes 4 bytes + 4 bytes; the comparison seems to be between 0.00 and 1.00, I guess a byte would do by encoding the 0.00-1.00 as 0..100.
So your table would be : id1, id2, relationship_value
That brings it to exactly 9 bytes per record. Thus (without any overhead) ((2 * 10^6)^2)*9/2bytes are needed, that's about 17Tb.
Off course that's if you have just a basic table. Since you don't plan on querying it very often I guess performance isn't that much of an issue. So you could go 'creative' by storing the values 'horizontally'.
Simplifying things, you would store the values in a 2 million by 2 million square and each 'intersection' would be a byte representing the relationship between their coordinates. This would "only" require about 3.6Tb, but it would be a pain to maintain, and it also doesn't make use of the fact that the relations are symmetrical.
So I'd suggest to use a hybrid approach, a table with 2 columns. First column would hold the 'left' document-id (4 bytes), 2nd column would hold a string of all values of documents starting with an id above the id in the first column using a varbinary. Since a varbinary only takes the space that it needs, this helps us win back some space offered by the symmetry of the relationship.
In other words,
record 1 would have a string of (2.000.000-1) bytes as value for the 2nd column
record 2 would have a string of (2.000.000-2) bytes as value for the 2nd column
record 3 would have a string of (2.000.000-3) bytes as value for the 2nd column
etc
That way you should be able to get away with something like 2Tb (inc overhead) to store the information. Add compression to it and I'm pretty sure you can store it on a modern disk.
Off course the system is far from optimal. In fact, querying the information will require some patience as you can't approach things set-based and you'll pretty much have to scan things byte by byte. A nice 'benefit' of this approach would be that you can easily add new documents by adding a new byte to the string of EACH record + 1 extra record in the end. Operations like that will be costly though as it will result in page-splits; but at least it will be possible without having to completely rewrite the table. But it will cause quite bit of fragmentation over time and you might want to rebuild the table once in a while to make it more 'aligned' again. Ah.. technicalities.
Selecting and Updating will require some creative use of SubString() operations, but nothing too complex..
PS: Strictly speaking, for 0..100 you only need 7 bytes, so if you really want to squeeze the last bit out of it you could actually store 8 values in 7 bytes and save another ca 300Mb, but it would make things quite a bit more complex... then again, it's not like the data is going to be human-readable anyway =)
PS: this line of thinking is completely geared towards reducing the amount of space needed while remaining practical in terms of updating the data. I'm not saying it's going to be fast; in fact, if you'd go searching for all documents that have a relation-value of 0.89 or above the system will have to scan the entire table and even with modern disks that IS going to take a while.
Mind you that all of this is the result of half an hour brainstorming; I'm actually hoping that someone might chime in with a neater approach =)

Is varchar(128) better than varchar(100)

Quick question. Does it matter from the point of storing data if I will use decimal field limits or hexadecimal (say 16,32,64 instead of 10,20,50)?
I ask because I wonder if this will have anything to do with clusters on HDD?
Thanks!
VARCHAR(128) is better than VARCHAR(100) if you need to store strings longer than 100 bytes.
Otherwise, there is very little to choose between them; you should choose the one that better fits the maximum length of the data you might need to store. You won't be able to measure the performance difference between them. All else apart, the DBMS probably only stores the data you send, so if your average string is, say, 16 bytes, it will only use 16 (or, more likely, 17 - allowing 1 byte for storing the length) bytes on disk. The bigger size might affect the calculation of how many rows can fit on a page - detrimentally. So choosing the smallest size that is adequate makes sense - waste not, want not.
So, in summary, there is precious little difference between the two in terms of performance or disk usage, and aligning to convenient binary boundaries doesn't really make a difference.
If it would be a C-Program I'd spend some time to think about that, too. But with a database I'd leave it to the DB engine.
DB programmers spent a lot of time in thinking about the best memory layout, so just tell the database what you need and it will store the data in a way that suits the DB engine best (usually).
If you want to align your data, you'll need exact knowledge of the internal data organization: How is the string stored? One, two or 4 bytes to store the length? Is it stored as plain byte sequence or encoded in UTF-8 UTF-16 UTF-32? Does the DB need extra bytes to identify NULL or > MAXINT values? Maybe the string is stored as a NUL-terminated byte sequence - then one byte more is needed internally.
Also with VARCHAR it is not neccessary true, that the DB will always allocate 100 (128) bytes for your string. Maybe it stores just a pointer to where space for the actual data is.
So I'd strongly suggest to use VARCHAR(100) if that is your requirement. If the DB decides to align it somehow there's room for extra internal data, too.
Other way around: Let's assume you use VARCHAR(128) and all things come together: The DB allocates 128 bytes for your data. Additionally it needs 2 bytes more to store the actual string length - makes 130 bytes - and then it could be that the DB aligns the data to the next (let's say 32 byte) boundary: The actual data needed on the disk is now 160 bytes 8-}
Yes but it's not that simple. Sometimes 128 can be better than 100 and sometimes, it's the other way around.
So what is going on? varchar only allocates space as necessary so if you store hello world in a varchar(100) it will take exactly the same amount of space as in a varchar(128).
The question is: If you fill up the rows, will you hit a "block" limit/boundary or not?
Databases store their data in blocks. These have a fixed size, for example 512 (this value can be configured for some databases). So the question is: How many blocks does the DB have to read to fetch each row? Rows that span several block will need more I/O, so this will slow you down.
But again: This doesn't depend on the theoretical maximum size of the columns but on a) how many columns you have (each column needs a little bit of space even when it's empty or null), b) how many fixed width columns you have (number/decimal, char), and finally c) how much data you have in variable columns.