Why would two mysql files (same table, same contents) be different in size? - sql

I took an existing MySQL database, and set up a copy on a new host.
The file size for some tables on the new host are 1-3% smaller than their counterpart files on the old host.
I am curious why that is.
My guess is, the old host's files have grown over time, and within the b-tree structure for that file, there is more fragmentation. Whereas the new host, because it was creating the file from scratch (via a binary log), avoided such fragmentation.
Does it even make sense for there to be fragmentation within the b-tree structure itself? (Speaking within the database layer, and not with regards to the OS file system layer) I originally thought "no", but then again, isn't such fragmentation the basis for the DBA task of compressing your database files?
I'm wondering maybe if this is simply an artifact of the file system layer. i.e. the new host has a mostly empty disk drive, hence less fragmentation would result in the allocation of a new file. Then again, I didn't think that fragmentation would show up in the reported file size (Linux OS).

There can certainly be fragmentation in MySQL data files or index files. This is common, even deliberate.
That is, the storage engine may deliberately leave some extra space here and there so when you change values, it can fit the rows in without having to reorder the whole data file. There are even server properties you can use to configure how much of this slop space to allocate.
I wouldn't even blink at a file discrepancy of 1-3%.

From what i understand of mysql It has a growth algo as it approaches capacity, when mounted it chose a different size, prolly trimming excess storage

Related

Is it wrong to write byte of images in the database?

When should I make this direct recording at the bank?
What are the situations?
I know I can record the path of the image in the bank.
In addition to the cost being higher as mentioned, one must take into account several factors:
Data Volume: For a low volume of data there may be no problem. On the other hand, for mass storage of data the database is practically unfeasible.
Clustering: One advantage of the database is if your system runs on multiple servers, everyone will have uniform access to the files.
Scalability: If demand for volume or availability increases, can you add more capacity to the system? It is much easier to split files between different servers than to distribute records from one table to more servers.
Flexibility: Backing up, moving files from one server to another, doing some processing on the stored files, all this is easier if the files are in a directory.
There are several strategies for scaling a system in terms of both availability and volume. Basically these strategies consist of distributing them on several different servers and redirecting the user to each of them according to some criteria. The details vary of implementation, such as: data update strategy, redundancy, distribution criteria, etc.
One of the great difficulties in managing files outside BD is that we now have two distinct data sources that need to be always in sync.
From the safety point of view, there is actually little difference. If a hacker can compromise a server, it can read both the files written to disk of your system and the files of the database system. If this question is critical, an alternative is to store the encrypted data.
I also convert my images into byte array and store them in an sql server database but in the long run, I am sure that someone will ask you and tell you that you should only save the (server) path of the image.
The biggest disadvantage of storing as binary I think is
Retrieving images from database is significantly more expensive compared to using the file system

Do files on your server influence website speed

If you have a webserver for your website, does it make a difference if there are a lot of other files on the server, even if they aren't used?
Example
An average webserver has a SSD with 500 GB of space. It's hosting a single website, but has a ton of other websites which are inactive. Though that single website is only 1GB in size, the hard drive is full for 50%. Will that influence site speed?
And does SSD vs HDD make a difference in that, apart from the speed difference between the two types.
Edit: I've read somewhere that the amount of files in your server influences it's speed, and it sounds logical due to Andrei's answer, concerning the having to search through more files. I've had a discussion about it with someone however, and he firmly states that it makes no difference.
Having other/unused files always has an impact on the performance, but the question is - how big it is. Usually not much and you will not notice it at all.
But think about how files are read from disk. First, you need to locate the file record in the file allocation table (FAT). Search in the table is similar to search in a tree-like data structure, as we have to deal with folders that contain other folders etc.
The more files you have, the bigger the FAT gets. And the search becomes slower, correspondingly.
All in all, with memory caching and other tricks, this is not an issue.
You will notice the impact when you have thousands of files in one folder. That's why picture-related services that host big amount of images usually store them in a folder structure that holds only limited amount of files per folder. For example, a file named '12345678.jpg' would be stored in '/1/2/3/4/5/12345678.jpg' path as well as other files whose names are '12345000'...'12345999'. Thus only 1000 files would be saved per folder.

Is it better to cache HTML in a database or on the file system?

I have a few programs that have to fetch HTML or XML and then cache it locally. Imagine that there are 1000-10,000 documents that need to be cached. Data is then extracted from the documents and inserted into a PostgreSQL database.
My question is whether it would better to cache these documents in TEXT fields on a PostgreSQL table, or whether I should just cache them on the filesystem.
The documents really don't serve much of a purpose beyond temporary caching and perhaps serving as a debugging tool if something goes wrong with the data extraction.
If you store things in the filesystem, ideally do so in a RAM disk.
If you do so in the database, ideally do so with a tablespace that lives in a RAM disk. In PG 9.1 (in beta), additionally make sure that your table is unlogged (so as to not need to write in the WAL).
Even more ideally, though, place all of this in Memcache... (Or whichever other applicable memory-based cache solution your platform offers.)
Caching in the database will allow the database engine to keep it in memory, possibly reducing disk access and improving performance.

Which technology should be used for serving large number of static files?

My main aim is to serve large number of XML files ( > 1bn each <1kb) via web server. Files can be considered as staic as those will be modified by external code, in relatively very low frequency (about 50k updates per day). Files will be requested in high frequency (>30 req/sec).
Current suggestion from my team is to create a dedicated Java application to implement HTTP protocal and use memcached to speed up the thing, keeping all file data in RDBMS and getting rid of file system.
On other hand, I think, a tweaked Apache Web Server or lighttpd should be enough. Caching can be left to OS or web server's defalt caching. There is no point in keeping data in DB if the same output is required and only queried based on file name. Not sure how memcached will work here. Also updating external cache (memcached) while updating file via external code will add complexity.
Also other question, if I choose to use files is is possible to store those in directory like \a\b\c\d.xml and access via abcd.xml? Or should I put all 1bn files in single directory (Not sure OS will allow it or not).
This is NOT a website, but for an application API in closed network so Cloud/CDN is of no use.
I am planning to use CentOS + Apache/lighttpd. Suggest any alternative and best possible solution.
This is the only public note found on such topic, and it is little old too.
1bn files at 1KB each, that's about 1TB of data. Impressive. So it won't fit into memory unless you have very expensive hardware. It can even be a problem on disk if your file system wastes a lot of space for small files.
30 requests a second is far less impressive. It's certainly not the limiting factor for the network nor for any serious web server out there. It might be a little challenge for a slow harddisk.
So my advice is: Put the XML files on a hard disk and serve them with a plain vanilla web server of your choice. Then measure the throughput and optimize it, if you don't reach 50 files a second. But don't invest into anything unless you have shown it to be a limiting factor.
Possible optimizations are:
Find a better layout in the file system, i.e. distribute your files over enough directories so that you don't have too many files (more than 5,000) in a single directory.
Distribute the files over several harddisks so that they can access the files in parallel
Use faster harddisk
Use solid state disks (SSD). They are expensive, but can easily serve hundreds of files a second.
If a large number of the files are requested several times a day, then even a slow hard disk should be enough because your OS will have the files in the file cache. And with today's file cache size, a considerable amount of your daily deliveries will fit into the cache. Because at 30 requests a second, you serve 0.25% of all files a day, at most.
Regarding distributing your files over several directories, you can hide this with an Apache RewriteRule, e.g.:
RewriteRule ^/xml/(.)(.)(.)(.)(.*)\.xml /xml/$1/$2/$3/$4/$5.xml
Another thing you could look at is Pomegranate, which seems very similar to what you are trying to do.
I believe that a dedicated application with everything feeding off a memcache db would be the best bet.

Performance implications of storing 600,000+ images in the same folder (NTFS)

I need to store about 600,000 images on a web server that uses NTFS. Am I better off storing images in 20,000-image chunks in subfolders? (Windows Server 2008)
I'm concerned about incurring operating system overhead during image retrieval
Go for it. As long has you have an external index and have a direct file path to each file with out listing the contents of the directory then you are ok.
I have a folder with that is over 500 GB in size with over 4 million folders (which have more folders and files). I have somewhere in the order of 10 million files in total.
If I accidentally open this folder in windows explorer it gets stuck at 100% cpu usage (for one core) until I kill the process. But as long as you directly refer to the file/folder performance is great (meaning I can access any of those 10 million files with no overhead)
Depending on whether NTFS has directory indexes, it should be alright from the application level.
I mean, that opening files by name, deleting, renaming etc, programmatically should work nicely.
But the problem is always tools. Third party tools (such as MS explorer, your backup tool, etc) are likely to suck or at least be extremely unusable with large numbers of files per directory.
Anything which does a directory scan, is likely to be quite slow, but worse, some of these tools have poor algorithms which don't scale to even modest (10k+) numbers of files per directory.
NTFS folders store an index file with links to all its contents. With a large amount of images, that file is going to increase a lot and impact your performance negatively. So, yes, on that argument alone you are better off to store chunks in subfolders. Fragments inside indexes are a pain.