RDBMS caching vs disk I/O -- comparison across vendors - sql

I know little about how leading RDBMSs go about retrieving data. So these questions may seem a bit rudimentary:
Does each SELECT in commonly used RDBMSs such as Oracle, SQL Server, MySQL, PostgeSQL etc. always mean a trip to read the data from the disk or do they, to some extent allowable by the hardware, cache commonly requested data to avoid the expensive I/O operation?
How do they determine which data segments to cache?
How do they go about synchronizing the cache once an update of some of the cached data occurs by a different process?
Is there a comparison matrix on how different RDBMSs cache frequently requested data?
Thanks

I'll answer for SQL Server:
Reads are served from cache if possible. Else, an IO occurs.
From what has been written and from what I observe, it is an LRU algorithm. I don't think this is documented anywhere. The LRU items are database pages of 8KB.
SQL Server is the only process which has access to the database files. So no other process can cause modifications. Regarding concurrent transactions: Multiple transactions can modify the same page. Locking (mostly at row-level, sometimes page or table level) ensures that the transactions do not disturb each other.
I don't know.

The answers for Informix are pretty similar to those given for SQL Server:
Reads and writes both use the cache if at all possible. If the page needed is not already in cache, an appropriate collection of I/O operations occurs (typically, evicting some page from cache, perhaps a dirty page that must be written before a new page can be read in, and then reading the new page where the old one was).
There are various algorithms, but page size and usage are the key parts. There are LRU queues for each page size.
The DBMS as a whole is an ensemble of processes that use a buffer pool in shared memory (and, where possible, direct disk I/O instead of going through the kernel cache), and uses various forms of locking (semaphores, spin-locks, mutexes, etc) to handle concurrency and synchronization. (On Windows, Informix uses a single process with multiple threads; on Unix, it uses multiple processes.)
Probably not.

Related

Object storage for a web application

I am currently working on a website where, roughly 40 million documents and images should be served to it's users. I need suggestions on which method is the most suitable for storing content with subject to these requirements.
System should be highly available, scale-able and durable.
Files have to be stored permanently and users should be able to modify them.
Due to client restrictions, 3rd party object storage providers such as Amazon S3 and CDNs are not suitable.
File size of content can vary from 1 MB to 30 MB. (However about 90% of the files would be less than 2 MB)
Content retrieval latency is not much of a problem. Therefore indexing or caching is not very important.
I did some research and found out about the following solutions;
Storing content as BLOBs in databases.
Using GridFS to chunk and store content.
Storing content in a file server in directories using a hash and storing the metadata in a database.
Using a distributed file system such as GlusterFS or HDFS and storing the file metadata in a database.
The website is developed using PHP and Couchbase Community Edition is used as the database.
I would really appreciate any input.
Thank you.
I have been working on a similar system for last two years, the work is still in progress. However, requirements are slightly different from yours: modifications are not possible (I will try to explain why later), file sizes fall in range from several bytes to several megabytes, and, the most important one, the deduplication, which should be implemented both on the document and block levels. If two different users upload the same file to the storage, the only copy of the file should be kept. Also if two different files partially intersect with each other, it's necessary to store the only copy of the common part of these files.
But let's focus on your requirements, so deduplication is not the case. First of all, high availability implies replication. You'll have to store your file in several replicas (typically 2 or 3, but there are techniques to decrease data parity) on independent machines in order to stay alive in case if one of the storage servers in your backend dies. Also, taking into account the estimation of the data amount, it's clear that all your data just won't fit into a single server, so vertical scaling is not possible and you have to consider partitioning. Finally, you need to take into account concurrency control to avoid race conditions when two different clients are trying to write or update the same data simultaneously. This topic is close to the concept of transactions (I don't mean ACID literally, but something close). So, to summarize, these facts mean that you're are actually looking for distributed database designed to store BLOBs.
On of the biggest problems in distributed systems is difficulties with global state of the system. In brief, there are two approaches:
Choose leader that will communicate with other peers and maintain global state of the distributed system. This approach provides strong consistency and linearizability guarantees. The main disadvantage is that in this case leader becomes the single point of failure. If leader dies, either some observer must assign leader role to one of the replicas (common case for master-slave replication in RDBMS world), or remaining peers need to elect new one (algorithms like Paxos and Raft are designed to target this issue). Anyway, almost whole incoming system traffic goes through the leader. This leads to the "hot spots" in backend: the situation when CPU and IO costs are unevenly distributed across the system. By the way, Raft-based systems have very low write throughput (check etcd and consul limitations if you are interested).
Avoid global state at all. Weaken the guarantees to eventual consistency. Disable the update of files. If someone wants to edit the file, you need to save it as new file. Use the system which is organized as a peer-to-peer network. There is no peer in the cluster that keeps the full track of the system, so there is no single point of failure. This results in high write throughput and nice horizontal scalability.
So now let's discuss the options you've found:
Storing content as BLOBs in databases.
I don't think it's a good option to store files in traditional RDBMS because they provide optimizations for structured data and strong consistency, and you don't need neither of this. Also you'll have difficulties with backups and scaling. People usually don't use RDBMS in this way.
Using GridFS to chunk and store content.
I'm not sure, but it looks like GridFS is built on the top of MongoDB. Again, this is document-oriented database designed to store JSONs, not BLOBs. Also MongoDB had problems with a cluster for many years. MongoDB passed Jepsen tests only in 2017. This may mean that MongoDB cluster is not mature yet. Make performance and stress tests, if you go this way.
Storing content in a file server in directories using a hash and storing the metadata in a database.
This option means that you need to develop object storage on your own. Consider all the problems I've mentioned above.
Using a distributed file system such as GlusterFS or HDFS and storing the file metadata in a database.
I used neither of these solutions, but HDFS looks like overkill, because you get dependent on Hadoop stack. Have no idea about GlusterFS performance. Always consider the design of distributed file systems. If they have some kind of dedicated "metadata" serves, treat it as a single point of failure.
Finally, my thoughts on the solutions that may fit your needs:
Elliptics. This object storage is not well-known outside of the russian part of the Internet, but it's mature and stable, and performance is perfect. It was developed at Yandex (russian search engine) and a lot of Yandex services (like Disk, Mail, Music, Picture hosting and so on) are built on the top of it. I used it in previous project, this may take some time for your ops to get into it, but it's worth it, if you're OK with GPL license.
Ceph. This is real object storage. It's also open source, but it seems that only Red Hat people know how to deploy and maintain it. So get ready to a vendor lock. Also I heard that it have too complicated settings. Never used in production, so don't know about performance.
Minio. This is S3-compatible object storage, under active development at the moment. Never used it in production, but it seems to be well-designed.
You may also check wiki page with the full list of available solutions.
And the last point: I strongly recommend not to use OpenStack Swift (there are lot of reasons why, but first of all, Python is just not good for these purposes).
One probably-relevant question, whose answer I do not readily see in your post, is this:
How often do users actually "modify" the content?
and:
When and if they do, how painful is it if a particular user is served "stale" content?
Personally (and, "categorically speaking"), I prefer to tackle such problems in two stages: (1) identifying the objects to be stored – e.g. using a database as an index; and (2) actually storing them, this being a task that I wish to delegate to "a true file-system, which after all specializes in such things."
A database (it "offhand" seems to me ...) would be a very good way to handle the logical ("as seen by the user") taxonomy of the things which you wish to store, while a distributed filesystem could handle the physical realities of storing the data and actually getting it to where it needs to go, and your application would be in the perfect position to gloss-over all of those messy filesystem details . . .

Load table mostly or partly into working memory for querying

I have a PostgreSQL read replica with the sole purpose to do some aggregate queries. Currently, a lot of I/O is going on in order to do the aggregates, even though there's a lot of memory available and since the instance serves only this purpose I was wondering if it is possible:
To simply tell PostgreSQL to cache most of the content of the table in order to speed up aggregate queries. Is that possible?
Per Craig's comment:
The pg_prewarm module provides a convenient way to load relation data into either the operating system buffer cache or the PostgreSQL buffer cache.
http://www.postgresql.org/docs/9.4/static/pgprewarm.html
Don't forget pg_prewarm is an extension, so you'll have to add it with:
create extension pg_prewarm;
.. a lot of I/O is going on in order to do the aggregates, even though there's a lot of memory available ..
You might also want to research some of the memory config options (http://www.postgresql.org/docs/9.4/static/runtime-config-resource.html#RUNTIME-CONFIG-RESOURCE-MEMORY)

Fine tuning oracle query with pipelined function

I have a query (that powers an Oracle Application Express Report) that I was told by our users was executing "slowly" or at an unacceptable speed (wasn't given an actual load time for the page and the query is the only thing on the page).
The query involves many tables and actually references a pipelined function which identifies the currently logged-in users to our website and returns a custom "table" of records they have permission to based upon a custom security scheme we have.
My main question is around Oracle's caching of queries and how they could be affected by our setup.
When I took the query out of the webpage and ran it in Sql Developer (and manually specified a user ID to simulate a logged-in user to the website), the performance went from 71 seconds to 19 seconds to .5 seconds. Clearly, Oracle is utilizing its caching mechanism to make subsequent runs faster.
How is this affected by?:
The fact that different users will get different tables from the
pipe-lined function (all the same columns, just different number of
rows and the values in the rows). Does the pipe-lining prevent
caching from working? Am I only seeing caching because I'm running
a very isolated test?
Further more - is caching easily influenced by the number of people using the system? I'm not sure how "much" can get cached. Therefore, if we have 50 concurrent users that are accessing different parts of the website that are loading different queries all day long, is it likely that oracle won't be able to cache many/any of them because it is constantly seeing different request for queries?
Sorry my question isn't very technical.
I'm a developer who has been asked to help out in this seemingly DBA question.
Also, this is complicated because I can't really determine what the actual load times are since our users don't report that level of detail.
Any thoughts on:
how I can determine if this query is actually slow?
what the average processing time would be?
and how to proceed with fine tuning if it is a problem?
Thanks!
It doesn't sound like this has anything to do with APEX, pipelined table functions, or query caching. It sounds like you are describing the effects of plain old data caching (most likely at the database level but potentially at the operating system and disk subsystem layers).
As a very basic overview, data is stored in rows, rows are stored in blocks (most commonly 8 kb in size), blocks are stored in extents (generally a few MB in size), and extents roll up to segments (i.e. a table). Oracle maintains a buffer cache where the most recently accessed blocks are stored. When you run a query, Oracle figures out which blocks it needs to read in order to get your data (this is the query plan). It then looks to see whether those blocks are in the buffer cache or whether they have to be read from disk. Obviously, reading a block from cache is much more efficient than reading it off the disk since RAM is much faster than disk. If you run the same query with the same set of bind variable values multiple times in a row, you'll be accessing the same set of blocks each time but more and more of the blocks you care about are going to be in the cache. So you'd generally expect that the second and third time that you call the query, you'll see faster performance.
If you run the query with a different set of bind variable values, if the second set of bind variable values causes Oracle to access many of the same blocks, those executions will benefit from the data the prior test cached. Otherwise, you'd be back to square 1 potentially reading all the data you need off disk. Most likely, you'll see some combination of the two.
Remember as well that it is not just Oracle that is caching data. Frequently, the operating system will be caching the most active pieces of the underlying Oracle data files. And the I/O subsystem will be caching the most recently accessed data as well. So even if Oracle thinks that it needs to go out to fetch a block because it is not in the database's buffer cache, the file system or the I/O subsystem may have cached that data so it may not require an actual physical read off of disk. These other caches behave similarly where running the same query multiple times in a row is likely to cause the cache to be "warm" and improve the performance of the later runs.

Optimise PostgreSQL for fast testing

I am switching to PostgreSQL from SQLite for a typical Rails application.
The problem is that running specs became slow with PG.
On SQLite it took ~34 seconds, on PG it's ~76 seconds which is more than 2x slower.
So now I want to apply some techniques to bring the performance of the specs on par with SQLite with no code modifications (ideally just by setting the connection options, which is probably not possible).
Couple of obvious things from top of my head are:
RAM Disk (good setup with RSpec on OSX would be good to see)
Unlogged tables (can it be applied on the whole database so I don't have change all the scripts?)
As you may have understood I don't care about reliability and the rest (the DB is just a throwaway thingy here).
I need to get the most out of the PG and make it as fast as it can possibly be.
Best answer would ideally describe the tricks for doing just that, setup and the drawbacks of those tricks.
UPDATE: fsync = off + full_page_writes = off only decreased time to ~65 seconds (~-16 secs). Good start, but far from the target of 34.
UPDATE 2: I tried to use RAM disk but the performance gain was within an error margin. So doesn't seem to be worth it.
UPDATE 3:*
I found the biggest bottleneck and now my specs run as fast as the SQLite ones.
The issue was the database cleanup that did the truncation. Apparently SQLite is way too fast there.
To "fix" it I open a transaction before each test and roll it back at the end.
Some numbers for ~700 tests.
Truncation: SQLite - 34s, PG - 76s.
Transaction: SQLite - 17s, PG - 18s.
2x speed increase for SQLite.
4x speed increase for PG.
First, always use the latest version of PostgreSQL. Performance improvements are always coming, so you're probably wasting your time if you're tuning an old version. For example, PostgreSQL 9.2 significantly improves the speed of TRUNCATE and of course adds index-only scans. Even minor releases should always be followed; see the version policy.
Don'ts
Do NOT put a tablespace on a RAMdisk or other non-durable storage.
If you lose a tablespace the whole database may be damaged and hard to use without significant work. There's very little advantage to this compared to just using UNLOGGED tables and having lots of RAM for cache anyway.
If you truly want a ramdisk based system, initdb a whole new cluster on the ramdisk by initdbing a new PostgreSQL instance on the ramdisk, so you have a completely disposable PostgreSQL instance.
PostgreSQL server configuration
When testing, you can configure your server for non-durable but faster operation.
This is one of the only acceptable uses for the fsync=off setting in PostgreSQL. This setting pretty much tells PostgreSQL not to bother with ordered writes or any of that other nasty data-integrity-protection and crash-safety stuff, giving it permission to totally trash your data if you lose power or have an OS crash.
Needless to say, you should never enable fsync=off in production unless you're using Pg as a temporary database for data you can re-generate from elsewhere. If and only if you're doing to turn fsync off can also turn full_page_writes off, as it no longer does any good then. Beware that fsync=off and full_page_writes apply at the cluster level, so they affect all databases in your PostgreSQL instance.
For production use you can possibly use synchronous_commit=off and set a commit_delay, as you'll get many of the same benefits as fsync=off without the giant data corruption risk. You do have a small window of loss of recent data if you enable async commit - but that's it.
If you have the option of slightly altering the DDL, you can also use UNLOGGED tables in Pg 9.1+ to completely avoid WAL logging and gain a real speed boost at the cost of the tables getting erased if the server crashes. There is no configuration option to make all tables unlogged, it must be set during CREATE TABLE. In addition to being good for testing this is handy if you have tables full of generated or unimportant data in a database that otherwise contains stuff you need to be safe.
Check your logs and see if you're getting warnings about too many checkpoints. If you are, you should increase your checkpoint_segments. You may also want to tune your checkpoint_completion_target to smooth writes out.
Tune shared_buffers to fit your workload. This is OS-dependent, depends on what else is going on with your machine, and requires some trial and error. The defaults are extremely conservative. You may need to increase the OS's maximum shared memory limit if you increase shared_buffers on PostgreSQL 9.2 and below; 9.3 and above changed how they use shared memory to avoid that.
If you're using a just a couple of connections that do lots of work, increase work_mem to give them more RAM to play with for sorts etc. Beware that too high a work_mem setting can cause out-of-memory problems because it's per-sort not per-connection so one query can have many nested sorts. You only really have to increase work_mem if you can see sorts spilling to disk in EXPLAIN or logged with the log_temp_files setting (recommended), but a higher value may also let Pg pick smarter plans.
As said by another poster here it's wise to put the xlog and the main tables/indexes on separate HDDs if possible. Separate partitions is pretty pointless, you really want separate drives. This separation has much less benefit if you're running with fsync=off and almost none if you're using UNLOGGED tables.
Finally, tune your queries. Make sure that your random_page_cost and seq_page_cost reflect your system's performance, ensure your effective_cache_size is correct, etc. Use EXPLAIN (BUFFERS, ANALYZE) to examine individual query plans, and turn the auto_explain module on to report all slow queries. You can often improve query performance dramatically just by creating an appropriate index or tweaking the cost parameters.
AFAIK there's no way to set an entire database or cluster as UNLOGGED. It'd be interesting to be able to do so. Consider asking on the PostgreSQL mailing list.
Host OS tuning
There's some tuning you can do at the operating system level, too. The main thing you might want to do is convince the operating system not to flush writes to disk aggressively, since you really don't care when/if they make it to disk.
In Linux you can control this with the virtual memory subsystem's dirty_* settings, like dirty_writeback_centisecs.
The only issue with tuning writeback settings to be too slack is that a flush by some other program may cause all PostgreSQL's accumulated buffers to be flushed too, causing big stalls while everything blocks on writes. You may be able to alleviate this by running PostgreSQL on a different file system, but some flushes may be device-level or whole-host-level not filesystem-level, so you can't rely on that.
This tuning really requires playing around with the settings to see what works best for your workload.
On newer kernels, you may wish to ensure that vm.zone_reclaim_mode is set to zero, as it can cause severe performance issues with NUMA systems (most systems these days) due to interactions with how PostgreSQL manages shared_buffers.
Query and workload tuning
These are things that DO require code changes; they may not suit you. Some are things you might be able to apply.
If you're not batching work into larger transactions, start. Lots of small transactions are expensive, so you should batch stuff whenever it's possible and practical to do so. If you're using async commit this is less important, but still highly recommended.
Whenever possible use temporary tables. They don't generate WAL traffic, so they're lots faster for inserts and updates. Sometimes it's worth slurping a bunch of data into a temp table, manipulating it however you need to, then doing an INSERT INTO ... SELECT ... to copy it to the final table. Note that temporary tables are per-session; if your session ends or you lose your connection then the temp table goes away, and no other connection can see the contents of a session's temp table(s).
If you're using PostgreSQL 9.1 or newer you can use UNLOGGED tables for data you can afford to lose, like session state. These are visible across different sessions and preserved between connections. They get truncated if the server shuts down uncleanly so they can't be used for anything you can't re-create, but they're great for caches, materialized views, state tables, etc.
In general, don't DELETE FROM blah;. Use TRUNCATE TABLE blah; instead; it's a lot quicker when you're dumping all rows in a table. Truncate many tables in one TRUNCATE call if you can. There's a caveat if you're doing lots of TRUNCATES of small tables over and over again, though; see: Postgresql Truncation speed
If you don't have indexes on foreign keys, DELETEs involving the primary keys referenced by those foreign keys will be horribly slow. Make sure to create such indexes if you ever expect to DELETE from the referenced table(s). Indexes are not required for TRUNCATE.
Don't create indexes you don't need. Each index has a maintenance cost. Try to use a minimal set of indexes and let bitmap index scans combine them rather than maintaining too many huge, expensive multi-column indexes. Where indexes are required, try to populate the table first, then create indexes at the end.
Hardware
Having enough RAM to hold the entire database is a huge win if you can manage it.
If you don't have enough RAM, the faster storage you can get the better. Even a cheap SSD makes a massive difference over spinning rust. Don't trust cheap SSDs for production though, they're often not crashsafe and might eat your data.
Learning
Greg Smith's book, PostgreSQL 9.0 High Performance remains relevant despite referring to a somewhat older version. It should be a useful reference.
Join the PostgreSQL general mailing list and follow it.
Reading:
Tuning your PostgreSQL server - PostgreSQL wiki
Number of database connections - PostgreSQL wiki
Use different disk layout:
different disk for $PGDATA
different disk for $PGDATA/pg_xlog
different disk for tem files (per database $PGDATA/base//pgsql_tmp) (see note about work_mem)
postgresql.conf tweaks:
shared_memory: 30% of available RAM but not more than 6 to 8GB. It seems to be better to have less shared memory (2GB - 4GB) for write intensive workloads
work_mem: mostly for select queries with sorts/aggregations. This is per connection setting and query can allocate that value multiple times. If data can't fit then disk is used (pgsql_tmp). Check "explain analyze" to see how much memory do you need
fsync and synchronous_commit: Default values are safe but If you can tolerate data lost then you can turn then off
random_page_cost: if you have SSD or fast RAID array you can lower this to 2.0 (RAID) or even lower (1.1) for SSD
checkpoint_segments: you can go higher 32 or 64 and change checkpoint_completion_target to 0.9. Lower value allows faster after-crash recovery

Will it be faster to use several threads to update the same database?

I wrote a Java program to add and retrieve data from an MS Access. At present it goes sequentially through ~200K insert queries in ~3 minutes, which I think is slow. I plan to rewrite it using threads with 3-4 threads handling different parts of the hundred thousands records. I have a compound question:
Will this help speed up the program because of the divided workload or would it be the same because the threads still have to access the database sequentially?
What strategy do you think would speed up this process (except for query optimization which I already did in addition to using Java's preparedStatement)
Don't know. Without knowing more about what the bottle neck is I can't comment if it will make it faster. If the database is the limiter then chances are more threads will slow it down.
I would dump the access database to a flat file and then bulk load that file. Bulk loading allows for optimzations which are far, far faster than running multiple insert queries.
First, don't use Access. Move your data anywhere else -- SQL/Server -- MySQL -- anything. The DB engine inside access (called Jet) is pitifully slow. It's not a real database; it's for personal projects that involve small amounts of data. It doesn't scale at all.
Second, threads rarely help.
The JDBC-to-Database connection is a process-wide resource. All threads share the one connection.
"But wait," you say, "I'll create a unique Connection object in each thread."
Noble, but sometimes doomed to failure. Why? Operating System processing between your JVM and the database may involve a socket that's a single, process-wide resource, shared by all your threads.
If you have a single OS-level I/O resource that's shared across all threads, you won't see much improvement. In this case, the ODBC connection is one bottleneck. And MS-Access is the other.
With MSAccess as the backend database, you'll probably get better insert performance if you do an import from within MSAccess. Another option (since you're using Java) is to directly manipulate the MDB file (if you're creating it from scratch and there are no other concurrent users - which MS Access doesn't handle very well) with a library like Jackess.
If none of these are solutions for you, then I'd recommend using a profiler on your Java application and see if it is spending most of its time waiting for the database (in which case adding threads probably won't help much) or if it is doing processing and parallelizing will help.
Stimms bulk load approach will probably be your best bet but everything is worth trying once. Note that your bottle neck is going to be disk IO and multiple threads may slow things down. MS access can also fall apart when multiple users are banging on the file and that is exactly what your multi-threaded approach will act like (make a backup!). If performance continues to be an issue consider upgrading to SQL express.
MS Access to SQL Server Migrations docs.
Good luck.
I would agree that dumping Access would be the best first step. Having said that...
In a .NET and SQL environment I have definitely seen threads aid in maximizing INSERT throughputs.
I have an application that accepts asynchronous file drops and then processes them into tables in a database.
I created a loader that parsed the file and placed the data into a queue. The queue was served by one or more threads whose max I could tune with a parameter. I found that even on a single core CPU with your typical 7200RPM drive, the ideal number of worker threads was 3. It shortened the load time an almost proportional amount. The key is to balance it such that the CPU bottleneck and the Disk I/O bottleneck are balanced.
So in cases where a bulk copy is not an option, threads should be considered.
On modern multi-core machines, using multiple threads to populate a database can make a difference. It depends on the database and its hardware. Try it and see.
Just try it and see if it helps. I would guess not because the bottleneck is likely to be in the disk access and locking of the tables, unless you can figure out a way to split the load across multiple tables and/or disks.
IIRC access don't allow for multiple connections to te same file because of the locking policy it uses.
And I agree totally about dumping access for sql.