Low MySQL Table Cache Hit Rate - sql

I've been working on optimizing my site and databases, and I have been using mysqltuner.pl to help with this. I've gotten just about everything correct, except for the table cache hit rate, no matter how high I raise it in my.cnf, I am still hitting about 0% (284 open / 79k opened).
My problem is that I don't really understand exactly what affects this so I don't really know what to look for in my queries/database structure to fix this.

The table cache defines the number of simultaneous file descriptors that MySQL has open. So table cache hit rate will be affected by how many tables you have relative to your limit as well as how frequently you re-reference tables (keeping in mind that it is counting not just a single connection, but simultaneous connections)
For instance, if your limit is 100 and you have 101 tables and you query each one in order, you'll never get any table cache hits. On the other hand, if you only have 1 table, you should generally get close to 100% hit rate unless you run FLUSH TABLES a lot ( as long as your table_cache is set higher than the number of typically simultaneous connections).
So for tuning, you need to look at how many distinct tables you might reference by one process/client and then look at how many simultaneous connections you might typically have.
Without more details, I can't guess whether your case is due to too many simultaneous connections or too many frequently referenced tables.

A cache is supposed to maintain copies of hot data. Hot data is data that is used a lot. If you cannot retrieve data out of a certain cache it means the DB has to go to disk to retrieve it.
--edit--
sorry if the definition seemed a bit obnoxious. a specific cache often covers a lot of entities, and these are database specific, you need to find out what is cached by the table cache firstly.
--edit: some investigation --
Ok, it seems (from the reply to this post), that Mysql uses the table cache for the data structures used to represent a table. the data structures also (via encapsulation or by having duplicate table entries for each table) represent a set of file descriptors open for the data files on the file system. The MyIsam engine uses one for a table and one for each index, additionally each active query element requires its own descriptors.
A file descriptor is a kernel entity used for file IO, it represents the low-level context of a particular file read or write.
I think you are either interpreting the value's incorrectly or they need to be interpreted differently in this context. 284 is the amount of active tables at the instance you took the snapshot and the second value represents the amount of times a table was acquired since you started Mysql.
I would hazard a guess that you need to take multiple snapshots of this reading and see if the first value (active fd's at that instance) ever exceed your cache size capacity.
p.s., the kernel generally has a upper limit on the amount of file descriptors it will allow each process to open -- so you might need to tune this if it is too low.

Related

Where are sql results stored in a gui client?

Suppose I have a dataset that contains 100B rows and I do a SELECT * sql query from the table without a limit, and let's suppose the client doesn't impose a limit on top of it either --
As the data is running it usually loads the results incrementally into the interface. However, the dataset is much to large to fit onto my local machine. What actually happens when it is "Running query..."? Is the data loaded directly to program memory? Is the data saved to something like a tmp file that is memory mapped (I would think not), or what is the most common way to 'display' the results here? And then finally, what would happen once my local memory limit is exceeded -- would the program just hang or crash?
I know this is a slightly abstract question, but mainly I'm asking how a SQL result-set is usually 'loaded' in order to display the results to a user in a user interface.
.There may not be a "ususal" answer. Different applications are likely to take different approaches depending on the trade-offs they want to make.
The simplest approach is for the client to fetch the first N rows (you tagged this for Oracle SQL Developer where the default N is 50). If you then scroll down in the results, the client will fetch the next N rows. The client keeps the results it has already fetched in memory. If you try to fetch more data than the client machine has memory available (and, of course, the client may have been configured to have virtual memory larger than the physical memory available), the application either crashes or generates some sort of error. Note that depending on the specific implementation, the data could be cached either by the ODBC/JDBC/etc. driver or by the actual application code.
If there is some reason for the client to expect that it would be beneficial to display gigabytes worth of data to a human (or if crashing or erroring out is particularly problematic), the client might write results to a file rather than keeping them in memory. That doesn't seem particularly common in a GUI IDE but I don't use a terribly large number of different GUIs.
Other options are possible (but probably not worth implementing in an application that is supposed to provide results to a human who isn't going to scroll through billions of results). Under the covers, the application or driver could cache a key (in Oracle, normally the ROWID) for the previously returned data rather than the entire row and then re-fetch that data if the user tries to scroll back to the top. The application could discard data that you had already fetched and throw an error if you tried to scroll back from row 1 billion to row 1. Or it could silently re-execute the query if you wanted to go back to the first row.

Allowing many users to view stale BigQuery data query results concurrently

If I have a BigQuery dataset with data that I would like to make available to 1000 people (where each of these people would only be allowed to view their subset of the data, and is OK to view a 24hr stale version of their data), how can I do this without exceeding the 50 concurrent queries limit?
In the BigQuery documentation there's mention of 50 concurrent queries being permitted which give on-the-spot accurate data, which I would surpass if I needed them to all be able to view on-the-spot accurate data - which I don't.
In the documentation there is mention of Batch jobs being permitted and saving of results into destination tables which I'm hoping would somehow allow a reliable solution for my scenario, but am having difficulty finding information on how reliably or frequently those batch jobs can be expected to run, and whether or not someone querying results that exist in those destination tables is in itself counting towards the 50 concurrent users limit.
Any advice appreciated.
Without knowing the specifics of your situation and depending on how much data is in the output, I would suggest putting your own cache in front of BigQuery.
This sounds kind of like a dashboading/reporting solution, so I assume there is a large amount of data going in and a relatively small amount coming out (per-user).
Run one query per day with a batch script to generate your output (grouped by user) and then export it to GCS. You can then break it up into multiple flat files (or just read it into memory on your frontend). Each user hits your frontend, you determine which part of the output to serve up to them and respond.
This should be relatively cheap if you can work off the cached data and it is small enough that handling the BigQuery output isn't too much additional processing.
Google Cloud Functions might be an easy way to handle this, if you don't want the extra work of setting up a new VM to host your frontend.

What is a good way to manage large ever growing tables in a database?

I am building a web application for medical record keeping. A requirement for this application is logging all changes (view, create, update, delete) to a patients data and pretty much any other useful info in the system (login, cron run, data export, etc).
I am storing the data into a database table currently which is working fine. However it is likely this table will grow unruly very quickly and bloat the database. I am not allowed to delete log entries.
My current plan is to choose an arbitrary size (such as 1 million entries, large but still manageable). When the table hits 1 million entries I move 100,000 oldest entries into a file and store it onto our file server.
Does anyone have any experience with this issue that has other/better ideas on how to handle it?
Additional info:
My primary concern is nothing will ever be deleted from this data. However the data does not necessarily need to be accessed after several months. Since this data could logically hit 1 Billion entries in a matter of a couple years (and I have 300 copies of this db that all include this table) what is a good way to manage the size and performance. This table needs to be on a pager which is obviously going to be an issue when it breaks 1 Million let alone 1 Billion.
Cases like this are tailor-made for partitioning. Using a partitioning strategy, you span your data across multiple tables. This helps to balance I/O, speed up access times for partition-specific queries, etc. This is a discipline in and of itself, and the choice of partitioning key is crucial. In many cases such as log data like this, people often partition on a datetime value.
Partitioned Tables and Indexes (SQL Server)

How to organize primary keys for good locality?

I have a table for users and a table for documents. Documents have exactly one user as an owner, and for the application I'm building, I know that I will typically be accessing a group of documents associated with a single given user.
Let's say the average user has K documents, and certain common queries fetch all of the documents for a given user. I don't want my database (PostgreSQL) to have to do K disk seeks (on average) to fetch all the documents for a user. Ideally, the documents would be stored in contiguous blocks so that fetches would only require a few seeks.
Is it possible (and reasonable) to organize the document table schema to create such locality? I know that no-SQL implementations do this all the time? E.g. the BigTable paper talks about how row keys for web tables are assigned by URL, except that the url is reversed, e.g. com.cnn.www, so that all the pages for CNN are located near eachother in the data store. It doesn't appear possible to something similar in Postgres because the tables cannot be index-organized, although it might be possible in MySQL w/ InnoDB. This post comes to a similar conclusion.
The command you're looking for is CLUSTER, but it has drawbacks. It completely rewrites the table when you run it, which requires a lock on it, so you may only want to do this when traffic is low. Also, Postgres will do nothing to keep rows in that order during INSERTs and UPDATEs, so your data will tend to fragment as the table is written to and you may have to recluster it regularly.
What you can also do is set a low fillfactor on the table, so that UPDATEs are more likely to keep a given row on the same page. This should prevent some fragmentation, which just leaves INSERTs, but with a low fillfactor INSERTs will tend to be placed on newer pages, and these will probably be commonly accessed enough to be kept in RAM. I'm making assumptions about your usage patterns which may be wrong, but regardless, your best course of action is probably to just recluster whenever you see I/O start to become a problem.
Finally, there's also a tool called pg_repack that can cluster a table without taking such a heavy lock, in a similar manner to how CREATE INDEX CONCURRENTLY works, but it's a third-party tool, so you'll want to experiment with it before running in production.

How should data be provided to a web server using a data warehouse?

We have data stored in a data warehouse as follows:
Price
Date
Product Name (varchar(25))
We currently only have four products. That changes very infrequently (on average once every 10 years). Once every business day, four new data points are added representing the day's price for each product.
On the website, a user can request this information by entering a date range and selecting one or more products names. Analytics shows that the feature is not heavily used (about 10 users requests per week).
It was suggested that the data warehouse should daily push (SFTP) a CSV file containing all data (currently 6718 rows of this data and growing by four each day) to the web server. Then, the web server would read data from the file and display that data whenever a user made a request.
Usually, the push would only be once a day, but more than one push could be possible to communicate (infrequent) price corrections. Even in the price correction scenario, all data would be delivered in the file. What are problems with this approach?
Would it be better to have the web server make a request to the data warehouse per user request? Or does this have issues such as a greater chance for network errors or performance issues?
Would it be better to have the web server make a request to the data warehouse per user request?
Yes it would. You have very little data, so there is no need to try and 'cache' this in some way. (Apart from the fact that CSV might not be the best way to do this).
There is nothing stopping you from doing these requests from the webserver to the database server. With as little information as this you will not find performance an issue, but even if it would be when everything grows, there is a lot to be gained on the database-side (indexes etc) that will help you survive the next 100 years in this fashion.
The amount of requests from your users (also extremely small) does not need any special treatment, so again, direct query would be the best.
Or does this have issues such as a greater chance for network errors or performance issues?
Well, it might, but that would not justify your CSV method. Examples and why you need not worry, could be
the connection with the databaseserver is down.
This is an issue for both methods, but with only one connection per day the change of a 1-in-10000 failures might seem to be better for once-a-day methods. But these issues should not come up very often, and if they do, you should be able to handle them. (retry request, give a message to user). This is what enourmous amounts of websites do, so trust me if I say that this will not be an issue. Also, think of what it would mean if your daily update failed? That would present a bigger problem!
Performance issues
as said, this is due to the amount of data and requests, not a problem. And even if it becomes one, this is a problem you should be able to catch at a different level. Use a caching system (non CSV) on the database server. Use a caching system on the webserver. Fix your indexes to stop performance from being a problem.
BUT:
It is far from strange to want your data-warehouse separated from your web system. If this is a requirement, and it surely could be, the best thing you can do is re-create your warehouse-database (the one I just defended as being good enough to query directly) on another machine. You might get good results by doing a master-slave system
your datawarehouse is a master-database: it sends all changes to the slave but is inexcessible otherwise
your 2nd database (on your webserver even) gets all updates from the master, and is read-only. you can only query it for data
your webserver cannot connect to the datawarehouse, but can connect to your slave to read information. Even if there was an injection hack, it doesn't matter, as it is read-only.
Now you don't have a single moment where you update the queried database (the master-slave replication will keep it updated always), but no chance that the queries from the webserver put your warehouse in danger. profit!
I don't really see how SQL injection could be a real concern. I assume you have some calendar type field that the user fills in to get data out. If this is the only form just ensure that the only field that is in it is a date then something like DROP TABLE isn't possible. As for getting access to the database, that is another issue. However, a separate file with just the connection function should do fine in most cases so that a user can't, say open your webpage in an HTML viewer and see your database connection string.
As for the CSV, I would have to say querying a database per user, especially if it's only used ~10 times weekly would be much more efficient than the CSV. I just equate the CSV as overkill because again you only have ~10 users attempting to get some information, to export an updated CSV every day would be too much for such little pay off.
EDIT:
Also if an attack is a big concern, which that really depends on the nature of the business, the data being stored, and the visitors you receive, you could always create a backup as another option. I don't really see a reason for this as your question is currently stated, but it is a possibility that even with the best security an attack could happen. That mainly just depends on if the attackers want the information you have.