How to organize primary keys for good locality? - sql

I have a table for users and a table for documents. Documents have exactly one user as an owner, and for the application I'm building, I know that I will typically be accessing a group of documents associated with a single given user.
Let's say the average user has K documents, and certain common queries fetch all of the documents for a given user. I don't want my database (PostgreSQL) to have to do K disk seeks (on average) to fetch all the documents for a user. Ideally, the documents would be stored in contiguous blocks so that fetches would only require a few seeks.
Is it possible (and reasonable) to organize the document table schema to create such locality? I know that no-SQL implementations do this all the time? E.g. the BigTable paper talks about how row keys for web tables are assigned by URL, except that the url is reversed, e.g. com.cnn.www, so that all the pages for CNN are located near eachother in the data store. It doesn't appear possible to something similar in Postgres because the tables cannot be index-organized, although it might be possible in MySQL w/ InnoDB. This post comes to a similar conclusion.

The command you're looking for is CLUSTER, but it has drawbacks. It completely rewrites the table when you run it, which requires a lock on it, so you may only want to do this when traffic is low. Also, Postgres will do nothing to keep rows in that order during INSERTs and UPDATEs, so your data will tend to fragment as the table is written to and you may have to recluster it regularly.
What you can also do is set a low fillfactor on the table, so that UPDATEs are more likely to keep a given row on the same page. This should prevent some fragmentation, which just leaves INSERTs, but with a low fillfactor INSERTs will tend to be placed on newer pages, and these will probably be commonly accessed enough to be kept in RAM. I'm making assumptions about your usage patterns which may be wrong, but regardless, your best course of action is probably to just recluster whenever you see I/O start to become a problem.
Finally, there's also a tool called pg_repack that can cluster a table without taking such a heavy lock, in a similar manner to how CREATE INDEX CONCURRENTLY works, but it's a third-party tool, so you'll want to experiment with it before running in production.

Related

Why use multiple ElasticSearch indices for one web application?

In asking a questions relating to using ES for web applications, suggestions have been made to have one index for things like user profiles, another index for data, etc., and several other ones for logs.
Having these all on a cluster with several web applications, this seems like things could get messy or disorganized.
In that case, are people using one cluster per application? I am a bit confused because when I read articles about indexing logs, they seem to refer to storing the data in multiple indices, rather than types within an index.
Secondly, why not have one index per app, with types for logs, user profiles, data, etc.?
Is there some benefit to using multiple indices rather than many types within an index for a web application?
-- UPDATE --
To add to this, the comments in this question, Elastic search, multiple indexes vs one index and types for different data sets?, don't seem to go far enough in explaining why:
data retention: for application log/metric data, use different indexes
if you require different retention period
Is that recommended because it's just simpler to delete an entire index rather than a type within an index? Does it have to do with the way the data is stored then space recovered after deleting the data?
I found the primary reason for creating multiple indices that satisfies my quest for an answer in ElasticSearch's pagination documentation:
To understand why deep paging is problematic, let’s imagine that we
are searching within a single index with five primary shards. When we
request the first page of results (results 1 to 10), each shard
produces its own top 10 results and returns them to the requesting
node, which then sorts all 50 results in order to select the overall
top 10.
Now imagine that we ask for page 1,000—results 10,001 to 10,010.
Everything works in the same way except that each shard has to produce
its top 10,010 results. The requesting node then sorts through all
50,050 results and discards 50,040 of them!
You can see that, in a distributed system, the cost of sorting results
grows exponentially the deeper we page. There is a good reason that
web search engines don’t return more than 1,000 results for any query.

What is a good way to manage large ever growing tables in a database?

I am building a web application for medical record keeping. A requirement for this application is logging all changes (view, create, update, delete) to a patients data and pretty much any other useful info in the system (login, cron run, data export, etc).
I am storing the data into a database table currently which is working fine. However it is likely this table will grow unruly very quickly and bloat the database. I am not allowed to delete log entries.
My current plan is to choose an arbitrary size (such as 1 million entries, large but still manageable). When the table hits 1 million entries I move 100,000 oldest entries into a file and store it onto our file server.
Does anyone have any experience with this issue that has other/better ideas on how to handle it?
Additional info:
My primary concern is nothing will ever be deleted from this data. However the data does not necessarily need to be accessed after several months. Since this data could logically hit 1 Billion entries in a matter of a couple years (and I have 300 copies of this db that all include this table) what is a good way to manage the size and performance. This table needs to be on a pager which is obviously going to be an issue when it breaks 1 Million let alone 1 Billion.
Cases like this are tailor-made for partitioning. Using a partitioning strategy, you span your data across multiple tables. This helps to balance I/O, speed up access times for partition-specific queries, etc. This is a discipline in and of itself, and the choice of partitioning key is crucial. In many cases such as log data like this, people often partition on a datetime value.
Partitioned Tables and Indexes (SQL Server)

Should I create separate SQL Server database for each user?

I am working on Asp.Net MVC web application, back-end is SQL Server 2012.
This application will provide billing, accounting, and inventory management. The user will create an account by signup. just like http://www.quickbooks.in. Each user will create some masters and various transactions. There is no limit, user can make unlimited records in the database.
I want to keep stable database performance, after heavy data load. I am maintaining proper indexing and primary keys in it, but there would be a heavy load on the database, per user.
So, should I create a separate database for each user, or should maintain one database with UserID. Add UserID in each table and making a partition based on UserID?
I am not an expert in SQL Server, so please provide suggestions with clear specifications.
Please inform me if there is any lack of information.
A DB per user is what happens when customers need to be able pack up and leave taking the actual database with them. Think of a self hosted wordpress website. Or if there are incredible risks to one user accidentally seeing another user's data, so it's safer to rely on the servers security model than to rely on remembering to add the UserId filter to all your queries. I can't imagine a scenario like that, but who knows-- maybe if the privacy laws allowed for jail time, I would rather data partitioned by security rules rather than carefully writing WHERE clauses.
If you did do user-per-database, creating a new user will be 10x more effort. While INSERT, UPDATE and so on stay the same from version to version, with each upgrade the syntax for database, user creation, permission granting and so on will evolve enough to break those scripts each SQL version upgrade.
Also, this will multiply your migration headaches by the number of users. Let's say you have 5000 users and you need to add some new columns, change a columns data type, update a trigger, and so on. Instead of needing to run that change script 1x, you need to run it 5000 times.
Per user Dbs also probably wastes disk space. Each of those databases is going to have a transaction log, sitting idle taking up the minimum log space.
As for load, if collectively your 5000 users are doing 1 billion inserts, updates and so on per day, my intuition tells me that it's going to be faster on one database, unless there is some sort of contension issue (everyone reading and writing to the same table at the same time and the same pages of the same table). Each database has machine resources (probably threads and memory) per database doing housekeeping, so these extra DBs can't be free.
Anyhow, the best thing to do is to simulate the two architectures and use a random data generator to simulate load and see how they perform.
It's not an easy answer to give.
First, there is logical design to be considered. Then you have integrity, security, management and performance (in this very order).
A database is a logical unit of data, self contained. Ideally, you should be able to take a database, move it to another instance, probably change the connection strings and be running again.
All the constraints are database-level. No foreign keys can exist referencing some object outside the database.
So, try thinking in these terms first.
How would you reliably prevent one user messing up the other user's data? Keep in mind that it's just a matter of time before someone opens an excel sheet and fire up queries on the database bypassing your application. Row level security in SQL Server is something you don't want to deal with.
Multiple databases mean that all management tasks should be scripted out and executed on all databases. Yes, there is some overhead to it, but once you set it up it's just the matter of monitoring. If a database goes suspect, it's a single customer down, not all of them. You can even have different versions for different customes if each customer have it's own database. Additionally, if you roll an upgrade, you can do it per customer, so the inpact will be much less.
Performance is the least relevant factor here. Of course, it really depends on how many customers and how much data, but proper indexing will solve these issues. Scale-out is much easier with multiple databases.
BTW, partitioning, as you mentioned it, is never a performance booster, it's simply a management feature, allowing for faster loading and evicting of data from a table.
I'd probably put each customer in separate database, but it's up to you eventually to make a decision for yourself. Hope I've helped some with this.

Should I create multiple tables, or even databases for multiple users of a CRM

I'm working on creating an application best described as a CRM. There is a relatively complex table structure, and I'm thinking about allowing users to do a fair bit of customization (adding fields and the like). One concern is that I will be reaching a certain level of scale almost immediately. We have about 50,000 individual users who will be coming online within about nine months of launch. So I want to build to last.
I'm thinking about two and maybe even three options.
One table set with a userID column on everything and with a custom attributes table created by creating a table which indexes custom attributes, then another table which has their values, which can then be joined to the existing contact records for the user. -- From what I've read, this seems like the right option, but I keep feeling like it's not. It seems like once these tables start reaching the millions of records searching for just one users records in every query is going to become a database hog.
For each user account recreate the table set, preened with a unique identifier (the userID for example.) Then rather than using a WHERE userID=? everywhere I can use a FROM ?_contacts. For attributes I could then have a custom attributes table where users could add additional columns for custom attributes. -- This feels like the simplest way to go, though, of course when I decide to change the database structure there would be a migration from hell.
The third option, which I'm pretty confident is wrong, but for that reason alone I can not rule out, is that a new database should be created for each user with all the requisite tables.
Am I crazy? Is option one really the best?
The first method is the best. Create individual userId's and then you can assign specific roles to them. A database retrieval time indeed depends on the number of records too. But, there is a trade-off where you can write efficient sql queries to fetch data. Well, according to this site, you will probably won't run out of memory or run into concurrency issues, because with a good server, the performance ought to be good, provided that you are efficient in writing queries.
If you recreate table sets, you will just end up creating lots of tables and can make the indexing slow which is a bad practice. Whereas if you opt of relational database scheme rather than an ordinary database scheme, and normalize the database and datatables for improving efficiency.
Creating a new database for each and every user, just sums up the complexity from both the above statements resulting in a shabby and disorganized database access. Because, if you decide to run individual instances of databases for every single user, you would just end up consuming your servers physical resources like RAM and CPU usage which will affect the service quality of all the other users.
Take up option 1. Assign separate userIds and assign them roles and privileges where needed. That is more efficient than the other two methods.

Low MySQL Table Cache Hit Rate

I've been working on optimizing my site and databases, and I have been using mysqltuner.pl to help with this. I've gotten just about everything correct, except for the table cache hit rate, no matter how high I raise it in my.cnf, I am still hitting about 0% (284 open / 79k opened).
My problem is that I don't really understand exactly what affects this so I don't really know what to look for in my queries/database structure to fix this.
The table cache defines the number of simultaneous file descriptors that MySQL has open. So table cache hit rate will be affected by how many tables you have relative to your limit as well as how frequently you re-reference tables (keeping in mind that it is counting not just a single connection, but simultaneous connections)
For instance, if your limit is 100 and you have 101 tables and you query each one in order, you'll never get any table cache hits. On the other hand, if you only have 1 table, you should generally get close to 100% hit rate unless you run FLUSH TABLES a lot ( as long as your table_cache is set higher than the number of typically simultaneous connections).
So for tuning, you need to look at how many distinct tables you might reference by one process/client and then look at how many simultaneous connections you might typically have.
Without more details, I can't guess whether your case is due to too many simultaneous connections or too many frequently referenced tables.
A cache is supposed to maintain copies of hot data. Hot data is data that is used a lot. If you cannot retrieve data out of a certain cache it means the DB has to go to disk to retrieve it.
--edit--
sorry if the definition seemed a bit obnoxious. a specific cache often covers a lot of entities, and these are database specific, you need to find out what is cached by the table cache firstly.
--edit: some investigation --
Ok, it seems (from the reply to this post), that Mysql uses the table cache for the data structures used to represent a table. the data structures also (via encapsulation or by having duplicate table entries for each table) represent a set of file descriptors open for the data files on the file system. The MyIsam engine uses one for a table and one for each index, additionally each active query element requires its own descriptors.
A file descriptor is a kernel entity used for file IO, it represents the low-level context of a particular file read or write.
I think you are either interpreting the value's incorrectly or they need to be interpreted differently in this context. 284 is the amount of active tables at the instance you took the snapshot and the second value represents the amount of times a table was acquired since you started Mysql.
I would hazard a guess that you need to take multiple snapshots of this reading and see if the first value (active fd's at that instance) ever exceed your cache size capacity.
p.s., the kernel generally has a upper limit on the amount of file descriptors it will allow each process to open -- so you might need to tune this if it is too low.